Preamble: This article features language mined from transcripts of “The Brian Lehrer Show”, publicly available at https://www.wnyc.org/shows/bl. This information is being shared and discussed purely for entertainment purposes!  

“It’s the Brian Lehrer show on WNYC, good morning everyone.”

How many of us New Yorkers rub the sleep from our eyes while listening to these soothing words on weekday mornings? I would go so far as to say that I eagerly await them as I sip my coffee in the morning and start my working day.

For those of you who aren’t familiar with these words, they come from “The Brian Lehrer Show”, a 2-hour long radio show produced by WNYC which touches on a range of local and national topics, including politics, racial justice, the economy, history, and even language (notably with another favorite “disembodied voice” of mine, Dr. John McWhartor, associate professor of English and comparative literature at Columbia University – for a recent episode, check out https://www.wnyc.org/story/words-avoid-john-mcwhorter). 

As I was twiddling my thumbs and swiveling in my desk chair while listening to Brian last week, I was struck with the brilliant idea of carrying out a little text processing project using NLP to practice my skills. I thought it would be great fun to combine web scraping with Beautiful soup with a sprinkle of NLTK and spaCy to create a (simplified) linguistic profile of how Brian speaks on the radio. The profile I set out to create included:

  • Brian’s twenty most frequently used lemma headwords (i.e., “play, plays, played, playing” => “play”)
  • Brian’s twenty most frequently used lemma bigrams/trigrams (which gives us an indication of Brian’s most-used phrases)
  • Brian’s twenty most frequently used proper nouns (using Named Entity Recognition with spaCy, giving us an indication of the topics of discussion)

Let’s take a look at how I did it!

 

Part 1: Web scraping

To create a linguistic profile of Brian, I first needed to create a corpus of as much of his language production as I could reasonably get my hands on. Lucky for me (and my poor fingers, which are still haunted by all of the transcriptions I did for my dissertation), WNYC keeps transcripts of all segments featured on the show at https://www.wnyc.org/shows/bl/segments

Although I originally was hoping for an entire history of Brian’s speech on the radio, WNYC only started (publicly) providing transcripts on the website from August 10th, 2020 to the present (October 15th, 2021, when I started the project). So, I built my corpus by scraping the transcripts of all segments published between these dates from the WNYC website, reframing my idea of Brian’s linguistic profile as occurring within one year. 

 

 

 I used both Selenium and Beautiful Soup to carry out the scraping. Selenium was used to load a virtual version of the dynamic webpage, which I then scraped using Beautiful Soup by iterating through the relevant HTML tags which contained the transcript of each segment.

The following graphic helps to explain what the code does and my conceptual approach.

Part 2: Text analysis

Now that I had all of the transcripts in one place, I had to sift through them and get only Brian’s lines. I used the following code to find proper names followed by a colon (e.g., “Brian Lehrer:”, “Mayor De Blasio:”), and cut out lines of text not belonging to Brian.

The next step was to remove Brian’s “cue” in each transcript. These appeared as both, “Brian Lehrer:” and “Brian”, as seen below:

I wrote some simple code to do this:

Next, I carried out several essential preprocessing steps:

  1. Expand contractions (e.g., to distinguish ” ‘s” in “Let’s” from the same token in ” it’s ” – this will be useful for generating frequency counts later)
  2. Make all words lowercase (to avoid repetitions of the same token; e.g., “she”, “She”) and tokenize the words
  3. Lemmatize the tokens (i.e., collapse different word forms to their headwords)
  4. Remove stop words (common words like “a, the, of, and is” which give us little information about the semantic content of the text)
  5. Strip punctuation

The code for this is presented below:

This code produced a preprocessed list of sentence strings. Compare, for instance, Brian’s first sentence in the corpus before and after preprocessing:

You can notice right away what happened in the preprocessing. All words were made lowercase and tokenized (e.g., “Biden” -> “biden”), stopwords were removed (e.g., “I”, “they”, “that”), and the words were lemmatized (e.g., “announced” -> “announce). Plowing forward on my quest to profile Brian Lehrer’s language use, I then created unigram and n-gram frequency dictionaries using Brian’s tokenized lines. Frequency dictionaries are simple counts of how many times a given word token (or lemma, in our case) occurs in a text. The following code illustrates how I created them.

Part 3: Frequency calculations

Once I had frequency dictionaries for Brian’s unigrams, bigrams, and trigrams, I was able to get to work making pretty visualizations of the data! I used a combination of WordClouds and Seabornto do this.

First, let’s take a look at Brian’s most commonly used unigram lemmas using WordCloud visualization.  On the left, we have a boring “basic” word cloud, while on the right, we have a novelty word cloud that spells out the name of New York’s favorite public radio station! The code used to produce these visualizations is presented below.

IS SO BASIC

IS SO FANCY

A quick examination of these words clouds shows allows us to extract simple insights about Brian’s word use over the past year.

  • He spoke a lot about New York.
  • He used language that encouraged listeners to call in and ask questions.
  • He talked a lot about himself (just joking –  we all know he isn’t a narcissist, he simply introduces every show by saying “It’s the Brian Lehrer Show”)
  • He discussed reported facts and speech (e.g., “Mayor De Blasio here has started to say, ‘tourists come back, come back'” – https://www.wnyc.org/story/us-travel-restrictions-changing)
  • He used common insight words to discuss opinions (e.g., “think”, “know”).

In other words, he used language that might be expected of a call-in radio show! A cleaner look at Brian’s language use can be provided by looking at the 20 most commonly used lemmas in the corpus, presented in the bar chart below.

The 20 most common unigrams in the Brian corpus presented above (in a pretty rainbow bar chart, no less!) confirm most of the insights we gleaned from our word clouds.  Let’s combine these with insights drawn from bar plots of Brian’s most common bigrams and trigrams (click the slider button below the graph to advance to the next image).

 

Brian’s 20 most frequent lemma bigrams

 

Brian’s 20 most frequent lemma trigrams

These n-gram charts are useful for telling us about the most common phrases that Brian uses on the show:

  • We can piece together from the pairs, “brian lehrer”, “lehrer show”, “show wnyc”, “wnyc good”, “good morning”,  and “morning everyone”, that Brian says the phrase “Brian Lehrer show [on] WNYC, good morning everyone” a lot!
  •  “New York” and “New Jersey”,  emerge regularly Brian’s discussions with guests and callers, but are likely very frequent because of Brian’s repeated station identification phrase: “This is WNYC, FM, HD, and AM New York, WNJT-FM 88.1 Trenton, WNJP 88.5 Sussex, WNJY 89.3 Netcong, and WNJO 90.3 Tom’s River, we are New York and New Jersey public radio. 
  • Brian uses fixed phrases to invite listeners to call into the show and interact with them (e.g., “let’s take a phone call”, “thank you for calling in”, “thank you very much call us again”). 
  • Brian speaks about Mayor Bill de Blasio a whole lot (which is unsurprising considering that every Friday is “ask the mayor”).

The final task outlined at the outset of this post was to identify the noun phrases Brian most commonly uses on the show, with an eye towards identifying the most popular topics of the show over the past year. We will do this using spaCy, a powerful library for processing text data (https://spacy.io/).

 

Top 20 most common named entities uttered by Brian

 

Top 20 most common named entities uttered by Brian (2+ words)

By looking at these two graphs, we can see get an idea of the most common topics of the past year. These include:

  • General US politics (e.g., democrats, republicans, washington, senate, congress, white house, joe biden, kamala harris, Graphs 1 & 2)
  • Specific political issues within the US (e.g., supreme court, texas, north carolina)
  • The NYC Mayoral race (e.g., maya wiley, scott stringer, andrew yang, eric adams, Graph 2)
  • …and the US withdrawal from Afghanistan (e.g., afghanistan, taliban, Graph 1) 

What an interesting use of NLP! Thank you Brian for all that you do for us New Yorkers!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">