Do you know Brian-ese? Applying NLP and text analysis to one year of the Brian Lehrer show.

Preamble: This article features language mined from transcripts of “The Brian Lehrer Show”, publicly available at https://www.wnyc.org/shows/bl. This information is being shared and discussed purely for entertainment purposes!

“It’s the Brian Lehrer show on WNYC, good morning everyone.”

How many of us New Yorkers rub the sleep from our eyes while listening to these soothing words on weekday mornings? I would go so far as to say that I eagerly await them as I sip my coffee in the morning and start my working day.

For those of you who aren’t familiar with these words, they come from “The Brian Lehrer Show”, a 2-hour long radio show produced by WNYC which touches on a range of local and national topics, including politics, racial justice, the economy, history, and even language (notably with another favorite “disembodied voice” of mine, Dr. John McWhartor, associate professor of English and comparative literature at Columbia University – for a recent episode, check out https://www.wnyc.org/story/words-avoid-john-mcwhorter).

As I was twiddling my thumbs and swiveling in my desk chair while listening to Brian last week, I was struck with the brilliant idea of carrying out a little text processing project using NLP to practice my skills. I thought it would be great fun to combine web scraping with Beautiful soup with a sprinkle of NLTK and spaCy to create a (simplified) linguistic profile of how Brian speaks on the radio. The profile I set out to create included:

Brian’s twenty most frequently used lemma headwords (i.e., “play, plays, played, playing” => “play”)
Brian’s twenty most frequently used lemma bigrams/trigrams (which gives us an indication of Brian’s most-used phrases)
Brian’s twenty most frequently used proper nouns (using Named Entity Recognition with spaCy, giving us an indication of the topics of discussion)

Let’s take a look at how I did it!

Part 1: Web scraping

To create a linguistic profile of Brian, I first needed to create a corpus of as much of his language production as I could reasonably get my hands on. Lucky for me (and my poor fingers, which are still haunted by all of the transcriptions I did for my dissertation), WNYC keeps transcripts of all segments featured on the show at https://www.wnyc.org/shows/bl/segments.

Although I originally was hoping for an entire history of Brian’s speech on the radio, WNYC only started (publicly) providing transcripts on the website from August 10th, 2020 to the present (October 15th, 2021, when I started the project). So, I built my corpus by scraping the transcripts of all segments published between these dates from the WNYC website, reframing my idea of Brian’s linguistic profile as occurring within one year.

I used both Selenium and Beautiful Soup to carry out the scraping. Selenium was used to load a virtual version of the dynamic webpage, which I then scraped using Beautiful Soup by iterating through the relevant HTML tags which contained the transcript of each segment.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup as bs #import web scraping tool in a shorter name
driver = webdriver.Chrome(r"C:\Users\scorc\OneDrive\Desktop\SLS-680_NLP+Python\CodeAcademy\Projects\Brian Lehrer Scraping\chromedriver.exe")
url = "https://www.wnyc.org/shows/bl/segments" #Sets the URL 
url_list = []
for i in range(124):
    driver.get(url + "/"+ str(i))
    soup = bs(driver.page_source)
    time.sleep(7)
    for tag in soup.find_all('h2',{'class':'story-tease__title'}):
        for x in tag.find_all('a'):
            url_list.append(x['href'])
#driver.get("https://www.wnyc.org/story/big-changes-student-borrowers")
# Parse processed webpage with BeautifulSoup after passing it through selenium
url_list_final = []<br>for item in url_list:<br>if not item in url_list_final:<br>url_list_final.append(item)
#Tell the program to skip a certain URL taht contained no segments
skip = '//www.wnyc.org/story/the-brian-lehrer-show-2021-10-21/'
text_list = []
for url in url_list_final:
    current = []
    if url == skip:
        continue
    else:
        try:
            driver = webdriver.Chrome(r"C:\Users\scorc\OneDrive\Desktop\SLS-680_NLP+Python\CodeAcademy\Projects\Brian Lehrer Scraping\chromedriver.exe")
            driver.get(url)
            soup = bs(driver.page_source)
            time.sleep(10) #Sleep is required here to make sure that the page is up long enough for BS to scrape it
            for tag in soup.find_all('div',{'class':'text'}):
                for x in tag.find_all('p'):
                    current.append(x.text.strip())
        except:
            pass
    text_list.append((url, current))

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

import time

from bs4 import BeautifulSoup as bs #import web scraping tool in a shorter name

driver = webdriver.Chrome(r"C:\Users\scorc\OneDrive\Desktop\SLS-680_NLP+Python\CodeAcademy\Projects\Brian Lehrer Scraping\chromedriver.exe")

url = "https://www.wnyc.org/shows/bl/segments" #Sets the URL

url_list = []

for i in range(124):

driver.get(url + "/"+ str(i))

soup = bs(driver.page_source)

time.sleep(7)

for tag in soup.find_all('h2',{'class':'story-tease__title'}):

for x in tag.find_all('a'):

url_list.append(x['href'])

#driver.get("https://www.wnyc.org/story/big-changes-student-borrowers")

# Parse processed webpage with BeautifulSoup after passing it through selenium

url_list_final = []<br>for item in url_list:<br>if not item in url_list_final:<br>url_list_final.append(item)

#Tell the program to skip a certain URL taht contained no segments

skip = '//www.wnyc.org/story/the-brian-lehrer-show-2021-10-21/'

text_list = []

for url in url_list_final:

current = []

if url == skip:

continue

else:

try:

driver = webdriver.Chrome(r"C:\Users\scorc\OneDrive\Desktop\SLS-680_NLP+Python\CodeAcademy\Projects\Brian Lehrer Scraping\chromedriver.exe")

driver.get(url)

soup = bs(driver.page_source)

time.sleep(10) #Sleep is required here to make sure that the page is up long enough for BS to scrape it

for tag in soup.find_all('div',{'class':'text'}):

for x in tag.find_all('p'):

current.append(x.text.strip())

except:

pass

text_list.append((url, current))

The following graphic helps to explain what the code does and my conceptual approach.

Part 2: Text analysis

Now that I had all of the transcripts in one place, I had to sift through them and get only Brian’s lines. I used the following code to find proper names followed by a colon (e.g., “Brian Lehrer:”, “Mayor De Blasio:”), and cut out lines of text not belonging to Brian.

#Let's write a function to splice out Brian's lines based on his name (and colon) 
name_syntax = '^[A-Z]\w*\s[A-Z]\w*\:|[A-Z]\w*:' #Looks for either a proper name with or without a first and last name
import re
brian_lines = [] #Defines empty list for only Brian's lines
for x, y in text_list_final:
    text = y
    for i in range(len(text)):
        if re.search('Brian Lehrer:|Brian:', text[i]):#Takes the first part of any of Brian's dialogue
            mark = i #sets the value of "mark" each time Brian's new line begins
            brian_lines_only.append(text[i])
        elif re.search(name_syntax, text[i]):
            brian_lines.append(text[mark+1:i]) #Appends everything from the "mark" index to the current index
            mark = i #Resets the value of "mark"

#Let's write a function to splice out Brian's lines based on his name (and colon)

name_syntax = '^[A-Z]\w*\s[A-Z]\w*\:|[A-Z]\w*:' #Looks for either a proper name with or without a first and last name

import re

brian_lines = [] #Defines empty list for only Brian's lines

for x, y in text_list_final:

text = y

for i in range(len(text)):

if re.search('Brian Lehrer:|Brian:', text[i]):#Takes the first part of any of Brian's dialogue

mark = i #sets the value of "mark" each time Brian's new line begins

brian_lines_only.append(text[i])

elif re.search(name_syntax, text[i]):

brian_lines.append(text[mark+1:i]) #Appends everything from the "mark" index to the current index

mark = i #Resets the value of "mark"

The next step was to remove Brian’s “cue” in each transcript. These appeared as both, “Brian Lehrer:” and “Brian”, as seen below:

I wrote some simple code to do this:

#Replace "Brian Lehrer:" and "Brian:" in Brian's lines so that we only have what Brian says.
brian_lines_all = [re.sub('Brian Lehrer:|Brian:', "",line) for line in brian_lines]

1 2	#Replace "Brian Lehrer:" and "Brian:" in Brian's lines so that we only have what Brian says. brian_lines_all = [re.sub('Brian Lehrer:\|Brian:', "",line) for line in brian_lines]

Next, I carried out several essential preprocessing steps:

Expand contractions (e.g., to distinguish ” ‘s” in “Let’s” from the same token in ” it’s ” – this will be useful for generating frequency counts later)
Make all words lowercase (to avoid repetitions of the same token; e.g., “she”, “She”) and tokenize the words
Lemmatize the tokens (i.e., collapse different word forms to their headwords)
Remove stop words (common words like “a, the, of, and is” which give us little information about the semantic content of the text)
Strip punctuation

The code for this is presented below:

#Import necessary modules
from nltk import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from contractions import CONTRACTION_MAP #This is simply a dictionary of common contractions, e.g., {ain't:is not}
brian_sentences = nltk.sent_tokenize(brian_lines_all) #Get a list of Brian's sentences
#Define list of unwanted punctuation and stop words
punctuation = '''!()-[]{};:\,'"<>./?@#$%^&*_~``'''
stop_words = stopwords.words('english')
#Define necessary functions
#Expands contractions
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction 
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

#Basic preprocessing function (lowercase, tokenize)
def preprocess_basic (text):
    text = text.lower()
    tokenized = word_tokenize(text)
    return tokenized

#Define empty list, load spacy for use in lemmatization
brian_sen_lemmas = []
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.max_length = 1000000000 #Allows spaCy to work with larger amounts of data
#Start of the key task 
for sen in brian_sentences[:37699]: #Iterates through all of Brian's sentences
    current = [] #Defines empty list (necessary to output a list of sentences)
    expanded = expand_contractions(sen) #1 - Expand contractions
    doc = nlp(expanded) #Create a spaCy object
    lemmatized = " ".join([token.lemma_ for token in doc]) #2a Lemmatize
    processed = preprocess_basic(lemmatized)
    for tok in processed:
        if tok not in punctuation: #2b remove punctuation
            if tok not in stop_words: #2c remove stop words
                current.append(re.sub('[^A-Za-z]', "", tok)) #2d remove non-word characters and add to empty list
    brian_sen_lemmas.append(" ".join(current))#Join processed words into a string and append the sentence to the main list

#Import necessary modules

from nltk import sent_tokenize

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk import FreqDist

from contractions import CONTRACTION_MAP #This is simply a dictionary of common contractions, e.g., {ain't:is not}

brian_sentences = nltk.sent_tokenize(brian_lines_all) #Get a list of Brian's sentences

#Define list of unwanted punctuation and stop words

punctuation = '''!()-[]{};:\,'"<>./?@#$%^&*_~``'''

stop_words = stopwords.words('english')

#Define necessary functions

#Expands contractions

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):

contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)

def expand_match(contraction):

match = contraction.group(0)

first_char = match[0]

expanded_contraction = contraction_mapping.get(match)\

if contraction_mapping.get(match)\

else contraction_mapping.get(match.lower())

expanded_contraction = first_char+expanded_contraction[1:]

return expanded_contraction

expanded_text = contractions_pattern.sub(expand_match, text)

expanded_text = re.sub("'", "", expanded_text)

return expanded_text

#Basic preprocessing function (lowercase, tokenize)

def preprocess_basic (text):

text = text.lower()

tokenized = word_tokenize(text)

return tokenized

#Define empty list, load spacy for use in lemmatization

brian_sen_lemmas = []

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

nlp.max_length = 1000000000 #Allows spaCy to work with larger amounts of data

#Start of the key task

for sen in brian_sentences[:37699]: #Iterates through all of Brian's sentences

current = [] #Defines empty list (necessary to output a list of sentences)

expanded = expand_contractions(sen) #1 - Expand contractions

doc = nlp(expanded) #Create a spaCy object

lemmatized = " ".join([token.lemma_ for token in doc]) #2a Lemmatize

processed = preprocess_basic(lemmatized)

for tok in processed:

if tok not in punctuation: #2b remove punctuation

if tok not in stop_words: #2c remove stop words

current.append(re.sub('[^A-Za-z]', "", tok)) #2d remove non-word characters and add to empty list

brian_sen_lemmas.append(" ".join(current))#Join processed words into a string and append the sentence to the main list

This code produced a preprocessed list of sentence strings. Compare, for instance, Brian’s first sentence in the corpus before and after preprocessing:

You can notice right away what happened in the preprocessing. All words were made lowercase and tokenized (e.g., “Biden” -> “biden”), stopwords were removed (e.g., “I”, “they”, “that”), and the words were lemmatized (e.g., “announced” -> “announce). Plowing forward on my quest to profile Brian Lehrer’s language use, I then created unigram and n-gram frequency dictionaries using Brian’s tokenized lines. Frequency dictionaries are simple counts of how many times a given word token (or lemma, in our case) occurs in a text. The following code illustrates how I created them.

#Creates unigram list from list of sentences
brian_lemma_final = [nltk.word_tokenize(token) for token in brian_sen_lemmas if token]
brian_lemma_final = [item for sublist in brian_lemma_final for item in sublist] 
#Create unigram frequency dictionary
from nltk import FreqDist
unigram_dict = FreqDist(brian_lemma_final) 
#Create bigram/trigram frequency dictionary
from nltk import ngram #module which automatically chop up a list of tokens into chunks of n adjacent words.
b_bigrams = bigrams(brian_lemma_final, 2)
b_bi_dict = FreqDist(b_bigrams)
b_trigrams = ngrams(brian_lemma_final, 3)
b_tri_dict = FreqDist(b_trigrams)

#Creates unigram list from list of sentences

brian_lemma_final = [nltk.word_tokenize(token) for token in brian_sen_lemmas if token]

brian_lemma_final = [item for sublist in brian_lemma_final for item in sublist]

#Create unigram frequency dictionary

from nltk import FreqDist

unigram_dict = FreqDist(brian_lemma_final)

#Create bigram/trigram frequency dictionary

from nltk import ngram #module which automatically chop up a list of tokens into chunks of n adjacent words.

b_bigrams = bigrams(brian_lemma_final, 2)

b_bi_dict = FreqDist(b_bigrams)

b_trigrams = ngrams(brian_lemma_final, 3)

b_tri_dict = FreqDist(b_trigrams)

Part 3: Frequency calculations

Once I had frequency dictionaries for Brian’s unigrams, bigrams, and trigrams, I was able to get to work making pretty visualizations of the data! I used a combination of WordClouds and Seabornto do this.

First, let’s take a look at Brian’s most commonly used unigram lemmas using WordCloud visualization. On the left, we have a boring “basic” word cloud, while on the right, we have a novelty word cloud that spells out the name of New York’s favorite public radio station! The code used to produce these visualizations is presented below.

#Let's create a wordcloud from Brian's lemmatized lines
from wordcloud import WordCloud
import matplotlib.pyplot as plt
b_words= " ".join(brian_sen_lemmas) #Wordclouds take a string as an argument
b_wc = WordCloud(width = 1200,height = 800).generate(b_words)
plt.imshow(b_wc)
plt.axis('off')
plt.show()
#Here's the code for a fancier word cloud
#Creating fancier word clouds with the WNYC logo as an outline
from PIL import Image
import numpy as np
wnyc = np.array(Image.open("wnyc2.png"))#Note that the shape comes from a local file of the WNYC logo
from matplotlib.colors import LinearSegmentedColormap
colors = ["#000000", "#111111", "#101010", "#121212", "#212121", "#222222"]
cmap = LinearSegmentedColormap.from_list("mycmap", colors)
wc = WordCloud(background_color="white", mask=wnyc, colormap=cmap).generate(b_words)
plt.figure(figsize=(16,12))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")

#Let's create a wordcloud from Brian's lemmatized lines

from wordcloud import WordCloud

import matplotlib.pyplot as plt

b_words= " ".join(brian_sen_lemmas) #Wordclouds take a string as an argument

b_wc = WordCloud(width = 1200,height = 800).generate(b_words)

plt.imshow(b_wc)

plt.axis('off')

plt.show()

#Here's the code for a fancier word cloud

#Creating fancier word clouds with the WNYC logo as an outline

from PIL import Image

import numpy as np

wnyc = np.array(Image.open("wnyc2.png"))#Note that the shape comes from a local file of the WNYC logo

from matplotlib.colors import LinearSegmentedColormap

colors = ["#000000", "#111111", "#101010", "#121212", "#212121", "#222222"]

cmap = LinearSegmentedColormap.from_list("mycmap", colors)

wc = WordCloud(background_color="white", mask=wnyc, colormap=cmap).generate(b_words)

plt.figure(figsize=(16,12))

plt.imshow(wc, interpolation="bilinear")

plt.axis("off")

IS SO BASIC

IS SO FANCY

A quick examination of these words clouds shows allows us to extract simple insights about Brian’s word use over the past year.

He spoke a lot about New York.
He used language that encouraged listeners to call in and ask questions.
He talked a lot about himself (just joking – we all know he isn’t a narcissist, he simply introduces every show by saying “It’s the Brian Lehrer Show”)
He discussed reported facts and speech (e.g., “Mayor De Blasio here has started to say, ‘tourists come back, come back'” – https://www.wnyc.org/story/us-travel-restrictions-changing)
He used common insight words to discuss opinions (e.g., “think”, “know”).

In other words, he used language that might be expected of a call-in radio show! A cleaner look at Brian’s language use can be provided by looking at the 20 most commonly used lemmas in the corpus, presented in the bar chart below.

#Let's create bar charts of Brian's top 20 most frequent lemmas
import pandas as pd
import seaborn as sns
## Creating FreqDist for whole BoW, keeping the 20 most common tokens
brian_20 = unigram_dict.most_common(20)
## Conversion to Pandas series via Python Dictionary for easier plotting
all_fdist = pd.Series(dict(brian_20))
## Setting figure, ax into variables
fig, ax = plt.subplots(figsize=(10,10))
## Seaborn plotting using Pandas attributes + xtick rotation for ease of viewing
all_plot = sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax)
plt.xticks(rotation=30);
plt.title('Frequency Distribution of Brian Lehrer\'s lemma unigrams: 08/10/2020 to 10/15/2021')

#Let's create bar charts of Brian's top 20 most frequent lemmas

import pandas as pd

import seaborn as sns

## Creating FreqDist for whole BoW, keeping the 20 most common tokens

brian_20 = unigram_dict.most_common(20)

## Conversion to Pandas series via Python Dictionary for easier plotting

all_fdist = pd.Series(dict(brian_20))

## Setting figure, ax into variables

fig, ax = plt.subplots(figsize=(10,10))

## Seaborn plotting using Pandas attributes + xtick rotation for ease of viewing

all_plot = sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax)

plt.xticks(rotation=30);

plt.title('Frequency Distribution of Brian Lehrer\'s lemma unigrams: 08/10/2020 to 10/15/2021')

The 20 most common unigrams in the Brian corpus presented above (in a pretty rainbow bar chart, no less!) confirm most of the insights we gleaned from our word clouds. Let’s combine these with insights drawn from bar plots of Brian’s most common bigrams and trigrams (click the slider button below the graph to advance to the next image).

#Bigrams

# Sort values by highest frequency
ngram_sorted = {k:v for k,v in sorted(b_dict.items(), key=lambda item:item[1], reverse = True)}

## Join bigram tokens with '_' + maintain sorting
ngram_joined = {'_'.join(k):v for k,v in sorted(b_dict.items(), key=lambda item:item[1], reverse = True)}

#Get top 20 bigrams separated by a whitespace
ngram_20 = {' '.join(x):y for x, y in b_20}
## Convert to Pandas series for easy plotting
ngram_freqdist = pd.Series(ngram_20)
## Setting figure & axes for plots
fig, ax = plt.subplots(figsize=(10,10))
## Setting plot to horizontal for easy viewing + setting title + display  
bar_plot = sns.barplot(x=ngram_freqdist.values, y=ngram_freqdist.index, orient='h', ax=ax)
plt.title('Frequency Distribution of Brian Lehrer\'s lemma bigrams: 08/10/2020 to 10/15/2021')
plt.show();

#Trigrams
from nltk import ngrams
b_trigrams = ngrams(brian_lemma_final, 3)
b_tri_dict = FreqDist(b_trigrams)
b_tri_20 = FreqDist(b_tri_dict).most_common(20)
## Sort values by highest frequency
trigram_sorted = {k:v for k,v in sorted(b_tri_dict.items(), key=lambda item:item[1], reverse = True)}

## Join trigram tokens with '_' + maintain sorting
trigram_joined = {'_'.join(k):v for k,v in sorted(b_tri_dict.items(), key=lambda item:item[1], reverse = True)}

#Get top 20 bigrams
trigram_20 = {' '.join(x):y for x, y in b_tri_20}
## Convert to Pandas series for easy plotting
trigram_freqdist = pd.Series(trigram_20)
## Setting figure & axes for plots
fig, ax = plt.subplots(figsize=(10,10))
## Setting plot to horizontal for easy viewing + setting title + display  
bar_plot = sns.barplot(x=trigram_freqdist.values, y=trigram_freqdist.index, orient='h', ax=ax)
plt.title('Frequency Distribution of Brian Lehrer\'s lemma trigrams: 08/10/2020 to 10/15/2021')
plt.tight_layout() 
plt.show();

#Bigrams

# Sort values by highest frequency

ngram_sorted = {k:v for k,v in sorted(b_dict.items(), key=lambda item:item[1], reverse = True)}

## Join bigram tokens with '_' + maintain sorting

ngram_joined = {'_'.join(k):v for k,v in sorted(b_dict.items(), key=lambda item:item[1], reverse = True)}

#Get top 20 bigrams separated by a whitespace

ngram_20 = {' '.join(x):y for x, y in b_20}

## Convert to Pandas series for easy plotting

ngram_freqdist = pd.Series(ngram_20)

## Setting figure & axes for plots

fig, ax = plt.subplots(figsize=(10,10))

## Setting plot to horizontal for easy viewing + setting title + display

bar_plot = sns.barplot(x=ngram_freqdist.values, y=ngram_freqdist.index, orient='h', ax=ax)

plt.title('Frequency Distribution of Brian Lehrer\'s lemma bigrams: 08/10/2020 to 10/15/2021')

plt.show();

#Trigrams

from nltk import ngrams

b_trigrams = ngrams(brian_lemma_final, 3)

b_tri_dict = FreqDist(b_trigrams)

b_tri_20 = FreqDist(b_tri_dict).most_common(20)

## Sort values by highest frequency

trigram_sorted = {k:v for k,v in sorted(b_tri_dict.items(), key=lambda item:item[1], reverse = True)}

## Join trigram tokens with '_' + maintain sorting

trigram_joined = {'_'.join(k):v for k,v in sorted(b_tri_dict.items(), key=lambda item:item[1], reverse = True)}

#Get top 20 bigrams

trigram_20 = {' '.join(x):y for x, y in b_tri_20}

## Convert to Pandas series for easy plotting

trigram_freqdist = pd.Series(trigram_20)

## Setting figure & axes for plots

fig, ax = plt.subplots(figsize=(10,10))

## Setting plot to horizontal for easy viewing + setting title + display

bar_plot = sns.barplot(x=trigram_freqdist.values, y=trigram_freqdist.index, orient='h', ax=ax)

plt.title('Frequency Distribution of Brian Lehrer\'s lemma trigrams: 08/10/2020 to 10/15/2021')

plt.tight_layout()

plt.show();

Brian’s 20 most frequent lemma bigrams

Brian’s 20 most frequent lemma trigrams

These n-gram charts are useful for telling us about the most common phrases that Brian uses on the show:

We can piece together from the pairs, “brian lehrer”, “lehrer show”, “show wnyc”, “wnyc good”, “good morning”, and “morning everyone”, that Brian says the phrase “Brian Lehrer show [on] WNYC, good morning everyone” a lot!
“New York” and “New Jersey”, emerge regularly Brian’s discussions with guests and callers, but are likely very frequent because of Brian’s repeated station identification phrase: “This is WNYC, FM, HD, and AM New York, WNJT-FM 88.1 Trenton, WNJP 88.5 Sussex, WNJY 89.3 Netcong, and WNJO 90.3 Tom’s River, we are New York and New Jersey public radio.
Brian uses fixed phrases to invite listeners to call into the show and interact with them (e.g., “let’s take a phone call”, “thank you for calling in”, “thank you very much call us again”).
Brian speaks about Mayor Bill de Blasio a whole lot (which is unsurprising considering that every Friday is “ask the mayor”).

The final task outlined at the outset of this post was to identify the noun phrases Brian most commonly uses on the show, with an eye towards identifying the most popular topics of the show over the past year. We will do this using spaCy, a powerful library for processing text data (https://spacy.io/).

import spacy
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 1000000000 #Allows spaCy to work with larger amounts of data
#Define lists for frequency counting
entities = []
for sen in brian_sen_lemmas[:37698]: #Iterates through all of Brian's sentences
    doc = nlp(sen) #Create a spaCy object
    for ent in doc.ents: # .ents in spaCy is used to identify pre-trained entities
        entities.append((ent.text, ent.label_))
#Get trimmed NPs and NPs that are two words or longer
b_nps_trim = []
b_nps_multi = []
stop_en = ['TIME', 'DATE', 'CARDINAL', 'ORDINAL']
for item in entities:
    if not item[1] in stop_en: 
        b_nps_trim.append(item)
for item in entities:
    if len(nltk.word_tokenize(item[0]))>1:
        if not item[1] in stop_en: 
            b_nps_multi.append(item)
#Create graphs
import pandas as pd
import seaborn as sns
b_np_20 = FreqDist(b_nps_trim).most_common(20)
## Sort values by highest frequency
b_nps_sorted = {k:v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)}
## Join bigram tokens with '_' + maintain sorting
b_nps_joined = {'_'.join(k):v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)}

#Get top 20 bigrams
b_np_20 = {' '.join(x):y for x, y in b_np_20}
## Convert to Pandas series for easy plotting
b_np_freqdist = pd.Series(b_np_20)
## Setting figure & ax for plots
fig, ax = plt.subplots(figsize=(10,10))
## Setting plot to horizontal for easy viewing + setting title + display  
bar_plot = sns.barplot(x=b_np_freqdist.values, y=b_np_freqdist.index, orient='h', ax=ax)
plt.title('Frequency Distribution of Brian Lehrer\'s noun phrases: 08/10/2020 to 10/15/2021')
plt.show();
bar_plot.figure.savefig('b_nps.png')
import pandas as pd
import seaborn as sns
b_np_20 = FreqDist(b_nps_trim).most_common(20)
## Sort values by highest frequency
b_nps_sorted = {k:v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)}
## Join bigram tokens with '_' + maintain sorting
b_nps_joined = {'_'.join(k):v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)}

#Get top 20 bigrams
b_np_20 = {' '.join(x):y for x, y in b_np_20}
## Convert to Pandas series for easy plotting
b_np_freqdist = pd.Series(b_np_20)
## Setting figure & ax for plots
fig, ax = plt.subplots(figsize=(10,10))
## Setting plot to horizontal for easy viewing + setting title + display  
bar_plot = sns.barplot(x=b_np_freqdist.values, y=b_np_freqdist.index, orient='h', ax=ax)
plt.title('Frequency Distribution of Brian Lehrer\'s noun phrases: 08/10/2020 to 10/15/2021')
plt.show();

import spacy

nlp = spacy.load('en_core_web_sm')

nlp.max_length = 1000000000 #Allows spaCy to work with larger amounts of data

#Define lists for frequency counting

entities = []

for sen in brian_sen_lemmas[:37698]: #Iterates through all of Brian's sentences

doc = nlp(sen) #Create a spaCy object

for ent in doc.ents: # .ents in spaCy is used to identify pre-trained entities

entities.append((ent.text, ent.label_))

#Get trimmed NPs and NPs that are two words or longer

b_nps_trim = []

b_nps_multi = []

stop_en = ['TIME', 'DATE', 'CARDINAL', 'ORDINAL']

for item in entities:

if not item[1] in stop_en:

b_nps_trim.append(item)

for item in entities:

if len(nltk.word_tokenize(item[0]))>1:

if not item[1] in stop_en:

b_nps_multi.append(item)

#Create graphs

import pandas as pd

import seaborn as sns

b_np_20 = FreqDist(b_nps_trim).most_common(20)

## Sort values by highest frequency

b_nps_sorted = {k:v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)}

## Join bigram tokens with '_' + maintain sorting

b_nps_joined = {'_'.join(k):v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)}

#Get top 20 bigrams

b_np_20 = {' '.join(x):y for x, y in b_np_20}

## Convert to Pandas series for easy plotting

b_np_freqdist = pd.Series(b_np_20)

## Setting figure & ax for plots

fig, ax = plt.subplots(figsize=(10,10))

## Setting plot to horizontal for easy viewing + setting title + display

bar_plot = sns.barplot(x=b_np_freqdist.values, y=b_np_freqdist.index, orient='h', ax=ax)

plt.title('Frequency Distribution of Brian Lehrer\'s noun phrases: 08/10/2020 to 10/15/2021')

plt.show();

bar_plot.figure.savefig('b_nps.png')

import pandas as pd

import seaborn as sns

b_np_20 = FreqDist(b_nps_trim).most_common(20)

## Sort values by highest frequency

b_nps_sorted = {k:v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)}

## Join bigram tokens with '_' + maintain sorting

b_nps_joined = {'_'.join(k):v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)}

#Get top 20 bigrams

b_np_20 = {' '.join(x):y for x, y in b_np_20}

## Convert to Pandas series for easy plotting

b_np_freqdist = pd.Series(b_np_20)

## Setting figure & ax for plots

fig, ax = plt.subplots(figsize=(10,10))

## Setting plot to horizontal for easy viewing + setting title + display

bar_plot = sns.barplot(x=b_np_freqdist.values, y=b_np_freqdist.index, orient='h', ax=ax)

plt.title('Frequency Distribution of Brian Lehrer\'s noun phrases: 08/10/2020 to 10/15/2021')

plt.show();

Top 20 most common named entities uttered by Brian

Top 20 most common named entities uttered by Brian (2+ words)

By looking at these two graphs, we can see get an idea of the most common topics of the past year. These include:

General US politics (e.g., democrats, republicans, washington, senate, congress, white house, joe biden, kamala harris, Graphs 1 & 2)
Specific political issues within the US (e.g., supreme court, texas, north carolina)
The NYC Mayoral race (e.g., maya wiley, scott stringer, andrew yang, eric adams, Graph 2)
…and the US withdrawal from Afghanistan (e.g., afghanistan, taliban, Graph 1)

What an interesting use of NLP! Thank you Brian for all that you do for us New Yorkers!

Part 1: Web scraping

Part 2: Text analysis

Part 3: Frequency calculations

IS SO BASIC

IS SO FANCY

Brian’s 20 most frequent lemma bigrams

Brian’s 20 most frequent lemma trigrams

Top 20 most common named entities uttered by Brian (2+ words)

Leave a Reply Cancel Reply