Preamble: This article features language mined from transcripts of “The Brian Lehrer Show”, publicly available at https://www.wnyc.org/shows/bl. This information is being shared and discussed purely for entertainment purposes!
“It’s the Brian Lehrer show on WNYC, good morning everyone.”
How many of us New Yorkers rub the sleep from our eyes while listening to these soothing words on weekday mornings? I would go so far as to say that I eagerly await them as I sip my coffee in the morning and start my working day.
For those of you who aren’t familiar with these words, they come from “The Brian Lehrer Show”, a 2-hour long radio show produced by WNYC which touches on a range of local and national topics, including politics, racial justice, the economy, history, and even language (notably with another favorite “disembodied voice” of mine, Dr. John McWhartor, associate professor of English and comparative literature at Columbia University – for a recent episode, check out https://www.wnyc.org/story/words-avoid-john-mcwhorter).
As I was twiddling my thumbs and swiveling in my desk chair while listening to Brian last week, I was struck with the brilliant idea of carrying out a little text processing project using NLP to practice my skills. I thought it would be great fun to combine web scraping with Beautiful soup with a sprinkle of NLTK and spaCy to create a (simplified) linguistic profile of how Brian speaks on the radio. The profile I set out to create included:
- Brian’s twenty most frequently used lemma headwords (i.e., “play, plays, played, playing” => “play”)
- Brian’s twenty most frequently used lemma bigrams/trigrams (which gives us an indication of Brian’s most-used phrases)
- Brian’s twenty most frequently used proper nouns (using Named Entity Recognition with spaCy, giving us an indication of the topics of discussion)
Let’s take a look at how I did it!
Part 1: Web scraping
To create a linguistic profile of Brian, I first needed to create a corpus of as much of his language production as I could reasonably get my hands on. Lucky for me (and my poor fingers, which are still haunted by all of the transcriptions I did for my dissertation), WNYC keeps transcripts of all segments featured on the show at https://www.wnyc.org/shows/bl/segments.
Although I originally was hoping for an entire history of Brian’s speech on the radio, WNYC only started (publicly) providing transcripts on the website from August 10th, 2020 to the present (October 15th, 2021, when I started the project). So, I built my corpus by scraping the transcripts of all segments published between these dates from the WNYC website, reframing my idea of Brian’s linguistic profile as occurring within one year.
I used both Selenium and Beautiful Soup to carry out the scraping. Selenium was used to load a virtual version of the dynamic webpage, which I then scraped using Beautiful Soup by iterating through the relevant HTML tags which contained the transcript of each segment.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
from selenium import webdriver from selenium.webdriver.common.keys import Keys import time from bs4 import BeautifulSoup as bs #import web scraping tool in a shorter name driver = webdriver.Chrome(r"C:\Users\scorc\OneDrive\Desktop\SLS-680_NLP+Python\CodeAcademy\Projects\Brian Lehrer Scraping\chromedriver.exe") url = "https://www.wnyc.org/shows/bl/segments" #Sets the URL url_list = [] for i in range(124): driver.get(url + "/"+ str(i)) soup = bs(driver.page_source) time.sleep(7) for tag in soup.find_all('h2',{'class':'story-tease__title'}): for x in tag.find_all('a'): url_list.append(x['href']) #driver.get("https://www.wnyc.org/story/big-changes-student-borrowers") # Parse processed webpage with BeautifulSoup after passing it through selenium url_list_final = []<br>for item in url_list:<br>if not item in url_list_final:<br>url_list_final.append(item) #Tell the program to skip a certain URL taht contained no segments skip = '//www.wnyc.org/story/the-brian-lehrer-show-2021-10-21/' text_list = [] for url in url_list_final: current = [] if url == skip: continue else: try: driver = webdriver.Chrome(r"C:\Users\scorc\OneDrive\Desktop\SLS-680_NLP+Python\CodeAcademy\Projects\Brian Lehrer Scraping\chromedriver.exe") driver.get(url) soup = bs(driver.page_source) time.sleep(10) #Sleep is required here to make sure that the page is up long enough for BS to scrape it for tag in soup.find_all('div',{'class':'text'}): for x in tag.find_all('p'): current.append(x.text.strip()) except: pass text_list.append((url, current)) |
The following graphic helps to explain what the code does and my conceptual approach.
Part 2: Text analysis
Now that I had all of the transcripts in one place, I had to sift through them and get only Brian’s lines. I used the following code to find proper names followed by a colon (e.g., “Brian Lehrer:”, “Mayor De Blasio:”), and cut out lines of text not belonging to Brian.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#Let's write a function to splice out Brian's lines based on his name (and colon) name_syntax = '^[A-Z]\w*\s[A-Z]\w*\:|[A-Z]\w*:' #Looks for either a proper name with or without a first and last name import re brian_lines = [] #Defines empty list for only Brian's lines for x, y in text_list_final: text = y for i in range(len(text)): if re.search('Brian Lehrer:|Brian:', text[i]):#Takes the first part of any of Brian's dialogue mark = i #sets the value of "mark" each time Brian's new line begins brian_lines_only.append(text[i]) elif re.search(name_syntax, text[i]): brian_lines.append(text[mark+1:i]) #Appends everything from the "mark" index to the current index mark = i #Resets the value of "mark" |
The next step was to remove Brian’s “cue” in each transcript. These appeared as both, “Brian Lehrer:” and “Brian”, as seen below:
I wrote some simple code to do this:
1 2 |
#Replace "Brian Lehrer:" and "Brian:" in Brian's lines so that we only have what Brian says. brian_lines_all = [re.sub('Brian Lehrer:|Brian:', "",line) for line in brian_lines] |
Next, I carried out several essential preprocessing steps:
- Expand contractions (e.g., to distinguish ” ‘s” in “Let’s” from the same token in ” it’s ” – this will be useful for generating frequency counts later)
- Make all words lowercase (to avoid repetitions of the same token; e.g., “she”, “She”) and tokenize the words
- Lemmatize the tokens (i.e., collapse different word forms to their headwords)
- Remove stop words (common words like “a, the, of, and is” which give us little information about the semantic content of the text)
- Strip punctuation
The code for this is presented below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
#Import necessary modules from nltk import sent_tokenize from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk import FreqDist from contractions import CONTRACTION_MAP #This is simply a dictionary of common contractions, e.g., {ain't:is not} brian_sentences = nltk.sent_tokenize(brian_lines_all) #Get a list of Brian's sentences #Define list of unwanted punctuation and stop words punctuation = '''!()-[]{};:\,'"<>./?@#$%^&*_~``''' stop_words = stopwords.words('english') #Define necessary functions #Expands contractions def expand_contractions(text, contraction_mapping=CONTRACTION_MAP): contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL) def expand_match(contraction): match = contraction.group(0) first_char = match[0] expanded_contraction = contraction_mapping.get(match)\ if contraction_mapping.get(match)\ else contraction_mapping.get(match.lower()) expanded_contraction = first_char+expanded_contraction[1:] return expanded_contraction expanded_text = contractions_pattern.sub(expand_match, text) expanded_text = re.sub("'", "", expanded_text) return expanded_text #Basic preprocessing function (lowercase, tokenize) def preprocess_basic (text): text = text.lower() tokenized = word_tokenize(text) return tokenized #Define empty list, load spacy for use in lemmatization brian_sen_lemmas = [] nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) nlp.max_length = 1000000000 #Allows spaCy to work with larger amounts of data #Start of the key task for sen in brian_sentences[:37699]: #Iterates through all of Brian's sentences current = [] #Defines empty list (necessary to output a list of sentences) expanded = expand_contractions(sen) #1 - Expand contractions doc = nlp(expanded) #Create a spaCy object lemmatized = " ".join([token.lemma_ for token in doc]) #2a Lemmatize processed = preprocess_basic(lemmatized) for tok in processed: if tok not in punctuation: #2b remove punctuation if tok not in stop_words: #2c remove stop words current.append(re.sub('[^A-Za-z]', "", tok)) #2d remove non-word characters and add to empty list brian_sen_lemmas.append(" ".join(current))#Join processed words into a string and append the sentence to the main list |
This code produced a preprocessed list of sentence strings. Compare, for instance, Brian’s first sentence in the corpus before and after preprocessing:
You can notice right away what happened in the preprocessing. All words were made lowercase and tokenized (e.g., “Biden” -> “biden”), stopwords were removed (e.g., “I”, “they”, “that”), and the words were lemmatized (e.g., “announced” -> “announce). Plowing forward on my quest to profile Brian Lehrer’s language use, I then created unigram and n-gram frequency dictionaries using Brian’s tokenized lines. Frequency dictionaries are simple counts of how many times a given word token (or lemma, in our case) occurs in a text. The following code illustrates how I created them.
1 2 3 4 5 6 7 8 9 10 11 12 |
#Creates unigram list from list of sentences brian_lemma_final = [nltk.word_tokenize(token) for token in brian_sen_lemmas if token] brian_lemma_final = [item for sublist in brian_lemma_final for item in sublist] #Create unigram frequency dictionary from nltk import FreqDist unigram_dict = FreqDist(brian_lemma_final) #Create bigram/trigram frequency dictionary from nltk import ngram #module which automatically chop up a list of tokens into chunks of n adjacent words. b_bigrams = bigrams(brian_lemma_final, 2) b_bi_dict = FreqDist(b_bigrams) b_trigrams = ngrams(brian_lemma_final, 3) b_tri_dict = FreqDist(b_trigrams) |
Part 3: Frequency calculations
Once I had frequency dictionaries for Brian’s unigrams, bigrams, and trigrams, I was able to get to work making pretty visualizations of the data! I used a combination of WordClouds and Seabornto do this.
First, let’s take a look at Brian’s most commonly used unigram lemmas using WordCloud visualization. On the left, we have a boring “basic” word cloud, while on the right, we have a novelty word cloud that spells out the name of New York’s favorite public radio station! The code used to produce these visualizations is presented below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
#Let's create a wordcloud from Brian's lemmatized lines from wordcloud import WordCloud import matplotlib.pyplot as plt b_words= " ".join(brian_sen_lemmas) #Wordclouds take a string as an argument b_wc = WordCloud(width = 1200,height = 800).generate(b_words) plt.imshow(b_wc) plt.axis('off') plt.show() #Here's the code for a fancier word cloud #Creating fancier word clouds with the WNYC logo as an outline from PIL import Image import numpy as np wnyc = np.array(Image.open("wnyc2.png"))#Note that the shape comes from a local file of the WNYC logo from matplotlib.colors import LinearSegmentedColormap colors = ["#000000", "#111111", "#101010", "#121212", "#212121", "#222222"] cmap = LinearSegmentedColormap.from_list("mycmap", colors) wc = WordCloud(background_color="white", mask=wnyc, colormap=cmap).generate(b_words) plt.figure(figsize=(16,12)) plt.imshow(wc, interpolation="bilinear") plt.axis("off") |
IS SO BASIC
IS SO FANCY
A quick examination of these words clouds shows allows us to extract simple insights about Brian’s word use over the past year.
- He spoke a lot about New York.
- He used language that encouraged listeners to call in and ask questions.
- He talked a lot about himself (just joking – we all know he isn’t a narcissist, he simply introduces every show by saying “It’s the Brian Lehrer Show”)
- He discussed reported facts and speech (e.g., “Mayor De Blasio here has started to say, ‘tourists come back, come back'” – https://www.wnyc.org/story/us-travel-restrictions-changing)
- He used common insight words to discuss opinions (e.g., “think”, “know”).
In other words, he used language that might be expected of a call-in radio show! A cleaner look at Brian’s language use can be provided by looking at the 20 most commonly used lemmas in the corpus, presented in the bar chart below.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#Let's create bar charts of Brian's top 20 most frequent lemmas import pandas as pd import seaborn as sns ## Creating FreqDist for whole BoW, keeping the 20 most common tokens brian_20 = unigram_dict.most_common(20) ## Conversion to Pandas series via Python Dictionary for easier plotting all_fdist = pd.Series(dict(brian_20)) ## Setting figure, ax into variables fig, ax = plt.subplots(figsize=(10,10)) ## Seaborn plotting using Pandas attributes + xtick rotation for ease of viewing all_plot = sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax) plt.xticks(rotation=30); plt.title('Frequency Distribution of Brian Lehrer\'s lemma unigrams: 08/10/2020 to 10/15/2021') |
The 20 most common unigrams in the Brian corpus presented above (in a pretty rainbow bar chart, no less!) confirm most of the insights we gleaned from our word clouds. Let’s combine these with insights drawn from bar plots of Brian’s most common bigrams and trigrams (click the slider button below the graph to advance to the next image).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
#Bigrams # Sort values by highest frequency ngram_sorted = {k:v for k,v in sorted(b_dict.items(), key=lambda item:item[1], reverse = True)} ## Join bigram tokens with '_' + maintain sorting ngram_joined = {'_'.join(k):v for k,v in sorted(b_dict.items(), key=lambda item:item[1], reverse = True)} #Get top 20 bigrams separated by a whitespace ngram_20 = {' '.join(x):y for x, y in b_20} ## Convert to Pandas series for easy plotting ngram_freqdist = pd.Series(ngram_20) ## Setting figure & axes for plots fig, ax = plt.subplots(figsize=(10,10)) ## Setting plot to horizontal for easy viewing + setting title + display bar_plot = sns.barplot(x=ngram_freqdist.values, y=ngram_freqdist.index, orient='h', ax=ax) plt.title('Frequency Distribution of Brian Lehrer\'s lemma bigrams: 08/10/2020 to 10/15/2021') plt.show(); #Trigrams from nltk import ngrams b_trigrams = ngrams(brian_lemma_final, 3) b_tri_dict = FreqDist(b_trigrams) b_tri_20 = FreqDist(b_tri_dict).most_common(20) ## Sort values by highest frequency trigram_sorted = {k:v for k,v in sorted(b_tri_dict.items(), key=lambda item:item[1], reverse = True)} ## Join trigram tokens with '_' + maintain sorting trigram_joined = {'_'.join(k):v for k,v in sorted(b_tri_dict.items(), key=lambda item:item[1], reverse = True)} #Get top 20 bigrams trigram_20 = {' '.join(x):y for x, y in b_tri_20} ## Convert to Pandas series for easy plotting trigram_freqdist = pd.Series(trigram_20) ## Setting figure & axes for plots fig, ax = plt.subplots(figsize=(10,10)) ## Setting plot to horizontal for easy viewing + setting title + display bar_plot = sns.barplot(x=trigram_freqdist.values, y=trigram_freqdist.index, orient='h', ax=ax) plt.title('Frequency Distribution of Brian Lehrer\'s lemma trigrams: 08/10/2020 to 10/15/2021') plt.tight_layout() plt.show(); |
These n-gram charts are useful for telling us about the most common phrases that Brian uses on the show:
- We can piece together from the pairs, “brian lehrer”, “lehrer show”, “show wnyc”, “wnyc good”, “good morning”, and “morning everyone”, that Brian says the phrase “Brian Lehrer show [on] WNYC, good morning everyone” a lot!
- “New York” and “New Jersey”, emerge regularly Brian’s discussions with guests and callers, but are likely very frequent because of Brian’s repeated station identification phrase: “This is WNYC, FM, HD, and AM New York, WNJT-FM 88.1 Trenton, WNJP 88.5 Sussex, WNJY 89.3 Netcong, and WNJO 90.3 Tom’s River, we are New York and New Jersey public radio.
- Brian uses fixed phrases to invite listeners to call into the show and interact with them (e.g., “let’s take a phone call”, “thank you for calling in”, “thank you very much call us again”).
- Brian speaks about Mayor Bill de Blasio a whole lot (which is unsurprising considering that every Friday is “ask the mayor”).
The final task outlined at the outset of this post was to identify the noun phrases Brian most commonly uses on the show, with an eye towards identifying the most popular topics of the show over the past year. We will do this using spaCy, a powerful library for processing text data (https://spacy.io/).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import spacy nlp = spacy.load('en_core_web_sm') nlp.max_length = 1000000000 #Allows spaCy to work with larger amounts of data #Define lists for frequency counting entities = [] for sen in brian_sen_lemmas[:37698]: #Iterates through all of Brian's sentences doc = nlp(sen) #Create a spaCy object for ent in doc.ents: # .ents in spaCy is used to identify pre-trained entities entities.append((ent.text, ent.label_)) #Get trimmed NPs and NPs that are two words or longer b_nps_trim = [] b_nps_multi = [] stop_en = ['TIME', 'DATE', 'CARDINAL', 'ORDINAL'] for item in entities: if not item[1] in stop_en: b_nps_trim.append(item) for item in entities: if len(nltk.word_tokenize(item[0]))>1: if not item[1] in stop_en: b_nps_multi.append(item) #Create graphs import pandas as pd import seaborn as sns b_np_20 = FreqDist(b_nps_trim).most_common(20) ## Sort values by highest frequency b_nps_sorted = {k:v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)} ## Join bigram tokens with '_' + maintain sorting b_nps_joined = {'_'.join(k):v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)} #Get top 20 bigrams b_np_20 = {' '.join(x):y for x, y in b_np_20} ## Convert to Pandas series for easy plotting b_np_freqdist = pd.Series(b_np_20) ## Setting figure & ax for plots fig, ax = plt.subplots(figsize=(10,10)) ## Setting plot to horizontal for easy viewing + setting title + display bar_plot = sns.barplot(x=b_np_freqdist.values, y=b_np_freqdist.index, orient='h', ax=ax) plt.title('Frequency Distribution of Brian Lehrer\'s noun phrases: 08/10/2020 to 10/15/2021') plt.show(); bar_plot.figure.savefig('b_nps.png') import pandas as pd import seaborn as sns b_np_20 = FreqDist(b_nps_trim).most_common(20) ## Sort values by highest frequency b_nps_sorted = {k:v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)} ## Join bigram tokens with '_' + maintain sorting b_nps_joined = {'_'.join(k):v for k,v in sorted(b_np_freq.items(), key=lambda item:item[1], reverse = True)} #Get top 20 bigrams b_np_20 = {' '.join(x):y for x, y in b_np_20} ## Convert to Pandas series for easy plotting b_np_freqdist = pd.Series(b_np_20) ## Setting figure & ax for plots fig, ax = plt.subplots(figsize=(10,10)) ## Setting plot to horizontal for easy viewing + setting title + display bar_plot = sns.barplot(x=b_np_freqdist.values, y=b_np_freqdist.index, orient='h', ax=ax) plt.title('Frequency Distribution of Brian Lehrer\'s noun phrases: 08/10/2020 to 10/15/2021') plt.show(); |
By looking at these two graphs, we can see get an idea of the most common topics of the past year. These include:
- General US politics (e.g., democrats, republicans, washington, senate, congress, white house, joe biden, kamala harris, Graphs 1 & 2)
- Specific political issues within the US (e.g., supreme court, texas, north carolina)
- The NYC Mayoral race (e.g., maya wiley, scott stringer, andrew yang, eric adams, Graph 2)
- …and the US withdrawal from Afghanistan (e.g., afghanistan, taliban, Graph 1)
What an interesting use of NLP! Thank you Brian for all that you do for us New Yorkers!