Introduction

A key goal of many English language learners is to reach a point where they are able to effortlessly understand spoken media – television, radio, YouTube videos, etc. This is not only a wonderful milestone in the language learning process – it also goes a long way towards helping learners develop other aspects of language proficiency! Reaching a state of fluent oral comprehension takes time and effort, but can be jumpstarted by learning vocabulary. Countless studies have shown that the number of words (or word families*) a learner knows, or their vocabulary size, predicts oral comprehension outcomes. But exactly how many words English language learners study in order to reach fluent oral comprehension? 

*Word families include a headword (e.g., play) and its inflections (e.g., played, replay, etc.)

Although it is estimated that native English speakers have vocabulary sizes of around 15,000 and 20,000 words by the time they graduate high school (Nation & Waring, 1993), researchers have shown that learners can get by with much less depending on the type of media they wish to consume. For instance, Nation (2006) suggests that written texts such as novels and newspapers require knowledge of between 6,000 to 9,000 words to achieve “good” or “optimal” text comprehension – that is, to reach a point where 95% or 98% of words in the text are known, respectively. Learners require far fewer words to reach these levels for spoken texts. Studies have consistently shown that knowledge of between 3,000 to 5,000 words are enough to reach high levels of coverage* of movies and TV shows (Webb & Rodgers, 2009a, 2009b), conversations (Nation, 2006), songs (Tegge, 2017), and podcasts (Nurmukhamedov & Sharakhimov, 2021). 

*Lexical coverage refers to the percentage of words in a text. 

One type of spoken text which may interest learners is news radio, e.g., NPR’s All Things Considered. On each hour-long episode of All Things Considered, reporters discuss the latest news headlines and present in-depth stories on current issues and events. Each episode is broken up into anywhere between 15 and 20 segments of about 5 minutes each. NPR produces transcripts of All Things Considered which can aid language-focused study and conveniently sorts each segment by topic. These factors position All Things Considered as ideal for learners who wish to increase not only their language proficiency but also their cultural awareness and engagement with culturally-authentic discourse. So, what are the lexical demands of NPR’s All Things Considered?

In this post I will report on three different analyses aimed at answering this question:

  1. A calculation of the number of words needed to reach 95% and 98% lexical coverage of the All Things Considered corpus.
  2. A comparison of the lexical demands of All Things Considered segments by topic.
  3. An analysis of the multi-word sequences of words at different frequency levels.

Let’s jump in!

How many words are needed to understand All Things Considered?

 To measure the lexical demands of this show, I will construct what is called a “lexical profile” of the corpus – a distribution of words at different frequency levels. In practice, researchers have typically used premade corpus-based frequency lists to construct a profile. This analysis uses a list of the 25,000 most frequent lemmas* in English, separated into 1000-word bands, developed from the British National Corpus (BNC) (Cobb, 2014 – for access to these lists, please visit https://www.lextutor.ca/vp/comp/). Four lists from Nation’s (2012) BNC-COCA 25000 list were used to get a sense of how many of the words in the corpus were proper nouns (e.g., “Mike”, “Prseident Biden”), acronyms (e.g., “NPR”), marginal words (e.g., “hmm”, “uh”), or transparent compounds (e.g., “firetruck”) – hereafter referred to as “PMAT”. Lexical demands are estimated by examining the lexical profile of the corpus, and manually identifying at which frequency band lexical coverage surpasses 95% and 98% – two benchmarks that have been associated with “good” and “adequate” comprehension, respectively (e.g., van Zeeland & Schmitt, 2013). The All Things Considered corpus was compiled by webscraping with Python (see this post for details), while the lexical profile was constructed using a custom Python script that functions similarly to AntWordProfiler (Anthony, 2014). It contains more than 7.5 million words from around 10,000 individual segments (more than 800 hours of program content). Let’s look at the lexical profile of All Things Considered.

*A lemma is a headword (e.g., play) plus only its most immediate inflections (e.g., plays, played, playing)

Do the lexical demands of All Things Considered differ by topic?

 If you visit the show page for All Things Considered (see image to the right), you will see that each segment is assigned a distinct topic. This is quite useful for learners who are more interested in certain issues than others. Naturally, then, a question that teachers and learners might ask is, “Do students need different levels of vocabulary knowledge to understand different topics in the news?”

In preparation for the analysis, I created a list of all the topics collected during the webscraping process.  I initially identified more than twenty topics but decided to collapse them into a smaller number of categories for ease of analysis. For example, topics such as “Business”, “Markets”, “Industry”, and “Economy” were collapsed into the single category “Business, Money, and the Economy”. The 10 topics identified were Business, Money, & Economy (BME), Entertainment (ENT), Healthcare (HLT), Law & Politics (LAW), Music (MSC), Nature & Environment (NAT), National News (NTL), Science & Technology (SCT), Sports (SPR), and World News (WLD). The table below summarizes the lexical profiles for these topics.

It can be seen in the table that, without accounting for PMAT, all but one topic (Healthcare) failed to reach the 95% and 98% coverage benchmarks through the word lists alone.  If we take PMAT into account, however, we find that most topics reached the 95% coverage benchmark with high-frequency vocabulary alone (i.e., the most frequent 3,000 lemmas). In other words, if a learner knows 3,000 lemmas plus PMAT, then for most topics they will encounter an unknown lemma every 20 words on average. The exception to this trend was “Nature & Environment”, which had a proportion of PMAT (notably proper nouns), and thus required knowledge of the first 4,000 lemmas to reach 95% lexical coverage.  The number of lemmas needed to reach the 98% coverage benchmark varied for each program, but never surpassed the range of mid-frequency vocabulary (i.e., between the most frequent 4,000 to 9,000 lemmas). For all but two topics, knowledge of the first 5,000 lemmas (plus PMAT) would mean that an unknown lemma would be encountered every 50 words. Learners would need to know the first 6,000 and 8,000 lemmas to reach 98% lexical coverage for segments focused on “Entertainment” and “Nature & Environment”, respectively.  To summarize, the results of this analysis suggest that most topics had similar lexical demands and that segments focused on “Entertainment” and “Nature & Environment” require knowledge of more vocabulary on average to understand.

Teaching multi-word sequences to improve students’ comprehension of All Things Considered

There is plentiful evidence that learners’ phrasal knowledge – that is, their knowledge of collocations (i.e., what words co-occur with a given word) and formulaic sequences – predicts many different aspects of second language proficiency (e.g., Siyanova & Schmitt, 2008 for aural processing speed; Uchihara et al., 2021 for speaking proficiency). Developing learners’ phrasal knowledge is critical for ensuring that words that may be known in terms of their basic/most common senses (e.g., cake = 🎂) become better fleshed out and more easily recognized in context (e.g., “birthday cake”)(Nguyen & Webb, 2017). While collocation knowledge is generally thought to improve as a result of extensive reading or listening, rather than explicit instruction (Schmitt, 2008), there may be a role for the use of collocations in explicitly teaching and presenting new vocabulary items. Because the above analysis showed that the first 3,000 lemmas (plus PMAT) may be a key milestone in language development, I will focus on highlighting collocations of some of the most frequent words in each word band. And I’ll do it using Python and Natural Language Processing to boot! Let’s take a look.

Stage 1: Identify the most frequent words in each band

This preliminary code used to identify the most frequent words in each band is presented below. The code (1) loads previously created bigram lists and unigram frequency lists,  (2) creates normalized frequency unigram lists of K1-K3 items, and (3) extracts and visualizes the 10 most frequent lemmas from the normalized K1-K3 lists. The output of the visualizations is presented after the code below.

Stage 2: Identify the most frequent collocations of common K1-K3 words

The above graphs show the 10 most frequent lemmas in the All Things Considered corpus, excluding words on NLTK’s stopwords list. Let’s take a closer look at the collocations of a few words from each word list. These are presented in the graphs below.

 

As can be seen in the graphs, the three most common bigrams for the 1K-word “know” are “know_like”, “know_people”, and “even_know”. The following concordance lines illustrate these collocations in context.

From these concordance lines, we can derive two interesting insights. First, we see that “know_like” and “know_people” seem to form the common trigrams “you know like” and “you know people”. The bigram “you know” did not appear in the top bigrams for “know” because the word “you” was removed as a stop word  However, from these few examples, we can see that it is a commonly used filler in English, and thus something learners could be made aware of. Second, we see that “even_know” is generally used in negative sentence patterns, specifically within the phrase “subj + do not even know”. Making learners aware of these patterns could help them to improve their recognition of them in context and ultimately expand their use of the word “know”.

Next, two bigrams for the 2K-word “pretty” seem to have drastically higher frequencies than the others: “pretty_much” and “pretty_good”. The following concordance lines illustrate these collocations in context.

From these concordance lines, we can see that the most common usage of “pretty” in this corpus is not as an adjective, but as an adverb. This is quite interesting, as learners may learn the adjective form of “pretty” first, and be unaware of its adverbial usage. It is possible that a learner who is unfamiliar with the adverbial usage of “pretty” might wonder whether a speaker who utters the phrase “pretty good number” is talking about the aesthetic quality of that number! This example shows how using concordance lines and bigram frequency counts can help determine which senses of a word should be learned.

Lastly, the 3K-word “republican” also has a few collocates that stand out from the rest: “republican_party”, “republican_governor”, and “republican_senator”, for instance. These collocates clearly make up a thematic set of “political” vocabulary which can be taught together. What’s more, many of the collocates are also high-frequency vocabulary. Thus, teaching them in semantic sets (which research generally suggests is an effective way to present vocabulary) has the added benefit of knocking a few items off the high-frequency lists that should be learned. 

Readers who are interested in knowing more about the collocates of the top words in the K1-K3 lists can refer to the reference graphs below. Those teachers and learners who are interested in learning common collocates of words can check out https://www.english-corpora.org/coca/, which allows for the searching of collocates based on individual words (and with a much larger corpus of general American English)! 

Stage 3: Using collocations and corpora in vocabulary teaching

As an instructor of academic listening and speaking, I have often used COCA to aid my students’ vocabulary learning. One activity I have found particularly effective is to have students create their own personal dictionaries. For instance, learners can choose smaller sections of words from the high-frequency word lists (e.g., “political words” such as “republican”, “governor”, “congressman”, etc.), and be instructed to use COCA to identify the most frequent collocations of these words, and fill in information about (1) the basic meaning of the word; (2) common collocates of the words; (3) example sentences with the word; and (4) in what contexts they feel like they encounter the word the most often. (1) can be filled in by using a dictionary, while (2) and (3) can be filled in with reference to COCA or another concordance program (e.g., using the free program AntConc -you can find the program and instructions for use athttps://www.laurenceanthony.net/software/antconc/ – and the raw text corpus I have made available at https://github.com/gfredriksmith/NPR-corpus/tree/main/All%20Things%20Considered%20Topic%20Corpus). Then, they can listen to specific NPR segments which are likely to use that vocabulary (e.g., looking at the “republican” word set in the “National News” subcorpus), compare the examples to their personal dictionaries, and update them accordingly! 

No matter what activity is used, I hope this post has shown how powerful language corpora and corpus analysis tools can be in informing L2 vocabulary pedagogy!

 

Bigrams of Top 10 most common 1K Lemmas

 

Bigrams of Top 10 most common 2K Lemmas

 

Bigrams of Top 10 most common 2K Lemmas

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">