Do Disney villains really use more sophisticated language than the heroes?

Who among us hasn’t noticed that the villains of popular Disney films tend to carry themselves with a certain air of superiority? This is certainly not a new observation. It is a known fact that villains in these animated films tend, more than not to adopt British English accents.

In fact, this trend seems to be characteristic of animated villains on the whole – at least according to one 1998 study by sociolinguists Julia Dobrow and Calvin Gidney. In their analysis of the personality traits and speech patterns of 323 characters across 76 animated children’s television shows, they found that dialect stereotypes were frequently used to indicate a character’s status as a hero or a villain. Most commonly, this meant a British English accent and hyper exaggerated features (e.g., excessively rolled “r’s”). A 2019 master’s thesis by Dea Maržić at the University of Rijeka took the analysis a step further. Maržić performed discourse analysis on nine Disney films produced between 1989 to 1998 (the so-called “Golden Age” of Disney films) to examine the ways in which Disney villains created a villainous identity. The study found, unsurprisingly, that villains tended to adopt non-American accents. However, Maržić also showed that villains tended to adopt a dual nature, presenting themselves at once as benevolent victims and authoritative dictators (think Frollo’s seemingly generous yet domineering adoption of Quasimodo in “The Hunchback of Notre Dame”).


This photo captures it well, doesn’t it

Being a Disney aficionado myself, I wanted to know whether the superiority projected by Disney villains was also reflected at a fine-grained linguistic level. Specifically, and looking for an excuse to apply my knowledge and experience with Natural Language Processing (NLP), I was interested in knowing whether Disney villains use language that is more sophisticated than that of the heroes. 

Approaching the Problem

Answering this question necessitated a number of preliminary steps.

First and foremost, what do I mean when I say “sophisticated language”? I chose to measure sophistication at the lexical (i.e., word) level, and thus adopted the index most commonly used by us vocabulary researchers: word frequency. Briefly, word frequency is determined in reference to language corpora – large-scale, structured collections of texts suitable for linguistic analysis. It involves counting the number of times a given word form (e.g., make, makes,  making) or lemma (i.e., the base for of a word – make) occurs within the text, and (usually) norming the frequency count to determine how many times that word form/lemma occurs per million words. Word frequency is an appropriate way to determine the sophistication of speech since it is known to strongly correlate with multiple aspects of language growth and proficiency. For example, it is known that children acquire more frequent words earlier in their first language (e.g., Dascalu et al., 2016; Landauer, Kireyev, & Panaccione 2011); that frequent words are accessed more quickly (Brysbaert & New, 2009); and that measures of single and multi-word frequency correlate with second language proficiency (Crossley et al., 2010). 

The frequency list I had on hand for this project was the Contemporary Corpus of American English Magazine sub-corpus, which consists of 127,352,030 word forms from 86,292 different texts. I was interested in analyzing the following:

  1. Overall type (number of unique words) and token (number of total word forms) frequency
  2. Most common two-word (i.e., bigram) and three-word (i.e., trigram) phrasal (i.e., part-of-speech) patterns.
  3. Average type and token frequency within the most common phrasal patterns.  

Second, I had to determine which Disney heroes and villains I would compare. For this post, I will be reporting on an NLP case study of Jafar and Aladdin from the Disney film “Aladdin”. Analysis of other Disney villains and heroes will be reported in upcoming posts! 

Third, the following was my conceptual approach to addressing this question:

  1. Importing and isolating the dialogue of the Jafar and Aladdin.
  2. Cleaning the text to remove punctuation and stopwords
  3. Tokenizing the lines of each character
  4. POS tagging the lines
  5. Identifying most common phrasal patterns
  6. Calculating word frequency 

Finally, I needed a number of Python-3 packages to analyze the data according to the following standard pattern of text analysis:

  1. Text preprocessing (using “re”)
  2. Tokenization (Natural Language Toolkit [NLTK])
  3. POS-tagging (StanfordPOSTagger)
  4. Frequency counts (NLTK)

Let’s look at how I did this!

The Procedure

Let’s start by importing the necessary packages, importing the Aladdin script, and tokenizing the script.

We can see by looking at the script that there are three main types of text: character names, dialogue, and scene descriptions:

You may have noticed that in this script that the character names are printed in all capital letters and the scene descriptions are enclosed within parentheses. I thus wrote two quick functions to cut out anything between parentheses, and add the remaining text to a separate “dialogue” list:

Running these two functions on our dialogue file resulted in a list of tokenized words without scene descriptions, going from this:

To this:

The next step was to isolate only Jafar and Aladdin’s lines. I did this simply using the “sent_tokenize” module from NLTK, which transforms text into a list of sentences by using common punctuation as a reference:

The end result of this code was a list of Jafar’s tokenized lines without his name or needless punctuation:

After doing the same thing for Aladdin, the next step was to remove stopwords (e.g., words that serve a grammatical function but don’t tell us much else about the content of speech; e.g., “the”, “be”) and calculate the overall frequency of each character’s words. After combining a custom stoplist (containing words that could skew sophistication like character names – e.g., “Sultan”, “Genie”) with the stoplist from NLTK, and lemmatizing each character’s lines, I built a frequency dictionary for each character and compared it to the COCA magazine word frequency lists:

With the frequencies in hand, I was able to calculate the average token and type score for each character by summing the frequencies of each word token and word type and dividing them by the total number of tokens or types. 

The following violin plots demonstrate the findings:

ggstatsplot of average lemma token frequency
ggstatsplot of average lemma type frequency

If you looked at these two plots and thought that produced opposite results, you are correct! It seems as though that in “Aladdin”, the raw number of words used by Aladdin are on average, more common than those used by Jafar. For example, Aladdin uses words like “one” (2853 / million), “like” (1837 / million), and “make” (1064 / million) multiple times in the film (seven, six, and six times, respectively). By comparison, Jafar uses these same words just once. This is not entirely surprising, considering that Aladdin has almost twice the number of lines in the film as Jafar does!

The result for type frequency is somewhat more interesting. Whereas Aladdin also uses more unique words than Jafar does (274 to 191), the average frequency of these word types does not statistically differ from Jafar’s words. For example, while Jafar snootily spits off words such as “humblest” (0.35 / million), “beheading” (0.79 / million), and “abject” (1.14 / million), Aladdin heroically counters with words like “hoofbeats” (0.12 / million), “valets” (0.33 / million), and “lawmen” (0.35 / million).

In other words, both characters seem to use an equal range of frequent and infrequent words.



So, despite projecting an obvious air of superiority, a nefarious British accent, and palpable arrogance, Jafar’s lexical performance fails to surpass that of a lowly street rat. Stay tuned for an extension of this analysis to other features of language (e.g., phrasal patterns) and other Disney characters! 

2 thoughts on “Heroes vs. Villains: An NLP case study of lexical sophistication in Aladdin

  1. Christine Jacobsen says:

    Oh my God! What an impressive and interesting analysis.
    At the conclusion, I would just remind the readers of the original question.
    Great work!
    I look forward to the next post.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">