Heroes vs. Villains: An NLP case study of lexical sophistication in Aladdin

Do Disney villains really use more sophisticated language than the heroes?

Who among us hasn’t noticed that the villains of popular Disney films tend to carry themselves with a certain air of superiority? This is certainly not a new observation. It is a known fact that villains in these animated films tend, more than not to adopt British English accents.

In fact, this trend seems to be characteristic of animated villains on the whole – at least according to one 1998 study by sociolinguists Julia Dobrow and Calvin Gidney. In their analysis of the personality traits and speech patterns of 323 characters across 76 animated children’s television shows, they found that dialect stereotypes were frequently used to indicate a character’s status as a hero or a villain. Most commonly, this meant a British English accent and hyper exaggerated features (e.g., excessively rolled “r’s”). A 2019 master’s thesis by Dea Maržić at the University of Rijeka took the analysis a step further. Maržić performed discourse analysis on nine Disney films produced between 1989 to 1998 (the so-called “Golden Age” of Disney films) to examine the ways in which Disney villains created a villainous identity. The study found, unsurprisingly, that villains tended to adopt non-American accents. However, Maržić also showed that villains tended to adopt a dual nature, presenting themselves at once as benevolent victims and authoritative dictators (think Frollo’s seemingly generous yet domineering adoption of Quasimodo in “The Hunchback of Notre Dame”).

**This photo captures it well, doesn’t it**

Being a Disney aficionado myself, I wanted to know whether the superiority projected by Disney villains was also reflected at a fine-grained linguistic level. Specifically, and looking for an excuse to apply my knowledge and experience with Natural Language Processing (NLP), I was interested in knowing whether Disney villains use language that is more sophisticated than that of the heroes.

Approaching the Problem

Answering this question necessitated a number of preliminary steps.

First and foremost, what do I mean when I say “sophisticated language”? I chose to measure sophistication at the lexical (i.e., word) level, and thus adopted the index most commonly used by us vocabulary researchers: word frequency. Briefly, word frequency is determined in reference to language corpora – large-scale, structured collections of texts suitable for linguistic analysis. It involves counting the number of times a given word form (e.g., make, makes, making) or lemma (i.e., the base for of a word – make) occurs within the text, and (usually) norming the frequency count to determine how many times that word form/lemma occurs per million words. Word frequency is an appropriate way to determine the sophistication of speech since it is known to strongly correlate with multiple aspects of language growth and proficiency. For example, it is known that children acquire more frequent words earlier in their first language (e.g., Dascalu et al., 2016; Landauer, Kireyev, & Panaccione 2011); that frequent words are accessed more quickly (Brysbaert & New, 2009); and that measures of single and multi-word frequency correlate with second language proficiency (Crossley et al., 2010).

The frequency list I had on hand for this project was the Contemporary Corpus of American English Magazine sub-corpus, which consists of 127,352,030 word forms from 86,292 different texts. I was interested in analyzing the following:

Overall type (number of unique words) and token (number of total word forms) frequency
Most common two-word (i.e., bigram) and three-word (i.e., trigram) phrasal (i.e., part-of-speech) patterns.
Average type and token frequency within the most common phrasal patterns.

Second, I had to determine which Disney heroes and villains I would compare. For this post, I will be reporting on an NLP case study of Jafar and Aladdin from the Disney film “Aladdin”. Analysis of other Disney villains and heroes will be reported in upcoming posts!

Third, the following was my conceptual approach to addressing this question:

Importing and isolating the dialogue of the Jafar and Aladdin.
Cleaning the text to remove punctuation and stopwords
Tokenizing the lines of each character
POS tagging the lines
Identifying most common phrasal patterns
Calculating word frequency

Finally, I needed a number of Python-3 packages to analyze the data according to the following standard pattern of text analysis:

Text preprocessing (using “re”)
Tokenization (Natural Language Toolkit [NLTK])
POS-tagging (StanfordPOSTagger)
Frequency counts (NLTK)

Let’s look at how I did this!

The Procedure

Let’s start by importing the necessary packages, importing the Aladdin script, and tokenizing the script.

import os #For loading text files from the hard drive
import nltk  #Natural Language Toolkit
import re #Regular expressions for searching through text
path = r'...Villians Project\aladdin.txt'
a_text = open(path, 'r', encoding = 'utf-8').read()#Reads the Aladdin script
a_tokenized = nltk.word_tokenize(a_text) #Tokenizes the entire script using NLTK

import os #For loading text files from the hard drive

import nltk #Natural Language Toolkit

import re #Regular expressions for searching through text

path = r'...Villians Project\aladdin.txt'

a_text = open(path, 'r', encoding = 'utf-8').read()#Reads the Aladdin script

a_tokenized = nltk.word_tokenize(a_text) #Tokenizes the entire script using NLTK

We can see by looking at the script that there are three main types of text: character names, dialogue, and scene descriptions:

You may have noticed that in this script that the character names are printed in all capital letters and the scene descriptions are enclosed within parentheses. I thus wrote two quick functions to cut out anything between parentheses, and add the remaining text to a separate “dialogue” list:

#This function is the first step in separating the dialogue from the scene descriptions. 
#Essentially, it iterates through a tokenized list of words until parentheses are found, then skips over them until the next character begins speaking.
def c_split(tokenized, empty_list):
    for item in tokenized:
        found_c = False #Sets the "found capital letter" variable to False
        current = []#Defines a placeholder list
        length = len(item) #Sets length of list
        counter = 0 #Sets counter to be used with "length" to determine if an item is the last item in the list
        for x in item:
            counter +=1 
            if found_c == False: #If a capital letter has not been found...
                if re.search('[A-Z][A-Z]', x): #...and this item has two capital letters at the beginning
                    current.append(x)#Append to placeholder list
                
                else:
                    current.append(x) #...and this item does not have two capital letters
                
                found_c = True #Sets the variable to "capital letter found"
            elif found_c == True: #If a capital letter has been found
                if re.search(r'[A-Z][A-Z]', x): #...and this is another capital letter...
                    found_c = False #Set the variable back to False, indicating that this is a new character talking
                    empty_list.append(current) #Appends current list to "dialogue" (containing previous character's dialogue)
                    current = []#Refines the placeholder list to take a new character's dialogue
                    current.append(x)
                
                else:
                    current.append(x)
        if counter == length: 
            empty_list.append(current)

#This function supplements the previous one, and takes care of scene descriptions within a character's dialogue.

for i in range(len(dialogue)):
        if i < len(dialogue):
            first_item =dialogue[i][0]
            if re.search(r'[A-Z][A-Z]', first_item):
                continue
            else:
                dialogue[i]=dialogue[i-1]+dialogue[i]
                del dialogue[i-1]
for i in range(len(dialogue)):
        if i < len(dialogue):
            first_item =dialogue[i][0]
            if re.search(r'[A-Z][A-Z]', first_item):
                continue
            else:
                dialogue[i]=dialogue[i-1]+dialogue[i]
                del dialogue[i-1]

#This function is the first step in separating the dialogue from the scene descriptions.

#Essentially, it iterates through a tokenized list of words until parentheses are found, then skips over them until the next character begins speaking.

def c_split(tokenized, empty_list):

for item in tokenized:

found_c = False #Sets the "found capital letter" variable to False

current = []#Defines a placeholder list

length = len(item) #Sets length of list

counter = 0 #Sets counter to be used with "length" to determine if an item is the last item in the list

for x in item:

counter +=1

if found_c == False: #If a capital letter has not been found...

if re.search('[A-Z][A-Z]', x): #...and this item has two capital letters at the beginning

current.append(x)#Append to placeholder list

else:

current.append(x) #...and this item does not have two capital letters

found_c = True #Sets the variable to "capital letter found"

elif found_c == True: #If a capital letter has been found

if re.search(r'[A-Z][A-Z]', x): #...and this is another capital letter...

found_c = False #Set the variable back to False, indicating that this is a new character talking

empty_list.append(current) #Appends current list to "dialogue" (containing previous character's dialogue)

current = []#Refines the placeholder list to take a new character's dialogue

current.append(x)

else:

current.append(x)

if counter == length:

empty_list.append(current)

#This function supplements the previous one, and takes care of scene descriptions within a character's dialogue.

for i in range(len(dialogue)):

if i < len(dialogue):

first_item =dialogue[i][0]

if re.search(r'[A-Z][A-Z]', first_item):

continue

else:

dialogue[i]=dialogue[i-1]+dialogue[i]

del dialogue[i-1]

for i in range(len(dialogue)):

if i < len(dialogue):

first_item =dialogue[i][0]

if re.search(r'[A-Z][A-Z]', first_item):

continue

else:

dialogue[i]=dialogue[i-1]+dialogue[i]

del dialogue[i-1]

Running these two functions on our dialogue file resulted in a list of tokenized words without scene descriptions, going from this:

To this:

['ALADDIN', ':', 'All', 'this', 'for', 'a', 'loaf', 'of', 'bread', '?'],
 ['GUARD', '1', ':', 'There', 'he', 'is', '!'],
 ['GUARD', '2', ':', 'You', 'wo', "n't", 'get', 'away', 'so', 'easy', '!'],
 ['ALADDIN', ':', 'You', 'think', 'that', 'was', 'easy', '?'],
['GUARD', '1',  ':',  'You',  'two',  ',',  'over',  'that',  'way',  ',',  'and',
  'you',  ',',  'with',  'me',  '.',  "We'll",  'find',  'him',  '.'],

['ALADDIN', ':', 'All', 'this', 'for', 'a', 'loaf', 'of', 'bread', '?'],

['GUARD', '1', ':', 'There', 'he', 'is', '!'],

['GUARD', '2', ':', 'You', 'wo', "n't", 'get', 'away', 'so', 'easy', '!'],

['ALADDIN', ':', 'You', 'think', 'that', 'was', 'easy', '?'],

['GUARD', '1', ':', 'You', 'two', ',', 'over', 'that', 'way', ',', 'and',

'you', ',', 'with', 'me', '.', "We'll", 'find', 'him', '.'],

The next step was to isolate only Jafar and Aladdin’s lines. I did this simply using the “sent_tokenize” module from NLTK, which transforms text into a list of sentences by using common punctuation as a reference:

jafar_dialogue_split = []
for item in jafar_dialogue:
    current = []
    for y in item:
        if re.search(r'JAFAR', y):
            continue
        if re.search('[^A-Za-z!\.\'?]+', y):#removes all punctuation but periods, question marks, and apostrophes (needed for contractions)
            continue
        else:
            current.append(y)
    split = nltk.sent_tokenize(" ".join(current))#split sentences based on sentence-final punctuation
    for x in split:
        current1 = []
        current1.append(nltk.word_tokenize(x))
    jafar_dialogue_split.append(current1)
#Now we can safely remove all punctuation!
jafar_dialogue_final = []
for item in jafar_dialogue_split:
    for y in item:
        current = []
        for z in y:
            if re.search('[^A-Za-z\']+', z):
                continue
            else:
                current.append(z.lower())
    jafar_dialogue_final.append(current)#success!

jafar_dialogue_split = []

for item in jafar_dialogue:

current = []

for y in item:

if re.search(r'JAFAR', y):

continue

if re.search('[^A-Za-z!\.\'?]+', y):#removes all punctuation but periods, question marks, and apostrophes (needed for contractions)

continue

else:

current.append(y)

split = nltk.sent_tokenize(" ".join(current))#split sentences based on sentence-final punctuation

for x in split:

current1 = []

current1.append(nltk.word_tokenize(x))

jafar_dialogue_split.append(current1)

#Now we can safely remove all punctuation!

jafar_dialogue_final = []

for item in jafar_dialogue_split:

for y in item:

current = []

for z in y:

if re.search('[^A-Za-z\']+', z):

continue

else:

current.append(z.lower())

jafar_dialogue_final.append(current)#success!

The end result of this code was a list of Jafar’s tokenized lines without his name or needless punctuation:

[['you', 'are', 'late'],
 ['you', 'have', 'it', 'then'],
 ['you', "'ll", 'get', 'what', "'s", 'coming', 'to', 'you'],
 ['quickly', 'follow', 'the', 'trail']...]

[['you', 'are', 'late'],

['you', 'have', 'it', 'then'],

['you', "'ll", 'get', 'what', "'s", 'coming', 'to', 'you'],

['quickly', 'follow', 'the', 'trail']...]

After doing the same thing for Aladdin, the next step was to remove stopwords (e.g., words that serve a grammatical function but don’t tell us much else about the content of speech; e.g., “the”, “be”) and calculate the overall frequency of each character’s words. After combining a custom stoplist (containing words that could skew sophistication like character names – e.g., “Sultan”, “Genie”) with the stoplist from NLTK, and lemmatizing each character’s lines, I built a frequency dictionary for each character and compared it to the COCA magazine word frequency lists:

#Let's define a few functions that will help us lemmatize the text
#Let's lemmatize Jafar and Aladdin's lines and remove stopwords
import nltk, re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter

stop_words = stopwords.words('english')
normalizer = WordNetLemmatizer()

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

def preprocess_text(text):
  cleaned = re.sub(r'\W+', ' ', text).lower()
  tokenized = word_tokenize(cleaned)
  normalized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]
  filtered = [word for word in normalized if word not in stop_words]
  return " ".join(filtered)

def preprocess_normal(text):
  cleaned = re.sub(r'\W+', ' ', text).lower()
  tokenized = word_tokenize(cleaned)
  normalized = " ".join([normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized])
  return normalized
j_dialogue_lemmatized = [normalizer.lemmatize(y, get_part_of_speech(y)) for x in jafar_dialogue_final for y in x]
a_dialogue_lemmatized = [normalizer.lemmatize(y, get_part_of_speech(y)) for x in aladdin_dialogue_final for y in x]
#Let's create our reference frequency dictionary
import pandas as pd
df = pd.read_csv(r'C:...Villians Project\freq_input.csv')
ref_freq_dict = {}
ref_words = []
ref_values = []
for item in df['#word']:
    ref_words.append(item)
for value in df['normed_freq']:
    ref_values.append(value)
for i in range(len(ref_words)):
    ref_freq_dict[ref_words[i]]=ref_values[i]

#Let's calculate overall frequency for Jafar and Aladdin's dialogue!
j_diag_freq = {}
a_diag_freq = {}
j_diag_full = []
a_diag_full = []
#First append the dialogue to one giant list for each character
for sen in jafar_dialogue_final:
    for tok in sen:
        j_diag_full.append(tok)
for sen in aladdin_dialogue_final:
    for tok in sen:
        a_diag_full.append(tok)
#Let's remove stopwords from Aladdin and Jafar's dialogue
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
custom_stop = ['sultan', 'ababwa', 'abooboo', 'genie', 'abu', 'jasmine', 'ghazeem', 'jafar','aladdin', 'agrabah', 'ali', 'al', '\'d', '\'re', '\'ll', '\'m','\'s', 'uh', 'ah', '\'ve', 'ya', 'ai', 'um', 'ca']
stop_words = stop_words + custom_stop
stop_words=set(stop_words)
a_full_filtered = [w for w in a_diag_full if not w.lower() in stop_words] 
j_full_filtered = [w for w in j_diag_full if not w.lower() in stop_words]
#Then create a frequency dictionary for each character for all of the words in their dialogue
for token in j_full_filtered:
    if token in ref_freq_dict.keys(): 
            j_diag_freq[token]=ref_freq_dict[token]
for token in a_full_filtered:
    if token in ref_freq_dict.keys(): 
        a_diag_freq[token]=ref_freq_dict[token]

#Let's define a few functions that will help us lemmatize the text

#Let's lemmatize Jafar and Aladdin's lines and remove stopwords

import nltk, re

from nltk.corpus import wordnet

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer

from collections import Counter

stop_words = stopwords.words('english')

normalizer = WordNetLemmatizer()

def get_part_of_speech(word):

probable_part_of_speech = wordnet.synsets(word)

pos_counts = Counter()

pos_counts["n"] = len( [ item for item in probable_part_of_speech if item.pos()=="n"] )

pos_counts["v"] = len( [ item for item in probable_part_of_speech if item.pos()=="v"] )

pos_counts["a"] = len( [ item for item in probable_part_of_speech if item.pos()=="a"] )

pos_counts["r"] = len( [ item for item in probable_part_of_speech if item.pos()=="r"] )

most_likely_part_of_speech = pos_counts.most_common(1)[0][0]

return most_likely_part_of_speech

def preprocess_text(text):

cleaned = re.sub(r'\W+', ' ', text).lower()

tokenized = word_tokenize(cleaned)

normalized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]

filtered = [word for word in normalized if word not in stop_words]

return " ".join(filtered)

def preprocess_normal(text):

cleaned = re.sub(r'\W+', ' ', text).lower()

tokenized = word_tokenize(cleaned)

normalized = " ".join([normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized])

return normalized

j_dialogue_lemmatized = [normalizer.lemmatize(y, get_part_of_speech(y)) for x in jafar_dialogue_final for y in x]

a_dialogue_lemmatized = [normalizer.lemmatize(y, get_part_of_speech(y)) for x in aladdin_dialogue_final for y in x]

#Let's create our reference frequency dictionary

import pandas as pd

df = pd.read_csv(r'C:...Villians Project\freq_input.csv')

ref_freq_dict = {}

ref_words = []

ref_values = []

for item in df['#word']:

ref_words.append(item)

for value in df['normed_freq']:

ref_values.append(value)

for i in range(len(ref_words)):

ref_freq_dict[ref_words[i]]=ref_values[i]

#Let's calculate overall frequency for Jafar and Aladdin's dialogue!

j_diag_freq = {}

a_diag_freq = {}

j_diag_full = []

a_diag_full = []

#First append the dialogue to one giant list for each character

for sen in jafar_dialogue_final:

for tok in sen:

j_diag_full.append(tok)

for sen in aladdin_dialogue_final:

for tok in sen:

a_diag_full.append(tok)

#Let's remove stopwords from Aladdin and Jafar's dialogue

from nltk.corpus import stopwords

stop_words = stopwords.words('english')

custom_stop = ['sultan', 'ababwa', 'abooboo', 'genie', 'abu', 'jasmine', 'ghazeem', 'jafar','aladdin', 'agrabah', 'ali', 'al', '\'d', '\'re', '\'ll', '\'m','\'s', 'uh', 'ah', '\'ve', 'ya', 'ai', 'um', 'ca']

stop_words = stop_words + custom_stop

stop_words=set(stop_words)

a_full_filtered = [w for w in a_diag_full if not w.lower() in stop_words]

j_full_filtered = [w for w in j_diag_full if not w.lower() in stop_words]

#Then create a frequency dictionary for each character for all of the words in their dialogue

for token in j_full_filtered:

if token in ref_freq_dict.keys():

j_diag_freq[token]=ref_freq_dict[token]

for token in a_full_filtered:

if token in ref_freq_dict.keys():

a_diag_freq[token]=ref_freq_dict[token]

With the frequencies in hand, I was able to calculate the average token and type score for each character by summing the frequencies of each word token and word type and dividing them by the total number of tokens or types.

The following violin plots demonstrate the findings:

ggstatsplot of average lemma token frequency

ggstatsplot of average lemma type frequency

If you looked at these two plots and thought that produced opposite results, you are correct! It seems as though that in “Aladdin”, the raw number of words used by Aladdin are on average, more common than those used by Jafar. For example, Aladdin uses words like “one” (2853 / million), “like” (1837 / million), and “make” (1064 / million) multiple times in the film (seven, six, and six times, respectively). By comparison, Jafar uses these same words just once. This is not entirely surprising, considering that Aladdin has almost twice the number of lines in the film as Jafar does!

The result for type frequency is somewhat more interesting. Whereas Aladdin also uses more unique words than Jafar does (274 to 191), the average frequency of these word types does not statistically differ from Jafar’s words. For example, while Jafar snootily spits off words such as “humblest” (0.35 / million), “beheading” (0.79 / million), and “abject” (1.14 / million), Aladdin heroically counters with words like “hoofbeats” (0.12 / million), “valets” (0.33 / million), and “lawmen” (0.35 / million).

In other words, both characters seem to use an equal range of frequent and infrequent words.

So, despite projecting an obvious air of superiority, a nefarious British accent, and palpable arrogance, Jafar’s lexical performance fails to surpass that of a lowly street rat. Stay tuned for an extension of this analysis to other features of language (e.g., phrasal patterns) and other Disney characters!

2 thoughts on “Heroes vs. Villains: An NLP case study of lexical sophistication in Aladdin”

Christine Jacobsen says:

September 30, 2021 at 11:56 pm

Oh my God! What an impressive and interesting analysis.
At the conclusion, I would just remind the readers of the original question.
Great work!
I look forward to the next post.

Pingback: Heroes vs. Villains: Two more case studies | George Fredrik Smith

Do Disney villains really use more sophisticated language than the heroes?

Approaching the Problem

The Procedure

Leave a Reply Cancel Reply