Introduction

In the pre-computer era, the creation of language corpora – compilations of millions of words drawn from multiple sources – was a painstaking and laborious undertaking. Imagine researchers huddled over desks, painstakingly copying words from hundreds of texts onto index cards, and filed away into physical organizing systems.

Remember these?

Even with early computers and scanners, sampling and preparing a corpus was a time-consuming effort, and as a result, corpora were quite small. However, with the increasing accessibility of computers,  improvements in natural language processing (NLP), and the massive amounts of freely available text available on the web today, all that is required to compile a corpus is a little bit of programmatic knowledge, and perhaps a large enough hard drive!

In this post, I will illustrate how the NLP process of Web Scraping can be used to mine large amounts of text from the web for use in the initial steps of creating a corpus!

Because a primary interest of mine is academic oral communication, I will be applying this process to the compilation of an (albeit limited) corpus of academic spoken English based on lectures provided by MIT OpenCourseware (https://ocw.mit.edu/index.htm).

Conceptual Overview

Conceptually, scraping text involves two key steps:

  1. Understanding the programmatic architecture of the website in terms of HTML tags.
  2. Writing a script based on these tags to extract the desired text.

To scrape data from the MIT Open Courseware website I needed to first understand its architecture, both from a user-side and a developer-side. From a user side, I immediately noticed that the actual data to scrape (video captions/transcripts already provided by the university) was nested within several different URLs. Specifically, I had to navigate to a page listing all courses containing captioned video lectures, and from there, navigate through the following: Course overview/Captions & Transcripts/Lecture/Transcript – four separate URL lists. The following pictures illustrate this.

 

Audio/Video Lecture Home Page + Courses

 

Course homepage with “captions/transcript” link

 

List of lectures within course -> Individual lecture page

 

As I examined the HTML codes of several sample courses, I noticed that the architecture was more or less the same and that I could write one program to collect the URLs I need in order to go through each of them and navigate to the page where the transcript lay. There, I simply needed to locate the tag/tags under which the text I was trying to scrape lay, and write a program to extract it.

For example, on the home page for the audio/video courses (https://ocw.mit.edu/courses/audio-video-courses/), URLs to individual courses are located with <h4> tags:

Once arriving at this page, all I needed to do was extract the already-provided transcript and save it to a text file.

Let’s look at how I did this.

Programmatic Overview

Step 1: Import necessary packages and define essential function

For this project, I used Python 3, nltk, and the BeautifulSoup package, an HTML/XML parser (https://pypi.org/project/beautifulsoup4/)

I also needed to write a function which would add the hyperlink text (i.e.,”https://ocw.mit.edu”) to all links:

Step 2: Create a list of relevant URLs 

I started by locating the relevant HTML tags surrounding the links I needed, and creating and extracting URLs from an HREF based on the text contained in the tag.

This code produces a list of URLs that looks something like this:

Continuing with the pattern, I was able to create a list of URLs for pages where the video captions could be found:

Step 3

Now that I had a list of all URLs for courses with audio/video lectures that were actually accompanied by a transcript, I could scrape the text of each. Each transcript for all of the video lectures had the same format: a 3-5 tab widget which contained information about the video, a playlist, transcript, download link, and some related content. The problem, however, was that the div tag corresponding to each tag had an id that did not actually correspond to the information contained within. For example, the “Transcript” above was under the <div id=”vid_playlist”> tag and class; obviously, vid_playlist is not what it was supposed to be – vid_transcript:

 I did some digging and noticed that there were 4 tag ids (about, playlist, index, related), meaning that there could be a number of mismatch errors! However, I also noticed that for every video, the “Transcript” tab began with the word “Download”. As a quick fix, I simply wrote a program to only take text from the tag whose first 8 characters were “Download” (e.g., “Download English-US transcript (PDF)” in the image above) and the problem was solved.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">