Skip to main content
Skip to main content

Book analysis with
AI techniques

Years 7-8; 9-10

This learning sequence explores text analysis through Natural Language Processing, a significant application of Artificial Intelligence.

Teachers and students are led through a series of video tutorials to develop a Python program that can break down and analyse the content of a complete text, such as Robert Louis Stevenson's Treasure Island, and use smart sentiment analysis to attempt to determine the villain(s) and hero(s).

This learning sequence is recommended for Years 9 and 10 or experienced students in Years 7 and 8. It is not recommended for beginners to General Purpose Programming. Basic understanding of iteration, branching and functions is assumed. (For a learning sequence on Natural Language Processing designed to better suit new programmers in Years 7 and 8, try A Sentimental Chatbot.)


 


Overview

View the Overview video for more information on this learning sequence, including a short lecture on Natural Language Processing. (Check the Resources section for links to more information on advanced concepts such as Part Of Speech Tagging and Lexicon Normalisation.)

Part 1: Setup

View the video Setup. This will introduce the repl.it coding environment, and the process for obtaining the text for Alice in Wonderland from the online repository Project Gutenberg.


Questions for discussion

Part 2: Removing punctuation

View the video Removing punctuation. In this part, we write and test a function to remove all punctuation from the text, so that our text analysis can focus on words only.

(Completed code up to this point.)


Questions for discussion

Skill review

Functions will be used throughout this learning sequence, including functions with parameters and return variables.

View the video Intro to Functions in Python for a brief introduction to writing and using functions.

Part 3: Tokenisation

View the video Tokenisation 1, noting the minor changes made manually to the text of Alice in Wonderland before the coding begins:

  • The word CHAPTER has been added before each chapter heading, eg. “CHAPTER I--DOWN THE RABBIT-HOLE”).
  • The Gutenberg license text has been removed from the end of the file.

In the part, we write new functions to break up the book text into a list of words, or into a list of sentences.

(Completed code up to this point.)


Questions for discussion

Now view the video Tokenisation 2. In this video, we write two more functions to break up the book text into a list of paragraphs, or into a list of chapters. This completes our library of functions for tokenising the book's text.

(Completed code up to this point.)


Skill building

Working with large bodies of text usually requires a bit of manual editing.

View the video Text File Preparation for more details on obtaining and editing the most suitable book text of Alice in Wonderland.


Part 4: Modular programming

View the video Modular programming. In this part, the functions we've created are separated into a different file.

(Completed code up to this point.)


Questions for discussion

Lecture: Sentiment analysis

View the video Lecture – Sentiment Analysis to:

  • discover interesting connections between linguistics and digital technologies,
  • import a Python module that makes it easy to incorporate sentiment analysis into your own programs,
  • explore two numbers for measuring sentiment: polarity and subjectivity.

Lectures are primarily intended for a teacher audience, but you may choose to re-view the video with students.

Part 5: Testing sentiment analysis

View the video Testing Sentiment Analysis. In this part, the TextBlob module is used to attempt to rate the polarity and subjectivity of sentences, paragraphs and chapters. (For an explanation of these two concepts, be sure to view the lecture in the previous section.)

(Completed code up to this point.)


Questions for discussion

Part 6: How many times does each word appear?

View the video Frequency of Words 1. In this part, we write and test a function to store each unique word in the book alongside the number of times that word appears in the book. In PART 7, we’ll rank these to find out the most frequent words in the book.

(Completed code up to this point.)


Skill review

A dictionary data structure is used to hold the data for the most frequent words. This data structure is sometimes called an associative array or a map in other programming languages.

View the video Intro to Dictionaries for a brief introduction to the dictionary data structure.

These short exercises will help you practice using Python Dictionaries:

Questions for discussion

Now view the video Frequency of Words 2. In this video, we make some improvements to our function. Now, the dictionary will be constructed ignoring the case of the words from the book. So, “Rabbit” and “rabbit” will now be considered one unique word, and the frequency value will reflect all instances of both words.

But, if the function’s new second argument cap is set to True, something quite different will happen. The dictionary will be constructed with only the Title Case words from the book, such as proper nouns. So, “rabbit” will not be included at all, but words like “Rabbit” and “Alice” will be included.

During this video, the presenter makes use of a Python shortcut called a List Comprehension, which allows a list to be quickly made from another list without many lines of code. This is not an essential skill and is merely done for convenience. See this external tutorial for more information on List Comprehensions.

Part 7: Ranking words by frequency

View the video Ranking Words by Frequency. In this short part, we make use of a dictionary created with the function we wrote in part 6. The dictionary contains unique words from the book alongside how often they appear in the book. Now we will sort those entries. The result is a simple list of the words ordered by frequency.

(Completed code up to this point.)


Questions for discussion

Part 8: Removing stop words

View the video Removing Stop Words. In this part, we write a function to filter out very common English words, and another function to filter out stop words like “the”, “I”, “and”. By first removing all these from the list of words in the book, our ranked list of frequent words will be more useful.

(Completed code up to this point.)

  • Click here to access the webpage with 1000 common words.
  • Click here to access the webpage with stop words.

Questions for discussion

Part 9: Heroes and villains

View the video Heroes and Villains 1. By now, we have identified the likely main characters in the book. In this part, we use sentiment analysis to guess whether each character is a hero or villain.

(Completed code up to this point.)


Questions for discussion

Now view the final video Heroes and Villains 2. In this video, we try a couple of other books, and we tweak the behaviour of the function for judging the main characters.

Projects / Assessment


Initially, students may wish to try reusing the program already written in this learning sequence:

  • choose a different book that can be obtained in Plain Text format,
  • try to identify the main characters,
  • try to determine which of the main characters are heroes or villains,
  • by judging the polarity of each chapter, try to determine if the story has a happy or sad ending.

Sometimes it can be hard to tell if articles from newspapers and other online sources are reporting pieces or opinion pieces.

Write a fresh program that uses sentiment analysis to determine if an article is a reporting piece or an opinion piece, based on the subjectivity of its language. We might expect opinion pieces to have higher subjectivity.

You will need to:

  • find at least 3 reporting pieces and 3 opinion pieces, ideally from the same news source,
  • obtain or convert each article into Plain Text format (eg. using copy-paste),
  • use the TextBlob library to analyse the subjectivity of the article,
  • make a decision whether the article is reporting or opinion.

Note, you may wish to reuse modules or functions written in this learning sequence, but your main program must be freshly written, with appropriate comments.

Design and implement a research project test a limit of the sentiment analysis approach used in this learning sequence. eg. How well does it respond to different conversational styles?

Write a new tool for analysing dialogue in a movie or play script.

The tool will be able to separate each paragraph, ignore stage/screen directions and notes, and connect each bit of dialogue with a character from a limited cast.

From there, characters can be compared based on volume of dialogue and sentiment analysis of dialogue.

Resources

  • Coding
    • Python cheat sheet (from Grok Learning)
    • Another Python cheat sheet that focuses on string functions (ways to manipulate text)
    • Visual to text coding series of lessons with videos and exercises to help you and your class transition from visual coding (eg Scratch) to general purpose programming (eg Python and JavaScript)
  • Natural Language Processing theory