Language Processing and Python It is easy to get our hands on millions of words of text. What can we do with it, assuming we can write some simple programs? What can we achieve by 17th century token simple programming techniques with large quantities of text?
How can we automatically extract key words and phrases that sum up the style and content of a text? What tools and techniques does the Python programming language provide for such work? What are some of the interesting challenges of natural language processing? This chapter is divided into sections that skip between two quite different styles. In the “computing with language” sections we will take on some linguistically motivated programming tasks without necessarily explaining how they work. In the “closer look at Python” sections we will systematically review key programming concepts.
We’ll flag the two styles in the section titles, but later chapters will mix both styles without being so up-front about it. If the material is completely new to you, this chapter will raise more questions than it answers, questions that are addressed in the rest of this book. 1 Computing with Language: Texts and Words We’re all very familiar with text, since we read and write it every day. But before we can do this, we have to get started with the Python interpreter. Type “help”, “copyright”, “credits” or “license” for more information. If you are unable to run the Python interpreter, you probably don’t have Python installed correctly. Python interpreter is now waiting for input.
Once the interpreter has finished calculating the answer and displaying it, the prompt reappears. This means the Python interpreter is waiting for another instruction. Your Turn: Enter a few more expressions of your own. The preceding examples demonstrate how you can work interactively with the Python interpreter, experimenting with various expressions in the language to see what they do. In Python, it doesn’t make sense to end an instruction with a plus sign. Now that we can use the Python interpreter, we’re ready to start working with language data.
2 Getting Started with NLTK Before going further you should install NLTK 3. Follow the instructions there to download the version required for your platform. Downloading the NLTK Book Collection: browse the available packages using nltk. The Collections tab on the downloader shows how the packages are grouped into sets, and you should select the line labeled book to obtain all data required for the examples and exercises in this book.
It consists of about 30 compressed files requiring about 100Mb disk space. Once the data is downloaded to your machine, you can load some of it using the Python interpreter. Here’s the command again, together with the output that you will see. Type the name of the text or sentence to view it. 9: The Man Who Was Thursday by G . Now that we can use the Python interpreter, and have some data to work with, we’re ready to get started.
3 Searching Text There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. CHAPTER 55 Of the monstrous Pictures of Whales . The first time you use a concordance on a particular text, it takes a few extra seconds to build an index so that subsequent searches are fast. Ctrl-up-arrow or Alt-p to access the previous command and modify the word being searched. You can also try searches on some of the other texts we have included.
Search the book of Genesis to find out how long some people lived, using text3. Note that this corpus is uncensored! Once you’ve spent a little while examining these texts, we hope you have a new sense of the richness and diversity of language. In the next chapter you will learn how to access a broader range of text, including text in languages other than English. A concordance permits us to see words in context.
What other words appear in a similar range of contexts? Observe that we get different results for different texts. It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears.