Intro to Natural Language Processing and Text Preprocessing

4 min readMar 28, 2021

When you hear about some great strides being achieved in computer programming, one area that you are likely to have heard about is Natural Language Processing. Natural Language Processing is usually shortened to NLP and is considered a branch of artificial intelligence. It combines linguistics, artificial intelligence, and computer science. It allows the computers to interpret and analyze the generation of human language. For example, one technology that depends on NLP is virtual assistants like Amazon’s Alexa, Apple’s Siri, and Google’s Duplex. These virtual assistants take verbal commands to allow the computer to complete actions for you, such as making a call or sending texts for you.

One of the first examples of NLP was the Turing Test. The Turing test is based on artificial intelligence. In the 1950s, Alan Turing came up with this test, known then as the imitation game, which tests a machines ability to display intelligent behavior similar to a human. The tester would have a natural language conversation with a machine that was to react with human like responses and another human being. They would be separated and speak through text format to remove the necessity of the computer to respond with a convincing human voice. If the tester was unable to find a difference between the human and the machine, then the machine would pass the test. (Britannica: Turing Test)

There are many different ways that a programmer can use NLP to understand data. It can help detect spam, or detect bias in news articles. It can detect the sentiment of reviews as positive, no emotion, or negative. To start exploring NLP, we will look more closely at one of the most important steps: text preprocessing.

Text Preprocessing

One of the most powerful tools that one can use in NLP is from a Python library called the Natural Language Toolkit, which is generally shortened to NLTK. The library began with Steven Bird, and his student Edward Loper in 2001 at the University of Pennsylvania. (github.com/nltk) It “interfaces to 50 corpora and lexical resources”. (nltk.org) Another great tool is Regex or regular expressions. Using these tools, we can preprocess a text in preparation for NLP.

When we consider a file of text, we understand that there are different rules of grammar that affect how the language is written. The computer does not automatically have this ability. When the reader sees words such as “Apple”, “Apple,”, and “apple”. The reader understands that these are all the same word, but due to capitalization and punctuation, the computer reads these words as different. One of the first steps to preprocessing a text file is to remove the “noise” in the file. This means that we want to remove the punctuation and symbols, such as @ or # in tweets, and convert the string to lower case. An early lesson in Python probably taught you the lower() method that converts all characters in a string to lower case. To remove punctuation, we can use the re library and sub method.

re.sub(pattern, repl, string)

The pattern is the regular expression of what the method is looking for in the text. For punctation it might look like

[\.\,\;\:\?\!]

The repl is what you want to replace those characters with, and in this case “” or no space. Finally, the string is what text you want to enter.

The second step is separating the text into a list of individual words. In NLTK, there is a method word_tokenize() that does this.

from nlkt.tokenize import word_tokenize
tokenized_text = word_tokenize(text)

After we have separated each word into a list, there is a process known as text normalization. As we mentioned before, having two different forms of a word such as “run” and “running” are similar, but we might want the computer to see them as the same. Some forms of normalization are known as Stemming, Lemmatization, and Stop word removal.

Stemming is known as brute method to remove prefixes and suffixes. While this may change a word like “walking” to “walk”, it also changes words like “anywhere” to “anywher”, thus removing the e.

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = stemmer.stem(word)

Lemmatization is a preprocessing method to return the word to its root or base form.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized = lemmatizer.lemmatize(word)

Stop words are a list of words that are common and do not add much for the computer to understand what the text is about. Common words such as “the”, “is”, or “at” would be removed from the list of tokenized words.

nltk.corpus import stopwords
stop_words = set(stopwords.words(‘english’)

Each of these methods help a programmer to prepare their text data for further analyzing. I hope this introduction helped your start into the brilliant field of Natural Language Processing.

Sources:

Britannica, The Editors of Encyclopaedia. “Turing test”. Encyclopedia Britannica, 7 May. 2020, https://www.britannica.com/technology/Turing-test. Accessed 28 March 2021.

“Natural Language Toolkit” Natural Language Toolkit — NLTK 3.5 Documentation, www.nltk.org

Bird, Steven. “FAQ”. NLTK FAQ, https://github.com/nltk/nltk/wiki/FAQ

Intro to Natural Language Processing and Text Preprocessing

Text Preprocessing

Written by Malcolm Katzenbach