Natural Language Processing: Introduction to TF-IDF

Malcolm Katzenbach
3 min readApr 11, 2021
Photo by Flipboard on Unsplash

In previous posts, I have talked about different natural language methods, such as preprocessing and bag-of-words. TF-IDF is another of these tools. TF-IDF stands for “Terms Frequency times Inverse Document Frequency.” This is an extension upon the bag-of-words method.

In the bag-of-words method, we first have to preprocess the document files so that we have the base or root of the words. We then are able to find the number of occurrences of those words in the different documents. This can be fine for when you are modeling small text files, but for larger text files this process is not always enough. For example, if you look at multiple front pages and create a bag-of-words with a word count vector, there will be a large number of common terms with a high word count that will tell you nothing about the topic of the article. It is not enough to just have a word count, but to also know the common terms that are unique to a certain topic.

Another problem that might occur is if the text files are highly different in length. The longer text files will naturally have higher average counts for certain words than a smaller file. So, even if the two texts were writing about the same topic, a model might not classify them as the same because the number of times an important term shows up will not be enough to match a larger text.

Term Frequency — Inverse Document Frequency

These different possible errors can be mitigated by using TF-IDF. Taking a closer look at the methods, the “terms frequency” is describing the number of times a term shows up compared to the rest of the document. So instead of just having a count of the terms like we would in the bag-of-words method, we are comparing that to the rest of document to find the frequency and help diminish the errors from different length text files. So, Term Frequency can be calculated by:

Term Frequency = # of times a term shows up in the document/ total # of terms in the document.

The other part of the method, “Inverse Document Frequency” is supposed to help alleviate the problems caused by having the same common words such as “he” or “she” showing up as the highest used words and not being descriptive of the topic being written about. In this case, some of the less commonly used words are likely to be more reflective of the text file topic. If you have the word “electorate” in the text file, it might not show up often, but you can probably tell the document has to deal with government or politics. The “Inverse Document Frequency” can be calculated by:

Inverse Document Frequency = log (Total # of Documents / # of Documents with Term in it)

By multiplying these two variables together, the TF-IDF value for a term can be found. Another way to look at this is to consider it as a weighting method. The number of times a word shows up can indicate what types of words are being used, but if they occur in multiple documents, their weight or importance to that specific document is less. In the same documents, if a less common word is used only in a single document vs multiple, its weight or importance is greater.

To use this, there is a method for it in the SciKit-Learn library.

Using TF-IDF with SciKit-Learn

The first step in using this method is to import it into your workspace.

from sklearn.feature_extraction.text import TfidfTransformer

From there you can assign the method to a variable and use the fit_transform method.

tfidf_transformer = TfidfTransfomer()X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

By using the fit_transform method on the counts for terms, it transforms the count vector to a weighted representation of those terms.

To recap, the TF-IDF method can be a powerful tool to assist with analyzing and classifying text files.

For documentation on TF-IDF in SciKitLearn:

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files

--

--