Photo by shiyang xu on Unsplash

Going through the steps of a data science project, the first step is to understand the question that needs to be answered. From there data needs to be gathered, which when analyzed, the data scientist can answer the question. However, before the data can analyzed, it must be cleaned. It is this critical step that we will go over here.

Problems from Uncleaned Data

There are multiple reasons why it is important to go through the data cleaning step. A number of errors can occur due to unclean data. One possible error is missing values. …


If you have just started to learn about data science and databases, you probably have heard about SQL, which stands for Structured Query Language. SQL was first seen in the early 1970s and since then there have been many different ‘dialects’ of the language like SQLite or PostgreSQL.

SQL is usually used in connection to relational databases. A database is a program that helps store data and adds functions and methods that allow for the adding, modifying or querying of the data found within. …


Photo by Brett Zeck on Unsplash

Geospatial data is all around us. The streets, cities, states, countries and events can all be considered geospatial data. We see the evidence of the use of geospatial data in our lives every day. One of the most common areas of use is if you watch the weather channel, or look at the weather forecast. The maps rendered are examples of this type of data.

Another common application is GPS, or Global Positioning Services, that allow us to get from point A to point B. Or when you get lost, and need to find yourself on the map.

These different…


Data Visualization can be one of the most important facets of a project . We see graphs every day, whether writing a research paper or reading the newspaper. Readers can be inundated with different data visualizations that are telling a story. Because of this, it is important to know how to create clear and concise data visualizations so that when somebody is reading your work, they understand the points you are trying to convey and not misinterpret the visualization for something else.

There are multiple libraries that allow us to create data visualizations to better explore the data and/or show…


Photo by Flipboard on Unsplash

In previous posts, I have talked about different natural language methods, such as preprocessing and bag-of-words. TF-IDF is another of these tools. TF-IDF stands for “Terms Frequency times Inverse Document Frequency.” This is an extension upon the bag-of-words method.

In the bag-of-words method, we first have to preprocess the document files so that we have the base or root of the words. We then are able to find the number of occurrences of those words in the different documents. This can be fine for when you are modeling small text files, but for larger text files this process is not…


Photo by Brett Jordan on Unsplash

In my previous post, I gave an introduction to the history of natural language processing and working on text preprocessing. Bag of Words is another NLP tool. Bag of Words (BoW) is considered a statistical language model. It analyzes texts based on the word count. This model does not care about the positioning of different words, only how often the words show up in the text. This is part of the reason is it called a bag of words. When you add the text to the bag, the words are then all jumbled up where the positioning no longer matters…


Photo by Annelies Geneyn on Unsplash

When you hear about some great strides being achieved in computer programming, one area that you are likely to have heard about is Natural Language Processing. Natural Language Processing is usually shortened to NLP and is considered a branch of artificial intelligence. It combines linguistics, artificial intelligence, and computer science. It allows the computers to interpret and analyze the generation of human language. For example, one technology that depends on NLP is virtual assistants like Amazon’s Alexa, Apple’s Siri, and Google’s Duplex. …


OLS or Ordinary Least Squares is a useful method for evaluating a linear regression model. It does this by using specific statistical performance metrics about the model as a whole and each specific parameter of the model. The OLS method comes from the StatsModels python package. This module is well known for offering multiple classes and functions that both estimate different types of statistical models and conducting multiple statistical tests.

To understand this method, we should take a quick refresher about linear regression. Linear regression is an attempt at modeling the relationship between two or more variables. There is one…


Photo by Nicolas Picard on Unsplash

One of the many things you might have heard about while learning about python is web scraping. When I was working on a personal project, I had to collect weather data from multiple counties and looked to web scraping to gather that data. Two of the most useful tools that you can utilize when dealing with web scraping are Selenium and Beautiful Soup.

Beautiful Soup is a python library that makes it easy for users to scrape data from web pages. The tools make it easier to navigate through html or XML files and search for information through a tree-like…


Fuzzy Clustering

As I have been working through my studies of big data, one of the topics in my course was clustering. Clustering is fascinating because it is type of unsupervised learning. A lot of what I have learned previously was about supervised learning. The difference between the two are that supervised learning needs known labels to answer specific questions, while unsupervised learning is looking for any patterns with a dataset.

Supervised vs Unsupervised Learning:

Supervised Learning

Another way to look at supervised learning and unsupervised learning is thinking it about it mathematically. In a supervised learning, we are answering…

Malcolm Katzenbach

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store