An Introduction to SQL

Malcolm Katzenbach
4 min readMay 2, 2021

If you have just started to learn about data science and databases, you probably have heard about SQL, which stands for Structured Query Language. SQL was first seen in the early 1970s and since then there have been many different ‘dialects’ of the language like SQLite or PostgreSQL.

SQL is usually used in connection to relational databases. A database is a program that helps store data and adds functions and methods that allow for the adding, modifying or querying of the data found within. As data scientists, a good portion of our time is spent querying databases to gather the data needed to answer a question.

A relational database stores each different type of data into tables. The tables would be what you would expect if you opened an excel spreadsheet. Each row represents an observation and each column represents a specific piece of information about that observation. It is considered a relational database because each of these tables is related to one or more other tables. SQL was designed to work with this type of relational database.

For example, lets say we were working for a small business with a few locations. Some of the types of data we might expect would be: Customers, Store Sites, Employees, Orders, Order Details, Products and Payments. Each of these tables could be connected to one or more different tables. Employees would be connected to Store Sites because each employee would be working out of a specific store. The Employees table might also be connected to the Customers table because there might be particular employee that filled out the orders for that customer. The Customers Table could then be connected to the Payments and Orders Table.

Looking a bit deeper into each table we can see you how each could connect to another. The Customer Table might have information such as their first and last name, address, and other personal information. However, the data from a particular observation might not be unique in that column. If the Customer table had a column for first names, there would be a lot of overlap. So the table might have a column with only unique values, which would be considered a primary key for that table. In the case of a Customer table, assigning a unique Customer ID value to each observation would be an example of a primary key.

But then we can look at the Orders table. Each order likely has a unique Order ID number as a primary key. However, if we check some of the other information available in the table, we might find a Customer ID column telling us which customer made the order. A particular customer might make multiple orders, so the column is not a primary key. This would be considered a foreign key, because it is a primary key in a foreign table. Its through these primary and foreign keys that we can find the different relationships between the tables of the database.

Querying in SQL

As mentioned previously, a lot of work with relational databases will be through queries. Data is retrieved from a database usually by the SELECT statement. It makes sense, when you want to query specific information, you are selecting a particular group of data.

Within that SELECT statement, as in any method, there can be multiple qualifiers. If you want to query for the Customer ID and State columns from the Customer table, it is coded similarly to the sentence.

SELECT Customer_ID, State FROM Customer

The above code would return the Customer ID and State columns from the Customer Table.

Another common qualifier would be using the WHERE clause. Using the WHERE clause we can filter the data down to where the data matches the qualifier. Using the same example as above, we might only want the Customer ID and State columns with the desired State listed as California.

SELECT Customer_ID, State FROM Customer
WHERE State = 'California'

Or we might want all the possible information from the table with that qualifier, which can be done by selecting *.

SELECT * FROM Customer
WHERE State = 'California'

There are many different ways of filtering for data. Some other common qualifiers are:

  1. ORDER BY: This is useful for cases where you might want the most expensive orders to be at the top of the list
  2. LIMIT: This is helpful if you are only trying find an example of the data or if you have a massive table where using all the observations would slow down possible modeling.
  3. BETWEEN: This can be used with a WHERE clause using numerical values.

As in any language, there are many ways that you can transform and filter different data tables. As you grow more comfortable with SQL, more ways open up like combining different data tables into one and then selecting data from that based on specific qualifiers.

I hope that this helps build on your foundation for further studies of SQL and thank you for reading.

--

--