Intro to SQL
October 12th, 2020
I strongly believe that if you want a Data Analyst or Data Scientist job, you must be comfortable with extracting and manipulating data from SQL. This post should not be used as a comprehensive guide to SQL, but it will provide an overview of most common SQL commands.
What is SQL? SQL stands for Structured Query Language (you can pronounce it as 'sequel' or 'ess-cue-elle'). It allows you to retrieve data stored in relational databases, such as Microsoft SQL Server, MySQL or Oracle, and manipulate it in a variety of ways.
If a concept of a database is new to you, you can think of it as a collection of tables. For instance, you can draw a comparison to an Excel workbook with multiple worksheets where each worksheet is somehow related to another. Worksheets in this comparison are SQL tables and the entire workbook would be a database. The key is to have some sort of relationship between the tables that allows you to bring all that data together. Please keep in mind that this is just an analogy and Excel workbook is not a database.
Each table in SQL consists of rows and columns where columns represent different data attributes and rows represent observations or data records. Tables can contain different types of data but the main types are integer, decimal, character or string, date, and time.
In this tutorial, I won't go into details on how to setup or create a database but will focus on several useful SQL commands instead. You will learn how to select data from tables, create new variables using conditional logic, filter records, aggregate tables, sort records in a table, and join multiple tables together. These are a must know basic commands for anyone interested in learning SQL.
SELECT
SELECT
is the most important command as it allows you to retrieve data
from a table. SELECT
on itself won't do anything, so you need to
specify what data you
want to retrieve and from where.
You can retrieve data from a table, so you need to specify a table name in the SELECT
statement. SELECT
statement refers to
the columns part of a table and not the rows, so you also need to list columns. Here are a few
things to remember:
- ✓To select a column, you will simply need to type its name
- ✓ You can select all columns by typing their name or you can
use
*
to select all columns without typing their names - ✓ You can select as many or as few columns from a table as you want
- ✓ You can even create new columns in the
SELECT
statement
Let's assume that we have a table named "Provinces" that contains two columns -- "Province" and "ProvinceCode".
SELECT Province, ProvinceCode FROM Provinces
The same can be achieved by running the following command.
SELECT * FROM Provinces
You can also select only Province from this table.
SELECT Province FROM Provinces
You can add a column to the output of your select statement. Let's say you want to indicate that
all of this provinces are in Canada by adding a column "Country" with a value of
"Canada". You'll need to specify first what the value should be followed by a key word
as
and the name you want this column to take. Please note that this
statement does not modify "Provinces" table! Also, you can only use single quotes in SQL
when working with strings.
SELECT *, 'Canada' as Country FROM Provinces
If you want to add a product of two numeric columns to the output of the select statement, this is what you can do
SELECT num_col1, num_col2, num_col1*num_col2 as num_col_product FROM table_name
Any other mathematical operator will work. There are also built-in functions that you can apply on columns to create new columns. Let's say you have a date column, and you want to add a year calculated from that column.
SELECT date_column, year(date_column) as yr FROM table_name
If you want to create a column that contains first two characters from one column and last three
characters from another column separated by -, you can use LEFT, RIGHT
functions and +
to combine multiple strings together.
SELECT str_col1, str_col2, left(str_col1,2)+'-'+right(str_col2,3) as new_col FROM table_name
You can find an overview of various functions Here. I encourage you to read documentation and familiarize yourself with what functions you can use. You don't need to remember the exact syntax or arguments that a function takes. Should you ever need to use one of these functions, you can always Google it.
WHERE
The commands above would select all rows from table "Provinces". There will be cases where you
don't want all of the observations to be returned; that's when you use the WHERE
clause. It is typically specified after you specify a table name
(unless you do a join, more on this to come). This is a typical syntax WHERE column_name operator value
. Here is a list of operations that you
can perform in the WHERE
clause.
- ✓Mathematical comparisons with the following operators
=, <,>, <=, >=, <>
. Note that<>
means not equal.=, <>
can be used for strings and numbers - ✓ To search for a pattern in a string, you can use
LIKE
operator - ✓ You can use
IN
operator to specify multiple values in a column - ✓ You can have multiple filters in the
WHERE
clause separated byAND
orOR
. Don't forget to specify column name for every filter you pass in theWHERE
clause
Let's say we want to select all columns from "Provinces" table where column "ProvinceCode" is ON.
SELECT * FROM Provinces
WHERE ProvinceCode = 'ON'
Let's say we want to select all columns from "Provinces" table where column "ProvinceCode" is not ON.
SELECT * FROM Provinces
WHERE ProvinceCode <> 'ON'
If we want to filter based on multiple values, we can do the following
SELECT * FROM Provinces
WHERE ProvinceCode = 'ON' or ProvinceCode = 'BC'
You can see that you can specify multiple filters by using OR
in this
case. Keep in mind if you have two or more filters in the WHERE
clause, you need to type variable name and operator again.
This will return an ERROR
INVALID SYNTAX!!
SELECT * FROM Provinces
WHERE ProvinceCode = 'ON' or 'BC'
The filtering example above is not very efficient. You would typically use multiple filtering
criteria for different columns. If you want to select multiple values from a column, you can use
IN
operator.
SELECT * FROM Provinces
WHERE ProvinceCode in ('ON', 'BC')
If you want all values but ON and BC, you can simply add not
before
IN
SELECT * FROM Provinces
WHERE ProvinceCode not in ('ON', 'BC')
When you have strings in a column and want to select values based on a certain pattern you can
use LIKE
operator. There are two wildcards that can be used with it.
%
which represents zero, one or multiple characters or _
which represents single character.
Let's select all provinces that have an s in their name. Since the length of each Province name can vary and it can have an s anywhere in the name, you can do the following
SELECT * FROM Provinces
WHERE Province like '%s%'
If you want to select a Province that has an i
as second to last
letter, you can use _
SELECT * FROM Provinces
WHERE Province like '%i_'
CASE WHEN
You can use conditional logic to create variables by following this syntax.
CASE WHEN condition THEN result1 ELSE result2 END AS new_variable
You can also have multiple conditions.
CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ELSE result3 END AS new_variable
Using "Province" table, you can create a new column "Region" that takes "Atlantic" value if "ProvinceCode" is NL, NB, PE or NS. For all other provinces, let's set the value to "Non-Atlantic".
SELECT *,
CASE WHEN ProvinceCode in ('NL', 'NB', 'PE', 'NS') then 'Atlantic' else 'Non-Atlantic' END AS Region
FROM Provinces
GROUP BY
GROUP BY
can be used to summarize values in a table (aggregate,
average, count, etc.). The following table can serve as an example for when you would want to
use GROUP BY
. The table contains 8 records, 4 observations for each
year where each row represents a quarter for the corresponding year.
What if you were asked to calculate average sales or total sales for 2019 and 2020? That's where
you can use GROUP BY
. In order to use it correctly, in the SELECT
statement, you will first need to specify the columns for which
you want to display aggregated data (year in our case), a column that needs to be transformed
and indicate which transformation needs to be applied (summed, averaged, counted, take the max
or min).
Please note that in order for this query to work correctly, you will need to exclude "Quarter" column. If you include it, since there is one record per Quart & Year, you will simply get the same output as the original table.
The rest of your query would follow as standard, i.e. you indicate FROM
which table you need to retrieve the data. If there are any filters you need to apply, you
specify them in the WHERE
clause. Finally, you add GROUP BY
which should contain one or more variables separated by commas
for which you are grouping the data.
SELECT var1, var2, sum(var3) as sum_var3 FROM table GROUP BY var1, var2
Notice that you only specify the variable that don't need to be grouped in the GROUP BY
statement.
Let's see how we can calculate total sales for each year.
SELECT Year, sum(Sales) as Sales
FROM qrt_sales
GROUP BY Year
This is how you can get average annual sales.
SELECT Year, avg(Sales) as avg_Sales
FROM qrt_sales
GROUP BY Year
You can get total and average sales all in one statement.
SELECT Year, avg(Sales) as avg_Sales, sum(Sales) as sum_Sales
FROM qrt_sales
GROUP BY Year
You can also count how many observations there are in this table for each year (it's obvious in this example, but there will be cases where you might need to get observation count).
SELECT Year, count(Sales) as count_Sales
FROM qrt_sales
GROUP BY Year
ORDER BY
If you want to sort the order of the SELECT
statement, you can use
ORDER BY
. This doesn't change the order of records in a table, and
only affects the output of the statement. You can sort in descending order by adding desc
key word after the name of the variable you want to sort. The default order is ascending. You can also sort by
multiple variables; you just need to separate them by commas.
SELECT * FROM Provinces
ORDER BY ProvinceCode
JOIN
If you remember in the beginning of the post, I mentioned that a database is a collection of
tables where each table has some sort of identifier or key that allows you to join the tables
together. That's really what a JOIN
clause does.
- ✓ LEFT JOIN returns all records from the left table and records from the right table that are found in the left table.
- ✓ RIGHT JOIN returns all records from the right table and records from the left table that are found in the right table.
- ✓ INNER JOIN only returns records that are present in both tables.
- ✓ FULL JOIN returns all records from both tables.
This is what the output of the four joins would look like using the tables in the example above.
Notice that the unique key in Table 1 and Table 2 is City. Also, a JOIN
returns a NULL
for when a match isn't
found; this is how SQL flags missing values.
Here is the syntax for four different joins shown above.
SELECT t1.City, t1.Airport, t2.Ocean FROM Table1 t1 LEFT JOIN Table2 t2 ON t1.City=t2.City
SELECT t2.City, t2.Ocean, t1.Airport FROM Table1 t1 RIGHT JOIN Table2 t2 ON t1.City=t2.City
SELECT t1.City, t1.Airport, t2.Ocean FROM Table1 t1 INNER JOIN Table2 t2 ON t1.City=t2.City
SELECT t1.City, t1.Airport, t2.Ocean FROM Table1 t1 FULL JOIN Table2 t2 ON t1.City=t2.City
A few things to remember about the syntax.
- ✓ You need to specify join type after the
FROM
command. - ✓ If certain columns exist in both tables, you need to
specify from which table you want to select a column by adding table name before column name
SELECT table_name1.variable_name1, table_name2.variable_name2
- ✓ Instead of typing full table name before column name, you
can assign an alias in the
FROM
command andJOIN
clause. In the example above, you can see that inFROM Table1 t1 LEFT JOIN Table2 t2
t1 and t2 are aliases which allows me to use them before column names rather than full table names. - ✓ You need to specify a key or keys on which you want to
join tables, like this
ON t1.key=t2.key
- ✓ You can join table on multiple keys
ON t1.key1=t2.key1 and t1.key2=t2.key2
. Please note that the names of the key columns you want to join on, don't need to match. - ✓ You can join more than two tables and can even perform different types of joins.
SELECT t1.Key, t2.Name1, t3.Name2
FROM table1 t1
LEFT JOIN table2 t2
ON t1.key = t2.key
RIGHT JOIN table3 t3
ON t1.key = t3.key
UNION
UNION
can also combine two or more tables but unlike a JOIN
clause, it appends the tables on top of each other. In order for
UNION
to work correctly, you need to specify the same number of
columns appearing in the same order from each table.
SELECT col1, col2
FROM table1
UNION
SELECT col1, co2
FROM table2
When can UNION
be helpful? If you have similar data stored in different
tables. Perhaps, data from one year is stored in one table whereas data from another year is
stored in a different table, and you want to combine it in one table, that's when you would use
the UNION
operator.
Putting all the code mentioned above together, this is how your syntax would look like.
SELECT t1.col1, t1.col2, sum(t2.col3) as col3
FROM table1 t1
LEFT JOIN table2 t2 on t1.id=t2.id
WHERE t1.col1 in ('A', 'B') and t1.col2 < 50
GROUP BY t1.col1, t1.col2
ORDER BY t1.col1, t1.col2
Notice how each command starts on a separate line? If you write everything in one line, your code will still work, but it will be not very readable, so I'd say it's a good practice to write your code in a similar way.
I'm planning on adding a SQL playground to this post where you can write basic commands covered in this tutorial. I'm hoping it can be completed within a few weeks. In the meantime, here are a few useful links you can checkout.