Introduction to Importing Data in Python. Part 1
March 5th, 2020
As a data scientist, you really need to get comfortable with working with data. Before you can build a cool machine learning model, you need to spend significant amount of time on figuring out where the data resides and in which format, how to access it, join multiple tables together and prepare it for further analysis. In this post, I will cover most tabular data formats I had a chance to work with and how to import them in Python.
CSV Files
CSV files are probably the least common data format that I personally get to work with, but I've interviewed quite a few people who primarily work with csv files, so it's important to cover them.
I downloaded a GDP csv file from the World Bank. Fortunately, the downloaded file is quite messy. There are extra empty columns, empty top and bottom rows, columns with information that we may not want, and column names take up two cells. While you can clean the file up in Excel, let's have some fun and clean it in Python.
In this example, we will use read_csv
function from Pandas library which comes with a
variety of helpful options. One of Panda's disadvantages is the library is big, and I've ran
into a few problems with referencing it during deployment stages, but we are simply trying to
load a csv file into Python, so I opted to use it. There is a csv module in Python that is also
worth looking into.
How do we clean up the file? We can use skiprows
option to remove empty top rows. We can
also drop the 3rd and 6th columns since they aren't needed. We will also rename columns in the
data frame. Lastly, we will use dropna
function to remove rows with at least one missing
value.
Relational Databases
Most of the data I work with resides in relational databases, such as MySQL, Microsoft SQL Server or Oracle. I'll cover three Python libraries that I've used in the past.
- ✓ Pypyodbc
- ✓ PyMySQL
- ✓ cx_Oracle
Before we access the data, we need to establish a connection by specifying server name, database
name, password and username. We then need to create a cursor object. To get all the data from a query, use fetchall
function. If you want to get only one row back, then you can use fetchone
function. You can have the output of fetchall
saved to pretty much any data type including data frames by utilizing Panda's DataFrame
function. Keep in mind, this won't keep the column names, so you'll need to list them separately.
A few things to note. Pypyodbc might give you problems with deploying an application in
Serverless environment on AWS. You can also try Panda's read_sql
function to import data from MySQL or
MS SQL Server. A more advanced way of transferring data from relational databases into Python is
to use Object-relational Mappers (ORMs) such as SQLAlchemy or PonyORM. It requires a bit more
coding, but comes with a variety of advantages. Lastly, you should make
sure that you have appropriate drivers installed.