Extract Transform Load with Pandas

Aleksandar Gakovic

Follow

6 min readJun 7, 2020

--

Hello everyone O/

Photo by Suzanne D. Williams on Unsplash

Recently I wrote about how to obtain data by using and calling APIs with Python.

This week I will build upon the data that I was able to access and retrieve using the RO mobile Exchange API.

I will explain how I am using Pandas step by step throughout the Extract Transform Load (ETL) process.

You will be able to:

Explain what ETL is
Perform the ETL process using Pandas

Santa Poring Bouncing Happily

But first what is ETL?!

Extract Transform Load as I understand is the process whereby some data is obtained, (extracted) cleaned, wrangled (transformed), and placed into a user-friendly data structure like a data frame (loaded).

Often you may not know that much about the data you are working with. ETL is an essential first step to gaining insight into your data. It is common to begin by understanding the features in your data and recognising outliers/missing values/duplicates.

It may be very useful with some data especially of user-level data, to perform groupby’s and aggregate functions to explore common demographic features such as location and gender, etc.

ETL can be used to explore your data in all of the above capacities so that you are ready to make progress on the Exploratory data analysis.

Let's look at some examples of ETL in action.

Extract

The data requested via the API won’t necessarily have come in a format I am used to. Let’s have a look.

Printing r.text returns the request payload

Looks like its a list of dictionaries. It is difficult to read this. We can clean this up using JSON.

Java Script Object Notation
An open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types. Wikipedia

Below is a JSON decoder which is inbuilt into the Python request library. It will allow us to observe the request payload (the returned data ) in a more visually pleasing way.
In other words, the below code is telling the JSON decoder to JSONify the data.

j = r.json()

We can see the j object remains a list when we ask its type…now

Importing pandas and JSON and Jsonifying the payload

so let’s inspect j…

Now we’re looking to extract this data into a Pandas Dataframe? but it won’t be so easy. First, we have to navigate and correctly index the nested dictionaries and lists. As we can see above it begins as a list. Let’s see the first item in the list by indexing with [0]:

Reference for indexing the data properly

Using the correct indexing we can ask pandas to create a data frame from the dictionary we want inside this list:

How does it look?

First Dataframe from requested API data. Yay.

Transfer

Will there be any missing values? how is the time column formatted?
Let’s investigate:

Making the time column a pandas date-time format:

We could also check for duplicates but its price data so there may be duplicates and that is fine.

Here is a simple pandas groupby() method with an aggregate mean function to obtain insight into the average price between servers:

A simple Groupby aggregate function that shows us the mean prices of this item by Server

Load

We have Extracted and transformed the data in some ways but now we come to the load aspect of ETL. We can load the new data and any further transformations we make into a target database or data warehouse.

I will example by loading this data into an SQL database.

Python has an inbuilt SQL module — sqlite 3.
Using the pandas DataFrame.to_sql() method will easily load your data frame into an SQL table.

First lets import sqlite3 and create a connection to a database. Doing so will automatically manifest a database. The cur object below is a way to fetch results and keep track of results from queries you make in the SQL language.

import sqlite3 conn = sqlite3.connect('romexchange.db')
cur = conn.cursor()

Now we can load the data frame we have created into an SQL table

df1.to_sql(name='abyss_flowers', if_exists='replace', con=conn)

Finally, we can perform a fetchall() to check the data exists in the database:

cur.execute("""SELECT * FROM abyss_flowers;""").fetchall()

The Data exists in the database. Looks good.

Summary:

ETL is a key component of Data Science — ETL stands for Extract Transform Load. The process involves gaining insights and cleaning the data (removing outliers, checking for duplicates, and missing values).

The data can be transformed into useful values using apply() functions, Pivot Tables(), Groupby(), etc. Lastly, the modified and transformed data can be loaded into another data warehouse or database. Pandas is a useful tool for doing this in Python and supports the process by making it faster and easier and user friendly.

Alternatives to Pandas

If you’re not interested in using Pandas for ETL, Hassan Syyid provides quite a few alternatives to Pandas for ETL in Python in his Medium article:

Pyspark — Useful for BIG DATA.
Dask — more scalability then pandas can work on clusters of data.
Modin — like Dask but more automated less configuration.
PETL— Designed with ETL in mind lightweight and efficient and so lacks data analysis features.

Sources

Using and Calling an API with Python — Aleksandar Gakovic
Data Structures — Pandas Docs
Python Data tools for ETL — Hassan Syyid
Java script object notation — Wikipedia
An Excellent ETL article for Python — Diljeet Singh Sethi
sqlite3 docs