Extract Transform Load with Pandas

Hello everyone O/

Photo by Suzanne D. Williams on Unsplash

Recently I wrote about how to obtain data by using and calling APIs with Python.

This week I will build upon the data that I was able to access and retrieve using the RO mobile Exchange API.

I will explain how I am using Pandas step by step throughout the Extract Transform Load (ETL) process.

  • Explain what ETL is
  • Perform the ETL process using Pandas
Santa Poring Bouncing Happily

Extract Transform Load as I understand is the process whereby some data is obtained, (extracted) cleaned, wrangled (transformed), and placed into a user-friendly data structure like a data frame (loaded).

Often you may not know that much about the data you are working with. ETL is an essential first step to gaining insight into your data. It is common to begin by understanding the features in your data and recognising outliers/missing values/duplicates.

It may be very useful with some data especially of user-level data, to perform groupby’s and aggregate functions to explore common demographic features such as location and gender, etc.

ETL can be used to explore your data in all of the above capacities so that you are ready to make progress on the Exploratory data analysis.

Let's look at some examples of ETL in action.

Photo by Tyler Nix on Unsplash

Extract

The data requested via the API won’t necessarily have come in a format I am used to. Let’s have a look.

Printing r.text returns the request payload

Looks like its a list of dictionaries. It is difficult to read this. We can clean this up using JSON.

Java Script Object Notation

An open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types. Wikipedia

Below is a JSON decoder which is inbuilt into the Python request library. It will allow us to observe the request payload (the returned data ) in a more visually pleasing way.
In other words, the below code is telling the JSON decoder to JSONify the data.

j = r.json()

We can see the j object remains a list when we ask its type…now

Importing pandas and JSON and Jsonifying the payload

so let’s inspect j…

Oh! Much better. Thank you JSON

Now we’re looking to extract this data into a Pandas Dataframe? but it won’t be so easy. First, we have to navigate and correctly index the nested dictionaries and lists. As we can see above it begins as a list. Let’s see the first item in the list by indexing with [0]:

Reference for indexing the data properly

Using the correct indexing we can ask pandas to create a data frame from the dictionary we want inside this list:

How does it look?

First Dataframe from requested API data. Yay.
Photo by Meagan Carsience on Unsplash

Transfer

Will there be any missing values? how is the time column formatted?
Let’s investigate:

no missing values

Making the time column a pandas date-time format:

date time formatted column

We could also check for duplicates but its price data so there may be duplicates and that is fine.

Here is a simple pandas groupby() method with an aggregate mean function to obtain insight into the average price between servers:

A simple Groupby aggregate function that shows us the mean prices of this item by Server
Photo by DDP on Unsplash

Load

We have Extracted and transformed the data in some ways but now we come to the load aspect of ETL. We can load the new data and any further transformations we make into a target database or data warehouse.

I will example by loading this data into an SQL database.

Python has an inbuilt SQL module — sqlite 3.
Using the pandas DataFrame.to_sql() method will easily load your data frame into an SQL table.

First lets import sqlite3 and create a connection to a database. Doing so will automatically manifest a database. The cur object below is a way to fetch results and keep track of results from queries you make in the SQL language.

import sqlite3 conn = sqlite3.connect('romexchange.db')
cur = conn.cursor()

Now we can load the data frame we have created into an SQL table

df1.to_sql(name='abyss_flowers', if_exists='replace', con=conn)

Finally, we can perform a fetchall() to check the data exists in the database:

cur.execute("""SELECT * FROM abyss_flowers;""").fetchall()
The Data exists in the database. Looks good.

Summary:

ETL is a key component of Data Science — ETL stands for Extract Transform Load. The process involves gaining insights and cleaning the data (removing outliers, checking for duplicates, and missing values).

The data can be transformed into useful values using apply() functions, Pivot Tables(), Groupby(), etc. Lastly, the modified and transformed data can be loaded into another data warehouse or database. Pandas is a useful tool for doing this in Python and supports the process by making it faster and easier and user friendly.

If you’re not interested in using Pandas for ETL, Hassan Syyid provides quite a few alternatives to Pandas for ETL in Python in his Medium article:

  • Pyspark — Useful for BIG DATA.
  • Dask — more scalability then pandas can work on clusters of data.
  • Modin — like Dask but more automated less configuration.
  • PETL— Designed with ETL in mind lightweight and efficient and so lacks data analysis features.
Photo by Sid Balachandran on Unsplash

Sources

  1. Using and Calling an API with Python — Aleksandar Gakovic
  2. Data Structures — Pandas Docs
  3. Python Data tools for ETL — Hassan Syyid
  4. Java script object notation — Wikipedia
  5. An Excellent ETL article for Python — Diljeet Singh Sethi
  6. sqlite3 docs

Practicing Data Scientist. Interested in Games, Gamification, Ocean Sciences, Music, Biology.