Extract Transform Load with Pandas

Hello everyone O/

Image for post
Image for post
Photo by Suzanne D. Williams on Unsplash

Recently I wrote about how to obtain data by using and calling APIs with Python.

This week I will build upon the data that I was able to access and retrieve using the RO mobile Exchange API.

I will explain how I am using Pandas step by step throughout the Extract Transform Load (ETL) process.

  • Explain what ETL is
  • Perform the ETL process using Pandas
Image for post
Image for post
Santa Poring Bouncing Happily

Extract Transform Load as I understand is the process whereby some data is obtained, (extracted) cleaned, wrangled (transformed), and placed into a user-friendly data structure like a data frame (loaded).

Often you may not know that much about the data you are working with. ETL is an essential first step to gaining insight into your data. It is common to begin by understanding the features in your data and recognising outliers/missing values/duplicates.

It may be very useful with some data especially of user-level data, to perform groupby’s and aggregate functions to explore common demographic features such as location and gender, etc.

ETL can be used to explore your data in all of the above capacities so that you are ready to make progress on the Exploratory data analysis.

Image for post
Image for post

Let's look at some examples of ETL in action.

Image for post
Image for post
Photo by Tyler Nix on Unsplash

Extract

The data requested via the API won’t necessarily have come in a format I am used to. Let’s have a look.

Image for post
Image for post
Printing r.text returns the request payload

Looks like its a list of dictionaries. It is difficult to read this. We can clean this up using JSON.

Java Script Object Notation

An open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types. Wikipedia

Below is a JSON decoder which is inbuilt into the Python request library. It will allow us to observe the request payload (the returned data ) in a more visually pleasing way.
In other words, the below code is telling the JSON decoder to JSONify the data.

j = r.json()

We can see the j object remains a list when we ask its type…now

Image for post
Image for post
Importing pandas and JSON and Jsonifying the payload

so let’s inspect j…

Image for post
Image for post
Oh! Much better. Thank you JSON

Now we’re looking to extract this data into a Pandas Dataframe? but it won’t be so easy. First, we have to navigate and correctly index the nested dictionaries and lists. As we can see above it begins as a list. Let’s see the first item in the list by indexing with [0]:

Image for post
Image for post
Reference for indexing the data properly

Using the correct indexing we can ask pandas to create a data frame from the dictionary we want inside this list:

Image for post
Image for post

How does it look?

Image for post
Image for post
First Dataframe from requested API data. Yay.
Image for post
Image for post
Image for post
Image for post
Photo by Meagan Carsience on Unsplash

Transfer

Will there be any missing values? how is the time column formatted?
Let’s investigate:

Image for post
Image for post
no missing values

Making the time column a pandas date-time format:

Image for post
Image for post
date time formatted column

We could also check for duplicates but its price data so there may be duplicates and that is fine.

Here is a simple pandas groupby() method with an aggregate mean function to obtain insight into the average price between servers:

Image for post
Image for post
A simple Groupby aggregate function that shows us the mean prices of this item by Server
Image for post
Image for post
Image for post
Image for post
Photo by DDP on Unsplash

Load

We have Extracted and transformed the data in some ways but now we come to the load aspect of ETL. We can load the new data and any further transformations we make into a target database or data warehouse.

I will example by loading this data into an SQL database.

Python has an inbuilt SQL module — sqlite 3.
Using the pandas DataFrame.to_sql() method will easily load your data frame into an SQL table.

First lets import sqlite3 and create a connection to a database. Doing so will automatically manifest a database. The cur object below is a way to fetch results and keep track of results from queries you make in the SQL language.

import sqlite3 conn = sqlite3.connect('romexchange.db')
cur = conn.cursor()

Now we can load the data frame we have created into an SQL table

df1.to_sql(name='abyss_flowers', if_exists='replace', con=conn)

Finally, we can perform a fetchall() to check the data exists in the database:

cur.execute("""SELECT * FROM abyss_flowers;""").fetchall()
Image for post
Image for post
The Data exists in the database. Looks good.
Image for post
Image for post

Summary:

ETL is a key component of Data Science — ETL stands for Extract Transform Load. The process involves gaining insights and cleaning the data (removing outliers, checking for duplicates, and missing values).

The data can be transformed into useful values using apply() functions, Pivot Tables(), Groupby(), etc. Lastly, the modified and transformed data can be loaded into another data warehouse or database. Pandas is a useful tool for doing this in Python and supports the process by making it faster and easier and user friendly.

If you’re not interested in using Pandas for ETL, Hassan Syyid provides quite a few alternatives to Pandas for ETL in Python in his Medium article:

  • Pyspark — Useful for BIG DATA.
  • Dask — more scalability then pandas can work on clusters of data.
  • Modin — like Dask but more automated less configuration.
  • PETL— Designed with ETL in mind lightweight and efficient and so lacks data analysis features.
Image for post
Image for post
Photo by Sid Balachandran on Unsplash

Sources

  1. Using and Calling an API with Python — Aleksandar Gakovic
  2. Data Structures — Pandas Docs
  3. Python Data tools for ETL — Hassan Syyid
  4. Java script object notation — Wikipedia
  5. An Excellent ETL article for Python — Diljeet Singh Sethi
  6. sqlite3 docs

Written by

Practicing Data Scientist. Interested in Games, Gamification, Ocean Sciences, Music, Biology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store