Hello everyone O/
Recently I wrote about how to obtain data by using and calling APIs with Python.
This week I will build upon the data that I was able to access and retrieve using the RO mobile Exchange API.
I will explain how I am using Pandas step by step throughout the Extract Transform Load (ETL) process.
You will be able to:
- Explain what ETL is
- Perform the ETL process using Pandas
But first what is ETL?!
Extract Transform Load as I understand is the process whereby some data is obtained, (extracted) cleaned, wrangled (transformed), and placed into a user-friendly data structure like a data frame (loaded).
Often you may not know that much about the data you are working with. ETL is an essential first step to gaining insight into your data. It is common to begin by understanding the features in your data and recognising outliers/missing values/duplicates.
It may be very useful with some data especially of user-level data, to perform groupby’s and aggregate functions to explore common demographic features such as location and gender, etc.
ETL can be used to explore your data in all of the above capacities so that you are ready to make progress on the Exploratory data analysis.
Let's look at some examples of ETL in action.
The data requested via the API won’t necessarily have come in a format I am used to. Let’s have a look.
Looks like its a list of dictionaries. It is difficult to read this. We can clean this up using JSON.
Java Script Object Notation
An open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types. Wikipedia
Below is a JSON decoder which is inbuilt into the Python request library. It will allow us to observe the request payload (the returned data ) in a more visually pleasing way.
In other words, the below code is telling the JSON decoder to JSONify the data.
j = r.json()
We can see the j object remains a list when we ask its type…now
so let’s inspect j…
Now we’re looking to extract this data into a Pandas Dataframe? but it won’t be so easy. First, we have to navigate and correctly index the nested dictionaries and lists. As we can see above it begins as a list. Let’s see the first item in the list by indexing with :
Using the correct indexing we can ask pandas to create a data frame from the dictionary we want inside this list:
How does it look?
Will there be any missing values? how is the time column formatted?
Making the time column a pandas date-time format:
We could also check for duplicates but its price data so there may be duplicates and that is fine.
Here is a simple pandas groupby() method with an aggregate mean function to obtain insight into the average price between servers:
We have Extracted and transformed the data in some ways but now we come to the load aspect of ETL. We can load the new data and any further transformations we make into a target database or data warehouse.
I will example by loading this data into an SQL database.
First lets import sqlite3 and create a connection to a database. Doing so will automatically manifest a database. The cur object below is a way to fetch results and keep track of results from queries you make in the SQL language.
import sqlite3 conn = sqlite3.connect('romexchange.db')
cur = conn.cursor()
Now we can load the data frame we have created into an SQL table
df1.to_sql(name='abyss_flowers', if_exists='replace', con=conn)
Finally, we can perform a fetchall() to check the data exists in the database:
cur.execute("""SELECT * FROM abyss_flowers;""").fetchall()
ETL is a key component of Data Science — ETL stands for Extract Transform Load. The process involves gaining insights and cleaning the data (removing outliers, checking for duplicates, and missing values).
The data can be transformed into useful values using apply() functions, Pivot Tables(), Groupby(), etc. Lastly, the modified and transformed data can be loaded into another data warehouse or database. Pandas is a useful tool for doing this in Python and supports the process by making it faster and easier and user friendly.
Alternatives to Pandas
If you’re not interested in using Pandas for ETL, Hassan Syyid provides quite a few alternatives to Pandas for ETL in Python in his Medium article:
- Pyspark — Useful for BIG DATA.
- Dask — more scalability then pandas can work on clusters of data.
- Modin — like Dask but more automated less configuration.
- PETL— Designed with ETL in mind lightweight and efficient and so lacks data analysis features.