Creating a Complete Mini Data Science Environment with Docker

Aleksandar Gakovic
6 min readJul 12, 2020
Photo by Aaron Burden on Unsplash

Good day everyone,

Earlier I wrote an article detailing the use of docker as a Data Scientist. What I didn’t explain in that article was how I made a complete mini Data Science environment.

Since then I’ve been thinking it was so easy, why not share how to do it?

Photo by Andre Gaulin on Unsplash

Intro to the Docker File

The slim DS environment exists here at DockerHub. Which by the way is easy to get going and a great registry of docker images. More on that in the last article.

In order to work with docker files and open containers, we must install Docker for desktop

Let’s take a deeper look into how I made this Docker file and what needs to go into a complete mini-Data Science Environment.

I start by navigating to a folder that I can work from.

I’m going to build our little environment within an existing slim python image which is built on top of a Linux Debian image.

First things first we will create the Dockerfile which will manifest the environment when you run the image. That file looks like this:

Well barring the choice of colour and contrast it looks like this.

This is the complete mini-environment that can be uploaded to and pulled from any Docker image registry. Using this environment you will be able to do actual data science in a Jupyter notebook using Pandas, Matplotlib, Numpy, and more.

Let’s break it down

First to Create the empty Dockerfile use the following command in your terminal

nano Dockerfile

This will create a new file named Dockerfile and open it with Nano. You can do this with Vim, Visual Studio, or any code editor.

At the top, we write FROM and give an existing image name. As mentioned here we are using the slim python image. When running the Dockerfile we are creating in this image. The slim Python image will be pulled and created on the go in the resulting container.

FROM python:3.7.3-slim

after setting the image we add some LABELS to make it clear who maintains this file and which version it is etc.

LABEL maintainer="your name here"
LABEL version="0.1"
LABEL description = "short description of what this image is about"

Note how Docker file instructions are capitalised (FROM and LABEL) by convention. Much like when writing in SQL.

You should know enough to be able to write your own description if you are coding along.

Key Dockerfile Instructions

Great start,

The next instruction I give the Dockerfile is WORKDIR.

WORKDIR sets the, you guessed it, the working directory of a Docker container. It also executes any RUN, CMD, ADD, COPY, ENTRYPOINT command.

We set the this to:

WORKDIR /data

What does this mean for us?

Looking back on our complete Dockerfile:

WORKDIR set as /data and EXECUTES a COPY And RUN command

We can see that our WORKDIR is set to /data

and it executes the COPY command

COPY . /data

The above command is telling giving instructions to copy all data from the current directory into the directory int he container from /data. If no data directory exists one will be created. Below we create a data directory so just follow along for now and it will connect later.

We can execute a RUN command to get the necessary libraries installed quickly and they can be chained together in one RUN command to save space and be more readable.

RUN pip install jupyter numpy matplotlib seaborn pandas

Pip is already present with the slim python which is built into this environment at the top of the Dockerfile.

The EXPOSE instruction is just a service to people viewing the Dockerfile telling them which port the file wants to open Jupyter notebook on.

EXPOSE 8888

Finally the CMD instruction (short for command) instructs the file what to do when it is run (when a container starts up). The full CMD instruction is below.

It tells it to run Jupyter notebook. with IP 0.0.0.0, on port 8888, with no browser and to allow root.

Photo by Joel Filipe on Unsplash

Adding Data and Building the Image

In the directory we’ve been working from, let’s create a data folder with ‘mkdir’ and add some data. This can be any CSV file or excel file, what-have-you.

I also added a Jupyter Notebook.ipynb file in the data directory. However, in reflection, it makes more sense to add the .ipynb file to the directory with the Dockerfile instead. This way when Jupyter runs on container startup, it will show the .ipynb file and the data folder.

mkdir data makes the data folder
mkdir data

There we go. We now have a Dockerfile ready to build an Image with.

To build the image, make sure:

  • Docker Desktop is installed and running
  • You are in the directory with the Dockerfile and Data folder

If everything has been followed along correctly, and you have Docker Desktop for your machine installed, and running, go ahead and run the following command:

docker build -t mini-ds-env .

This will take some time (a few minutes perhaps). It will use the Dockerfile we’ve created in the current working directory and build an image from it giving it the tag ‘mini-ds-env’. Congratulations.

To check that your image exists on your machine, in your command line/terminal write:

docker image ls

Look for the image with the same tag you gave. Try opening a container of the image with:

docker container run -p 8888:8888 mini-ds-env

adding the -p (port flag) will indicate on which ports (host port:port on container) you can run the image. Of course, we specify the image tag last.

Let's get this up on Docker Hub so you can share your new mini-environment with other Data Scientists.

Uploading to DockerHub

First let’s tag the new image with your docker account name and a the :latest attribute.

Let’s grab the image ID and copy it

docker image ls
Copy the IMAGE ID from the new image we just built.

Copying that IMAGE ID we now input the following command in the command line:

docker tag <your IMAGEID> <your username>/mini-ds-env:latest

for comparison mine looks like this:

docker tag 7e809eb61b60 algakovic/mini-ds-env:latest

Now we can upload this image with the tag, user, version data to DockerHub:

First we login to DockerHub. In the command line write

docker login

Enter your credentials when prompted

Finally, push the image to DockerHub with:

docker push <your username>/mini-ds-env:latest

Login to DockerHub, check your repositories, and share on from there!

Summary

The environment and contributions will work on any computer they are pulled to since they’re all packaged up nicely in a complete mini-Data Science Environment. That’s what Docker works so hard to do and it does great. We’ve built a mini- ds-enviroment from the ground up! All the hard work goes to Docker and Docker Daemon of course but we can safely say… standing on the shoulders of giants?

Uploading your environment to DockerHub ensures that you can share and collaborate with no environment dependency troubles.

If you’ve read my previous article on using Docker, well then a Congratulations is in order. You now know docker!

Photo by Austin Park on Unsplash

Sources

  1. Docker for Desktop
  2. DockerHub
  3. Docker Docs
  4. Using Docker as a Data Scientist — Recommended quick guide

--

--

Aleksandar Gakovic

Practicing Data Scientist. Interested in Games, Gamification, Ocean Sciences, Music, Biology.