Creating a Complete Mini Data Science Environment with Docker
Good day everyone,
Earlier I wrote an article detailing the use of docker as a Data Scientist. What I didn’t explain in that article was how I made a complete mini Data Science environment.
Since then I’ve been thinking it was so easy, why not share how to do it?
Intro to the Docker File
The slim DS environment exists here at DockerHub. Which by the way is easy to get going and a great registry of docker images. More on that in the last article.
In order to work with docker files and open containers, we must install Docker for desktop
Let’s take a deeper look into how I made this Docker file and what needs to go into a complete mini-Data Science Environment.
I start by navigating to a folder that I can work from.
I’m going to build our little environment within an existing slim python image which is built on top of a Linux Debian image.
First things first we will create the Dockerfile which will manifest the environment when you run the image. That file looks like this:
This is the complete mini-environment that can be uploaded to and pulled from any Docker image registry. Using this environment you will be able to do actual data science in a Jupyter notebook using Pandas, Matplotlib, Numpy, and more.
Let’s break it down
First to Create the empty Dockerfile use the following command in your terminal
This will create a new file named Dockerfile and open it with Nano. You can do this with Vim, Visual Studio, or any code editor.
At the top, we write FROM and give an existing image name. As mentioned here we are using the slim python image. When running the Dockerfile we are creating in this image. The slim Python image will be pulled and created on the go in the resulting container.
after setting the image we add some LABELS to make it clear who maintains this file and which version it is etc.
LABEL maintainer="your name here"
LABEL description = "short description of what this image is about"
Note how Docker file instructions are capitalised (FROM and LABEL) by convention. Much like when writing in SQL.
You should know enough to be able to write your own description if you are coding along.
Key Dockerfile Instructions
The next instruction I give the Dockerfile is WORKDIR.
WORKDIR sets the, you guessed it, the working directory of a Docker container. It also executes any RUN, CMD, ADD, COPY, ENTRYPOINT command.
We set the this to:
What does this mean for us?
Looking back on our complete Dockerfile:
We can see that our WORKDIR is set to /data
and it executes the COPY command
COPY . /data
The above command is telling giving instructions to copy all data from the current directory into the directory int he container from /data. If no data directory exists one will be created. Below we create a data directory so just follow along for now and it will connect later.
We can execute a RUN command to get the necessary libraries installed quickly and they can be chained together in one RUN command to save space and be more readable.
RUN pip install jupyter numpy matplotlib seaborn pandas
Pip is already present with the slim python which is built into this environment at the top of the Dockerfile.
The EXPOSE instruction is just a service to people viewing the Dockerfile telling them which port the file wants to open Jupyter notebook on.
Finally the CMD instruction (short for command) instructs the file what to do when it is run (when a container starts up). The full CMD instruction is below.
It tells it to run Jupyter notebook. with IP 0.0.0.0, on port 8888, with no browser and to allow root.
Adding Data and Building the Image
In the directory we’ve been working from, let’s create a data folder with ‘mkdir’ and add some data. This can be any CSV file or excel file, what-have-you.
I also added a Jupyter Notebook.ipynb file in the data directory. However, in reflection, it makes more sense to add the .ipynb file to the directory with the Dockerfile instead. This way when Jupyter runs on container startup, it will show the .ipynb file and the data folder.
There we go. We now have a Dockerfile ready to build an Image with.
To build the image, make sure:
- Docker Desktop is installed and running
- You are in the directory with the Dockerfile and Data folder
If everything has been followed along correctly, and you have Docker Desktop for your machine installed, and running, go ahead and run the following command:
docker build -t mini-ds-env .
This will take some time (a few minutes perhaps). It will use the Dockerfile we’ve created in the current working directory and build an image from it giving it the tag ‘mini-ds-env’. Congratulations.
To check that your image exists on your machine, in your command line/terminal write:
docker image ls
Look for the image with the same tag you gave. Try opening a container of the image with:
docker container run -p 8888:8888 mini-ds-env
adding the -p (port flag) will indicate on which ports (host port:port on container) you can run the image. Of course, we specify the image tag last.
Let's get this up on Docker Hub so you can share your new mini-environment with other Data Scientists.
Uploading to DockerHub
First let’s tag the new image with your docker account name and a the :latest attribute.
Let’s grab the image ID and copy it
docker image ls
Copying that IMAGE ID we now input the following command in the command line:
docker tag <your IMAGEID> <your username>/mini-ds-env:latest
for comparison mine looks like this:
docker tag 7e809eb61b60 algakovic/mini-ds-env:latest
Now we can upload this image with the tag, user, version data to DockerHub:
First we login to DockerHub. In the command line write
Enter your credentials when prompted
Finally, push the image to DockerHub with:
docker push <your username>/mini-ds-env:latest
Login to DockerHub, check your repositories, and share on from there!
The environment and contributions will work on any computer they are pulled to since they’re all packaged up nicely in a complete mini-Data Science Environment. That’s what Docker works so hard to do and it does great. We’ve built a mini- ds-enviroment from the ground up! All the hard work goes to Docker and Docker Daemon of course but we can safely say… standing on the shoulders of giants?
Uploading your environment to DockerHub ensures that you can share and collaborate with no environment dependency troubles.
If you’ve read my previous article on using Docker, well then a Congratulations is in order. You now know docker!