blog/content/posts/lighten-your-python-image-docker-multi-stage-builds.md

8.5 KiB

title tags date
Lighten your Python image with Docker multi-stage builds
Docker
multi-stage builds
poetry
python
2021-01-02T10:37:29.021773+00:00

In previous posts we talked about poetry and Docker images layers and I promised I would write about Docker multi-stage builds, so here we go!

!!! info "Note" I will explain the basics of Docker multi-stage builds required to understand the post but I won't repeat the documentation (see further reading).

⚙️ Multi-stage builds

Basically a multi-stage build allows you to sequentially use multiple images in one Dockerfile and pass data between them.

This is especially useful for projects in statically compiled languages such as Go, in which the output is a completely standalone binary: you can use an image containing the Go toolchain to build your project and copy your binary to a barebones image to distribute it.

package main
import "fmt"
func main() {
  fmt.Println("Hello Gab!")
}
# Dockerfile
FROM golang:alpine as builder
RUN mkdir /build 
ADD . /build/
WORKDIR /build 
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags '-extldflags "-static"' -o main .

FROM scratch
COPY --from=builder /build/main /app/
WORKDIR /app
CMD ["./main"]

This example1 produces a working Docker image containing only the binary built from the project. It also perfectly illustrates the basics of multi-stage builds.

Notice the second FROM instruction? It tells Docker to start again from a new image, like at the beginning of a build, except that it will have access to the last layers of all the previous stages.

Then, the COPY --from is used to retrieve the built binary from the first stage.

In this extreme case, the final image weighs nothing more than the binary itself since scratch is a special empty image with no operating system.

🐍 Applying to Python & Poetry

Install the dependencies

Let's start with a basic Dockerfile with a single stage that will just install this blog's dependencies and run the project.2

# Dockerfile
## Build venv
FROM python:3.8.6-buster

# Install poetry, see https://python-poetry.org/docs/#installation
ENV POETRY_VERSION=1.1.4
RUN curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
ENV PATH /root/.poetry/bin:$PATH

# Install dependencies
WORKDIR /app
RUN python -m venv /app/venv
COPY pyproject.toml poetry.lock ./
RUN . /app/venv/bin/activate && poetry install
ENV PATH /app/venv/bin:$PATH

# Add code
COPY . ./

HEALTHCHECK --start-period=30s CMD python -c "import requests; requests.get('http://localhost:8000', timeout=2)"

CMD ["gunicorn", "blog.wsgi", "-b 0.0.0.0:8000", "--log-file", "-"]

It's already not that bad! We are taking advantage of the cache by copying only the files that describe our dependencies before installing them, and the Dockerfile is easy to read.

Now, our final image attack surface could be reduced: we're using a full Debian buster with all the build tools included and we have poetry installed in our image when we don't need it at runtime.

We'll add another stage to this build. First, we will install poetry and the project's dependencies, and in a second stage we will copy the virtual environment and our source code.

Multi-staged dependencies & code

# Dockerfile
## Build venv
FROM python:3.8.6-buster AS venv

ENV POETRY_VERSION=1.1.4
RUN curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
ENV PATH /root/.poetry/bin:$PATH

WORKDIR /app
COPY pyproject.toml poetry.lock ./

# The `--copies` option tells `venv` to copy libs and binaries
# instead of using links (which could break since we will
# extract the virtualenv from this image)
RUN python -m venv --copies /app/venv
RUN . /app/venv/bin/activate && poetry install


## Beginning of runtime image
# Remember to use the same python version
# and the same base distro as the venv image
FROM python:3.8.6-slim-buster as prod

COPY --from=venv /app/venv /app/venv/
ENV PATH /app/venv/bin:$PATH

WORKDIR /app
COPY . ./

HEALTHCHECK --start-period=30s CMD python -c "import requests; requests.get('http://localhost:8000', timeout=2)"

CMD ["gunicorn", "blog.wsgi", "-b 0.0.0.0:8000", "--log-file", "-"]

See? We didn't have to change much but our final image is already much slimmer!

Without accounting for what we install or add inside, the base python:3.8.6-buster weighs 882MB vs 113MB for the slim version. Of course it's at the expense of many tools such as build toolchains3 but you probably don't need them in your production image.4

Your ops teams should be happier with these lighter images: less attack surface, less code that can break, less transfer time, less disk space used, ... And our Dockerfile is still readable so it should be easy to maintain.

Final form

For this blog, I use a slightly modified version of what we just saw:

# Dockerfile
## Build venv
FROM python:3.8.6-buster AS venv

ENV POETRY_VERSION=1.1.4
RUN curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
ENV PATH /root/.poetry/bin:$PATH

WORKDIR /app
COPY pyproject.toml poetry.lock ./
RUN python -m venv --copies /app/venv

# Allows me to tweak the dependency installation.
# See below.
ARG POETRY_OPTIONS
RUN . /app/venv/bin/activate \
    && poetry install $POETRY_OPTIONS


## Get git versions
FROM alpine/git:v2.26.2 AS git
ADD . /app
WORKDIR /app
# I use this file to provide the git commit
# in the footer without having git present
# in my production image
RUN git rev-parse HEAD | tee /version


## Beginning of runtime image
FROM python:3.8.6-slim-buster as prod

RUN echo "Europe/Paris" > /etc/timezone \
    && mkdir /db

COPY --from=venv /app/venv /app/venv/
ENV PATH /app/venv/bin:$PATH

WORKDIR /app
COPY manage.py LICENSE pyproject.toml ./
COPY docker ./docker/
COPY blog ./blog/
# These are the two folders that change the most.
COPY attachments ./attachments/
COPY articles ./articles/
COPY --from=git /version /app/.version

ENV SECRET_KEY "changeme"
ENV DEBUG "false"
ENV HOST ""
ENV DB_BASE_DIR "/db"

HEALTHCHECK --start-period=30s CMD python -c "import requests; requests.get('http://localhost:8000', timeout=2)"

CMD ["/app/docker/run.sh"]

There are not much differences between this and the previous one, except for an added stage to retrieve the git commit hash and some tweaking when copying the code.

There is also the addition of the POETRY_OPTIONS build argument. It allows me to build the same Dockerfile with two different outputs: one with the development dependencies like pytest or pre-commit and the other without.

I use it like this:

# with pytest
docker build --pull --build-arg POETRY_OPTIONS="" -t blog-test .
# without pytest
docker build --pull --build-arg POETRY_OPTIONS="--no-dev" -t blog .

Again, this is in the spirit of minimizing the production image.

🗒 Closing thoughts

Docker multi-stage builds helped me reduce my image sizes and attack surface - sometimes by a lot - without compromising on features.

I hope that you enjoyed reading this article and that you found it interesting or helpful! Please feel free to contact me if you want to comment on the subject.

In a future post, I'll talk about reducing Docker images build time in a CI environment where the filesystem isn't guaranteed to stay between runs.

📚 Further reading


  1. Thanks to Cloudreach for the example. ↩︎

  2. The source code is available on Gitea. ↩︎

  3. You often need these tools to install some python dependencies which require compiling. That's why I don't use the slim version to install my dependencies. ↩︎

  4. Except of course if your goal is to compile stuff on the go or provide a platform for people to build their code. ↩︎