In this post, we'll walk through Docker image layers and the caching around them from the point of view of a Docker user. I'll assume you're already familiar with Dockerfiles and Docker concepts in general.
![Docker logo](32.png)
## ✌️ The two axioms of Docker layers
There are two key concepts to understand, from which everything else is deduced. Let's call them our axioms.
Axiom 1
: Every instruction in a Dockerfile results in a layer[^1]. Each layer is stacked onto the previous one and depends upon it.
Axiom 2
: Layers are cached and this cache is invalidated whenever the layer or its parent change. The cache is reused on subsequent builds.
[^1]: Well, that's not true anymore, see [Best practices for writing Dockerfiles: Minimize the number of layers (Docker docs)](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#minimize-the-number-of-layers). But since it's easier to understand this way, I'm willing to make this compromise for this article.
So, what happens when we build a small Docker image?
1. Docker first downloads our base image since it doesn't exist in the local registry.
2. It creates the `/app` directory. Subsequent commands will run inside this directory.
3. It copies the file from our local directory to the image.
4. It stores the MD5 hash of our file inside a file named `somefile.md5`.
Now if we try to build the image again, without changing anything, here's what happens:
```text
$ docker build -t gabnotes-example .
Sending build context to Docker daemon 3.072kB
Step 1/4 : FROM ubuntu
---> f643c72bc252
Step 2/4 : WORKDIR /app
---> Using cache
---> 8637829f8e9b
Step 3/4 : COPY somefile ./
---> Using cache
---> 5edc5d0aab9d
Step 4/4 : RUN md5sum somefile > somefile.md5
---> Using cache
---> c2d34241963a
Successfully built c2d34241963a
Successfully tagged gabnotes-example:latest
```
For every step, Docker says it's "using cache." Remember our axioms? Well, each step of our first build generated a layer which is cached locally and was reused for our second build.
## 🔄 Cache invalidation
We can get some information about the layers of our image using `docker history`:
<missing> 4 weeks ago /bin/sh -c set -xe && echo '#!/bin/sh' > /… 811B
<missing> 4 weeks ago /bin/sh -c #(nop) ADD file:4f15c4475fbafb3fe… 72.9MB
```
This output should be read as a stack: the first layer is at the bottom and the last layer of the image is at the top. This illustrates the dependencies between layers: if a "foundation" layer changes, Docker has to rebuild it and all the layers that were built upon.
It's natural: your layers 2 and 3 may depend on the output of the layer 1, so they should be rebuilt when layer 1 changes.
In our example:
```Dockerfile
# Dockerfile
FROM ubuntu
WORKDIR /app
COPY somefile ./
RUN md5sum somefile > somefile.md5
```
* the `COPY` instruction depends on the previous layer because if the working directory were to change, we would need to change the location of the file.
* the `RUN` instruction must be replayed if the file changes or if the working directory changes because then the output file would be placed elsewhere. It also depends on the presence of the `md5sum` command, which exists in the `ubuntu` image but might not exist in another one.
So if we change the content of `somefile`, the `COPY` will be replayed as well as the `RUN`. If after that we change the `WORKDIR`, it will be replayed as well as the other two.[^docs]
[^docs]: Read more about how Docker detects when the cache should be invalidated: [Leverage build cache](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache)
Let's try this:
```text
$ echo "good bye world" > somefile
$ docker build -t gabnotes-example .
Sending build context to Docker daemon 3.072kB
Step 1/4 : FROM ubuntu
---> f643c72bc252
Step 2/4 : WORKDIR /app
---> Using cache
---> 8637829f8e9b
Step 3/4 : COPY somefile ./
---> ba3ed4869a32
Step 4/4 : RUN md5sum somefile > somefile.md5
---> Running in c66d26f47038
Removing intermediate container c66d26f47038
---> c10782060ad4
Successfully built c10782060ad4
Successfully tagged gabnotes-example:latest
```
See, Docker detected that our file had changed, so it ran the copy again as well as the `md5sum` but used the `WORKDIR` from the cache.
This mechanism is especially useful for builds that take time, like installing your app's dependencies.
See? Because we chose to add all of our files in one command, whenever we modify our source code, Docker has to invalidate all the subsequent layers including the dependencies installation.
In order to speed up our builds locally, we may want to skip the dependency installation if they don't change. It's quite easy: add the `requirements.txt` first, install the dependencies and then add our source code.
```Dockerfile
# Dockerfile
FROM python:3.8.6-buster
WORKDIR /app
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY main.py ./
CMD ["python", "main.py"]
```
After a first successful build, changing the source code will not trigger the dependencies installation again. Dependencies will only be re-installed if:
1. You pull a newer version of `python:3.8.6-buster`
2. The `requirements.txt` file is modified
3. You change any instruction in the Dockerfile from the `FROM` to the `RUN pip install` (included). For example if you change the working directory, or if you decide to copy another file with the requirements, or if you change the base image.
## ⏬ Reduce your final image size
Now you may also want to keep your images small. Since an image size is the sum of the size of each layer, if you create some files in a layer and delete them in a subsequent layers, these files will still account in the total image size, even if they are not present in the final filesystem.
Let's consider a last example:
```Dockerfile
# Dockerfile
FROM ubuntu
WORKDIR /app
RUN fallocate -l 100M example
RUN md5sum example > example.md5
RUN rm example
```
Pop quiz! Given the following:
* The ubuntu image I'm using weighs 73MB
* The file created by `fallocate` is actually 104857600 bytes, or about 105MB
* The md5 sum file size is negligible
What will be the final size of the image?
1. 73MB
2. 105MB
3. 178MB
4. zzZZZzz... Sorry, you were saying?
Well I'd like the answer to be 73MB but instead the image will weigh the full 178MB. Because we created the big file in its own layer, it will account for the total image size even if it's deleted afterwards.
What we could have done instead, is combine the three `RUN` instructions into one, like so:
```Dockerfile
# Dockerfile
FROM ubuntu
WORKDIR /app
RUN fallocate -l 100M example \
&& md5sum example > example.md5 \
&& rm example
```
This Dockerfile produces a final image that looks exactly the same as the previous one but without the 105MB overweight. Of course, this has the downside of making you recreate the big file every time this layer is invalidated, which could be annoying if creating this file is a costly operation.
This pattern is often used in official base image that try to be small whenever they can. For example, consider this snippet from the [`python:3.8.7-buster`](https://github.com/docker-library/python/blob/756285c50c055d06052dd5b6ac34ea965b499c15/3.8/buster/Dockerfile#L28,L37) image (MIT License):
&& tar -xJC /usr/src/python --strip-components=1 -f python.tar.xz \
&& rm python.tar.xz
```
See how `python.tar.xz` is downloaded and then deleted all in the same step? That's to prevent it from weighing in the final image. It's quite useful! But don't overuse it or your Dockerfiles might become unreadable.
## 🗒 Key takeaways
* Every instruction in a Dockerfile results in a layer[^1]. Each layer is stacked onto the previous one and depends upon it.
* Layers are cached and this cache is invalidated whenever the layer or its parent change. The cache is reused on subsequent builds.
* Use `docker history` to know more about your image's layers.
* Reduce your build duration by adding only the files you need when you need them. Push files that might change a lot to the bottom of your Dockerfile (dependencies installation example).
* Reduce your image size by combining multiple `RUN` instructions into one if you create files and delete them shortly after (big file deletion example).
Well that wraps it up for today! It was quite technical but I hope you learned something along the way 🙂