Build Optimization Mechanisms in Gitlab, Gradle, Docker

optimization mechanisms in gitlab, gradle, docker image

This post is an edited transcript from a talk given by Grammarly engineer Dmytro Patkovskyi in September of this year. You can watch the full talk in the embedded video below and view the latest presentation here.

My name is Dmytro Patkovskyi. I’m here today to talk about build tools we commonly use at Grammarly: GitLab, Gradle, and Docker. Not every team uses Gradle and Docker, but pretty much every team uses GitLab. I’m going to talk about build optimization mechanisms and how to make your builds more efficient with these tools. A short disclaimer: This is my first public talk in a really, really long time, so forgive me if I’m a little bit nervous.

The second disclaimer is that I’m not a longtime expert on those tools. Everything I’ve learned, I’ve learned over my short, four-month tenure so far at Grammarly. When I joined Grammarly, I was quickly onboarded onto a completely new project—single sign-on—which was being built from scratch, with a clean slate. One of my first tasks was to develop a skeleton plan for that project, including a build process. I dove deep into that task, incorporating the tools and optimizations I’m going to describe to you.

Setting build optimization goals

Let’s start by defining build goals to achieve with a build process. For my work on our single sign-on initiative, I defined three. The main one was reproducibility, which ensures achieving correctness in the build process. If you cannot reproduce the same build output when using the same set of inputs, your build process is completely incorrect. My second goal—as it probably will be for most—was speed. Then the least important: simplicity. It’s last—because it’s so subjective and depends on the depth of your personal knowledge regarding the build tools you’re using.

The goal of speed can actually be split into two separate goals. One is isolated build speed, or clean build speed. The second is incremental build speed, meaning the speed of subsequent builds. If you have compiled 200 classes and downloaded all the third-party dependencies in your first build, and then change one class in your second commit, maybe you don’t need to do all the work from scratch on your second build. Maybe you can save some time and complete an incremental build much faster than for the first one.

When you look at goals, you should always ask: Is there a conflict? And yes, there is a conflict between these goals. The biggest conflict is between incrementality and reproducibility—because reproducibility is usually achieved by running your build in a disposable environment and not passing any artifacts or files between environments for subsequent builds. So every build is a clean build. This was the goal when continuous integration tools were first introduced: to have a disposable environment for each build so that each build is easily reproducible and you never have weird side effects.

However, incrementality requires you to pass some files to help make the subsequent build faster. So these goals are in conflict. The way to resolve this is to prefer reproducibility for the master branch and do clean builds every time while preferring incrementality in other branches. This way, when you work on features, subsequent builds are faster—and you can see the results of your changes quickly.

There is also conflict between simplicity and incrementality. But the way to resolve that conflict is to look at the quantifiable benefit of incrementality. If you implement some optimization and it saves you only five seconds, maybe it’s not worth making your build much more complex, complicated, and harder to understand for other developers on your team—it’s a relatively small amount of improvement. You should weigh the benefits of incremental speed when setting goals.

Thinking about goals brings us to the focus of my talk, which is those three build tools. The first I’m going to cover is GitLab. We’ll discuss what it has to offer and then look at Gradle and Docker.

Shape the way millions of people communicate!

GitLab and caching for builds

GitLab has three optimization features—which might surprise you a little bit because artifacts are usually not thought of as an optimization feature. However, GitLab is used not only to present the final results of your build but also to pass intermediate results between different build stages. This makes artifacts a viable optimization feature that allows you to avoid redundant work for different stages of your build. (Persistent volumes are very similar to caches—they’re basically local caches—which I’ll cover later.)

For those who never learned GitLab concepts—or maybe forgot them—I made this diagram, which explains the pipelines, stages, and jobs. For each commit, GitLab launches a pipeline. A pipeline is a full iteration of your build process for a particular code change. A pipeline consists of stages. Stages run sequentially, one after another, and during each stage, you define jobs that run in parallel. Artifacts can be passed only between jobs of existing stages. For example, in Job D you cannot depend on artifacts from Job E, but you can depend on artifacts from Job A and B. And then Job Z can depend on artifacts from every job that was executed previously.

Cache options and avoiding pitfalls

Where does caching fit here? Caches can be passed in every direction, except to the past—because you cannot reverse time. It depends on the cache keys you define, and you can pass them vertically between pipelines. You can probably even pass them across projects (though that would be a pretty questionable decision). With this established, let’s look at common pitfalls, recipes, and other things to keep in mind when using these tools.

First, pitfalls. A common pitfall when working with artifacts is not specifying dependencies in the jobs. For the first stage, this does not matter, because you have no jobs in previous stages. But if you define a job in the second stage and don’t specify dependencies, GitLab will have to download artifacts from all jobs in the first stage, even if you don’t really need them. Artifacts are a distributed concept, which means artifacts are uploaded to Amazon S3, and GitLab documentation says that by default artifacts from all previous jobs of all previous stages are passed to each job. To avoid that, you have to specify an empty array in the dependencies key of your job. To specify dependencies, just point out necessary jobs in the dependency key.

Let’s talk about the three cache options that we use at Grammarly. Two work in .gitlab-ci.yml, where you specify cache: the key, and then you write which paths you want to cache. Those are either distributed or local. The thing is, they have exactly the same syntax, but the choice between them is made when you set up runners. So you, as an author of .gitlab-ci.yml, cannot choose between distributed and local cache. The choice is made by the platform team. In my case, the platform team chose distributed cache. This is actually quite good because we have a third option, which sort of emulates local cache. It is basically the same as local cache for most purposes.

The third option is persistent volume, which can be used together with distributed cache in the same job. To use it, you just write files to the /cache folder, and they will be persisted in the same GitLab runner instance. So if the next build runs on the same GitLab runner instance, it will get the freshest cache possible. If the build after the next one falls on the same GitLab runner instance, with the middle one on a different runner, it will have a slightly stale cache—which may or may not be useful, depending on the contents of the cache.

When I’m talking about GitLab, I’m not telling you what you put in the cache, right? Because it depends on the tools you use. When I talk about Gradle, I will explain what you can put in the cache. With GitLab, though, the trick is that if you use distributed cache, basically every runner will have the freshest cache. But if you use local cache or persistent volume, you might get a slightly stale cache, depending on the number of runners you have. If you have just one runner defined for your project, then you would always have a fresh cache.

Let’s talk about how distributed cache works. It’s very simple—I actually think that it’s too simple. It essentially downloads and extracts a zip file from S3, then runs your job, and then packs and uploads a new version of that zip file. Do you see what’s missing here, as a developer? If you defined a very large, 500-megabyte cache, is there something missing? There are no checksums or hashsums. Nothing is compared, so it’s not incremental. Second, this is not an rsync. It’s literally downloaded and uploaded every time: the whole zip file. Third, there is no automatic cleanup mechanism. So whenever you want to clean that distributed cache, you have to reset it via the GitLab user interface or change the cache key.

The first two points together mean that distributed cache is always downloaded and uploaded, even when there are no changes to it. With that, there are two minor annoyances, one of which is that you can only cache files that are inside your project directory. The other is that absolute paths to those files will differ across GitLab runners because your project directory is different on different runners, and that includes the runner ID.

I’m jumping ahead here because it’s not immediately clear why this matters, but I’ll tell you this matters to some Gradle optimizations. They work based on absolute paths. If you take files from one GitLab runner and then pass them to another, and the absolute path on that GitLab runner is different, the optimization will just not see those files. It will not find them, and it will not work. But there is a workaround: Set the GIT_CLONE_PATH variable in your .gitlab-ci.yml to a value that is runner-independent, which does not include runner ID.

Next, let’s talk about how to choose a key for a distributed cache. I’ll offer three options. In fact, there are limitless options, but I’m going to talk about the ones that are more common and that you’ll see more often. You can cache per repository. If you define the key CI project name, this means that you will use one cache for all branches in your repository. The risk here is that the cache quickly becomes a union of all branches. If you add some dependencies in one branch, and then you revert to building the master branch, the master will get those dependencies even if the branch was not merged into it. To avoid that, it must be handled outside of GitLab. For example, Gradle has automatic cleanup of caches, but then you depend on some tool other than GitLab, as GitLab has no cleanup mechanisms.

The second option is to cache per branch, which means that every time you create a new branch, the first commit will take longer to build because it doesn’t have a cache yet. The third option is kind of ridiculous: one cache per pipeline. This is used much less often because you can use artifacts instead of that, and you should. This is when you want to pass intermediate build results between jobs off a single pipeline. I suggest using artifacts instead of creating a cache per branch. You can also add a job name to those keys and have separate caches for different jobs. That is optional and depends on which tools you use inside GitLab.

Distributed cache trade-off: network time vs. build time

Here is a trade-off in distributed cache that I have already mentioned: network time versus build time. By using distributed cache, you cut down the time it takes for Gradle to build your project, but you add time to download this cache from S3 and to upload it back to S3, right? The bigger you make this cache, the more time it takes to download and upload it. But a local or volume cache also has a trade-off, which is between the number of runners and cache freshness. The more runners you have in your project, the less likely it is that you will get the freshest cache on your next build.

The thing with volume cache is that with more runners, you get a bigger average lag. It depends on what you cache. Some caches are very prone to staleness. If it’s stale, it’s a complete cache miss, and the cache is absolutely not useful for you. Other caches can still be beneficial if they are slightly stale.

Which cache should you use? These are some rules of thumb I came up with.

Use volume cache when:

A stale cache is still beneficial
The file sizes are big enough that uploading and downloading will take a while
You download files (such as dependencies) and you’re not producing them with your build
The number of build runners is low

Low is a subjective word, but basically, when it’s less than 10, it’s usually okay to use volume cache.

Use distributed cache when:

A stale cache will mean a complete cache miss
You need the freshest cache on every build in order to save time
The file size is small and downloading doesn’t take long
Your build process produces files

My last advice on the goals that I mentioned earlier is regarding simplicity, which isn’t directly related to speed and optimizations. GitLab has a feature called templates. You can use them to avoid config application in .gitlab-ci.yml, but sometimes you can save even more space in our .gitlab-ci.yml by using one-line tricks. I will show you one a little later.

I didn’t cover reproducibility here. My main advice for reproducibility is that if you need to install something, specify the versions in your script. And if you use a lot of cache and scary cache techniques, do clean builds in the master branch.

Gradle and optimizing builds

Now we come to Gradle. Gradle has a large optimization zoo—six optimizations—all different in magnitude. When you build locally, they’re all enabled by default, so you don’t need to do anything to get those speed-ups when you build on your machine. On CI, if you want to enable them, you must preserve certain files that store metadata info for these optimizations. Some of those files are pretty big. We will decide by the end of this section how to store them with GitLab, whether you’d store them in distributed cache or in volume cache. Or maybe you don’t need some of those optimizations at all because they will add very little benefit to your build process.

Dependency cache

Dependency cache is about saving the third-party libraries you need to build your project. Gradle puts those libraries in a certain folder, which is the biggest of caches. This can take up to 500 megabytes for a typical project—sometimes even a gigabyte, which is pretty big. To download it from the internet on every build might be wasteful. You can check whether you are downloading it by checking your build log and just searching for download strings. You will see the libraries you are downloading—or maybe you won’t, which will mean that you are using some sort of cache.

Incremental build

Second: incremental build, which gets us into slightly confusing territory. There are very slight differences between the next couple of optimizations. Incremental build essentially can skip Gradle tasks if the inputs and outputs for the tasks have not changed since the last execution on your local machine. Gradle stores input files under {projectDir}/.Gradle. It stores hashes of those inputs, then it stores output files in your build inventory. Those output files (e.g., for Java) can be generated class files. They will be in /build/classes. If Gradle runs and sees that your input files match your output files, it will simply skip the task. That will output the task name column to UP-TO-DATE in the build log. Incremental build is also commonly called up-to-date checks.

Local build cache

Next: local build cache. This is very similar to incremental build. However, it doesn’t require that previous task execution was for the same input files. Essentially, it stores every execution hash in a list, and on the next execution it checks your inputs against that list. If it finds the inputs in that list you have right now in your project directory, it will take the outputs of the task from the cache. Instead of using the last execution, you can use any previous execution here. This optimization needs to be enabled via command argument with --build-cache, as it is disabled by default in Gradle. Tasks that are cache hits will be marked as being taken from the cache in the build log.

Remote build cache

And now: remote build cache. This is also very similar to the previous one. However, the files used to carry this optimization over are not stored on the machine that runs builds but are instead stored on a remote Gradle cache server. Artifactory can serve this cache. In addition to enabling this cache on the command line when you call Gradle, you also need to specify the URL of the remote Gradle cache server in your Gradle settings. Otherwise, you cannot use this optimization.

Incremental compilation

Next is incremental compilation. We already had incremental build, and incremental build is a language-agnostic feature. It does not know which language you are using. It just checks input files against output files. Incremental compilation, on the other hand, is a language-specific feature. It exists only for some languages—prominently, for Java. It allows you to avoid full recompilation of all your Java classes if just a few classes changed. In that case, it might be enough to rebuild just those classes—along with the classes that depend on them—and save you some time.

Compilation avoidance

The last optimization is compilation avoidance. This is also language-specific. It’s very similar to incremental compilation, but here’s what it does: If class B changed, and class A depends on it, this optimization will avoid recompilation of A if B changed in an interface-compatible way. For instance, if only the internals of B changed, then it will avoid recompiling A to support that.

This table highlights a few useful facts about the different optimizations. I concentrated first on the sizes of the files produced by these optimizations, then the time savings, and then whether the cache will be beneficial if it’s stale. If what we are getting is not the last cache but a cache from a build before last, for example—or from a build that happened three commits ago—is it still useful? For dependency cache, it’s still very useful, even if you get a dependency cache from 10 commits ago, because dependencies change so rarely.

It’s rare to add libraries to projects. If you do, usually you have some baseline that is pretty big—maybe 20 libraries. Then you add 2. If a build gets a cache that has 20 libraries, it is still useful and saves time on the subsequent build to add just the additional 2 new ones. Some of these optimizations are hit or miss. For example, incremental build is hit or miss. If you’re getting a stale incremental build with files from the commit before the last one, they are not really useful at all, because any small change will completely invalidate this cache.

Gradle and effective cache choices

Here comes the official Gradle advice. Gradle advises you to use remote build cache—and, a bit more quietly, also advises using dependency cache. At least it is not actively against using it on CI! However, I have spent a lot of time trying to enable all of these caches on CI for my project, and I succeeded.

But then I realized that some of them are really not that useful or beneficial, so I turned them off to simplify my project, which is what I recommend. Essentially, you don’t really need remote build cache, even though Gradle recommends it. You don’t need it because you rarely run your builds on exactly the same files as before. So when would you need it? Maybe when you change a README file. There are just not many other cache hit situations here. Maybe there would be more for remote and local build cache if you defined your own exotic Gradle tasks with some small and rarely changed set of inputs. For a typical Java project, though, the cache hits will be too rare for using remote build cache to make sense.

With incremental compilation and compilation avoidance, it is useful to use those features if compilation takes a while for your project. But this is also pretty rare, as most projects would compile in less than 10 seconds unless they’re really huge. If your project takes a while to compile, then I would advise using the distributed GitLab cache to store incremental compilation information and use it across your runners.

GitLab and Gradle recipes

Some GitLab recipes for Gradle: If you want to enable dependency cache, I recommend doing it via GitLab volume. And all you need to do is just to define GRADLE_USER_HOME. and put it in the .gradle/cache folder. That’s all we need in the Grammarly setup. If you decide to enable more cryptic optimizations, you need to define GIT_CLONE_PATH as well to make GitLab check out your project in a constant folder across all GitLab instances. You would also need to define the cache key and also cache some files that are produced by those optimizations.

The last GitLab recipe is about running a clean build in master, which is only relevant if you decide to use those incremental compilations I described. The only reason you would do this is if those features, such as incremental compilation and compilation avoidance, were historically a little buggy, as Gradle admits. Sometimes, they would include classes that you had already deleted in your jar, and you probably wouldn’t want them in your master branch. Also, I think it’s just a good practice to run clean builds in master so you are sure that all your tests pass and you end up with a very reproducible result.

Optimizing builds with Docker

With that in mind, let’s move on to Docker. Here are the optimizations available in Docker: layer cache . . . and basically nothing else (or at least nothing else that I know of). As I said, I’m not a huge expert on those tools. This is what I learned during a couple of months with Grammarly.

Is layer caching used in your GitLab builds? What do you think? The answer is no, unless you do something special to actually enable it. By default, it’s not used—because you’re building in a disposable environment. Nothing is passed. How would it be used? It’s not, though you shouldn’t actually need it, anyway. That’s the interesting part here, which you probably did not expect.

You shouldn’t need it because you can write your Dockerfile with GitLab. You can write your Dockerfile so simply that there will be nothing to optimize at all. There will be nothing to cache in that Dockerfile.

Let’s look at how layer cache works. It takes the Dockerfile command arguments, which it remembers as a cache key in addition to remembering the effect those commands have. The effect is a sum of all file changes it made, called the Docker layer. The sum of all the layers is the Docker image that you get in the end. The interesting thing is that if one command was changed and resulted in a cache miss, then all commands below it will also automatically miss cache. That’s called cache invalidation.

That’s an interesting thing because you really need to take it into account when writing your Dockerfile. You need to think about which lines will often result in cache misses. Usually, you will try to put them as late as possible in the Dockerfile so that the cache misses happen after all the other cache hits. This gives you the fastest Docker build possible.

Let’s come back to what I said in the first slide: that you probably won’t need layer cache because complex Dockerfiles are things of the past in GitLab. Hopefully, anyway. The hope is that you will achieve a Docker file as short as this:

This is a full Dockerfile: five lines. It simply takes the distribution off your Java application. It copies it inside the container and runs it, and it does nothing else. And there is absolutely nothing to cache here because this tar archive changes on every commit. You cannot cache it, and there are no installation statements. There is nothing here and no multi-stage build. [Update: There is a way to achieve further optimization here by unpacking fat jar and creating separate Docker layers for dependencies and code.]

Not every Dockerfile is as short as this. Even in our services—even after I tried hard to optimize them—we have doc products where we need to install Node.js and all of its dependencies. We need to do this in the Dockerfile as well. Here’s a simple but insightful flow chart about how to decide whether you need to enable layer cache or not.

Let’s talk about how to enable layer cache. Start with this shorter Dockerfile:

In the longer version, we pull the latest image before the Docker build and then push the result of that Docker build into it. By doing this, you also add a cache from the statement in your Docker build and specify to cache from latest.

This might look a little wasteful to you because this does not store this image you’re pulling locally, right? It gets it from the internet every time. This line FROM in the second example downloads a pretty large image. The first line gets the base image, so it’s already downloading quite a lot from the internet. If you don’t log the whole image, it usually doesn’t add too much to it, but it includes the base image. It already includes that first line that you would download anyway. So you are not really being too wasteful by pulling this before the build.

Let’s look at the order of commands. I’ve mentioned that you can inadvertently invalidate the cache and then get no benefits of the layer cache. This is a real example from our project docproc (in which I’ve recently made a change from this version):

It saved about one minute of Docker build for that project. The trick here is that you have a slow command, which was close to the end of the sequence of commands—which means it was never taken from the cache even when it was enabled. Well, the cache was disabled in the first place, but even after I enabled it, it was not being cached because it was too low in the Dockerfile. There were other commands above it that were essentially invalidating this cache on every commit. After moving things around, I get the same result. But now the slow command is cached. Since it rarely changes, it’s pretty much always a cache hit.

Q&A

Speaker 1: Thanks for the great talk and for all the work you have done. I would just like to add that in Docker 1809, if you build with buildkit, you can actually cache something in volumes. It has some use index features. You probably could check this out, but it won’t really change anything in this presentation. Thanks for the good work.

Dmytro Patkovskyi: Thanks. Yeah, I have not really used buildkit. And to be honest, probably out of those three tools, Docker is the one least investigated by me, so I’m least experienced with Docker compared to GitLab and Gradle as of today. So thank you for the input. Anything else?

Speaker 2: Thank you. Also, I also don’t have questions. I would like just to say thank you for the research because with your research we could apply some optimization into acquisition pipelines, for example, which will increase and speed up the builds with Gradle caches.

Dmytro Patkovskyi: Thank you very much. I found that moving the dependency cache from distributed GitLab cache to local or volume cache actually gave me a pretty good speed up even in an SSL project. I just tried it today, and it really seems faster, considering that we have just three or four runners, I believe. It means that it is pretty much always available, even though it’s local. But there is absolutely no network time associated with downloading or uploading it, which is cool.

I also want to say thank you for all the support my team gave me with GitLab and with experiments. I had to run through a couple of my ideas, and Dima Shevchuck and Maryna Veremenko had to spend some time setting up some custom things for me. And thank you Sasha Marynych for giving me the motivation to do this talk.