Fix a Slow Self-Hosted GitLab: The RAM Trap During Pipelines

The web UI freezes. git push hangs. A build that used to take 1:40 now takes 3:30. The runner VM looks idle in htop. The network is fine. It only happens while a pipeline is running.

The culprit isn’t your network or your runner. It’s the GitLab server hitting its RAM ceiling and swapping. A few default services eat more memory than most homelab boxes have to spare, and a Node-based pipeline is enough to tip the whole thing into swap.

This post is the playbook I used to diagnose and fix exactly that. Five-command diagnosis, three changes to gitlab.rb, build time back to 1:40.

Environment:

Component	Value
GitLab	GitLab EE 18.9.1 (Omnibus, Debian)
Host	Debian 13 VM on Proxmox
VM Specs	8 vCPU, 10 GB RAM, 60 GB SSD
Runner	Separate VM (Docker executor)
Workload	Astro static site → Cloudflare Pages

The Workflow That Lands on the Server

Every time the symptoms hit, this is the path a single commit takes from your editor to a built artifact:

The compute happens on the runner VM, in a Docker container. But the server is in every step: it ingests the push, creates the pipeline, hands jobs to the runner, streams logs back, and stores the artifact. That’s why a server with no headroom collapses during pipelines even when the runner is idle.

What We’ll Cover

Rule out the red herrings — TLS, DNS, runner contention
Understand what GitLab is actually doing during a pipeline — Puma, Sidekiq, Workhorse, Postgres
Diagnose in five commands — RAM, swap, load, OOM kills, disk
Read real numbers from a thrashing box — what the symptoms look like
Trim gitlab.rb — Puma workers, Sidekiq concurrency, disable Prometheus/KAS
Grow swap as a safety net
Verify the fix — what free -h and vmstat should look like under load

What It’s Not

Before tuning anything, rule out three things that aren’t the problem:

Suspect	Verdict	Why
TLS / HTTPS	Not the cause	TLS overhead is ~1–3% CPU. You’d see CPU pegged, not RAM exhausted.
Pi-hole / DNS routing	Not the cause	Slow DNS would slow everything, not just pipelines. Test with `dig`.
Runner / server contention	Not the cause	The runner already runs on a separate VM. If the runner VM is idle, it’s the server.

If htop on the runner is idle and the GitLab VM is the slow one, you’re in the right place.

Runner-side performance is a separate fight

This post is about the GitLab server. If your pipelines are slow because the runner is taking the long way around (e.g. routing through Cloudflare instead of the local network), or if npm ci runs from scratch on every job, see GitLab Runner Performance Optimization. The two posts are complementary.

What Actually Happens on the Server During a Pipeline

Even with the runner on its own VM, the GitLab server still has work to do every time a job runs:

Runner API polling. The runner hits /api/v4/jobs/request every 3 seconds by default. Every poll goes through Puma.
Trace log streaming. As the job runs, the runner ships log output back continuously: Workhorse → Redis (ci_build_trace_chunks) → Postgres. A noisy build (npm install is very noisy) means a lot of writes.
Sidekiq state machine. Every status change fires workers: BuildTraceChunkFlushWorker, BuildFinishedWorker, PipelineUpdateWorker, and friends.
Artifact upload. At the end of the job, the runner pushes artifacts back through Workhorse.

On a 10 GB box running Puma + Sidekiq + Postgres + Redis + Gitaly + Workhorse + nginx + Prometheus + Registry + KAS + Pages, all of that lands on a system already at its memory ceiling. RAM pressure → swap → everything stalls.

Diagnose First

Run this block on the GitLab VM while a pipeline is actively running. Five commands, one minute, full picture.

free -h && echo "---" && vmstat 2 5
sudo gitlab-ctl status
top -b -n 1 -o %MEM | head -25
sudo dmesg -T | grep -iE 'oom|killed process' | tail -20
df -h && sudo du -sh /var/opt/gitlab/* | sort -h

How to read each:

Command	Watch for	Verdict
`free -h`	`available` close to zero, swap used	RAM ceiling hit
`vmstat 2 5`	`si` / `so` columns non-zero	Swap thrashing in progress
`gitlab-ctl status`	Long list of services	Spot anything you don’t use (kas, registry, pages, prometheus)
`top -o %MEM`	Top consumers	Tells you which knob to turn
`dmesg`	`Out of memory: Killed process …`	Kernel killed a GitLab process — explains hangs
`df -h` + `du`	`/var/opt/gitlab/prometheus` huge	Wasted disk on a service you may not need

Also clarify the symptom before changing anything. Slow web UI, slow git push / git clone, or slow pipeline runtime? Each points at a different bottleneck — Puma for UI, Gitaly for git, the whole stack for pipelines.

Where the Pressure Goes

Every box on the server side holds memory. Add it up with the Omnibus defaults — Puma alone is 4×1.2 GB — and you’re at the ceiling before the pipeline even starts.

Real Numbers From a Thrashing Box

Here’s what the diagnostic block showed on my box during a slow pipeline:

Headline signals

Signal	Value	Verdict
Free RAM	207 MB / 10 GB	Critically low
Swap used	1.0 GB / 1.0 GB (100%)	Completely full
Load avg (1/5/15)	0.17 / 19.70 / 23.71	System was crushed minutes ago
OOM kills	none	Swap absorbed it — which is why it feels slow
Disk usage	28 / 59 GB (50%)	Fine
`/var/opt/gitlab/prometheus`	13 GB	Wasteful for a solo instance

vmstat was quiet at sample time, but 5- and 15-minute load averages of 19 and 23 prove the storm just before. When swap is full, every new allocation stalls — that’s the felt slowness.

RAM breakdown from top

4 × Puma workers @ ~1.2 GB  = 4.8 GB   ← biggest consumer
1 × Sidekiq                 = 1.8 GB
1 × Puma master             = 0.7 GB
PostgreSQL (sum)            ~ 1.5 GB
Prometheus                  = 0.6 GB (+13 GB disk)
Gitaly                      = 0.2 GB
Redis, nginx, exporters     ~ 0.3 GB
                            ─────────
                            ~ 9.9 GB   right at the 10 GB ceiling

A 1–2 GB npm install during a CI job pushes the whole box into swap. That’s the root cause.

The full pre-fix snapshot

               total        used        free      shared  buff/cache   available
Mem:             9Gi       9.4Gi       419Mi       2.2Gi       2.6Gi       539Mi
Swap:          1.0Gi       1.0Gi       412Ki
---
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 0  0 1048160 429860  33120 2712088    6    4  7253   272 1307    2  1  1 97  1  0  0

Swap fully used (1048160 KB), available RAM at 539 MB. Anything new has to evict pages to disk first.

And the full service list — many of which I never use:

run: alertmanager: …
run: gitaly: …
run: gitlab-exporter: …
run: gitlab-kas: …
run: gitlab-workhorse: …
run: logrotate: …
run: nginx: …
run: node-exporter: …
run: postgres-exporter: …
run: postgresql: …
run: prometheus: …
run: puma: …
run: redis: …
run: redis-exporter: …
run: sidekiq: …

Alertmanager, node-exporter, postgres-exporter, redis-exporter, prometheus, gitlab-kas — six processes I haven’t touched in months, all holding memory.

The Fix

Apply in order. Step 1 alone is usually enough.

Step 1 — Trim gitlab.rb

Edit /etc/gitlab/gitlab.rb:

# Cut Puma from 4 workers to 2. Frees ~2.4 GB.
puma['worker_processes'] = 2

# Sidekiq default is 20; 10 is plenty for a solo instance.
sidekiq['max_concurrency'] = 10

# Frees ~600 MB RAM + lets you reclaim 13 GB of disk afterwards.
prometheus_monitoring['enable'] = false

# Disable only if you're not using the Kubernetes Agent.
gitlab_kas['enable'] = false

# Disable only if you're not using these.
registry['enable']     = false
gitlab_pages['enable'] = false

Apply and restart:

sudo gitlab-ctl reconfigure
sudo gitlab-ctl restart

Only after Prometheus has actually stopped (verify with gitlab-ctl status), reclaim its disk:

sudo rm -rf /var/opt/gitlab/prometheus/data

Only delete Prometheus data after confirming it's disabled

Run sudo gitlab-ctl status | grep prometheus first and confirm no run: entries remain. Deleting /var/opt/gitlab/prometheus/data while the service is still running corrupts the in-flight write batch.

Step 2 — Grow swap to 4 GB (safety net, not a cure)

Swap doesn’t make a thrashing box fast. It does prevent OOM kills, which turn slow into broken.

sudo swapoff -a
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Step 3 — Grow the VM (when convenient)

If you can bump RAM 10 → 16 GB in Proxmox, do it. That alone removes the constraint and lets you re-enable Prometheus or KAS without thinking about budget.

Resource	Current	Recommended
RAM	10 GB	16 GB
Swap	1 GB	4 GB
Disk	60 GB	60 GB (fine after Prometheus cleanup)

Verify the Fix

Restart, then re-run the diagnostic block. The state to want:

si / so in vmstat stay at 0 under pipeline load
available RAM stays above 1–2 GB
Load average stays well below CPU count (8)

Here’s the after on my box, with no pipeline running:

               total        used        free      shared  buff/cache   available
Mem:             9Gi       5.0Gi       4.0Gi       208Mi       1.4Gi       4.9Gi
Swap:          1.0Gi        25Mi       998Mi

RAM use dropped from 9.4 GB to 5.0 GB. Swap dropped from 100% used to ~2%. There’s now 4 GB of headroom for a pipeline to chew through without touching disk.

Under pipeline load:

procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 1  0  25692 3456384  86496 1725492    6    4  7795   271 1320    2  1  1 97  1  0  0
 0  0  25692 3443544  86500 1725492    0    0     0    40 1143 1123  2  0 97  0  0  0
 0  0  25692 3442508  86500 1725492    0    0     0     0 1086 1165  2  0 98  0  0  0

si and so flat at 0 across every 2-second sample. No swap I/O. That’s what “not thrashing” looks like.

Build and deploy time for the staging/production pipeline:

Phase	Before	After
Build + deploy	03:30	01:40

Nearly 2× faster, with one config file edit and no hardware change.

The pipeline history makes the difference impossible to miss. Before the fix, pipeline durations were all over the map — some runs took 3, 4, even 9 minutes:

GitLab pipeline list showing build durations before and around the fix: pipelines #96–#99 took 4:26, 1:45, 9:43 and 3:26, while later runs after the change drop to around 1:41

After the fix, every recent pipeline lands in the 1:30–1:40 range — predictable, no outliers:

GitLab pipeline list after the fix: pipelines #106–#109 all complete between 0:46 and 1:36, with consistent build durations

Decision Tree — When to Apply What

Not every slow GitLab has the same shape. Match the symptom to the fix:

Symptom in diagnostic block	Likely cause	Apply
`si` / `so` non-zero in `vmstat`	RAM exhaustion → swap	Step 1 (trim gitlab.rb) or grow RAM to 16 GB
Puma processes dominate `top`	Too many Puma workers for the box	`puma['worker_processes'] = 2`
Sidekiq + Postgres dominate `top`	Trace log noise + state churn	Step 1 + quieter CI logs (set `NPM_CONFIG_LOGLEVEL=error`)
Gitaly dominates	Big repo or slow disk	Check repo size, check Proxmox storage I/O
Disk near full, `/var/opt/gitlab` huge	Old artifacts, traces, registry	Clean old job traces, expire artifacts, prune the registry
No RAM/CPU stress, just slow web UI	Puma worker count mismatch	`puma['worker_processes'] = 2`
Cloudflare-routed traffic, runner idle, server healthy	Network detour	See GitLab Runner Performance Optimization

What You Have Now

Layer	Outcome
Server	RAM use down from 9.4 GB to 5.0 GB. 4 GB of headroom for pipelines.
Services	Prometheus, KAS, Registry, Pages disabled (re-enable when you actually need them)
Disk	13 GB reclaimed from `/var/opt/gitlab/prometheus`
Pipeline	03:30 → 01:40 with no hardware change
Safety	4 GB swap so a runaway job slows things down instead of OOM-killing Puma

Next Steps

GitLab Runner Performance Optimization — the runner-side counterpart. Direct internal connections, Docker executor tuning, caching, needs for parallelism.
Self-Host Harbor Image Registry on Debian — stop hitting Docker Hub on every CI job. Smaller base-image pulls = less Workhorse traffic = less server load.
Monitoring Proxmox with Grafana — the diagnostic block in this post is reactive. A Grafana dashboard watching RAM and swap on the GitLab VM will surface this trend a week before it becomes “the pipeline is broken.”
Linux Server Security Baseline — once the GitLab server is performing, lock it down.