When something breaks in a Kubernetes cluster, the difference between fixing it in 5 minutes and 5 hours is knowing which command to run first. Not which command exists — there are hundreds — but which one maps to the symptom in front of you.
This is the symptom-driven cheatsheet I built from a single afternoon of debugging GitLab on K3s on ARM64. Every command and log signature here was used to find a real bug. Memorize the patterns, not the flags.
What You’ll Cover
- The default workflow:
get → describe → logs - When a pod won’t start (and the empty-log arch trap)
- When pods look healthy but the app returns 500
- When the ingress returns 404 for your hostname
- When cert-manager won’t issue a certificate
- When external access (LB or tunnel) is broken
- Operational commands: scale, restart, replay
- Log signatures and what they actually mean
The Default Workflow: get → describe → logs
Almost every Kubernetes debugging session starts with the same three commands, in this order. Skipping ahead wastes time.
# 1. What's broken?
kubectl get pods -n gitlab
# 2. Why is it broken? (events, image, conditions, mounts)
kubectl describe pod <pod-name> -n gitlab
# 3. What did it say before it died?
kubectl logs <pod-name> -n gitlab --previous
get shows status. describe shows the why — Events at the bottom of the output tell you about image pulls, scheduling failures, OOM kills, probe failures, volume mounts. logs --previous is critical for crash loops because the current container is the one that hasn’t run yet; the failure is in the previous instance.
For Jobs in CrashLoopBackOff, the pod name often changes between checks. If kubectl logs pod-xyz --previous returns pods "pod-xyz" not found, list pods again — the Job has spawned a new one. Or skip the name lookup entirely with kubectl logs -l job-name=<job> and kubectl describe pod -l job-name=<job>.
When a Pod Won’t Start
kubectl get pods shows you a status. Each one points at a different category of problem.
| Status | First command to run | Likely cause |
|---|---|---|
Pending | kubectl describe pod (look at Events) | No node fits — resources, taints, PVC not bound |
ImagePullBackOff / ErrImagePull | kubectl describe pod (look at Image:) | Wrong image name/tag, registry auth missing, network |
CrashLoopBackOff (with logs) | kubectl logs --previous | App startup error — read the log |
CrashLoopBackOff (empty logs) | kubectl describe pod (look at Exit Code + Started/Finished) | Architecture mismatch, missing binary, exec failure |
0/2 Running | kubectl describe pod (look at Containers + Conditions) | One container in the pod is not Ready — probe failing, init still running |
The most expensive of these to diagnose is the empty-logs CrashLoopBackOff, because the obvious next move (logs) gives you nothing. The clue lives in describe.
The Empty-Log Arch Trap
Here is the actual output from a real bug — a bucket-create Job stuck in a loop:
Containers:
minio-mc:
Image: registry.gitlab.com/.../minio/mc:RELEASE.2018-07-13T00-53-22Z
State: Terminated
Reason: Error
Exit Code: 255
Started: Wed, 20 May 2026 12:34:50 +0700
Finished: Wed, 20 May 2026 12:34:50 +0700
Last State: Terminated
Reason: Error
Exit Code: 255
Restart Count: 3
Three signals together tell the whole story:
- Exit Code 255 — generic catch-all, often used for “exec failed at the kernel level”
- Started and Finished in the same second — the binary didn’t even get to do anything
kubectl logs --previousreturnsunable to retrieve container logs— there are no logs because the process never wrote anything to stdout
This is the canonical signature of an architecture mismatch — usually an amd64-only image running on an ARM64 node. The kernel rejects the ELF binary before any code runs. The image tag from 2018 is a strong second clue: that’s old enough to predate widely-published multi-arch manifests.
The fix is to override the image to a multi-arch tag and force the Job to re-run:
# After patching the chart values:
kubectl -n gitlab delete job -l component=create-buckets
kubectl -n gitlab get pods -l component=create-buckets -w
# expect: Completed (not CrashLoopBackOff)
On any node: kubectl get nodes -o wide shows the architecture. On the pod side: kubectl describe pod | grep Image: shows the image, and crane manifest <image> | jq '.manifests[].platform' (or Docker Hub’s “OS/ARCH” tab) shows what archs the image actually publishes. If they don’t intersect, no amount of restarts will help.
When Pods Look Healthy But the App Returns 500
This is the trickiest category because Kubernetes itself reports nothing wrong. Pods are 2/2 Running. Probes are green. The Service routes traffic. But the user sees a 500 page.
Kubernetes can’t help here — the failure is inside the application. You need to read application logs and, often, run code in the app’s context.
Step 1: Filter the logs for errors
# Tail the container's stdout/stderr stream (works for any pod)
kubectl logs deploy/gitlab-webservice-default -c webservice -n gitlab \
--tail=500 | grep -iE "error|exception|fatal|500" | tail -40
The -c webservice selects a specific container in a multi-container pod. Without it, kubectl picks the first container — which may not be the one with the error.
Step 2: Read the app’s own log files
Many real apps (GitLab, Rails apps in general, anything with structured logging to file) write to log files inside the container, not to stdout. To read them you need to exec in:
# Read the actual Rails production.log inside the running container
kubectl exec deploy/gitlab-webservice-default -c webservice -n gitlab -- \
tail -n 200 /var/log/gitlab/production.log
You’ll get a real backtrace pointing at the actual broken line in the app. That’s a different signal than the access log — the access log shows that a request returned 500; the production.log shows why.
Step 3: Reproduce in a console inside the cluster
For Rails apps the cleanest move is to run the failing code path directly:
kubectl exec -it deploy/gitlab-toolbox -n gitlab -- gitlab-rails runner '
begin
u = Users::Internal.duo_code_review_bot
puts "OK: #{u.inspect}"
rescue => e
puts "FAILED: #{e.class}: #{e.message}"
puts e.record.errors.full_messages if e.respond_to?(:record)
end
'
This bypasses the web layer entirely. If the call fails in the console too, the bug is in the app code or the database. If it works in the console but not via HTTP, the bug is in middleware, authentication, or session state.
When the Ingress Returns 404 for Your Hostname
You configured a tunnel or load balancer, the cert is valid, the pods are healthy — and the URL returns 404 from the cluster’s ingress controller. This is almost always a Host-header mismatch.
# What hostnames does the ingress controller actually know about?
kubectl get ingress -n gitlab
NAME CLASS HOSTS ADDRESS PORTS AGE
gitlab-webservice traefik gitlab.example.com ... 80,443 5h
The HOSTS column is the source of truth. If the tunnel or LB is sending Host: gitlabee.example.com and the Ingress is configured for gitlab.example.com, you get a 404 — the ingress controller has no rule that matches.
Test it from the node, bypassing DNS and TLS
The most reliable confirmation is a curl against the ingress controller directly with an explicit Host header:
# This 404s when the Ingress doesn't know the hostname
curl -ksI -H "Host: gitlabee.example.com" https://127.0.0.1
# This succeeds when you use the correct hostname
curl -ksI -H "Host: gitlab.example.com" https://127.0.0.1
If the second one returns 200/302 and the first 404s, the upstream (tunnel, load balancer, external DNS) is sending a hostname the ingress controller doesn’t know about. Fix the chart values or the Ingress, not the network.
When cert-manager Won’t Issue a Certificate
# Top-level status
kubectl get certificate -n gitlab
# NAME READY SECRET AGE
# gitlab-tls False gitlab-tls 3m
# The full chain — Certificate → CertificateRequest → Order → Challenge
kubectl describe certificate gitlab-tls -n gitlab
kubectl describe certificaterequest -n gitlab
kubectl describe challenge -A
READY=False for more than a couple of minutes means something is wrong in the chain. The chain goes Certificate → CertificateRequest → Order → Challenge, and the failure surfaces at the lowest level — usually the Challenge:
Status:
Reason: Waiting for DNS-01 challenge propagation: NS ns1.example.com returned
REFUSED for _acme-challenge.gitlab.example.com.
That error string tells you exactly what’s wrong — in this case the DNS-01 solver can’t write the TXT record because the API token’s zone scope doesn’t include the target domain.
Let’s Encrypt’s production rate limits are surprisingly tight. While iterating, point your ClusterIssuer at acme-staging-v02.api.letsencrypt.org/directory so you don’t burn through the production quota on bad attempts. Switch back once issuance works end-to-end.
When External Access Is Broken (Tunnel or LB)
If you’re using Cloudflare Tunnel (or any out-of-cluster connector), the pod-side check is straightforward:
# Tail the cloudflared logs
kubectl -n gitlab logs deploy/cloudflared --tail=50
# expect: "Registered tunnel connection" lines, no repeating errors
If the tunnel pod is in CrashLoopBackOff, the usual suspects are: wrong tunnel UUID in the ConfigMap, credentials secret name/key mismatch, or the tunnel was deleted in the upstream dashboard while the credentials remained. The logs will say which.
To prove the tunnel pod itself is not at fault when the external URL is failing, scale it to zero and see if the LAN path still works:
# Drain the tunnel
kubectl -n gitlab scale deploy/cloudflared --replicas=0
# Test direct ingress from the LAN (DNS pointed at the node IP)
curl -kI https://gitlab.example.com
# Bring it back
kubectl -n gitlab scale deploy/cloudflared --replicas=1
If the LAN path works with cloudflared at zero replicas, the tunnel is the only thing actually broken — not the cluster. Focus there.
Operational Commands: Scale, Restart, Replay
These are the ones you’ll type 100 times. Worth committing to memory.
# Restart all pods in a Deployment, picking up new ConfigMap / Secret values
kubectl -n gitlab rollout restart deploy/gitlab-webservice-default
# Drain a Deployment for testing (and bring it back)
kubectl -n gitlab scale deploy/cloudflared --replicas=0
kubectl -n gitlab scale deploy/cloudflared --replicas=1
# Re-run a Job that succeeded once (delete the Job; the chart's hook will recreate it)
kubectl -n gitlab delete job <job-name>
# Watch status changes live
kubectl -n gitlab get pods -w
# Show resource usage (requires metrics-server)
kubectl top pods -n gitlab
kubectl top nodes
kubectl rollout restart does a rolling update — pods come up one at a time and the old ones don’t terminate until the new ones are Ready. kubectl delete pod <name> kills a pod immediately and the Deployment respawns it. Use restart for config reloads; use delete only when you actually want to evict a specific stuck pod.
Reading Secrets Inline
You’ll do this constantly when bootstrapping a chart:
# Decode a Secret into something readable
kubectl -n gitlab get secret gitlab-gitlab-initial-root-password \
-o jsonpath='{.data.password}' | base64 -d ; echo
# Create one from a literal (e.g. for an Initial root password)
kubectl create secret generic gitlab-initial-root-password \
--namespace gitlab \
--from-literal=password='ChangeMe-Strong-Password!'
# Create one from a file (e.g. cloudflared tunnel credentials)
kubectl -n gitlab create secret generic cloudflared-credentials \
--from-file=credentials.json=$HOME/.cloudflared/<tunnel-uuid>.json
The ; echo at the end of the decode is because base64-decoded output has no trailing newline — without it, the next shell prompt eats the last line of the secret.
Read a ConfigMap to See What the Chart Built
When a Helm chart produces something you didn’t expect — wrong hostname, missing flag, weird env var — read the rendered ConfigMap directly:
kubectl -n gitlab get cm gitlab-webservice -o yaml
kubectl -n gitlab get cm cloudflared-config -o yaml
This is the actual source of truth for what the container reads at startup. If the chart values look correct but the behavior is wrong, the ConfigMap is what proves which side has the bug.
Log Signatures and What They Actually Mean
The patterns that come up over and over in homelab K8s debugging:
| Signature | Likely cause |
|---|---|
| Exit 255 + Started == Finished + empty logs | Architecture mismatch (amd64 image on ARM64, or vice versa) |
exec format error in pod events | Same as above, but the kernel said it out loud |
OOMKilled in kubectl describe pod | Container exceeded its memory limit (raise the limit or fix the leak) |
Liveness probe failed repeatedly | App slow to start; raise initialDelaySeconds or fix the probe path |
BackOff pulling image | Image name typo, missing imagePullSecret, or registry unreachable |
MountVolume.SetUp failed | PVC not bound, or a Secret/ConfigMap reference doesn’t exist |
FailedScheduling: insufficient memory | Node doesn’t have room for the requested resources |
403 SignatureDoesNotMatch (in S3-style API logs) | Credentials drift between two consumers of the same store |
App returns 500 but kubectl get pods shows 2/2 Running | Not a Kubernetes problem — read the app log inside the container |
Quick Reference
# Discovery
kubectl get pods -n NS
kubectl get pods -n NS -w
kubectl get pods -n NS -l app=foo
# Diagnosis
kubectl describe pod POD -n NS
kubectl logs POD -n NS --previous
kubectl logs deploy/DEPLOY -c CONTAINER -n NS --tail=200
# Exec / introspection
kubectl exec -it POD -n NS -- bash
kubectl exec DEPLOY -n NS -- tail -n 200 /var/log/app/prod.log
# Operations
kubectl rollout restart deploy/DEPLOY -n NS
kubectl scale deploy/DEPLOY -n NS --replicas=N
kubectl delete job JOB -n NS
# Networking
kubectl get ingress -n NS
If you remember nothing else, remember the first triple: get then describe then logs --previous. Almost every diagnostic session in this post started there.
Next Steps
- Self-Host GitLab EE on K3s with Cloudflare Tunnel — the homelab setup that produced most of the commands in this post.
- Fix Slow Self-Hosted GitLab: The RAM Trap — when
kubectl top podsshows trouble, this is what to do next. - Self-Host Harbor Image Registry on Debian — pairs nicely if your K8s debugging keeps surfacing image-related problems.