subuilds.dev

Debugging Kubernetes by Symptom: The Kubectl Commands I Reach For

· 13 min read
self-hosted-infra

When something breaks in a Kubernetes cluster, the difference between fixing it in 5 minutes and 5 hours is knowing which command to run first. Not which command exists — there are hundreds — but which one maps to the symptom in front of you.

This is the symptom-driven cheatsheet I built from a single afternoon of debugging GitLab on K3s on ARM64. Every command and log signature here was used to find a real bug. Memorize the patterns, not the flags.

What You’ll Cover

  1. The default workflow: get → describe → logs
  2. When a pod won’t start (and the empty-log arch trap)
  3. When pods look healthy but the app returns 500
  4. When the ingress returns 404 for your hostname
  5. When cert-manager won’t issue a certificate
  6. When external access (LB or tunnel) is broken
  7. Operational commands: scale, restart, replay
  8. Log signatures and what they actually mean

The Default Workflow: get → describe → logs

Almost every Kubernetes debugging session starts with the same three commands, in this order. Skipping ahead wastes time.

# 1. What's broken?
kubectl get pods -n gitlab

# 2. Why is it broken? (events, image, conditions, mounts)
kubectl describe pod <pod-name> -n gitlab

# 3. What did it say before it died?
kubectl logs <pod-name> -n gitlab --previous

get shows status. describe shows the why — Events at the bottom of the output tell you about image pulls, scheduling failures, OOM kills, probe failures, volume mounts. logs --previous is critical for crash loops because the current container is the one that hasn’t run yet; the failure is in the previous instance.

Pod names change when Jobs restart

For Jobs in CrashLoopBackOff, the pod name often changes between checks. If kubectl logs pod-xyz --previous returns pods "pod-xyz" not found, list pods again — the Job has spawned a new one. Or skip the name lookup entirely with kubectl logs -l job-name=<job> and kubectl describe pod -l job-name=<job>.

When a Pod Won’t Start

kubectl get pods shows you a status. Each one points at a different category of problem.

StatusFirst command to runLikely cause
Pendingkubectl describe pod (look at Events)No node fits — resources, taints, PVC not bound
ImagePullBackOff / ErrImagePullkubectl describe pod (look at Image:)Wrong image name/tag, registry auth missing, network
CrashLoopBackOff (with logs)kubectl logs --previousApp startup error — read the log
CrashLoopBackOff (empty logs)kubectl describe pod (look at Exit Code + Started/Finished)Architecture mismatch, missing binary, exec failure
0/2 Runningkubectl describe pod (look at Containers + Conditions)One container in the pod is not Ready — probe failing, init still running

The most expensive of these to diagnose is the empty-logs CrashLoopBackOff, because the obvious next move (logs) gives you nothing. The clue lives in describe.

The Empty-Log Arch Trap

Here is the actual output from a real bug — a bucket-create Job stuck in a loop:

Containers:
  minio-mc:
    Image:         registry.gitlab.com/.../minio/mc:RELEASE.2018-07-13T00-53-22Z
    State:         Terminated
      Reason:      Error
      Exit Code:   255
      Started:     Wed, 20 May 2026 12:34:50 +0700
      Finished:    Wed, 20 May 2026 12:34:50 +0700
    Last State:    Terminated
      Reason:      Error
      Exit Code:   255
    Restart Count: 3

Three signals together tell the whole story:

  • Exit Code 255 — generic catch-all, often used for “exec failed at the kernel level”
  • Started and Finished in the same second — the binary didn’t even get to do anything
  • kubectl logs --previous returns unable to retrieve container logs — there are no logs because the process never wrote anything to stdout

This is the canonical signature of an architecture mismatch — usually an amd64-only image running on an ARM64 node. The kernel rejects the ELF binary before any code runs. The image tag from 2018 is a strong second clue: that’s old enough to predate widely-published multi-arch manifests.

The fix is to override the image to a multi-arch tag and force the Job to re-run:

# After patching the chart values:
kubectl -n gitlab delete job -l component=create-buckets
kubectl -n gitlab get pods -l component=create-buckets -w
# expect: Completed (not CrashLoopBackOff)
Verify the node arch matches the image

On any node: kubectl get nodes -o wide shows the architecture. On the pod side: kubectl describe pod | grep Image: shows the image, and crane manifest <image> | jq '.manifests[].platform' (or Docker Hub’s “OS/ARCH” tab) shows what archs the image actually publishes. If they don’t intersect, no amount of restarts will help.

When Pods Look Healthy But the App Returns 500

This is the trickiest category because Kubernetes itself reports nothing wrong. Pods are 2/2 Running. Probes are green. The Service routes traffic. But the user sees a 500 page.

Kubernetes can’t help here — the failure is inside the application. You need to read application logs and, often, run code in the app’s context.

Step 1: Filter the logs for errors

# Tail the container's stdout/stderr stream (works for any pod)
kubectl logs deploy/gitlab-webservice-default -c webservice -n gitlab \
  --tail=500 | grep -iE "error|exception|fatal|500" | tail -40

The -c webservice selects a specific container in a multi-container pod. Without it, kubectl picks the first container — which may not be the one with the error.

Step 2: Read the app’s own log files

Many real apps (GitLab, Rails apps in general, anything with structured logging to file) write to log files inside the container, not to stdout. To read them you need to exec in:

# Read the actual Rails production.log inside the running container
kubectl exec deploy/gitlab-webservice-default -c webservice -n gitlab -- \
  tail -n 200 /var/log/gitlab/production.log

You’ll get a real backtrace pointing at the actual broken line in the app. That’s a different signal than the access log — the access log shows that a request returned 500; the production.log shows why.

Step 3: Reproduce in a console inside the cluster

For Rails apps the cleanest move is to run the failing code path directly:

kubectl exec -it deploy/gitlab-toolbox -n gitlab -- gitlab-rails runner '
  begin
    u = Users::Internal.duo_code_review_bot
    puts "OK: #{u.inspect}"
  rescue => e
    puts "FAILED: #{e.class}: #{e.message}"
    puts e.record.errors.full_messages if e.respond_to?(:record)
  end
'

This bypasses the web layer entirely. If the call fails in the console too, the bug is in the app code or the database. If it works in the console but not via HTTP, the bug is in middleware, authentication, or session state.

When the Ingress Returns 404 for Your Hostname

You configured a tunnel or load balancer, the cert is valid, the pods are healthy — and the URL returns 404 from the cluster’s ingress controller. This is almost always a Host-header mismatch.

# What hostnames does the ingress controller actually know about?
kubectl get ingress -n gitlab
NAME              CLASS     HOSTS                  ADDRESS  PORTS  AGE
gitlab-webservice traefik   gitlab.example.com     ...      80,443 5h

The HOSTS column is the source of truth. If the tunnel or LB is sending Host: gitlabee.example.com and the Ingress is configured for gitlab.example.com, you get a 404 — the ingress controller has no rule that matches.

Test it from the node, bypassing DNS and TLS

The most reliable confirmation is a curl against the ingress controller directly with an explicit Host header:

# This 404s when the Ingress doesn't know the hostname
curl -ksI -H "Host: gitlabee.example.com" https://127.0.0.1

# This succeeds when you use the correct hostname
curl -ksI -H "Host: gitlab.example.com"   https://127.0.0.1

If the second one returns 200/302 and the first 404s, the upstream (tunnel, load balancer, external DNS) is sending a hostname the ingress controller doesn’t know about. Fix the chart values or the Ingress, not the network.

When cert-manager Won’t Issue a Certificate

# Top-level status
kubectl get certificate -n gitlab
# NAME         READY   SECRET       AGE
# gitlab-tls   False   gitlab-tls   3m

# The full chain — Certificate → CertificateRequest → Order → Challenge
kubectl describe certificate gitlab-tls -n gitlab
kubectl describe certificaterequest -n gitlab
kubectl describe challenge -A

READY=False for more than a couple of minutes means something is wrong in the chain. The chain goes Certificate → CertificateRequest → Order → Challenge, and the failure surfaces at the lowest level — usually the Challenge:

Status:
  Reason:  Waiting for DNS-01 challenge propagation: NS ns1.example.com returned
           REFUSED for _acme-challenge.gitlab.example.com.

That error string tells you exactly what’s wrong — in this case the DNS-01 solver can’t write the TXT record because the API token’s zone scope doesn’t include the target domain.

Use the staging endpoint while debugging

Let’s Encrypt’s production rate limits are surprisingly tight. While iterating, point your ClusterIssuer at acme-staging-v02.api.letsencrypt.org/directory so you don’t burn through the production quota on bad attempts. Switch back once issuance works end-to-end.

When External Access Is Broken (Tunnel or LB)

If you’re using Cloudflare Tunnel (or any out-of-cluster connector), the pod-side check is straightforward:

# Tail the cloudflared logs
kubectl -n gitlab logs deploy/cloudflared --tail=50
# expect: "Registered tunnel connection" lines, no repeating errors

If the tunnel pod is in CrashLoopBackOff, the usual suspects are: wrong tunnel UUID in the ConfigMap, credentials secret name/key mismatch, or the tunnel was deleted in the upstream dashboard while the credentials remained. The logs will say which.

To prove the tunnel pod itself is not at fault when the external URL is failing, scale it to zero and see if the LAN path still works:

# Drain the tunnel
kubectl -n gitlab scale deploy/cloudflared --replicas=0

# Test direct ingress from the LAN (DNS pointed at the node IP)
curl -kI https://gitlab.example.com

# Bring it back
kubectl -n gitlab scale deploy/cloudflared --replicas=1

If the LAN path works with cloudflared at zero replicas, the tunnel is the only thing actually broken — not the cluster. Focus there.

Operational Commands: Scale, Restart, Replay

These are the ones you’ll type 100 times. Worth committing to memory.

# Restart all pods in a Deployment, picking up new ConfigMap / Secret values
kubectl -n gitlab rollout restart deploy/gitlab-webservice-default

# Drain a Deployment for testing (and bring it back)
kubectl -n gitlab scale deploy/cloudflared --replicas=0
kubectl -n gitlab scale deploy/cloudflared --replicas=1

# Re-run a Job that succeeded once (delete the Job; the chart's hook will recreate it)
kubectl -n gitlab delete job <job-name>

# Watch status changes live
kubectl -n gitlab get pods -w

# Show resource usage (requires metrics-server)
kubectl top pods -n gitlab
kubectl top nodes
rollout restart is not the same as delete

kubectl rollout restart does a rolling update — pods come up one at a time and the old ones don’t terminate until the new ones are Ready. kubectl delete pod <name> kills a pod immediately and the Deployment respawns it. Use restart for config reloads; use delete only when you actually want to evict a specific stuck pod.

Reading Secrets Inline

You’ll do this constantly when bootstrapping a chart:

# Decode a Secret into something readable
kubectl -n gitlab get secret gitlab-gitlab-initial-root-password \
  -o jsonpath='{.data.password}' | base64 -d ; echo

# Create one from a literal (e.g. for an Initial root password)
kubectl create secret generic gitlab-initial-root-password \
  --namespace gitlab \
  --from-literal=password='ChangeMe-Strong-Password!'

# Create one from a file (e.g. cloudflared tunnel credentials)
kubectl -n gitlab create secret generic cloudflared-credentials \
  --from-file=credentials.json=$HOME/.cloudflared/<tunnel-uuid>.json

The ; echo at the end of the decode is because base64-decoded output has no trailing newline — without it, the next shell prompt eats the last line of the secret.

Read a ConfigMap to See What the Chart Built

When a Helm chart produces something you didn’t expect — wrong hostname, missing flag, weird env var — read the rendered ConfigMap directly:

kubectl -n gitlab get cm gitlab-webservice -o yaml
kubectl -n gitlab get cm cloudflared-config -o yaml

This is the actual source of truth for what the container reads at startup. If the chart values look correct but the behavior is wrong, the ConfigMap is what proves which side has the bug.

Log Signatures and What They Actually Mean

The patterns that come up over and over in homelab K8s debugging:

SignatureLikely cause
Exit 255 + Started == Finished + empty logsArchitecture mismatch (amd64 image on ARM64, or vice versa)
exec format error in pod eventsSame as above, but the kernel said it out loud
OOMKilled in kubectl describe podContainer exceeded its memory limit (raise the limit or fix the leak)
Liveness probe failed repeatedlyApp slow to start; raise initialDelaySeconds or fix the probe path
BackOff pulling imageImage name typo, missing imagePullSecret, or registry unreachable
MountVolume.SetUp failedPVC not bound, or a Secret/ConfigMap reference doesn’t exist
FailedScheduling: insufficient memoryNode doesn’t have room for the requested resources
403 SignatureDoesNotMatch (in S3-style API logs)Credentials drift between two consumers of the same store
App returns 500 but kubectl get pods shows 2/2 RunningNot a Kubernetes problem — read the app log inside the container

Quick Reference

# Discovery
kubectl get pods -n NS
kubectl get pods -n NS -w
kubectl get pods -n NS -l app=foo

# Diagnosis
kubectl describe pod POD -n NS
kubectl logs POD -n NS --previous
kubectl logs deploy/DEPLOY -c CONTAINER -n NS --tail=200

# Exec / introspection
kubectl exec -it POD -n NS -- bash
kubectl exec DEPLOY -n NS -- tail -n 200 /var/log/app/prod.log

# Operations
kubectl rollout restart deploy/DEPLOY -n NS
kubectl scale deploy/DEPLOY -n NS --replicas=N
kubectl delete job JOB -n NS

# Networking
kubectl get ingress -n NS

If you remember nothing else, remember the first triple: get then describe then logs --previous. Almost every diagnostic session in this post started there.

Next Steps