Phần 15: Quan sát và debug: events, logs, metrics | CodeTrekNomad

Ba nguồn thông tin chính

Nguồn	Công cụ
Events	`kubectl get/describe` (scheduling, probe, mount…)
Logs	`kubectl logs`, stern
Metrics	`kubectl top`, Prometheus / metrics-server

Events, “nhật ký” của Kubernetes

Events là object K8s, lưu tạm thời (mặc định 1 giờ), ghi lại mọi thứ xảy ra: scheduling, image pull, probe fail, OOM, volume mount…

# Events trong namespace
kubectl get events -n production --sort-by='.lastTimestamp'

# Events của pod cụ thể
kubectl describe pod my-app    # Section "Events:" cuối output

# Chỉ warning
kubectl get events --field-selector type=Warning

# Tất cả namespace
kubectl get events -A --sort-by='.metadata.creationTimestamp'

Events thường gặp và ý nghĩa

Event	Ý nghĩa	Hành động
`FailedScheduling`	Không đủ resource trên node	Scale node, giảm requests
`Pulling` / `Pulled`	Đang pull / đã pull image	Bình thường
`BackOff`	Container crash, đang chờ restart	Xem logs –previous
`Unhealthy`	Liveness/readiness probe fail	Kiểm tra probe config
`OOMKilling`	Container vượt memory limit	Tăng limit hoặc fix leak
`FailedMount`	Volume mount thất bại	Kiểm tra PVC, Secret tồn tại
`FailedCreate`	Controller không tạo được pod	RBAC, quota, admission webhook

Logs, stdout/stderr của container

Cơ bản

# Log container chính
kubectl logs my-app

# Follow real-time
kubectl logs -f my-app

# Container trước (sau restart)
kubectl logs my-app --previous

# Giới hạn
kubectl logs my-app --tail=100
kubectl logs my-app --since=30m

# Multi-container: chỉ định
kubectl logs my-app -c sidecar

# Tất cả container
kubectl logs my-app --all-containers

# Label selector (nhiều pod)
kubectl logs -l app=my-api --max-log-requests=10

stern, multi-pod log tailing

# Cài stern
# macOS: brew install stern
# go: go install github.com/stern/stern@latest

# Tail tất cả pod match pattern
stern my-api

# Regex + namespace
stern "my-api-.*" -n production

# Chỉ container cụ thể
stern my-api -c app

# JSON output (cho pipe vào jq)
stern my-api -o json | jq '.message'

# Exclude container
stern my-api --exclude-container sidecar

stern = kubectl logs -f cho nhiều pod cùng lúc, output có màu theo pod.

Log architecture


  flowchart TB
    Container[Container stdout/stderr] --> Kubelet[kubelet<br/>node]
    Kubelet --> File[/var/log/containers/<br/>&lt;pod&gt;_&lt;ns&gt;_&lt;container&gt;-&lt;id&gt;.log]
    File --> Agent[Log agent: Fluent Bit / Vector / Fluentd<br/>DaemonSet]
    Agent --> Central[Centralized: Elasticsearch / Loki / CloudWatch]

kubectl logs đọc từ file trên node → chỉ có log hiện tại (rotate theo size/time). Log cũ → cần centralized logging.

kubectl debug, ephemeral containers

Khi container dùng distroless image (không shell, không curl, không nslookup):

# Attach debug container vào pod đang chạy
kubectl debug -it my-app --image=busybox:1.36 --target=app
# --target=app: share process namespace với container "app"

# Trong debug container:
# - Thấy process của container app (ps aux)
# - Access filesystem (nhưng không mount volumes)
# - Chạy curl, nslookup, tcpdump...

# Debug container riêng (copy pod, đổi command)
kubectl debug my-app -it --copy-to=debug-pod --container=app -- /bin/sh

# Debug trên node
kubectl debug node/node-1 -it --image=busybox:1.36
# Mount host filesystem tại /host
# ls /host/var/log/containers/

Yêu cầu: K8s 1.25+ (ephemeral containers GA).

Metrics, resource real-time

metrics-server

# Cài metrics-server (nếu chưa có)
# kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# CPU + memory per pod
kubectl top pods
kubectl top pods -n production --sort-by=cpu

# Per container
kubectl top pods --containers

# Per node
kubectl top nodes

kubectl top cho snapshot hiện tại. Cần Prometheus + Grafana cho historical data.

Prometheus metrics quan trọng

# CPU usage
rate(container_cpu_usage_seconds_total{pod="my-app"}[5m])

# Memory working set
container_memory_working_set_bytes{pod="my-app"}

# CPU throttling
rate(container_cpu_cfs_throttled_seconds_total{pod="my-app"}[5m])

# Pod restart count
kube_pod_container_status_restarts_total{pod="my-app"}

# Pod ready status
kube_pod_status_ready{pod="my-app"}

Playbook debug theo triệu chứng

Pod Pending

kubectl describe pod <name>
# Xem Events:

# "Insufficient cpu/memory" → scale node hoặc giảm requests
# "no nodes available" → taint/toleration, node selector sai
# "pod has unbound PersistentVolumeClaims" → PVC Pending (bài 12)
# "0/N nodes are available: N node(s) had untolerated taint" → thêm toleration

Pod CrashLoopBackOff

# 1. Events
kubectl describe pod <name>

# 2. Log lần chạy trước
kubectl logs <name> --previous

# 3. Exit code
# 1 = app error
# 137 = OOMKilled hoặc liveness kill
# 139 = SIGSEGV
# 143 = SIGTERM timeout

# 4. Nếu log trống → chạy local: docker run <image>
# 5. Nếu OOM → bài 13 (tăng limit hoặc fix leak)
# 6. Nếu liveness kill → kiểm tra probe config (bài 05)

Pod Running nhưng không nhận traffic

# 1. Service endpoints
kubectl get endpoints <service>
# Trống? → label mismatch hoặc readiness probe fail

# 2. Pod Ready?
kubectl get pods
# READY 0/1 → readiness probe fail → describe pod

# 3. Service selector match?
kubectl get svc <service> -o yaml | grep -A5 selector
kubectl get pods --show-labels

# 4. Port đúng?
# Service port/targetPort khớp container port?

# 5. NetworkPolicy chặn? (bài 10)
kubectl get networkpolicy

# 6. Test direct
kubectl port-forward pod/<name> 8080:8080
curl localhost:8080

Node NotReady

# 1. Node conditions
kubectl describe node <name> | grep -A10 Conditions

# 2. kubelet logs (trên node)
journalctl -u kubelet -n 100

# 3. Disk pressure? Memory pressure?
# DiskPressure → /var/lib/kubelet hoặc /var/log đầy
# MemoryPressure → quá nhiều pod, node OOM

# 4. Network? node mất kết nối → kubelet không report

Structured debugging flow


  flowchart TB
    Symptom[Triệu chứng] --> PodIssue{Pod issue?}
    PodIssue -->|Yes| Describe[kubectl describe pod → Events]
    Describe --> Logs[kubectl logs --previous]
    Logs --> Top[kubectl top pod → resource?]
    Top --> Debug[kubectl debug nếu cần]
    PodIssue -->|No| SvcIssue{Service/Network issue?}
    SvcIssue -->|Yes| Endpoints[kubectl get endpoints]
    Endpoints --> Exec[kubectl exec -- curl/nslookup]
    Exec --> NetPol[kubectl get networkpolicy]
    SvcIssue -->|No| NodeIssue{Node issue?}
    NodeIssue -->|Yes| DescNode[kubectl describe node]
    DescNode --> TopNode[kubectl top node]
    TopNode --> SSH[SSH → journalctl/dmesg]

Debug transcript: CrashLoopBackOff

CrashLoopBackOff cần ba nguồn: event, log hiện tại, log lần crash trước.

kubectl -n app get pod -l app=api
kubectl -n app describe pod api-abc123
kubectl -n app logs api-abc123 --previous
kubectl -n app get pod api-abc123 -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'; echo

Nếu --previous có stack trace, app crash thật. Nếu exit code là 137, nghi OOMKilled. Nếu log rỗng và event nói probe failed, quay lại bài probe: có thể app sống nhưng bị liveness giết quá sớm.

Điều cần giữ khi vận hành Kubernetes

Events → kiểm tra đầu tiên khi bất kỳ issue nào: scheduling, probe, OOM, mount.
Logs → --previous cho container đã crash. stern cho multi-pod.
kubectl debug → khi container không có shell (distroless).
metrics-server + kubectl top → snapshot resource. Prometheus cho historical.
Playbook: Pending → Events. CrashLoop → logs + exit code. No traffic → endpoints + readiness.

Câu hỏi hay gặp

Events biến mất sau 1 giờ, lưu lâu hơn được không?

Trả lời: Events mặc định TTL 1 giờ (kube-apiserver --event-ttl). Tăng được nhưng tốn etcd. Giải pháp tốt hơn: dùng event exporter (kubernetes-event-exporter) gửi events tới Elasticsearch/Loki để lưu trữ lâu dài.

kubectl debug không hoạt động, “ephemeral containers are disabled”?

Trả lời: Ephemeral containers GA từ K8s 1.25. Cluster cũ hơn cần enable feature gate EphemeralContainers. Kiểm tra version: kubectl version.

Log quá nhiều, kubectl logs timeout?

Trả lời: Log file lớn (container viết nhiều stdout) → --tail=100 giới hạn dòng. Cho production: dùng centralized logging (Loki, Elasticsearch) thay vì kubectl logs. Container nên viết structured JSON logs để query dễ hơn.

Bài tiếp theo (Giai đoạn VI): Helm, Kustomize, RBAC và bản đồ mở rộng, đóng gói, quyền, và đọc tiếp ở đâu.