Phần 5: Probe health: liveness, readiness, startup

Ba loại probe

Probe	Hỏi gì	Hậu quả khi fail
Liveness	“Container còn sống không?”	kubelet restart container
Readiness	“Container sẵn sàng nhận traffic không?”	Pod bị bỏ khỏi Service endpoints (không nhận traffic)
Startup	“Container đã khởi động xong chưa?”	Disable liveness/readiness cho đến khi pass


  flowchart LR
    Start[Container start] --> Startup[Startup probe]
    Startup -->|pass| Liveness[Liveness + Readiness probe]

Liveness probe

Mục đích: phát hiện container bị deadlock hoặc treo (process còn chạy nhưng không phản hồi).

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15 # Chờ trước probe đầu tiên
  periodSeconds: 20 # Mỗi 20s probe một lần
  timeoutSeconds: 3 # Timeout 3s cho mỗi probe
  failureThreshold: 3 # 3 lần fail liên tiếp → restart
  successThreshold: 1 # 1 lần pass → coi như healthy

Khi fail: kubelet restart container (không phải xoá pod). Restart count tăng, có thể dẫn tới CrashLoopBackOff nếu liveness luôn fail.

Sai lầm phổ biến #1: liveness probe check dependency

# ❌ SAI: kiểm tra database trong liveness
livenessProbe:
  httpGet:
    path: /healthz # endpoint này check DB connection
    port: 8080
# Hậu quả: DB chậm → liveness fail → restart tất cả pod
# → tất cả pod reconnect DB cùng lúc → DB quá tải → cascade failure

Nguyên tắc: liveness chỉ kiểm tra container process có hoạt động không. Dependency check thuộc về readiness.

Readiness probe

Mục đích: xác định pod có sẵn sàng nhận traffic không. Pod không ready → bị bỏ khỏi Service endpoints.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3
  successThreshold: 1

Khi fail: pod vẫn chạy, nhưng Service không forward traffic tới. Khi pass lại → thêm lại vào endpoints.

Readiness vs liveness, khi nào dùng gì

Tình huống	Dùng probe nào
App treo, process không phản hồi	Liveness (cần restart)
App đang warm-up, chưa sẵn sàng	Readiness (chưa nhận traffic)
DB downstream chậm	Readiness (tạm không nhận traffic, không restart)
App cần load cache lúc start	Startup + readiness

Startup probe

Mục đích: cho container thời gian khởi động dài mà không bị liveness kill.

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 30 # 30 × 10s = 300s (5 phút) để start

Khi startup probe chưa pass:

Liveness và readiness không chạy.
Container có tối đa failureThreshold × periodSeconds để khởi động.
Sau khi pass → startup probe tắt, liveness + readiness bắt đầu.

Dùng cho: JVM warm-up, app load model ML, migration chạy lúc start.

Probe methods

HTTP GET

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders: # Custom headers nếu cần
      - name: X-Health
        value: check

Response 200-399 → pass. 400+ → fail.

TCP socket

livenessProbe:
  tcpSocket:
    port: 5432 # Chỉ kiểm tra port mở

Dùng cho service không có HTTP endpoint (database, message queue). Lưu ý: port mở không có nghĩa service healthy.

exec command

livenessProbe:
  exec:
    command:
      - cat
      - /tmp/healthy

Exit code 0 → pass. Dùng cho health check phức tạp (script kiểm tra file lock, queue depth…). Cảnh báo: exec tạo process mới mỗi lần probe, tốn tài nguyên nếu period ngắn.

gRPC (K8s 1.27+ GA)

livenessProbe:
  grpc:
    port: 50051
    service: "" # "" = overall health

Dùng gRPC Health Checking Protocol (grpc.health.v1.Health).

Cấu hình timing


  flowchart TB
    Start[Container start] --> Delay[Chờ initialDelaySeconds]
    Delay --> Probe[Probe]
    Probe -->|pass| Healthy[Healthy]
    Probe -->|fail| FailCount[Tăng fail count]
    FailCount -->|fail < threshold| Wait[Chờ periodSeconds]
    Wait --> Probe
    FailCount -->|fail >= threshold| Action[Action: restart hoặc bỏ khỏi endpoints]

Parameter	Mặc định	Ý nghĩa
`initialDelaySeconds`	0	Chờ trước probe đầu tiên
`periodSeconds`	10	Khoảng cách giữa các probe
`timeoutSeconds`	1	Timeout cho mỗi probe
`failureThreshold`	3	Số lần fail liên tiếp trước khi action
`successThreshold`	1	Số lần pass liên tiếp để coi là healthy (readiness: thường > 1)

Sai cấu hình phổ biến

1. Không có readiness probe

Hậu quả: rolling update gửi traffic tới pod chưa warm-up → user thấy 502/503.

2. Liveness quá nhạy

# ❌ SAI
livenessProbe:
  periodSeconds: 3
  timeoutSeconds: 1
  failureThreshold: 1 # 1 lần fail = restart

Một spike latency nhỏ → restart → app mất state → restart liên tục.

3. Liveness check external dependency

Đã nói ở trên, cascade failure.

4. initialDelaySeconds quá ngắn

App JVM cần 30s warm-up, initialDelaySeconds: 5 → liveness kill trước khi app sẵn sàng → restart loop. Giải pháp: dùng startup probe thay vì tăng initialDelaySeconds.

5. Quên tăng timeoutSeconds

App endpoint /healthz truy vấn nhẹ nhưng khi GC pause hoặc load cao → timeout 1s → fail. Đặt ít nhất 3-5s.

Graceful shutdown và probe


  sequenceDiagram
    participant K as kubectl
    participant API as kube-apiserver
    participant EP as Endpoints Controller
    participant KL as kubelet
    participant C as Container

    K->>API: Delete pod
    API->>API: Pod status → Terminating
    par async
        EP->>EP: Bỏ pod khỏi Service endpoints
    and
        KL->>C: Chạy preStop hook (nếu có)
        KL->>C: SIGTERM
        C->>C: Chờ terminationGracePeriodSeconds
        KL->>C: SIGKILL (nếu chưa exit)
    end

Vấn đề race condition: bước 2 và 3 chạy đồng thời. Có thể kubelet gửi SIGTERM trước khi kube-proxy cập nhật rules → traffic vẫn tới pod đang shutdown.

Giải pháp:

lifecycle:
  preStop:
    exec:
      command: ["sleep", "5"] # Chờ 5s cho kube-proxy cập nhật

Lab: đọc rollout qua probe

Khi rollout bị kẹt, đừng chỉ nhìn kubectl rollout status. Mở thêm events và endpoint:

kubectl -n app rollout status deploy/api
kubectl -n app describe pod -l app=api
kubectl -n app get endpoints api -w
kubectl -n app logs deploy/api --tail=80

Nếu Pod Running nhưng không xuất hiện trong Endpoints, readiness probe đang fail. Nếu container restart liên tục, liveness hoặc startup probe đang quá nhạy, hoặc app crash thật. Hai trạng thái này nhìn giống “deploy lỗi”, nhưng cách sửa hoàn toàn khác.

Điều cần giữ khi vận hành Kubernetes

Liveness: app treo → restart container. Chỉ check process, không check dependency.
Readiness: app chưa sẵn sàng → bỏ khỏi Service. Check cả dependency.
Startup: cho app thời gian khởi động dài mà không bị liveness kill.
Timing: đừng quá nhạy (restart loop) hay quá chậm (traffic tới pod chết).
Graceful shutdown: preStop sleep 5-10s giải quyết race condition với kube-proxy.

Câu hỏi hay gặp

Có bắt buộc phải có cả 3 probe không?

Trả lời: Không. Tối thiểu nên có readiness (cho rolling update an toàn). Thêm liveness nếu app có thể treo (deadlock, leak). Thêm startup nếu app start chậm. Không probe nào là bắt buộc, nhưng không có readiness thì rolling update mù.

App stateless đơn giản (Express/Gin) cần liveness probe không?

Trả lời: Nếu app chỉ serve HTTP và crash = process exit → kubelet restart tự động (nhờ restartPolicy: Always). Liveness probe thêm giá trị khi app có thể treo mà không exit (deadlock, event loop blocked). Đơn giản: thêm endpoint /healthz return 200, chi phí thấp, lợi ích cao.

readinessProbe `successThreshold > 1` có nên dùng?

Trả lời: Có, khi app hay flap (ready → not ready → ready nhanh). successThreshold: 2-3 giúp pod chỉ nhận traffic khi thực sự ổn định, tránh bị thêm/bỏ khỏi endpoints liên tục (gây connection reset).

Bài tiếp theo (Giai đoạn II): Job, CronJob và DaemonSet, batch workload, scheduled task, node-level agent.