Rate limit production gotcha: những thứ không ai nói khi demo

Deploy rate limiter lên production, smoke test pass, demo sếp xong tưởng yên tâm ngủ ngon — hôm sau nhìn log thấy một user gọi API 4.000 request trong vòng 60 giây mà không bị block lần nào. Cách bypass đơn giản đến mức buồn cười: set header X-Forwarded-For: 1.2.3.4 trong request, rate limiter nghĩ đây là IP khác nhau, counter không bao giờ đạt threshold.
Đây là gotcha đầu tiên trong danh sách dài mình đã gặp trong 3 năm deploy rate limiter cho các API production. Tutorial thì demo bằng single-instance Redis với curl từ một IP. Reality thì khác hoàn toàn.
Gotcha 1: X-Forwarded-For là header bất kỳ client nào cũng set được

Hiểu nôm na thì X-Forwarded-For giống như mục “khai báo nơi ở” khi check-in khách sạn — khách sạn có thể tin hoặc không, hoàn toàn do bạn viết lên mà không ai xác minh. Load balancer của bạn mới là người thực sự biết IP nguồn.
Khi request đi qua load balancer (nginx, ALB, Cloudflare), header được thêm vào kiểu:
X-Forwarded-For: <real-client-ip>, <proxy1-ip>, <proxy2-ip>
```text
Nhưng nếu client tự set header này trước khi gửi request, load balancer sẽ **append** IP thật vào sau — không replace:

```text
X-Forwarded-For: 1.2.3.4, 203.0.113.5
#                ↑ client fake  ↑ IP thật do LB thêm
```text
Code rate limiter naive đọc `X-Forwarded-For` và lấy phần tử đầu tiên — đó là IP do client tự khai, không phải IP thật.

```python
# SAI: client có thể spoof X-Forwarded-For
def get_client_ip_wrong(request):
    forwarded_for = request.headers.get("X-Forwarded-For", "")
    if forwarded_for:
        return forwarded_for.split(",")[0].strip()  # lấy phần tử đầu = dễ fake
    return request.remote_addr

# ĐÚNG: lấy phần tử cuối cùng do load balancer của bạn thêm
# hoặc dùng header riêng mà LB set và không cho client override
def get_client_ip_correct(request):
    forwarded_for = request.headers.get("X-Forwarded-For", "")
    if forwarded_for:
        # Phần tử cuối cùng là IP mà load balancer nhìn thấy trực tiếp
        # Client không thể fake phần này vì LB mới có quyền thêm vào
        ips = [ip.strip() for ip in forwarded_for.split(",")]
        return ips[-1]
    return request.remote_addr
```text
Còn tốt hơn nữa: config nginx/ALB dùng header riêng như `X-Real-IP` (nginx) hoặc `CF-Connecting-IP` (Cloudflare) — header này được set bởi infrastructure của bạn, không phải client. Khi đó bạn chỉ cần đọc đúng header đó và ignore `X-Forwarded-For` hoàn toàn.

---

## Gotcha 2: Distributed counter với Redis INCR vẫn race nếu dùng sai

Nhiều người đọc tài liệu Redis thấy `INCR` là atomic, kết luận "vậy là thread-safe rồi, không cần lo". Nhưng rate limiting thường cần nhiều operations: INCR + kiểm tra giá trị + set TTL. Ba operations này không atomic với nhau — và đây là chỗ race condition xảy ra.

Pattern phổ biến mà mình thấy trong codebase:

```python
# CÓ RACE CONDITION: không atomic giữa các bước
def is_rate_limited_wrong(key: str, limit: int, window_seconds: int) -> bool:
    count = redis.incr(key)              # bước 1
    if count == 1:
        redis.expire(key, window_seconds)  # bước 2: nếu crash giữa đây
                                          # key sẽ tồn tại mãi mãi, không expire
    return count > limit
```text
Nếu instance crash sau bước 1 nhưng trước bước 2, key sẽ tồn tại mãi không có TTL — user bị block vĩnh viễn. Với nhiều instance đồng thời chạy cùng lúc, window time cũng không chính xác vì mỗi instance có thể đặt TTL khác nhau.

Cách fix là dùng Lua script để đảm bảo atomic:

```lua
-- sliding_window_rate_limit.lua
-- Sliding window counter dùng Redis sorted set
-- Atomic: toàn bộ script chạy như một transaction

local key = KEYS[1]
local now = tonumber(ARGV[1])           -- unix timestamp milliseconds
local window_ms = tonumber(ARGV[2])    -- window size tính bằng ms
local limit = tonumber(ARGV[3])        -- số request tối đa trong window

-- Xóa entries cũ nằm ngoài window
-- zadd dùng score = timestamp, nên zremrangebyscore xóa entries quá hạn
redis.call("ZREMRANGEBYSCORE", key, 0, now - window_ms)

-- Đếm số request trong window hiện tại
local count = redis.call("ZCARD", key)

if count < limit then
    -- Thêm request này vào sorted set (member = timestamp + random để tránh collision)
    redis.call("ZADD", key, now, now .. math.random())
    -- Set TTL bằng window size để key tự cleanup
    redis.call("PEXPIRE", key, window_ms)
    -- Trả về: [allowed=1, current_count, limit]
    return {1, count + 1, limit}
else
    -- Tìm thời điểm request cũ nhất trong window để tính Retry-After
    local oldest = redis.call("ZRANGE", key, 0, 0, "WITHSCORES")
    local retry_after_ms = 0
    if #oldest > 0 then
        retry_after_ms = window_ms - (now - tonumber(oldest[2]))
    end
    -- Trả về: [allowed=0, current_count, limit, retry_after_ms]
    return {0, count, limit, retry_after_ms}
end
```text
```python
import redis
import time
import random

class SlidingWindowRateLimiter:
    def __init__(self, redis_client):
        self.redis = redis_client
        # Load Lua script một lần, dùng SHA để gọi lại
        with open("sliding_window_rate_limit.lua") as f:
            self.script_sha = self.redis.script_load(f.read())

    def check(self, key: str, limit: int, window_seconds: int) -> dict:
        now_ms = int(time.time() * 1000)
        window_ms = window_seconds * 1000

        result = self.redis.evalsha(
            self.script_sha,
            1,           # số keys
            key,         # KEYS[1]
            now_ms,      # ARGV[1]
            window_ms,   # ARGV[2]
            limit        # ARGV[3]
        )

        allowed = result[0] == 1
        return {
            "allowed": allowed,
            "current": result[1],
            "limit": result[2],
            "retry_after_ms": result[3] if not allowed else 0,
        }
```text
Sliding window log cho kết quả chính xác nhất (đếm chính xác request trong bất kỳ window nào) nhưng tốn memory — mỗi request là một entry trong sorted set. Với 1.000 user × 100 request/minute, bạn có 100.000 entries trong Redis tại một thời điểm. Fixed window counter dùng ít memory hơn nhiều (một counter per key) nhưng có "burst doubling" problem: user có thể gửi 200 request trong 2 giây nếu biết cách timing window boundary.

---

## Gotcha 3: Rate limit by IP sẽ block cả office dùng chung NAT

NAT (Network Address Translation) là thứ mà router của văn phòng dùng để cho 200 máy tính ra internet bằng một IP public duy nhất. Rate limit by IP trong trường hợp này nghĩa là giới hạn chung cho cả 200 người.

Mình từng set limit 1.000 req/phút per IP cho một API internal tool — nghe cao nhưng vào giờ standup khi cả team 50 người mở dashboard đồng thời, tổng request từ một IP tăng lên ~4.300/phút, tất cả nhận 429 cùng lúc. Tệ hơn nữa là rate limit counter reset sau 60 giây, nên "cơn bão" 429 xảy ra đều đặn mỗi phút, tưởng là bug ứng dụng.

Giải pháp tùy theo loại API:

- **Public API không có auth**: rate limit by IP là không tránh khỏi, nhưng cần limit cao hơn nhiều (10.000+ req/phút) và phải document rõ ràng
- **API có auth token**: rate limit by user ID hoặc API key, không phải IP — đây là cách đúng đắn cho authenticated endpoints
- **Internal tool**: dùng rate limit by user ID, hoặc không rate limit tầng IP mà chỉ rate limit ở tầng application logic

```python
def get_rate_limit_key(request) -> str:
    # Ưu tiên user ID nếu request đã authenticated
    if hasattr(request, "user") and request.user.is_authenticated:
        return f"rl:user:{request.user.id}"

    # API key cho machine-to-machine
    api_key = request.headers.get("X-API-Key")
    if api_key:
        return f"rl:apikey:{api_key[:16]}"  # dùng prefix để tránh expose full key trong Redis

    # Fallback về IP chỉ khi không có gì khác
    return f"rl:ip:{get_client_ip_correct(request)}"
```text
---

## Gotcha 4: Bulk endpoint phá vỡ rate limit per-request

Rate limit thường được thiết kế theo kiểu "N request per window". Nhưng nếu API có endpoint `/batch` nhận một request và xử lý 100 operations — rate limit per-request không bảo vệ được gì cả.

Một request tới `/v1/messages/batch` với body chứa 500 message IDs về mặt kỹ thuật là 1 request, nhưng load lên database là 500 queries. Rate limiter thấy 1 request, cho qua. Database chết.

Cách fix: tính cost theo đơn vị operations, không phải requests.

```python
from functools import wraps

def rate_limit_with_cost(limit_per_window: int, window_seconds: int, cost_func=None):
    """
    Decorator rate limit theo cost, không phải số requests.
    cost_func: callable nhận request, trả về cost (int)
    """
    def decorator(view_func):
        @wraps(view_func)
        def wrapped_view(request, *args, **kwargs):
            # Tính cost của request này
            cost = 1
            if cost_func:
                cost = cost_func(request)

            key = get_rate_limit_key(request)
            limiter = SlidingWindowRateLimiter(redis_client)

            # Kiểm tra trước khi xử lý (check without incrementing)
            result = limiter.check_with_cost(key, limit_per_window, window_seconds, cost)

            if not result["allowed"]:
                response = HttpResponse(status=429)
                response["X-RateLimit-Limit"] = limit_per_window
                response["X-RateLimit-Remaining"] = 0
                response["Retry-After"] = result["retry_after_ms"] // 1000 + 1
                return response

            return view_func(request, *args, **kwargs)
        return wrapped_view
    return decorator

# Dùng cho batch endpoint: cost = số items trong batch
def batch_cost(request) -> int:
    try:
        body = json.loads(request.body)
        items = body.get("ids", [])
        return max(1, min(len(items), 500))  # clamp 1-500 để tránh abuse
    except Exception:
        return 1

@rate_limit_with_cost(limit_per_window=1000, window_seconds=60, cost_func=batch_cost)
def batch_messages_view(request):
    # xử lý batch request
    ...
```text
---

## Gotcha 5: 429 không có Retry-After là vô dụng với well-behaved client

Client tốt khi nhận 429 sẽ retry sau một khoảng thời gian — nhưng nếu không biết chờ bao lâu, nó sẽ retry ngay lập tức hoặc dùng exponential backoff bắt đầu từ 1 giây. Kết quả: client tốt flood thêm 429 request vào hệ thống, aggravate vấn đề thay vì giảm tải.

RFC 6585 định nghĩa `Retry-After` header cho HTTP 429. Ngoài ra còn cần bộ headers đầy đủ để client biết tình trạng rate limit của mình:

```python
def add_rate_limit_headers(response, result: dict, window_seconds: int):
    """
    Thêm rate limit headers theo standard vào response.

    X-RateLimit-Limit: giới hạn tối đa trong window
    X-RateLimit-Remaining: còn lại bao nhiêu request trong window này
    X-RateLimit-Reset: unix timestamp (giây) khi window reset
    Retry-After: số giây client nên chờ trước khi retry (chỉ có khi 429)
    """
    import time

    response["X-RateLimit-Limit"] = result["limit"]
    response["X-RateLimit-Remaining"] = max(0, result["limit"] - result["current"])
    # Reset time = thời điểm window hiện tại kết thúc
    reset_timestamp = int(time.time()) + window_seconds
    response["X-RateLimit-Reset"] = reset_timestamp

    if not result["allowed"]:
        # Retry-After: số giây cụ thể, không phải timestamp
        retry_after_seconds = (result["retry_after_ms"] // 1000) + 1
        response["Retry-After"] = retry_after_seconds

    return response
```text
Một vài lưu ý về `Retry-After`:
- Dùng số giây (integer), không phải HTTP date string — dễ parse hơn cho client
- Luôn cộng thêm 1 giây buffer để tránh edge case race giữa client retry và window reset
- Với sliding window, retry-after là khoảng thời gian đến khi request cũ nhất trong window bị expire — tính chính xác từ Lua script ở trên

---

## Khi nào dùng thuật toán nào

Câu hỏi này hay gặp: fixed window, sliding window, hay token bucket? Câu trả lời phụ thuộc vào bạn đang bảo vệ cái gì.

**Fixed window counter** — đơn giản nhất, ít memory nhất. Dùng khi: limit không cần chính xác tuyệt đối, burst nhỏ ở window boundary có thể chấp nhận được. Phù hợp cho: login throttle, SMS OTP limit. Không phù hợp cho: API bảo vệ database khỏi overload.

**Sliding window log** — chính xác nhất, tốn memory nhiều nhất. Dùng khi: cần đảm bảo không bao giờ vượt quá N request trong bất kỳ window nào. Phù hợp cho: billing API, expensive computation endpoints. Không phù hợp cho: high-throughput endpoints với hàng triệu users vì memory footprint lớn.

**Token bucket** — tốt nhất để cho phép burst ngắn hạn có kiểm soát. User có thể dùng tokens tích lũy để burst 10 request ngay lập tức, miễn là tổng rate trung bình không vượt giới hạn. Phù hợp cho: interactive API nơi user thỉnh thoảng cần bulk action, nhưng không liên tục. Triển khai với Redis cần atomic check-and-decrement — cũng cần Lua script.

```text
Loại API                 | Thuật toán khuyên dùng   | Lý do
-------------------------|--------------------------|----------------------------------
Login / OTP              | Fixed window             | Đơn giản, burst không nguy hiểm
Public REST API          | Sliding window           | Accurate, user experience tốt
Expensive computation    | Token bucket             | Cho burst, kiểm soát avg rate
Webhook delivery         | Token bucket             | Cho phép retry burst
Internal service         | Sliding window           | Accurate, ít users nên memory OK
```text
Rate limiter tốt không chỉ là code đúng — còn là telemetry đầy đủ để bạn biết ai đang bị limit, tại sao, và liệu limit đó có hợp lý không. Mình luôn emit metric `rate_limit.allowed` và `rate_limit.rejected` với tag `limit_key_type` (user/ip/apikey) và `endpoint` — 2 tuần sau deploy là đủ data để điều chỉnh thresholds cho hợp lý hơn.