Learning about rate limiting algorithms
/ 10 min read
Table of Contents
If you don’t know what rate limiters are you can refer to this blog: https://www.cloudflare.com/en-gb/learning/bots/what-is-rate-limiting/
Why Rate Limiting?
Rate limiting protects your services from:
- DoS attacks - Malicious request floods that can take down your service
- Resource exhaustion - Prevents CPU, memory, and database overload
- Cost control - Essential for metered APIs and cloud services
- Fair usage - Prevents one client from starving others
Common limits: 100 req/min (tier 1), 1000 req/min (tier 2), 10000 req/min (enterprise), etc.
Architecture
In a very oversimplified way, this is how a rate limiter works.
flowchart TD
A[Client] --> B[API Gateway]
B --> C{Rate Limit Check}
C -->|Allowed| D[Service]
C -->|Blocked| E[429 Too Many Requests]
subgraph Store
S[Counter / Token Bucket]
end
C --> S
Token Bucket
A token bucket controls the amount of requests by using a simple idea:
A “bucket” that fills up with tokens slowly over time and with every request that is allowed, one token must be spent.
A bucket has a maximum capacity which is pre-defined above which no more tokens are filled into the bucket. The tokens are filled in the bucket at a constant rate, which determines the average throughput rate for the rate limiter.
When a request comes in, it is only allowed if there are more than 1 tokens available in the bucket. If not, the request is throttled.
The token bucket algorithm is ideal for services which can experience frequent surges in requests as this algorithm allows any amount of requests as long as there are tokens available in the bucket.
Example:
sequenceDiagram
participant C as Client
participant G as API Gateway
participant R as Rate Limiter
participant S as Token Store
Note over R,S: config: capacity=5, refill=1 token/sec
C->>G: 5 requests at t=0 (burst)
G->>R: Check (req1)
R->>S: fetch state (tokens=5, last=0)
S-->>R: tokens=5
R-->>G: allow (consume -> tokens=4)
G-->>C: 200 OK
G->>R: Check (req2..req5) %% condensed
R->>S: tokens=4..1
R-->>G: allow (consume)
G-->>C: 200 OK (×4)
Note over C,R: t=1s later
C->>G: Request at t=1.0
G->>R: Check (lazy refill)
R->>S: fetch state (tokens=0, last=0)
alt elapsed = 1s
R->>R: tokens += 1*refill -> tokens=1 (capped to capacity)
end
R-->>G: allow (consume -> tokens=0)
G-->>C: 200 OK
C->>G: Request at t=1.2
G->>R: Check
R->>S: fetch state (tokens=0, last=1.0)
R-->>G: reject (429)
G-->>C: 429 Too Many Requests
Implementation
Most implementations don’t refill the bucket at regular intervals, instead they do something called “lazy refill”. This means the tokens get refilled in the bucket only when a request arrives. The amount of tokens which get refilled match the exact number of tokens which would get accumulated if the bucket had been filling up at a constant rate.
flowchart TD
A[Request arrives] --> B[Read state: tokens, last_timestamp]
B --> C[elapsed = now - last_timestamp]
C --> D[add = elapsed * refill_rate]
D --> E["tokens = min(capacity, tokens + add)"]
E --> F{tokens >= cost?}
F -->|yes| G["consume cost (tokens -= cost); allow request"]
F -->|no| H[reject / queue / throttle]
G --> I[update state: tokens, last_timestamp=now]
H --> I
Algorithm in C++
class TokenBucket { double tokens; int capacity; double refillRate; // tokens per second long long lastRefill;
public: TokenBucket(int cap, double rate) : capacity(cap), refillRate(rate) { tokens = cap; lastRefill = time(0); }
bool allowRequest(int cost = 1) { refill();
if (tokens >= cost) { tokens -= cost; return true; } return false; }
void refill() { long long now = time(0); double elapsed = now - lastRefill;
// Add tokens based on time elapsed tokens = min((double)capacity, tokens + elapsed * refillRate); lastRefill = now; }};Complexity Analysis
- Time: O(1) per request
- Space: O(1) - only stores tokens, capacity, rate, timestamp
Pros
- Allows legitimate traffic bursts
- Simple and memory efficient
- Flexible - different costs per request type
- Used by major cloud providers (AWS, GCP)
Cons
- Doesn’t smooth traffic
- Large bursts can still overwhelm downstream services
- Requires careful tuning of capacity vs refill rate
Real-World Usage
- GitHub API: 5,000 requests/hour per user
- Stripe API: Token bucket for payment endpoints
- AWS API Gateway: Default rate limiting mechanism
Leaky Bucket
A leaky bucket algorithm sends requests to the service at a fixed rate no matter how fast they come in. It’s like pouring water into a bucket with a tiny hole at the bottom — you can pour as fast as you want, but the water will only drip out at one steady speed.
graph TB
subgraph "Leaky Bucket Algorithm"
A[Incoming Requests] --> B{Bucket}
B --> C[Queue/Buffer]
C --> D[Leak Rate Controller]
D --> E[Outgoing Requests]
B -.->|Bucket Full| F[Reject Request]
D -.->|Fixed Rate| G[Constant Output Rate]
end
Implementation
We usually implement leaky bucket with help of a queue that stores incoming requests upto a maximum capacity, combined with a background process that processes the requests at a fixed rate.
sequenceDiagram
participant Client
participant Bucket
participant LeakController
participant Server
Client->>Bucket: Request 1 (t=0)
Bucket->>Bucket: Add to queue
Client->>Bucket: Request 2 (t=0.1)
Bucket->>Bucket: Add to queue
Client->>Bucket: Request 3 (t=0.2)
Bucket->>Bucket: Add to queue
loop Fixed Leak Rate
LeakController->>Bucket: Process next request
Bucket->>Server: Forward request
Server-->>Bucket: Response
Bucket-->>Client: Response
Note over LeakController: Wait for fixed interval
end
Client->>Bucket: Request N (Bucket Full)
Bucket-->>Client: 429 Too Many Requests
Algorithm in C++
class LeakyBucket { queue<int> q; int capacity; long long lastLeak; int leakRate; // requests per second
public: LeakyBucket(int cap, int rate) : capacity(cap), leakRate(rate) { lastLeak = time(0); }
bool addRequest(int id) { // Leak first leak();
// Check capacity if (q.size() >= capacity) { return false; // Reject }
q.push(id); return true; // Accept }
void leak() { long long now = time(0); long long elapsed = now - lastLeak;
// Number of requests to leak based on time elapsed int toRemove = elapsed * leakRate;
while (toRemove-- > 0 && !q.empty()) { q.pop(); }
lastLeak = now; }};Complexity Analysis
- Time: O(M) per request, where M = expired requests to remove
- Space: O(N) where N = requests in queue
Pros
- Smooths out traffic bursts into steady stream
- Predictable output rate
- Good for network traffic shaping
Cons
- Requires queue/buffer storage
- Higher memory usage (O(N))
- Requests may experience variable latency
Real-World Usage
- Network QoS (Quality of Service)
- Video streaming rate control
- Traffic shaping in routers
Fixed Window Counter
This algorithm divides time into equal intervals where each interval has a maximum number of requests it can serve which is tracked using a counter. This counter resets every time a new interval begins.
graph TB
subgraph "Fixed Window Counter"
A[Incoming Request] --> B{Check Current Time}
B --> C{Window Expired?}
C -->|Yes| D[Reset Counter to 0]
C -->|No| E[Keep Current Counter]
D --> F[Update Window Start Time]
E --> G{Counter < Limit?}
F --> G
G -->|Yes| H[Increment Counter]
G -->|No| I[Reject Request]
H --> J[Accept Request]
end
The Burst Problem
The major issue with fixed window counters is the 2x burst at window boundaries:
graph LR
subgraph "Burst Problem Example"
W1["Window 1
10:00:00-10:00:59
100 req at 10:00:30-59"]
W2["Window 2
10:01:00-10:01:59
100 req at 10:01:00-29"]
W1 --> W2
Problem["⚠️ Result: 200 requests
in 60 seconds
(10:00:30-10:01:29)
2x the limit!"]
end
Implementation
sequenceDiagram
participant Client
participant RateLimiter
participant Window as Window State
Note over Window: Window 1: 10:00:00-10:00:59
Counter=0, Limit=100
Client->>RateLimiter: Request 1 (t=10:00:05)
RateLimiter->>Window: Get current time
Window-->>RateLimiter: 10:00:05
RateLimiter->>Window: Check if window expired
Window-->>RateLimiter: No (still in window)
RateLimiter->>Window: Check counter (0 < 100?)
Window-->>RateLimiter: Yes
RateLimiter->>Window: Increment counter (0→1)
RateLimiter-->>Client: 200 OK (Request Accepted)
Note over Client,Window: ... more requests ...
Client->>RateLimiter: Request 100 (t=10:00:55)
RateLimiter->>Window: Check counter (99 < 100?)
Window-->>RateLimiter: Yes
RateLimiter->>Window: Increment counter (99→100)
RateLimiter-->>Client: 200 OK (Request Accepted)
Client->>RateLimiter: Request 101 (t=10:00:58)
RateLimiter->>Window: Check if window expired
Window-->>RateLimiter: No
RateLimiter->>Window: Check counter (100 < 100?)
Window-->>RateLimiter: No
RateLimiter-->>Client: 429 Too Many Requests (REJECTED)
Note over Window: Window expires at 10:01:00
Client->>RateLimiter: Request 102 (t=10:01:02)
RateLimiter->>Window: Get current time
Window-->>RateLimiter: 10:01:02
RateLimiter->>Window: Check if window expired
Window-->>RateLimiter: Yes (10:01:02 >= 10:01:00)
RateLimiter->>Window: Reset counter to 0
RateLimiter->>Window: Update window start (10:01:02)
Note over Window: Window 2: 10:01:02-10:02:01
Counter=0
RateLimiter->>Window: Check counter (0 < 100?)
Window-->>RateLimiter: Yes
RateLimiter->>Window: Increment counter (0→1)
RateLimiter-->>Client: 200 OK (Request Accepted)
Algorithm in C++
class FixedWindowCounter { int count; long long windowStart; int limit; int windowSize; // seconds
public: bool allowRequest() { long long now = time(0);
// Check if window expired if (now >= windowStart + windowSize) { count = 0; windowStart = now; }
// Check limit if (count >= limit) { return false; }
count++; return true; }};Complexity Analysis
- Time: O(1) per request
- Space: O(1) - only stores counter, window start, limit
Pros
- Extremely simple to implement
- Memory efficient (O(1))
- Easy to understand and debug
Cons
- Major flaw: Allows 2x burst at window boundaries
- Not suitable for strict SLAs
- Doesn’t provide smooth rate limiting
Real-World Usage
- Simple analytics and metrics counting
- Non-critical rate limiting
- Basic logging systems where precision isn’t critical
Sliding Window Log
A sliding window log tracks every request timestamp inside the current time window. On each incoming request, it removes all timestamps older than now - window_size and checks whether the remaining count exceeds the limit. It’s accurate but memory-heavy.
graph TB
subgraph "Sliding Window Log Architecture"
A[Incoming Request] --> B[Get Current Timestamp]
B --> C[Cleanup Phase]
C --> D{Remove timestamps
older than
now - window_size}
D -->|Has old entries| E[Pop from front of log]
E --> D
D -->|No old entries| F{Log size < Limit?}
F -->|Yes| G[Add current timestamp
to log]
F -->|No| H[Reject Request]
G --> I[Accept Request]
end
Implementation
sequenceDiagram
participant Client
participant RateLimiter
participant Log as Timestamp Log
participant Clock
Note over Log: Initially empty
Limit: 3 req/10sec
Client->>RateLimiter: Request 1 (t=0s)
RateLimiter->>Clock: Get current time
Clock-->>RateLimiter: t = 0
RateLimiter->>Log: Remove timestamps < (0 - 10) = -10
Log-->>RateLimiter: None to remove
RateLimiter->>Log: Count entries
Log-->>RateLimiter: 0 entries
RateLimiter->>Log: Check (0 < 3?)
Log-->>RateLimiter: Yes
RateLimiter->>Log: Add timestamp: 0
Note over Log: [0]
RateLimiter-->>Client: 200 OK ✓
Client->>RateLimiter: Request 2 (t=2s)
RateLimiter->>Clock: Get current time
Clock-->>RateLimiter: t = 2
RateLimiter->>Log: Remove timestamps < (2 - 10) = -8
Log-->>RateLimiter: None to remove
RateLimiter->>Log: Count entries
Log-->>RateLimiter: 1 entry
RateLimiter->>Log: Check (1 < 3?)
Log-->>RateLimiter: Yes
RateLimiter->>Log: Add timestamp: 2
Note over Log: [0, 2]
RateLimiter-->>Client: 200 OK ✓
Client->>RateLimiter: Request 3 (t=5s)
RateLimiter->>Clock: Get current time
Clock-->>RateLimiter: t = 5
RateLimiter->>Log: Remove timestamps < (5 - 10) = -5
Log-->>RateLimiter: None to remove
RateLimiter->>Log: Count entries
Log-->>RateLimiter: 2 entries
RateLimiter->>Log: Check (2 < 3?)
Log-->>RateLimiter: Yes
RateLimiter->>Log: Add timestamp: 5
Note over Log: [0, 2, 5]
RateLimiter-->>Client: 200 OK ✓
Client->>RateLimiter: Request 4 (t=7s)
RateLimiter->>Clock: Get current time
Clock-->>RateLimiter: t = 7
RateLimiter->>Log: Remove timestamps < (7 - 10) = -3
Log-->>RateLimiter: None to remove
RateLimiter->>Log: Count entries
Log-->>RateLimiter: 3 entries
RateLimiter->>Log: Check (3 < 3?)
Log-->>RateLimiter: No
Note over Log: [0, 2, 5]
(unchanged)
RateLimiter-->>Client: 429 Too Many Requests ✗
Note over Client,Log: Window slides forward...
Client->>RateLimiter: Request 5 (t=11s)
RateLimiter->>Clock: Get current time
Clock-->>RateLimiter: t = 11
RateLimiter->>Log: Remove timestamps < (11 - 10) = 1
activate Log
Log->>Log: Check: 0 <= 1? Yes, remove
Note over Log: [2, 5]
deactivate Log
Log-->>RateLimiter: Removed 1 entry
RateLimiter->>Log: Count entries
Log-->>RateLimiter: 2 entries
RateLimiter->>Log: Check (2 < 3?)
Log-->>RateLimiter: Yes
RateLimiter->>Log: Add timestamp: 11
Note over Log: [2, 5, 11]
RateLimiter-->>Client: 200 OK ✓
Client->>RateLimiter: Request 6 (t=13s)
RateLimiter->>Clock: Get current time
Clock-->>RateLimiter: t = 13
RateLimiter->>Log: Remove timestamps < (13 - 10) = 3
activate Log
Log->>Log: Check: 2 <= 3? Yes, remove
Note over Log: [5, 11]
deactivate Log
Log-->>RateLimiter: Removed 1 entry
RateLimiter->>Log: Count entries
Log-->>RateLimiter: 2 entries
RateLimiter->>Log: Check (2 < 3?)
Log-->>RateLimiter: Yes
RateLimiter->>Log: Add timestamp: 13
Note over Log: [5, 11, 13]
RateLimiter-->>Client: 200 OK ✓
Algorithm in C++
class SlidingWindowLog { list<long long> log; // timestamps int limit; int windowSize; // seconds
public: SlidingWindowLog(int lim, int window) : limit(lim), windowSize(window) {}
bool allowRequest() { long long now = time(0);
// Remove old timestamps outside window while (!log.empty() && log.front() <= now - windowSize) { log.pop_front(); }
// Check if limit exceeded if (log.size() >= limit) { return false; // Reject }
// Add current timestamp log.push_back(now); return true; // Accept }};Complexity Analysis
-
Time:
O(K) per request, where K = number of expired timestamps to remove
(amortized close to O(1) for evenly spaced traffic)
-
Space:
O(N), storing one timestamp per request in the window
Pros
- Very accurate rate limiting
- No burst problems like fixed windows
- Easy to reason about
- Handles irregular traffic patterns well
Cons
- Needs memory proportional to all requests in the window
- Slower at high QPS
- Not ideal for large-scale distributed systems
Real-World Usage
- Smaller services needing precise control
- Security-sensitive endpoints
- API endpoints with low-to-medium traffic where accuracy matters more than memory