API Rate Limiter
Summary

Once a new request arrives, the Web Server asks the Rate Limiter to decide if it will be served or throttled.
Requirement
Functional requirements
  - Limit the number of requests an entity can send to an API within a time window.
- The APIs are accessible through a cluster, so the rate limit should be considered across different servers.
- The user should get an error message whenever the defined threshold is crossed within a single server or across a combination of servers.
Performance requirements
  - The system should be highly available. The rate limiter should always work since it protects our service from external attacks.
- Our rate limiter should not introduce substantial latencies affecting the user experience.
Throttling
  - Throttling is the process of controlling the usage of the APIs by customers during a given period.
- Throttling can be defined at the application level and/or API level.
- When a throttle limit is crossed, the server returns HTTP status “429 - Too many requests”.
- Types of throttling
    
      - Hard throttling: The number of API requests cannot exceed the throttle limit.
- Soft throttling: we can set the API request limit to exceed a certain percentage.
- Elastic / dynamic throttling: the number of requests can go beyond the threshold if the system has some resources available.
 
Rate Limiting Algorithm

  - Fixed window: the time window is considered from the start of the time-unit to the end of the time-unit.
- Rolling window: the time window is considered from the fraction of the time at which each request is made plus the time window length.
Fixed Window Solution
  - Allow 3 requests per minute per user.
- Hashtable
    
      - Key: userId
- Value: countandstartTime
 
- Procedure for each request
    
      - For new user
        
          - Create a new entry
- count = 1
- startTime = currentTime
- Allow the request.
 
- For existing user
        
          - If currentTime - startTime >= 1 min, resetcountandstartTime.
- Otherwise, reject the request if count >= 3, or allow the request ifcount < 3and incrementcount.
 
 
- Atomicity
    
      - In a distributed environment, the “read-and-then-write” behavior can create a race condition.
- We can use lock, but it will slow concurrent requests from the same user, and introduce extra complexity.
 
- Memory usage
    
      - userId: 8B
- count: 2B, count up to 65K
- startTime: 4B; store only the minute and second part
- lock: 4B number
- hash overhead: 20 B
- total: (16B + 20B) * 1M users = 36 MB
- Can fit into one server, but should be distributed for performance reason
- 3M QPS if the rate limit is 3 requests per user per second
 
- Implementation
    
  
Sliding Window Solution
  - Hashtable
    
      - Key: userId
- Value: sorted set of timestamps
 
- Procedure for each request
    
      - Remove all timestamps from the sorted set that are older than currentTime - 1 min
- Reject the request if the total count is greater than throttling limit
- Otherwise allow the request, and add the current time into the sorted set
 
- Memory usage
    
      - userId: 8B
- each timestamp: 4B + 8B (prev pointer) + 8B (next pointer) = 20B
- hash overhead: 20B
- each user: 8B + 20B * 3 + 20B = 88B
- total: 88MB
 
Sliding Window with Counters
  - Keep track of request counts for each user using multiple fixed time windows.
- For example, for an hourly rate limit, we can keep a count for each minute and calculate the sum of all counters in the past hour.
- This will reduce memory footprint for large limites.
- For a rate limit of 500 requests per hour
    
      - Without counters: 8B user id + (4B timestamp + 20B sorted set overhead) * 500 timestamps + 20 hashtable overhead = 12KB
- With counters: 8B user id + (4B timestamp + 2B count + 20B hash overhead) * 60 entries + 20B hashtable overhead = 1.6KB
- 86% less memory
 
Data Sharding
  - Shard rate limiting data by userId.
- Use consistent hashing.
- If different APIs have different limit, we can shard per user per API.
Data Caching
  - Application servers can quickly check if the cache has the desired record before hitting backend servers.
- Use write-back cache. The cache is persisted to persistence at fixed intervals.
- This is useful once the user has hit their max limit, and the rate limiter will only be reading data without any updates.
Rate Limit by IP or by User
  - By IP
    
      - Better than no rate limit at all.
- When multiple users share a single public IP, one bad user can cause throttling to others.
- There are a huge number of IPv6 addresses available from even one computer.
 
- By User
    
      - Performed after user authentication.
- What about rate limit on the login API?
 
- Hybrid
    
      - Combine per-IP and per-user rate limiting.
- Require more memory and storage.