A rate limiter restricts the number of events that a certain user or device can perform within a given time range. In other words, a rate limiter limits the number of requests a sender can send in a particular period. So it is implemented as a protective mechanism to prevent excessive use of services. Once the upper limit is reached, the rate limiter blocks further requests.
In this blog, we will discuss components and algorithms related to the design of a rate limiter.
It’s critical to ask questions and clarify needs and restrictions with the interviewer during a system-design interview. Here are a few examples:
Let’s keep things simple and communicate using a basic client-server approach. Here is a critical question: where should the rate limiter be placed? One can implement a rate limiter on either the client or server side.
There is an alternative to the client and server-side implementations. We construct a rate limiter middleware, which throttles requests to your APIs, rather than establishing a rate limiter on the API servers.
The underlying concept of rate-limiting algorithms is simple. We need a counter at the highest level to track how many requests are submitted from the same user, IP address, etc. The request is rejected if the counter exceeds the limit.
Where should we keep the counters? Due to the slowness of disc access, using the database is not a smart option. So we can use an in-memory cache to make it quick and support a time-based expiration technique. For example, Redis is a popular choice for rate-limiting, where “INCR” and “EXPIRE” are two commands that can be used to access in-memory storage.
Now, let’s explore several rate-limiting algorithms available for implementing API rate limiting. Suppose we need to create an API Rate Limiter that limits API calls based on the user, where one user can send a fixed number of requests in a fixed time range. Based on the use case, we can change the implementation according to our criteria.
The idea of the token bucket algorithm works by maintaining a bucket with a fixed capacity of tokens. Here, tokens are added to the bucket at a constant rate, and each token represents permission to perform a certain action (such as making an API request).
When an API request comes in, the system checks if there are tokens available in the bucket. If there is at least one token available, the system removes one token from the bucket and processes the request. If there are no tokens available, the request is either rejected or delayed until tokens are available.
The token bucket gets refreshed or refilled after a specified time interval, which determines how quickly the bucket refills with tokens. For example, if a user is allocated 10 tokens per minute, they can make up to 10 requests per minute. Once the user uses up all their tokens within the time interval, they will be denied further requests until the bucket is refilled at the next interval.
Pros: Memory friendly and simple to use. Cons: A race condition counter may cause an issue in a distributed system due to concurrent requests from the same user.
You can explore the sample implementation for the token bucket using this link.
We can understand the leaky bucket algorithm idea using this analogy: Suppose there is a bucket with a hole in the bottom. Water (API requests) flows into the bucket (queue), but it can only drain out through the hole at a fixed rate. If water flows into the bucket faster than it drains out, eventually, it will overflow.
So, this algorithm uses a queue, which can be thought as a bucket that holds the requests. This queue is used to control that flow: requests are entered into the queue as they arrive (analogous to water being added to the bucket). These requests are then removed from the queue (first come, first served) at a fixed rate (comparable to water leaking from the bucket). If the queue is full (or leaked), any incoming request is rejected until there is space in the queue. Overall, this algorithm ensures that output is processed at a fixed rate.
If 3 requests/user is allowed and the queue size is 3, the API Rate Limiter will block the 4th Request.
In this algorithm, we will keep a bucket per unit time window and a counter to track the number of requests made in that time window. What is the meaning of the time window here? Suppose you have a fixed time window of 1 minute, it means that the rate limit applies within each minute interval.
Here we also set a threshold for the number of requests allowed within each time window. When a request is made, the algorithm increments the counter. Once the value of the counter reaches the value of the threshold, further requests are either delayed or rejected, until the start of the next time window when the counter resets.
In the above diagram, a new request is seen at 00:00:10. So we get the count from the current bucket 00:00-00:01 (which is 0) and check if processing one more request exceeds the permissible limit of 3. If not (which is the case here), the bucket's count is incremented by 1 and the request is allowed to be processed. When a request arrives at 00:00:40 and the window count is 3, no request is authorised and an exception is triggered.
Space complexity: O(1) for storing the count for the current window. Time complexity: O(1) for get and simple atomic increment operation.
Pros:
The algorithm maintains a log of request timestamps. This log can be implemented using a data structure like a queue or a circular buffer. One of the best ideas is to store these timestamps in a cache for faster retrieval and manipulation of the data.
For example, suppose the rate limiter allows 2 requests per minute, and the first request arrives at 1:00:00. Now, the second request arriving at 1:00:20 is allowed because the log has a size of 2, and it falls within the time range of 1:00:00 - 1:01:00. Log = [1:00:00, 1:00:20].
Space complexity: O(Max requests seen in a window time). In a time window, all of the requests’ timestamps are saved.
Time complexity: O(Max requests seen in a window time) — A portion of timestamps is deleted.
Pros: It works flawlessly
Cons:
The low processing cost of the fixed window algorithm is combined with the increased boundary constraints of the sliding log in this hybrid approach. To begin, we track a counter for each fixed window, similar to the fixed window technique. To reduce surges of traffic, we account for a weighted value of the previous window’s request rate based on the current date.
We keep track of each user’s request counts by using various fixed time windows, such as 1/60th the size of our rate limit’s time frame. If we have a minute rate limit, for example, we can record a count for each second and calculate the sum of all counters in the previous minute when we get a new request to determine the throttling limit.
Space Complexity: O(number of buckets).
Time Complexity: O(1).
Pros:
Cons:
Thanks to Navtosh Kumar for his contribution to creating the first draft of content. Please write in the message below if you find anything incorrect, or if you want to share more insight. Enjoy learning, Enjoy algorithms!