Service quotas and spike patterns as part of multi tenant services’ SLAs
In his famous book, “High output management”, Andrew Grove (Intel’s co founder and ex-CEO) explained that sometimes quality checks can stop the whole organization. For example, if you put a quality check on the input raw material for a factory, and due to some market changes providers became unable to provide raw materials up to these quality standards, then your whole factory stops. In this case, managers should discuss what is the accepted degradation in the product, unless reliability gets affected. Reliability is untouchable even if you stop the whole organization; the reason for this is that you can never know the consequences of low reliability. The example that he gave was a non reliable component that goes into pacemaker, it can be discovered at the manufacturer, causing cost to replace, but if it fails later after the pacemaker is implanted; then, the consequences are more than just financials.
Most industries have higher reliability than software, so it is easy to say that you cannot make compromises regarding reliability when you produce hardware, but not software. However, the basic principle of being unable to assess the overall consequences of low reliability still applies.
And here come SLAs. SLAs basically says that I, the software service provider, can provide this level of reliability and the consequences of this level of reliability should be assessed by the user because I cannot assess them.
A more explicit example is the famous severity classification of services done by many companies like Amazon. A Sev1 outage means it affected amazon.com retail operations for most or all users, the basics of buying of the site is down or major part of it. A Sev2 outage means that it affected some of the customers or some of the operations but not a major portion of them, Sev3 means that it is annoying to some customers but not causing serious problems.
Then, when you build a service you classify its severity, how reliable it is based on the accepted severity level of outages. S3 is a Sev1 service, this means that it is allowed for parts of amazon business that may generate a Sev1 outage to run on top of S3. A newly created service usually starts with low level of severity and with time (years usually) and maturity it is upgraded until it reaches Sev1.
So this is another form of shifting the reliability problem to the user. If I build a service and claim that it as Sev2 service, the developer using this service is the one responsible for classifying what types of outages he may generate and in turn use my service or not. If a developer in the team building the shopping cart for example starts using my service then it is the developer’s problem. That is another explicit form of shifting assessing the consequences of low reliability product to the user.
So If you are building a multi tenant service, a service that has many users where usage patterns of user A like excess traffic and spikes should not affect service quality perceived by user B, how can you do capacity planning? The short answer is (do rate limiting), but what should the rate limiting restriction and the service’s performance promises look like? how to describe what is accepted under what usage conditions and put this into your capacity plans and service description or SLAs.
In the cloud we find the concept of service quotas. Basically cloud providers say: for this service or that we can give you this amount of requests per minute, day, month, etc. If you want to exceed this quota you add a request. But the interesting part is that there are no guarantees on the time for processing this quota increase requests. If you ask to increase your big table quota the support engineer at google may have to contact big data team, who would do capacity testing in the regions you run your services at, add more servers, etc, before raising your quotas.
Even for auto scaling services there is a limit on spikes. You would expect that amazon’s elastic load balancer can handle any traffic spike — it has the word (elastic) in it — but even there there is a quota on the spike: by default elastic load balancers does not support a spike in traffic of more than 50% in less than 5 minutes. If you expect your traffic to go beyond this pattern you have to add a quota increase request before hand. Again no promises on how long processing your quota increase request take.
Being able to handle spikes means that you are over provisioning, there is no other way around this. The question is how to control this over provisioning to be in an accepted range.
Similar to SLAs being a way to shift evaluating consequences of low reliability to the user, service quotas are a way to shift the capacity estimation problem and over provisioning to the user of your service. You give each user a given quota for their service account and you know how many service accounts you have, so you know your capacity needs. If a user asks for more you get more resources, with no promise on when you will respond to their quota increase request.
Speaking of how to implement rate limiting is beyond this article, In the cloud there are services that provide quota and rate limiting out of the box. For example AWS api gateway is facade that controls users access to your service based on the service account you give them, implementing whatever rate limiting and quota rates you specify for them. Google has similar service and Envoy can integrate with any rate limiting service that provides a specific api and there are much more ways.
However it is not only about rate limiting implementation, it is also an operational practice and agreement between you and consumers of your service.