Disclaimer: I am fully aware this method is horrendously inaccurate. It is, however, quick and being rigorously accurate when caluclating memory and CPU usage is complicated. And it doesn’t really make much difference given how coarse-grained VM SKUs in Azure are

Another disclaimer: all of this assumes that your workload is insensitive to scale-out/scale-in events. If scale-out/ scale-in causes unacceptable latency spikes or similar disruptions, you’ll want to optimize your thresholds for performance rather than cost

How to Determine Your Scaling Thresholds in Azure Link to heading

This applies to App Service Plan instances, VMs, and any other Azure resource where you’re applying the stock Azure VM SKUs. It also assumes a general workload that’s constrained by CPU and memory, and not some kind of I/O or GPU-intensive workload.

Scale Out: Don’t Pay for Compute Capacity You’re Not Using Link to heading

The great benefit of cloud-hosting is the ability to scale out/in in response to load. The old school rubric of over- provisioning just in case of a load spike is no longer relevant when Azure will seamlessly scale for you.

Ideally, all your systems would be running at 100% all of the time because then you’re using 100% of the CPU/memory you’re paying for. The problem is that it can be tricky to determine the difference between “100% resource usage because everything is perfectly right-sized” and “100% resource usage because the system is overloaded and barely responding”. Also, the coarse grain of resource metrics in Azure means that at the sub-one-minute window it’s hard to detect CPU/ memory spikes.

So the rule of thumb is that you should aim for 80% CPU/memory usage as a constant state. This gives headroom for short-term spikes and ensures you’re using most of the resource.

Azure is much faster at scaling out than it used to be; scale-outs can happen in under 30 seconds now. At the other end five minutes seems to be the upper bound. VM scale sets may take longer if they require custom provisioning (don’t do that; use Packer). So your scale out rule should look like this:

If (CPU OR Memory) usage >= 80% for >= 5 minutes, increment instance count by 1

If you have a load spike, you will have at most ten minutes of degraded performance before the load gets spread across more instances.

Scale In: Put Your Toys Away When You’re Done Playing With Them Link to heading

This gets more complicated because the threshold will vary depending on the minimum instance count you’ve set, and here’s why.

Suppose you have a minimum instance count of four, and they’re stationkeeping at 80% CPU usage. That’s 4 x 80 = 320 percentage points of CPU load (yes, I know that’s not exactly how CPU usage works; see disclaimer). If we scale up to five instances - say because of a sustained load of 81% - that’s a sustained load of 64% average (320 / 5). So we could set a scale-in threshold of 60%; if load drops below that we can run the load on only four instances.

But what if we have a minimum instance count of two? 2 x 80 = 160 percentage points; if the set scales out to three instances, the load per instance is 160 / 3 = 53%. That’s much lower than 60%; if we set a threshold of 60% the set would be constantly scaling out and then scaling in immediately after (this is sometimes referred to as “flapping”). A better scale-in threshold for this set would be 50% load.

If you have a wide range between your minimum instance count and the count you need to handle your peak loads - perhaps you have an extremely cyclic load profile - you’ll notice that the appropriate scale-in threshold will be higher the more instances you start with. You want to avoid flapping but you also want to scale in when you’re not actually using the extra instances.

How long the load should be below the scale-in threshold before you actually scale-in is going to be highly dependent on your load-over-time profile, but a good rule of thumb is to start with a much longer duration than the scale-out rule and tune from there. So your scale-in rule should look like this:

If (CPU AND Memory) usage <= n% for >= 15 minutes, decrement instance count by 1 (where n = (80 * min instance count)/(min instance count + 1), rounded down to the nearest ten percent

Note that we scale out when either CPU or memory are spiking, but only scale in when both are below the threshold. You want to be very quick to scale out, but cautious about scaling in, both for performance reasons and to avoid flapping.