Infrastructure as a Service (IaaS): common issues and solutions
For many people, the cloud is this magical place where they can find interesting information, store their data, or even have their own online business. As developers, we know that underneath this layer of magic sits cloud infrastructure. A set of servers, disks and networks which just happen to be accessible from the internet. For us, there is no magic. However, once you use cloud infrastructure, there are still many things which you wouldn’t expect. For example, have you thought about the fact that the cloud…can run out of servers?
Avoiding those common pitfalls starts by knowing that they exist. So let’s dive into some of the most common pitfalls.
Instance type unavailability
It’s the holiday season, and your team is launching the development environment of some 25 server instances, as they always do. You think you have perfected the launch sequence. Except, this time you are being greeted with a
We currently do not have sufficient <InstanceTypeX> capacity in the Availability Zone you requested error.
You’ve just hit a cloud capacity issue. This can happen when the cloud provider has run out of server instances of a specific type. The holiday season is when this happens the most, because all big retailers which use that same cloud provider are scaling up their services to meet the increased demand of shopping customers.
Luckily, this probably doesn’t impact all regions or availability zones (physically separate locations within a region) of a cloud provider.
As a short-term solution, you have several options:
wait a while when the deployment isn’t urgent and try later. Cloud capacity can shift in your favor, so this might work, but isn’t a guaranteed short-term solution.
switch your deployment to another availability zone.
scale down the number of instances that cause the capacity issue.
deploy using a different instance type. You might not always want to do this, since this might impact your platform differently.
As a long-term solution for critical workloads, start thinking about reserving capacity with your cloud provider.
Some cloud providers (AWS & Azure) allow you to reserve instance capacity without committing to a fixed time period. This is the most expensive form, but also the most flexible.
Some also have reserved instance saving plans, where you commit for a fixed period (e.g. 1 or more years) on a number of instances. Your cloud provider might reserve capacity, such that these instances are always available to you (but please, read the small print). The additional benefit is that for the commitment, you pay a reduced price.
Cloud servers are not much different from regular servers. They run on physical hardware, which, over time, can degrade and result in irreparable failure. When this happens, your cloud provider will decide to retire the server. They will notify you in advance that they’ll stop or reboot the instance, such that it will start on different physical hardware. This pro-actively ensures that your applications keep running smoothly, but may require some maintenance on your part.
While instructions to handle instance retirement may vary between cloud providers, there are a couple of rules that will smooth the maintenance.
First off, if you want to keep the data after having switched your instance to different physical hardware, you better use external block storage for your instance disk. The benefit of external block storage is that you or your cloud provider can detach it from the old instance and attach it to the new instance.
Second, make sure all services and applications start automatically upon instance launch, so that you don’t have to start them manually after the reboot.
Finally, perform or schedule the maintenance before the automatic operation by your cloud provider. This allows you to perform it during your own preferred time-window, with time scheduled to perform preparation work and after-care. Because the instance is already in a degraded state, the earlier you tackle the issue, the better.
The case of the bad instance
As we’ve seen, instances aren’t guaranteed to remain healthy during their lifetime. Sometimes you’ll notice its effects even before the cloud provider has scheduled an instance retirement. And occasionally, you might even draw a “bad” instance when launching a new one. This can happen when the instance is being scheduled on physical hardware already in a degraded state. This can manifest itself in low performance, bad network connectivity, or unreliable behavior. It might even happen that your instance becomes unreachable. It’s not always as obvious when the instance is at fault, because your instinct might be to blame your application.
The only way to solve this is to reboot the instance and hope it lands on different physical hardware.
You want to allocate more resources (e.g. instances, disks, object store buckets) and are willing to spend money on it, but the cloud is not allowing you to provision it. Not because it’s out of capacity, but because you’ve stumbled into one of its account limits. These limits exist to protect both your budget and the cloud provider, so that you can never allocate more resources than allowed within the limit.
They can be cumbersome, as you will run into them exactly when you want to provision new resources, and requesting an increase of this limit isn’t always instant. Sadly, there’s nothing else you can do than requesting an increase of the limit, or reducing your resource allocation.
So, you’ve fully automated the deployment of your platform’s cloud resources (e.g. using Terraform). Great! Now you can deploy, upgrade and destroy platforms like there’s no tomorrow! So, using your automation, you start the deployment of 5 different platforms. But a few minutes in, the deployment comes to a screeching halt with a “Rate exceeded” error.
You’ve stumbled into the API rate limits of your cloud provider.
First, don’t use the same API key to deploy all your platforms! It will certainly get you into this kind of trouble. Make sure you use multiple API keys with each having their own rate limit. When you still run into rate limits, you might have to request a rate limit increase.
Disk burst balance
You have this one application which is heavy on reading or writing to disk, primarily when launching it. It has always performed well, but suddenly on launch its performance has become significantly worse. However, the amount of data it is processing has increased. So you look at the CPU and memory metrics of the instance, but there’s still plenty of room left. How is this possible, the disk is still the same disk as before?!
You are probably running into IOPS issues and have depleted the burst balance of your disk.
Cloud disks typically have IOPS (input/output operations per second) limitations, which is their baseline performance. Some cloud providers allow your disks to burst above this IOPS limitation temporarily, because they acknowledge many applications need to either read or write to disk for a brief time (e.g. during launch). When your disk bursts above the IOPS limitation, it is using credits, which it needs to replenish while your application is reducing its IOPS. You only have a limited number of burst credits, so if your application has been bursting for an extended time, it might deplete its burst balance. This means that your disk will not use more IOPS than its baseline limit, which can cause issues with your application performance.
Monitor the IOPS usage and burst balance of your disk if your cloud provider provides metrics for this. It will reduce the time spend troubleshooting performance issues for applications which heavily use disks. Based on these metrics, you can fine-tune your application’s I/O behavior, or change your volume to another type with more provisioned IOPS.
Working with cloud infrastructure isn’t a silver bullet that will solve all your infrastructure problems. It still requires expertise and regular maintenance. While the major cloud paradigms are the same cross-vendor, every cloud provider’s implementation is still different. Therefore, a solution which works with one vendor might not be available with the other.
But where it shines is the ease and speed of deploying new infrastructure, which far outweighs the new challenges you might encounter. As a developer, even after years working in the cloud, it remains a wondrous experience to almost daily launch and tear down platforms with some 40+ server instances on an automated schedule. As a business, it’s an immense advantage to have an environment where you can quickly experiment by launching new infrastructure, paying as you go, and tearing it down when you don’t need it anymore.