Designing a Reliable Architecture for the Cloud
Alright alright, this article might seem a little late. Although the cloud has been around for a decade and many people are perfectly comfortable with it, a lot of my corporate clients (you might be one of them!) are still used to traditional servered computing and are only now starting to consider the cloud for new projects. Don’t be embarrassed, the future welcomes you! If you follow these general rules, you should have a solid footing for any cloud-based project development.
Start with your head in the clouds
The best way to develop a cloud-native application is – surprise – to actually start the design of the application with the cloud in mind. It’s always a great disappointment to develop an entire application that works fine locally but fails for various reasons when attempting a cloud deploy. If you’ve ever tried to push an application that runs on a single traditional server to the cloud, you know the error messages I’m talking about: inaccessible ports, unresolved hostnames, database connection errors because localhost ain’t exactly a thing in the cloud. When scaling, you’ll run into even more issues: missing files, unreliable caching, data incongruity, chatty and insecure communication between nodes and services. If we start with the cloud in mind, we can mitigate some of these problems and prevent major rewrites of applications to retrofit them to the cloud.
Note: Designing for the cloud on day one isn’t always an option. Whether your application is so old that it pre-dates cloud computing, or it was a budget-less bootstrapped MVP that worked fine on traditional servers, or the choice isn’t yours to make (a common scenario with corporations beholden to slow-moving IT departments), sometimes it’s difficult to prepare for the cloud at the onset of a project. Regardless, you can still gain some value from the following recommendations.
Embrace change and relinquish some control
True story: I have a friend who has refused to update his Perl-based application at all. His operating system, runtime, database, and coding standards were all terribly out of date. When I say terribly, I mean in excess of 15 years out of date. He refused to update because, as he said: “the system works, I don’t want it to break.” The moment he got it live, he hasn’t changed the infrastructure at all. Because he was operating on traditional servers (physical servers that he owned and maintained in our shared office space), he was able to make that choice. In cloud computing, you shouldn’t count on being in total control.
In the cloud, everything can and often is constantly changing, whether you like it or not. By choosing AWS, Google Cloud or IBM Cloud as your cloud provider, you are trading some control for a lot of convenience. The exact configuration you’ve used in the past might not be available as a configuration in the cloud. I always recommend that business considering the cloud have a look at the server, storage, and networking practices in your preferred cloud provider.
Once you’ve established that the IaaS provider you’re choosing offers a configuration that works for you, you need to be prepared for the multitude of things that can change without notice in your application. This may sound scary, but it’s actually a good thing because it allows for the scaling and flexibility that the cloud is known for. Some initial things that you should absolutely expect to change:
- Database connection details
- IP addresses and hostnames
- Environment variables generated by AWS products
- Occasionally OS, runtimes, and default packages
In all cases, AWS has a way to make sure you don’t need to think about these changes. For example, when scaling automatically via Elastic Beanstalk, database connection details are passed to an application as environment variables. All you need to do is not hardcode values, but reference the environment variables that will contain them.
Decouple as much as possible and “design for failure”
Decoupling simply means making the individual services or components of your application as self-contained and independent as possible. A common strategy for a decoupled implementation is building REST-based layers rather than directly accessing internals. Using a recent application I completed as an example, I identified the following “functions” that were part of a single monolithic PHP class that could be broken down into individual services for migration to a cloud-based architecture:
- User login and authentication – because users could log in via the app itself or a web-based portal, it made sense to have this be a single “User management” service.
- Orders API – used to create and retrieve orders for prints of uploaded photos.
- Photo processing – as soon as users uploaded a photo, a variety of functions needed to be performed: ensure it was an acceptable format, crop, create a scaled copy, and convert to jpg.
- Printing API – calls to the external printing service used to be included with other business logic. By separating it into its own REST API, we were able to create a stable service that wasn’t as affected by changes to other business logic.
- Database layer – by abstracting the database interface with a REST API, we were able to ensure reliability when scaling across multiple database instances.
By separating these services I was able to ensure that a single failure in one service didn’t affect other parts of the application. If our printer’s API stopped responding or a cropping on an unsupported filetype failed for example, other parts of the application could still work. Obviously you shouldn’t assume fault tolerance means ignoring errors. You should still be diligent about capturing errors, queuing events for when services are brought back online, and displaying useful error messages to users when something does fail. AWS SQS comes in handy when developing this sort of infrastructure that requires chaining services or components together.
Going stateless and assuming you won’t know
Ah state, the lovely principle that keeps developers on their feet. In my day, we had sessions and cookies to manage state, and that was about it. In many ways, the server managed state for you. With cloud computing, though, services or instances often have zero native recollection of who initiated something that could be defined as a “session.”
Being stateless means, again, planning for change. In particular, it means paying attention to three very specific differences between traditional and cloud computing.
- The files written to a filesystem after runtime are not necessarily reliable. Because new instances of a service only spin-up the file structure defined in its template, it won’t “copy” over things like user-uploaded files. Using local filesystem references in this sort of situation will result in occasional broken references. For this reason, prepare for files in a stateless environment by moving as many as you can to cloud object storage. The same applies to logging – never log to the filesystem. Instead, choose an external store.
- In-memory sessions and caching don’t work well in cloud computing. Because an instance may die or instantiate unpredictably, you should anticipate building a caching layer that stores and fetches a unified cache from a remote source, like a database. Memcached or Redis are two popular caching mechanisms supported in the AWS and Google Cloud environments.
- Although not directly related to state, keep in mind that redundancy often means persistence even when you don’t expect it. Let’s say you need to change or remove a resource from your website. In a cloud computing environment, not only do you need to think about local browser caching but you also need to think about propagation between availability zones. Just because it has been changed or deleted, it might still exist in a previous form on another zone until the changes propagate.
Automate as much as possible
Although not explicitly a feature of cloud computing, it’s best to plan for as much automation in the cloud as possible. From deploying to provisioning to scaling to billing, almost all aspects of your cloud infrastructure can be automated without much effort from you. With the exception of CI/CD, which can technically be done entirely within AWS of Google Cloud but is more likely to involve external services like Jenkins, Github/Gitlab that may live outside of the cloud, all automation can be configured within your cloud computing provider. This is extremely liberating, as long as you don’t fall into the common pitfall:
Automation does not mean you can stop thinking
After reviewing a recent client’s bill, I saw they were paying an extraordinary amount for what seemed to be a pretty small lambda function. After talking with the developer that wrote the application calling the lambda function, she admitted that she was calling it more often than necessary because she knew it would scale automatically and she would “rather be safe than sorry.” Don’t get lazy.
Although your cloud provider is on the hook for the security of the servers themselves, you are still responsible for securing your application. Thankfully, many of the best practices from traditional computing are a great starting point. However, there are some key concepts that need to be understood to properly secure an application in the cloud:
- Decoupling our application means that services need to allow inbound traffic without being so open that they are vulnerable to abuse. You want to ensure that your networking rules are as restrictive as possible, to only allow traffic between the services that need access to them.
- Have a rigid IAM policy. Maintain roles, groups, and ensure the minimum level of access necessary is used for each service individually.
- Don’t “chat” between services without encrypting and authenticating that the service is allowed to communicate with it. Something as simple as JWT could save you lots of headaches.
Using the printing application example above, if we were cavalier about security we could have deployed our printing API, opened up HTTP traffic, and exposed it via a public IP, and assume any request to print was legitimate and should be honored. If a never-do-well discovered the service’s IP and could imitate the requests, they could print as many prints as they want, which could get costly for our client. Because the print logic was previously built-in to our internal application, the developers never had to think about “securing” it. With cloud computing, you need to think about
Cloud computing is a remarkable leap, but it comes with very specific considerations. These considerations might seem intimidating, but I promise it’s easy once you’ve got a grasp of the concepts. If you need help understanding, send me a message and I’ll be glad to help.