If you’ve ever listened to Amazon.com’s CTO Werner Vogels speak at a conference, one of his favorite subjects is talking about architecting for failure. In fact, he co-authored a paper that says “The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.”
Message Queues as Transmorgifier
Many enterprise architectures make heavy use of message queues; however their potential is far greater than simple message delivery. In fact, this was an epiphany when I sat down to really understand one of our Web Services, known as Amazon Simple Queue, that software developers around the world use to build Web Scale applications. On the surface it sounded pedestrian: the product detail Web page described it as “a reliable, highly scalable hosted queue for storing messages as they travel between computers”.
Looking at the implementation details, I realized two things:
First, message queues are great “impedance matching devices”. In a high-scale Services-Oriented Architecture, you can’t make any assumptions about the speed of the other end of a connection. Your application doesn’t know the relative speed of another system component in any loosely-coupled architecture. In fact, the application doesn’t even know if the other end of a communication channel is available at all.
Using a service that guarantees message delivery enables applications to “fire and forget”. The other endpoint will handle the message asynchronously. Maybe in 2 milliseconds; maybe in 2 days. It shouldn’t matter to the sending application (assuming the task doesn’t require synchronous communication). Why, for example, should your customer be forced to wait for the (relatively slow) credit card back end system to acknowledge that their credit card is valid? If instead your application puts the validation in a queue, then the application can generate an email back to the customer when there is an exception such as an invalid expiration date.
However it was my second realization that really helped me understand what “everything fails” means in practice.
An ordinary message queue delivers a message and then immediately deletes it.
This particular message queue system delivers the message and marks it as invisible in the list of available messages. That’s because the queue assumes that the receiving application failed before processing the message! In fact if the receiving application does not return and explicitly delete the message out of the queue within a prescribed period of time, Amazon SQS makes the message available again.
My initial reaction was that the approach seemed inefficient. Then I realized that the implementation amounts to automatic rollbacks: if the receiving application fails to delete the message, it is available to try again! The approach eliminated all sorts of exception logic in program code, and is brilliant in its simplicity.
Virtual Failure
Servers are a bit different than components such as a message queue service.
Virtual servers are no more or less reliable that physical ones; assuming that all other variables are equal. However there is one major difference: when virtual servers fail, you often lose the virtual disk drive attached to it. Think of this as loosely analogous to “RAID 0 with no persistence”- that is, non-redundant disk drives.
At the most basic level this characteristic requires incremental backups to persistent storage. However it’s really an opportunity to think about what “everything fails” really means. Because a large pool of virtual servers is essentially “infinite capacity”, I prefer to think of them as transitory computing resources at my command to bring on and offline in minutes. Essentially that means that computing resources are a separate and decoupled pool from data resources: really the SOA model applied to what we traditionally considered to be hardware.
Academia has been thinking for some time about computing capacity as a resource instead of hardware. In the Grid Computing community, there is a project known as Condor that assigns “jobs” to resources, with a particular match between resource and job based on the characteristics of the job. For example, a large financial model might require a 64-bit environment that will be available to ten hours. Another job might be a low-priority inventory analysis program that can be pre-empted, and requires a 32-bit resource for three days. Condor seems reminiscent of mainframe batch jobs; yet in reality it is all about distributed computing.
Then there’s Hadoop (lucene.apache.org/hadoop/), which is an open source project framework intended to process large datasets on commodity-grade hardware, with the assumption that the hardware will fail. It slices computing tasks into smaller pieces, in order to distribute them across multiple computers—any one of which could fail and require that its assignment be re-run. Hadoop really shines when failure occurs, because any given task is small relative to the overall process; which in turn means that the “cost” of failure is low.
Hadoop isn’t just a research project. The New York Times used Hadoop in combination with virtual servers to generate 11 million PDF files in just 24 hours.
How Do You Plan to Fail?
There are as many approaches to application-level resiliency as there are architects who design systems in this manner; with only a few are presented here. The point is that designing for failure produces self-healing reliability at an affordable price.
How many components in your architecture are likely candidates for re-architecture using “everything fails” as guiding principal? Every organization has unique needs, and for that reason there are no patent solutions. Fortunately SOA is the set of building blocks needed to create a resilient and scalable architecture.