Josh Yotty is the latest Blue Box cadet! Josh brings with him a significant history of data center and client services support. Coupling his extensive background in perl with a passion for Ruby and cooking with Chef, Josh’s curiosity and thirst to “automate all of the things” will certainly have a huge impact at Blue Box. Beyond the world of bits and bytes, Josh lives and breathes highly efficient diesel cars and tasty ales (think Arrogant Bastard, Georgetown Lucille, Dogfish Head 90 Minute). Additionally, Josh is an avid reader. He’s presently neck deep in Harm on Rye. We’re delighted to have Josh on the team!
Josh Yotty Joins Blue Box!!!
Distributed Systems Design (Part 4/4)
Blue Box is proud to present this series on understanding the basics of distributed systems design. Distributed Systems Design Part One, posted on April 9th, reviewed common terminology, Brewer’s/CAP Theorem, and basic concepts in distributed system design. Distributed Systems Design Part Two covered general guidelines for distributed programming. Distributed Systems Design Part Three discussed partitioning, fail-over, and fail-back events.
Specific technology readiness for use in distributed systems
This section of the document discusses several commonly used pieces of software within site-local clusters and their applicability in a distributed system, as well as their CAP theorem-related predispositions.
MySQL
A MySQL master-slave relationship, like all simple master-slave relationships, is capable of having only one authoritative “master” at a single site.
MySQL’s standard replication procedure follows an eventual-consistency model, meaning the database remains available during a partition event, but the data between master and slave will temporarily be out of synch until the partition is corrected and the slave is able to catch up to the master.
A MySQL master-master relationship set-up is actually capable of running in active-active mode with a couple of caveats:
- You must have auto_increment_increment and auto_increment_offset set correctly or replication will almost certainly break quickly and un-recoverably.
- You must examine your schema for any unique non-auto-increment indexes. If uniqueness of the field is absolutely necessary, it implies a business decision favoring consistency over availability. This means you can only do inserts / updates to this field on one authoritative master server across your whole distributed system at a time.
- Be careful about “update” queries. The last update wins and, therefore, it is possible for data to get out of synch between sites even under normal running conditions due simply to normal network latency. In general, updates should only be run against a single authoritative master in the distributed system at a time.
- Despite all of the above, it’s still entirely possible to get the masters out of synch in subtle ways if arbitrary updates/inserts are being run on both masters at the same time. In general the rule of thumb should be to do write-type queries on a single master in the distributed network at a time and the exception to do inserts / updates anywhere when you can be absolutely certain that data inconsistencies here will not adversely affect your application.
In any case, as part of the health check of a given site, you should make sure to monitor the status of the replication relationship. Any site whose master either gets disconnected (when a partition isn’t happening) or too far behind the other master(s) in the relationship should generally return a ‘fail’ to health checks of the database.
The active-active pattern for distributed systems works so long as any sensitive updates to the database only ever happen on one master in one site at a time.
PostgreSQL
At the time of this writing, PostgreSQL built-in replication models are not capable of running in master-master mode. There are some technologies,like proxy solutions, which are capable of making a series of PostgreSQL servers act as if they are in a master-master relationship but we suspect they do not react well to partitioning at this time.
PostgreSQL 9′s standard replication model (ie. streaming WAL updates) follows the eventual consistency model much like the MySQL master-slave relationship does. PostgreSQL 9 is also capable of running in a synchronous replication mode, which favors consistency over availability. This is usually not desirable for most applications, as a partition means update queries will hang indefinitely (ie. become unavailable).
The active-active pattern for distributed systems works with PostgreSQL only insomuch as all insert / update queries happen on one server in one site at a time. Network latency for remote sites becomes a serious factor with this model.
Memcache
Memcache has no built-in concept of replication. Any kind of data synchronization between sites must happen in scripts or in-application functionality you write. The application developer has complete control over the consistency versus availability question when partitions happen.
If consistency is required, we recommend flushing the memcache on the fail-back event in any case.
Apache / Nginx / Passenger / Unicorn
As none of Apache, nginx, passenger or unicorn inherently have any concept of state built in, “consistency” is not a problem. These technologies are perfectly safe to use without having to worry about fail-over and fail-back transitions.
Having said this, it’s actually extremely unlikely your application code or external modules used with Apache or nginx don’t use any kind of state information. While the above pieces of software in their simplest form don’t need state, your application is probably using state to do what it needs to do. It is not so much what these pieces of software do so much as what they probably touch.
DRBD / HA-NFS
DRBD enforces a synchronous update mode until the other node in the DRBD pair is thought to be down. At this point, the surviving node assumes it is the only survivor and goes into “master” mode, if it wasn’t already there.
Taken in light of a distributed network which experiences partitions, without some other form of intervention, the DRBD servers will go into “split brain” mode which is nearly impossible to resolve without simply dumping one half of the cluster or the other. Locally, we use direct crossover cables to minimize the chances of unexpectedly entering split-brain mode. This does not work over a 2000-mile network.
The other main feature of DRBD is it requires a very large amount of network bandwidth to do synchronization between servers in the pair.
Due to these reasons, an inter-site DRBD relationship in a distributed system is not recommended in any circumstance. Instead, we recommend evaluating other solutions which are capable of doing eventual consistency. These include Amazon S3, rsync between site-local DRBD services, application code to push updates to both sites in an asynchronous manner, Hadoop, and Riak.
Since HA-NFS solutions utilize DRBD disks in the underlying technology stack, the same restrictions apply to them as DRBD.
CDNs and other caching technologies
CDNs and other caching technologies are great solutions to use in all cases where strict consistency is not necessary and when serving slightly stale data is not a showstopper for your application.
The nature of any cache is to favor availability over consistency.
It is generally a good idea to flush caches as part of the fail-back process.
Redis
Redis’ replication model resembles a simple master-slave relationship. You will more than likely treat a Redis store exactly as you would a PostgreSQL master-slave relationship.
The problem is Redis is generally used for its speed, and therefore the added latency when used with the active-active pattern generally becomes a showstopper for the single-master Redis approach. If you need to run using this model consider using asynchronous processing and a caching layer, which, yes, basically defeats the purpose of using Redis in the first place.
Hadoop, Riak, and other “eventually consistent” cloud technologies
Since most of these types of systems were designed to work in a distributed system, they usually have the ability to adapt well to being used in… a distributed system. Said another way, in going through the work to make your application work around some of the bizarre restrictions and behaviors of these systems, you’re necessarily going to make your application more tolerant of the eventual consistency model.
The important thing to note about these technologies is they are almost all universally written with the idea that availability is far more important than consistency. They all follow the “eventual consistency” model. As such, if strict consistency is required for any aspect of your application, you should not use one of these technologies for that particular data.
The exception to this is the eventually consistent models which allow for the idea of a “strictly consistent” read or write. These are generally far slower to do, and will favor consistency over availability. The ability to do strictly consistent operations will hang or return an error when a partition exists.
Beyond this, it is very important you understand the specific foibles and restrictions of these technologies, especially if an entire data center of cluster nodes is unexpectedly partitioned from the rest of the distributed system. Pay very close attention to how these technologies handle the fail-back transition and remember cross-continent bandwidth is a precious commodity.
Mongo
Mongo has some very interesting capabilities when it comes to data replication and distribution. In general there are two modes of operation.
(Legacy) master-slave mode acts like any other database in master-slave mode and the same restrictions apply as apply to PostgreSQL master-slave or Redis master-slave. If your data set is massive, as is almost always the case for Mongo, beware the enormous bandwidth hit when resynchronizing as part of the fail-back event.
The newer, recommended form of Mongo replication involves using “replica sets.” This allows for automatic sharing, scaling, failure detection and correction, redundant storage of data, and other features. It even has the ability to be data center aware so no data center is completely cut off from data during a partition event. In reality, this is still a form of master-slave replication, meaning all writes to the database still need to pass through an elected master server . This master can live in either data center when there is no partition. Again, because of latency realities in a distributed system this can lead to inconsistent performance.
Due to the way election happens in a Mongo replica set, writes can only happen when quorum can be reached. This reveals the implied business decision of consistency over availability, meaning for writes, at least, this is not an eventually-consistent model. Having said this, it appears from the Mongo documentation that the developers of this software have largely solved many of the complexities that need to happen in a Mongo replica set when a fail-over or fail-back event happens. (We have no operational experience here.) In any case, you still need to beware of the massive bandwidth hit that can occur when the fail-back event happens.
Reads with Mongo are “eventually consistent” unless you specify strong consistency for that particular read.
The one other major caveat we feel we need to mention is we have never actually seen a successful replica set implementation. This is not to say it does not work well, but given Mongo’s history of losing entire data sets upon server crash, our recommendation is to ensure you are backing up your data often if you care to preserve it.
Rails
There is nothing inherent in the Rails framework which makes it unsuitable for use in a distributed system but you should be aware of some of the trade-offs in the system, in which designers have made CAP theorem decisions without distributed systems in mind, that can cause major headaches if they are not understood.
The biggest of these tradeoffs is probably that in an effort to make the database back-end flavor agnostic, the Rails developers chose to give application designers a way to validate the uniqueness of certain fields outside of the context of the database itself. It is important to note insistence on uniqueness implies a business decision for consistency over availability. If your back-end is eventually consistent the results of this check might not actually be able to return trustworthy results.
You need to be very careful around such conventions if you are making an active-active distributed system,.
Where to actually get started
Having read this far, you understand the main purpose of this paper is to attempt to give you the background you actually need in order to start thinking about how to make your single-site application work well in a distributed environment. Once you feel comfortable with all of the above concepts, and assuming you are familiar enough with your own code base to understand the implications of the former light of the latter, I recommend the following strategy with regard to where you should start:
- Make sure you understand all the concepts discussed in this document.
- Audit the components in your application in light of the concepts discussed in this document to evaluate their readiness to operate in a distributed environment.
- Consider the cost and effort involved in getting your application into any of the above distributed system patterns.
- Decide which pattern you want to successfully achieve given the above.
- Develop a plan for getting there.
As far as that plan is concerned, we believe the above patterns suggest a logical progression. As such we suggest the following broad outline:
If going from pattern zero to fully manual fail-over:
- See below for detailed information on things to try and techniques to follow, depending on which automated active + standby pattern you hope to most resemble with your manual process.
- Carefully consider all the steps you would need to follow to transition your application to another datacenter and back again. Develop these into scripts or checklists to follow.
- Test your fail-over and fail-back plans.
If going from pattern zero to active + reduced functionality standby:
- Try running your site with all data-storing components in read-only mode.
- Make the necessary code-changes such that the important features of your site which can operate in a read-only state actually do so.
- Develop graceful workarounds / failure modes for those bits which necessarily cannot work in read-only mode.
If going from active + reduced functionality standby to active + full-functionality standby:
- Develop the fail-over plan in the context of your application’s needs. Again, failure at the local site means it should voluntarily fence itself off and refuse to become authoritative without administrator action.
- Develop a simple check for a site to determine whether it is the “authoritative” site. This check should consider the status of the site-availability service, the health of the other site, and the health of the local site in returning the status of local authority.
- Alter those sections of code doing writes / updates to your data stores to conditionally do this or operate in read-only mode (ie. fall back to the work-around / graceful failure) depending on the state of local authority.
- Start introducing asynchronous processing where you can.
- Develop the fail-back plan. Know what will need synchronization after a partition is resolved as well as administrator process on verifying the status of the reconnected site. Document this and make checklists.
- Test the fail-over. Test the fail-back. Again, fail-over should be completely automatic. Fail-back should require some administrator action.
If going from active + full functionality standby to active + active:
- Make everything requiring consistency operate asynchronously if you can. Otherwise you need to sacrifice availability of these components when partitioned and without quorum.
- Automate as much of your fail-back plan as you can. If it cannot be completely automated, know that you will need on-call administrators to handle the fail-backs.
We hope you have found this series on Distributed Systems Design interesting and informative. All further reading resources can be found at the end of Distributed Systems Design Part One.
Distributed Systems Design (Part 3/4)
Blue Box is proud to present this series on understanding the basics of distributed systems design. Distributed Systems Design Part 1, posted on April 9th, reviewed common terminology, Brewer’s/CAP Theorem, and basic concepts in distributed system design. Distributed Systems Design Part2 covered general guidelines for distributed programming.
Handling partitioning
Now that you better understand the factors involved in making your application run in a distributed manner when there are no immediate problems happening on the network, we will discuss making your application do the right thing when problems start happening.
Distributed system states
One of the more common problems you are going to encounter with a distributed system happens when there are communication issues between sites. Unfortunately, when these happen the problem often lies in some router half-way between each of your sites. Here, each site-local cluster of machines is operating normally and may even be able to speak to a sizable portion of the internet. In this situation, continuing to operate both sites can be dangerous if a high degree of consistency is required for any portion of the data being shared between sites.
To those familiar with working with highly available clusters, the above situation is analogous to a partial-failure in the cluster.
Partial failures of a system are among the hardest events that a designer can plan for simply because there are literally an innumerable number of different ways a system can partially fail. As such, almost all highly available applications attempt to turn partial failures of components into much more predictable and workable complete failures of said components.
Reliably detecting a partition
In a normal site-local highly available cluster of machines, the system ensures partial failures become complete failures through the use of the STONITH. STONITH stands for “shoot the other node in the head.” When implemented correctly, it usually means each node in a (for example) two node cluster has direct access to the other node’s power supply so if either server detects a problem with the other server, the still-functioning server can ensure the other node experiences a highly-predictable complete failure by physically powering off the other node. The link to the power supply is usually done via a highly reliable serial cable or trusted network segment, so that in practice in almost all cases the cluster designers never have to worry about the all-important link to the power supply failing in this set up.
Unfortunately, when one tries to use this same concept to eliminate partial failures in a cluster whose nodes are separated by over 2000 miles, the STONITH methodology does not work anymore. Nobody makes a 2000+ mile serial cable, nor would such a thing be reliable if anyone were to make it. Instead, each site must have some mechanism for detecting when it has been disconnected from the rest of the distributed system.
Thankfully, this problem has already been mostly solved. By using the tried-and-true methods already explored in cluster theory, it is possible for a site to detect when it can no longer reach over 50% of the other nodes in the cluster and voluntarily refuse to operate in a production role (ie. fence itself off).
The key here is to have:
- At least three nodes (five is better) which are geographically distant from each other. Only two need actually be capable of handling production services.
- Have each of these nodes participate in a standard Linux heartbeat / Corosync cluster with each other.
- If at any point during operation a site loses quorum (it can talk to 50% or less of all the nodes), it is likely that node is experiencing enough of a localized network failure that it should voluntarily take itself offline for those application features which demand 100% consistency.
Blue Box site-availability service
Since forcing each of our customers running multi-site applications to also run their own cluster of servers spread across many data centers seems prohibitive when such are always going to be devoted to determining the same thing in most cases, Blue Box is in the process of deploying a simple “site-availability” service at each of its production data centers. This service will consist of the following:
- Each Blue Box production site will have a pair of servers designated as site-availability monitoring servers.
- These servers will be in a highly-available cluster with all the other servers of similar type in other data centers. Five geographic sites will be used initially).
- Based on the state of quorum in this cluster, there will be a highly-available query-able service on each of these servers which details the local servers view on:
- Whether the local data center is online from majority opinion of the other nodes in the site-availability network.
- What the status of the other sites are in the site-availability network.
We hope this service will provide enough automatic local intelligence to quickly determine whether a partition event is happening and which site has the best connection to the rest of the internet at the moment. Customers who want to use this service to determine local site and remote site connectivity to the internet will be able to query this service in order to make automated decisions about whether to trigger a failover event. Specific details on how this works will be forthcoming in another document
Shortcomings of partition detection
The goal of the above site-availability service is to determine whether a given site has the minimal connectivity it needs in order to be considered connected to the rest of the distributed system. The above checks say nothing about whether a given site has enough bandwidth or server capacity to take on load, nor whether any individual client applications are in a state where they can safely run. These additional factors must be determined by the individual client applications themselves.
Don’t forget about the usual site-local health checks
Of course, a partition is just one thing which can make a fail-over necessary. It is still important to query the status of the other site’s application cluster in order to decide locally whether the local cluster should initiate a fail-over.
In any case, in the logic you write to determine whether to do a failover, we suggest the following:
- Is the local site capable of becoming primary (ie. are local health checks all returning “OK”)? If no, do not continue.
- Is the other site partitioned from the internet? If yes, trigger failover.
- Is the other site returning the expected data? Are all its various health check returning “OK”? If no, trigger failover.
The above should be used in formulating an overall “should I be running in production?” decision for the whole application. Making this into a simple command-line interface which can be used from cron jobs, backgrounded tasks, and the production site code when evaluating requests is an exceedingly good idea. All components of the site-local cluster should be using the same logic for this decision. This should also be tied into the health checks the external DNS provider uses for determining site availability.
The fail-over event
What “fail-over” means in the case of a distributed system can vary a lot depending on which of the patterns above you implemented in your distributed system. In most cases, “fail-over” only has meaning for those business CAP theorem decisions where consistency is required. Specifically, “fail-over” means we switch which site is authoritative for that data (ex. where you’re doing your sensitive inserts / updates).
We specifically want our logic to follow the idea of creating the predictable “complete failure” mode of the local cluster if the local cluster becomes disconnected from too great a portion of the internet. The worst thing that can happen in the case where consistency is required is two disconnected sites both think they control the authoritative source of information at the same time. Any scripts handling the failover should be written with this in mind.
For the following patterns, this is what “fail-over” means:
Active + reduced functionality standby:
If the standby site has no state information (it is effectively a read-only site), then fail-over doesn’t mean anything. Your application will have reduced functionality until the active production site comes back online. Please note it will potentially serve stale data, which is usually acceptable for most applications.
Active + full-functionality standby:
All “authoritative” sources of information move to the stand-by site. Steps should be taken on the “active” site to prevent an automatic fail-back (ie. fencing), as this will almost always need to be done carefully after engineers verify a synchronization step occurred.
Active + Active:
All “authoritative” sources of information move off the local cluster which has detected it is disconnected from the rest of the distributed system. Automatic steps should be taken to ensure that a fail-back does not happen automatically unless this can be safely scripted (ie. fencing).
The fail-back event
The fail-back is the most dangerous transition you will have to handle during the no-partition -> partition -> no-partition cycle. Almost all our planning goes into handling the fail-over event, where we go from a highly predictable no-partition distributed system to a highly predictable partitioned distributed system. The fail-back, by contrast, is less predictable because in almost all systems there will be a certain amount of cleanup and resynchronization that needs to happen before a once-disconnected site can safely be the authoritative source for information again. Not every site failure is the result of a partitioning event, so it is important the cause for the failure is well understood and corrected, and the important data is fully resynchronized before a safe fail-back can occur.
As with a fail-over event, the fail-back event has different meanings depending on which distributed design pattern was followed:
Active + reduced functionality standby:
Fail-back has less meaning if the standby site is completely stateless. Administrators’ entire focus will be to get the primary site back online as soon as possible.
Active + full-functionality standby:
By way of a generic checklist, a fail-back usually has the following elements:
- Failed site is no longer partitioned / cause for failure has been corrected.
- Administrators check the failed site thoroughly enough to understand the cause for the failure, as well as any lingering effects the nature of the failure may have had. These are corrected.
- Important data is resynchronized between sites.
- Administrators remove the block preventing the failed site from participating fully in the distributed network (ie. remove fencing).
- Depending on the application, either things are left running on the secondary site (ie. if both sites are equally capable of handling the production application), or a fail-back may be manually triggered to force production traffic back to the original site.
Active + Active:
In this pattern, fail-back may share many of the same elements as the Active + full-functionality standby pattern. The hope is in active + active, the resynchronization steps have enough automation around them to the point where automatic fail-back may be an option. Before a failed site is allowed to go back into production in this model, it must have all important data brought back into a consistent state with the other sites.
Other good ideas around state changes
Synchronize your clocks!
One of the few bits of information you can almost categorically trust to not diverge too quickly from a highly consistent state and remain available during a partition is your system’s clock. Many of the algorithms used for automatic resolution of conflicting updates in eventually consistent systems rely heavily on timestamps being very accurate. With the ubiquity of public highly-accurate time sources on the internet, there is no excuse for not having your servers’ clocks always kept strictly within a couple dozen milliseconds of atomic-clock accurate time.
Don’t be too hasty!
Border Gateway Protocol (BGP) reconvergences often take between two and five minutes to complete, during which packets are routed sub-optimally and often lost. BGP reconvergence events are also constantly happening on the real-world internet. The more logical hops two geographically distant sites have to go through to get from site A to site B, the more likely inter-site communications are to get interrupted by this.
At the same time, a distributed system’s fail-over and fail-back procedures can be both risky and expensive, and generally shouldn’t be attempted under unexpected circumstances unless absolutely necessary.
On top of this, it should be noted cluster topology changes (ex. in the case of the use of the site-availability service) are not always detected by all nodes in a cluster at the same time. A standby site may detect the primary site has gone offline up to a minute before the primary site realizes this.
Due to these factors, and depending on your distributed system pattern, it may make sense to put a forced delay in the fail-over process, both to ensure is not a brief BGP reconvergence event and to make sure the primary site truly is offline before the standby site attempts to take over. The detection of a partition event starts a countdown to the point of no return when the standby site goes into production.) Especially in the case of BGP reconvergence, it is often less expensive and risky to deal with a less than five minute outage in your distributed application due to external network factors than to trigger an expensive and risky fail-over process, which may take five minutes or longer to complete anyway, depending on exactly what you are doing.
Use different DNS sub-domains for application components using different design patterns!
If you have done a decent job of keeping your various application components separate, you may find it beneficial to have some components of your application follow one of the design patterns above while others follow a different pattern. As far as the fail-over and fail-back transitions (and even the working model of these components) it can be beneficial to think of them as separate applications. Even if you use one pattern throughout your application, you will probably have some components which favor consistency and some which favor availability.
Using different DNS sub-domains provides a good logical separation between these components. It gives you the ability to do fail-over and fail-back differently on a per-component basis at the DNS level. Plus, keeping everything under a single top-level domain means you can still share cookies and security features between components.
Create, use, and maintain checklists!
Even the most simple distributed application is going to have fairly complicated fail-over and fail-back procedures. Especially for those parts of the process which require manual intervention, it is an excellent idea to write down a few simple-to-understand checklists covering these kinds of transitions. This will ensure that when you or your team members actually have to do the process during a real partition event, you do not have to try to remember everything that needs to be done from memory.
These checklists will also become the scripts you follow in testing your fail-over and fail-back plans. They become the summary of what exactly in your process you need to look into automating as much as possible. You should plan on sharing your checklists with all the members of your team and vendors who may need to be involved in a fail-over or fail-back. A good checklist also needs to be able to pass the 3:00am test. That is to say, it must be direct and simple enough to follow by anyone on your team with sufficient access who is running at 10% capacity (having been woken up at 3:00am after minimal sleep).
Even if you are moving toward fully automated fail-overs and fail-backs, we still recommend keeping functional checklists up-to-date describing a manual equivalent of the automated process. This will not only help you to document what the automated processes are supposed to be doing, but will also help you know where and how you can intervene if an automated fail-over or fail-back process breaks down because of a failure scenario you didn’t anticipate.
*All references are provided at the end of Distributed Systems Design (part 1/4).
Distributed Systems Design (Part 2/4)
Blue Box is proud to present this series on understanding the basics of distributed systems design. Distributed Systems Design (Part 1/4), posted on April 9th, reviewed common terminology, Brewer’s/CAP Theorem, and basic concepts in distributed system design.
Some useful distributed design patterns
Now that we have gone through the fundamental concepts and have familiarized you with the key differences you need to think about when programming and designing for a distributed system, you are probably wondering how you take your application which works well in a single cluster in a single datacenter and make it work well in a distributed fashion. This section of the document is aimed at giving you some pointers to get started.
Note that there are a lot of different common design patterns that are used here (and not every pattern is covered below). Which is right for you will depend a lot on some more business decisions revolving around budget, willingness to alter code, and perceived cost of down-time / reduced functionality during a partition.
Running under optimal conditions
You will have absolutely no ability to get your application to work in a failover or partitioned environment if you cannot get it to work when the network is not having problems. As such, we will discuss achieving this before we talk about handling partition events.
We highly recommend ensuring any given site-specific cluster which makes up part of your distributed application is fully redundant, with no single points of failure. A local failover due to a failing piece of hardware is far easier to deal with than a complete failover to a remote site due to a failure in a single-point of failure on the local network. The most dangerous times for a distributed application happen during the fail-over and fail-back stages, so minimizing the number of times you have to actually do this is an excellent idea.
There are a few standard patterns which get used here, and a couple variations of these patterns that should be discussed. Please note we will devote the majority of the discussion below toward patterns where fail-over events are automatic.
Manual fail-over patterns
Pattern zero
Pattern zero is not actually a distributed pattern at all. Rather, it is the term we are giving to the minimal cluster design we recommend you achieve at a single site before you can realistically start to contemplate creating a distributed application. The key characteristics to pattern zero are that all or almost all single-points of failure within the cluster design have been addressed. Also, regular off-site backups are being made of all of the data you would need to set up a similar cluster elsewhere.
It should be noted that unless you are making regular off-site backups of your key data, you are taking a business-continuity risk. If a major data-destroying disaster occurs at your data center without off-site backups, there will be no way to restore the lost data. If you are dependent on your application data to do business (as most are), this could spell doom for your business.
This pattern’s characteristics are:
- External DNS service not necessary
- All or almost all single points of failure have been eliminated in the single-site cluster.
- Off-site backups of all important data are being kept.
Fully-manual fail-over pattern
In a fully-manual fail-over pattern, enough of an application footprint is kept in standby at a remote data center to resume some semblance of normal business operations in a relatively short amount of time. However, all the steps for transitioning to the standby data center are managed manually. This is the most basic system design which can be considered distributed. Except for the fact that fail-over and fail-back happens only with administrator intervention, this pattern can otherwise closely resemble either the active + reduced functionality standby or active + full functionality standby patterns discussed below.
If your goal is to turn a single-site (pattern zero) application into a distributed application with automated fail-over, designing the system first for fully-manual fail-over can be a logical step in the process.
In coming up with your design, we highly recommend you consider each task which needs to happen for a successful fail-over or fail-back event and develop detailed checklists or scripts for administrators to follow in case a manual fail-over becomes necessary. These scripts also become your guide for which tasks you need to automate, if you want to ultimately achieve some degree of an automatic fail-over pattern.
This pattern’s characteristics are:
- Requires DNS service which can be administered when the primary site is down.
- Detailed checklists or scripts are used in lieu of automated processes for fail-over and fail-back events.
- At a minimum, off-site backups of all vital data are necessary.
- Otherwise, this pattern highly resembles one of the two active + standby patterns detailed below.
Automatic fail-over patterns
Active + reduced functionality standby
In the active + reduced functionality standby pattern, an external DNS service is used which has the capability of doing health checks against the full-production site It will automatically send traffic to the standby site if the health checks detect a failure in the production site. Under normal running conditions, all traffic gets sent to just the primary / active / full-production site. The standby site itself can range from as simple as a single service which displays a “Site is down right now” message to clients, all the way up to something closely resembling production with only the most sensitive features of the site disabled.
In general, with this pattern, we recommend concentrating on making the standby site do as much as possible without requiring state information from the production site. In other words, the standby-site runs in “read only” mode. This way, failing over to the standby site and failing back can happen without much intelligence needing to be built in around synchronization. Most of the thinking / logic that has already gone into designing an application which functions in a non-distributed way does not need to change since the standby site is essentially stateless.
In a nutshell, this pattern’s characteristics are:
- This is usually the least expensive option to implement from a software engineering perspective.
- Requires external DNS service capable of doing site health checks.
- Simplest automated pattern to implement when coming from single-site clustered application.
- Standby site often does not need as much hardware as full-production site in order to work correctly.
- If standby site implements some, but not all, production site functionality, testing your code becomes more difficult.
- Fail-overs and fail-backs are relatively simple.
- Generally this is the least-expensive of the distributed architecture patterns from a raw infrastructure cost.
Active + full-functionality standby
In the active + full-functionality standby pattern, an external DNS service is used which has the capability of doing health checks against the active site. It will automatically send traffic to the standby site if health checks fail against the active site. Under normal running conditions, all traffic gets sent to just the active site. The difference between this pattern and the one above is the standby site in this case synchronizes state information with the active site so that functionality is not reduced when fail-overs happen.
In this pattern, significantly more intelligence needs to be built around the fail-over and fail-back functionality. Most of the work in getting the fail-over to happen will be in making sure the right data is being synchronized in the first place. If this happens correctly, then fail-overs tend to be pretty simple. Otherwise, this pattern also allows software engineers to mostly use much of the same code and thinking that went around single-site application programming. This is because the normal situation is only one site will be active at a time and, therefore, less state data needs to be constantly or immediately shared between sites.
The fail-back process can be fairly involved depending on the specific technologies used in the the application in the first place. We generally do not recommend fail-backs to be an automated event in this environment.
This pattern’s characteristics are:
- This pattern’s software engineering expense lies in designing the synchronization but otherwise does not differ much from the active + partial-functionality standby pattern.
- Requires external DNS service capable of doing site health checks.
- Requires synchronization of some state data (eg. database / shared filesystem / etc.) between sites.
- Standby site needs as much hardware as active site in order to avoid reduced functionality or performance during partition.
- Same code runs in both sites.
- Fail-overs tend to be simple. Fail-backs can be difficult and should generally be done manually.
- This tends to be the most expensive of the distributed architecture patterns from a raw infrastructure cost, as half of your hardware will essentially be sitting idle all the time.
Active + Active
In the active + active pattern, production site traffic is sent to all sites making up the distributed application at all times. It is the most difficult pattern to implement when coming from a legacy application written for a single-site environment. It is also the one most capable of dealing with network instability. This is the pattern all the big guys out there use, including Google, eBay, and Amazon. It is often the one most clients would like to achieve with their distributed application.
As we can send traffic to many sites at the same time with this pattern, it opens up a number of other interesting options with regard to Global Server Load Balancing (GSLB). As with the other patterns discussed here, an external DNS service capable at least of doing health checks is required. With this pattern we may be able to use more advanced features of an external DNS service as well.
With the active + active pattern, a lot more work generally needs to be done within the application to get it to behave correctly in this environment. Specifically, work needs to be done, or at least evaluations made, around every point in the application where state data is touched.
This pattern’s characteristics are:
- This is by far the most expensive pattern to implement from a software-engineering perspective.
- Requires external DNS service capable for doing site health checks. May be able to utilize other GSLB features of external DNS service.
- Requires synchronization or other intelligence around handling all state data between sites.
- Active / active sites need to be approximately equal size and have enough redundant capacity to take the full load in case of a complete site outage (think at least N+1 redundancy here.)
- Same code runs at all sites.
- Fail-overs tend to be simple. Fail-backs tend to still be very complicated, but should be automated as much as possible.
- Tends to be fairly expensive from a raw infrastructure perspective. With two sites, 50% of server capacity will be idle. With three sites, 33% of server capacity will be idle.
What about auto-scaling the standby site?
If your failover strategy includes spinning up the standby site on the fly during a partition event, please note we generally do not recommend attempting auto-scaling the standby site for the following reasons:
- Testing these kinds of fail-overs tends to be difficult and inconclusive unless you’re already in the habit of regularly performing these tests.
- During an actual partition, the fewer moving parts you have and the fewer topology changes you have to make, the better.
- Certain technologies, such as database servers, simply do not lend themselves to doing this well.
- Spinning up and configuring new servers tends to take a non-trivial amount of time on most cloud service providers today. Ranges are generally measured in minutes for this to happen and they must come down to seconds for it to be viable.
- The failure of a major site can result in a rush on capacity in other data centers. This means spinning up new servers will take even longer or may fail entirely, if the datacenter runs out of capacity. The cloud provides the illusion of infinite capacity but no cloud actually can provide infinite capacity. When the partition happens, it is far better for your servers to already be spun up.
Making your application capable of going active + active
When considering how to make your single-site application go to an active + active distributed pattern, you need to concentrate on two key areas, stateful/shared data and cache usage.
Any time you touch cache, stateful, or shared data in your application, you need to make the CAP theorem business decision around that interaction.
Ideally, you want all aspects of your application to be able to operate with an eventual-consistency model, at least to a limited degree (because of the effect of latency acting like a partition). Where consistency of a given piece of data from one node of the cluster to another is not absolutely necessary you want your code to be able to treat the data like it can be at least partially inconsistent.
To help you make the distinction of what you are already doing in your code around the CAP theorem decisions, consider this:
- Anywhere in your code where you use cache, you already made the business decision availability is more important than consistency, for at least a little while, for that data.
- Anywhere in your code where you specifically cannot use cache, or must go to a “master” server (ex. if MySQL slaves are used in your cluster), then you have effectively made the business decision consistency is more important than availability.
- In most applications, “read” operations tend to not need strict atomic consistency and, therefore, can work with the eventual consistency model.
- In most applications, many “write” operations are able to operate with the eventual consistency model if asynchronous processing is used. For most applications, though, there will almost always be a few “write” operations that require strict consistency.
Those pieces where the second point above apply need to be very carefully evaluated to see whether consistency truly is required here or whether asynchronous processing can be used at all.
We suggest approaching these problems as follows:
- Go through your application with a fine-toothed comb looking for any instances where an external service (ie., one not on “localhost”,) is used. Make a list of all of these, as well the apparent decision about whether you are implicitly choosing availability or consistency with the interaction.
- If any of these instances can be eliminated without making your application suffer, eliminate them.
- For those services that are not absolutely necessary for main site functionality, alter the code around them to allow the external service to fail without bringing the whole application down.
- Use asynchronous processing wherever you are able and where it makes sense to do so.
- For those services where consistency is chosen over availability, evaluate whether consistency is absolutely necessary. Try to use asynchronous processing where you can.
- If you are not able to eliminate all instances in the application where consistency is 100% necessary, then you must write code to detect when a partition is happening and gracefully disable those features. We will discuss how to reliably detect partitioning shortly.
When you just can’t sacrifice consistency
When you can’t sacrifice consistency, active + active can still work when partitions don’t exist. For example, if your “master” PostgreSQL server exists in Seattle, under normal running conditions your application servers in Virginia must connect to the “master” PostgreSQL server in Seattle (over an SSH or VPN tunnel) in order to perform these sensitive operations. Try to minimize the occurrence of this kind of activity as the latency of the coast-to-coast link will already add a lot to the response time of your Virginia application servers.
When partitions happen, you must be prepared to sacrifice availability of those site features where consistency is absolutely necessary.
*All references are provided at the end of Distributed Systems Design (part 1/4).
Distributed Systems Design (part 1/4)
Blue Box is proud to present this series on understanding the basics of distributed systems design. Stephen Balukoff, BBG’s Principal Technologist, was surprised by the lack of practical information available on the subject. He authored this document to help fill the gap and hopefully start an exchange of information within the community.
Introduction
In recent months more and more Blue Box customers have been asking about the possibility of setting up their applications to be distributed across two or more geographically distant data centers. At first glance this seems like it ought to be easily accomplished by modeling the distributed system after a cluster with no single point of failure on a local LAN In practice, however, there are a lot more concerns which need to be addressed in a geographically distributed system. As such, creating a well-functioning distributed system is probably a lot more difficult than you think.
This document was written with the goal of giving you a place to start to understand the concepts and concerns involved, as well as to give some practical advice as to “where to start” if you are trying to turn an application which exists happily in a single datacenter into an application which exists happily spread across two or more data centers. Unfortunately there is no panacea here which can accomplish this for any given application for reasons which will become evident below. Educating oneself as to the rules of the game and suggested best practices are the first steps in avoiding becoming a casualty. This document was also not intended to be a comprehensive guide on designing distributed systems.
Caveat Emptor
In the interests of full disclosure it should be noted the authors of this document, while very skilled in systems administration, network and cluster design, and application design and support, do not to claim to be experts in distributed systems design. Distributed systems design is a relatively new area of computer science, and there are actually few companies in the grand scheme of things who have successfully engineered large-scale distributed systems. Those who have tend to be fairly tight-lipped about it, guarding their trade secrets in this arena. There are certainly more intelligent and experienced people in this arena than us, and we encourage the reader to examine other sources of information when formulating a plan for implementing a distributed system.
The sources of information used in formulating this document range from various whitepapers available online, wikipedia articles, scientific research papers, consultation with other experts in the field, experience gleaned from working with customers who have implemented such systems, our cumulative understanding of the implications of all of the above, and the occasional educated guess. In general, we’ve found that there are very few practical how-tos available for anyone looking to create a distributed application (especially if they’re starting from an non-distributed application). We wrote this document in the hopes it would be helpful to anyone else facing this problem but do not guarantee the accuracy, practicality, or any other value to the information contained herein.
The Rules of the Game
(What you absolutely need to understand)
Distributed System Design requirements
In their most generic form, the basic design requirements of any distributed system can be summarized as follows:
Consistency
Changes to data in one part of a distributed system are immediately represented throughout the whole of the system. For example, if I add a pair of shoes to my shopping cart by sending a web request to a server in Seattle, if my next request ends up going to a server in Virginia, I expect the server there to know about the pair of shoes in my shopping cart.
Availability
Every request sent to a distributed system should get a response, no matter which individual server in the distributed system I happen to be talking to. Continuing the above example, it should not matter whether I sent my request to a server in Seattle or Virginia, I expect to eventually get a response back.
Partition-tolerance
The distributed system should still be able to function even if arbitrary messages are lost in the system. Again, continuing the example, even if the servers in Seattle and Virginia are not able to talk to each other, the system should still work.
Brewer’s (CAP) theorem
While the above are arguably essential requirements for any web application regardless of whether the servers which make up the system are located within a single data center or spread across the whole internet, the problem with the above is that it turns out it is impossible to guarantee all of the above for any system. As the saying goes, “Pick any two. You can’t have all three.”
Specifically, in a keynote speech to the Association for Computing Machinery in 2000, Eric Brewer conjectured it is impossible for a distributed web system to guarantee both consistency, availability, and partition-tolerance at the same time. Two years later, two very smart computer scientists at MIT named Seth Gilbert and Nancy Lynch formally, mathematically proved this.
It is important to understand the magnitude of impact of that proof. It formally establishes CAP theorem as a theorem: This means it speaks to the fundamental nature of the universe in which we live, just as much as 1 + 1 = 2, or the law of gravity, general relativity or the speed of light. When we say “pick any two, you can’t have all three,” we’re not trying to be glib. It is physically impossible for us or anyone to guarantee a distributed system will be able to guarantee both consistency, availability and partition-tolerance all at the same time.
The hidden lie in CAP theorem
It is actually a little worse than that. In real world applications (or until quantum entanglement sees a real computing application), one cannot choose to never have partitions. Network failures happen, bugs in the system happen, user error happens, and there will always (and usually frequently on globally distributed systems) be times when server A and server B in your network are not going to be able to talk to each other. In a real-world distributed system one of the choices is always going to be made for you. Again, blame the universe if you don’t like this. Of the two choices you get, partition-tolerance has to be one of them.
So what it really boils down to is: for your business application, which is more important to you, consistency or availability? When a partition happens, you are forced to choose one or the other. It is physically impossible to have both.
The light at the end of the tunnel
The good news is you do not necessarily have to choose between consistency and availability for your entire application all of the time. For example, it may make sense to choose consistency for those site features which are most sensitive (ex. bidding in an auction, modifying account balances, etc.) but choose availability for those site features which are not very sensitive (eg. browsing items in a store, reading forum posts, etc.).
Further, for brief partitions (ex. communication interruption that only lasts a couple of minutes), it may make sense to live with inconsistency for a short period (perhaps operating on the assumption that data won’t get very out of sync very fast). Then later you can choose to drop availability if the partition event lasts too long. It all depends on the nature of the application and the business decisions dictating appropriate application behavior.
…but it’s complicated
The choice of consistency versus availability is a business decision. Therefore, the appropriate action that a given cluster or node within a cluster should take during a partition event needs to be dictated by business logic. Partially due to the fact that businesses must choose for their application to behave in different ways according to their business needs– and that even different parts of any given application in a distributed setup may need to behave differently depending on the specifics of the circumstances– this means that handling a partition event gracefully is necessarily much more complicated than typical fail-over scenarios which occur within a single site or localized cluster.
The one exception to the rule
The only exception to the CAP theorem is the one case where consistency doesn’t matter. If your application is able to run completely stateless (ie. no data preserved server-side of any kind), then by definition no data exists which needs to be kept consistent across the distributed system. Technically, this is not an exception to CAP theorem. It is the one case where one can safely choose availability and partition-tolerance all of the time.
Eventual Consistency and other half-truths
Before we go too much futher into the implications of CAP theorem, we need to say something about eventual consistency. In particular, one clarification about the proof of CAP theorem is that by ‘consistency’ what is meant here is “atomic consistency.” The theorem allows for the possibility of a thing called “eventual consistency” (or “weak consistency”) which for some applications is a better alternative than losing availability during a partition event. There is an implied business decision with eventual consistency configurations to sacrifice consistency in the short term in favor of preserving availability. The idea behind eventual consistency is that partitioned servers in a distributed system are allowed to become inconsistent during a partition event, and that there’s pre-defined business logic within the software to resolve the data inconsistencies and conflicts this creates once the partition is resolved.
The problem is the devil is really in the details here. Further research into the topic of eventual consistency shows that for any given algorithm here, it may not be possible to resolve these inconsistencies in a reasonable amount of time (if ever). Plus, “partitioning” here is not always a total communication failure.
Latency as a form of partition
One other important thing to point out here is that up until now we treated partitioning as an “event” in the network. The definition of what a partition is, however, can be more broadly defined in real-world (i.e. mostly asynchronous) networking environments as the time it takes data to propagate across all the servers in the distributed system, if such systems are designed to retransmit dropped messages. During this window, depending on the specifics of how data is transferred, implied business decisions have been made about consistency vs. availability.
Example 1:
For example, suppose I have database server A in Seattle, and database server B in Virginia which are in a MySQL master-master replication configuration. Typical asynchronous MySQL replication is actually one form of “eventual consistency”. There is a significant amount of time between when an update to the data on server A propagates to server B. The best case scenario is around 80 milliseconds, worst case is never– if server B is unable to process data at the same rate as server A and is continuously falling behind in replication. (See the above section about the half-truth of eventual consistency). Between the time an update to the data is made on server A, and the time when server B processes the same update, one could say server A and server B are effectively partitioned. Furthermore, because they will return different results if a client makes a query to each server during this window– but each server *will* return a result– this means that consistency has been sacrificed in favor of availability (at least until that update makes it through). The business decision here was made when the choice was made in favor of using MySQL asynchronous replication.
Example 2:
Take the same scenario above but replace MySQL master-master replication with PostgreSQL master-slave replication in synchronous mode. In this case, the PostgreSQL synchronous replication system, with its acknowledgements and ties into transactions, ensures both server A and server B strictly have the same data set at all times. If we update data on server A, during the window in which this is replicated to server B, querying the data from server A will show the same thing as server B. The update to the data has not yet been applied. In fact, the update query will hang indefinitely if server B happens to be down. In this way, one can see the business decision has been made to sacrifice availability for consistency.
CAP theorem all around us…
Given the above examples, if you are starting to understand some of the implications of the CAP theorem, you should start to see how it applies to all kinds of situations in various information networks, ranging from CDN caches all the way down to the write-back versus write-through buffer in a RAID array. In particular, CAP theorem applies on a local LAN within a cluster but we largely ignore it there simply because partitions at this level are rare enough, and latency-related partitioning is so small, that the effects can mostly be ignored. CAP theorem applies whenever there is more than one store for information where the possibility of partitioning exists.
Not every piece of technology gives you a choice
There is one more characteristic of CAP theorem that occurs when applied to real-world technology. Until now we implied implementors of a given distributed system always get a business choice as to what they want to sacrifice when partitioning becomes an issue. In reality, some implementations of technology are incapable of allowing this choice in the event of a partition. The nature of the software is such that it makes the choice of availability or consistency for you.
What you need to know about the network
(The eight fallacies of distributed computing)
Along with the implications of CAP theorem as applied to a geographically distributed system, there are a few other common pitfalls for anyone embarking on this road for the first time. Please note all of these assumptions are false and must be accounted for in any distributed system:
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn’t change.
- There is one administrator.
- Transport cost is zero.
- The network is homogenous.
The corollary to the above fallacies are the truths about networks in a distributed environment:
- The network is unreliable and will fail often.
- Latency has a significant effect on your application.
- Bandwidth is often limited.
- The network is insecure.
- Network topology changes frequently, especially the more geographically distant two nodes are.
- There are many administrators. Different administrators unfortunately will do things differently. The less you have to rely on specific policies / behaviors of any given administrator, the better.
- Transport costs can be significant.
- The network is not homogenous. Your packets may get split apart, TTLs messed with, and certain kinds of traffic filtered outside of your control. This can vary depending on which paths your packets take.
General guidelines for distributed programming
Everything we have discussed thus far leads to the following best practices for designing a distributed system:
Design for failure
Every component of your application is going to fail at some point, especially those components which communicate over the widely-distributed network. It is a good idea to write down the most common failures which will happen in your application and design acceptable fallback behaviors for those periods. It is also generally better to lose partial functionality of your application than the whole application when failures happen. In other words, fail gracefully.
Minimize cross-site traffic
The less data you need to share between sites, the fewer opportunities there are for failure. Also, given that bandwidth limitations may occasionally happen, being minimal here can help a lot. Traffic traversing long-haul links is much more expensive than local traffic.
Try to be as stateless as possible
The less data needing to be preserved between requests to your application, the less needs to be transferred between sites. For every bit of state-ful data that needs to be shared between sites, a business CAP decision needs to be made about how the application behaves when a partition happens. The fewer of these you have to deal with, the easier time you will have dealing with partitions. Using client-side session stores can help. It should be noted, though, that some browsers / proxying solutions / web servers will limit the size such client-side data can be. In any case, a client-side session cache will be less reliable and more subject to security problems than a server-side cache under normal running conditions.
Maintain separation of application components
There are many reasons to maintain separation of application components even outside the context of distributed systems. Reasons range from better testability, modular functionality leading to code reuse, more predictable code behavior in failure scenarios, easier to find and fix bugs, better control and logging. In the context of distributed systems, modular systems are far easier to make:
1. Fail gracefully (ie. it’s often better to lose non-essential parts of an application during a partition than to lose the whole application)
2. Make different CAP theorem decisions per component
Encrypt everything sent between sites
You simply cannot trust the internet like you can a local network. Back-filling in a security solution after a breach occurs can often prove to be extremely difficult and costly. It is better to start secure and stay there. This can be achieved through the use of a VPN, though we generally recommend simply setting up self-healing / self-spawning SSH tunnels or application-specific authentication and encryption between systems as this usually scales horizontally much better and eliminates the VPN as a single point of failure and bottleneck for an entire data center of machines.
Cache wherever you can– and flush caches appropriately
With the goal of minimizing cross-site traffic, caching anything which has to go over this link is generally a good idea. In a situation where a partition happens, it is often a good idea to flush caches.
Use queueing systems / asynchronous processing for cross-site updates as appropriate
In a partition event, most applications work well with a certain degree of inconsistency in order to preserve availability. This is most obviously done with transactions which are essentially read-only. In order to preserve the illusion of consistency for the client for write-type transactions during a partition, using asynchronous processing is often the only viable solution. It is better to simply design your application to use these systems even without partitions so that operating with a partition more closely resembles “normal” running conditions.
Batch cross-site updates
There is a significant amount of network latency that applies when transmitting data between geographically distant sites. Blame the speed of light for this. In any case, doing 50 individual request-response transactions over such a link will take a lot longer than batching all these requests together and getting the responses back in one large operation. If this can be done asynchronously, all the better.
Use idempotent design patterns when possible
Idempotency in the context of algorithms, methods, etc. means you can run the same thing many times without changing the outcome any more than running that code once would have done. Especially if you have to design your code to deal with eventual consistency, it is very likely you will end up accidentally running the same procedure on the same bit of data more than once at some point in your application. This can be extremely bad for business especially in the case of monetary transactions.
Unleash the chaos monkey
All the extra work / automated failover plans you will come up with are virtually guaranteed not to work if they are not tested. Unfortunately, it is often very difficult to simulate in a lab the kinds of network failures you can get in a distributed system. It is always a good idea to set up a fully-testable staging version of the application on which this kind of testing can occur if your budget allows for it. The next best alternative is to force “real” failures in the production environment (eg. by shutting down certain vital systems) under more controlled conditions and in windows least likely to cause your business and your clients problems if the failover routines do not work as expected.
Document all your distributed systems design decisions
As a programmer or designer making distributed systems, you will make some decisions which will be manifested in code or strange looking cluster components which will not make sense to an engineer used to thinking about the problems in terms of single-site mentality. The surest way to make sure your successors end up repeating all your painful learning experiences is to document none of your hard-won but somewhat strange-looking decisions.
To continue with this series check out Distributed Systems Design Part 2.
Further reading
The following are some of the more helpful documents consulted when writing this paper:
* http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
* http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
* http://www.royans.net/arch/brewers-cap-theorem-on-distributed-systems/
* http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
* http://www.eukhost.com/web-hosting/kb/global-server-load-balancing/
* https://blogs.oracle.com/davew/entry/thoughts_on_global_server_load
* http://www.tenereillo.com/GSLBPageOfShame.htm
* http://code.google.com/edu/parallel/dsd-tutorial.html
* http://perspectives.mvdirona.com/2010/02/24/ILoveEventualConsistencyBut.aspx
* http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
* http://guysblogspot.blogspot.com/2008/09/cap-solution-proving-brewer-wrong.html
Office Warming
4:30 – 6:00 PM
119 Pine street, Suite 200
Seattle, WA 98101
A Couple Quick Questions
We at Blue Box are always trying to serve you better. Would you recommend us to clients, colleagues, and friends? Yes or no, let us know why or why not! Two simple questions so we can find ways to assist our customers, partners, and friends on their own paths to world domination.
Thank you,
Blue Box Team
We Have a Winner!

Thank you to everyone who sent in mugs for our contest. We truly appreciate the tremendous outpouring of support for Trevor and his coffee habit. There were a lot of really creative designs submitted and it was difficult to pick a winner but in the end we had to make a decision. And the winner is…
Nathan Carnes with his one of a kind entry featuring an awesome image with a Blue Box Group shout out!
Congratulations to Nathan Carnes at Carnes Media for sending in the winning mug. Nathan will receive a delicious gift basket including Seattle coffee, chocolate, and more! Below are pictures of all the mug entries with the winning mug front and center.


It is hard to see in the group shots, so below is a large view of the artwork that won the contest.

Blue Box brings our first East Coast data center online!
For the last twelve months, many customers have been asking Blue Box about a timeline for the implementation of an East Coast data center. Besides providing a site for DR capabilities, an East Coast data center improves our ability to serve customers throughout the United States and Europe with lower latency. As of January 16th, 2012, we are happy to announce the public availability of our new data center facility in Ashburn, VA.
The new data center is housed in the state of the art Latisys facility, just outside Washington, DC. The Latisys facility is an SAS-70 Type II certified, Tier III data center. It offers top of the line fire suppression and cooling systems, as well as redundant power providers, generator and battery backup systems. A full list of Latisys facility specifics can be found within this PDF. We are currently hiring more data center engineers for this location, and you can apply here.

What does the new location mean for our customers? It means better bandwidth throughout, lower latency, and more infrastructure options. Ashburn is a major hub of East Coast Internet connectivity. Latisys is strategically located near Amazon, Facebook, and Google data centers with a two millisecond latency between these facilities. This allows BBG customers to use Amazon services as if they were a part of the local infrastructure. The proximity to Facebook’s infrastructure reduces potential communication failures between sites for our social gaming customers. With multiple data center locations, Blue Box now provides more options for redundancy and off-site back-ups, giving customers additional ways to keep their data secure and allowing for increased disaster preparedness.
Blue Box is excited about the opportunities for better connectivity and creative infrastructure solutions the new data center opens up for our customers. It also comprises another step in our quest for world domination. Seattle, New York, Ashburn/Washington, DC, Slovenia…where will we pop up next? Stay tuned!
Blue Box Has Mugs!
Thank you to everyone who has sent us mugs! Trevor will now be able to drink his coffee out of an appropriate container, but there are a lot of other people in the office who drink coffee so keep the mugs coming! Remember you have till January 31st 2012 to send in your craziest mug in order to win an assortment of Seattle goodies (our address is Blue Box Group 119 Pine St, Suite 200, Seattle WA 98101). Here is a sneak peak at what has been sent in so far.
