Maintaining replicated copies of state is the biggest challenge of building large-scale computer systems. Replication is used for a variety of reasons; for example, multiple copies can increase availability, reliability, and/or performance. Often, transformations are applied to data as it is replicated; for example, in search engines, indexes are inverted replicas of an underlying document collection.
Of the many difficulties of maintaining replicated state, the biggest is Murphy's law: Anything that can go wrong, will go wrong. Corruption of data on disk or in the network; mangling of data by buggy software; data loss due to hardware that fails during the replication process: in large enough systems, all of these problems and many more become annoyingly common. As builders of Web-scale, reliable systems, we not only must build mechanisms that prevent errors during replication, but also must build mechanisms to repair errors after replication -- sometimes long after replication.
At Yahoo!, to ensure that system designers adequately address the issue of repair, we ask them to address three issues we refer to as audit, patch, and bootstrap. These issues are addressed in three steps:
Let's use an example to illustrate our approach. (This section is illustrative only and does not describe the audit, patch, or bootstrap procedures of any real systems.)
Imagine an online advertising system that consists of two main sub-systems: the ads-management subsystem and the ad-serving subsystem. The ads-management subsystem is a database-backed, order-management application used by advertisers, which captures the actual "creatives" (banners) the advertiser wants to show, the conditions under which they are to be shown (e.g., on what pages, and to what groups of users), and the price the advertiser is willing to spend. The ad-serving subsystem delivers ads to consumers, which means delivering ads against a high volume of page views.
Both the ads-management and ad-serving subsystems store replicas of the orders entered by advertisers. The ads-management subsystem maintains them in a relational database in a standard, 3NF data model. For performance reasons, the ad-serving subsystem maintains the orders in highly-indexed data structures. Further, also for performance reasons, the ad servers are partitioned -- no single ad-server stores all possible ads, but rather the ads are partitioned across multiple machines. For increased parallelism, individual partitions are replicated onto multiple ad servers.
Updates to the ad-serving system happen along two paths. There is a “slow path” in which an entire partition of records is reindexed and stored in a compact file-structure distributed to the ad servers. To reduce latency between the time an advertiser changes an order and the time that change goes “live,” there is also a “fast path” in which recent changes to individual orders are kept in a small buffer that the ad-server consults in parallel with its main data set. Anything the ad server finds in this buffer has to be merged it into the results that come from the compact data structure that is created on the slow path.
Our first task is to identify where the data is mastered. In this example, it would be natural to designate the ads-management database as the master data system, and the ad servers as the replicas. In this particular example, it's probably the case that the ad servers contain only a subset of the data (e.g., they don't contain the contact info for the advertising accounts). But even if the ad servers contain all the data, the ads-management system would still be the better master because it is where changes to the data are initially captured, because it has the simpler and more natural representation of the data, and because it's easier to backup. (Backups are a very important form of replication. The best candidate to "master" data is typically the subsystem with the most reliable backup procedures.)
The next step is to define auditing procedures. In this application, the replicas (the ad-serving systems) store a transformation of the original data; thus, a bit-wise comparison isn’t possible, even if it were practical. Instead, during the transformation process, we can compute and store checksums on a per-record and per-partition basis. Then, we can periodically check each replica of a partition against is per-partition checksum, and, when that fails, check each record against its per-record checksum.
The final step is to define repair procedures. Earlier we described a “fast path” from the ads-management system to the ad-servers used to quickly propagate small changes straight to the serving systems. When our audit procedure finds a corrupted record, we can reuse this “fast path” to patch the corrupted record by pushing the uncorrupted value of the record through to the ad-server. This results in a granular, low-latency mechanism for patching the ad server.
While both granular and fast, this patching procedure is also fragile. For example, if too many records have been corrupted, then patching them in through this update path could overflow the buffer of recent changes. Also, it could be that data corruption in the main data set prevents the merge logic from correctly merging results from the recent-changes buffer into results from the main data set. Where this patching-procedure fails, the following bootstrap procedure may be used as a fall-back: copy the entire state of a working ad-server onto the corrupted ad-server. The amount of data copied could be substantial, and thus this repair could be disruptive (two replicas of a partition could be out of commission during the copy). But this bootstrap procedure would consist of just copying a few files, which can be implemented in a simple, robust, reliable manner.
As this example suggests, the difference between patching and bootstrapping is more of degree than of kind. There does tend to be a tradeoff between repair procedures that are more granular and less operationally disruptive, but also more fragile, versus procedures that are more disruptive but also more bullet-proof. We tend to call the former patching, and the later bootstrapping. But we've also found that, when system designers are alert to the need for data-repairs and design them in from the start, designers can substantially mitigate this tradeoff, blurring the boundaries between patching and bootstrapping.
The first step in designing audit procedures is to decide which subsystem holds the "source of truth" against which other subsystems are to be audited. Data can have multiple masters, which can be confusing. But more often, the confusion is about what constitutes "multi-mastering." This subsection considers the impact of multi-mastering -- and confusion about multi-mastering -- on our recommendations.
Large databases are often partitioned ("sharded") across multiple masters; for example, customers will be hashed to partitions, and all the data for a single customer will be mastered by the partition to which the customer hashes. This is sometimes called "multi-mastering," but it really isn't: Each record has a clear, single master, even if the totality of records does not have a single master. Auditing, patching, and bootstrapping apply to this scenario in a straight-forward manner. (In some cases, individual fields of a record are mastered in different storage systems. Although such a design complicates audit, patch, and bootstrap procedures, it should not be confused with true multi-mastering.)
True multi-mastering is rare, but it does happen. As one example of true multi-mastering, our ads-management and CRM systems might both accept changes to customer contact records, which eventually have to be reconciled through some mechanism. As another example, consider Amazon's Dynamo. Over simplified, in Dynamo you might have three servers, and when you write a record, you might send the write to all three servers, but only wait for the first two to complete before considering the write to be successful. When you read, you read from all three servers, and take the majority value. Thus, no single server can be designated the master of the data: All three must be consulted before the true value of a record is known.
The concepts of audit, patch, and bootstrap are easily applied to multi-master systems. In the ads-management/CRM example from above, the "master" from which the ad servers are replicated is the ads-management system, and any multi-mastering with the CRM system does not directly impact audit, patch, or bootstrap of the ad servers. In the Dynamo case, multi-mastering can be hidden behind an abstraction layer. In our advertising system, for example, we might use a dynamo-like system as the storage system for the ads-management application. But even in this scenario, the analysis from Section 2 still holds: the ads-management system, as a whole, is the master, and the ad-serving system is the replica. The fact that multi-mastering is used inside the ads-management system should be hidden from the audit, patch, and bootstrap procedures.
Auditing (and patching and bootstrapping) assumes that you're auditing against the actual source of truth. However, it's not always practical to audit all replica's of data directly against the ultimate source of truth. In these cases, transitive auditing -- that is, auditing through a proxy for the source of truth -- can be used to reduce load on the ultimate source of truth.
Let's consider again our advertising system. We said the source of truth is the ads-management application, but the storage system for this application might be a RDBMS against which "auditing scans" would severely compromise performance. Instead of auditing the serving systems directly against this database, one can instead dump the database into a staging area and audit the serving systems against the dump. Such dumping procedures are typically quite robust, so a practical auditing procedure in this case is to audit the serving systems against the dump, and only audit the dump against the original database infrequently.
Often, sources of truth are replicated by capturing a transaction-log of some kind from that source of truth, and using the log to update replicas. In these situations, it's tempting to use the transaction log as the proxy for the source of truth, rather than an independent dump. Designers should consider carefully before following this path. The effectiveness of the proxy is dependent on a highly robust process for maintaining the proxy. If it's possible to independently dump the source of truth in a highly-robust manner, and audit replicas against this independent proxy, then you will be auditing the end-to-end replication path, including the capture and replay of transaction logs, which can catch a lot of software bugs. If a full, independent dump is not possible, then you should consider mechanisms for performing light-weight audits on the transaction log. The next subsection enumerates a few ideas for such auditing.
Comparing counts is a fast, easy technique that can catch a lot of replication failures. Counting records is the easiest of light-weight audits, and fairly effective as well. Counting bytes is a great technique too, although it's difficult when the master and the replica represent data in different ways. Where records can be subdivided into meaningful classes, counting the size of each class is also valuable. In advertising, for example, orders can be "active" or "inactive," and counting each class is a good lightweight audit. Finally, while definitely heavier in weight, histograms are also a useful auditing tool (e.g., number of orders per advertiser).
Comparing other aggregations can also be effective. For example, comparing the sum of the bids of every order in our advertising system is a good way to check the integrity of a particularly important piece of information maintained by that system. Also, checking the min and max modification-times on a set of records can also help catch missing updates.
One might think that checksums are a useful tool for auditing, but they can be more complicated to use than one might think. Replicas typically store data in a format different than the master, and can even transform the data even further (e.g., creating inverted files). Further, different replicas often replicate only a subset of the records -- or only a subset of the data in records -- requiring the computation of different checksums for each replica. Finally, maintaining checksums in systems where records can be deleted or modified introduces even further complexities. As suggested above, aggregations on important fields (such as advertiser bids) can be an effective way of using partial checksums for lightweight auditing. And "heavy weight" checksums -- snapshotting (physically or logically) and scanning the snapshot to compute a point-in-time checksum -- can be a useful diagnostic tool. But incrementally maintaining checksums of the entire data set at both the master and the replicas is typically not a practical approach to lightweight auditing.
To dig a little deeper into the issues, as you increase the bandwidth of changes while keeping the latency low, you are inevitably forced to apply changes to replicas out-of-order (otherwise, a delay in one update cascades to the entire stream of updates). So in a high-bandwidth, low-latency scenario, the state of the master and replicas are both in constant flux (because of high bandwidth), and may never actually be exactly the same (because of differences in order).
This problem can easily be addressed by auditing only the older parts of the data set. The "flux" caused by high bandwidth of out-of-order updates is limited to the records that have changed recently. On the "older" part of the data-set -- the records that haven't changed for a bit -- the master and replicas should agree. By restricting audit procedures to this older part of the data set, it can be easier to define the audit logic.
Bootstrap procedures can also be challenging to build for high-bandwidth systems. This is because bootstrap procedures tend to run long, during which many changes can occur in a high-bandwidth system. One technique for addressing this challenge is to replicate a snapshot for the bootstrap, capture changes that occur during this replication, and applying the changes after the snapshot is installed.
In our experience, junior system-designers typically fail to address audit, patch, and bootstrap procedures altogether. They don't willfully ignore it, but rather design what they believe to be "bullet proof" replication mechanisms which are only bullet-proof under non-Byzantine assumptions. In reviews, naive designers will deny that Byzantine failures are possible, or argue that they are "low risk." However, with some experience -- including a few weekends spent cleaning the messes caused by Byzantine failures -- they come to understand the need for audit, patch, and bootstrap procedures.
Interestingly, at this point, many then go overboard, either building protection against Byzantine failures directly into the replication mechanisms, or designing extensive automation around audit, patch, and bootstrap. This is typically also a mistake, especially early into the lifecycle of a system.
Instead, it is better to initially design audit, patch, and bootstrap procedures as tool-supported manual procedures. That is, it is typically faster and more reliable to build scripts that help operators audit, patch, and bootstrap a system, rather than attempt to automate everything right from the beginning. A nice balance between automation and manual effort is to automate a simple, high-level audit (e.g., compare record counts), and build manual tools for deep auditing, and for patching and bootstrapping, to be used by operators when the high-level audits fail. In fact, this approach is sufficient for the entire lifetime of most systems, and further automation can reduce reliability. Keep in mind that our context is recovering from Byzantine failures in the replication path. In these recovery scenarios, something unexpected and possibly quite severe has occurred. In these scenarios, automated repair procedures can easily cause more damage, so it is often better to let an operator investigate the larger situation and apply human judgement before any repairs are attempted.
Finally, note that bootstrap is required to recover from catastrophic failures, while audit/patch is an optimization. Even in cases where audit/patch exists, it's important to bootstrap periodically to ensure that the bootstrapping process works and to correct any errors not found in audit or caused by patch. Prophylactic bootstrap need not require a service outage, since replicas (often built to satisfy load/capacity requirements) should be able to be bootstrapped independently.
Announcement and comments: Master The Fundamentals