Backup Strategy

Introduction

Our overall backup strategy is not really about backup. It's about providing the maximum number of levels of failure possible before the application fails. Backup is a large part of that plan, but we need to know how the backups will be brought back online, how they will be archived, for what duration they will be kept, etc.

Data

  1. Externally Consistent Data

    Distributed transactions should be "externally consistent". This includes SMTP "transactions", e.g. message texts. Also, if we are to have a version control system, the actual versioned files should be considered externally consistent, because the synchronization is potentially destructive (for example, if synchronization is being done from a mobile device, the mobile device may not want to keep the data in its memory once it thinks that the server has a copy).

    External consistency means that if the data has arrived, we will do everything that is even theoretically possible to keep it around. That means instant replication of the data. Luckily, all the data which must be stored in this manner is also amenable to flat-file storage, and is rarely if ever mutated.

    Because of its loose run-time requirements, this data can safely be stored on a network filesystem.

  2. Internally Consistent Data

    Application objects should be "internally consistent". This is managed almost entirely by the BSDDB transaction system, but there are a few caveats we need to be aware of. Because this data is mutable and required for any level of the system's operation, we have to perform any backups of this data "hot". It is acceptable to lose a small number of hours' worth of internally consistent data as long as the database is not corrupt and the non-user-entered data for that time period can be recovered by re-processing data from the externally consistent store.

    Since we have an append-only log-file that gets generated by the transaction system, we can perform system-wide backups fairly easily.

    It is theoretically possible, however unlikely, that our internally consistent data will fall behind our externally consistent data due to a crash. This is the main caveat we must handle at the application level - checking at startup to make sure that the database and the file store are in a self-consistent state, and replaying any data from the file store back into the database so that no data from other systems is lost.

  3. Cache Data

    This is data which is potentially expensive to generate, but relatively cheap to store, such as the Lupy index or image thumbnails. These have different priorities, but the idea is the same: it's bad to lose them, but if you do you can effectively re-generate them from scratch. This kind of data can be lost safely, but it should be backed up anyway to avoid the extra performance hit.

    In particular, the lupy index should be backed up because it could be massively expensive to re-process the whole thing, and we don't want to start from day 1, we want to start from the most recently available comprehensive backup. Because lupy is already running asynchronously, we can flush the index and block further indexing during the backup without too much trouble.

  4. Scratch Data

    Scratch data is data which needs a location on disk, but is only temporarily usable and should be discarded afterwards. I'm sure there are security-related use-cases for this, but the one I'm thinking of is a temporary, non-transactional bsddb for building temporary indexes which are used once, periodically blown away, and never backed up.

Phases

Reliability is a cash-intensive problem. We all know that you can throw $1000 at the first 90%, $10,000 at 99%, and then an order of magnitude more for every .9 after that. Therefore, some of our "guarantees" will actually be optimistic because it is not possible to provide true guarantees without more hardware than we are likely to see soon.

Phase 1

Our initial backup strategy will focus on comprehensiveness and self-consistency, at the risk of potentially losing some of our external consistency. To put it simply: we will create a tar archive of all of the important data, and lock the user's account while this is happening.

Internally the filesystem data will have to be segmented into consistent, cache, and scratch directories so that we can run the tar in an order that makes sense.

These archives should be run once per day and archived on CD-ROM. When the data becomes too large for a single CD, we can split it into database and content archives, with the database archive created first. Finally, we can do database dumps of a particular user's data in an external process. Since transactions should not be run across multiple users' database spaces at once anyway, this should be fine.

In the event of a power failure, the disk should be in a consistent state to recover from. In the event of a disk hardware failure, we can bring up a new machine and deflate the archives in parallel, adding the keys to the database with transactions of arbitrary size, tuned for performance, with the last added database key in each overall transaction being the one which notifies the system that the user can log in.

Potential Problems

The one problem with this strategy is transactional consistency during a backup. Currently our MAX_LOCKS is set to 1000 in order to make sure that we're not running really large transactions inside the main process. However, we are going to need to make that number substantially larger in order to archive the database in segments in single transactions per user, because a user might have tens of thousands of keys. Rather than attempting one actual transaction, it may be wise to produce some application-level logic for a mini-transaction-log that is additionally created if a user's data is modified during backup.

Phase 2

As we bring multiple machines into the picture, the backup strategy should change little. We will still archive CDs in much the same format - with (compressed as necessary) length-delimited flat key/value dumps of the data from the database, segmented by user, so that we can restore users individually.

The one major difference is we won't be able to use an implicit key/value mapping inside the master database to indicate a user's availability. We will need to provide a notification system to the domain master server to indicate when lost users' data becomes available again.

Phase 3

Of course, the holy grail of a backup plan is to provide zero downtime in the case of recovery. Eventually, we should emulate GoogleFS in this regard.

The strategy here would be to have each user's account being hosted on at least 3 machines simultaneously. One machine for the live service, one for running backups, and one for hot failover. The backup machine would periodically stop, run a huge transaction to do the backup, and then start re-processing from its queue where it left off.

The hot failover machine would just keep doing what it was doing until it was called upon to start answering user-interaction requests directly, at which point it could take over the spitting of the queued data to the other servers.

When the production machine fails, the failover machine can take its place processing user interaction.

When the failover machine fails, another formatted box can be brought in and initialized by pointing at the backup machine and telling the backup machine to do a checkpoint to it, then give it a feed of queued commands starting from the transaction log.

When the backup machine fails, the failover machine can take its place running the backup until another failover machine is brought in and it is told that the previous failover machine failed and to restore itself from the current backup server as a new failover machine, as above.

If any two servers fail, the one that remains does a full checkpoint and interrupts service until one of the machines can be replaced. To avoid service interruption we may actually want to have two backup machines running in parallel so that two machines can fail simultaneously.

If the whole bank fails, we can initialize a new bank from the last complete backup, as described in phrases 1 and 2.