Justin Francis Self-Portrait

Saturday, July 26, 2008

Persistence of an In-Memory Application Server

My philosophy is that the database should be used primarily as dumb storage. Keeping as much as possible in the middle layer of an application affords us programmers the most power; most of the changes we will need to make will be in the language we are most comfortable in.

For an In-Memory application like ours (one that loads all its data at startup and does not read from the database thereafter), the database takes on a totally different role from that of the holder of application state to a glorified file. On each modification of the system, of course, we immediately write the change to the database so if the system crashes, it will boot back in the same state.

Recently, the largest single cause of wait-times in our system was contention for these database writes. But with an In-Memory system, the state of the application is not maintained in the database. Therefore the correct functioning of the system is not dependent on the database. In fact, if we ignored system crashes, we could just persist the entire domain model of the system right before we shut it down and everything would work fine.

After becoming comfortable with the idea that writing changes to the database is only so the system can reboot properly, the most profound revelation occurred to me. What does the user care if the system persisted the change they just made? Regardless, when they make their next request (assuming no system crash) the changes they made are still there because the state of the application is in memory, not in the database. So why not skip the save (from the user's perspective) and schedule it to be done after the response is returned to the user?

Doing saves asynchronously increased the response time of all system modifications by factor of 10 on average. Not bad for a day or two of programming. Oh, and 5 years of building software.

The implementation was relatively simple. We created a class on which domain object saves would be queued. In fact, we made it more general and used a decorator that would queue any function or method on this queue, which would then call the function at some point in the near future.

The advantages are obviously the speed boost, but as a bonus we save writes when there are multiple sequential modifications of the same object, which as you might expect are very common. We also reduce the load both on the database server and our system because only a single thread (the queue thread) is trying to modify the database at any given time.

There are some disadvantages. Probably the biggest your are thinking of is a failure during a save. Some other things that come to mind are data integrity errors and too many modifications such that we cannot keep up. One final potential problem I'll discuss is a system crash, which could result in some lost writes.

Foreign keys, data-type and other data-related errors that could happen during the save are not a problem. First of all, they should never happen because you have coded these constraints into your system (or they exist naturally -- how can you have a broken FK in your object model? The worst you could have is an unlinked object). Second, if they do happen, you can notify the programmers (because if the system does not prevent this kind of problem, the user certainly won't be able to solve it from some cryptic error message) or simply re-queue the save (it's probably a temporary problem anyway).

The problem of potentially having saves queue up faster than they can be cleared is unlikely. I have never found that spawning multiple threads to handle many writes to a database speed things up much (which is what happens when each request synchronously writes to the database). It is simple, however, to create multiple threads to save a backlogged queue. So far, we have not seen this in our system. The average size of the backlog in the queue is 0.

Finally, what happens if the system crashes? I should point out that our system does not crash more than once a month, and that is just because we have been going through a rough patch. The real question is how many writes are normally waiting to be completed? It will probably be close to zero. No matter what, writes are getting lost in a crash (the ones in progress or starting). If you tune your queue properly, you can minimize the risk and impact of a crash. The top priority, however, should be finding out why the system is crashing and fixing the source of this problem.

The conclusion here is that users don't care about persistence, so don't make them wait for it unless you have to. If you have to, examine your assumptions, and if you are still using the database as your authority on application state, consider an in-memory system. It might solve a lot of problems you are having, database contention probably chief among them.