Justin Francis Self-Portrait

Sunday, August 3, 2008

Aren't Application Servers Slow?

Application servers have a reputation for being sluggish resource hogs. While it is true that there are many, many slow application servers out there, I think it is unfair to single them out as most of the time, these systems are being run with many more demands (data, users) than the average CGI script suite. Any system would run slow in this kind of environment.

In fact, I have never seen a good sized company use a pile of scripts centered around a database that wasn't slow. Which leads us to the launching point for this discussion. Any system running in a real environment with a real amount of users and data will run slowly, whether database driven or not.

Understanding the inevitability of growth, the real question is what are you going to do when your system becomes slow? As I see it, there are two scenarios. Either you are a database-driven application in which case you optimize the database, or you are an application server, and you optimize your code.

To optimize the database, you look at queries, indexes, caching. Eventually you move on to table design and normalization, splitting of schemas and so on. I am not a database developer, but I have designed and re-designed (for performance reasons) a number of very large databases; and it seems to me that there is a limit to what you can do painlessly in this domain. With databases, I feel the limit for new innovative solutions is reached fairly quickly.

When optimizing code in an application server, however, the possibilities are almost endless. While SQL and other database features are relatively limited, modern general-purpose languages allow for an almost unlimited number of ways to innovate and improve performance.

In the application server I am currently building, the size of the system has tripled since we started building the software. However, with constant optimizations, it is running as fast today as it was in the beginning. Certainly there were some database optimizations, but the vast majority were simple, effective and insightful changes based on the results of profiling the python code. It was more fun and more interesting than debugging database locks.

The conclusion here is that either way, your system will be slow as it reaches the limits of what is was originally designed to do. Don't blame it on the architecture or technology (note I am not talking about scalability here). Whatever your architecture, you must optimize. If I have to optimize, I would much rather optimize in my language of choice than in the database; I have more power, more comfort and more freedom.

Of course, whether in the database or codebase, performance optimizations are time consuming. And hard.

Saturday, July 26, 2008

Persistence of an In-Memory Application Server

My philosophy is that the database should be used primarily as dumb storage. Keeping as much as possible in the middle layer of an application affords us programmers the most power; most of the changes we will need to make will be in the language we are most comfortable in.

For an In-Memory application like ours (one that loads all its data at startup and does not read from the database thereafter), the database takes on a totally different role from that of the holder of application state to a glorified file. On each modification of the system, of course, we immediately write the change to the database so if the system crashes, it will boot back in the same state.

Recently, the largest single cause of wait-times in our system was contention for these database writes. But with an In-Memory system, the state of the application is not maintained in the database. Therefore the correct functioning of the system is not dependent on the database. In fact, if we ignored system crashes, we could just persist the entire domain model of the system right before we shut it down and everything would work fine.

After becoming comfortable with the idea that writing changes to the database is only so the system can reboot properly, the most profound revelation occurred to me. What does the user care if the system persisted the change they just made? Regardless, when they make their next request (assuming no system crash) the changes they made are still there because the state of the application is in memory, not in the database. So why not skip the save (from the user's perspective) and schedule it to be done after the response is returned to the user?

Doing saves asynchronously increased the response time of all system modifications by factor of 10 on average. Not bad for a day or two of programming. Oh, and 5 years of building software.

The implementation was relatively simple. We created a class on which domain object saves would be queued. In fact, we made it more general and used a decorator that would queue any function or method on this queue, which would then call the function at some point in the near future.

The advantages are obviously the speed boost, but as a bonus we save writes when there are multiple sequential modifications of the same object, which as you might expect are very common. We also reduce the load both on the database server and our system because only a single thread (the queue thread) is trying to modify the database at any given time.

There are some disadvantages. Probably the biggest your are thinking of is a failure during a save. Some other things that come to mind are data integrity errors and too many modifications such that we cannot keep up. One final potential problem I'll discuss is a system crash, which could result in some lost writes.

Foreign keys, data-type and other data-related errors that could happen during the save are not a problem. First of all, they should never happen because you have coded these constraints into your system (or they exist naturally -- how can you have a broken FK in your object model? The worst you could have is an unlinked object). Second, if they do happen, you can notify the programmers (because if the system does not prevent this kind of problem, the user certainly won't be able to solve it from some cryptic error message) or simply re-queue the save (it's probably a temporary problem anyway).

The problem of potentially having saves queue up faster than they can be cleared is unlikely. I have never found that spawning multiple threads to handle many writes to a database speed things up much (which is what happens when each request synchronously writes to the database). It is simple, however, to create multiple threads to save a backlogged queue. So far, we have not seen this in our system. The average size of the backlog in the queue is 0.

Finally, what happens if the system crashes? I should point out that our system does not crash more than once a month, and that is just because we have been going through a rough patch. The real question is how many writes are normally waiting to be completed? It will probably be close to zero. No matter what, writes are getting lost in a crash (the ones in progress or starting). If you tune your queue properly, you can minimize the risk and impact of a crash. The top priority, however, should be finding out why the system is crashing and fixing the source of this problem.

The conclusion here is that users don't care about persistence, so don't make them wait for it unless you have to. If you have to, examine your assumptions, and if you are still using the database as your authority on application state, consider an in-memory system. It might solve a lot of problems you are having, database contention probably chief among them.

Saturday, February 9, 2008

Managerial Technological Vision

I find that in a lot of companies, upper management relies too heavily on development teams for the implementation of their technological vision. What is worse, the vision is normally only truly supported by the CEO, and is more akin to a feature set than a vision. What invariably results is a vision that is too specific and lacking in feedback from end users. It is also usually not supported by middle or upper management.

This leads to conflicting interests between the development team who want to implement the CEO's directive, and the department affected, who at best see the changes required as unnecessary and at worst see them as an unacceptable impediment to reaching their own goals set by the CEO.

Thus the feedback they give is hostile and unhelpful. The features eventually built are not helpful and their required use seen as a draconian measure requiring more work for no benefit (like time-sheets). If the users do use the software at all, they won't use it properly or to its maximum ability. They will also complain and resent the development team for forcing them to change how they work. Perhaps the worst effect, however, is that with everyone dragging their feet, project pace slows to a crawl, and the development team gets blamed when projects never end.

I believe that the way this should work is upper management must come up with a shared strategic vision as a whole for where the company's development effort should focus (efficiency, customer visibility, marketing, reporting?). This high-level, abstract vision should then be passed down the ranks to middle management to implement. Then specific departments and groups would come to the development team requesting advice on how they can implement the vision, instead of the development team going to a department insisting they have a mandate for change.

So how do you get there? I am not all that sure. But I do have the following ideas. Critically, management must be educated, or perhaps the term should be "given feedback" on the effects of the current process. When that fails, there are ways to mitigate the damage to users.

Hold "secondary stakeholder meetings", which I define as a meeting of middle management and end users, without upper management present. The majority of each iteration goes toward implementing the vision, but this forum is used to schedule a portion of each iteration for immediate concerns. The secondary stakeholders can use this portion for anything they want, and upper management should not be able to override the requests. This allows time for building features that would never be implemented in the vision.

Finally, every good development team should reserve a portion of each iteration for internal use. Normally this effort is used for larger refactorings, experimental development, development tools, and other features that make the development team function smoothly. There is nothing wrong, if you fudge the agile rules a little, with taking some of this time to build a feature that you can see the company is screaming for, but that nobody is willing to request. The most popular features I have built have been based on ideas generated within the development team that the users subsequently realised they needed, but would never have scheduled on their own.

It is very difficult to overcome an incoherent or divided company with respect to technological vision. With hard work, however, it is possible to mitigate the negative impacts, and with constant feedback (and with end-users on your side), management may change their ways, leading to a company united behind the technological strategy, and critically, united behind the development team.

Sunday, January 27, 2008

"Responsibility" of a Software System

We have recently run into a problem at work where we inherited a piece of software with a user base who firmly believed that as soon as the software was able to help with a certain aspect of the business, the software (and thus the development team) became responsible for that aspect of the business.

For example, if the system could calculate commissions, then it was the system's responsibility to generate and pay commission every month. This was exacerbated by the problem that the users did not really understand what the system did and did not do. So what ended up happening is that when the commission results were not what the accounting group thought they should be, they called up the dev team and told them "there is something wrong with the commission results".

This may seem normal, but follow the reasoning to its logical conclusion. In a system that automates most aspects of a department's job, what exactly is the department's responsibility? Clicking buttons in the system at a maximum. Everything else from understanding how commissions are calculated to ensuring the incoming data is correct falls onto the dev team's plate. I have actually seen support requests (to the dev team) from users asking a question like "why has this order been left in accounting for 10 days"? When the system is responsible for everything, the dev team is supposed to know why accounting has not processed the order yet.

So then what is the responsibility of a software system? The answer to this question is of paramount importance to every dev team because it sets the baseline for what the dev team and related groups (networking, technical support, etc) are responsible for. I firmly believe, after working in a number of different environments, that the only responsibility of a software system is to behave the way the stakeholders have requested it to behave.

This means that no responsibility is shifted when new software is built. Everyone is still responsible for what they were yesterday, but there is new responsibility added because the system needs to do what the users tell it to do. This new responsibility is also practically a job description for software developers. This makes sense; that responsibility is where we come in. Our job is to build software that helps people do their jobs. But that is where it ends; it does not extend to actually doing their jobs once the software is built. In other words, we build tools to help people do their jobs.

Is Microsoft responsible for ensuring that your presentation is presentable simply because you are using Powerpoint to help you? Of course not. In the example above, the accounting department is still responsible for calculating and paying commission, but now they have a tool to help them. If the tool is not working properly, then the dev team will fix it. But accounting is still responsible for finding that error and fixing it for the current commission period by any means, regardless of whether the dev team can make a fix in time.

The major corollary of this is that users need to understand how their system works. Not at a technical level, but certainly at a logical one. Again, this makes sense. Users of word processing tools must understand how it works if they hope to use it properly. If accounting does not understand how to calculate commissions, certainly the developers are not going to. This should not be as unfortunately rare as it is; after all the tool was built for them, presumably by them or their representatives. So somebody must have understood how to calculate commission at some point.

The agile philosophy spends a lot of effort laying out clear responsibility for various activities (effort estimation, scheduling decisions, etc) with good reason. Without clear responsibility, groups start conflicting with each other. In order to have a successful project, all members of the team must be working harmoniously toward the same goal. Clear and fair responsibility lays the foundation for this.