Justin Francis Self-Portrait

Sunday, August 3, 2008

Aren't Application Servers Slow?

Application servers have a reputation for being sluggish resource hogs. While it is true that there are many, many slow application servers out there, I think it is unfair to single them out as most of the time, these systems are being run with many more demands (data, users) than the average CGI script suite. Any system would run slow in this kind of environment.

In fact, I have never seen a good sized company use a pile of scripts centered around a database that wasn't slow. Which leads us to the launching point for this discussion. Any system running in a real environment with a real amount of users and data will run slowly, whether database driven or not.

Understanding the inevitability of growth, the real question is what are you going to do when your system becomes slow? As I see it, there are two scenarios. Either you are a database-driven application in which case you optimize the database, or you are an application server, and you optimize your code.

To optimize the database, you look at queries, indexes, caching. Eventually you move on to table design and normalization, splitting of schemas and so on. I am not a database developer, but I have designed and re-designed (for performance reasons) a number of very large databases; and it seems to me that there is a limit to what you can do painlessly in this domain. With databases, I feel the limit for new innovative solutions is reached fairly quickly.

When optimizing code in an application server, however, the possibilities are almost endless. While SQL and other database features are relatively limited, modern general-purpose languages allow for an almost unlimited number of ways to innovate and improve performance.

In the application server I am currently building, the size of the system has tripled since we started building the software. However, with constant optimizations, it is running as fast today as it was in the beginning. Certainly there were some database optimizations, but the vast majority were simple, effective and insightful changes based on the results of profiling the python code. It was more fun and more interesting than debugging database locks.

The conclusion here is that either way, your system will be slow as it reaches the limits of what is was originally designed to do. Don't blame it on the architecture or technology (note I am not talking about scalability here). Whatever your architecture, you must optimize. If I have to optimize, I would much rather optimize in my language of choice than in the database; I have more power, more comfort and more freedom.

Of course, whether in the database or codebase, performance optimizations are time consuming. And hard.

Saturday, July 26, 2008

Persistence of an In-Memory Application Server

My philosophy is that the database should be used primarily as dumb storage. Keeping as much as possible in the middle layer of an application affords us programmers the most power; most of the changes we will need to make will be in the language we are most comfortable in.

For an In-Memory application like ours (one that loads all its data at startup and does not read from the database thereafter), the database takes on a totally different role from that of the holder of application state to a glorified file. On each modification of the system, of course, we immediately write the change to the database so if the system crashes, it will boot back in the same state.

Recently, the largest single cause of wait-times in our system was contention for these database writes. But with an In-Memory system, the state of the application is not maintained in the database. Therefore the correct functioning of the system is not dependent on the database. In fact, if we ignored system crashes, we could just persist the entire domain model of the system right before we shut it down and everything would work fine.

After becoming comfortable with the idea that writing changes to the database is only so the system can reboot properly, the most profound revelation occurred to me. What does the user care if the system persisted the change they just made? Regardless, when they make their next request (assuming no system crash) the changes they made are still there because the state of the application is in memory, not in the database. So why not skip the save (from the user's perspective) and schedule it to be done after the response is returned to the user?

Doing saves asynchronously increased the response time of all system modifications by factor of 10 on average. Not bad for a day or two of programming. Oh, and 5 years of building software.

The implementation was relatively simple. We created a class on which domain object saves would be queued. In fact, we made it more general and used a decorator that would queue any function or method on this queue, which would then call the function at some point in the near future.

The advantages are obviously the speed boost, but as a bonus we save writes when there are multiple sequential modifications of the same object, which as you might expect are very common. We also reduce the load both on the database server and our system because only a single thread (the queue thread) is trying to modify the database at any given time.

There are some disadvantages. Probably the biggest your are thinking of is a failure during a save. Some other things that come to mind are data integrity errors and too many modifications such that we cannot keep up. One final potential problem I'll discuss is a system crash, which could result in some lost writes.

Foreign keys, data-type and other data-related errors that could happen during the save are not a problem. First of all, they should never happen because you have coded these constraints into your system (or they exist naturally -- how can you have a broken FK in your object model? The worst you could have is an unlinked object). Second, if they do happen, you can notify the programmers (because if the system does not prevent this kind of problem, the user certainly won't be able to solve it from some cryptic error message) or simply re-queue the save (it's probably a temporary problem anyway).

The problem of potentially having saves queue up faster than they can be cleared is unlikely. I have never found that spawning multiple threads to handle many writes to a database speed things up much (which is what happens when each request synchronously writes to the database). It is simple, however, to create multiple threads to save a backlogged queue. So far, we have not seen this in our system. The average size of the backlog in the queue is 0.

Finally, what happens if the system crashes? I should point out that our system does not crash more than once a month, and that is just because we have been going through a rough patch. The real question is how many writes are normally waiting to be completed? It will probably be close to zero. No matter what, writes are getting lost in a crash (the ones in progress or starting). If you tune your queue properly, you can minimize the risk and impact of a crash. The top priority, however, should be finding out why the system is crashing and fixing the source of this problem.

The conclusion here is that users don't care about persistence, so don't make them wait for it unless you have to. If you have to, examine your assumptions, and if you are still using the database as your authority on application state, consider an in-memory system. It might solve a lot of problems you are having, database contention probably chief among them.

Saturday, February 9, 2008

Managerial Technological Vision

I find that in a lot of companies, upper management relies too heavily on development teams for the implementation of their technological vision. What is worse, the vision is normally only truly supported by the CEO, and is more akin to a feature set than a vision. What invariably results is a vision that is too specific and lacking in feedback from end users. It is also usually not supported by middle or upper management.

This leads to conflicting interests between the development team who want to implement the CEO's directive, and the department affected, who at best see the changes required as unnecessary and at worst see them as an unacceptable impediment to reaching their own goals set by the CEO.

Thus the feedback they give is hostile and unhelpful. The features eventually built are not helpful and their required use seen as a draconian measure requiring more work for no benefit (like time-sheets). If the users do use the software at all, they won't use it properly or to its maximum ability. They will also complain and resent the development team for forcing them to change how they work. Perhaps the worst effect, however, is that with everyone dragging their feet, project pace slows to a crawl, and the development team gets blamed when projects never end.

I believe that the way this should work is upper management must come up with a shared strategic vision as a whole for where the company's development effort should focus (efficiency, customer visibility, marketing, reporting?). This high-level, abstract vision should then be passed down the ranks to middle management to implement. Then specific departments and groups would come to the development team requesting advice on how they can implement the vision, instead of the development team going to a department insisting they have a mandate for change.

So how do you get there? I am not all that sure. But I do have the following ideas. Critically, management must be educated, or perhaps the term should be "given feedback" on the effects of the current process. When that fails, there are ways to mitigate the damage to users.

Hold "secondary stakeholder meetings", which I define as a meeting of middle management and end users, without upper management present. The majority of each iteration goes toward implementing the vision, but this forum is used to schedule a portion of each iteration for immediate concerns. The secondary stakeholders can use this portion for anything they want, and upper management should not be able to override the requests. This allows time for building features that would never be implemented in the vision.

Finally, every good development team should reserve a portion of each iteration for internal use. Normally this effort is used for larger refactorings, experimental development, development tools, and other features that make the development team function smoothly. There is nothing wrong, if you fudge the agile rules a little, with taking some of this time to build a feature that you can see the company is screaming for, but that nobody is willing to request. The most popular features I have built have been based on ideas generated within the development team that the users subsequently realised they needed, but would never have scheduled on their own.

It is very difficult to overcome an incoherent or divided company with respect to technological vision. With hard work, however, it is possible to mitigate the negative impacts, and with constant feedback (and with end-users on your side), management may change their ways, leading to a company united behind the technological strategy, and critically, united behind the development team.

Sunday, January 27, 2008

"Responsibility" of a Software System

We have recently run into a problem at work where we inherited a piece of software with a user base who firmly believed that as soon as the software was able to help with a certain aspect of the business, the software (and thus the development team) became responsible for that aspect of the business.

For example, if the system could calculate commissions, then it was the system's responsibility to generate and pay commission every month. This was exacerbated by the problem that the users did not really understand what the system did and did not do. So what ended up happening is that when the commission results were not what the accounting group thought they should be, they called up the dev team and told them "there is something wrong with the commission results".

This may seem normal, but follow the reasoning to its logical conclusion. In a system that automates most aspects of a department's job, what exactly is the department's responsibility? Clicking buttons in the system at a maximum. Everything else from understanding how commissions are calculated to ensuring the incoming data is correct falls onto the dev team's plate. I have actually seen support requests (to the dev team) from users asking a question like "why has this order been left in accounting for 10 days"? When the system is responsible for everything, the dev team is supposed to know why accounting has not processed the order yet.

So then what is the responsibility of a software system? The answer to this question is of paramount importance to every dev team because it sets the baseline for what the dev team and related groups (networking, technical support, etc) are responsible for. I firmly believe, after working in a number of different environments, that the only responsibility of a software system is to behave the way the stakeholders have requested it to behave.

This means that no responsibility is shifted when new software is built. Everyone is still responsible for what they were yesterday, but there is new responsibility added because the system needs to do what the users tell it to do. This new responsibility is also practically a job description for software developers. This makes sense; that responsibility is where we come in. Our job is to build software that helps people do their jobs. But that is where it ends; it does not extend to actually doing their jobs once the software is built. In other words, we build tools to help people do their jobs.

Is Microsoft responsible for ensuring that your presentation is presentable simply because you are using Powerpoint to help you? Of course not. In the example above, the accounting department is still responsible for calculating and paying commission, but now they have a tool to help them. If the tool is not working properly, then the dev team will fix it. But accounting is still responsible for finding that error and fixing it for the current commission period by any means, regardless of whether the dev team can make a fix in time.

The major corollary of this is that users need to understand how their system works. Not at a technical level, but certainly at a logical one. Again, this makes sense. Users of word processing tools must understand how it works if they hope to use it properly. If accounting does not understand how to calculate commissions, certainly the developers are not going to. This should not be as unfortunately rare as it is; after all the tool was built for them, presumably by them or their representatives. So somebody must have understood how to calculate commission at some point.

The agile philosophy spends a lot of effort laying out clear responsibility for various activities (effort estimation, scheduling decisions, etc) with good reason. Without clear responsibility, groups start conflicting with each other. In order to have a successful project, all members of the team must be working harmoniously toward the same goal. Clear and fair responsibility lays the foundation for this.

Saturday, December 22, 2007

Overlooked Developer Qualities

There is a lot more to a developer than ability to write code, or even to design software. I want to emphasize some non-technical qualities that do not normally get the recognition they deserve, but that I have noticed increase the value of a developer. They are, in no particular order:

  • Code Memory
  • Debugging Skills
  • Attention to Detail and Thoroughness

At my job, we have a very open, spontaneous environment, and developers will routinely raise their voice and ask a general question like "did anybody change this recently" or "what was that issue we had last week"? What amazes me is that not many developers remember how they designed something or how they solved a problem last week, let alone six months ago. It is a critical asset, therefore, to have a developer on the team with fantastic "code memory".

A developer with good code memory knows everything about how the system works, the current feature set, and current problems. In addition, they can remember how all those things have evolved over a period of months. This saves time when debugging recurring problems, answering questions from users and answering questions from developers. Every team should have this librarian-like keeper of knowledge, though ideally, this would be redundant across the entire team.

Another great quality to have as a developer is good debugging skills. To be able to quickly identify, isolate and fix problems is supremely valuable both during development of new features and during maintenance of a running system. There is nothing worse than having development slow to a crawl because you are plunging down a rabbit hole that may not be related to the problem at hand. On a running system, this skill is especially valuable as it means less downtime. Problem solving skills and code memory combine to vastly enhance this skill.

Finally, attention to detail and thoroughness make a big difference in the quality of a developer. Thia quality fundamentally allows a developer to be self-sufficient. Often, this skill is the difference between an intermediate developer and a senior developer. Without being able to think the entire feature through in all its detail and being sure that those details are covered by the solution, a developer cannot run projects, or even develop new features without support from someone who does have this quality.

These non-technical skills are based largely on fundamental learning abilities that ought to be taught to everyone starting in elementary school. These skills are not as easily quantifiable as languages known or coding ability, but deserve to be recognized for their indispensable value on a dev team.

Sunday, December 2, 2007

Velocity: It's not a race

I use an agile estimation process at work. During any given two week iteration, we estimate all tasks that need to be done and then we measure how much we did at the end of the iteration. These numbers will then be used to predict how much we can do for the next two week period. This sounds simple enough, but there is a common misconception that does not escape even the most experienced agile teams. I often hear comments like "is there a estimate for that?" when expanding the scope of a ticket, or "does that fall into this ticket"?

I guess we just cannot seem to help feeling we need to maximize how much estimated work we do during an iteration. To feel this way is to fundamentally misunderstand the purpose of estimating and measuring.

The key is that we are measuring how much estimated work is being done; not how much actual work is being done. So built into each estimate is a specific amount of uncertainty about what needs to be done to complete the ticket. Generally, tickets describe end-user functionality, and so the ticket includes whatever work is needed to get the feature done, including unknowns.

But there are other reasons besides unknowns for why the actual work differs from the estimated work. Perhaps one is not doing lots of estimated work. We only estimate features and bug fixes. There is a whole slew of other work that needs to be done on a dev team: project management, live system maintenance, refactoring, etc.

The bottom line is that as long as you are consistent about how you estimate, it does not quantitatively matter how many units of estimated effort you complete. In fact, you could double all estimates starting in the new year. All that matters is that you then measure how much you get done in an iteration. Nobody should be concerned about their velocity except insofar as it allows for the accurate prediction of how much they can tell stakeholders they can accomplish for the next iteration.

To change an estimate during an iteration because of actual work is to change your estimating strategy, which in the end will leave your estimates less consistent, and your planning less reliable.

Friday, November 16, 2007

Ordering work in an iteration

Recently, we have been in a velocity crunch at work. Our velocity plummeted 50% in a single iteration. The result of this (relevant to this post) is that we missed our target for the iteration by a big margin. Luckily, we were able to mitigate the impact and the stakeholders did not notice a huge discrepancy. This experience, did, however, get me thinking about the best way to order work in an iteration to minimize the risk of missing deadlines, and to minimize the impact on stakeholders when you do.

Ordering tactics to mitigate the risk of not meeting an iteration deadline:

  • Begin with the largest tickets

  • Begin with user-visible tickets

  • Begin with highest priority tickets

  • Begin with tickets whose component is least familiar

One should begin with the largest tickets because they are the ones that generally have the most unknowns in them. The estimate is the roughest, so the time they will take to complete is most uncertain. They are also more difficult, and so carry more risk.

By starting with user-visible tickets, the likelihood of having to push a stakeholder's ticket at the end of the iteration decreases. It is much easier politically to push a ticket that the stakeholder will not see for an iteration or two (or ever if it is some kind of internal ticket).

Starting with higher priority tickets reduces the risk that a critical ticket will need to be pushed at the end of the iteration because they have all been completed. Stakeholders are much more understanding when pushing a ticket that can wait an iteration than when pushing a critical bug fix.

Finally, if a developer has little experience with a certain component or domain (or the team if nobody has experience), they should not attempt to complete a ticket in that domain at the end of the iteration. It will take them longer than a developer with experience to do the same work. Having the developer make the attempt at the beginning of the iteration ensures extra time if needed, and extra help which is usually unavailable in the final two day sprint.

Following these guidelines has helped us to all but eliminate the need to push important features from an iteration due to lack of time, a welcome political victory for any agile development team.