Justin Francis Self-Portrait

Sunday, April 29, 2007

Open Sourcing Software Incidental to a Business

Maybe my search technique is slipping, or maybe search engines are losing the battle against clutter, but in researching this post, I was hard-pressed to find any discussions that directly weigh in on how to release as open source incidental tools that were built on company time while working to build the software the business needs. Here are my own thoughts on this process.

A while ago we pitched our non-technical boss to release some tools we had built in-house in the process of building the software that supports the business. We are not a software development house; we build software to support the business in what they really sell: credit card processing. This software would never be a candidate for open source. It is customized for the company, highly specialized to our partners and is one of our value-added services that enable the business to sell their product. Therefore, even though we would never sell the software (we are not in that business), the business does want to deny its use to competitors, a sentiment I support.

There are, however, components of our software to which the preceding statement does not apply. Our WSGI server we built from scratch (because we did not like the other python web frameworks out there) or our mocking and other testing tools we built that did not exist for python already are good examples. Our argument for releasing these tools was a fairly standard one: higher quality from more users testing the software, reduced maintenance effort due to contributed patches and new functionality built by external developers (never mind giving back to the community). Unsurprisingly, the response was a decidedly chilly "let's continue discussing it".

I think part of the reason for the less-than-enthusiastic response is that it is difficult to explain to a non-technical manager the difference between general, re-usable components of a system and the core system itself. In such a scenario, all the manager sees is risk, and they are unable to easily see the benefit. I don't think this can ever be easily overcome. This is why I believe the decision to open source software not directly related to the core business domain must be the decision of the development team. Nobody else is qualified to weigh the pros and cons.

Which brings me to the main focus of this post. The primary and direct beneficiaries of open source software incidental to a business are software developers. This is especially true in this case, where we are talking about programming tools and frameworks, but holds true even for things like web servers and accounting software (because we do not have to re-invent the wheel). If we do not advocate and push to have the software open sourced, nobody else will because nobody else is able to see the cost of having to re-implement these things at each company.

Fundamentally, we all have an extremely selfish motive for open sourcing these tools. Most programmers are lazy and we don't want to do the same thing over and over again. The tools I build at work are useful to me in my personal projects, and I want to use them. But if I can't get authorization to release the software, what can we do about this? I have a second-best solution.

Even if we all crassly sell away our copyrights as software developers 8 hours a day for a salary, the company does not own the idea of what we have done (at least not yet). We can re-build the tools on our own time so that we can make the decision to open source them. This is what my colleague Iain Lowe dubs a "clean room implementation". There may even be benefits to re-implementation. The idea has been proven, the tool is clearly useful, and lessons have been learned. Note that you cannot do this in all cases. It depends on how close to the business' domain the component is and your employment contract with the company.

So in a worse-case scenario, we will do the same thing twice, but never thrice. Personally, I prefer this technique because it yields a higher quality piece of software. But more importantly in our open source gift economy, I want to be recognized for the work I have done beyond the shallow salary I earn. Isn't that why we all have project lists prominently displayed on our home pages? Here's to fattening them up.

Monday, April 23, 2007

Pile of Scripts vs Application Server

When I first started doing some work with Java servlets, what struck me immediately was that the entire web application has a shared state that is wholly contained within itself. This differs from say, a pile of PHP scripts, which do not share any state except what they store in the database (ignoring sessions). More recently, I moved into Python where both were possibilities, but we chose to have a single Python process handle all incoming web requests, yielding an application server instead of a pile of scripts.

This was probably the most fundamental complaint I had with many scripting languages, but to my surprise, I could not find a single online discussion of this issue. So to break the ice, I believe there are four reasons why an "In-Memory" model makes things easier for developers: lower impedance mismatch, native messaging technique, DB as secondary system, and theoretical performance.

Assuming you are using a Domain Model, it is much more natural to assume "Identity Integrity" in your application; by which I mean if we have an object that represents a specific user in our application, it is much easier to understand the system as dealing the same single object instance instead of dealing with multiple copies that get synchronized with each other through the database on each request.

If you don't have any shared memory native to your application that persists between requests, then fundamentally, you are sending messages through the next best thing: the database. It seems that it is much more efficient and powerful to send messages using the native framework without requiring that message to pass through the database, with all the restrictions that may imply. While rather abstract, this difference may have more of an impact than you think as I describe next.

In an application server with a single state kept resident in the running instance itself, the database takes on very little significance. It literally becomes a secondary system that you use only to ensure that the next time the system comes up, it will be in roughly the state it was in when it went down. In other words, it is used for persistence only. This is important because it alters your perspective, which may alter the way you design your system.

Last, and probably least (as this is by no means the only factor, or even the most important factor) is the fact that if you are hitting the database to retrieve state on each request, you are doing way more work (and so is your database) than you need to. Who knows how much faster your system would run if instead of reading from a DB each time a page loads, it just looks up a username in an array in memory?

I guess the fundamental difference between these two is that with a pile of scripts, the DB is used to maintain application state as well as persist that state, whereas with a single-process, in-memory application, the DB is only used to persist changes to the application state which is just "there". I think you would be surprised at how liberating that concept is.