Justin Francis Self-Portrait

Thursday, May 31, 2007

Avoiding Storing Common One-To-Many Relationships

Occasionally, one comes across common classes of objects that you would like to be able to link to a number of different types of domain objects through a one-to-many relationship. For example, in a Customer Relationship Manager tool, you may want to have comments on many different types of entitities (customers, applications, disputes, etc). Same thing for tickets, attachments and other common structures.

Linking up these objects in the domain model is trivial: a list here, an ArrayList there. The real pain comes when you want to persist these relationships. Conventional wisdom holds you would create a link table for each of the entities you want to link to your common entity. The problem is that this is tedious and time consuming because not only do you have to create the tables, you also then need to modify your loading and saving procedures for the domain object. Because this is a one-many relationship, you will be getting multiple rows back per object, with all the muck that brings.

An alternative approach which we have begun to use at work would be to instead store the link to the domain object in the other table (the one side of the relationship). This way, there are no new tables to create. Of course, the immediate problem is that you cannot have a foreign key from one table to many tables. But imagine if instead of storing a database foreign key in the column, you instead store a "programming foreign key". In other words instead of a database link, store a programming link. In python, this would be the `repr` of an object. In this context, we could store a call to a finder function in this column, and evaluate it on the way out.

Storing a finder call in this way allows the common object to link to any number of different types of domain object. The domain objects just need to iterate over the collection of common objects to find those that link to themselves. When you add a new domain object, all you need to do is add a method that searches this extent. No database modification, no saving or loading code to manipulate.

Using some other tricks from python (though this is generally applicable), you could also use the `str` of the object to have a generic display of the common object's link to the domain object. If the protocol you are using (like python's `repr`) is generic enough, you could literally store anything in the column. If we take the comments example, the comment could have been left by a `User`, a string username or a list of `User`s.

There are downsides to this approach, of course. The biggest one is that ad-hoc queries on those common objects becomes more difficult. The good news is those common objects are not normally the subject of ad-hoc queries because they are not interesting domain objects. Additionally, if your system has the facility for ad-hoc queries in a programming language (no, not SQL), then this is a non-issue (depending on who is doing the querying). Secondly, you cannot enforce referential integrity in such a system at the database level. Again, though, I have not run into referential integrity issues on these kinds of common objects. Integrity is usually important for pieces of the domain that must fit together properly or system failure may occur. Finally, performance may also be impacted depending on how intense the finder function is, though these should be fast if you have designed your system well.

In the end, it is a trade-off between ease of addition of new domain objects and ease of querying and data integrity. For me, the database is just a storage system, so I will take advantage of the programming language I am working in, even if I lose some functionality at the data level.

Saturday, May 19, 2007

The Other Source Control

Writing software without source control these days is insanity. Having worked in professional environments both with and without source control, I know the pain of not having it, and the ease with which it can be introduced. Now, I never write anything that is not under source control, even if I am working alone. It is so free and the benefits so clear that I don't even question it anymore.

I think that almost everybody (except maybe Linus Torvalds) agrees that source control is critical to software development. For all its other shortcomings, the Joel Test has source control right at the top of the list.

What has strikes me, however, is how little concern is normally given to source control of the database. Especially if one is not using an In-Memory Model, your database is critical to the correct functioning of the code. Yet I rarely hear about shops that control change to the database in the same way they control changes to the code.

I'm not talking about changes to the data housed by the database (analogous to instance changes in the running code), but changes to the structure of the database. I can remember countless times where all of a sudden, our application would stop working. It turns out someone applied a database change directly on production. Other cases include asking "what changed in the database between version 1.1 and 1.2 of the system", "what do we need to do to the database when we release this version" and "what do I need to do to run a previous version of the software"? These questions are difficult to answer accurately without all change to the database being controlled just like all other changes to the code.

At work, we have spent a substantial amount of time working to bring the database in line with solid change control for almost a year. We are finally at the point where the code is able to detect the version of the database it is running on, and is able to either upgrade the version or roll it back to a previous version depending on what version of the code is currently running. The only changes that are applied are those that have been checked-in to the code repository as being required to run on the next release of the software.

We have found that controlling change in this way provides all the same benefits source control for our code provides. And in the end, doesn't this make sense? Doesn't it defy logic that part of our code should be controlled, while part of it (db schema) should not be? Now our database acts just like our code: changes are logged historically, changes are fully tested, changes are applied once per release on production, and the only changes made are those that are fully automated.

Like code source control, I will never look back.

Saturday, May 5, 2007

Love Your self In Python

When I was using Java, I always insisted on making the this reference explicit when referring to an attribute or method within the same class. I think it always makes the code clearer because there is never any doubt about where an identifier is coming from; either it is a temp or argument or there will be a this in front of it, in which case it is an attribute or method.

The habit of placing this in Java code even where it was not required certainly eased my transition to Python. In Python, self is the explicit first parameter to every method, and must be used whenever accessing methods or attributes of the object. I have often heard people complain about this, but it is one of the things I love about Python.

In addition to the clarity described above, the explicit self as the first parameter to all methods unifies functions and methods in a way that is not done in other languages. Once you start thinking about methods as functions that take an object as their first argument, functions and methods become very similar. This encourages a dual interface that is used often in Python. This dual interface is one where you can use either a function or an object's method to do the same thing (like Python's re module).

For example, I find myself often writing code like this:

def get_table_sql(name, db):
return 'CREATE TABLE %s (%s)' % (name, get_column_sql(name))

class Table(DbObject):
def get_sql(self):
return get_table_sql(self.name, self.db)


Now, nothing prevents you from doing this in other languages, but I find that the explicit self goes a long way towards encouraging programmers to think about functions and methods in similar ways. And at the end of the day, your design is based more upon your frame of mind than the technological limitations of the language you are using.