Version control strategies for relational database content

02 Mar 2012

I recently had to design a relational database in which content had to be versioned. Changes to more of less every record in every table in the database must be tracked. For just about any record, a user should be able to retrieve a report detailing the list of changes made to it, when they were made, and by whom.

I considered a few strategies for how to implement these requirements:

  1. Use one table and clone rows before change. Every time a record is modified, a copy of it is made beforehand. I use the Active attribute of a record to denote which records are active vs. which make up the history. This approach, however, requires that you make the primary key a combination of Id and VersionId columns. Likewise, for relationships between tables, you must make the foreign key be a combination of ForeignId and ForeignVersionId. Referential integrity now becomes relatively straightforward to maintain, but the table may grow large and may likely contain a lot redundant information.
  2. Create one history table for each source table. A variation of #1 where each source table would have en corresponding history table with a close to identical schema, but with its own primary key. While this ensures that the source table only contains current rows, it effectively doubles the number of tables. Not to mention that every structural change to the source table would have to be propagated to the history table.
  3. Create a generic and conceptually separate history database. This approach consists of a history table serving any number of source tables. Conceptually, it’s a audit trail database of its own that the rest of the database need not know about. Each row in this history table would hold before and after values for changed columns only. You may even take it so far as to start with an empty source and a some history and replay every change made to the source up to some point. It would require special care to be taken for non-additive schema changes.

I decided to go with the third option. To make it concrete, say I have a Contacts table. Besides the primary key, I store the name of the contact and possibly a comment:

CREATE TABLE [dbo].[Contacts](
  [Id] [int] IDENTITY(1,1) NOT NULL,
  [Name] [nvarchar](255) NOT NULL,
  [Comment] [nvarchar](max) NULL,
  [CreatedBy] [nvarchar](30) NOT NULL,
  [CreatedAt] [datetime] NOT NULL,
  [ModifiedBy] [nvarchar](30) NOT NULL,
  [ModifiedAt] [datetime] NOT NULL,
  [UniqueRowGuid] [uniqueidentifier] NOT NULL,
  [Active] [bit] NOT NULL)

The remaining fields are system fields that I add to almost every table in the database. You could argue that the created and modified columns should really be part of the history table, but I prefer for the source table to be self-contained. UniqueRowGuid provides a unique identity of any row in any table and is useful for creating weak relationships between tables in the same database or across databases.

If it weren’t for the fact that a primary key of type Guid makes for bad clustered index performance, I could’ve made the primary key in every table be the UniqueRowGuid — MS SQL Server even has a special type of sequential Guid for this purpose. But I stuck with the traditional integer-based key. The Active column is used to denote if a row has been deleted. I want to keep every row for some period of time, even though to the user it appears to have been deleted.

Jumping straight to the history schema, here’s what I decided to store for each change.

CREATE TABLE [dbo].[History](
  [Id] [int] IDENTITY(1,1) NOT NULL,
  [ChangeCorrelationGuid] [uniqueidentifier] NOT NULL,
  [UniqueRowGuid] [uniqueidentifier] NOT NULL,
  [ColumnName] [nvarchar](255) NOT NULL,
  [BeforeValue] [nvarchar](max) NOT NULL,
  [AfterValue] [nvarchar](max) NOT NULL,
  [CreatedAt] [datetime] NOT NULL,
  [Comment] [nvarchar](max) NULL)

The ChangeCorrelationGuid allows for a grouping of related history entries. Suppose in one transaction you modified both the Name and Comment of a contact. Using this ChangeCorrelationGuid, you effectively create different versions of the row. Next comes the UniqueRowGuid which contains the row-value from the source table, resulting in a loose coupling between History and Contacts. Think of it as identifying the source row whereas ColumnName identifying the source column.

Strictly speaking, I don’t need to store both the BeforeValue and an AfterValue, since the AfterValue of one change becomes the Before version of the next. But this piece of controlled redundency makes querying for the summary easier and less costly.