Referential Integrity - Data Modeling Mistake 1 of 10

| | Comments (5) | TrackBacks (0)

In my mind data models are like the foundations of a house. Whether you use ORM or a more traditional modeling tool, they form the base of the entire rest of your project. Consequently, every decision you make (or don't make) regarding your data model during the design phase(s) of your project will significantly affect the duration of your project and the maintainability and performance of your application.

You could de-emphasize up-front planning, but every correction you make to the data model once code has been written on top of it will introduce significant delays to the project as developers refactor data access, business logic, and user interface tiers. That's why mistakes made during design are expensive, and it would behoove any architect (or project manager) to be well aware of the repercussions of data model decisions and minimize mistakes before construction begins.

After years of working with or maintaining applications based on poorly designed data models, and after years of modeling my own databases from scratch I've seen and made a lot of mistakes. So, I've compiled ten of the most common ones and the arguments for and against them.

I'll be speaking on this topic in the upcoming IASA conference in October, and so I wanted to vet these ideas with the community. I know there are strong feelings on these topics, so please help me out by commenting if you feel I've missed something or am off base.

I'll start with Mistake #1: Not Using Referential Integrity in this post. I'll give four common reasons for avoiding referential integrity and then rebuff them. I'll then cover the more controversial Mistake #2 Not Using Surrogate Keys in my next post.

Mistake #1 - Not using referential integrity

I've heard a lot of excuses for not using referential integrity, but I've never been swayed by one of them. If you have a record with a foreign key field you should be 100% certain that it will always refer to the primary key of an existing record in one and only one foreign table. The last thing you want to do is write large amounts of conditional logic because you aren't 100% certain that you aren't dealing with orphaned data. Nonetheless, here are some almost compelling arguments I've heard for not using it:

Reason #1: Project Too Small

If your project or database is only a few tables and a couple lines of code then you don't need referential integrity right? Wrong, numerous projects start small, get big, and have major problems because of it. It doesn't take much extra time to put in constraints. Avoid the urge to be lazy.

Reason #2: Accidental Oversight

Numerous applications I've seen forget a relationship or two. This is borne of writing and executing database creation statements by hand and is the reason that data modeling tools exist. When you visualize your database in a model it's hard to miss a relationship. So use a modeling tool and keep it in sync with your database, you won't regret it.

Incidentally I like Microsoft Visio for data modeling because you can change your schema during development and Visio won't delete your data. This enables you to keep your data model in sync with the database for the entire lifetime of the database. There are other benefits too, if you're interested see my article on data modeling in Microsoft Visio.

Reason #3: Maximize Insert Speed

It's a fact: indexes and constraints slow down insert and update operations. If your application is heavy on writing and light on reading, then you could argue referential integrity isn't for you. This argument is often combined with the "Only one application ever uses my database" argument.

There are two problems with this. One problem comes when either a well meaning DBA modifies data by hand and messes up the state of the database, or more realistically when there's a bug in the application that accidentally orphans data. Orphaned data may not affect your application, but a well designed solution should plan for the future. When that data warehouse project finally gets around to importing data from your database, what do they do with the orphaned data? Ignore it? Try to integrate it? Who knows? If you've been in this position, you'll know what I mean when I say the responsible architect's name (or their app) will be synonymous with a curse word.

The second problem is that even if a database without referential integrity don't end up with orphaned data, a second application that might want to integrate can still never be 100% certain that foreign keys refer to existing records. It comes down to designing for the future.

The answer to speed is to build your database with referential integrity, drop or disable your constraints and indexes before a bulk load, and re-enable them after the bulk load. It will increase the duration of your bulk load operation over not using constraints at all, but it will be much faster than leaving them enabled and checking them for each insert. So use referential integrity: the pros outweigh the cons.

Reason #4: Mutually exclusive relationships

Too often I've seen databases with a foreign key that relates to one of five tables based on the value of a char(1) field. The space conserving mindset that comes up with this implementation is admirable, but it produces far too many negative side effects.

What happens when the char(1) field gets out of sync with the foreign key field? What happens when someone deletes the foreign record or changes its primary key? More orphaned data happens.

The solution is to use five fields that each refer to a single table. You may have more nullable fields that take up more space in the database, but it's worth it in the long run.

Conclusion

Well, hopefully I've convinced you to avoid the urge to be a lazy data modeler, design for the future, use a data modeling tool, and drop constraints during bulk load operations. In short, always use referential integrity. But if not, hopefully you'll at least understand when people curse your name several years from now. :)

5 Comments

Seth said:

I think the title needs a "not using".

Denormalization and OLAP data modeling might make interesting counterpoints.

Here's a sneaky example of avoiding referential integrity:

create table foo { id int primary key };
create table bar { id int primary key };
create table foo_or_bar { type char(3), id int };

A foreign key can't be created for foo_or_bar.id. The RDBMS is supposed to understand the structure of the data. Why obscure the relationships and take on extra work to maintain unnecessary triggers and check constraints?

Seth,

Interesting point about denormalization, but even with denormalized data you still can and should enforce referential integrity. The last data warehouse project I was on kept referential integrity in their star schema, but of course disabled it for bulk loads.

I don't understand your point with the sneaky example of avoiding referential integrity. It looks just like you're agreeing with my Reason #4 for enforcing RE to me. Could you elaborate?

Seth said:

Lee,

I need to do more reading (more than a quick glimpse at Wikipedia :) on denormalization. Hopefully I'll find something useful to include.

I didn't read #4 because I struggled with the title, so unfortunately I repeated your point with three lines of SQL :|

"Mutually exclusive relationships" wasn't a term I had heard before. The meaning wasn't clear to me. Now I understand it as describing the recommended solution of the problem: "1 non-null foreign key value per row". Enforcing that seems really hard, since every foreign key column would have to be considered individually in every subclause of the constraint:

CHECK (
(fk1 IS NOT NULL AND fk2 IS NULL AND fk3 IS NULL) OR
(fk1 IS NULL AND fk2 IS NOT NULL AND fk3 IS NULL) OR
(fk1 IS NULL AND fk2 IS NULL AND fk3 IS NOT NULL) OR
[...]
(fk1 IS NULL AND fk2 IS NULL AND fk3 IS NULL AND fkN IS NOT NULL))

PS: Is RE an acronym for referential integrity? Does anyone use RI?

I forgot to complete my last post -- sorry!

The first solution to #4 was one table with many kinds of foreign keys. Another solution is many mapping tables.

CREATE TABLE foo_bar_pair ( foo INT, bar INT, PRIMARY KEY (foo, bar), FOREIGN KEY (foo) REFERENCES foo(id) ON DELETE CASCADE, FOREIGN KEY (bar) REFERENCES bar(id) ON DELETE CASCADE);

CREATE TABLE foo_baz_pair ( foo INT, baz INT, PRIMARY KEY (foo, baz), FOREIGN KEY (foo) REFERENCES foo(id) ON DELETE CASCADE, FOREIGN KEY (baz) REFERENCES baz(id) ON DELETE CASCADE);

CREATE TABLE bar_baz_pair ( bar INT, baz INT, PRIMARY KEY (bar, baz), FOREIGN KEY (bar) REFERENCES bar(id) ON DELETE CASCADE, FOREIGN KEY (baz) REFERENCES baz(id) ON DELETE CASCADE);

Bob Chesley said:

Nice job of making the case for referential integrity at the database level. Too many don't understand this basic best practice.

Leave a comment


Type the characters you see in the picture above.

0 TrackBacks

Listed below are links to blogs that reference this entry: Referential Integrity - Data Modeling Mistake 1 of 10.

TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/500