Friday March 14, 2008 Lee Richardson
After writing my last blog entry on Deferred Execution in LINQ I had a conversation with Seth Schroeder who rightly pointed out among other things that I really didn't show how LINQ's deferred execution works internally. So in this post I wanted to implement my own LINQ Where() extension method based off of the one in the System.Linq namespace. So I'll show you the code, explain interesting parts of how it works including collection initializiers and extension methods, and then explain where the deferred execution behavior comes from (i.e. the yield statement). I will only explain in the context of LINQ to Objects since that's far simpler than other Linq's. I will implement a Where() like LINQ to SQL does in a later blog post (that's where things get really crazy).
Implementing MyWhere()
Let's start out with some code. The first question is does this compile?
using System;
using System.Collections.Generic;
using MyExtensionMethods;
namespace PlayingWithLinq {
public
class LinqToObjects
{
public
static void
DoStuff() {
IList<int> ints =
new List<int>() {9,8,7,6,5,4,3,2,1};
IEnumerable<int> result = ints.MyWhere(i
=> i < 5);
foreach (int i
in result) {
Console.WriteLine(i);
}
}
}
}
namespace MyExtensionMethods {
public
static class
ExtensionMethods {
public
static IEnumerable<TSource>
MyWhere<TSource>(
this
IEnumerable<TSource> source,
Func<TSource, bool> predicate
) {
foreach (TSource element in source) {
if (predicate(element)) {
yield return
element;
}
}
}
}
}
Side note: putting two namespaces in on file is far from a best practice, but yes that is allowed.
Lambdas and Collection Initializers
If you're new to C# 3.5 then your first thought may be that:
IList<int> ints = new List<int>() {9,8,7,6,5,4,3,2,1};
is not allowed. Actually it is. It's the collection initializer syntax that I initially whined about in my post C# 3.0: The Sweet and Sour of Syntactic Sugar (ironically I actually like this syntax the more I use it.)
Your next thought may be that:
i => i < 5
is not legitimate. This is in fact a Lambda Expression, and as I explained in Deferred Execution, The Elegance of LINQ it conceptually compiles down to an anonymous method. Incidentally those that know Groovy (myself not included) or Lisp may know this as a closure since as we'll see later it can access local variables.
Extension Methods
Ok, the .Net Framework certainly has no MyWhere() function on the List object so this certainly wouldn't compile in C# 2. But that's where C# 3's Extension Methods come in. The "this" in:
MyWhere<TSource>(this IEnumerable<TSource> source,
says that MyWhere() can be applied to any generic IEnumerable. If you want to, you can still call MyWhere() normally:
IList<int> ints = new
List<int>()
{9,8,7,6,5,4,3,2,1};
ExtensionMethods.MyWhere(ints, i => i < 5);
And in fact this is what the compiler does in the background when you call MyWhere() off of an IEnumerable. But now with extension methods you don't have to.
But does MyWhere() now exist on all IEnumerable objects everywhere? No, it turns out you only get MyWhere() when you import the namespace it exists in (MyExtensionMethods). Incidentally unlike Groovy and Ruby there is no way to add an extension method to a class itself, only to instances.
Whose got the Func()?
The last two questionable parts of the code are the Func<TSource, bool> and the yield. Func is pretty easy. It's simply one of several new predefined delegates (method signatures) that comes with the .Net framework off of the System namespace. The two generic argument one above will match any function that returns the second generic argument and takes the first generic argument as a parameter. It looks like this:
delegate TResult Func<T, TResult>(T arg1);
So rather than using a Lambda expression in my initial example I could have been very explicit about the delegate instance (myFunc):
public
static void
DoStuff() {
IList<int> ints = new
List<int>()
{9,8,7,6,5,4,3,2,1};
Func<int, bool> myFunc
= IsSmall;
IEnumerable<int> result = ints.MyWhere<int>(myFunc);
foreach
(int i in
result) {
Console.WriteLine(i);
}
}
public static
bool IsSmall(int
i) {
return
i < 5;
}
And that would have done the same thing. Notice I had to specify the generic type on the call to MyWhere() since the compiler can't infer the type in this example.
Yield
Now the really interesting part: yield. Yield is what makes deferred execution work. It actually was introduced with C# 2.0, but I don't think anyone really used it (I didn't know about it until recently). So because MyWhere() returns an IEnumerable (and because it isn't anonymous and doesn't have ref or out parameters) it is allowed to use the yield statement. When a method has a yield return (or yield break) statement, then execution of the method doesn't even begin until a calling method first iterates over the resulting IEnumerable. Execution then begins in the method and runs to the first yield statement, returns a result, and passes execution back to the caller. When the calling method iterates to the next value execution continues in the method where it left off until it gets to the next yield statement and then it passes execution back to the caller again and so on. Weird huh? Joshua Flanagan has a nice article that explains this in more detail along with some of the nice benefits like a smaller memory footprint.
So here's a quiz. What happens when you execute the following code?
IList<int> ints = new
List<int>()
{9,8,7,6,5,4,3,2,1};
IEnumerable<int>
result = ints.MyWhere<int>(i => i < 4);
ints.Add(0);
foreach (int i
in result) {
Console.WriteLine(i);
}
Without the yield you'd get the numbers 3 through 1 since you added 0 after the call to MyWhere(). But since the yield in MyWhere() (and the Where() in System.Linq) defers execution until the foreach statement, you actually get 3 through 0. Ready for a little more mind bending? How about this:
IList<int> ints = new
List<int>()
{9,8,7,6,5,4,3,2,1};
int j = 4;
IEnumerable<int>
result = ints.MyWhere<int>(i => i < j);
ints.Add(0);
j = 3;
foreach (int i
in result) {
Console.WriteLine(i);
}
Does the state of j get captured? My intuition would say yes. If so you'd expect 3 through 0. Well, the closure part of anonymous methods and lambdas work by keeping a reference to their calling object (this). So consequently they always get the most up to date value of a variable. So if your intuition works like mine you'd be wrong. You actually get the numbers 2 through 0. Crazy huh? And definitely something I hope I won't run into in someone's code (JetBrains ReSharper actually warns you if you do something crazy like this).
Conclusion
If this made sense then you should have a pretty solid grasp of how most of Linq to Objects works. Understanding extension methods, Func delegates, and yield statements should form the majority of what Linq does. Well, except for expression trees. But that's a topic for another post. Please post if this doesn't make sense or if I got it all wrong, I'd love to hear from you.
P.S. To comment on this article please use my public Blog.
One of the things I love about LINQ is its deferred execution model. It's the type of thing that makes sense academically when you first read about it (e.g. in Part Three of Scott Gunthrie's LINQ to SQL series), but for me anyway, it took some time to understand enough to use effectively.
For instance the Daily RSS Download open source application that I wrote about last week needs to download entries (posts) that are newly published since the last download. While it isn't a complicated problem, my first attempt at a solution didn't use the power of LINQ correctly. I'll explain my naïve solution in this post, describe how LINQ's deferred execution works (i.e. Lambda expressions), explain the problems with my solution, then give an the elegant solution that is only possible because of LINQ's deferred execution model. See if you can spot my error along the way.
Downloading the Latest Entries
Downloading the latest entries would be a ridiculously simple problem if there weren't multiple formats for RSS. But since the solution needs to support Atom and RSS 2.0 and 1.0 and potentially other future formats, the class structure should be set up appropriately:
The newspaper class primarily exists to enumerate feeds:
public
class Newspaper
{
public
void DownloadNow() {
foreach (Feed objFeed in
Settings.Feeds) {
objFeed.DownloadRecentEntries(...);
}
}
}
The Feed class is abstract and during runtime is either an RssFeed or an AtomFeed. The relevant function Feed.DownloadRecentEntries() calls the abstract Feed.GetEntries() method, which returns a group of Entry objects.
public
abstract class
Feed {
public
abstract IEnumerable<Entry> GetEntries(XDocument
rssfeed);
public
void DownloadRecentEntries(...) {
XDocument
xdocFeed = XDocument.Load(Url);
IEnumerable<Entry> lstRecentPosts = GetEntries(xdocFeed);
foreach
(Entry objEntry in
lstRecentPosts) {
objEntry.Download(...)
}
}
}
The Feed classes, RssFeed and AtomFeed then implement GetEntries as follows:
publpublic
class RssFeed
: Feed {
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants("item")
where (DateParser.ParseDateTime(item.Element("pubDate").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new
RssEntry(item, this);
}
}
public class
AtomFeed : Feed
{
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants(_atomNamespace + "entry")
where (DateParser.ParseDateTime(
item.Element(_atomNamespace + "published").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new AtomEntry(item,
this);
}
}
Yes, that's all LINQ to XML in there. It looks a lot like SQL, but as you'll see in a second it's really just glorified syntactic sugar. Expressive though, isn't it? While the astute reader may have already spotted the inelegance of my solution, for those unfamiliar with LINQ, let's first describe what AtomFeed.GetEntries() does.
What is this Deferred Execution Stuff?
If you already understand LINQ and how delayed execution works feel free to skip this section. For everyone else it's important to understand that the following line:
from item
in rssfeed.Descendants("item")
where (DateParser.ParseDateTime(item.Element("pubDate").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new
RssEntry(item, this);
Is actually just syntactic sugar for the following set of statements:
rssfeed
.Descendants(_atomNamespace +
"entry")
.Where( item => (DateParser.ParseDateTime(
item.Element(_atomNamespace +
"published").Value)
>= this.LastDownloaded)
|| this.LastDownloaded
== null)
.Select( item => (Entry)new
AtomEntry(item, this));
Now XDocument.Descendants() returns IEnumerable
But more important for the topic of deferred execution is the => operator, which is a Lambda expression and is also new to C# 3.0. The best way to understand them is that they are essentially syntactic sugar for an anonymous method (e. (e.g. a type safe function pointer to code). So we could again rewrite our code as follows:
rssfeed
.Descendants(_atomNamespace +
"entry")
.Where(delegate(XElement
item) {
return
(DateParser.ParseDateTime(
item.Element(_atomNamespace +
"published").Value)
>= this.LastDownloaded) || this.LastDownloaded
== null; })
.Select(delegate(XElement
item) {
return
(Entry)new
AtomEntry(item, this);
});
Back in familiar territory yet? If not you probably aren't familiar with C# 2.0. In the background the compiler takes the anonymous methods above and turns them into methods on the current class and instantiates new delegates of the correct type that points to them and passes them to the Select() and Where() methods.
The The key thing to note is that the arguments for select and where are delegates, and so when those delegates are executed is beyond our control. In fact if you put a Console.WriteLine or a breakpoint inside of the AtomEntry constructor, it won't get called until the resulting IEnumerable is enumerated, specifically the following line in the first code sample:
foreach (Entry objEntry in lstRecentPosts) {
So that's delayed execution. But understanding how it works and how to use it are completely different things.
The Inelegant Solution
Getting back to my code sample you may have picked up that my where clause is the mistake. I implemented it like this because RSS and Atom have different field names for the published date. But the way I wrote it I'd have to make two changes if I wanted to change which entries to download. Ok, big deal, I'm extremely unlikely to make changes to that where clause right? Or I wasn't until I wanted functionality to set some defaults based on the average length of posts prior to downloading posts. Basically:
public
static Feed CreateFeed(string
strUrl, int intDisplayOrder) {
IEnumerable<Entry> lstRecentEntries = feed.GetEntries(rssfeed);
double
intAveragePostSize = lstRecentEntries.Average(
i => i.Description.Length);
// if the
feeds posts are typically small then include the
// description field in the summary
and download the content
// for the main article from the link
if
(intAveragePostSize < 1000) {
...
} else {
...
}
}
Except this now ties me to the were clause, when what I'd really like to do is just get the average post size for the last couple of posts. The problem is that GetEntries() isn't generic enough.
The Elegant Solution
The The solution is then to normalize out (excuse the database terminology) the where clause into the two methods that use GetEntries(). So GetEntries() becomes simple:
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants(_atomNamespace + "entry")
select (Entry)new AtomEntry(item,
this);
}
And then Feed.CreateFeed() and Feed.DownloadRecentEntries() become more complicated
public
abstract class
Feed {
public
abstract IEnumerable<Entry> GetEntries(XDocument
rssfeed);
public
static Feed CreateFeed(string strUrl,
int intDisplayOrder) {
IEnumerable<Entry> lstEntries = feed.GetEntries(rssfeed);
// get the
five most recent posts
IEnumerable<Entry> lstRecentEntries =
from
entry in lstEntries.Take(5)
select entry;
double intAveragePostSize =
lstRecentEntries.Average(
i => i.Description.Length);
if (intAveragePostSize < 1000) {
...
} else {
...
}
}
public
void DownloadRecentEntries(...) {
XDocument xdocFeed = XDocument.Load(Url);
IEnumerable<Entry> lstEntries =
GetEntries(xdocFeed);
// get newly
published posts
IEnumerable<Entry>
lstRecentPosts = from entry in lstEntries
where
(entry.Published >= this.LastDownloaded)
||
this.LastDownloaded ==
null
select entry;
foreach
(Entry objEntry in
lstRecentPosts) {
objEntry.Download(...)
}
}
}
Note that we now have a second LINQ statement that runs against the results of the LINQ statement in GetEntries(). But since nothing's been executed yet we're just building out the statement that we will eventually run when the resulting IEnumerable if enumerated. So we've now spread our LINQ statements across an inheriting and a base class, and in process we've made GetEntries() extremely generic.
Conclusion
So what's the big deal? The big deal is that we can spread our data access statements across multple classes and because of deferred execution we don't need to worry about the performance of generic methods that are closer to the data that don't contain a "where" clause. This may not be a huge deal in this example, but it becomes extremely powerful when the user interface tier can tack on "order by" statements or "filters" BEFORE anything is executed against your data store. And that, for me, is at the heart of the beauty of LINQ.
I published Daily RSS Download, my first open source project on CodePlex* today. It's not going to change the world, but if you have a need for it there is a decided lack of decent products that perform this functionality. In this post I'll give a little background about why I wrote it and explain what it does and how to use it. Besides needing this functionality I also wrote it to learn LINQ to Objects and LINQ to XML, but I'll cover the more interesting implementation details in a later post.
Why I Wrote It
For Christmas I received an IRex Iliad which is an e-book reader combined with a Wacom tablet. It's an awesome product that allows reading PDF's (among other formats) and writing on them. It's pricey, but the ability to jot notes on technical documents (in addition to recipes and guitar tablature, etc) as you read is invaluable for me. I now read about twice as much as I did before. It supports Wi-Fi, and in particular can connect to a computer on a regular basis to download files you put in a specific directory.
So theoretically it could download a customized newspaper every morning for me, right? I could have today's world news, national news, local news, technical news, weather, and my RSS feeds like Scott Gunthrie all in one place while I eat my cereal! And then I could cancel my Washington Post subscription and after about 7.5 years I would have recouped the costs of the Iliad. Sweet.
The problem is that the product doesn't come with any way to download RSS feeds. Well, you can use software from MobiPocket, but it's a pain to setup, and use, and I couldn't figure out how to have it automatically run on a daily basis. And furthermore it can't grab the real content from the website if the RSS feed only contains an abstract (e.g. washingtonpost.com). I searched and there was some software out there, but none of it did what I liked. And of course none of it generated a manifest.xml file which is an Iliad specific file that links HTML pages together and gives names to groups of content (i.e. grouping the files in a directory to make a “book" called “My Daily News for February 13").
So what a great opportunity to write it myself and learn LINQ to XML and LINQ to Objects in the process.
What It Does
The end result (or the index page anyway) looks something like this:
The images are local, the links go to a full page of content, and on the Iliad, because Daily RSS Download generates a manifest.xml, the next and previous buttons can move you to the next or previous article and you can see at a glance how many articles there are.
If you want to recreate the screenshot above, first head over to the Releases page of Daily RSS Download, where you can download the msi and install the application. When you open “Daily RSS Download Config" you can view a home page like this:
You can type in an RSS or Atom URL and click Add Feed. The application will try to connect to the website, download the title, and set some configuration options based on the average length of posts (specifically if you put in a feed from the washingtonpost.com website it will detect that the average post size is small and determine that it should download the main content from the website).
You can click on any of the feeds you've added and you'll get a Feed Settings page like below:
The fields are mostly self explanatory, but here are three of the more interesting settings:
Summary Source Values:
This setting determines where the abstract (summary) on the index page should come from. There are three options:
No Summary – Does not display a summary on the index.html page. This is what Scott Gunthrie's feed was set to in the first screenshot.
Extract from the content – This takes the first 300 characters from the main content as the summary. This was set for the washingtonpost.com feed in the first screenshot (although Use the RSS description field would actually have been more appropriate).
Use the RSS description field – This uses the entire description field from the RSS (or Atom) feed. This is what the weatherbug feed was set to in the first screenshot. Obviously this is a bad choice for a Scott Gunthrie type of RSS entry since he posts everything in the description field.
Content Source Values:
This setting determines where the main content page should get it's value. There are thee options:
No content, summary only – If you set a feed to this, then Daily RSS Download won't generate a content file. This would be a good choice for the weather feed in the example.
Use the RSS description field – The content file will be created from the RSS description field. This would be a good choice for a Scott Gunthrie type of feed.
Download from the referenced web page –Daily RSS Download will download the page referenced by the RSS or Atom feed. This would be a good option for a washingtonpost.com type of feed.
Content Start/End Markers
These are regular expressions that are used if you set content source to download the referenced web page. You can leave them blank or you can set them if you want to try to strip out header, footer, navigation bars, etc. The content start marker in the screenshot:
\<div id=\"article_body\"[^\>]*\>
Says match ‘<div id="article_body"' up through to the next ‘>'. Both markers are exclusive (the thing your matching on won't be included in the results).
Customizing the CSS
So that's it for the general settings and use. You can click “Download Now" on the main config page to download your feeds, and you can set it up to run on a recurring basis (it will only download new content) by setting a recurring task to run “DailyRssDownload.exe DownloadNow". The only other thing of interest is to make the content more pretty.
The generated HTML is CSS customizable, so in order to get the two column look above (and/or make it look pretty on an Iliad) you can customize the CSS as below:
h1
{
margin-top:
0px;
/* A pretty
linux script font since the Iliad has a linux kernel */
font-family: Zapf
Chancery;
font-size:
30pt;
margin-bottom:
0px;
}
h2
{
}
.NewsHeader
{
border-bottom:
solid 1px
black;
text-align:
center;
}
.DailyRss_Date
{
text-align:
center;
}
.DailyRss_Feed
{
}
.DailyRss_Entry
{
}
.DailyRss_EvenEntry
{
}
.DailyRss_OddEntry
{
}
/* LEFT COLUMN */
#ScottGusBlog
{
float:
left;
width:
49%;
border-right:
solid 1px
gray;
}
#washingtonpostcom-TodaysHighlights
{
clear:
both;
float:
left;
width:
49%;
border-right:
solid 1px
gray;
}
/* RIGHT COLUMN */
#WeatherBugLocalWeatherfor20190
{
margin-left:
50%;
}
So basically just use the old float left, width 50%, margin-left 50% trick to get the pretty two-column look (without tables).
Conclusion
I hope you find the Daily RSS Download open source project useful. Please feel free to submit suggestions, feature requests, defects or preferably defects AND patches on the project's CodePlex home page.
* In case you aren't aware CodePlex is an open source project hosting website from Microsoft. It's similar to Source Forge, except there is no approval process for new projects and it integrates nicely with Visual Studio.
I really enjoyed Seth Schroeder’s critique of the last post in my ten part data modeling mistake series: Surrogate vs Natural Primary Keys. His argument regarding data migration in particular sheds light on a major shortcoming of using surrogate keys: they lead data modelers to a false sense of security regarding the uniqueness of data. Specifically if modelers ignore uniqueness constraints they allow duplicate data. And as Seth points out this has a nasty side effect of disallowing any clear way to compare data between systems. But there are other problems too.
So, in this post I’ll address the uniqueness problem introduced with surrogate keys by way of an example, I’ll provide two how-to’s, one implementing uniqueness in Visio and one in NHibernate, I’ll explain the difference between unique indexes and unique constraints, and finally I’ll provide reasons why unique indexes might be overlooked, specifically by providing a critique of ORM tools.
Surrogate Keys = Data Disaster?
So as mentioned above the biggest problem with surrogate keys is they lull junior data modelers or lazy developers into thinking they don’t need to worry about indexes. But they do; and it’s as vital as implementing referential integrity. And for the same reason: data integrity.
As an example, imagine you’re modeling a simple Country table. You could of course use CountryName as the primary key, but as you know from my post on surrogate keys, you would have problems with varchar join speed (assuming you disagree with Seth that it’s a premature optimization) and to a lesser extent cascading updates (since country names do occasionally change).
Introducing a surrogate key (CountryId) resolves these issues, but you also remove an inherent advantage that natural keys have: they require uniqueness in country names. In other words you can now have two New Zealand’s and the system wouldn’t stop you.
What’s the big deal? Country seems like a pretty benign table to have duplicates, right? Your users from New Zealand simply have an extra list item in their drop down to pick from and some pick one and some pick the other.
For Country one problem comes in reporting. Consider delivering a revenue by Country report. Your report probably lists New Zealand twice and a quick scan by an exec sees half of the actual revenue for that country that they should. And as a result numerous innocent sheep are slaughtered … uh, or something.
Another major problem could come in syncing data with other systems. How do those systems know which record to use?
As you can imagine the problem is even worse with major entities like Customer, Order, Product, or something more scary like Airline Flights. And the longer the system stays in production, the more production data the system collects, the more duplicates rack up, and the more time and money that will be required to clean up the data when the problem is finally identified. In short the bigger the data disaster.
How To #1: Visio
So the solution is to add at least one unique constraint (or index) to every single table. In other words if you have a table without a uniqueness constraint chances are very good you’ve done something wrong.
The good news is that it’s pretty easy to implement once you agree it’s necessary. If you’re modeling with Microsoft Visio this is a six step process:
- Select the table.
- Select the “Indexes” category.
- Click New.
- No need to enter a name, just click OK.
- Select either “Unique constraint only” or “Unique index only” (more on this decision later).
- Double click the column(s) to add.
Then when you generate or update your database Visio puts in DBMS specific uniqueness constraints. And voila, problem solved.
Unique Constraints vs Unique Indexes
The question will come up when using Visio or perhaps using various DBMS’s including SQL Server whether to use a unique constraint or unique index. The short answer is that most people use unique constraints, but ultimately they’re the same thing so it doesn’t matter.
In case you’re interested in the details though here’s a quick rundown of the differences:
Unique Constraint
- A logical construct.
- Defined in the ANSI SQL standard.
- Intent: data integrity.
- Usually part of a table definition.
Unique Index
- A physical DBMS implementation.
- Not specified in ANSI SQL.
- Intent: performance.
- Usually external to a table definition.
But since most DBMS’s implement unique constraints as unique indexes, it doesn’t really matter which you choose.
How To #2: NHibernate
Since I have the pleasure of learning the NHibernate ORM tool on my current project, I thought I’d also describe the same technique with a different tool. Basically you can either set the Unique attribute to true to obtain uniqueness in one column, or set the unique-key attribute to obtain uniqueness among multiple columns. If you use NHibernate mapping attributes you write:
[Property(NotNull = true, Length = 100, Unique = true)]
public virtual string CountryName {
get { return _strCountryName; }
set { _strCountryName = value; }
}
Which generates the following hbm:
<class name="Country"><id name="CountryId"><generator class="sequence" /></id>
<property name="CountryName" length="100" not-null="true" unique="true" />
</class>
Which NHibernate turns into the following DDL:
create table Country (
CountryId NUMBER(10,0) not null,
CountryName NVARCHAR2(100) not null <b>unique</b>,
primary key (CountryId)
So quick quiz: was that a unique index or unique constraint it generated? If you answered who cares you’re right. However if you answered a unique constraint you’re also right.
The Problem with ORM
Obviously ignorance of the problem and shortsightedness are two causes for systems going into production without unique indexes, but I’d like to point out a third. While Object Relational Mapping (ORM) tools like NHibernate are extremely convenient for generating database schemas, modeling database tables with classes and generating DDL can lead developers to a false sense of purpose.
This can occur because ORM tools focus entirely on the world of objects and classes. In this world data’s persistence is irrelevant. It exists for the purposes of a single operation, and consequently long term data persistence issues like data integrity are deemphasized. In fact, it would be easy to lose perspective of the fact that there is a database at all.
Don’t get me wrong, the benefits you get like mandatory surrogate keys, DBMS neutrality, lazy loading, and minimal data access code are wonderful. Just don’t forget that tags like NHibernate’s unique and unique-key exist. And are very necessary.
Conclusion
To sum it up don’t allow the convenience of surrogate keys to lull you into a false sense of security regarding the importance of keys. It’s a trap. And it will be a disastrous one if you aren’t careful.
In case you’re new to the series I’ve compiled a list of ten data modeling mistakes that I see over and over that I’m tackling one by one. I’ll be speaking about these topics at the upcoming IASA conference in October, so I’m hoping to generate some discussion to at least confirm I have well founded arguments.
The last post in this series Referential Integrity was probably less controversial than this one. After all, who can argue against enforcing referential integrity? But as obvious as surrogate keys may be to some, there is a good deal of diversity of opinion as evidenced by the fact that people continue to not use them.
I intend to address this topic by way of a fairly ubiquitous example that should draw out all of the arguments. I’ll investigate the options for primary keys in a Person table. I’ll provide four possible options and explain why each of them is a bad choice. I’ll then give four arguments against surrogate keys, which I will then shoot down. So without further ado:
Contender 1: Name and Location
Of course you can’t use name as a primary key for a person, because two people can have the same name and primary keys must be unique. But all too often I’ve seen databases with multiple, sometime numerous, natural (or business-related) primary keys. These databases combine a field like name with enough other fields such that the likelihood of uniqueness is approaching certainty.
In the case of person this would be equivalent to First and Last Name (wouldn’t want to violate first normal form by combining those into one field, but that’s a whole other topic), zip code, and we ought to throw address line 1 in to be safe. This is known as either a compound, composite, or multicolumn index.
Now our chances of uniqueness are close enough to certain to not warrant discussion, so let’s jump right to space and performance. There are three major problems with this approach.
Con 1: Primary key size. The primary key index for person becomes enormous. The database must now catalog four large (probably all varchar) fields. This increases the size of the index which increases overhead for insert, delete and update operations and can even decreases read speed because of the increased disk I/O.
Con 2: Foreign key size. If you have a child table like PhoneNumber then, as the diagram above shows, the foreign key becomes four columns. Those four columns take up a lot of space again. And now a common query like “Get all phone numbers for a person” involves a full table scan, or, if you throw an index on them you end up with another huge index. In fact, you most likely end up propagating huge indexes and vast amounts of data all over the place like some evil data-cancer.
Con 3: Asthetics. It just isn’t pretty. Having four column foreign keys all over the place increases the amount of code you need to write in stored procedures, middle tier, and presentation tier. Even intellisense won’t help you with this one.
Contender 2: Social Security Number
The most obvious choice for a natural key for a person object is social security number, right? Obviously it depends on what type of data person is, but regardless you’ll probably face the following four problems with this primary key candidate:
Con 4: Optionality. The social security administration specifies that U.S. citizens are not required to provide social security numbers in many circumstances. While employment is one of these circumstances, consumers of non-governmental services are definitely not. You can deny service if your consumer won’t provide the number, but is your CEO prepared to turn away business based on a data modeling decision you make?
Con 5: Applicability. Only U.S. citizens have a social security number. Your system might only cater to U.S. citizens now, but will it always?
Con 6: Uniqueness. The social security administration “is adamant” that the numbers are not recycled, even after someone dies. But eventually the numbers will run out. If you visit the slate article cited above, it calculates this date as in the next century. But the math fails to include the fact that location information is encoded in the number which significantly limits the permutations. I don’t know what the real number is, but the point is: you’re gambling with how long until a conflict occurs. And even if the time argument fails to sway you, just think of who is assigning the numbers. How much do you trust a government office to not make an occasional mistake?
Con 7: Privacy. Does your application use primary keys in the user interface tier to uniquely identify records? Does it pass primary keys between pages or use them to identify rows in a combo box? You certainly wouldn’t store such a thing in a cookie or pass it across the wire unencrypted right? Social security information is sensitive information and privacy zealots care very much how you handle their data. Primary keys are necessarily are closer to end users and harder to hide than regular fields. It just isn’t the type of data to take a chance on.
Contender 3: E-mail
So e-mail is a pretty likely choice right? It’s a relatively safe assumption that no two people share an e-mail (maybe). And anyone with a computer has one right? So there should be no uniqueness, privacy or optionality/applicability problems. But how about this:
Con 8: Accidental Denormalization. Do you have more than one e-mail address? Doesn’t everyone? Imagine what a pain it would be if Evite only allowed you one e-mail address per person (ok, well if you didn’t know it does allow you to consolidate accounts for those of us with multiple e-mail addresses). Even if your system only stores one e-mail address per person now, just think what a pain it would be to change the database to allow N e-mail addresses per person.
No… Wait. Really. Think about it...
Yea … yuck.
Contender 4: Username
If your users log in with a username, that’s a likely candidate for a primary key right? But what if they want to update their username (perhaps it was based on a last name that changed). This leads us to:
Con 9: Cascading Updates. If you have a natural key that might change you’ll need to implement some type of cascading updates (whether your DBMS supports it or you write code by hand). In other words, change the username in the person table and you have to change the username foreign key in all child records of the invoices, comments, sales, certifications, defects, or whatever other tables you track. It may not happen often, but when it does it sure will wreak havoc on your indexes. Imagine rebuilding even 10% of your indexes at once because of one operation. It’s just unnecessary.
Con 10: Varchar join speed. I left this to last because it applies to all of contenders thus far and is by far the most compelling argument against natural keys. Nine out of ten natural keys are varchar fields. Even an employee number as generated by another system may have a significant zero. It’s a fact: joining across tables with varchars is always slower than joining across tables with integers. How much? According to Peter Zaitsev who runs a MySql performance blog it’s 20% to 600% slower. And that’s for one join. How many joins do you think comprise an average user interaction? Five? Ten? Twenty? It could very likely make a significant difference to your end user.
And The Winner Is
So surrogate keys win right? Well, let’s review and see if any of the con’s of natural key’s apply to surrogate keys:
- Con 1: Primary key size – Surrogate keys generally don't have problems with index size since they're usually a single column of type int. That's about as small as it gets.
- Con 2: Foreign key size - They don't have foreign key or foreign index size problems either for the same reason as Con 1.
- Con 3: Asthetics - Well, it’s an eye of the beholder type thing, but they certainly don’t involve writing as much code as with compound natural keys.
- Con 4 & 5: Optionality & Applicability – Surrogate keys have no problems with people or things not wanting to or not being able to provide the data.
- Con 6: Uniqueness - They are 100% guaranteed to be unique. That’s a relief.
- Con 7: Privacy - They have no privacy concerns should an unscrupulous person obtain them.
- Con 8: Accidental Denormalization – You can’t accidentally denormalize non-business data.
- Con 9: Cascading Updates - Surrogate keys don't change, so no worries about how to cascade them on update.
- Con 10: Varchar join speed - They're generally int's, so they're generally as fast to join over as you can get.
For every natural key con I see a surrogate key pro. But not everyone agrees. Here are some arguments against them.
Disadvantage 1: Getting The Next Value
Some have argued that getting the next value for a surrogate keys is a pain. Perhaps that’s true in Oracle with its sequences, but generally it just takes a couple minutes research, or you can use ORM tools to hide the details for you.
Disadvantage 2: Users Don’t Understand Them
One argument I uncovered is if users were to perform ad-hoc queries on the database they wouldn’t be able to understand how to use surrogate keys.
Bunk. Bunk, bunk, bunk. End users shouldn’t be fiddling in databases any more than airline customers should be fiddling in airplane engines. And if they are savvy enough to try, then let them learn to perform joins like the pros do.
Disadvantage 3: Extra Joins
Suppose you have users table with a social security number natural primary key, and a phone number child table with social security as a foreign key.
If your user enters a social security number on a log in screen you could theoretically get their phone numbers without accessing the users table. In a surrogate key world you would have to look up the surrogate key in the person table before getting their phone numbers.
While this is true, I have found that with most CRUD applications there are few times when this scenario comes up. The vast majority of queries involve already known surrogate keys. So while this argument may be true in some situations, it just isn’t true enough of the time to be significant.
Disadvantage 4: Extra Indexes
I find this to be the most persuasive argument against natural keys. If your person object would normally have a natural key on social security number, then in surrogate-world you should have a unique index on social security number in addition to your primary key index on the surrogate key. In other words, you now have two indexes instead of one. In fact, if you have N indexes per table in natural key world, you’ll always have N + 1 indexes in surrogate key world.
While the additional indexes do indeed add indexes, which increase database size, and slow insert and update performance, you could offset some of that expense by converting your old natural key, social security number for example, to a clustered index.
Or you could just relax in the knowledge that there are pro’s and con’s to every architectural decision and for surrogate keys the pro’s outweigh the con’s.
Summary
So now if some well meaning DBA argues to use natural keys on your next project you should have ten arguments against them, which will double as ten arguments for surrogate keys, and you should be prepared with rebuttals for four arguments against surrogate keys. Whew, that was a lot. But I assure you, if you use surrogate keys today it will definitely make your life easier in the long run.
Just a heads up that DevX just published an article of mine today. The article is entitled Sync Your Database to SharePoint Using SSIS. The article covers how to import and export SharePoint list items using Collaborative Application Markup Language (CAML), SharePoint's web services API, and SQL Server Integration Services.
The latter half of the article is a fairly detailed how-to, but the former half covers what SharePoint lists are, what SSIS is, and why you would want to use them all together. I hope you find the article useful, and feel free to comment here if you have thoughts on the article.
In my mind data models are like the foundations of a house. Whether you use ORM or a more traditional modeling tool, they form the base of the entire rest of your project. Consequently, every decision you make (or don’t make) regarding your data model during the design phase(s) of your project will significantly affect the duration of your project and the maintainability and performance of your application.
You could de-emphasize up-front planning, but every correction you make to the data model once code has been written on top of it will introduce significant delays to the project as developers refactor data access, business logic, and user interface tiers. That’s why mistakes made during design are expensive, and it would behoove any architect (or project manager) to be well aware of the repercussions of data model decisions and minimize mistakes before construction begins.
After years of working with or maintaining applications based on poorly designed data models, and after years of modeling my own databases from scratch I’ve seen and made a lot of mistakes. So, I’ve compiled ten of the most common ones and the arguments for and against them.
I’ll be speaking on this topic in the upcoming IASA conference in October, and so I wanted to vet these ideas with the community. I know there are strong feelings on these topics, so please help me out by commenting if you feel I’ve missed something or am off base.
I’ll start with Mistake #1: Not Using Referential Integrity in this post. I'll give four common reasons for avoiding referential integrity and then rebuff them. I'll then cover the more controversial Mistake #2 Not Using Surrogate Keys in my next post.
Mistake #1 – Not using referential integrity
I’ve heard a lot of excuses for not using referential integrity, but I’ve never been swayed by one of them. If you have a record with a foreign key field you should be 100% certain that it will always refer to the primary key of an existing record in one and only one foreign table. The last thing you want to do is write large amounts of conditional logic because you aren’t 100% certain that you aren’t dealing with orphaned data. Nonetheless, here are some almost compelling arguments I’ve heard for not using it:
Reason #1: Project Too Small
If your project or database is only a few tables and a couple lines of code then you don’t need referential integrity right? Wrong, numerous projects start small, get big, and have major problems because of it. It doesn’t take much extra time to put in constraints. Avoid the urge to be lazy.
Reason #2: Accidental Oversight
Numerous applications I’ve seen forget a relationship or two. This is borne of writing and executing database creation statements by hand and is the reason that data modeling tools exist. When you visualize your database in a model it’s hard to miss a relationship. So use a modeling tool and keep it in sync with your database, you won’t regret it.
Incidentally I like Microsoft Visio for data modeling because you can change your schema during development and Visio won’t delete your data. This enables you to keep your data model in sync with the database for the entire lifetime of the database. There are other benefits too, if you’re interested see my article on data modeling in Microsoft Visio.
Reason #3: Maximize Insert Speed
It’s a fact: indexes and constraints slow down insert and update operations. If your application is heavy on writing and light on reading, then you could argue referential integrity isn’t for you. This argument is often combined with the “Only one application ever uses my database” argument.
There are two problems with this. One problem comes when either a well meaning DBA modifies data by hand and messes up the state of the database, or more realistically when there’s a bug in the application that accidentally orphans data. Orphaned data may not affect your application, but a well designed solution should plan for the future. When that data warehouse project finally gets around to importing data from your database, what do they do with the orphaned data? Ignore it? Try to integrate it? Who knows? If you’ve been in this position, you’ll know what I mean when I say the responsible architect’s name (or their app) will be synonymous with a curse word.
The second problem is that even if a database without referential integrity don’t end up with orphaned data, a second application that might want to integrate can still never be 100% certain that foreign keys refer to existing records. It comes down to designing for the future.
The answer to speed is to build your database with referential integrity, drop or disable your constraints and indexes before a bulk load, and re-enable them after the bulk load. It will increase the duration of your bulk load operation over not using constraints at all, but it will be much faster than leaving them enabled and checking them for each insert. So use referential integrity: the pros outweigh the cons.
Reason #4: Mutually exclusive relationships
Too often I’ve seen databases with a foreign key that relates to one of five tables based on the value of a char(1) field. The space conserving mindset that comes up with this implementation is admirable, but it produces far too many negative side effects.
What happens when the char(1) field gets out of sync with the foreign key field? What happens when someone deletes the foreign record or changes its primary key? More orphaned data happens.
The solution is to use five fields that each refer to a single table. You may have more nullable fields that take up more space in the database, but it’s worth it in the long run.
Conclusion
Well, hopefully I’ve convinced you to avoid the urge to be a lazy data modeler, design for the future, use a data modeling tool, and drop constraints during bulk load operations. In short, always use referential integrity. But if not, hopefully you’ll at least understand when people curse your name several years from now. :)










