Lee Richardson

All | General | Java | Ruby | .Net
XML
20080220 Wednesday February 20, 2008
Deferred Execution, the Elegance of LINQ

One of the things I love about LINQ is its deferred execution model. It's the type of thing that makes sense academically when you first read about it (e.g. in Part Three of Scott Gunthrie's LINQ to SQL series), but for me anyway, it took some time to understand enough to use effectively.

For instance the Daily RSS Download open source application that I wrote about last week needs to download entries (posts) that are newly published since the last download. While it isn't a complicated problem, my first attempt at a solution didn't use the power of LINQ correctly. I'll explain my naïve solution in this post, describe how LINQ's deferred execution works (i.e. Lambda expressions), explain the problems with my solution, then give an the elegant solution that is only possible because of LINQ's deferred execution model. See if you can spot my error along the way.

Downloading the Latest Entries

Downloading the latest entries would be a ridiculously simple problem if there weren't multiple formats for RSS. But since the solution needs to support Atom and RSS 2.0 and 1.0 and potentially other future formats, the class structure should be set up appropriately:

The newspaper class primarily exists to enumerate feeds:

public class Newspaper {
    public void DownloadNow() {
      foreach
(Feed objFeed in Settings.Feeds) {
        objFeed.DownloadRecentEntries(...);
      }
    }
}

The Feed class is abstract and during runtime is either an RssFeed or an AtomFeed. The relevant function Feed.DownloadRecentEntries() calls the abstract Feed.GetEntries() method, which returns a group of Entry objects.

public abstract class Feed {
    public
abstract IEnumerable<Entry> GetEntries(XDocument rssfeed);

    public
void DownloadRecentEntries(...) {
        XDocument xdocFeed = XDocument.Load(Url);
        IEnumerable<Entry> lstRecentPosts = GetEntries(xdocFeed);

        foreach (Entry objEntry in lstRecentPosts) {
            objEntry.Download(...)
        }
    }
}

The Feed classes, RssFeed and AtomFeed then implement GetEntries as follows:

publpublic class RssFeed : Feed {
    public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
        return from item in rssfeed.Descendants("item")
               where (DateParser.ParseDateTime(item.Element("pubDate").Value)
                   >= this.LastDownloaded)
                   || this.LastDownloaded == null
               select
(Entry)new RssEntry(item, this);
    }
}

public class AtomFeed : Feed {
    public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
        return from item in rssfeed.Descendants(_atomNamespace + "entry")
               where (DateParser.ParseDateTime(
                   item.Element(_atomNamespace + "published").Value)
                    >= this.LastDownloaded)
                    || this.LastDownloaded == null
               select (Entry)new AtomEntry(item, this);
    }
}

Yes, that's all LINQ to XML in there. It looks a lot like SQL, but as you'll see in a second it's really just glorified syntactic sugar. Expressive though, isn't it? While the astute reader may have already spotted the inelegance of my solution, for those unfamiliar with LINQ, let's first describe what AtomFeed.GetEntries() does.

What is this Deferred Execution Stuff?

If you already understand LINQ and how delayed execution works feel free to skip this section. For everyone else it's important to understand that the following line:

from item in rssfeed.Descendants("item")
               where (DateParser.ParseDateTime(item.Element("pubDate").Value)
                   >= this.LastDownloaded)
                   || this.LastDownloaded == null
               select
(Entry)new RssEntry(item, this);

Is actually just syntactic sugar for the following set of statements:

rssfeed
    .Descendants(_atomNamespace + "entry")
    .Where( item => (DateParser.ParseDateTime(
        item.Element(_atomNamespace + "published").Value)
        >= this.LastDownloaded)
        || this.LastDownloaded == null)
    .Select( item => (Entry)new AtomEntry(item, this));

Now XDocument.Descendants() returns IEnumerable which most definitely does not have a Where() function on it. And if you look at the return type of Where() in this context, it returns an IEnumerable which definetly does not have a Select() method on it. That's because Where() and Select() are extension methods, meaning you can attach them on to just about anything. They're new to C# 3.0 and are beyond the scope of this article.

But more important for the topic of deferred execution is the => operator, which is a Lambda expression and is also new to C# 3.0. The best way to understand them is that they are essentially syntactic sugar for an anonymous method (e. (e.g. a type safe function pointer to code). So we could again rewrite our code as follows:

rssfeed
    .Descendants(_atomNamespace + "entry")
    .Where(delegate(XElement item) {
        return (DateParser.ParseDateTime(
            item.Element(_atomNamespace + "published").Value)
            >= this.LastDownloaded) || this.LastDownloaded == null; }
)
    .Select(delegate(XElement item) {
        return (Entry)new AtomEntry(item, this); }
);

Back in familiar territory yet? If not you probably aren't familiar with C# 2.0. In the background the compiler takes the anonymous methods above and turns them into methods on the current class and instantiates new delegates of the correct type that points to them and passes them to the Select() and Where() methods.

The The key thing to note is that the arguments for select and where are delegates, and so when those delegates are executed is beyond our control. In fact if you put a Console.WriteLine or a breakpoint inside of the AtomEntry constructor, it won't get called until the resulting IEnumerable is enumerated, specifically the following line in the first code sample:

foreach (Entry objEntry in lstRecentPosts) {

So that's delayed execution. But understanding how it works and how to use it are completely different things.

The Inelegant Solution

Getting back to my code sample you may have picked up that my where clause is the mistake. I implemented it like this because RSS and Atom have different field names for the published date. But the way I wrote it I'd have to make two changes if I wanted to change which entries to download. Ok, big deal, I'm extremely unlikely to make changes to that where clause right? Or I wasn't until I wanted functionality to set some defaults based on the average length of posts prior to downloading posts. Basically:

public static Feed CreateFeed(string strUrl, int intDisplayOrder) {
    IEnumerable<Entry> lstRecentEntries = feed.GetEntries(rssfeed);
    double intAveragePostSize = lstRecentEntries.Average(
        i => i.Description.Length);
    // if the feeds posts are typically small then include the
    // description field in the summary and download the content
    // for the main article from the link
   
if (intAveragePostSize < 1000) {
        ...
    } else {
        ...
    }
}

Except this now ties me to the were clause, when what I'd really like to do is just get the average post size for the last couple of posts. The problem is that GetEntries() isn't generic enough.

The Elegant Solution

The The solution is then to normalize out (excuse the database terminology) the where clause into the two methods that use GetEntries(). So GetEntries() becomes simple:

public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
      return from item in rssfeed.Descendants(_atomNamespace + "entry")
               select (Entry)new AtomEntry(item, this);
}

And then Feed.CreateFeed() and Feed.DownloadRecentEntries() become more complicated

public abstract class Feed {
    public
abstract IEnumerable<Entry> GetEntries(XDocument rssfeed);

    public
static Feed CreateFeed(string strUrl, int intDisplayOrder) {
        IEnumerable<Entry> lstEntries = feed.GetEntries(rssfeed);
        // get the five most recent posts
        IEnumerable<Entry> lstRecentEntries =
            from entry in lstEntries.Take(5)
            select entry;

        double
intAveragePostSize = lstRecentEntries.Average(
            i => i.Description.Length);
        if (intAveragePostSize < 1000) {
            ...
        } else {
            ...
        }
    }

    public
void DownloadRecentEntries(...) {
        XDocument xdocFeed = XDocument.Load(Url);
        IEnumerable<Entry> lstEntries = GetEntries(xdocFeed);
        // get newly published posts
        IEnumerable<Entry> lstRecentPosts = from entry in lstEntries
            where (entry.Published >= this.LastDownloaded)
                || this.LastDownloaded == null
            select entry;


        foreach (Entry objEntry in lstRecentPosts) {
            objEntry.Download(...)
        }
    }
}

Note that we now have a second LINQ statement that runs against the results of the LINQ statement in GetEntries(). But since nothing's been executed yet we're just building out the statement that we will eventually run when the resulting IEnumerable if enumerated. So we've now spread our LINQ statements across an inheriting and a base class, and in process we've made GetEntries() extremely generic.

Conclusion

So what's the big deal? The big deal is that we can spread our data access statements across multple classes and because of deferred execution we don't need to worry about the performance of generic methods that are closer to the data that don't contain a "where" clause. This may not be a huge deal in this example, but it becomes extremely powerful when the user interface tier can tack on "order by" statements or "filters" BEFORE anything is executed against your data store. And that, for me, is at the heart of the beauty of LINQ.

Posted by lrichard Feb 20 2008, 09:45:58 AM EST
20080214 Thursday February 14, 2008
Daily RSS Download

I published Daily RSS Download, my first open source project on CodePlex* today. It's not going to change the world, but if you have a need for it there is a decided lack of decent products that perform this functionality. In this post I'll give a little background about why I wrote it and explain what it does and how to use it. Besides needing this functionality I also wrote it to learn LINQ to Objects and LINQ to XML, but I'll cover the more interesting implementation details in a later post.

Why I Wrote It

For Christmas I received an IRex Iliad which is an e-book reader combined with a Wacom tablet. It's an awesome product that allows reading PDF's (among other formats) and writing on them. It's pricey, but the ability to jot notes on technical documents (in addition to recipes and guitar tablature, etc) as you read is invaluable for me. I now read about twice as much as I did before. It supports Wi-Fi, and in particular can connect to a computer on a regular basis to download files you put in a specific directory.

So theoretically it could download a customized newspaper every morning for me, right? I could have today's world news, national news, local news, technical news, weather, and my RSS feeds like Scott Gunthrie all in one place while I eat my cereal! And then I could cancel my Washington Post subscription and after about 7.5 years I would have recouped the costs of the Iliad. Sweet.

The problem is that the product doesn't come with any way to download RSS feeds. Well, you can use software from MobiPocket, but it's a pain to setup, and use, and I couldn't figure out how to have it automatically run on a daily basis. And furthermore it can't grab the real content from the website if the RSS feed only contains an abstract (e.g. washingtonpost.com). I searched and there was some software out there, but none of it did what I liked. And of course none of it generated a manifest.xml file which is an Iliad specific file that links HTML pages together and gives names to groups of content (i.e. grouping the files in a directory to make a “book" called “My Daily News for February 13").

So what a great opportunity to write it myself and learn LINQ to XML and LINQ to Objects in the process.

What It Does

The end result (or the index page anyway) looks something like this:

The images are local, the links go to a full page of content, and on the Iliad, because Daily RSS Download generates a manifest.xml, the next and previous buttons can move you to the next or previous article and you can see at a glance how many articles there are.

If you want to recreate the screenshot above, first head over to the Releases page of Daily RSS Download, where you can download the msi and install the application. When you open “Daily RSS Download Config" you can view a home page like this:

You can type in an RSS or Atom URL and click Add Feed. The application will try to connect to the website, download the title, and set some configuration options based on the average length of posts (specifically if you put in a feed from the washingtonpost.com website it will detect that the average post size is small and determine that it should download the main content from the website).

You can click on any of the feeds you've added and you'll get a Feed Settings page like below:

The fields are mostly self explanatory, but here are three of the more interesting settings:

Summary Source Values:

This setting determines where the abstract (summary) on the index page should come from. There are three options:

No Summary – Does not display a summary on the index.html page. This is what Scott Gunthrie's feed was set to in the first screenshot.

Extract from the content – This takes the first 300 characters from the main content as the summary. This was set for the washingtonpost.com feed in the first screenshot (although Use the RSS description field would actually have been more appropriate).

Use the RSS description field – This uses the entire description field from the RSS (or Atom) feed. This is what the weatherbug feed was set to in the first screenshot. Obviously this is a bad choice for a Scott Gunthrie type of RSS entry since he posts everything in the description field.

Content Source Values:

This setting determines where the main content page should get it's value. There are thee options:

No content, summary only – If you set a feed to this, then Daily RSS Download won't generate a content file. This would be a good choice for the weather feed in the example.

Use the RSS description field – The content file will be created from the RSS description field. This would be a good choice for a Scott Gunthrie type of feed.

Download from the referenced web page –Daily RSS Download will download the page referenced by the RSS or Atom feed. This would be a good option for a washingtonpost.com type of feed.

Content Start/End Markers

These are regular expressions that are used if you set content source to download the referenced web page. You can leave them blank or you can set them if you want to try to strip out header, footer, navigation bars, etc. The content start marker in the screenshot:

\<div id=\"article_body\"[^\>]*\>

Says match ‘<div id="article_body"' up through to the next ‘>'. Both markers are exclusive (the thing your matching on won't be included in the results).

Customizing the CSS

So that's it for the general settings and use. You can click “Download Now" on the main config page to download your feeds, and you can set it up to run on a recurring basis (it will only download new content) by setting a recurring task to run “DailyRssDownload.exe DownloadNow". The only other thing of interest is to make the content more pretty.

The generated HTML is CSS customizable, so in order to get the two column look above (and/or make it look pretty on an Iliad) you can customize the CSS as below:

h1
{
      margin-top: 0px;
      /* A pretty linux script font since the Iliad has a linux kernel */
      font-family: Zapf Chancery;
      font-size: 30pt;
      margin-bottom: 0px;
}
h2
{
}
.NewsHeader
{
      border-bottom: solid 1px black;
      text-align: center;
}
.DailyRss_Date
{
      text-align: center;
}
.DailyRss_Feed
{
}
.DailyRss_Entry
{
}
.DailyRss_EvenEntry
{
}
.DailyRss_OddEntry
{
}

/* LEFT COLUMN */
#ScottGusBlog
{
      float: left;
      width: 49%;
      border-right: solid 1px gray;
}
#washingtonpostcom-TodaysHighlights
{
      clear: both;
      float: left;
      width: 49%;
      border-right: solid 1px gray;
}

/* RIGHT COLUMN */
#WeatherBugLocalWeatherfor20190
{
      margin-left: 50%;
}

So basically just use the old float left, width 50%, margin-left 50% trick to get the pretty two-column look (without tables).

Conclusion

I hope you find the Daily RSS Download open source project useful. Please feel free to submit suggestions, feature requests, defects or preferably defects AND patches on the project's CodePlex home page.

* In case you aren't aware CodePlex is an open source project hosting website from Microsoft. It's similar to Source Forge, except there is no approval process for new projects and it integrates nicely with Visual Studio.

Posted by lrichard Feb 14 2008, 11:29:00 AM EST
20070830 Thursday August 30, 2007
Create Data Disaster: Avoid Unique Indexes – (Mistake 3 of 10)

I really enjoyed Seth Schroeder’s critique of the last post in my ten part data modeling mistake series: Surrogate vs Natural Primary Keys. His argument regarding data migration in particular sheds light on a major shortcoming of using surrogate keys: they lead data modelers to a false sense of security regarding the uniqueness of data. Specifically if modelers ignore uniqueness constraints they allow duplicate data. And as Seth points out this has a nasty side effect of disallowing any clear way to compare data between systems. But there are other problems too.

So, in this post I’ll address the uniqueness problem introduced with surrogate keys by way of an example, I’ll provide two how-to’s, one implementing uniqueness in Visio and one in NHibernate, I’ll explain the difference between unique indexes and unique constraints, and finally I’ll provide reasons why unique indexes might be overlooked, specifically by providing a critique of ORM tools.

Surrogate Keys = Data Disaster?

So as mentioned above the biggest problem with surrogate keys is they lull junior data modelers or lazy developers into thinking they don’t need to worry about indexes. But they do; and it’s as vital as implementing referential integrity. And for the same reason: data integrity.

As an example, imagine you’re modeling a simple Country table. You could of course use CountryName as the primary key, but as you know from my post on surrogate keys, you would have problems with varchar join speed (assuming you disagree with Seth that it’s a premature optimization) and to a lesser extent cascading updates (since country names do occasionally change).

Introducing a surrogate key (CountryId) resolves these issues, but you also remove an inherent advantage that natural keys have: they require uniqueness in country names. In other words you can now have two New Zealand’s and the system wouldn’t stop you.

What’s the big deal? Country seems like a pretty benign table to have duplicates, right? Your users from New Zealand simply have an extra list item in their drop down to pick from and some pick one and some pick the other.

For Country one problem comes in reporting. Consider delivering a revenue by Country report. Your report probably lists New Zealand twice and a quick scan by an exec sees half of the actual revenue for that country that they should. And as a result numerous innocent sheep are slaughtered … uh, or something.

Another major problem could come in syncing data with other systems. How do those systems know which record to use?

As you can imagine the problem is even worse with major entities like Customer, Order, Product, or something more scary like Airline Flights. And the longer the system stays in production, the more production data the system collects, the more duplicates rack up, and the more time and money that will be required to clean up the data when the problem is finally identified. In short the bigger the data disaster.

How To #1: Visio

So the solution is to add at least one unique constraint (or index) to every single table. In other words if you have a table without a uniqueness constraint chances are very good you’ve done something wrong.

The good news is that it’s pretty easy to implement once you agree it’s necessary. If you’re modeling with Microsoft Visio this is a six step process:

  1. Select the table.
  2. Select the “Indexes” category.
  3. Click New.
  4. No need to enter a name, just click OK.
  5. Select either “Unique constraint only” or “Unique index only” (more on this decision later).
  6. Double click the column(s) to add.

Then when you generate or update your database Visio puts in DBMS specific uniqueness constraints. And voila, problem solved.

Unique Constraints vs Unique Indexes

The question will come up when using Visio or perhaps using various DBMS’s including SQL Server whether to use a unique constraint or unique index. The short answer is that most people use unique constraints, but ultimately they’re the same thing so it doesn’t matter.

In case you’re interested in the details though here’s a quick rundown of the differences:

Unique Constraint

  • A logical construct.
  • Defined in the ANSI SQL standard.
  • Intent: data integrity.
  • Usually part of a table definition.

Unique Index

  • A physical DBMS implementation.
  • Not specified in ANSI SQL.
  • Intent: performance.
  • Usually external to a table definition.

But since most DBMS’s implement unique constraints as unique indexes, it doesn’t really matter which you choose.

How To #2: NHibernate

Since I have the pleasure of learning the NHibernate ORM tool on my current project, I thought I’d also describe the same technique with a different tool. Basically you can either set the Unique attribute to true to obtain uniqueness in one column, or set the unique-key attribute to obtain uniqueness among multiple columns. If you use NHibernate mapping attributes you write:

[Property(NotNull = true, Length = 100, Unique = true)]

public virtual string CountryName {

    get { return _strCountryName; }

    set { _strCountryName = value; }

}

Which generates the following hbm:

<class name="Country">
    <id name="CountryId"><generator class="sequence" /></id>
    <property name="CountryName" length="100" not-null="true" unique="true" />
</class>

Which NHibernate turns into the following DDL:

create table Country (

   CountryId NUMBER(10,0) not null,

   CountryName NVARCHAR2(100) not null <b>unique</b>,

   primary key (CountryId)

)

So quick quiz: was that a unique index or unique constraint it generated? If you answered who cares you’re right. However if you answered a unique constraint you’re also right.

The Problem with ORM

Obviously ignorance of the problem and shortsightedness are two causes for systems going into production without unique indexes, but I’d like to point out a third. While Object Relational Mapping (ORM) tools like NHibernate are extremely convenient for generating database schemas, modeling database tables with classes and generating DDL can lead developers to a false sense of purpose.

This can occur because ORM tools focus entirely on the world of objects and classes. In this world data’s persistence is irrelevant. It exists for the purposes of a single operation, and consequently long term data persistence issues like data integrity are deemphasized. In fact, it would be easy to lose perspective of the fact that there is a database at all.

Don’t get me wrong, the benefits you get like mandatory surrogate keys, DBMS neutrality, lazy loading, and minimal data access code are wonderful. Just don’t forget that tags like NHibernate’s unique and unique-key exist. And are very necessary.

Conclusion

To sum it up don’t allow the convenience of surrogate keys to lull you into a false sense of security regarding the importance of keys. It’s a trap. And it will be a disastrous one if you aren’t careful.

Posted by lrichard Aug 30 2007, 07:54:53 PM EDT
20070816 Thursday August 16, 2007
Surrogate Keys – Data Modeling Mistake 2 of 10

In case you’re new to the series I’ve compiled a list of ten data modeling mistakes that I see over and over that I’m tackling one by one. I’ll be speaking about these topics at the upcoming IASA conference in October, so I’m hoping to generate some discussion to at least confirm I have well founded arguments.

The last post in this series Referential Integrity was probably less controversial than this one. After all, who can argue against enforcing referential integrity? But as obvious as surrogate keys may be to some, there is a good deal of diversity of opinion as evidenced by the fact that people continue to not use them.

I intend to address this topic by way of a fairly ubiquitous example that should draw out all of the arguments. I’ll investigate the options for primary keys in a Person table. I’ll provide four possible options and explain why each of them is a bad choice. I’ll then give four arguments against surrogate keys, which I will then shoot down. So without further ado:

Contender 1: Name and Location

Of course you can’t use name as a primary key for a person, because two people can have the same name and primary keys must be unique. But all too often I’ve seen databases with multiple, sometime numerous, natural (or business-related) primary keys. These databases combine a field like name with enough other fields such that the likelihood of uniqueness is approaching certainty.

In the case of person this would be equivalent to First and Last Name (wouldn’t want to violate first normal form by combining those into one field, but that’s a whole other topic), zip code, and we ought to throw address line 1 in to be safe. This is known as either a compound, composite, or multicolumn index.

Now our chances of uniqueness are close enough to certain to not warrant discussion, so let’s jump right to space and performance. There are three major problems with this approach.

Con 1: Primary key size. The primary key index for person becomes enormous. The database must now catalog four large (probably all varchar) fields. This increases the size of the index which increases overhead for insert, delete and update operations and can even decreases read speed because of the increased disk I/O.

Con 2: Foreign key size. If you have a child table like PhoneNumber then, as the diagram above shows, the foreign key becomes four columns. Those four columns take up a lot of space again. And now a common query like “Get all phone numbers for a person” involves a full table scan, or, if you throw an index on them you end up with another huge index. In fact, you most likely end up propagating huge indexes and vast amounts of data all over the place like some evil data-cancer.

Con 3: Asthetics. It just isn’t pretty. Having four column foreign keys all over the place increases the amount of code you need to write in stored procedures, middle tier, and presentation tier. Even intellisense won’t help you with this one.

Contender 2: Social Security Number

The most obvious choice for a natural key for a person object is social security number, right? Obviously it depends on what type of data person is, but regardless you’ll probably face the following four problems with this primary key candidate:

Con 4: Optionality. The social security administration specifies that U.S. citizens are not required to provide social security numbers in many circumstances. While employment is one of these circumstances, consumers of non-governmental services are definitely not. You can deny service if your consumer won’t provide the number, but is your CEO prepared to turn away business based on a data modeling decision you make?

Con 5: Applicability. Only U.S. citizens have a social security number. Your system might only cater to U.S. citizens now, but will it always?

Con 6: Uniqueness. The social security administration “is adamant” that the numbers are not recycled, even after someone dies. But eventually the numbers will run out. If you visit the slate article cited above, it calculates this date as in the next century. But the math fails to include the fact that location information is encoded in the number which significantly limits the permutations. I don’t know what the real number is, but the point is: you’re gambling with how long until a conflict occurs. And even if the time argument fails to sway you, just think of who is assigning the numbers. How much do you trust a government office to not make an occasional mistake?

Con 7: Privacy. Does your application use primary keys in the user interface tier to uniquely identify records? Does it pass primary keys between pages or use them to identify rows in a combo box? You certainly wouldn’t store such a thing in a cookie or pass it across the wire unencrypted right? Social security information is sensitive information and privacy zealots care very much how you handle their data. Primary keys are necessarily are closer to end users and harder to hide than regular fields. It just isn’t the type of data to take a chance on.

Contender 3: E-mail

So e-mail is a pretty likely choice right? It’s a relatively safe assumption that no two people share an e-mail (maybe). And anyone with a computer has one right? So there should be no uniqueness, privacy or optionality/applicability problems. But how about this:

Con 8: Accidental Denormalization. Do you have more than one e-mail address? Doesn’t everyone? Imagine what a pain it would be if Evite only allowed you one e-mail address per person (ok, well if you didn’t know it does allow you to consolidate accounts for those of us with multiple e-mail addresses). Even if your system only stores one e-mail address per person now, just think what a pain it would be to change the database to allow N e-mail addresses per person.

No… Wait. Really. Think about it...

Yea … yuck.

Contender 4: Username

If your users log in with a username, that’s a likely candidate for a primary key right? But what if they want to update their username (perhaps it was based on a last name that changed). This leads us to:

Con 9: Cascading Updates. If you have a natural key that might change you’ll need to implement some type of cascading updates (whether your DBMS supports it or you write code by hand). In other words, change the username in the person table and you have to change the username foreign key in all child records of the invoices, comments, sales, certifications, defects, or whatever other tables you track. It may not happen often, but when it does it sure will wreak havoc on your indexes. Imagine rebuilding even 10% of your indexes at once because of one operation. It’s just unnecessary.

Con 10: Varchar join speed. I left this to last because it applies to all of contenders thus far and is by far the most compelling argument against natural keys. Nine out of ten natural keys are varchar fields. Even an employee number as generated by another system may have a significant zero. It’s a fact: joining across tables with varchars is always slower than joining across tables with integers. How much? According to Peter Zaitsev who runs a MySql performance blog it’s 20% to 600% slower. And that’s for one join. How many joins do you think comprise an average user interaction? Five? Ten? Twenty? It could very likely make a significant difference to your end user.

And The Winner Is

So surrogate keys win right? Well, let’s review and see if any of the con’s of natural key’s apply to surrogate keys:

  • Con 1: Primary key size – Surrogate keys generally don't have problems with index size since they're usually a single column of type int. That's about as small as it gets.
  • Con 2: Foreign key size - They don't have foreign key or foreign index size problems either for the same reason as Con 1.
  • Con 3: Asthetics - Well, it’s an eye of the beholder type thing, but they certainly don’t involve writing as much code as with compound natural keys.
  • Con 4 & 5: Optionality & Applicability – Surrogate keys have no problems with people or things not wanting to or not being able to provide the data.
  • Con 6: Uniqueness - They are 100% guaranteed to be unique. That’s a relief.
  • Con 7: Privacy - They have no privacy concerns should an unscrupulous person obtain them.
  • Con 8: Accidental Denormalization – You can’t accidentally denormalize non-business data.
  • Con 9: Cascading Updates - Surrogate keys don't change, so no worries about how to cascade them on update.
  • Con 10: Varchar join speed - They're generally int's, so they're generally as fast to join over as you can get.

For every natural key con I see a surrogate key pro. But not everyone agrees. Here are some arguments against them.

Disadvantage 1: Getting The Next Value

Some have argued that getting the next value for a surrogate keys is a pain. Perhaps that’s true in Oracle with its sequences, but generally it just takes a couple minutes research, or you can use ORM tools to hide the details for you.

Disadvantage 2: Users Don’t Understand Them

One argument I uncovered is if users were to perform ad-hoc queries on the database they wouldn’t be able to understand how to use surrogate keys.

Bunk. Bunk, bunk, bunk. End users shouldn’t be fiddling in databases any more than airline customers should be fiddling in airplane engines. And if they are savvy enough to try, then let them learn to perform joins like the pros do.

Disadvantage 3: Extra Joins

Suppose you have users table with a social security number natural primary key, and a phone number child table with social security as a foreign key.

If your user enters a social security number on a log in screen you could theoretically get their phone numbers without accessing the users table. In a surrogate key world you would have to look up the surrogate key in the person table before getting their phone numbers.

While this is true, I have found that with most CRUD applications there are few times when this scenario comes up. The vast majority of queries involve already known surrogate keys. So while this argument may be true in some situations, it just isn’t true enough of the time to be significant.

Disadvantage 4: Extra Indexes

I find this to be the most persuasive argument against natural keys. If your person object would normally have a natural key on social security number, then in surrogate-world you should have a unique index on social security number in addition to your primary key index on the surrogate key. In other words, you now have two indexes instead of one. In fact, if you have N indexes per table in natural key world, you’ll always have N + 1 indexes in surrogate key world.

While the additional indexes do indeed add indexes, which increase database size, and slow insert and update performance, you could offset some of that expense by converting your old natural key, social security number for example, to a clustered index.

Or you could just relax in the knowledge that there are pro’s and con’s to every architectural decision and for surrogate keys the pro’s outweigh the con’s.

Summary

So now if some well meaning DBA argues to use natural keys on your next project you should have ten arguments against them, which will double as ten arguments for surrogate keys, and you should be prepared with rebuttals for four arguments against surrogate keys. Whew, that was a lot. But I assure you, if you use surrogate keys today it will definitely make your life easier in the long run.

Posted by lrichard Aug 16 2007, 08:08:45 AM EDT
20070730 Monday July 30, 2007
Sync Your Database to SharePoint Using SSIS

Just a heads up that DevX just published an article of mine today. The article is entitled Sync Your Database to SharePoint Using SSIS. The article covers how to import and export SharePoint list items using Collaborative Application Markup Language (CAML), SharePoint's web services API, and SQL Server Integration Services.

The latter half of the article is a fairly detailed how-to, but the former half covers what SharePoint lists are, what SSIS is, and why you would want to use them all together. I hope you find the article useful, and feel free to comment here if you have thoughts on the article.

Posted by lrichard Jul 30 2007, 06:47:01 PM EDT
20070718 Wednesday July 18, 2007
Referential Integrity - Data Modeling Mistake 1 of 10

In my mind data models are like the foundations of a house. Whether you use ORM or a more traditional modeling tool, they form the base of the entire rest of your project. Consequently, every decision you make (or don’t make) regarding your data model during the design phase(s) of your project will significantly affect the duration of your project and the maintainability and performance of your application.

You could de-emphasize up-front planning, but every correction you make to the data model once code has been written on top of it will introduce significant delays to the project as developers refactor data access, business logic, and user interface tiers. That’s why mistakes made during design are expensive, and it would behoove any architect (or project manager) to be well aware of the repercussions of data model decisions and minimize mistakes before construction begins.

After years of working with or maintaining applications based on poorly designed data models, and after years of modeling my own databases from scratch I’ve seen and made a lot of mistakes. So, I’ve compiled ten of the most common ones and the arguments for and against them.

I’ll be speaking on this topic in the upcoming IASA conference in October, and so I wanted to vet these ideas with the community. I know there are strong feelings on these topics, so please help me out by commenting if you feel I’ve missed something or am off base.

I’ll start with Mistake #1: Not Using Referential Integrity in this post. I'll give four common reasons for avoiding referential integrity and then rebuff them. I'll then cover the more controversial Mistake #2 Not Using Surrogate Keys in my next post.

Mistake #1 – Not using referential integrity

I’ve heard a lot of excuses for not using referential integrity, but I’ve never been swayed by one of them. If you have a record with a foreign key field you should be 100% certain that it will always refer to the primary key of an existing record in one and only one foreign table. The last thing you want to do is write large amounts of conditional logic because you aren’t 100% certain that you aren’t dealing with orphaned data. Nonetheless, here are some almost compelling arguments I’ve heard for not using it:

Reason #1: Project Too Small

If your project or database is only a few tables and a couple lines of code then you don’t need referential integrity right? Wrong, numerous projects start small, get big, and have major problems because of it. It doesn’t take much extra time to put in constraints. Avoid the urge to be lazy.

Reason #2: Accidental Oversight

Numerous applications I’ve seen forget a relationship or two. This is borne of writing and executing database creation statements by hand and is the reason that data modeling tools exist. When you visualize your database in a model it’s hard to miss a relationship. So use a modeling tool and keep it in sync with your database, you won’t regret it.

Incidentally I like Microsoft Visio for data modeling because you can change your schema during development and Visio won’t delete your data. This enables you to keep your data model in sync with the database for the entire lifetime of the database. There are other benefits too, if you’re interested see my article on data modeling in Microsoft Visio.

Reason #3: Maximize Insert Speed

It’s a fact: indexes and constraints slow down insert and update operations. If your application is heavy on writing and light on reading, then you could argue referential integrity isn’t for you. This argument is often combined with the “Only one application ever uses my database” argument.

There are two problems with this. One problem comes when either a well meaning DBA modifies data by hand and messes up the state of the database, or more realistically when there’s a bug in the application that accidentally orphans data. Orphaned data may not affect your application, but a well designed solution should plan for the future. When that data warehouse project finally gets around to importing data from your database, what do they do with the orphaned data? Ignore it? Try to integrate it? Who knows? If you’ve been in this position, you’ll know what I mean when I say the responsible architect’s name (or their app) will be synonymous with a curse word.

The second problem is that even if a database without referential integrity don’t end up with orphaned data, a second application that might want to integrate can still never be 100% certain that foreign keys refer to existing records. It comes down to designing for the future.

The answer to speed is to build your database with referential integrity, drop or disable your constraints and indexes before a bulk load, and re-enable them after the bulk load. It will increase the duration of your bulk load operation over not using constraints at all, but it will be much faster than leaving them enabled and checking them for each insert. So use referential integrity: the pros outweigh the cons.

Reason #4: Mutually exclusive relationships

Too often I’ve seen databases with a foreign key that relates to one of five tables based on the value of a char(1) field. The space conserving mindset that comes up with this implementation is admirable, but it produces far too many negative side effects.

What happens when the char(1) field gets out of sync with the foreign key field? What happens when someone deletes the foreign record or changes its primary key? More orphaned data happens.

The solution is to use five fields that each refer to a single table. You may have more nullable fields that take up more space in the database, but it’s worth it in the long run.

Conclusion

Well, hopefully I’ve convinced you to avoid the urge to be a lazy data modeler, design for the future, use a data modeling tool, and drop constraints during bulk load operations. In short, always use referential integrity. But if not, hopefully you’ll at least understand when people curse your name several years from now. :)

Posted by lrichard Jul 18 2007, 07:02:03 PM EDT
20070607 Thursday June 07, 2007
An Entity Relationship Diagram Example

It seems like a dying art, but I still strongly feel that Entity Relationship Diagrams (ERD) should be the starting point of all software development projects. Since they are for me anyway, I wanted a place to refer colleagues to for how to read these diagrams, and an Entity Relationship Diagram Example seemed like a great place to start.

The Example: A Resource Management Application

Consider that we’re writing a resource management application. The first step to creating an ERD is always to identify the nouns (entities). In this case let’s start with:

  • Company
  • Employee
  • Project; and
  • Technology Project (which are a specific type of Project that perhaps require special fields like “number of entities”)

Here’s the Example Entity Relationship Diagram I’ll decipher piece by piece in this article (click to enlarge):

(note that I’m now using singular names since my somewhat controversial decision to switch to naming entities in the singular)

To read the notations of an Entity Relationship Diagram:

An Entity Relationship Diagram conveys a lot of information with a very concise notation. The important part to keep in mind is to limit what you’re reading using the following technique:

  1. Choose two entities (e.g. Company and Employee)
  2. Pick one that you’re interested in (e.g. how a single Company relates to employees)
  3. Read the notation on the second entity (e.g. the crow’s feet with the O above it next to the Employee entity).

The set of symbols consist of Crow’s feet (which Wikipedia describes as looking like the forward digits of a bird’s claw), O, and dash, but they can be combined in four distinct combinations. Here are the four combinations:

  • Zero through Many (crow's feet, O)
  • One through Many (crow's feet, dash)
  • One and Only One (dash, dash)
  • Zero or One (dash, O)

Zero through Many

If, as in the diagram above, the notation closest to the second entity is a crow’s feet with an O next to it, then the first entity can have zero, one, or many of the second entity. Consequently the diagram above would read: “A company can have zero, one, or many employees”.

This is the most common relationship type, and consequently many people ignore the O. While you can consider the O optional, I consider it a best practice to be explicit to differentiate it from the less common one through many relationship.

One through Many

If, as the next diagram shows, the notation closest to the second entity is a crow’s feet with a dash, then the first entity can have one through many of the second entity. More specifically it may not contain zero of the second entity. The example above would thus read (read bottom to top): “A Project can have one through many Employees working on it.”

This is an interesting combination because it can’t (and for various reasons probably shouldn’t if it could) be enforced by a database. Thus, you will only see these in logical, but not a physical, data models. It is still useful to distinguish, but your application will need to enforce the relationship in business rules.

One and Only One (onne)

If the notation closest to the second entity contains two dashes it indicates that the first entity can have one and only one of the second. More specifically it cannot have zero, and it cannot have more than one. The example would thus read: “An Employee can have one and only one Company.”

This combination is the most common after zero through many, and so frequently people consider the second dash optional. In fact, some ignore both dashes, but I would highly recommend at least using one for clarity so as not to confuse the notation with “I’ll fill in the relationship details later”.

Zero or One

A zero or one relationship is indicated by a dash and an O. It indicates that the first entity can have zero or one of the second, but not more than one. The relationship in the example above would thus read: “A Project can have zero or one Technology Project.”

The zero or one relationship is quite common and is frequently abbreviated with just an O (however it is most commonly seen in a many-to-many relationship rather than the one-to-one above, more on this later).

Relationship Types

Having examined the four types of notation, the discussion wouldn’t be complete without a quick overview of the three relationship types. These are:

  • One to Many
  • Many to Many
  • One to One

One-to-Many

A one-to-many (1N) is by far the most common relationship type. It consists of either a one through many or a zero through many notation on one side of a relationship and a one and only one or zero or one notation on the other. The relationship between Company and Employee in the example is a one-to-many relationship.

Many-to-Many

The next most common relationship is a many-to-many (NM). It consists of a zero through many or one through many on both sides of a relationship. This construct only exists in logical data models because databases can’t implement the relationship directly. Physical data models implement a many-to-many relationship by using an associative (or link or resolving) table via two one-to-many relationships.

The relationship between Employee and Project in the example is a many to many relationship. It would exist in logical and physical data models as follows:

One-to-One

Probably the least common and most misunderstood relationship is the one-to-one. It consists of a one and only one notation on one side of a relationship and a zero or one on the other. It warrants a discussion unto itself, but for now the Project to Technology Project relationship in the example is a one to one. Because these relationships are easy to mistake for traditional one-to-many relationships, I have taken to drawing a red dashed line around them. The red dashed line is not standard at all (although a colleague, Steve Dempsey uses a similar notation), but in my experience it can help eliminate confusion.

Conclusion

I hope you’ve found this a useful example for deciphering and verifying entity relationship diagrams. As always please add any comments, disagreements, thoughts or related resources.

Posted by lrichard Jun 07 2007, 12:38:24 AM EDT
20070510 Thursday May 10, 2007
Export Visio Database Table Names to Excel

If you use the Enterprise Architect edition of Microsoft Visio for data modeling regularly, then there is a good chance that at some point you’ve wanted to export just the table names into Excel. You might want to do this to map logical ERD entities to physical data model tables, track project status by entity, or track overlap between database versions.

Regardless, it turns out to be non-trivial to perform this export, particularly if you are unable to generate to a database to retrieve the table names. The trick is to use the reporting feature of Visio, but there are many reports and report options, and you will need one that is table-based to get the data into Excel easily.

Note: If you are unfamiliar with the capabilities of Microsoft Visio as a data modeling tool you may wish to take a look at my Data Modeling in Microsoft Visio Tutorial.

Export Procedure

1. This may seem a little unusual, but if you don't have any comments in any of your tables (which really shouldn’t be the case), you will need to add comments for at least one of your tables. Without this step Visio will not display tables in a grid format in the report.

2. Now select the somewhat obscure "Report" item off of the "Database" menu.

3. Only the "Table Report" provides the ability to layout database tables in a grid. Select it and click Finish.

4. Under “Predefined logical/physical reports" Click the button labeled "Default To: General Report" and change it to "Default To: Database Report." This will remove tables formatted per page from the end of the report.

5. Under the "Attributes" tab select “Deselect All” then select the “Table stats summary” option.

6. Click “Export to RTF,” save the file somewhere, and open it with Microsoft Word.

7. (optional) If you have any new lines in the notes field you may have to replace them with spaces. Just do a search and replace for "^l" and replace with " ".

8. Now you’re ready to copy and paste.

And you’re done! Hopefully this tutorial will make life easier for you next time you need to export table names from Visio to Microsoft Excel.

Posted by lrichard May 10 2007, 06:23:07 PM EDT
20070504 Friday May 04, 2007
CAML: Nested Too Deep

I discovered an interesting error recently while working with Microsoft’s Collaborative Application Markup Language (CAML) that, surprisingly, had received no ink. Partly what surprises me about this is that the error may require you to rewrite large sections of your code if you haven't previously considered this SharePoint limitation. I'll start with some context, but first of all the error is:

“Some part of your SQL statement is nested too deeply. Rewrite the query or break it up into smaller queries,”