Lee Richardson

All | General | Java | Ruby | .Net
XML
20070816 Thursday August 16, 2007
Surrogate Keys – Data Modeling Mistake 2 of 10

In case you’re new to the series I’ve compiled a list of ten data modeling mistakes that I see over and over that I’m tackling one by one. I’ll be speaking about these topics at the upcoming IASA conference in October, so I’m hoping to generate some discussion to at least confirm I have well founded arguments.

The last post in this series Referential Integrity was probably less controversial than this one. After all, who can argue against enforcing referential integrity? But as obvious as surrogate keys may be to some, there is a good deal of diversity of opinion as evidenced by the fact that people continue to not use them.

I intend to address this topic by way of a fairly ubiquitous example that should draw out all of the arguments. I’ll investigate the options for primary keys in a Person table. I’ll provide four possible options and explain why each of them is a bad choice. I’ll then give four arguments against surrogate keys, which I will then shoot down. So without further ado:

Contender 1: Name and Location

Of course you can’t use name as a primary key for a person, because two people can have the same name and primary keys must be unique. But all too often I’ve seen databases with multiple, sometime numerous, natural (or business-related) primary keys. These databases combine a field like name with enough other fields such that the likelihood of uniqueness is approaching certainty.

In the case of person this would be equivalent to First and Last Name (wouldn’t want to violate first normal form by combining those into one field, but that’s a whole other topic), zip code, and we ought to throw address line 1 in to be safe. This is known as either a compound, composite, or multicolumn index.

Now our chances of uniqueness are close enough to certain to not warrant discussion, so let’s jump right to space and performance. There are three major problems with this approach.

Con 1: Primary key size. The primary key index for person becomes enormous. The database must now catalog four large (probably all varchar) fields. This increases the size of the index which increases overhead for insert, delete and update operations and can even decreases read speed because of the increased disk I/O.

Con 2: Foreign key size. If you have a child table like PhoneNumber then, as the diagram above shows, the foreign key becomes four columns. Those four columns take up a lot of space again. And now a common query like “Get all phone numbers for a person” involves a full table scan, or, if you throw an index on them you end up with another huge index. In fact, you most likely end up propagating huge indexes and vast amounts of data all over the place like some evil data-cancer.

Con 3: Asthetics. It just isn’t pretty. Having four column foreign keys all over the place increases the amount of code you need to write in stored procedures, middle tier, and presentation tier. Even intellisense won’t help you with this one.

Contender 2: Social Security Number

The most obvious choice for a natural key for a person object is social security number, right? Obviously it depends on what type of data person is, but regardless you’ll probably face the following four problems with this primary key candidate:

Con 4: Optionality. The social security administration specifies that U.S. citizens are not required to provide social security numbers in many circumstances. While employment is one of these circumstances, consumers of non-governmental services are definitely not. You can deny service if your consumer won’t provide the number, but is your CEO prepared to turn away business based on a data modeling decision you make?

Con 5: Applicability. Only U.S. citizens have a social security number. Your system might only cater to U.S. citizens now, but will it always?

Con 6: Uniqueness. The social security administration “is adamant” that the numbers are not recycled, even after someone dies. But eventually the numbers will run out. If you visit the slate article cited above, it calculates this date as in the next century. But the math fails to include the fact that location information is encoded in the number which significantly limits the permutations. I don’t know what the real number is, but the point is: you’re gambling with how long until a conflict occurs. And even if the time argument fails to sway you, just think of who is assigning the numbers. How much do you trust a government office to not make an occasional mistake?

Con 7: Privacy. Does your application use primary keys in the user interface tier to uniquely identify records? Does it pass primary keys between pages or use them to identify rows in a combo box? You certainly wouldn’t store such a thing in a cookie or pass it across the wire unencrypted right? Social security information is sensitive information and privacy zealots care very much how you handle their data. Primary keys are necessarily are closer to end users and harder to hide than regular fields. It just isn’t the type of data to take a chance on.

Contender 3: E-mail

So e-mail is a pretty likely choice right? It’s a relatively safe assumption that no two people share an e-mail (maybe). And anyone with a computer has one right? So there should be no uniqueness, privacy or optionality/applicability problems. But how about this:

Con 8: Accidental Denormalization. Do you have more than one e-mail address? Doesn’t everyone? Imagine what a pain it would be if Evite only allowed you one e-mail address per person (ok, well if you didn’t know it does allow you to consolidate accounts for those of us with multiple e-mail addresses). Even if your system only stores one e-mail address per person now, just think what a pain it would be to change the database to allow N e-mail addresses per person.

No… Wait. Really. Think about it...

Yea … yuck.

Contender 4: Username

If your users log in with a username, that’s a likely candidate for a primary key right? But what if they want to update their username (perhaps it was based on a last name that changed). This leads us to:

Con 9: Cascading Updates. If you have a natural key that might change you’ll need to implement some type of cascading updates (whether your DBMS supports it or you write code by hand). In other words, change the username in the person table and you have to change the username foreign key in all child records of the invoices, comments, sales, certifications, defects, or whatever other tables you track. It may not happen often, but when it does it sure will wreak havoc on your indexes. Imagine rebuilding even 10% of your indexes at once because of one operation. It’s just unnecessary.

Con 10: Varchar join speed. I left this to last because it applies to all of contenders thus far and is by far the most compelling argument against natural keys. Nine out of ten natural keys are varchar fields. Even an employee number as generated by another system may have a significant zero. It’s a fact: joining across tables with varchars is always slower than joining across tables with integers. How much? According to Peter Zaitsev who runs a MySql performance blog it’s 20% to 600% slower. And that’s for one join. How many joins do you think comprise an average user interaction? Five? Ten? Twenty? It could very likely make a significant difference to your end user.

And The Winner Is

So surrogate keys win right? Well, let’s review and see if any of the con’s of natural key’s apply to surrogate keys:

  • Con 1: Primary key size – Surrogate keys generally don't have problems with index size since they're usually a single column of type int. That's about as small as it gets.
  • Con 2: Foreign key size - They don't have foreign key or foreign index size problems either for the same reason as Con 1.
  • Con 3: Asthetics - Well, it’s an eye of the beholder type thing, but they certainly don’t involve writing as much code as with compound natural keys.
  • Con 4 & 5: Optionality & Applicability – Surrogate keys have no problems with people or things not wanting to or not being able to provide the data.
  • Con 6: Uniqueness - They are 100% guaranteed to be unique. That’s a relief.
  • Con 7: Privacy - They have no privacy concerns should an unscrupulous person obtain them.
  • Con 8: Accidental Denormalization – You can’t accidentally denormalize non-business data.
  • Con 9: Cascading Updates - Surrogate keys don't change, so no worries about how to cascade them on update.
  • Con 10: Varchar join speed - They're generally int's, so they're generally as fast to join over as you can get.

For every natural key con I see a surrogate key pro. But not everyone agrees. Here are some arguments against them.

Disadvantage 1: Getting The Next Value

Some have argued that getting the next value for a surrogate keys is a pain. Perhaps that’s true in Oracle with its sequences, but generally it just takes a couple minutes research, or you can use ORM tools to hide the details for you.

Disadvantage 2: Users Don’t Understand Them

One argument I uncovered is if users were to perform ad-hoc queries on the database they wouldn’t be able to understand how to use surrogate keys.

Bunk. Bunk, bunk, bunk. End users shouldn’t be fiddling in databases any more than airline customers should be fiddling in airplane engines. And if they are savvy enough to try, then let them learn to perform joins like the pros do.

Disadvantage 3: Extra Joins

Suppose you have users table with a social security number natural primary key, and a phone number child table with social security as a foreign key.

If your user enters a social security number on a log in screen you could theoretically get their phone numbers without accessing the users table. In a surrogate key world you would have to look up the surrogate key in the person table before getting their phone numbers.

While this is true, I have found that with most CRUD applications there are few times when this scenario comes up. The vast majority of queries involve already known surrogate keys. So while this argument may be true in some situations, it just isn’t true enough of the time to be significant.

Disadvantage 4: Extra Indexes

I find this to be the most persuasive argument against natural keys. If your person object would normally have a natural key on social security number, then in surrogate-world you should have a unique index on social security number in addition to your primary key index on the surrogate key. In other words, you now have two indexes instead of one. In fact, if you have N indexes per table in natural key world, you’ll always have N + 1 indexes in surrogate key world.

While the additional indexes do indeed add indexes, which increase database size, and slow insert and update performance, you could offset some of that expense by converting your old natural key, social security number for example, to a clustered index.

Or you could just relax in the knowledge that there are pro’s and con’s to every architectural decision and for surrogate keys the pro’s outweigh the con’s.

Summary

So now if some well meaning DBA argues to use natural keys on your next project you should have ten arguments against them, which will double as ten arguments for surrogate keys, and you should be prepared with rebuttals for four arguments against surrogate keys. Whew, that was a lot. But I assure you, if you use surrogate keys today it will definitely make your life easier in the long run.

Posted by lrichard Aug 16 2007, 08:08:45 AM EDT
20070730 Monday July 30, 2007
Sync Your Database to SharePoint Using SSIS

Just a heads up that DevX just published an article of mine today. The article is entitled Sync Your Database to SharePoint Using SSIS. The article covers how to import and export SharePoint list items using Collaborative Application Markup Language (CAML), SharePoint's web services API, and SQL Server Integration Services.

The latter half of the article is a fairly detailed how-to, but the former half covers what SharePoint lists are, what SSIS is, and why you would want to use them all together. I hope you find the article useful, and feel free to comment here if you have thoughts on the article.

Posted by lrichard Jul 30 2007, 06:47:01 PM EDT
20070718 Wednesday July 18, 2007
Referential Integrity - Data Modeling Mistake 1 of 10

In my mind data models are like the foundations of a house. Whether you use ORM or a more traditional modeling tool, they form the base of the entire rest of your project. Consequently, every decision you make (or don’t make) regarding your data model during the design phase(s) of your project will significantly affect the duration of your project and the maintainability and performance of your application.

You could de-emphasize up-front planning, but every correction you make to the data model once code has been written on top of it will introduce significant delays to the project as developers refactor data access, business logic, and user interface tiers. That’s why mistakes made during design are expensive, and it would behoove any architect (or project manager) to be well aware of the repercussions of data model decisions and minimize mistakes before construction begins.

After years of working with or maintaining applications based on poorly designed data models, and after years of modeling my own databases from scratch I’ve seen and made a lot of mistakes. So, I’ve compiled ten of the most common ones and the arguments for and against them.

I’ll be speaking on this topic in the upcoming IASA conference in October, and so I wanted to vet these ideas with the community. I know there are strong feelings on these topics, so please help me out by commenting if you feel I’ve missed something or am off base.

I’ll start with Mistake #1: Not Using Referential Integrity in this post. I'll give four common reasons for avoiding referential integrity and then rebuff them. I'll then cover the more controversial Mistake #2 Not Using Surrogate Keys in my next post.

Mistake #1 – Not using referential integrity

I’ve heard a lot of excuses for not using referential integrity, but I’ve never been swayed by one of them. If you have a record with a foreign key field you should be 100% certain that it will always refer to the primary key of an existing record in one and only one foreign table. The last thing you want to do is write large amounts of conditional logic because you aren’t 100% certain that you aren’t dealing with orphaned data. Nonetheless, here are some almost compelling arguments I’ve heard for not using it:

Reason #1: Project Too Small

If your project or database is only a few tables and a couple lines of code then you don’t need referential integrity right? Wrong, numerous projects start small, get big, and have major problems because of it. It doesn’t take much extra time to put in constraints. Avoid the urge to be lazy.

Reason #2: Accidental Oversight

Numerous applications I’ve seen forget a relationship or two. This is borne of writing and executing database creation statements by hand and is the reason that data modeling tools exist. When you visualize your database in a model it’s hard to miss a relationship. So use a modeling tool and keep it in sync with your database, you won’t regret it.

Incidentally I like Microsoft Visio for data modeling because you can change your schema during development and Visio won’t delete your data. This enables you to keep your data model in sync with the database for the entire lifetime of the database. There are other benefits too, if you’re interested see my article on data modeling in Microsoft Visio.

Reason #3: Maximize Insert Speed

It’s a fact: indexes and constraints slow down insert and update operations. If your application is heavy on writing and light on reading, then you could argue referential integrity isn’t for you. This argument is often combined with the “Only one application ever uses my database” argument.

There are two problems with this. One problem comes when either a well meaning DBA modifies data by hand and messes up the state of the database, or more realistically when there’s a bug in the application that accidentally orphans data. Orphaned data may not affect your application, but a well designed solution should plan for the future. When that data warehouse project finally gets around to importing data from your database, what do they do with the orphaned data? Ignore it? Try to integrate it? Who knows? If you’ve been in this position, you’ll know what I mean when I say the responsible architect’s name (or their app) will be synonymous with a curse word.

The second problem is that even if a database without referential integrity don’t end up with orphaned data, a second application that might want to integrate can still never be 100% certain that foreign keys refer to existing records. It comes down to designing for the future.

The answer to speed is to build your database with referential integrity, drop or disable your constraints and indexes before a bulk load, and re-enable them after the bulk load. It will increase the duration of your bulk load operation over not using constraints at all, but it will be much faster than leaving them enabled and checking them for each insert. So use referential integrity: the pros outweigh the cons.

Reason #4: Mutually exclusive relationships

Too often I’ve seen databases with a foreign key that relates to one of five tables based on the value of a char(1) field. The space conserving mindset that comes up with this implementation is admirable, but it produces far too many negative side effects.

What happens when the char(1) field gets out of sync with the foreign key field? What happens when someone deletes the foreign record or changes its primary key? More orphaned data happens.

The solution is to use five fields that each refer to a single table. You may have more nullable fields that take up more space in the database, but it’s worth it in the long run.

Conclusion

Well, hopefully I’ve convinced you to avoid the urge to be a lazy data modeler, design for the future, use a data modeling tool, and drop constraints during bulk load operations. In short, always use referential integrity. But if not, hopefully you’ll at least understand when people curse your name several years from now. :)

Posted by lrichard Jul 18 2007, 07:02:03 PM EDT
20070607 Thursday June 07, 2007
An Entity Relationship Diagram Example

It seems like a dying art, but I still strongly feel that Entity Relationship Diagrams (ERD) should be the starting point of all software development projects. Since they are for me anyway, I wanted a place to refer colleagues to for how to read these diagrams, and an Entity Relationship Diagram Example seemed like a great place to start.

The Example: A Resource Management Application

Consider that we’re writing a resource management application. The first step to creating an ERD is always to identify the nouns (entities). In this case let’s start with:

  • Company
  • Employee
  • Project; and
  • Technology Project (which are a specific type of Project that perhaps require special fields like “number of entities”)

Here’s the Example Entity Relationship Diagram I’ll decipher piece by piece in this article (click to enlarge):

(note that I’m now using singular names since my somewhat controversial decision to switch to naming entities in the singular)

To read the notations of an Entity Relationship Diagram:

An Entity Relationship Diagram conveys a lot of information with a very concise notation. The important part to keep in mind is to limit what you’re reading using the following technique:

  1. Choose two entities (e.g. Company and Employee)
  2. Pick one that you’re interested in (e.g. how a single Company relates to employees)
  3. Read the notation on the second entity (e.g. the crow’s feet with the O above it next to the Employee entity).

The set of symbols consist of Crow’s feet (which Wikipedia describes as looking like the forward digits of a bird’s claw), O, and dash, but they can be combined in four distinct combinations. Here are the four combinations:

  • Zero through Many (crow's feet, O)
  • One through Many (crow's feet, dash)
  • One and Only One (dash, dash)
  • Zero or One (dash, O)

Zero through Many

If, as in the diagram above, the notation closest to the second entity is a crow’s feet with an O next to it, then the first entity can have zero, one, or many of the second entity. Consequently the diagram above would read: “A company can have zero, one, or many employees”.

This is the most common relationship type, and consequently many people ignore the O. While you can consider the O optional, I consider it a best practice to be explicit to differentiate it from the less common one through many relationship.

One through Many

If, as the next diagram shows, the notation closest to the second entity is a crow’s feet with a dash, then the first entity can have one through many of the second entity. More specifically it may not contain zero of the second entity. The example above would thus read (read bottom to top): “A Project can have one through many Employees working on it.”

This is an interesting combination because it can’t (and for various reasons probably shouldn’t if it could) be enforced by a database. Thus, you will only see these in logical, but not a physical, data models. It is still useful to distinguish, but your application will need to enforce the relationship in business rules.

One and Only One (onne)

If the notation closest to the second entity contains two dashes it indicates that the first entity can have one and only one of the second. More specifically it cannot have zero, and it cannot have more than one. The example would thus read: “An Employee can have one and only one Company.”

This combination is the most common after zero through many, and so frequently people consider the second dash optional. In fact, some ignore both dashes, but I would highly recommend at least using one for clarity so as not to confuse the notation with “I’ll fill in the relationship details later”.

Zero or One

A zero or one relationship is indicated by a dash and an O. It indicates that the first entity can have zero or one of the second, but not more than one. The relationship in the example above would thus read: “A Project can have zero or one Technology Project.”

The zero or one relationship is quite common and is frequently abbreviated with just an O (however it is most commonly seen in a many-to-many relationship rather than the one-to-one above, more on this later).

Relationship Types

Having examined the four types of notation, the discussion wouldn’t be complete without a quick overview of the three relationship types. These are:

  • One to Many
  • Many to Many
  • One to One

One-to-Many

A one-to-many (1N) is by far the most common relationship type. It consists of either a one through many or a zero through many notation on one side of a relationship and a one and only one or zero or one notation on the other. The relationship between Company and Employee in the example is a one-to-many relationship.

Many-to-Many

The next most common relationship is a many-to-many (NM). It consists of a zero through many or one through many on both sides of a relationship. This construct only exists in logical data models because databases can’t implement the relationship directly. Physical data models implement a many-to-many relationship by using an associative (or link or resolving) table via two one-to-many relationships.

The relationship between Employee and Project in the example is a many to many relationship. It would exist in logical and physical data models as follows:

One-to-One

Probably the least common and most misunderstood relationship is the one-to-one. It consists of a one and only one notation on one side of a relationship and a zero or one on the other. It warrants a discussion unto itself, but for now the Project to Technology Project relationship in the example is a one to one. Because these relationships are easy to mistake for traditional one-to-many relationships, I have taken to drawing a red dashed line around them. The red dashed line is not standard at all (although a colleague, Steve Dempsey uses a similar notation), but in my experience it can help eliminate confusion.

Conclusion

I hope you’ve found this a useful example for deciphering and verifying entity relationship diagrams. As always please add any comments, disagreements, thoughts or related resources.

Posted by lrichard Jun 07 2007, 12:38:24 AM EDT
20070510 Thursday May 10, 2007
Export Visio Database Table Names to Excel

If you use the Enterprise Architect edition of Microsoft Visio for data modeling regularly, then there is a good chance that at some point you’ve wanted to export just the table names into Excel. You might want to do this to map logical ERD entities to physical data model tables, track project status by entity, or track overlap between database versions.

Regardless, it turns out to be non-trivial to perform this export, particularly if you are unable to generate to a database to retrieve the table names. The trick is to use the reporting feature of Visio, but there are many reports and report options, and you will need one that is table-based to get the data into Excel easily.

Note: If you are unfamiliar with the capabilities of Microsoft Visio as a data modeling tool you may wish to take a look at my Data Modeling in Microsoft Visio Tutorial.

Export Procedure

1. This may seem a little unusual, but if you don't have any comments in any of your tables (which really shouldn’t be the case), you will need to add comments for at least one of your tables. Without this step Visio will not display tables in a grid format in the report.

2. Now select the somewhat obscure "Report" item off of the "Database" menu.

3. Only the "Table Report" provides the ability to layout database tables in a grid. Select it and click Finish.

4. Under “Predefined logical/physical reports" Click the button labeled "Default To: General Report" and change it to "Default To: Database Report." This will remove tables formatted per page from the end of the report.

5. Under the "Attributes" tab select “Deselect All” then select the “Table stats summary” option.

6. Click “Export to RTF,” save the file somewhere, and open it with Microsoft Word.

7. (optional) If you have any new lines in the notes field you may have to replace them with spaces. Just do a search and replace for "^l" and replace with " ".

8. Now you’re ready to copy and paste.

And you’re done! Hopefully this tutorial will make life easier for you next time you need to export table names from Visio to Microsoft Excel.

Posted by lrichard May 10 2007, 06:23:07 PM EDT
20070504 Friday May 04, 2007
CAML: Nested Too Deep

I discovered an interesting error recently while working with Microsoft’s Collaborative Application Markup Language (CAML) that, surprisingly, had received no ink. Partly what surprises me about this is that the error may require you to rewrite large sections of your code if you haven't previously considered this SharePoint limitation. I'll start with some context, but first of all the error is:

“Some part of your SQL statement is nested too deeply. Rewrite the query or break it up into smaller queries,”

So let's back up and describe what CAML is for everyone who isn’t yet experiencing this problem.

CAML is an XML based query language that is similar to, and serves many of the same purposes as SQL and DDL. However, CAML is used primarily to interact with SharePoint lists. You can retrieve data using something like:

<Where><Eq><FieldRef Name='Title'/><Value Type='Text'>hello</Value></Eq></Where>

Which is equilivant to WHERE Title = ’hello’ in a SQL statement. At this point I’m sure you’re asking why exchange 80 characters for 21? I suppose the answer is related to how CAML is primarily used in web services requests. Regardless, the nested too deeply error occurs when you try to insert large amounts of data into a list.

The insert syntax is actually all batch based, so if you have 100 list items to insert into a list, you build them all into one big CAML statement and then send it across the wire with a call to Lists.UpdateListItems(). Your CAML statement will look something like the following:

<Batch OnError="Continue">
    <Method ID="1" Cmd="New">
        <Field Name="Title">Hello<Field>
        <Field Name="Document">5</Field>
   </Method>
   <Method ID="2" Cmd="New">
        <Field Name="Title" >World</Field>
        <Field Name="Document">5</Field>
   </Method>
</Batch>

The problem comes when you want to insert something like 19,642 list items into a list. SharePoint complains with “Some part of your SQL statement is nested too deeply. Rewrite the query or break it up into smaller queries,” and suddenly you’re stuck with an enormous CAML statement that you need to break up into smaller batches. How small? Well, for me 300 worked and 500 didn’t, so I set my batch size to 300.

This means that if you are currently experiencing this problem, then have fun rewriting your code to use smaller batches. And, if you're lucky enough to be reading this before you experience the error, then prepare yourself now for this possibility.

Or, if performance isn’t an issue and you don’t want to or can’t rewrite the code that built your CAML then feel free to use this code to batch up CAML batches into smaller batches.

/// <summary>

/// Breaks a larg CAML query into smaller batches to avoid the error "Some part of your SQL statement is nested too deeply. Rewrite the query or break it up into smaller queries."

/// </summary>

/// <param name="listService">The SharePoint list service to execute the CAML against.</param>

/// <param name="strListName">The name of the list to execute the CAML against.</param>

/// <param name="elementLargeBatch">The CAML batch list of commands to be broken up.</param>

/// <param name="intBatchSize">The size of batches to use.  If unsure use 300, it seems to work fairly well.</param>

/// <returns>The results of all batched queries.</returns>

public static XmlNode ExecuteLargeQuery(

    Lists listService,

    string strListName,

    XmlElement elementLargeBatch,

    int intBatchSize

    ) {

     

    // calculate useful information

    int intMethodCount = elementLargeBatch.ChildNodes.Count;

    int intBatchCount = (int)Math.Ceiling((double)intMethodCount / (double)intBatchSize);

     

    // prepare xml documents for batches and results

    XmlDocument xmlDocBatch = new XmlDocument();

    XmlDocument xmlDocResults = new XmlDocument();

    XmlElement elementResults = xmlDocResults.CreateElement("Results");

   

    // for each batch

    for (int intCurrentBatch = 0; intCurrentBatch < intBatchCount; intCurrentBatch++) {

        int intMethodStart = intCurrentBatch * intBatchSize;

        int intMethodEnd = Math.Min(intMethodStart + intBatchSize - 1, intMethodCount - 1);

     

        XmlElement elementSmallBatch = ListHelper.CreateBatch(xmlDocBatch);

     

        // for each method in the batch

        for (int intCurrentMethod = intMethodStart; intCurrentMethod <= intMethodEnd; intCurrentMethod++) {

            XmlElement element = (XmlElement)elementLargeBatch.ChildNodes[intCurrentMethod];

            elementSmallBatch.AppendChild(xmlDocBatch.ImportNode(element, true));

        }

     

        // execute the batch

        XmlNode nodeBatchResult = listService.UpdateListItems(strListName, elementSmallBatch);

       

        // add the results of the batch into the results xml document

        foreach (XmlElement elementResult in nodeBatchResult.ChildNodes)

            elementResults.AppendChild(xmlDocResults.ImportNode(elementResult, true));

     

        // clean up

        xmlDocBatch.RemoveAll();

     

    }

     

    return elementResults;

}

Useful MSDN References

Incidentally, these MSDN references might be useful if you’re interested in more info:

Introduction to Collaborative Application Markup Language (CAML)

CAML Reference: Query Schema

CAML Reference: Batch Element

Posted by lrichard May 04 2007, 01:04:38 PM EDT
20070425 Wednesday April 25, 2007
C# 3.0: The Sweet and Sour of Syntactic Sugar

I’ve just started reading the blog of Scott Guthrie who is a general manager at Microsoft and is currently writing about the new C# 3.0 (code named Orcas). The post I felt strongly about describes three new language features including:

  • Automatic Properties
  • Object Initializers, and
  • Collection Initializers

All three features are purely syntactic sugar: they add nothing of real value to the language (unlike generics in C# 2.0 for example). One features is wonderful, while two are terrible.

The Sweetest Syntactic Sugar Ever

Automatic properties are fabulous. I’ve always grumbled about the number of lines required to create simple properties. Anything that can turn:

public class Person {
    private string _firstName;

    public string FirstName {
        get { return _firstName; }
        set { _firstName = value; }
    }
}

Into:

public class Person {
    public string FirstName { get; set; }
}

Gets my vote.

Sickeningly Sweet

Perhaps I’m a purest, but with more language conveniences comes: a steeper learning curve because of a larger language, and decreased readability because of the numerous ways of doing the same thing. Picture reading this in someone else’s code:

Person p = new Person("Lee", "Richardson") {
    Phone = "111-111-1111",
    Age = 29,
    Company = new Company("Near Infinity") {
        Phone = "222-222-2222"
    }
}

Who would write this crap? Everyone else, now that Microsoft lets them. I vote for the verbose, the readable, the single approach:

Company company = new Company("Near Infinity");
company.Phone = "222-222-2222";

Person person = new Person("Lee", "Richardson");
person.Phone = "111-111-1111";
person.Age = 29;
person.Company = company;

Hmmm, and by the way did you count the number of lines?

Conclusion

So do the differences between my thoughts on automatic properties contradict with my thoughts on object initializers and collection initializers? They don’t. Basically a little sugar is good, but too much ruins the dish.

Posted by lrichard Apr 25 2007, 01:51:53 PM EDT
20070423 Monday April 23, 2007
Parameter passing in C#

The topic of parameter passing in C# keeps coming up, and so I thought this post might be useful. It's been my most popular post on my rapid application development blog, so I thought I'd repost here. Please kick the post if you like it.

So there I was, merrily browsing the Internet when I came across this fantastic site on topics such as:

  • Implementing a singleton pattern in C#,
  • Type initializers,
  • Static constructors,
  • Delegates and events,
  • And as my title suggests: Parameter passing in C#.

In short, all those things you rarely need to know to get your job done, but that separate mediocre developers from good ones. The site is by Jon Skeet and the articles are informative, well researched, well explained, well written. Here's the C# part of his site:

http://www.yoda.arachsys.com/csharp/

The article that caught my attention made sense to me, but being a very visual person I couldn't help but think that some pictures could really help illustrate the points. So without further ado, I illustrated the article. You probably don't need to read the article to understand this post - but you should:

http://www.yoda.arachsys.com/csharp/parameters.html.

Note: you can click the images to get a clearer view.

1. Value Types

Notice that the values live inside the box which will not be the case for reference types. Also, the assignment operation copies the value inside of the box, this is important to compare with reference types.

Quiz: What is the result of the WriteLine statement?

Answer: 5

2. Reference Types

Variables that hold reference types actually hold a reference to a location in memory (on the heap). So assignment operations copy the address. Notice this is still consistent with diagram #1, the copy operation copies the value inside of the box.

Quiz: What is the result of the WriteLine statement?

Answer: hello world

3. Immutable Reference Types

Immutable reference types like strings behave just like regular reference types except they don't provide a way to change their value.

Quiz: What is the result of the WriteLine statement?

Answer: hello

4. Value Types Passed by Value

Passing a variable to a function by value is equivilant to instantiating a new variable and assigning it to the first (well, ignoring scope issues and such). Notice that the diagram below is nearly identical to diagram #1.

Quiz: What is the result of the WriteLine statement?

Answer: 5, same as #1

5. Reference Types Passed by Value

In #4 I said passing a variable to a function by value is equivilant to instantiating a new variable and assigning it to the first. Is that still true of reference types? Yup. And did you notice there's an implicit assignment statement when passing by value? As you'll see shortly there won't be when passing by reference.

Quiz: What is the result of the WriteLine statement?

Answer: hello world

6. Value Types Passed by Reference

Passing by reference doesn't involve an implicity copy, instead it instantiates the inner variable to the address in memory of the outer variable. Then all references to the inner variable are implicitly dereferenced for you and voila, magically you're changing the value of the outer variable.

Quiz: What is the result of the WriteLine statement?

Answer: 10, and notice how different the diagram and results are than #1 and #4.

7. Reference Types Passed by Reference

Really this is no different than value types passed by reference (#6), except calling sb.Append() from an inner variable is dereferenced once to get to the outer variable and again because the outer variable is itself a pointer.

By the way, when you get to the section in Jon's article called:

"Sidenote: what is the difference between passing a value object by reference and a reference object by value?"

Please read it carefully, it's an extremely good point. It can be sumed up by comparing the final assignment statement above (Reference Types Passed by Reference) to the final assignment statement in in diagram #5 (Reference Types Passed by Value). It's a subtle, but important difference.

Oh and the quiz, what is the value of the Console.WriteLine in #7?

Answer: NullReferenceExceptinon – Object reference not set to an instance of an object

Still confused? Then I didn't do my job right, since this is the point in the article when I thought pictures would help. So please post your thoughts whether it makes sense or not.

- Lee

Posted by lrichard Apr 23 2007, 05:34:18 PM EDT
20070415 Sunday April 15, 2007
Multi Value Columns Solution #2 - Custom Activities in SPD

In the previous post in this series (Multi-Value Columns in SharePoint Designer - Solution #1), I described a problem where SharePoint Designer can’t send e-mail to multiple recipients if those recipients exist inside of a multi-value column in a SharePoint list. The simple hacky solution I described was to temporarily turn the column into a single value column just for SharePoint Designer. But this approach has problems, and there is a better way: developing custom activities in Visual Studio for use in SharePoint Designer.

In this post I will describe how to develop a custom activity in Visual Studio that will also solve this problem and I will also describe how to install it on a SharePoint server so that SharePoint Designer clients can automatically download and use it. Before I do let me back up and tell you why, from my perspective, this approach is probably not the best way to go.

SharePoint Designer is a Microsoft Office product that replaces FrontPage, integrates tightly with SharePoint, and allows non-developers (aka “knowledge workers” in the Microsoft lingo) to create simple workflows without writing any code.

The problem is that simple workflows and multi-value columns are like oil and water: not so compatible. If you’re using multi-value columns then your knowledge workers should admit defeat and let developers create the workflows in Visual Studio using the Windows Workflow Foundation (which, incidentally, will be the topic for third article in this series).

Still, you may have a good reason for continuing to develop workflows in SharePoint Designer, and I’ve already gone through the pain of writing and installing custom activities, so hopefully this post will make life easier for someone somewhere.

Creating the Custom Activity in Visual Studio

The code to create the custom activity in Visual Studio is the most interesting part of this solution. Make sure to check out the GetEmailAddressesFromUsernames() method if you have time to review the code. Here is the procedure assuming this is your first time working with Windows Workflow Foundation in Visual Studio.

1. The first step is to download and install Visual Studio 2005 extensions for .NET Framework 3.0 (Windows Workflow Foundation) which will add workflow options to Visual Studio.

2. If you aren’t running on Windows Server 2003, then you probably need to install the Windows Workflow Foundation DLL’s. You’ll know there’s a problem if, after creating the project in Visual Studio, your SharePoint references are invalid. I picked these DLL’s up from my Windows Server 2003 machine (actually a VMWare virtual machine) from the following location:

C:\Program Files\Common Files\Microsoft Shared\web server extensions\12\ISAPI

You should install these to the Global Assembly Cache (GAC) to make Visual Studio happier, although you may still need to reference them for your project. The easy way to install to the GAC is just to copy them to somewhere on your local (non-W2K3) machine and then drag them all over to C:\Windows\assembly.

3. After completing step #1 when you open Visual Studio you should get a new tree node under “Visual C#” (or Visual Basic) called “Workflow.” Select that, then “Workflow Activity Library” then call the project something like “MultiRecipientMail”.

4. Visual Studio should automatically create a blank activity called Activity1.cs. If you want a more reasonable name then delete it and “Add” a “New Item” of type “Activity” called something like “MultiRecipientMailActivity.” Don’t make the same painful mistake as me: name your activity something other than the name of the project or anything in the namespace.

5. At this point you could drag and drop a “Replicator” activity on the design surface and put a “Send Mail” activity inside, but I’ll cover that in my next post. For now I think code is the clearer way to go. Hit F7 or right click and “View Code” and paste the following which I’ll try to comment in-line rather than in this article.

(for the code see the original article Multi Value Columns Solution #2 - Custom Activities in SPD on BlogSpot)

You’ll need to substitute your own public key token from step 9 in the “Assembly=” statement.

12. Next you need to tell SharePoint designer clients to trust this new dll. To do so open the Web.Config in the virtual directory for your site (e.g. C:\Inetpub\wwwroot\wss\VirtualDirectories\34089\ web.config). Find the section called “” and add the following new line (substitute your own public token):

(again substitute your own public key token)

13. Restart IIS by opening a command prompt and typing “iisreset.” SharePoint will pick up the new MultiRecipientMail.ACTIONS file, find the dll in the GAC, and provide the new information to any SharePoint Designer clients that access the site.

14. Open SharePoint Designer and either navigate to the “Workflows\Multi Recipient Email.xoml” project from my previous article or create a new workflow project based on a list with a multi-value column. You should get a message like the following while SharePoint Designer downloads the new dll:

15. If everything worked correctly you should now be able to 1. click Actions; 2. More Actions; 3. Select the “Custom Actions” category from the dropdown (this category came from the .ACTIONS file in Step 11); and 4. Select the “Multi Recipient E-mail” action. If it doesn’t show up in the Workflow Designer page after clicking "add", then you probably referenced it incorrectly in the Web.Config.

16. Now you should be able to 1. click the Fx button next to “To;” 2. Select your multi value column (e.g. Peers To Review from my previous article); 3. Hit Ok; and 4. Fill in the remaining fields with values.

17. Now if you head back to Sharepoint, select the dropdown of a list item in the list associated with your workflow, click “Workflows”, select the workflow you created in SharePoint Designer, and click “Start” you should receive an e-mail at each address in your multi-value column.

Well … at least it worked for me. :) I did kitchen test this article, but please leave comments if it doesn't work for you.

Redeploying

If you want to make any changes to the code the steps to redeploy are:

1. Compile (you don’t need to sign again, that’s one time only)

2. Copy the dll to your W2K3 server

3. Re-install the dll to the GAC. You can just re-copy it over to c:\windows\assembly if you like.

4. Restart iis with an “iisreset” from the command line (no need to change the .ACTIONS or Web.Config files)

5. Finally, if you need to make changes to your workflow, you may need to restart SharePoint Designer to download the new dll (I usually have to).

You’re Done! Easy huh?

So you should now have a reusable component that non-developers can use in SharePoint Designer to send e-mail to multiple recipients as determined by a multi-value column in a SharePoint list. Of course if you decide that developing the workflow in Visual Studio is better, then the third article in this series may be for you.

Posted by lrichard Apr 15 2007, 12:26:09 PM EDT
20070409 Monday April 09, 2007
Multi-Value Columns in SharePoint Designer - Solution #1

Recently I’ve been working with Microsoft Office SharePoint Server (MOSS) 2007. Since this is my first post on the topic I’d love to start at a high level about what it is and how it works, but let’s get to the interesting stuff: what doesn’t work well and how to get around it.

Specifically, this will be the first in a series regarding a deficiency in the workflow component of SharePoint that doesn’t allow you to send e-mail to multiple recipients from a multi-value column of a list using SharePoint Designer 2007 (SPD).

This first post will describe the problem, provide lots of screenshots so it can double as a fast introduction to workflows in SharePoint for the uninitiated, and then provide a quick hacky solution.

The Multi-Value Column Problem

Here’s how it should work. You create a list (e.g. Performance Review) and add a column (e.g. “Peers To Review”) and select a type of “Person or Group” with “Allow multiple selections” set to Yes.

(click any image to enlarge)

Now if you add a blank workflow in SharePoint Designer (from the SharePoint Content tab):

And you select the list you created (Performance Review):

Then you should be able to

  1. Add a “Send an Email” action
  2. View the message’s properties
  3. Select the recipient
  4. Add a function by clicking “Workflow Lookup”
  5. Select the Current Item (aka the current list item, which is like a row in a spreadsheet and in my example this would be the current performance review that the workflow is being run on)
  6. Then select the new column “Peers to Review”

But wait. Where’s your column? And here is the problem. The UI of SharePoint Designer filters out all columns that are marked with “Allow multiple selections.”

Simple Hacky Solution #1

Believe it or not the Workflow Engine knows how to send e-mail to a column with “Allow multiple selections” but it’s the SharePoint Designer UI that doesn’t. This means that a simple solution to the problem is this:

  1. Turn off “Allow multiple selections” for your column (note the warning “This will remove all person values except the first one”)
  2. Close and reopen the workflow in SharePoint Designer and follow steps above and magically your column will appear in the dropdown

  3. Finish creating your workflow
  4. Turn back on “Allow multiple selections”
  5. Create an item for the list and run the workflow you created in SharePoint Designer

And voila, it works! Both (or all) recipients have received an e-mail.

Sadly, this simple technique isn’t an acceptable solution for the long term. What happens if you want to change your workflow down the road? You may need to turn “Allow multiple selections” back off, thus deleting your multi-user data. A better solution is to create a custom action in Visual Studio and get it to plug into SharePoint Designer. And that will be the topic of my next post:

Multi-Value Columns in SharePoint Designer - Solution #2

---

Note 1: this series is largely in response to the several users having similar problems at: http://forums.microsoft.com/MSDN/showpost.aspx?postid=1443799&siteid=1

Note 2: If you’re interested in the ERD for the topics in this SharePoint post this diagram might be useful to get you up to speed:

Note 3: This is double posted at http://rapidapplicationdevelopment.blogspot.com/2007/04/multi-value-columns-in-sharepoint.html

Posted by lrichard Apr 09 2007, 12:58:25 PM EDT
20070326 Monday March 26, 2007
Entity Naming Conventions

It seems as though as software developers mature they develop consistency in their approach to just about every aspect of their work, regardless if there is a good reason for adopting a particular practice or not.

For instance, in data modeling I developed the habit of always naming my tables in the plural – Employees instead of Employee, and such. There’s no reason for this convention, other than perhaps I copied what I saw from the Northwind database.

But it’s important to question these practices from time to time, and after over seven years of doing things the same way I have decided to make a change. And for the second time now (see my post The Importance of a Logical Data Model), it was a colleague: Steve Dempsey who initiated the change. So why would one opt for singular names over plural ones?

Developers might chose singular names because they are shorter and require less typing, but this argument never held for me because of tools like intellisense and code generation (not to mention touch typing). But Steve is extremely adamant about singular names for a different reason: because of relationship readability.

For instance, in Sharepoint, workflows relate to events. Specifically, a