Thursday March 27, 2008 Lee Richardson
In my last post I described how the Where() function works for LINQ to Objects via extension methods and the yield statement. That was interesting. But where things get crazy is how the other LINQ technologies, like LINQ to SQL use extension methods. In particular it’s their use of a new C# 3 feature called expression trees that makes them extremely powerful. And it’s an advantage that more traditional technologies like NHibernate will never touch until they branch out from being a simple port of a Java technology. In this post I’ll explain the inherent advantage conferred on LINQ technologies by expression trees and attempt to describe how the magic works.
What’s so Magic about LINQ to SQL?
LINQ to SQL (and it’s more powerful unreleased cousin LINQ to Entities) is a new Object Relational Mapping (ORM) technology from Microsoft. It allows you to write something like the following:
IEnumerable<Product> products =
northwindDataContext.Products.Where(
p => p.Category.CategoryName ==
"Beverages"
);
Which as you’d expect returns products from the database whose category is Beverages. But wait, aren’t you impressed? If not read over that code again, you should be very impressed. In the background that C# code is converted into the following SQL:
SELECT [t0].[ProductID], [t0].[ProductName],
...
FROM [dbo].[Products]
AS [t0]
LEFT OUTER
JOIN [dbo].[Categories]
AS [t1]
ON [t1].[CategoryID]
= [t0].[CategoryID]
WHERE [t1].[CategoryName]
= @p0
In other words it’s pretty smart. It isn’t just returning all products and filtering them in memory using the LINQ to Objects version of Where() I discussed previously.
Doing something like that using NHibernate Criteria would require something like this:
ICriteria c = session.CreateCriteria(typeof(Product));
c.Add(Expression.Eq("Category.CategoryName",
"Beverages"));
IEnumerable<Product>
products = c.List<Product>();
You could use HQL too, but both NHibernate options suffer from the same problem. Did you spot it?
The LINQ to SQL version is taking actual strongly typed C# code and somehow smartly converting it to useful SQL. The NHibernate version does the same thing, but always using a weakly typed alternative. In other words the column “CategoryName” in NHibernate is a string. If it or its data type change in NHibernate you won’t find out until runtime. And that is the beauty of LINQ to SQL: you’ll find more errors at compile time. And if you’re like me you want the compiler to find your mistakes before the unit tests that you (or your fellow developers) may or may not have written do.
So you’re probably now wondering if you can put strongly typed C# in your where clause and it somehow magically gets converted to SQL, what’s the limit? If you put in a String.ToLower() or StartsWith() will it get converted to equivalent SQL? What about a loop or conditional? A function call? A recursive function call? At some point it has to break down and either return all products and filter them in memory or just fail right? Before answering those questions we need to understand what’s going on.
Understanding the Magic
The Magic happens in a class called Expression<T>. Expression takes a generic argument that must be a delegate and is usually one of the built in Func methods. However the class can only be instantiated to a lambda expression. That’s right, not a delegate or anonymous method, only a Lambda expression. So in my deferred execution post where I explained what Lambda expression are, I said they were essentially syntactic sugar for an anonymous methods. Well, the emphasis is on the essentially, because they really aren’t sugar at all. When you assign a lambda expression to an Expression, the compiler, rather than generating the IL to evaluate the expression, generates IL that constructs an abstract syntax tree (AST) for the expression! You can then parse the tree and perform actions based on the code in the lambda expression.
Below is an example adapted from the .Net Developer’s guide on MSDN that shows how this works:
// convert the lambda expression to an abstract syntax tree
Expression<Func<int, bool>>
expression = i => i < 5;
ParameterExpression param = (ParameterExpression)expression.Parameters[0];
// this next line would fail if we change the Lambda
expression much
BinaryExpression operation = (BinaryExpression)expression.Body;
ParameterExpression left = (ParameterExpression)operation.Left;
ConstantExpression right = (ConstantExpression)operation.Right;
Console.WriteLine("Decomposed expression: {0} => {1} {2} {3}",
param.Name,
left.Name,
operation.NodeType,
right.Value
);
This outputs “Decomposed expression: i => i LessThan 5”. The first line is the most important. It defines an Expression that takes a delegate with a single int parameter and a return type of bool. It then instantiates the Expression to a simple lambda expression. Incidentally this would also work if we defined our own Delegate:
public
delegate bool
LessThanFive(int
i);
public static
void DoStuff() {
Expression<LessThanFive> expression =
i => i < 5;
}
It would, however, not work if we used an anonymous method:
Expression<Func<int, bool>> expression = delegate(int i) { return i < 5; };
While that looks legal it actually results in the compile time error “An anonymous method expression cannot be converted to an expression tree.”
There is a lot of complexity in parsing the AST, far beyond the scope of this article. However, the MSDN does have a nice diagram that helps explain how the following slightly more complicated Lambda expression that determines if a string has more letters than a number:
Expression<Func<string,
int, bool>>
expression =
(str, num) => num > str.Length;
How Deep Does The Rabbit Hole Go?
So LINQ to SQL uses this Expression Tree technique to parse a plethora of possible code that you could throw at it and turn it into smart SQL. For instance check out a couple of the following conversions that LINQ to SQL will (or will not) perform:
p => p.Category.CategoryName.ToLower() == "beverages"
Results In:
SELECT [t0].[ProductID],
...
FROM [dbo].[Products]
AS [t0]
LEFT OUTER
JOIN [dbo].[Categories]
AS [t1] ON [t1].[CategoryID] =
[t0].[CategoryID]
WHERE LOWER([t1].[CategoryName]) = @p0
Not bad, huh? How about:
p => p.Category.CategoryName.Contains("everage")
That results in the following SQL snippet:
WHERE [t1].[CategoryName] LIKE @p0
And it sets @p0 to “%everage%”. Pretty cool. Ok this will get it to fail though, right?
public
static string
GetCat() {
return
"Beverages";
}
IEnumerable<Product>
products = northwindDataContext.Products.Where(
p => p.Category.CategoryName ==
GetCat()
);
It turns out that LINQ to SQL will look inside of other functions! Alright, there’s no way it can do complicated conditionals:
p =>
p.Category.CategoryName ==
"Beverages" ? p.UnitsInStock < 5 : !p.Discontinued
This should only pick up Beverages that have fewer than 5 items in stock regardless of whether they are discontinued and any other products that aren’t discontinued. Would you believe that it runs a single SQL statement:
SELECT [t0].[ProductID], ...
FROM [dbo].[Products]
AS [t0]
LEFT OUTER
JOIN [dbo].[Categories]
AS [t1] ON [t1].[CategoryID] =
[t0].[CategoryID]
WHERE (
(CASE
WHEN
[t1].[CategoryName] =
@p0 THEN
(CASE
WHEN [t0].[UnitsInStock]
< @p1 THEN 1
WHEN NOT ([t0].[UnitsInStock] <
@p1) THEN 0
ELSE NULL
END)
ELSE CONVERT(Int,
(CASE
WHEN NOT ([t0].[Discontinued] =
1) THEN 1
WHEN NOT NOT
([t0].[Discontinued]
= 1)
THEN 0
ELSE NULL
END))
END)) = 1
Wow, it sure isn’t pretty, but it scales to multiple conditionals, and most importantly it didn’t return all products and process them in memory. Not bad.
Conclusion
I asserted up front that using expression trees and the strong typing that comes with them is the reason LINQ to SQL is inherently better that NHibernate. I really can’t make that claim without admitting one of LINQ to SQL’s biggest shortcomings: It currently does not support multiple table inheritance. Ultimately, however, it’s a short term fault since the forthcoming LINQ to Entities does. And I stand by my claim because from a long term perspective as long as technologies like NHibernate remain pure ports of Java code they will never realize the full benefits of equivelant LINQ technologies that take advantage of .Net's native strengths: like expression trees.
Note: Please post comments to my blogspot blog
After writing my last blog entry on Deferred Execution in LINQ I had a conversation with Seth Schroeder who rightly pointed out among other things that I really didn't show how LINQ's deferred execution works internally. So in this post I wanted to implement my own LINQ Where() extension method based off of the one in the System.Linq namespace. So I'll show you the code, explain interesting parts of how it works including collection initializiers and extension methods, and then explain where the deferred execution behavior comes from (i.e. the yield statement). I will only explain in the context of LINQ to Objects since that's far simpler than other Linq's. I will implement a Where() like LINQ to SQL does in a later blog post (that's where things get really crazy).
Implementing MyWhere()
Let's start out with some code. The first question is does this compile?
using System;
using System.Collections.Generic;
using MyExtensionMethods;
namespace PlayingWithLinq {
public
class LinqToObjects
{
public
static void
DoStuff() {
IList<int> ints =
new List<int>() {9,8,7,6,5,4,3,2,1};
IEnumerable<int> result = ints.MyWhere(i
=> i < 5);
foreach (int i
in result) {
Console.WriteLine(i);
}
}
}
}
namespace MyExtensionMethods {
public
static class
ExtensionMethods {
public
static IEnumerable<TSource>
MyWhere<TSource>(
this
IEnumerable<TSource> source,
Func<TSource, bool> predicate
) {
foreach (TSource element in source) {
if (predicate(element)) {
yield return
element;
}
}
}
}
}
Side note: putting two namespaces in on file is far from a best practice, but yes that is allowed.
Lambdas and Collection Initializers
If you're new to C# 3.5 then your first thought may be that:
IList<int> ints = new List<int>() {9,8,7,6,5,4,3,2,1};
is not allowed. Actually it is. It's the collection initializer syntax that I initially whined about in my post C# 3.0: The Sweet and Sour of Syntactic Sugar (ironically I actually like this syntax the more I use it.)
Your next thought may be that:
i => i < 5
is not legitimate. This is in fact a Lambda Expression, and as I explained in Deferred Execution, The Elegance of LINQ it conceptually compiles down to an anonymous method. Incidentally those that know Groovy (myself not included) or Lisp may know this as a closure since as we'll see later it can access local variables.
Extension Methods
Ok, the .Net Framework certainly has no MyWhere() function on the List object so this certainly wouldn't compile in C# 2. But that's where C# 3's Extension Methods come in. The "this" in:
MyWhere<TSource>(this IEnumerable<TSource> source,
says that MyWhere() can be applied to any generic IEnumerable. If you want to, you can still call MyWhere() normally:
IList<int> ints = new
List<int>()
{9,8,7,6,5,4,3,2,1};
ExtensionMethods.MyWhere(ints, i => i < 5);
And in fact this is what the compiler does in the background when you call MyWhere() off of an IEnumerable. But now with extension methods you don't have to.
But does MyWhere() now exist on all IEnumerable objects everywhere? No, it turns out you only get MyWhere() when you import the namespace it exists in (MyExtensionMethods). Incidentally unlike Groovy and Ruby there is no way to add an extension method to a class itself, only to instances.
Whose got the Func()?
The last two questionable parts of the code are the Func<TSource, bool> and the yield. Func is pretty easy. It's simply one of several new predefined delegates (method signatures) that comes with the .Net framework off of the System namespace. The two generic argument one above will match any function that returns the second generic argument and takes the first generic argument as a parameter. It looks like this:
delegate TResult Func<T, TResult>(T arg1);
So rather than using a Lambda expression in my initial example I could have been very explicit about the delegate instance (myFunc):
public
static void
DoStuff() {
IList<int> ints = new
List<int>()
{9,8,7,6,5,4,3,2,1};
Func<int, bool> myFunc
= IsSmall;
IEnumerable<int> result = ints.MyWhere<int>(myFunc);
foreach
(int i in
result) {
Console.WriteLine(i);
}
}
public static
bool IsSmall(int
i) {
return
i < 5;
}
And that would have done the same thing. Notice I had to specify the generic type on the call to MyWhere() since the compiler can't infer the type in this example.
Yield
Now the really interesting part: yield. Yield is what makes deferred execution work. It actually was introduced with C# 2.0, but I don't think anyone really used it (I didn't know about it until recently). So because MyWhere() returns an IEnumerable (and because it isn't anonymous and doesn't have ref or out parameters) it is allowed to use the yield statement. When a method has a yield return (or yield break) statement, then execution of the method doesn't even begin until a calling method first iterates over the resulting IEnumerable. Execution then begins in the method and runs to the first yield statement, returns a result, and passes execution back to the caller. When the calling method iterates to the next value execution continues in the method where it left off until it gets to the next yield statement and then it passes execution back to the caller again and so on. Weird huh? Joshua Flanagan has a nice article that explains this in more detail along with some of the nice benefits like a smaller memory footprint.
So here's a quiz. What happens when you execute the following code?
IList<int> ints = new
List<int>()
{9,8,7,6,5,4,3,2,1};
IEnumerable<int>
result = ints.MyWhere<int>(i => i < 4);
ints.Add(0);
foreach (int i
in result) {
Console.WriteLine(i);
}
Without the yield you'd get the numbers 3 through 1 since you added 0 after the call to MyWhere(). But since the yield in MyWhere() (and the Where() in System.Linq) defers execution until the foreach statement, you actually get 3 through 0. Ready for a little more mind bending? How about this:
IList<int> ints = new
List<int>()
{9,8,7,6,5,4,3,2,1};
int j = 4;
IEnumerable<int>
result = ints.MyWhere<int>(i => i < j);
ints.Add(0);
j = 3;
foreach (int i
in result) {
Console.WriteLine(i);
}
Does the state of j get captured? My intuition would say yes. If so you'd expect 3 through 0. Well, the closure part of anonymous methods and lambdas work by keeping a reference to their calling object (this). So consequently they always get the most up to date value of a variable. So if your intuition works like mine you'd be wrong. You actually get the numbers 2 through 0. Crazy huh? And definitely something I hope I won't run into in someone's code (JetBrains ReSharper actually warns you if you do something crazy like this).
Conclusion
If this made sense then you should have a pretty solid grasp of how most of Linq to Objects works. Understanding extension methods, Func delegates, and yield statements should form the majority of what Linq does. Well, except for expression trees. But that's a topic for another post. Please post if this doesn't make sense or if I got it all wrong, I'd love to hear from you.
P.S. To comment on this article please use my public Blog.
One of the things I love about LINQ is its deferred execution model. It's the type of thing that makes sense academically when you first read about it (e.g. in Part Three of Scott Gunthrie's LINQ to SQL series), but for me anyway, it took some time to understand enough to use effectively.
For instance the Daily RSS Download open source application that I wrote about last week needs to download entries (posts) that are newly published since the last download. While it isn't a complicated problem, my first attempt at a solution didn't use the power of LINQ correctly. I'll explain my naïve solution in this post, describe how LINQ's deferred execution works (i.e. Lambda expressions), explain the problems with my solution, then give an the elegant solution that is only possible because of LINQ's deferred execution model. See if you can spot my error along the way.
Downloading the Latest Entries
Downloading the latest entries would be a ridiculously simple problem if there weren't multiple formats for RSS. But since the solution needs to support Atom and RSS 2.0 and 1.0 and potentially other future formats, the class structure should be set up appropriately:
The newspaper class primarily exists to enumerate feeds:
public
class Newspaper
{
public
void DownloadNow() {
foreach (Feed objFeed in
Settings.Feeds) {
objFeed.DownloadRecentEntries(...);
}
}
}
The Feed class is abstract and during runtime is either an RssFeed or an AtomFeed. The relevant function Feed.DownloadRecentEntries() calls the abstract Feed.GetEntries() method, which returns a group of Entry objects.
public
abstract class
Feed {
public
abstract IEnumerable<Entry> GetEntries(XDocument
rssfeed);
public
void DownloadRecentEntries(...) {
XDocument
xdocFeed = XDocument.Load(Url);
IEnumerable<Entry> lstRecentPosts = GetEntries(xdocFeed);
foreach
(Entry objEntry in
lstRecentPosts) {
objEntry.Download(...)
}
}
}
The Feed classes, RssFeed and AtomFeed then implement GetEntries as follows:
publpublic
class RssFeed
: Feed {
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants("item")
where (DateParser.ParseDateTime(item.Element("pubDate").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new
RssEntry(item, this);
}
}
public class
AtomFeed : Feed
{
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants(_atomNamespace + "entry")
where (DateParser.ParseDateTime(
item.Element(_atomNamespace + "published").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new AtomEntry(item,
this);
}
}
Yes, that's all LINQ to XML in there. It looks a lot like SQL, but as you'll see in a second it's really just glorified syntactic sugar. Expressive though, isn't it? While the astute reader may have already spotted the inelegance of my solution, for those unfamiliar with LINQ, let's first describe what AtomFeed.GetEntries() does.
What is this Deferred Execution Stuff?
If you already understand LINQ and how delayed execution works feel free to skip this section. For everyone else it's important to understand that the following line:
from item
in rssfeed.Descendants("item")
where (DateParser.ParseDateTime(item.Element("pubDate").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new
RssEntry(item, this);
Is actually just syntactic sugar for the following set of statements:
rssfeed
.Descendants(_atomNamespace +
"entry")
.Where( item => (DateParser.ParseDateTime(
item.Element(_atomNamespace +
"published").Value)
>= this.LastDownloaded)
|| this.LastDownloaded
== null)
.Select( item => (Entry)new
AtomEntry(item, this));
Now XDocument.Descendants() returns IEnumerable
But more important for the topic of deferred execution is the => operator, which is a Lambda expression and is also new to C# 3.0. The best way to understand them is that they are essentially syntactic sugar for an anonymous method (e. (e.g. a type safe function pointer to code). So we could again rewrite our code as follows:
rssfeed
.Descendants(_atomNamespace +
"entry")
.Where(delegate(XElement
item) {
return
(DateParser.ParseDateTime(
item.Element(_atomNamespace +
"published").Value)
>= this.LastDownloaded) || this.LastDownloaded
== null; })
.Select(delegate(XElement
item) {
return
(Entry)new
AtomEntry(item, this);
});
Back in familiar territory yet? If not you probably aren't familiar with C# 2.0. In the background the compiler takes the anonymous methods above and turns them into methods on the current class and instantiates new delegates of the correct type that points to them and passes them to the Select() and Where() methods.
The The key thing to note is that the arguments for select and where are delegates, and so when those delegates are executed is beyond our control. In fact if you put a Console.WriteLine or a breakpoint inside of the AtomEntry constructor, it won't get called until the resulting IEnumerable is enumerated, specifically the following line in the first code sample:
foreach (Entry objEntry in lstRecentPosts) {
So that's delayed execution. But understanding how it works and how to use it are completely different things.
The Inelegant Solution
Getting back to my code sample you may have picked up that my where clause is the mistake. I implemented it like this because RSS and Atom have different field names for the published date. But the way I wrote it I'd have to make two changes if I wanted to change which entries to download. Ok, big deal, I'm extremely unlikely to make changes to that where clause right? Or I wasn't until I wanted functionality to set some defaults based on the average length of posts prior to downloading posts. Basically:
public
static Feed CreateFeed(string
strUrl, int intDisplayOrder) {
IEnumerable<Entry> lstRecentEntries = feed.GetEntries(rssfeed);
double
intAveragePostSize = lstRecentEntries.Average(
i => i.Description.Length);
// if the
feeds posts are typically small then include the
// description field in the summary
and download the content
// for the main article from the link
if
(intAveragePostSize < 1000) {
...
} else {
...
}
}
Except this now ties me to the were clause, when what I'd really like to do is just get the average post size for the last couple of posts. The problem is that GetEntries() isn't generic enough.
The Elegant Solution
The The solution is then to normalize out (excuse the database terminology) the where clause into the two methods that use GetEntries(). So GetEntries() becomes simple:
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants(_atomNamespace + "entry")
select (Entry)new AtomEntry(item,
this);
}
And then Feed.CreateFeed() and Feed.DownloadRecentEntries() become more complicated
public
abstract class
Feed {
public
abstract IEnumerable<Entry> GetEntries(XDocument
rssfeed);
public
static Feed CreateFeed(string strUrl,
int intDisplayOrder) {
IEnumerable<Entry> lstEntries = feed.GetEntries(rssfeed);
// get the
five most recent posts
IEnumerable<Entry> lstRecentEntries =
from
entry in lstEntries.Take(5)
select entry;
double intAveragePostSize =
lstRecentEntries.Average(
i => i.Description.Length);
if (intAveragePostSize < 1000) {
...
} else {
...
}
}
public
void DownloadRecentEntries(...) {
XDocument xdocFeed = XDocument.Load(Url);
IEnumerable<Entry> lstEntries =
GetEntries(xdocFeed);
// get newly
published posts
IEnumerable<Entry>
lstRecentPosts = from entry in lstEntries
where
(entry.Published >= this.LastDownloaded)
||
this.LastDownloaded ==
null
select entry;
foreach
(Entry objEntry in
lstRecentPosts) {
objEntry.Download(...)
}
}
}
Note that we now have a second LINQ statement that runs against the results of the LINQ statement in GetEntries(). But since nothing's been executed yet we're just building out the statement that we will eventually run when the resulting IEnumerable if enumerated. So we've now spread our LINQ statements across an inheriting and a base class, and in process we've made GetEntries() extremely generic.
Conclusion
So what's the big deal? The big deal is that we can spread our data access statements across multple classes and because of deferred execution we don't need to worry about the performance of generic methods that are closer to the data that don't contain a "where" clause. This may not be a huge deal in this example, but it becomes extremely powerful when the user interface tier can tack on "order by" statements or "filters" BEFORE anything is executed against your data store. And that, for me, is at the heart of the beauty of LINQ.
I published Daily RSS Download, my first open source project on CodePlex* today. It's not going to change the world, but if you have a need for it there is a decided lack of decent products that perform this functionality. In this post I'll give a little background about why I wrote it and explain what it does and how to use it. Besides needing this functionality I also wrote it to learn LINQ to Objects and LINQ to XML, but I'll cover the more interesting implementation details in a later post.
Why I Wrote It
For Christmas I received an IRex Iliad which is an e-book reader combined with a Wacom tablet. It's an awesome product that allows reading PDF's (among other formats) and writing on them. It's pricey, but the ability to jot notes on technical documents (in addition to recipes and guitar tablature, etc) as you read is invaluable for me. I now read about twice as much as I did before. It supports Wi-Fi, and in particular can connect to a computer on a regular basis to download files you put in a specific directory.
So theoretically it could download a customized newspaper every morning for me, right? I could have today's world news, national news, local news, technical news, weather, and my RSS feeds like Scott Gunthrie all in one place while I eat my cereal! And then I could cancel my Washington Post subscription and after about 7.5 years I would have recouped the costs of the Iliad. Sweet.
The problem is that the product doesn't come with any way to download RSS feeds. Well, you can use software from MobiPocket, but it's a pain to setup, and use, and I couldn't figure out how to have it automatically run on a daily basis. And furthermore it can't grab the real content from the website if the RSS feed only contains an abstract (e.g. washingtonpost.com). I searched and there was some software out there, but none of it did what I liked. And of course none of it generated a manifest.xml file which is an Iliad specific file that links HTML pages together and gives names to groups of content (i.e. grouping the files in a directory to make a “book" called “My Daily News for February 13").
So what a great opportunity to write it myself and learn LINQ to XML and LINQ to Objects in the process.
What It Does
The end result (or the index page anyway) looks something like this:
The images are local, the links go to a full page of content, and on the Iliad, because Daily RSS Download generates a manifest.xml, the next and previous buttons can move you to the next or previous article and you can see at a glance how many articles there are.
If you want to recreate the screenshot above, first head over to the Releases page of Daily RSS Download, where you can download the msi and install the application. When you open “Daily RSS Download Config" you can view a home page like this:
You can type in an RSS or Atom URL and click Add Feed. The application will try to connect to the website, download the title, and set some configuration options based on the average length of posts (specifically if you put in a feed from the washingtonpost.com website it will detect that the average post size is small and determine that it should download the main content from the website).
You can click on any of the feeds you've added and you'll get a Feed Settings page like below:
The fields are mostly self explanatory, but here are three of the more interesting settings:
Summary Source Values:
This setting determines where the abstract (summary) on the index page should come from. There are three options:
No Summary – Does not display a summary on the index.html page. This is what Scott Gunthrie's feed was set to in the first screenshot.
Extract from the content – This takes the first 300 characters from the main content as the summary. This was set for the washingtonpost.com feed in the first screenshot (although Use the RSS description field would actually have been more appropriate).
Use the RSS description field – This uses the entire description field from the RSS (or Atom) feed. This is what the weatherbug feed was set to in the first screenshot. Obviously this is a bad choice for a Scott Gunthrie type of RSS entry since he posts everything in the description field.
Content Source Values:
This setting determines where the main content page should get it's value. There are thee options:
No content, summary only – If you set a feed to this, then Daily RSS Download won't generate a content file. This would be a good choice for the weather feed in the example.
Use the RSS description field – The content file will be created from the RSS description field. This would be a good choice for a Scott Gunthrie type of feed.
Download from the referenced web page –Daily RSS Download will download the page referenced by the RSS or Atom feed. This would be a good option for a washingtonpost.com type of feed.
Content Start/End Markers
These are regular expressions that are used if you set content source to download the referenced web page. You can leave them blank or you can set them if you want to try to strip out header, footer, navigation bars, etc. The content start marker in the screenshot:
\<div id=\"article_body\"[^\>]*\>
Says match ‘<div id="article_body"' up through to the next ‘>'. Both markers are exclusive (the thing your matching on won't be included in the results).
Customizing the CSS
So that's it for the general settings and use. You can click “Download Now" on the main config page to download your feeds, and you can set it up to run on a recurring basis (it will only download new content) by setting a recurring task to run “DailyRssDownload.exe DownloadNow". The only other thing of interest is to make the content more pretty.
The generated HTML is CSS customizable, so in order to get the two column look above (and/or make it look pretty on an Iliad) you can customize the CSS as below:
h1
{
margin-top:
0px;
/* A pretty
linux script font since the Iliad has a linux kernel */
font-family: Zapf
Chancery;
font-size:
30pt;
margin-bottom:
0px;
}
h2
{
}
.NewsHeader
{
border-bottom:
solid 1px
black;
text-align:
center;
}
.DailyRss_Date
{
text-align:
center;
}
.DailyRss_Feed
{
}
.DailyRss_Entry
{
}
.DailyRss_EvenEntry
{
}
.DailyRss_OddEntry
{
}
/* LEFT COLUMN */
#ScottGusBlog
{
float:
left;
width:
49%;
border-right:
solid 1px
gray;
}
#washingtonpostcom-TodaysHighlights
{
clear:
both;
float:
left;
width:
49%;
border-right:
solid 1px
gray;
}
/* RIGHT COLUMN */
#WeatherBugLocalWeatherfor20190
{
margin-left:
50%;
}
So basically just use the old float left, width 50%, margin-left 50% trick to get the pretty two-column look (without tables).
Conclusion
I hope you find the Daily RSS Download open source project useful. Please feel free to submit suggestions, feature requests, defects or preferably defects AND patches on the project's CodePlex home page.
* In case you aren't aware CodePlex is an open source project hosting website from Microsoft. It's similar to Source Forge, except there is no approval process for new projects and it integrates nicely with Visual Studio.
I really enjoyed Seth Schroeder’s critique of the last post in my ten part data modeling mistake series: Surrogate vs Natural Primary Keys. His argument regarding data migration in particular sheds light on a major shortcoming of using surrogate keys: they lead data modelers to a false sense of security regarding the uniqueness of data. Specifically if modelers ignore uniqueness constraints they allow duplicate data. And as Seth points out this has a nasty side effect of disallowing any clear way to compare data between systems. But there are other problems too.
So, in this post I’ll address the uniqueness problem introduced with surrogate keys by way of an example, I’ll provide two how-to’s, one implementing uniqueness in Visio and one in NHibernate, I’ll explain the difference between unique indexes and unique constraints, and finally I’ll provide reasons why unique indexes might be overlooked, specifically by providing a critique of ORM tools.
Surrogate Keys = Data Disaster?
So as mentioned above the biggest problem with surrogate keys is they lull junior data modelers or lazy developers into thinking they don’t need to worry about indexes. But they do; and it’s as vital as implementing referential integrity. And for the same reason: data integrity.
As an example, imagine you’re modeling a simple Country table. You could of course use CountryName as the primary key, but as you know from my post on surrogate keys, you would have problems with varchar join speed (assuming you disagree with Seth that it’s a premature optimization) and to a lesser extent cascading updates (since country names do occasionally change).
Introducing a surrogate key (CountryId) resolves these issues, but you also remove an inherent advantage that natural keys have: they require uniqueness in country names. In other words you can now have two New Zealand’s and the system wouldn’t stop you.
What’s the big deal? Country seems like a pretty benign table to have duplicates, right? Your users from New Zealand simply have an extra list item in their drop down to pick from and some pick one and some pick the other.
For Country one problem comes in reporting. Consider delivering a revenue by Country report. Your report probably lists New Zealand twice and a quick scan by an exec sees half of the actual revenue for that country that they should. And as a result numerous innocent sheep are slaughtered … uh, or something.
Another major problem could come in syncing data with other systems. How do those systems know which record to use?
As you can imagine the problem is even worse with major entities like Customer, Order, Product, or something more scary like Airline Flights. And the longer the system stays in production, the more production data the system collects, the more duplicates rack up, and the more time and money that will be required to clean up the data when the problem is finally identified. In short the bigger the data disaster.
How To #1: Visio
So the solution is to add at least one unique constraint (or index) to every single table. In other words if you have a table without a uniqueness constraint chances are very good you’ve done something wrong.
The good news is that it’s pretty easy to implement once you agree it’s necessary. If you’re modeling with Microsoft Visio this is a six step process:
- Select the table.
- Select the “Indexes” category.
- Click New.
- No need to enter a name, just click OK.
- Select either “Unique constraint only” or “Unique index only” (more on this decision later).
- Double click the column(s) to add.
Then when you generate or update your database Visio puts in DBMS specific uniqueness constraints. And voila, problem solved.
Unique Constraints vs Unique Indexes
The question will come up when using Visio or perhaps using various DBMS’s including SQL Server whether to use a unique constraint or unique index. The short answer is that most people use unique constraints, but ultimately they’re the same thing so it doesn’t matter.
In case you’re interested in the details though here’s a quick rundown of the differences:
Unique Constraint
- A logical construct.
- Defined in the ANSI SQL standard.
- Intent: data integrity.
- Usually part of a table definition.
Unique Index
- A physical DBMS implementation.
- Not specified in ANSI SQL.
- Intent: performance.
- Usually external to a table definition.
But since most DBMS’s implement unique constraints as unique indexes, it doesn’t really matter which you choose.
How To #2: NHibernate
Since I have the pleasure of learning the NHibernate ORM tool on my current project, I thought I’d also describe the same technique with a different tool. Basically you can either set the Unique attribute to true to obtain uniqueness in one column, or set the unique-key attribute to obtain uniqueness among multiple columns. If you use NHibernate mapping attributes you write:
[Property(NotNull = true, Length = 100, Unique = true)]
public virtual string CountryName {
get { return _strCountryName; }
set { _strCountryName = value; }
}
Which generates the following hbm:
<class name="Country"><id name="CountryId"><generator class="sequence" /></id>
<property name="CountryName" length="100" not-null="true" unique="true" />
</class>
Which NHibernate turns into the following DDL:
create table Country (
CountryId NUMBER(10,0) not null,
CountryName NVARCHAR2(100) not null <b>unique</b>,
primary key (CountryId)
So quick quiz: was that a unique index or unique constraint it generated? If you answered who cares you’re right. However if you answered a unique constraint you’re also right.
The Problem with ORM
Obviously ignorance of the problem and shortsightedness are two causes for systems going into production without unique indexes, but I’d like to point out a third. While Object Relational Mapping (ORM) tools like NHibernate are extremely convenient for generating database schemas, modeling database tables with classes and generating DDL can lead developers to a false sense of purpose.
This can occur because ORM tools focus entirely on the world of objects and classes. In this world data’s persistence is irrelevant. It exists for the purposes of a single operation, and consequently long term data persistence issues like data integrity are deemphasized. In fact, it would be easy








