Joseph Ferner

All | General | Java | Ruby | .NET
XML
20080522 Thursday May 22, 2008
RemoteLINQ - How to make your LINQ span the globe

After reading John Skeet's blog about Generating Mandelbrot images using PLINQ I got the idea to build my own LINQ extension. Instead of just splitting the work across processors like PLINQ does, I decided to split it across machines as well. Thus was born RemoteLINQ. The concept is simple, take each item from an enumeration and send it to a remote machine for processing. From a users perspective it is just that easy. The magic behind the curtains is not. Handling threading, ordering, communication, etc. was and still is difficult. The code I provide is still very beta and I cannot guarantee it won't blow up, but it's getting better and it does generates Mandelbrot images just fine across 3 machines containing a total of 5 processors.

Code You Write To Make It Work

Here is an example of what you as a user of RemoteLINQ would need to do:
   1:  IEnumerable<int> numbers = Enumerable.Range(0, 500);
   2:  RemoteContext remoteContext = new RemoteContext(
   3:     new[] { this.GetType().Assembly }, 
   4:     new[] { "localhost", "someremotehost" },
   5:     7776);
   6:  RemoteOptions remoteOptions = new RemoteOptions();
   7:  IEnumerable<int> results = numbers.AsRemotable(remoteContext, remoteOptions).Select(i => DoSomeWork(i));

This will take the numbers 0 to 500 and send half of the numbers to "localhost" and half of the numbers to "someremotehost" for processing.

Let's walk through the code.

  • Line 1, gives us an enumerable of numbers from 0 to 500 for processing.
  • Line 2, creates the RemoteContext. This is really just a client which has a list of server connections to which we can send work.
  • Line 3, is a list of assemblies that contain the necessary code to run your LINQ statement. The client will actually send a copy of the dll to the RemoteLINQ server so that the server will have a copy of the "DoSomeWork" (line 7) method so that it can execute it.
  • Line 4, is the list of servers to divvy the work across.
  • Line 5, is just the port that the client will try to connect to the server on.
  • Line 6, contains any options to direct how the work gets split across the servers, etc.
  • Line 7, is your LINQ statement. Notice the "AsRemotable". This is where all the magic starts. This will actually return a IRemoteEnumerable which knows how to take anything to the right of it and run it remotely.
Lines 2-6 can be put into some fields somewhere and be reused across multiple RemoteLINQ calls. This will actually buy you a couple things. First, you will only have one connection to the server. Second, only one copy of your assemblies will be sent across to the server.

How it works

Lets start out with what lambda expression look like in compiled code. When we compile lambda expressions they are moved into a new method with a compiler generated name. If you reflect on a class with lambdas and get all the methods on that type you will see a bunch of method names which you didn't create. Those are your lambdas. The fact that they are first class methods allows us from the outside, with reflection of course, to call those methods. So when we call "Select" on an IRemoteEnumerable, RemoteLINQ is taking the lambda expression passed in, and storing information about it into a serializable object (assembly, declaring type, method name, etc.). This will allow us to send that information to the server, who can then call the method without all the other stuff around it.

On the server side of things, when we start up a RemoteLINQ server, it will create a worker thread for each processor in the machine. The worker thread will wait quietly until new work arrives on the socket connection. When work arrives we take the serialized object I mentioned before and find the assembly, type, and method and execute that method with the work given to us. In this case the work will be one of the numbers from 0 to 500. The client will handle load balancing the servers by choosing the server with the least amount of pending work per processor. The amount of work queued up on the server is sent back after each unit of work is completed so that the client will have the most up to date picture of what the servers are doing to make the best guess as to where to send the next piece of work.

That's about it. You now have a way to split complex LINQ tasks across multiple machine with just a single call on your LINQ statement.

Download The Code

Posted by jferner May 22 2008, 02:39:29 PM EDT
20080501 Thursday May 01, 2008
Performance: LINQ to XML vs XmlDocument vs XmlReader I recently had a project where I needed to ingest large XML documents using C# so I was curious which XML reader technology would be the fastest. So I coded up a quick benchmark that would compare LINQ to XML, XmlDocument.Load, and XmlReader against each other.

The Test Data

I generated a very simple XML file before each run of a test. The id's were random and the number of "child" nodes varied based on the run. The following is an example of the test data I used.
<root>
  <child id='123'/>
  <child id='234'/>
  ...
</root>

The Test

As I said before I wanted to compare LINQ to XML, XmlDocument.Load, and XmlReader against each other. I ran each of these technologies using 1, 10, 100, 1000, 10,000, 100,000 "child" nodes. I also ran each against a XML document using UTF-8, ASCII, and UTF-32 encodings. Each iteration was run 100 times to reduce anomalies. In each of the tests I call the method "ProcessId" which simulates the processing of the "id" attribute.

XmlDocument.Load

I thought the code for XmlDocument.Load was the cleanest and easiest to understand, although I must admit I like XPath. XmlDocument does have some security concerns but that's another post. Here is the code I used to load and search the document:
private static void XmlDocumentReader(string fileName) {
    XmlDocument doc = new XmlDocument();
    doc.Load(fileName);
    XmlNodeList nodes = doc.SelectNodes("//child");
    if (nodes == null) {
        throw new ApplicationException("invalid data");
    }
    foreach (XmlNode node in nodes) {
        string id = node.Attributes["id"].Value;
        ProcessId(id);
    }
}

LINQ to XML

LINQ to XML was also very easy to read and understand code. I did find that even though LINQ to XML is supposed to use XmlReaders under the covers calling XDocument.Load does read the whole document into memory before returning. So if you are looking for data at the top of middle of a very large document this could be a concern. Here is the code I used to load and search the document:
private static void XDocumentReader(string fileName) {
    XDocument doc = XDocument.Load(fileName);
    if (doc == null | doc.Root == null) {
        throw new ApplicationException("invalid data");
    }
    foreach (XElement child in doc.Root.Elements("child")) {
        XAttribute attr = child.Attribute("id");
        if (attr == null) {
            throw new ApplicationException("invalid data");
        }
        string id = attr.Value;
        ProcessId(id);
    }
}

XmlReader

XmlReader, specifically XmlTextReader was the hardest to write and understand. With it's quirks of being a forward only reader you need to take what you need while you have it because you can't rewind.
private static void XmlReaderReader(string fileName) {
    using (XmlReader reader = new XmlTextReader(fileName)) {
        while (reader.Read()) {
            if (reader.NodeType == XmlNodeType.Element) {
                if (reader.Name == "child") {
                    reader.MoveToAttribute("id");
                    string id = reader.Value;
                    ProcessId(id);
                }
            }
        }
    }
}

The Results

The following results are in milliseconds for each run. I took the total time to run and divided it by 100.

UTF8Encoding

1101001,00010,000100,000
XmlDocument0.15678000.17134500.38886201.981648022.8049260459.8570340
XmlReader0.14674600.14395800.23005000.85344007.577164076.8635690
LINQ to XML0.14995300.15006400.27787201.461673015.7719020208.9360300

ASCIIEncoding

1101001,00010,000100,000
XmlDocument0.16593500.19220800.34331401.984633022.5484690482.8699720
XmlReader0.13768400.14537300.21998100.87682607.918738077.7760560
LINQ to XML0.13459000.15733400.28484201.488993015.1504500214.9338990

UTF32Encoding

1101001,00010,000100,000
XmlDocument0.16723700.17997800.41562502.718837030.6423960543.4604540
XmlReader0.13868200.15038700.28674001.498107014.4428430152.7660780
LINQ to XML0.13170600.18666100.53859402.363129021.4566290274.3280280

Conclusion

XmlReader beats LINQ to XML in almost every run except for very small XML documents. What's interesting is how the numbers scale between the encodings. XmlReader is over twice as slow when reading UTF-32 documents verse UTF-8 or ASCII encoded XML, yet LINQ to XML and XmlDocument slowed down by a much smaller amount. If you need speed when reading XML documents stick with XmlReader. If you need readability and maintainability of your code go with LINQ to SQL or XmlDocument.
Posted by jferner May 01 2008, 11:31:58 AM EDT