Description
I had a bet with a co-worker that C# would outperform F# for a simple counting exercise. This turned out to be quite a surprise for many reasons. In this blog post I will try to explain why.
Problem
Sum the numbers from 1 to n using brute force (can't use the (1 + N)*(N/2) equation). Do this in both C# and F# and compare the speed. Since F# needs to do this recursively I thought this should be a no contest win for C#.
Using F#
We initially wrote the code without using tail recursion which immediately ended with a stack overflow. So we went back and wrote it doing it the functional way, with tail recursion. Here is what we came up with.
module FSharpLib let sumNum fromNum toNum = let rec sumNumTail fromNum toNum result = if fromNum > toNum then result else sumNumTail (fromNum+1L) toNum (result+fromNum) sumNumTail fromNum toNum 0L
Using C# - Attempt 1
You'll see later why I say attempt 1. But, this is the simplest C# code I could think of.
public static long CSharpForLoop(long fromNumber, long toNumber) { long result = 0; for (long i = fromNumber; i <= toNumber; i++) { result += i; } return result; }
The Results - Attempt 1
This is the results of running the above code using a fromNumber of 1 and a toNumber of
10000000 and running it 10 times alternating between C# and F# to filter out start-up times, etc.
| F# | 0.8606902s | 11% faster |
|---|---|---|
| C# | 0.9556198s |
I ran it like 20 times because I couldn't believe it. I looked at the C# code for any possible optimizations I could add. But, being so simple there wasn't a lot I could do.
Reflector is an amazing tool
I turned to my good friend Reflector to see what was going on. Here is what the F# code looked like when viewing it as C# code.
public static long sumNum(long fromNum, long toNum) { FastFunc<long, FastFunc<long, FastFunc<long, long>>> sumNumTail = new clo@7(); return FastFunc<long, long>.InvokeFast3<long, long>(sumNumTail, fromNum, toNum, 0L); }and clo@7.Invoke looks like
public override long Invoke(long fromNum, long toNum, long result) { while (fromNum <= toNum) { result += fromNum; toNum = toNum; fromNum += 1L; } return result; }They used a while loop instead of a for loop. OK, I can do that. But, what happened to tail, I thought that tail is why the CLR is so good at functional language implementations.
C# - Attempt 2
public static long CSharpWhileLoop(long fromNumber, long toNumber) { long result = 0; while (fromNumber <= toNumber) { result += fromNumber; // I left out the toNum = toNum because it didn't do anything. fromNumber += 1L; } return result; }and the results (I've got you now F#)...
| F# | 0.8606902s |
|---|---|
| C# For Loop | 0.9556198s |
| C# While Loop | 0.9105592s |
Oh wait... DAMN! What could F# be doing to make it almost 10% faster than C#.
To the IL code Batman
This is the IL for the F# code.
L_0000: nop // while loop L_0001: nop L_0002: ldarg.1 // fromNum <= toNum L_0003: ldarg.2 L_0004: ble.s L_0009 L_0006: nop // return result L_0007: ldarg.3 L_0008: ret L_0009: nop // result += fromNum L_000a: ldarg.1 L_000b: ldc.i8 1 L_0014: add L_0015: ldarg.2 // fromNum += 1L L_0016: ldarg.3 L_0017: ldarg.1 L_0018: add L_0019: starg.s result // loop L_001b: starg.s toNum L_001d: starg.s fromNum L_001f: br.s L_0000This is the IL for the C# while loop.
L_0000: nop // long result = 0L L_0001: ldc.i4.0 L_0002: conv.i8 L_0003: stloc.0 L_0004: br.s L_0012 // while loop L_0006: nop // result += fromNumber L_0007: ldloc.0 L_0008: ldarg.0 L_0009: add L_000a: stloc.0 // fromNumber += 1L L_000b: ldarg.0 L_000c: ldc.i4.1 L_000d: conv.i8 L_000e: add L_000f: starg.s fromNumber L_0011: nop L_0012: ldarg.0 // fromNumber <= toNumber L_0013: ldarg.1 L_0014: cgt L_0016: ldc.i4.0 L_0017: ceq L_0019: stloc.2 L_001a: ldloc.2 L_001b: brtrue.s L_0006 // loop L_001d: ldloc.0 L_001e: stloc.1 L_001f: br.s L_0021 L_0021: ldloc.1 L_0022: ret
What just happened.
Wow, the C# compiler is quite a bit more verbose. First of all, C# makes the while loop into a do/while loop. Secondly, in the loop condition C# creates a temporary variable to store the comparison and then checks for 'true' ("brtrue.s"). If you look at the F# IL you'll see the use of the "ble.s" instructions (branch if less or equal) instead. Finally we see that C# uses a few "conv.i8" instructions (convert to 8 byte/64-bit int) which I can only guess adds to it's defeat.
Conclusion
So is F# just a more performant language or is C# just compiling to more robust IL code which as F# matures will end up suffering the same fate? I personally think (hope actually) the later is true. I'm not an expert in compilers but I would have to assume the C# compiler team is using every trick in the book to make highly performant IL code without sacrificing robustness and I can only assume that F# will get to a point where it needs to make the same concessions to handle some obscure cases.
I wrote my first firmware for the DEFCON badge. Doesn't do much other than replace the default Knight Rider LED sequence with my own, but it's a first step. I did want to note a couple of things that were not mentioned in the badge hacking information that might make it easier to get started.
There are two drivers for windows, one when the badge is in USB bootloader mode and one while the badge is in normal mode. To get to bootloader mode you need to hold down the mode select button while inserting the battery. The driver provided on the CD is for normal mode which took me a while to figure out. To get to the normal mode you need to plug the USB cord in BEFORE you insert the battery. Looking at the firmware I noticed the check for a USB connection is done as one of the very first things. Windows will then recognize the device and you can install the driver from the CD. The driver for bootloader mode is actually part of the Freescale JM60 bootloader program which can also be found on the CD but must be installed before you try bootloader mode.
Also, I couldn't get the full version of Code Warrior so the next best thing was the demo from Freescale. But, the demo has a 32k firmware limit which means removing some of the original firmware code. I removed fat.c from the project and recompiled and fixed until I got it working. This will of course remove your ability to read or write to the SD card using FAT.
Next step is to make the final hack which will hopefully be done by tomorrow so we can show it off to Kingpin. I'll post the final hack information and code tomorrow if it gets done.
After reading John Skeet's blog about Generating Mandelbrot images using PLINQ I got the idea to build my own LINQ extension. Instead of just splitting the work across processors like PLINQ does, I decided to split it across machines as well. Thus was born RemoteLINQ. The concept is simple, take each item from an enumeration and send it to a remote machine for processing. From a users perspective it is just that easy. The magic behind the curtains is not. Handling threading, ordering, communication, etc. was and still is difficult. The code I provide is still very beta and I cannot guarantee it won't blow up, but it's getting better and it does generates Mandelbrot images just fine across 3 machines containing a total of 5 processors.
Code You Write To Make It Work
Here is an example of what you as a user of RemoteLINQ would need to do:1: IEnumerable<int> numbers = Enumerable.Range(0, 500); 2: RemoteContext remoteContext = new RemoteContext( 3: new[] { this.GetType().Assembly }, 4: new[] { "localhost", "someremotehost" }, 5: 7776); 6: RemoteOptions remoteOptions = new RemoteOptions(); 7: IEnumerable<int> results = numbers.AsRemotable(remoteContext, remoteOptions).Select(i => DoSomeWork(i));
This will take the numbers 0 to 500 and send half of the numbers to "localhost" and half of the numbers to "someremotehost" for processing.
Let's walk through the code.
- Line 1, gives us an enumerable of numbers from 0 to 500 for processing.
- Line 2, creates the RemoteContext. This is really just a client which has a list of server connections to which we can send work.
- Line 3, is a list of assemblies that contain the necessary code to run your LINQ statement. The client will actually send a copy of the dll to the RemoteLINQ server so that the server will have a copy of the "DoSomeWork" (line 7) method so that it can execute it.
- Line 4, is the list of servers to divvy the work across.
- Line 5, is just the port that the client will try to connect to the server on.
- Line 6, contains any options to direct how the work gets split across the servers, etc.
- Line 7, is your LINQ statement. Notice the "AsRemotable". This is where all the magic starts. This will actually return a IRemoteEnumerable which knows how to take anything to the right of it and run it remotely.
How it works
Lets start out with what lambda expression look like in compiled code. When we compile lambda expressions they are moved into a new method with a compiler generated name. If you reflect on a class with lambdas and get all the methods on that type you will see a bunch of method names which you didn't create. Those are your lambdas. The fact that they are first class methods allows us from the outside, with reflection of course, to call those methods. So when we call "Select" on an IRemoteEnumerable, RemoteLINQ is taking the lambda expression passed in, and storing information about it into a serializable object (assembly, declaring type, method name, etc.). This will allow us to send that information to the server, who can then call the method without all the other stuff around it.
On the server side of things, when we start up a RemoteLINQ server, it will create a worker thread for each processor in the machine. The worker thread will wait quietly until new work arrives on the socket connection. When work arrives we take the serialized object I mentioned before and find the assembly, type, and method and execute that method with the work given to us. In this case the work will be one of the numbers from 0 to 500. The client will handle load balancing the servers by choosing the server with the least amount of pending work per processor. The amount of work queued up on the server is sent back after each unit of work is completed so that the client will have the most up to date picture of what the servers are doing to make the best guess as to where to send the next piece of work.
That's about it. You now have a way to split complex LINQ tasks across multiple machine with just a single call on your LINQ statement.
The Test Data
I generated a very simple XML file before each run of a test. The id's were random and the number of "child" nodes varied based on the run. The following is an example of the test data I used.<root> <child id='123'/> <child id='234'/> ... </root>
The Test
As I said before I wanted to compare LINQ to XML, XmlDocument.Load, and XmlReader against each other. I ran each of these technologies using 1, 10, 100, 1000, 10,000, 100,000 "child" nodes. I also ran each against a XML document using UTF-8, ASCII, and UTF-32 encodings. Each iteration was run 100 times to reduce anomalies. In each of the tests I call the method "ProcessId" which simulates the processing of the "id" attribute.XmlDocument.Load
I thought the code for XmlDocument.Load was the cleanest and easiest to understand, although I must admit I like XPath. XmlDocument does have some security concerns but that's another post. Here is the code I used to load and search the document:private static void XmlDocumentReader(string fileName) { XmlDocument doc = new XmlDocument(); doc.Load(fileName); XmlNodeList nodes = doc.SelectNodes("//child"); if (nodes == null) { throw new ApplicationException("invalid data"); } foreach (XmlNode node in nodes) { string id = node.Attributes["id"].Value; ProcessId(id); } }
LINQ to XML
LINQ to XML was also very easy to read and understand code. I did find that even though LINQ to XML is supposed to use XmlReaders under the covers calling XDocument.Load does read the whole document into memory before returning. So if you are looking for data at the top of middle of a very large document this could be a concern. Here is the code I used to load and search the document:private static void XDocumentReader(string fileName) { XDocument doc = XDocument.Load(fileName); if (doc == null | doc.Root == null) { throw new ApplicationException("invalid data"); } foreach (XElement child in doc.Root.Elements("child")) { XAttribute attr = child.Attribute("id"); if (attr == null) { throw new ApplicationException("invalid data"); } string id = attr.Value; ProcessId(id); } }
XmlReader
XmlReader, specifically XmlTextReader was the hardest to write and understand. With it's quirks of being a forward only reader you need to take what you need while you have it because you can't rewind.private static void XmlReaderReader(string fileName) { using (XmlReader reader = new XmlTextReader(fileName)) { while (reader.Read()) { if (reader.NodeType == XmlNodeType.Element) { if (reader.Name == "child") { reader.MoveToAttribute("id"); string id = reader.Value; ProcessId(id); } } } } }
The Results
The following results are in milliseconds for each run. I took the total time to run and divided it by 100.UTF8Encoding
| 1 | 10 | 100 | 1,000 | 10,000 | 100,000 | |
|---|---|---|---|---|---|---|
| XmlDocument | 0.1567800 | 0.1713450 | 0.3888620 | 1.9816480 | 22.8049260 | 459.8570340 |
| XmlReader | 0.1467460 | 0.1439580 | 0.2300500 | 0.8534400 | 7.5771640 | 76.8635690 |
| LINQ to XML | 0.1499530 | 0.1500640 | 0.2778720 | 1.4616730 | 15.7719020 | 208.9360300 |
ASCIIEncoding
| 1 | 10 | 100 | 1,000 | 10,000 | 100,000 | |
|---|---|---|---|---|---|---|
| XmlDocument | 0.1659350 | 0.1922080 | 0.3433140 | 1.9846330 | 22.5484690 | 482.8699720 |
| XmlReader | 0.1376840 | 0.1453730 | 0.2199810 | 0.8768260 | 7.9187380 | 77.7760560 |
| LINQ to XML | 0.1345900 | 0.1573340 | 0.2848420 | 1.4889930 | 15.1504500 | 214.9338990 |
UTF32Encoding
| 1 | 10 | 100 | 1,000 | 10,000 | 100,000 | |
|---|---|---|---|---|---|---|
| XmlDocument | 0.1672370 | 0.1799780 | 0.4156250 | 2.7188370 | 30.6423960 | 543.4604540 |
| XmlReader | 0.1386820 | 0.1503870 | 0.2867400 | 1.4981070 | 14.4428430 | 152.7660780 |
| LINQ to XML | 0.1317060 | 0.1866610 | 0.5385940 | 2.3631290 | 21.4566290 | 274.3280280 |

