I've been using Lucene for the better part of 2 years, from initial playing around, to prototyping to production application. It's an impressive library and it has come along way in the past couple of years.
When I first started playing around with it the version was 2.1 and the search times were so much faster than what we were trying to use at the time (Oracle Text). The first test was indexing a monster dataset and searching it quickly. It passed with flying colors!
Next was to add in record level access control. Easy and extremely fast.
Next was to add in all the other data needed for our application. That was a little bit harder, considering that we have close to 150 fields in our index and well into the billion record range (growing everyday).
The problem was that we needed more memory and there was no extra money for any more servers (or upgrades). So there we were, stuck. So I decided to start poking around using visualvm to see if there were any places in our application or in Lucene to save some memory.
We had already disabled norms on all our fields (we really didn't need norms for our data nor did we have the resources). Took a long look at all our fields that we were indexing to see if there were any we didn't need, but we really did need them all. Then I stumbled across the TermInfosReader class in Lucene.
This is where Lucene really gets it speed, but also uses quite a bit memory to do it. And this is where I wrote my first Lucene patch.
In TermInfosReader there is a bunch stuff but the big memory hogs are in three arrays.
- Terms[]
- TermInfos[]
- long[]
Basically Lucene does a binary search across the Terms array (that by default contains every 128th Term in the index) with a given Term to find where on disk the exact Term needed lives. There's a little bit more going on in the class than that, but that's basically what it's doing.
So, I started this patch with the need to save memory. So how in the world do you do that in java when everything is already in basic arrays and everything is needed in memory. Well you have to save it another way, references. References are a hidden cost in Java, every single reference in 32-bit JVM costs you 4 bytes, and 64-bit JVM it's 8 (assuming that you don't have compressed references).
Let's count the references.
- Terms[] length * 3, 1 reference for the Term and 2 references for the two Strings inside the Term
- TermInfo[] length * 1
- long[] = 1 reference total
So, let's talk numbers. If you have a billion terms in your index, that's 125 MB (1,000,000,000 / 128 * (3 + 1 references) * 4 bytes for every ref) bytes of memory for the references. In a 64-bit JVM that doubled 250 MB. Not to mention the object overhead for every one of those Term and TermInfos objects. Wow that's a lot!
So I decided to remove nearly all of those references by using a byte array and an int array as an offset index.
The results were impressive!
Given an index of 6.2 GB size 1,010,000 number of documents with 179,822,683 number of terms the default implementation uses 292,235,512 bytes to just get the index usable.
My no-ref implementation of the same index uses only 49,849,744 bytes get the index usable. That a 17% of the original size, that's an 83% savings!
And the best part is, that it loads the segments faster into memory. So those real-time updates will be online faster. The run-time performance is slightly faster as well. But the huge performance saving is in garbage collection. Over 7 times faster for full GC's on my Macbook Pro. Wow!
I think that the results speak for themselves, and I hope that the Lucene folks will accept my patch. That way I won't have to continue patching each version after the fact. Also removing references can be great, but the code required to do it, and maintain the same level of performance, is ugly! So don't try this at home!
7 Comments
Leave a comment
0 TrackBacks
Listed below are links to blogs that reference this entry: My First Lucene Patch - Making Lucene Do More With Less.
TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/1553



Nice work!
Going into Flex?
Yep, I need to rework the patch to go into flex. Should be complete in a couple of days. Thanks!
Another great example of why we need a true struct in java.
This is a fantastic way of getting your object count down, but imagine you could have an array of structs and bypass the need for reading into and out of the byte array.
Absolutely! I agree completely!
I'm use the SpellChecker from the contrib package.
My db size is: ~ 210MByte
I applied your patch to Lucene 3.0
The init code of my SpellChecker is:
dir = FSDirectory.open(new File(path + "/didyoumean/" + tableName + "_s1_l"));
ram = new RAMDirectory(dir);
sp = new SpellChecker(ram);
The result is only 2MByte ram savings. It is normal??
Well, the patch is for the internal term infos lookup. It won't have any effect on the RAMDirectory usage. If you were to start your app using the FSDirectory, see the memory usage overhead w/o the patch and then see with the patch that will show you the real % change. But my guess is that because your index is small enough to fit into memory, you won't see a big difference.
This patch was meant to help those indexes that have multi-gigabyte *.tis files.
Great job!