Lucene is a memory hog!

| | Comments (4) | TrackBacks (0)

Let me start by saying that, I like Lucene, I have used it to solve many technical problems on my current project. But one aspect of Lucene that I have had issues with is with its memory footprint.

Currently we index 38 fields across 1.5 billion documents, and we have implemented a fair similarity object (see Lucene Scoring Documentation) for normalized scoring. We have no use for index time field boosting or for any type of Norms (see Lucene Scoring Documentation). However Lucene reads all of the Norms into memory for fast scoring.

So let's do the math:

1 byte per Norm value * 38 fields * 1,500,000,000 = 57,000,000,000

That's +/- 57 Gigs of heap space!

That's quiet a bit of memory usage for something that we don't even use. I have since patched Lucene, so that the indexed Norms have a much better memory footprint, something around 1.5 Gigs. Not great, but livable.

I havenÕt posted my patch, because itÕs all or nothing, I havenÕt implemented a way to turn it on and off. But if there is anyone else out there, with an application that is running out of memory as users use more and more indexed fields in Lucene, take a look at the SegmentReader class. There's a byte array in there that you should take a look at. Happy hunting!

4 Comments

Erik Hatcher said:

Aaron - Lucene has the capability to omit norms. Look at Field#setOmitNorms().

Erik

Aaron said:

Thanks Erik,

If you use:

Field field = new Field("n","v",Field.Store.YES,Field.Index.TOKENIZED);
field.setOmitNorms(true);

This works as you described, however when I first went down this route I tried the NO_NORMS option on the Field.Index class. This option does not tokenize the field information and I didn't try that combination of the Field.Index.TOKENIZED option with manually setting the omit norms field. Thanks again for your help!

Aaron

Sanne said:

just out of curiosity, what's the size of your indexes? If you happen to rebuild the index, how much time does it take?

Aaron said:

We have around 1.5 billion documents, and the total size of the 16 partitions are around 950 GB. Once we get all of our data extracted from the source system (which takes about 18 hours right now), it takes about 8 hours to index everything. We have a dedicated 16 node cluster, each machine has 4 cores and 3 GB of ram, just for indexing. And we rebuild the index once a week.

Leave a comment


Type the characters you see in the picture above.

0 TrackBacks

Listed below are links to blogs that reference this entry: Lucene is a memory hog!.

TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/477