Recently about Lucene

While I'm working on getting my code into lucene.  I have patched 3.0.0 and 2.9.1 with my low memory patch.

By default the small memory footprint is enabled to change back to the default implementation set the following system property.

-Dorg.apache.lucene.index.TermInfosReader=default

Have fun!  If you have any problems or questions please let me know or add to this LUCENE-2205.



Lucene gives users the ability to search massive amounts of data in a very short amount of time.  However allowing users to page through the entire result set of their search can be difficult and risky depending on how many users are performing searches and how many of those users are paging through 100's if not 1,000's of hits per page.

Problem Scenario:

  • Each of your indexes contains 100,000,000's documents.
  • You have 500 users on your system actively performing searches.
  • You have 100 search results per page.
  • And, your typical user pages through the first 10 pages of results.  (Normal occurrence on some systems)

So for the 10th page you will have to collect 1,000 hits, at a cost of a float plus an int plus some object overhead per hit.  So let's say 20 bytes per hit.  So you have 500 users * 1,000 hits * 20 bytes = 1,000,000 bytes or 1M.  Easy, no problem, right?

Well what if you also give the users an easy way to move to the end of the result set.  Hmm...  Well for a result set size of 10,000 it's no big deal.  But what if you hand out result sets in the order of a 1,000,000 or even 10,000,000.

At this point you really just want to prevent the system from running out of memory.  Because if you have 25 users getting 10,000,000 results each and they all click last page at the same time.  That's going to cost you 5 Gig of heap!  At least.  Some might say that it won't ever happen, but in my experience, if it can happen, it will.

So I created a Paging Hit Collector, that windows the hits to the users.  It's uses the last hit collected from the previous search pass, to feed the next search pass.  So yes if a user clicks the last page, it might perform multiple searches but, the system won't run out of memory.

The user's will get there answer eventually, and if your system gives them some feedback as it searches and pages, they will probably sit and wait for it to come back.  Instead of giving up and hitting cancel and search and cancel and search, and making the system worse and worse.

The Simple Example:

IndexSearcher searcher = new IndexSearcher(reader);
TermQuery query = new TermQuery(new Term("f1", "value")); IterablePaging paging = new IterablePaging(searcher, query, 100);
for (ScoreDoc sd : paging.skipTo(90)) {   System.out.println("doc id [" + sd.doc + "] " +     "score [" + sd.score + "]"); }

The More Advanced Example:

IndexSearcher searcher = new IndexSearcher(reader);
TotalHitsRef totalHitsRef = new TotalHitsRef(); ProgressRef progressRef = new ProgressRef();
TermQuery query = new TermQuery(new Term("f1", "value")); IterablePaging paging = new IterablePaging(searcher, query, 100);
for (ScoreDoc sd : paging.skipTo(90).                           gather(20).                           totalHits(totalHitsRef).                           progress(progressRef)) {
  System.out.println("time [" + progressRef.queryTime() + "] " +     "total hits [" + totalHitsRef.totalHits() + "] " +     "searches [" + progressRef.searchesPerformed() + "] " +     "position [" + progressRef.currentHitPosition() + "] " +     "doc id [" + sd.doc + "] " +     "score [" + sd.score + "]"); }

Here's a link to the code LUCENE-2215.

I've been using Lucene for the better part of 2 years, from initial playing around, to prototyping to production application.  It's an impressive library and it has come along way in the past couple of years.

When I first started playing around with it the version was 2.1 and the search times were so much faster than what we were trying to use at the time (Oracle Text).  The first test was indexing a monster dataset and searching it quickly.  It passed with flying colors!

Next was to add in record level access control.  Easy and extremely fast.

Next was to add in all the other data needed for our application.  That was a little bit harder, considering that we have close to 150 fields in our index and well into the billion record range (growing everyday).

The problem was that we needed more memory and there was no extra money for any more servers (or upgrades).  So there we were, stuck.  So I decided to start poking around using visualvm to see if there were any places in our application or in Lucene to save some memory.

We had already disabled norms on all our fields (we really didn't need norms for our data nor did we have the resources).  Took a long look at all our fields that we were indexing to see if there were any we didn't need, but we really did need them all.  Then I stumbled across the TermInfosReader class in Lucene.

This is where Lucene really gets it speed, but also uses quite a bit memory to do it.  And this is where I wrote my first Lucene patch.

In TermInfosReader there is a bunch stuff but the big memory hogs are in three arrays.

  • Terms[]
  • TermInfos[]
  • long[]
Basically Lucene does a binary search across the Terms array (that by default contains every 128th Term in the index) with a given Term to find where on disk the exact Term needed lives.  There's a little bit more going on in the class than that, but that's basically what it's doing.

So, I started this patch with the need to save memory.  So how in the world do you do that in java when everything is already in basic arrays and everything is needed in memory.  Well you have to save it another way, references.  References are a hidden cost in Java, every single reference in 32-bit JVM costs you 4 bytes, and 64-bit JVM it's 8 (assuming that you don't have compressed references).

Let's count the references.

  • Terms[] length * 3, 1 reference for the Term and 2 references for the two Strings inside the Term
  • TermInfo[] length * 1
  • long[] = 1 reference total
So, let's talk numbers.  If you have a billion terms in your index, that's 125 MB (1,000,000,000 / 128 * (3 + 1 references) * 4 bytes for every ref) bytes of memory for the references.  In a 64-bit JVM that doubled 250 MB.  Not to mention the object overhead for every one of those Term and TermInfos objects. Wow that's a lot!

So I decided to remove nearly all of those references by using a byte array and an int array as an offset index.

The results were impressive!

Given an index of 6.2 GB size 1,010,000 number of documents with 179,822,683 number of terms the default implementation uses 292,235,512 bytes to just get the index usable.

My no-ref implementation of the same index uses only 49,849,744 bytes get the index usable.  That a 17% of the original size, that's an 83% savings!

And the best part is, that it loads the segments faster into memory.  So those real-time updates will be online faster.  The run-time performance is slightly faster as well.  But the huge performance saving is in garbage collection.  Over 7 times faster for full GC's on my Macbook Pro.  Wow!

I think that the results speak for themselves, and I hope that the Lucene folks will accept my patch.  That way I won't have to continue patching each version after the fact.  Also removing references can be great, but the code required to do it, and maintain the same level of performance, is ugly!  So don't try this at home!


Recently I have been rewriting a lot of our lucene search engine code for a web application that I'm currently supporting.  Our physical environment is a little different than most, we have a single very large computer (a left over from a previous project) running several virtual machines.  The virtual machine that we are currently using for our searching duties is a 64 cpu 128G box.

Before I go into the tuning saga, I would like to say that in a previous version of the search engine we ran many more (16), smaller jvms all on the same box.  They work together through a main controller module that coordinates calls to all the searching jvms.  After the rewrite of the searchers, there was a lot of pressure to get the new code out the door.  The new code partitions the data and searching duties in a very similar way to the previous version, however I haven't had time to focus on getting all the parts broken up across all the jvms.  It's very doable, but do to customer priorities it hasn't been developed.  But I was in luck (kinda), our environment was a single computer, so I thought I will just run one large jvm.  I knew going in that garbage collection (GC) was going to be a problem, I just didn't realize how much.

Currently, we are running the 64-bit IBM java 6 jvm, the IBM jvm is very different from the Sun jvm, but no more so than in the GC tuning department.  In the Sun jvm you have all kinds of settings to tweak, algorithms to use etc.  The IBM jvm just isn't as advanced, you have 4 different GC policies, and sizing for various parts of the internals.  And that's it!  Great, it should be simple right?  Well sorta.

First try, bring the system online with a 50G (yes that's gigabytes) heap, with all the default settings.  Run some load tests and see where we are.

Everything is running great right up to the first full GC, 25 seconds doesn't sound that long but when you are waiting for a computer to return results, it's an eternity.  For those that don't know, when a full GC occurs (some of the newer Sun algorithms are different) it stops the world (STW).  This means the JVM is frozen until the GC is complete.  No good.

So the default policy is the optthruput policy (-Xgcpolicy:optthruput).  After digging through the IBM documentation, this type of policy should be used for maximum throughput, but at the expense of pauses during GC.  They also mention that this should be used for batch processing when pauses are really a problem.

Next I tried the optavgpause policy (-Xgcpolicy:optavgpause), this is suppose to smooth out the GCs by kicking off the mark phase early.  I'm not going to talk about mark, sweep, and compaction, you can find it here (http://www.ibm.com/developerworks/ibm/library/i-incrcomp/).  But basically it's suppose to run a concurrent parallel mark phase before the jvm runs out of heap space and performs a full GC that STW.  This did help, got us down in the 10-15 second range on average, but the problem was that the mark phase was too slow to start under heavy load and didn't give us a whole lot of concurrent marking before the STW.  I found this out by adding -verbose:gc, this adds a lot of debugging information to the standard out.  You should just run this all the time, it provides a lot of useful information about your application.

Next I tried subpool (-Xgcpolicy:subpool), it worked fine but was plagued by the same problem, slow full GCs.  Subpool is suppose to work better on large SMP machines like ours, but in the end it was more of the same.  It's full GCs were in the 10-12 second range, plus the application seemed to run slower, by about 10%.

And last I tried gencon (-Xgcpolicy:gencon), the newest of all their policies.  Gencon is suppose to be used on "transactional systems", systems that create a lot of short lived objects.  Isn't that what most java applications do?  When I started it up, it seemed to be faster, and our load test confirmed that, 30% more throughput.  But the amazing thing was that the full GCs were fast, really fast, in the 200-400ms range, for a 50G heap!  But wait, it wasn't using most of the heap, and it was GCing all the time.  Back to the verbose:gc log, AHA!  The nursery was too small, it was about 10% of the heap, and because our application is almost all short lived objects I decided to increase the size of the nursery.  I gave it 50% of the heap, and the time between GCs slowed down, but the time to perform the GC was still in the 350ms range.  Awesome!

I finally settled on 100G heap with 50% a nursery, and the full GCs are now in the 400-600ms range.  I can live with that, because this gives us a huge ceiling for load, and capacity.

So to summarize, if you application needs to run a huge heap size, and you are using the IBM jvm, I would start by using the gencon policy.  It seems to be the most modern of all their policies, and it seems to work the best.

Good luck!
The application that I'm current developing uses Lucene for searching and data retrieval.  We recently had a rather special need to use one of the utility classes in Lucene, OpenBitSet.  OpenBitSet is not unlike the regular BitSet provided by the java.util package.  The biggest difference is that the class is NOT final.  So for someone like me, that tinkers and optimizes to get every last bit of performance out of my application, final classes are a major PITA.

The problem:

OpenBitSet provides an optimized "find the next set bit starting here" method called nextSetBit(long index).  However I needed to find the previous set bit, so naturally I started by using the built-in api.  Iterate backwards over each bit starting at given position until a bit that it is set to true is found.  Probably not the fastest way, but it works.  I thought that if I wrote my own "find the previous set bit starting here" method it could be potently a lot faster.  And because the class is not final and the internals are not private I can extend it, YAY!

So after some work and some tests I came up with this:

    /**
     * Finds the previous set bit in the bitset, start looking at the index position.
     * @param index the index to start looking.
     * @return the previous position in the bitset that is set.
     */
    public long prevSetBit(long index) {
        //get index of the word to start with
        int i = (int) (index >>> 6); //div by 64
        if (i >= wlen) {
            //if the index requested is greater than the actual
            //max word, just start with the last word
            i = wlen - 1;
        }
        //loop backwards over the bits[] until a non-zero word is found
        while (i >= 0) {
            long word = bits[i];
            if (word != 0) {
                //find the position of the most significant bit
                int pomsb = getPositionOfMostSigBit(word);
                //multiply by 64
                long l = ((long) i) << 6;
                //add together to find the actual position
                return l + pomsb;
            }
            i--;
        }
        return -1;
    }
   
    /**
     * The word passed in, should not be 0.
     * @param word the word to find the position of the most significant bit.
     * @return the position in the word.
     */
    private int getPositionOfMostSigBit(long word) {
        if (word == 1l) {
            return 0;
        }
        int i = 0;
        while (word != 1) {
            word = word >>> 1;
            i++;
        }
        return i;
    }

And of course I had to write a performance test to make sure that my work had produced something useful.  Here's the test:

    public static void main(String[] args) {
        CustomOpenBitSet bitSet = new CustomOpenBitSet(Integer.MAX_VALUE);
        int maxCount = 1000000;
        fillBitSet(bitSet, maxCount);
       
        for (int i = 0; i < 10; i++) {
            double s1 = System.nanoTime();
            int slow = runSlowReverseSeeksToSetBits(bitSet, maxCount);
            double e1 = System.nanoTime();
           
            double s2 = System.nanoTime();
            int fast = runFastReverseSeeksToSetBits(bitSet, maxCount);
            double e2 = System.nanoTime();
           
            assert (slow == fast);
           
            System.out.println("The slow time took " + ((e1-s1) / ONE_BILLION) + " seconds");
            System.out.println("The fast time took " + ((e2-s2) / ONE_BILLION) + " seconds");
        }
    }

    private static int runFastReverseSeeksToSetBits(CustomOpenBitSet bitSet, int maxCount) {
        int count = 0;
        long start = Integer.MAX_VALUE;
        while (count < maxCount - 1) {
            long position = bitSet.prevSetBit(start);
            count++;
            start = position - 1;
        }
        return count;
    }

    private static int runSlowReverseSeeksToSetBits(CustomOpenBitSet bitSet, int maxCount) {
        int count = 0;
        long start = Integer.MAX_VALUE;
        while (count < maxCount - 1) {
            long position = slowSeek(bitSet,start);
            count++;
            start = position - 1;
        }
        return count;
    }

    private static long slowSeek(CustomOpenBitSet bitSet, long position) {
        while (!bitSet.get(position)) {
            position--;
            if (position < 0) {
                return -1;
            }
        }
        return position;
    }

    private static void fillBitSet(CustomOpenBitSet bitSet, int numberOfBitsToSet) {
        Random random = new Random();
        int nextInt = -1;
        for (int i = 0; i < numberOfBitsToSet; i++) {
            do {
                nextInt = random.nextInt(Integer.MAX_VALUE);
            } while (bitSet.fastGet(nextInt));
            bitSet.set(nextInt);
        }
    }
The test creates a BitSet that is Integer.MAX_VALUE long and randomly populates it with 1 million bits set to true.  Then starting at Integer.MAX_VALUE, find all the bits that are set to true, moving from the greatest position to the smallest position.  Do this 10 times and print off the time taken to perform both algorithms.

Pass [0], the slow time took 7.867899136 seconds
Pass [0], the fast time took 0.059851008 seconds
Pass [1], the slow time took 7.84461184 seconds
Pass [1], the fast time took 0.051954944 seconds
Pass [2], the slow time took 7.820731904 seconds
Pass [2], the fast time took 0.07801088 seconds
Pass [3], the slow time took 7.81953408 seconds
Pass [3], the fast time took 0.074951168 seconds
Pass [4], the slow time took 7.798106112 seconds
Pass [4], the fast time took 0.07792384 seconds
Pass [5], the slow time took 7.742498048 seconds
Pass [5], the fast time took 0.067644928 seconds
Pass [6], the slow time took 7.664754944 seconds
Pass [6], the fast time took 0.07476992 seconds
Pass [7], the slow time took 7.841472768 seconds
Pass [7], the fast time took 0.073187072 seconds
Pass [8], the slow time took 7.927188992 seconds
Pass [8], the fast time took 0.07574912 seconds
Pass [9], the slow time took 7.85273088 seconds
Pass [9], the fast time took 0.076514048 seconds
So as you can see the new method is about 100 times faster than just iterating backwards over the bitset, given this sample data.  I would assume that if the sample size of 1 million were to increase, the gap between the 2 methods would narrow.