Lucene Hit Paging Made Easier

| | Comments (15) | TrackBacks (0)
Lucene gives users the ability to search massive amounts of data in a very short amount of time.  However allowing users to page through the entire result set of their search can be difficult and risky depending on how many users are performing searches and how many of those users are paging through 100's if not 1,000's of hits per page.

Problem Scenario:

  • Each of your indexes contains 100,000,000's documents.
  • You have 500 users on your system actively performing searches.
  • You have 100 search results per page.
  • And, your typical user pages through the first 10 pages of results.  (Normal occurrence on some systems)

So for the 10th page you will have to collect 1,000 hits, at a cost of a float plus an int plus some object overhead per hit.  So let's say 20 bytes per hit.  So you have 500 users * 1,000 hits * 20 bytes = 10,000,000 bytes or 10M.  Easy, no problem, right?

Well what if you also give the users an easy way to move to the end of the result set.  Hmm...  Well for a result set size of 10,000 it's no big deal.  But what if you hand out result sets in the order of a 1,000,000 or even 10,000,000.

At this point you really just want to prevent the system from running out of memory.  Because if you have 25 users getting 10,000,000 results each and they all click last page at the same time.  That's going to cost you 5 Gig of heap!  At least.  Some might say that it won't ever happen, but in my experience, if it can happen, it will.

So I created a Paging Hit Collector, that windows the hits to the users.  It's uses the last hit collected from the previous search pass, to feed the next search pass.  So yes if a user clicks the last page, it might perform multiple searches but, the system won't run out of memory.

The user's will get there answer eventually, and if your system gives them some feedback as it searches and pages, they will probably sit and wait for it to come back.  Instead of giving up and hitting cancel and search and cancel and search, and making the system worse and worse.

The Simple Example:

IndexSearcher searcher = new IndexSearcher(reader);
TermQuery query = new TermQuery(new Term("f1", "value")); IterablePaging paging = new IterablePaging(searcher, query, 100);
for (ScoreDoc sd : paging.skipTo(90)) {   System.out.println("doc id [" + sd.doc + "] " +     "score [" + sd.score + "]"); }

The More Advanced Example:

IndexSearcher searcher = new IndexSearcher(reader);
TotalHitsRef totalHitsRef = new TotalHitsRef(); ProgressRef progressRef = new ProgressRef();
TermQuery query = new TermQuery(new Term("f1", "value")); IterablePaging paging = new IterablePaging(searcher, query, 100);
for (ScoreDoc sd : paging.skipTo(90).                           gather(20).                           totalHits(totalHitsRef).                           progress(progressRef)) {
  System.out.println("time [" + progressRef.queryTime() + "] " +     "total hits [" + totalHitsRef.totalHits() + "] " +     "searches [" + progressRef.searchesPerformed() + "] " +     "position [" + progressRef.currentHitPosition() + "] " +     "doc id [" + sd.doc + "] " +     "score [" + sd.score + "]"); }

Here's a link to the code LUCENE-2215.

15 Comments

Bjørn said:

Hi

Thanks, this looks really nice.
I'm about to test your code but I have one question: I need to sort my results (using SortField?) before the paging. Any idea how how I do that? Can I add some code to achieve that?
(I have just upgraded from Lucene 2.3 to 3.0 and I have not used Collector's before).

Regards

Aaron said:

You are right I did leave out sorting! It will be easy to add, I will do it this evening and post a patch update to the LUCENE-2215 issue. Thanks!

Bjørn said:

That would be great Aaron, thanks, I'm looking forward seeing the update. Take care...

Bjørn said:

That would be great Aaron, thanks, I'm looking forward seeing the update. Take care...

Bjørn said:

Hi again Aaron,
Not to nag at you, I'm sure you have lots of things to do. But I would really appreciate if you could look at the "sort"-update soon..
Thanks again.

Aaron said:

Sorry, I have been digging out and dealing with the weather here. I will get back to this at some point today. But don't look for anything posted until tonight, east coast time.

Aaron

Aaron said:

Well I have taken a look at adding the Sort feature. It will be not be easy, it's going to require some core lucene code surgery. I will continue to work on it, but it's going to take awhile to get a solution. Sorry.

Bjørn said:

Okey, I was afraid of that. I hope you get it to work. Thanks..

sudarsan said:

Hi Aaron,

Thank you for your contributions.

BTW, from where did you use the HitQueue(int, boolean) class?

Could you provide some inputs, please.

Aaron said:

Not sure if I understand your question, but the HitQueue class is a lucene provided class in the org.apache.lucene.search package. It is used in most of the built-in collectors.

Aaron

Chetan said:

Thanks,

What about sorting?

Guester said:

Hi, the calculation of search result size should be 10M instead of 1M?

500*1000*20 = 10,000,000 = 10M

Or have I misinterpreted your calculation?

Thanks

Aaron said:

Yep you are right, thanks!

Olaf said:

Hello Aaron,

tnx for the very interesting post! I was wondering if this tecnique is also applicable to the (imho) rather interesting/complex issue of giving previous/next buttons to move through the adjoining brothers/sisters in a resultset. Especially concerning overhead/performance issues in a high traffic/volume data environment.

cheers, Olaf

Aaron said:

It's possible, but in my experience this is mainly used as a safety mechanize to prevent the search engine from running out of resources.

Thanks,
Aaron

Leave a comment


Type the characters you see in the picture above.

0 TrackBacks

Listed below are links to blogs that reference this entry: Lucene Hit Paging Made Easier.

TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/1556