Lucene Hit Paging Made Easier
By Aaron Mccurry
Jan 18, 2010
Lucene gives users the ability to search massive amounts of data in a very short amount of time. However allowing users to page through the entire result set of their search can be difficult and risky depending on how many users are performing searches and how many of those users are paging through 100's if not 1,000's of hits per page.
The Simple Example:
The More Advanced Example:
Problem Scenario:
- Each of your indexes contains 100,000,000's documents.
- You have 500 users on your system actively performing searches.
- You have 100 search results per page.
- And, your typical user pages through the first 10 pages of results. (Normal occurrence on some systems)
So for the 10th page you will have to collect 1,000 hits, at a cost of a float plus an int plus some object overhead per hit. So let's say 20 bytes per hit. So you have 500 users * 1,000 hits * 20 bytes = 10,000,000 bytes or 10M. Easy, no problem, right?
Well what if you also give the users an easy way to move to the end of the result set. Hmm... Well for a result set size of 10,000 it's no big deal. But what if you hand out result sets in the order of a 1,000,000 or even 10,000,000.
At this point you really just want to prevent the system from running out of memory. Because if you have 25 users getting 10,000,000 results each and they all click last page at the same time. That's going to cost you 5 Gig of heap! At least. Some might say that it won't ever happen, but in my experience, if it can happen, it will.
So I created a Paging Hit Collector, that windows the hits to the users. It's uses the last hit collected from the previous search pass, to feed the next search pass. So yes if a user clicks the last page, it might perform multiple searches but, the system won't run out of memory.
The user's will get there answer eventually, and if your system gives them some feedback as it searches and pages, they will probably sit and wait for it to come back. Instead of giving up and hitting cancel and search and cancel and search, and making the system worse and worse.
IndexSearcher searcher = new IndexSearcher(reader);<br />
TermQuery query = new TermQuery(new Term("f1", "value"));
IterablePaging paging = new IterablePaging(searcher, query, 100);<br />
for (ScoreDoc sd : paging.skipTo(90)) {
System.out.println("doc id [" + sd.doc + "] " +
"score [" + sd.score + "]");
}
IndexSearcher searcher = new IndexSearcher(reader);<br />
TotalHitsRef totalHitsRef = new TotalHitsRef();
ProgressRef progressRef = new ProgressRef();<br />
TermQuery query = new TermQuery(new Term("f1", "value"));
IterablePaging paging = new IterablePaging(searcher, query, 100);<br />
for (ScoreDoc sd : paging.skipTo(90).
gather(20).
totalHits(totalHitsRef).
progress(progressRef)) {<br />
System.out.println("time [" + progressRef.queryTime() + "] " +
"total hits [" + totalHitsRef.totalHits() + "] " +
"searches [" + progressRef.searchesPerformed() + "] " +
"position [" + progressRef.currentHitPosition() + "] " +
"doc id [" + sd.doc + "] " +
"score [" + sd.score + "]");
}
Here's a link to the code LUCENE-2215.