Lucene gives users the ability to search massive amounts of data in a very short amount of time. However allowing users to page through the entire result set of their search can be difficult and risky depending on how many users are performing searches and how many of those users are paging through 100's if not 1,000's of hits per page.
The Simple Example:
The More Advanced Example:
Problem Scenario:
- Each of your indexes contains 100,000,000's documents.
- You have 500 users on your system actively performing searches.
- You have 100 search results per page.
- And, your typical user pages through the first 10 pages of results. (Normal occurrence on some systems)
So for the 10th page you will have to collect 1,000 hits, at a cost of a float plus an int plus some object overhead per hit. So let's say 20 bytes per hit. So you have 500 users * 1,000 hits * 20 bytes = 10,000,000 bytes or 10M. Easy, no problem, right?
Well what if you also give the users an easy way to move to the end of the result set. Hmm... Well for a result set size of 10,000 it's no big deal. But what if you hand out result sets in the order of a 1,000,000 or even 10,000,000.
At this point you really just want to prevent the system from running out of memory. Because if you have 25 users getting 10,000,000 results each and they all click last page at the same time. That's going to cost you 5 Gig of heap! At least. Some might say that it won't ever happen, but in my experience, if it can happen, it will.
So I created a Paging Hit Collector, that windows the hits to the users. It's uses the last hit collected from the previous search pass, to feed the next search pass. So yes if a user clicks the last page, it might perform multiple searches but, the system won't run out of memory.
The user's will get there answer eventually, and if your system gives them some feedback as it searches and pages, they will probably sit and wait for it to come back. Instead of giving up and hitting cancel and search and cancel and search, and making the system worse and worse.
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery query = new TermQuery(new Term("f1", "value")); IterablePaging paging = new IterablePaging(searcher, query, 100);
for (ScoreDoc sd : paging.skipTo(90)) { System.out.println("doc id [" + sd.doc + "] " + "score [" + sd.score + "]"); }
IndexSearcher searcher = new IndexSearcher(reader);
TotalHitsRef totalHitsRef = new TotalHitsRef(); ProgressRef progressRef = new ProgressRef();
TermQuery query = new TermQuery(new Term("f1", "value")); IterablePaging paging = new IterablePaging(searcher, query, 100);
for (ScoreDoc sd : paging.skipTo(90). gather(20). totalHits(totalHitsRef). progress(progressRef)) {
System.out.println("time [" + progressRef.queryTime() + "] " + "total hits [" + totalHitsRef.totalHits() + "] " + "searches [" + progressRef.searchesPerformed() + "] " + "position [" + progressRef.currentHitPosition() + "] " + "doc id [" + sd.doc + "] " + "score [" + sd.score + "]"); }
Here's a link to the code LUCENE-2215.
15 Comments
Leave a comment
0 TrackBacks
Listed below are links to blogs that reference this entry: Lucene Hit Paging Made Easier.
TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/1556



Hi
Thanks, this looks really nice.
I'm about to test your code but I have one question: I need to sort my results (using SortField?) before the paging. Any idea how how I do that? Can I add some code to achieve that?
(I have just upgraded from Lucene 2.3 to 3.0 and I have not used Collector's before).
Regards
You are right I did leave out sorting! It will be easy to add, I will do it this evening and post a patch update to the LUCENE-2215 issue. Thanks!
That would be great Aaron, thanks, I'm looking forward seeing the update. Take care...
That would be great Aaron, thanks, I'm looking forward seeing the update. Take care...
Hi again Aaron,
Not to nag at you, I'm sure you have lots of things to do. But I would really appreciate if you could look at the "sort"-update soon..
Thanks again.
Sorry, I have been digging out and dealing with the weather here. I will get back to this at some point today. But don't look for anything posted until tonight, east coast time.
Aaron
Well I have taken a look at adding the Sort feature. It will be not be easy, it's going to require some core lucene code surgery. I will continue to work on it, but it's going to take awhile to get a solution. Sorry.
Okey, I was afraid of that. I hope you get it to work. Thanks..
Hi Aaron,
Thank you for your contributions.
BTW, from where did you use the HitQueue(int, boolean) class?
Could you provide some inputs, please.
Not sure if I understand your question, but the HitQueue class is a lucene provided class in the org.apache.lucene.search package. It is used in most of the built-in collectors.
Aaron
Thanks,
What about sorting?
Hi, the calculation of search result size should be 10M instead of 1M?
500*1000*20 = 10,000,000 = 10M
Or have I misinterpreted your calculation?
Thanks
Yep you are right, thanks!
Hello Aaron,
tnx for the very interesting post! I was wondering if this tecnique is also applicable to the (imho) rather interesting/complex issue of giving previous/next buttons to move through the adjoining brothers/sisters in a resultset. Especially concerning overhead/performance issues in a high traffic/volume data environment.
cheers, Olaf
It's possible, but in my experience this is mainly used as a safety mechanize to prevent the search engine from running out of resources.
Thanks,
Aaron