Recently by Aaron McCurry

How to implement row level access control in Lucene

| | Comments (2) | TrackBacks (0)
Below I have written some fully functionally code that shows how you could implement row level access control in Lucene (2.3.2). Basically you have to index enough information to be able to search (in a single query) and find all documents that a given user has access to read.

In the below example there are two fields:

DATA: Which contains any data that you want your users to be able to search. NOTE: You can have as many data fields as you like.

ACL_FIELD: The field used to determine what users have access to this document. Note: You can have as many access control fields as you like.

All you have to do is built the access control query for each user and submit your user's query unchanged.

public class TestIndexerSearcher {

   public static void main(String[] args) throws Exception {
      Directory directory = new RAMDirectory();
      IndexWriter indexWriter = new IndexWriter(directory, new StandardAnalyzer());
      indexWriter.addDocument(buildDocument("DATA:sametoken","ACL_FIELD:access"));
      indexWriter.addDocument(buildDocument("DATA:sametoken","ACL_FIELD:noaccess"));
      indexWriter.optimize();
      indexWriter.close();

      IndexSearcher indexSearcher = new IndexSearcher(directory);

      QueryParser parser = new QueryParser("DATA", new StandardAnalyzer());
      Query query = parser.parse("sametoken");
		
      //This is all you have to add to your existing code.
      Filter aclFilter = applyAccessControl(new TermQuery(
         new Term("ACL_FIELD","access")));

      Hits hits = indexSearcher.search(query, aclFilter);
      System.out.println("Hits[" + hits.length() + "]");
      for (int i = 0; i < hits.length(); i++) {
         Document doc = hits.doc(i);
         System.out.println("DATA [" + doc.get("DATA") + 
            "] ACL_FIELD [" + doc.get("ACL_FIELD") + "]");
      }
      indexSearcher.close();	
   }

   private static Filter applyAccessControl(Query aclQuery) {
      return new CachedQueryFilter(aclQuery.toString(), 
         new QueryWrapperFilter(aclQuery));
   }

   private static Document buildDocument(String... fieldInfo) {
      Document document = new Document();
      for (int i = 0; i < fieldInfo.length; i++) {
         String[] split = fieldInfo[i].split(":");
         String fieldName = split[0];
         String fieldValue = split[1];
         document.add(new Field(fieldName,fieldValue,
            Field.Store.YES,Field.Index.TOKENIZED));
      }
      return document;
   }	
}


After you run this code, you will get a single hit, not the two that you would normally get if the access control filter wasn't in place.

public class CachedQueryFilter extends Filter {
   private static final long serialVersionUID = 6797293376134753695L;
   private Filter filter;
   private String key;
   private static transient Map<String, BitSetCache> filterCache = 
      new ConcurrentHashMap<String, BitSetCache>();

   public CachedQueryFilter(String key, Filter filter) {
      this.filter = filter;
      this.key = key;
   }

   public BitSet bits(IndexReader reader) throws IOException {
      BitSetCache cachedBitSet = (BitSetCache) filterCache.get(key);
      if (cachedBitSet != null) {
         BitSet bitSet = cachedBitSet.bitSet.get();
         if (bitSet != null && cachedBitSet.indexReaderVersion == reader.getVersion()) {
            return bitSet;
         }
      }
      BitSet bits = filter.bits(reader);
      BitSetCache bitSetCache = new BitSetCache();
      bitSetCache.indexReaderVersion = reader.getVersion();
      bitSetCache.bitSet = new SoftReference<BitSet>(bits);
      filterCache.put(key, bitSetCache);
      return bits;
   }
	
   private class BitSetCache {
      long indexReaderVersion;
      SoftReference<BitSet> bitSet;
   }
}
There are two additional features that this query filter doesn't implements that you may want to consider.

1st - Provide per query locking around the bitset creation code. This would allow multiple bitset creation calls to occur at once, but the same access control query would block. Therefore we would only have to build it once, even if multiple user queries with the same access control hit the query filter at once.

2nd - Persist the bitsets. In the past I have used the same directory as the index, but you may want to use a database, or something else.

Lucene is a memory hog!

| | Comments (4) | TrackBacks (0)

Let me start by saying that, I like Lucene, I have used it to solve many technical problems on my current project. But one aspect of Lucene that I have had issues with is with its memory footprint.

Currently we index 38 fields across 1.5 billion documents, and we have implemented a fair similarity object (see Lucene Scoring Documentation) for normalized scoring. We have no use for index time field boosting or for any type of Norms (see Lucene Scoring Documentation). However Lucene reads all of the Norms into memory for fast scoring.

So let's do the math:

1 byte per Norm value * 38 fields * 1,500,000,000 = 57,000,000,000

That's +/- 57 Gigs of heap space!

That's quiet a bit of memory usage for something that we don't even use. I have since patched Lucene, so that the indexed Norms have a much better memory footprint, something around 1.5 Gigs. Not great, but livable.

I havenÕt posted my patch, because itÕs all or nothing, I havenÕt implemented a way to turn it on and off. But if there is anyone else out there, with an application that is running out of memory as users use more and more indexed fields in Lucene, take a look at the SegmentReader class. There's a byte array in there that you should take a look at. Happy hunting!

My current project has some unique searching requirements.

Requirements

  • Fuzzy searching is a must (Soundex, Levenshtein, etc.)

  • Has to be fast, a must with any searching solution

  • Has to provide access control

  • Full data load indexing needs to be completed in a reasonable amount of time

  • Scoring needs to be a custom implementation

  • Needs to run on a predetermined environment, meaning that new hardware purchases are not going to happen any time soon

  • And last but not least is ability do all these things on a dataset that exceeds a billion records

So we have had a lot of constraints to deal with, the hardest one by far is the last one.

The Data

  • 1 billion plus records

  • Over 30 million unique terms

Indexing and Searching Server Specs

  • 20 CPUs

  • 32 Gig of ram

  • Dedicated SAN storage

First Searching Experiences

After getting the index built in multiple partitions, I fired up a simple Lucene console to do some simple searches with a Lucene multi searcher. Ran out of memory with 2 Gig heap, tried the maximum heap size for the 32 bit JVM we were using, 3.3 Gig, and that ran out of memory as well. So, initial tries to just run one search were unsuccessful.

Then we installed a 64-bit JVM and tried an 8 Gig heap, and it worked! I could run searches and after the first couple of warm up searches it was getting 20 - 80 ms responses on single term searches. Great, but then we tried a Fuzzy search, which uses a Levenshtein algorithm to calculate matches, 2 minutes 45 seconds, this was unacceptable.

Next we wrote our own Levenshtein Lucene query and got the 2 minutes plus search down to about one second. We found that the built in Lucene Fuzzy query was taking 85-95% of the time to find the terms to search. Then after those terms were found the actual search with those expanded terms only took a second to two depending on how many terms were found. So we replaced the built in Fuzzy query with a custom one that gets near instantaneous results on Levenshtein fuzzy matches. Problem solved.

Indexing Time

After our initial proof of concept was complete, we needed to improve the indexing time down to something more reasonable. The index creation from scratch was taking 36 - 48 hours to build with 20 CPUs running at 100% utilization. Which means that the machine was indexing about 9,000 records a second. Not bad for Lucene 2.2, but not that great.

First we stopped merging the indexes after we created them, that by itself was taking about 12 hours. At this point we also started searching these multiple indexes in parallel, and we are seeing modest increases in query performance.

Second, we upgraded to Lucene 2.3, this provided a huge increase in indexing speed. Our index creation time went from 36 - 48 hours (depending on if we merged indexes or not) down to 3-4 hours. The indexing process is now indexing around 125,000+ records a second. Huge improvement, if you haven't upgraded to 2.3, you should!

Current Development

We are in the process of adding access control to Lucene as well as adding new custom queries and scoring. So far Lucene has performed better than any of the competition that it has come up against, and with it's price point it seems to have won acceptance on our project.

In upcoming parts I will go into more details about the technical solutions that we have developed to solve these problems, as well others that I haven't mentioned yet.

Column Oriented Databases

| | Comments (0) | TrackBacks (0)

What is it?

A column oriented database (or a column store database) is a database that stores it's information in a column oriented manor, instead of a row oriented manor.

Take for example this simple table of information.

Id

FirstName

LastName

Gender

BirthDate

1

John

Doe

M

1/1/1980

2

Jane

Doe

F

2/2/1981

3

John

Smith

M

3/3/1979

It is very natural for a row oriented database to store the following information as sequential appending rows on disk. This makes them very good for inserting and updating data (depending on implementation).

1,John,Doe,M,1/1/1980~2,Jane,Doe,F,2/2/1981~3,John,Smith,M,3/3/1979

A column oriented database might store the data like this:

1,2,3~John,Jane,John~Doe,Doe,Smith~M,F,M~1/1/1980,2/2/1981,3/3/1979

As you can imagine this type of structure would not perform well if you were making lots of changes to the data. But if you are scanning through a column looking for values, the column oriented database groups all the values of a column together as opposed to spreading them out like the row store database would do.

Why do I care?

Database vendors would argue that indexing the columns solves all the performance problems of a row store database. And for most applications they are right, but one benefit that most column store databases provide that may not be apparent, is their ability to compress the data. Take my example:

1,2,3~John,Jane,John~Doe,Doe,Smith~M,F,M~1/1/1980,2/2/1981,3/3/1979

Because the type of each column is consistent, compression of the entire column can be a lot greater than compressing each row in a row store database (some newer RDBMs are adding compression features). Vertica, a commercial column store database boasts up to 90% reduction in storage compared with a traditional row store databases.

Query speed can also be greatly improved on a column store database versus a row store database, although that feature is more dependent on implementation then on the structure of the data.

Conclusion

For problems where you are writing and updating a lot (OLTP in database talk) stick with a row store database. But if you are reading more than you are writing (OLAP) a column store database might be something worth considering.

Here are some column store databases that I have used or heard of:

Open source:

Commercial:

  • Vertica - Grid based column store database