Recently about Java
This year one of my goals is to try and become proficient in using ANTLR. I think that learning to translate text or build an external DSL is skill that, although not used everyday, will be very useful to know. For my first attempt I settled on something fairly easy, a SQL like grammar that could be used to search for files and the content within those files. You should also be able to narrow the search results based on when the file was last modified. My goal is to take something like the following:
select * from /logs where file="*.out" and pattern="foobar" and modified < 2 days ago select * from /logs where file='*.out' and pattern='foobar' and modified between 20 and 30 minutes agoand translate it to the corresponding find command and pipe the results to xargs and grep:
find /logs -name '*.out' -mtime -2 | xargs grep 'foobar' find /logs -name '*.out' -mmin +20 -mmin -30 | xargs grep 'foobar'As an aside, if you are not familiar with xargs, check out this xargs tutorial or the xargs man pages , it's a great utility that executes a command with the output of a previous command.
Disclaimer
Now before the villagers gather up with torches and pitch forks to run me out of town (I'm channeling Young Frankenstein here), I would like to make somewhat of a disclaimer. I am not suggesting a new language or discouraging learning the *nix command line tools. The point here is to learn ANTLR. I found it more interesting to translate something I use everyday on my current project, versus some of the other "Hello World" ANTLR examples I have seen. So other than a using this grammar as a learning exercise, I don't see it as being useful.Introduction
ANTLR is a deep topic, so obviously one blog post can not go into any great detail. So what follows is not in-depth coverage of ANTLR, but a detailed description of the grammar developed. I will explain each section as well as some of the decisions and trade-offs I made. For my development environment I'm using:- Eclipse 3.5.1
- Java 6
- The ANTLR IDE plugin for Eclipse. You could also use ANTLRWorks, the gui development environment for ANTLR. ANTLRWorks is an excellent tool, I just felt more comfortable to do this work in Eclipse.
- ANTLR version 3.2
- Mac OS X 10.6.2.
options, @header
grammar FQL;
options {
language = Java;
}
@header {
package bbejeck.antlr.fql;
}
Here I am specifying a combined grammar named FQL. (FQL is short for File Query Language and yes, I know the name sucks)
In options I'm specifying that I want the generated code to be Java. I could have also specified C,C++ or Python here as well. ANTLR also has support for generating code in Ruby, but with the version I am using (v 3.2) I could not get it to work. I did find ANTLR Ruby. I have not tried it out, but from the documentation it looks promising. The @header option is setting the package for the generated parser code. This is also where I would have specified any needed imports.
@members
The @members section is where you place instance variables and methods that will be placed and used in the generated parser. Most likely the code in the members section will be used in embedded actions in the parser rules. @members {
private StringBuilder findBuilder = new StringBuilder("find ");
private StringBuilder filter = new StringBuilder();
private void addString(String s){
if(s!=null){
findBuilder.append(s);
}
}
private String buildTimeArg(String s, String snum, String sign){
StringBuilder timeBuilder = new StringBuilder();
int num = Integer.parseInt(snum);
if(s.equals("days")){
return timeBuilder.append(" -mtime ").append(sign).append(num).toString();
}
if(s.equals("hours")){
return timeBuilder.append(" -mmin ").append(sign).append((num*60)).toString();
}
return timeBuilder.append(" -mmin ").append(sign).append(num).toString();
}
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException{
throw new MismatchedTokenException(ttype,input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow) throws RecognitionException{
throw e;
}
}
The two StringBuilders findBuilder and filter will be used by embedded actions to build up our translated query. The reason for two StringBuilders will be explained when we cover the parsing rules. The addString method is to check for optional tokens that could be null. I could have easily checked for null in the embedded code within each rule, but I felt it cluttered the grammar too much. The buildTimeArg method is used as sort of a poor man's symbol table to translate the modified clause to the proper time format for the mmin or mtime arguments.
The final two methods override how the generated parser responds to recognition errors (the generated parser extends ANTRL's Parser class which in turn extends the BaseRecognizer class). By default ANTLR will recover from recognition errors and continue on, trying to read more tokens if available. But in this grammar, if there is a recognition error along the way I want to stop processing right there.
@rulecatch
Each parser rule is converted into a method call in the generated parser with a try - catch block surrounding the parsing code. The catch statement here will be embedded in each one of the try-catch blocks in the parser.@rulecatch{
catch (RecognitionException e){
throw e;
}
}
If you remember from the previous section we want to stop parsing stop when RecognitionExceptions are encountered, so we re-throw the caught exception.
@lexer::header
Here we are specifying the package for the generated lexer.@lexer::header {
package bbejeck.antlr.fql;
}
Now let's move on to the parsing rules.
Parsing Rules
evaluate returns [String query]
: query';' {$query = builder.toString() + filter.toString() ;}
;
query
: select_stmt where_stmt
;
select_stmt
: 'select' '*' 'from' directory
;
Here evaluate is our top level rule and returns a String, translated and built as the input is parsed. Anything within the curly braces is code that will be embedded in the generated parser. Note how we reference query from the grammar by placing a '$' before the word 'query'. Also note that the string returned is a concatenation from the two StringBuilders we declared in the @members section. The query rule is comprised of a select_stmt followed by a where_stmt. The select_stmt is "select * from" followed by the directory rule.
directory
: (p='.'{addString($p.text);} | (p='/'?{addString($p.text);}IDENT{addString($IDENT.text);})+ )
;
The directory rule accepts either a '.', a relative or an absolute path. If the first expression is not provided there must be at least one path expression denoted by the '+'. The variable 'p' is used to give a handle to the '.' or '/' token so it can be extracted . IDENT is a lexer rule which will be explained a little bit later. All tokens here are passed into the addString method defined in the members section.
where_stmt
: ('where' clause ('and' clause)* ) ?
;
clause
: file_name
| pattern
| modified
;
The where_stmt rule expects the string 'where' followed by 0 or more clauses. Also the entire where_stmt is optional. Here I chose form over substance. By that I mean the grammar as it stands here will allow multiple clause's that would not make sense, i.e multiple file_name arguments etc. I could have specified an exact order of clauses that would have also effectively set the limit of clauses entered, but I would rather the grammar be flexible and trust that the user knows what they want to do.
file_name
: 'file' '=' STRING_LITERAL
{addString(" -name ");addString($STRING_LITERAL.text);}
;
pattern
: 'pattern' '=' STRING_LITERAL
{ filter.append(" | xargs grep ").append($STRING_LITERAL.text); }
;
The file_name rule sets the -name argument again using the addString method. The lexer rule STRING_LITERAL will accept whatever the user inputs. The pattern rule builds up the grep command. Here we see the use of the second StringBuilder filter that was defined in the @members section. I feel that having a second StringBuilder to capture text for the grep filter is a hack. The issue is that the grep command needs to be last in our translated query, but I really want the where statement to be in any order. So by placing the tokens captured by the pattern rule in a separate StringBuilder I can easily guarantee the grep statement will be last.
modified
: modified_less
| modified_more
| modified_between
;
The modified rule has three options. This portion builds the mmin/mtime argument(s) for the find command.
modified_less
: 'modified' '<' INTEGER time_span
{ addString(buildTimeArg($time_span.text,$INTEGER.text,"-")); }
;
modified_more
: 'modified' '>' INTEGER time_span
{ addString(buildTimeArg($time_span.text,$INTEGER.text,"+")); }
;
modified_between
: 'modified' 'between' int1=INTEGER 'and' int2=INTEGER time_span
{ addString(buildTimeArg($time_span.text,$int1.text,"+")); }
{ addString(buildTimeArg($time_span.text,$int2.text,"-")); }
;
The grammar allows you to specify searching by the time a file was last modified. Here we use the method buildTimeArg to translate the input to the correct argument for either mmin (minutes modified) or mtime (days modified). Also take note of setting the two variables int1 and int2. Those are used to disambiguate which INTEGER token to use.
time_span
: 'days'
| 'minutes'
| 'hours'
;
The time_span rule allows input of days, minutes or hours. The hours argument is converted into minutes by the buildTimeArg method.
That's it for the parsing rules, now on to the lexer rules.
Lexer Rules
fragment DIGIT : '0'..'9';
fragment LETTER : 'a'..'z'|'A'..'Z' ;
STRING_LITERAL : '\''.*'\'';
INTEGER : DIGIT+ ;
IDENT : LETTER(LETTER | DIGIT)* ;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;};
DIGIT and LETTER are not lexer rules, as you can see by the fragment definition. These are used for making the grammar more readable. In the WS definition the {$channel=HIDDEN;} is used to ignore whitespace in the input.
Test Code
I used the following code to test the grammar from the command line:public class FQLTester {
public static void main(String[] args) throws Exception{
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
String line = null;
System.out.println("Enter your search:");
while((line = reader.readLine())!= null){
if(line.equalsIgnoreCase("quit")){
System.exit(0);
}
CharStream charstream = new ANTLRStringStream(line);
FQLLexer lexer = new FQLLexer(charstream);
TokenStream tokenStream = new CommonTokenStream(lexer);
FQLParser parser = new FQLParser(tokenStream);
String parsed = null;
try{
parsed = parser.evaluate();
System.out.println("parsed query is ["+parsed+"]");
Process process = Runtime.getRuntime().exec(new String[]{"sh","-c",parsed});
InputStream input = process.getInputStream();
BufferedReader procReader = new BufferedReader(new InputStreamReader(input));
String searchResults = null;
while((searchResults=procReader.readLine())!=null){
System.out.println(searchResults);
}
}catch(Exception e){
e.printStackTrace();
}
System.out.println("Enter your search:");
}
}
Since this blog is just scratching the surface as far as ANTLR's capabilities are concerned, I plan to be writing more about ANTLR in the near future. Full source code for everything presented is available here.
More resources for learning ANTLR are:
That's it for now, thanks for your time.
Continue reading Learning ANTLR part I.
While I'm working on getting my code into lucene. I have patched 3.0.0 and 2.9.1 with my low memory patch.
By default the small memory footprint is enabled to change back to the default implementation set the following system property.
-Dorg.apache.lucene.index.TermInfosReader=default
Have fun! If you have any problems or questions please let me know or add to this LUCENE-2205.
Lucene gives users the ability to search massive amounts of data in a very short amount of time. However allowing users to page through the entire result set of their search can be difficult and risky depending on how many users are performing searches and how many of those users are paging through 100's if not 1,000's of hits per page.
The Simple Example:
The More Advanced Example:
Problem Scenario:
- Each of your indexes contains 100,000,000's documents.
- You have 500 users on your system actively performing searches.
- You have 100 search results per page.
- And, your typical user pages through the first 10 pages of results. (Normal occurrence on some systems)
So for the 10th page you will have to collect 1,000 hits, at a cost of a float plus an int plus some object overhead per hit. So let's say 20 bytes per hit. So you have 500 users * 1,000 hits * 20 bytes = 1,000,000 bytes or 1M. Easy, no problem, right?
Well what if you also give the users an easy way to move to the end of the result set. Hmm... Well for a result set size of 10,000 it's no big deal. But what if you hand out result sets in the order of a 1,000,000 or even 10,000,000.
At this point you really just want to prevent the system from running out of memory. Because if you have 25 users getting 10,000,000 results each and they all click last page at the same time. That's going to cost you 5 Gig of heap! At least. Some might say that it won't ever happen, but in my experience, if it can happen, it will.
So I created a Paging Hit Collector, that windows the hits to the users. It's uses the last hit collected from the previous search pass, to feed the next search pass. So yes if a user clicks the last page, it might perform multiple searches but, the system won't run out of memory.
The user's will get there answer eventually, and if your system gives them some feedback as it searches and pages, they will probably sit and wait for it to come back. Instead of giving up and hitting cancel and search and cancel and search, and making the system worse and worse.
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery query = new TermQuery(new Term("f1", "value")); IterablePaging paging = new IterablePaging(searcher, query, 100);
for (ScoreDoc sd : paging.skipTo(90)) { System.out.println("doc id [" + sd.doc + "] " + "score [" + sd.score + "]"); }
IndexSearcher searcher = new IndexSearcher(reader);
TotalHitsRef totalHitsRef = new TotalHitsRef(); ProgressRef progressRef = new ProgressRef();
TermQuery query = new TermQuery(new Term("f1", "value")); IterablePaging paging = new IterablePaging(searcher, query, 100);
for (ScoreDoc sd : paging.skipTo(90). gather(20). totalHits(totalHitsRef). progress(progressRef)) {
System.out.println("time [" + progressRef.queryTime() + "] " + "total hits [" + totalHitsRef.totalHits() + "] " + "searches [" + progressRef.searchesPerformed() + "] " + "position [" + progressRef.currentHitPosition() + "] " + "doc id [" + sd.doc + "] " + "score [" + sd.score + "]"); }
Here's a link to the code LUCENE-2215.
I've been using Lucene for the better part of 2 years, from initial playing around, to prototyping to production application. It's an impressive library and it has come along way in the past couple of years.
When I first started playing around with it the version was 2.1 and the search times were so much faster than what we were trying to use at the time (Oracle Text). The first test was indexing a monster dataset and searching it quickly. It passed with flying colors!
Next was to add in record level access control. Easy and extremely fast.
Next was to add in all the other data needed for our application. That was a little bit harder, considering that we have close to 150 fields in our index and well into the billion record range (growing everyday).
The problem was that we needed more memory and there was no extra money for any more servers (or upgrades). So there we were, stuck. So I decided to start poking around using visualvm to see if there were any places in our application or in Lucene to save some memory.
We had already disabled norms on all our fields (we really didn't need norms for our data nor did we have the resources). Took a long look at all our fields that we were indexing to see if there were any we didn't need, but we really did need them all. Then I stumbled across the TermInfosReader class in Lucene.
This is where Lucene really gets it speed, but also uses quite a bit memory to do it. And this is where I wrote my first Lucene patch.
In TermInfosReader there is a bunch stuff but the big memory hogs are in three arrays.
- Terms[]
- TermInfos[]
- long[]
Basically Lucene does a binary search across the Terms array (that by default contains every 128th Term in the index) with a given Term to find where on disk the exact Term needed lives. There's a little bit more going on in the class than that, but that's basically what it's doing.
So, I started this patch with the need to save memory. So how in the world do you do that in java when everything is already in basic arrays and everything is needed in memory. Well you have to save it another way, references. References are a hidden cost in Java, every single reference in 32-bit JVM costs you 4 bytes, and 64-bit JVM it's 8 (assuming that you don't have compressed references).
Let's count the references.
- Terms[] length * 3, 1 reference for the Term and 2 references for the two Strings inside the Term
- TermInfo[] length * 1
- long[] = 1 reference total
So, let's talk numbers. If you have a billion terms in your index, that's 125 MB (1,000,000,000 / 128 * (3 + 1 references) * 4 bytes for every ref) bytes of memory for the references. In a 64-bit JVM that doubled 250 MB. Not to mention the object overhead for every one of those Term and TermInfos objects. Wow that's a lot!
So I decided to remove nearly all of those references by using a byte array and an int array as an offset index.
The results were impressive!
Given an index of 6.2 GB size 1,010,000 number of documents with 179,822,683 number of terms the default implementation uses 292,235,512 bytes to just get the index usable.
My no-ref implementation of the same index uses only 49,849,744 bytes get the index usable. That a 17% of the original size, that's an 83% savings!
And the best part is, that it loads the segments faster into memory. So those real-time updates will be online faster. The run-time performance is slightly faster as well. But the huge performance saving is in garbage collection. Over 7 times faster for full GC's on my Macbook Pro. Wow!
I think that the results speak for themselves, and I hope that the Lucene folks will accept my patch. That way I won't have to continue patching each version after the fact. Also removing references can be great, but the code required to do it, and maintain the same level of performance, is ugly! So don't try this at home!
At the beginning of last month I started prototyping various solutions for a customer using HBase. However I found myself writing tons of code to perform some fairly simple tasks. So I set out to simply my HBase code and ended up writing a Java HBase DSL. It's still fairly rough around the edges but it does allow the use of standard Java types and it's extensible.
Simple Put and Get Example
Direct HBase API:
Scanner Example
Direct HBase API:
See the unit tests, for more examples.
Simple Put and Get Example
Direct HBase API:
public class PutAndGet {
public static void main(String[] args) throws IOException {
HTable hTable = new HTable("test");
byte[] rowId = Bytes.toBytes("abcd");
byte[] famA = Bytes.toBytes("famA");
byte[] col1 = Bytes.toBytes("col1");
Put put = new Put(rowId).
add(famA, col1, Bytes.toBytes("hello world!"));
hTable.put(put);
Get get = new Get(rowId);
Result result = hTable.get(get);
byte[] value = result.getValue(famA, col1);
System.out.println(Bytes.toString(value));
}
}
HBase-dsl API:public class PutAndGetWithDsl {
public static void main(String[] args) throws IOException {
HBase<QueryOps, String> hBase = new HBase<QueryOps<String>, String>(String.class);
hBase.save("test").
row("abcd").
family("famA").
col("col1", "hello world!");
String value = hBase.fetch("test").
row("abcd").
family("famA").
value("col1", String.class)
System.out.println(value);
}
}
Now this is where the dsl becomes more powerful!Scanner Example
Direct HBase API:
public class Scanner {
public static void main(String[] args) throws IOException {
byte[] famA = Bytes.toBytes("famA");
byte[] col1 = Bytes.toBytes("col1");
HTable hTable = new HTable("test");
Scan scan = new Scan(Bytes.toBytes("a"), Bytes.toBytes("z"));
scan.addColumn(famA, col1);
SingleColumnValueFilter singleColumnValueFilterA = new SingleColumnValueFilter(
famA, col1, CompareOp.EQUAL, Bytes.toBytes("hello world!"));
singleColumnValueFilterA.setFilterIfMissing(true);
SingleColumnValueFilter singleColumnValueFilterB = new SingleColumnValueFilter(
famA, col1, CompareOp.EQUAL, Bytes.toBytes("hello hbase!"));
singleColumnValueFilterB.setFilterIfMissing(true);
FilterList filter = new FilterList(Operator.MUST_PASS_ONE, Arrays
.asList((Filter) singleColumnValueFilterA,
singleColumnValueFilterB));
scan.setFilter(filter);
ResultScanner scanner = hTable.getScanner(scan);
for (Result result : scanner) {
System.out.println(Bytes.toString(result.getValue(famA, col1)));
}
}
}
HBase-dsl API:public class ScannerWithDsl {
public static void main(String[] args) throws IOException {
HBase<QueryOps, String> hBase = new HBase<QueryOps<String>, String>(String.class);
hBase.scan("test","a","z").
select().
family("famA").
col("col1").
where().
family("famA").
col("col1").eq("hello world!","hello hbase!").
foreach(new ForEach() {
@Override
public void process(Row row) {
System.out.println(row.value("famA", "col1", String.class));
}
});
}
}
See the unit tests, for more examples.

