Recently about Hadoop

Using HBase-dsl

| | Comments (1) | TrackBacks (0)
At the beginning of last month I started prototyping various solutions for a customer using HBase.  However I found myself writing tons of code to perform some fairly simple tasks.  So I set out to simply my HBase code and ended up writing a Java HBase DSL.  It's still fairly rough around the edges but it does allow the use of standard Java types and it's extensible.

Simple Put and Get Example

Direct HBase API:

public class PutAndGet {
   public static void main(String[] args) throws IOException {
      HTable hTable = new HTable("test");

      byte[] rowId = Bytes.toBytes("abcd");
      byte[] famA = Bytes.toBytes("famA");
      byte[] col1 = Bytes.toBytes("col1");
      Put put = new Put(rowId).
         add(famA, col1, Bytes.toBytes("hello world!"));
      hTable.put(put);
      Get get = new Get(rowId);
      Result result = hTable.get(get);
      byte[] value = result.getValue(famA, col1);
      System.out.println(Bytes.toString(value));
   }
}
HBase-dsl API:

public class PutAndGetWithDsl {

   public static void main(String[] args) throws IOException {

      HBase<QueryOps, String> hBase = new HBase<QueryOps<String>, String>(String.class);

      hBase.save("test").
 
         row("abcd").

            family("famA").

               col("col1", "hello world!");

      String value = hBase.fetch("test").

         row("abcd").
            family("famA").

               value("col1", String.class)
      System.out.println(value);
   }

}
Now this is where the dsl becomes more powerful!

Scanner Example

Direct HBase API:

public class Scanner {
   public static void main(String[] args) throws IOException {
      byte[] famA = Bytes.toBytes("famA");
      byte[] col1 = Bytes.toBytes("col1");  

      HTable hTable = new HTable("test");  

      Scan scan = new Scan(Bytes.toBytes("a"), Bytes.toBytes("z"));
      scan.addColumn(famA, col1);  

      SingleColumnValueFilter singleColumnValueFilterA = new SingleColumnValueFilter(
           famA, col1, CompareOp.EQUAL, Bytes.toBytes("hello world!"));
      singleColumnValueFilterA.setFilterIfMissing(true);  

      SingleColumnValueFilter singleColumnValueFilterB = new SingleColumnValueFilter(
           famA, col1, CompareOp.EQUAL, Bytes.toBytes("hello hbase!"));
      singleColumnValueFilterB.setFilterIfMissing(true);  

      FilterList filter = new FilterList(Operator.MUST_PASS_ONE, Arrays
           .asList((Filter) singleColumnValueFilterA,
                singleColumnValueFilterB));  

      scan.setFilter(filter);  

      ResultScanner scanner = hTable.getScanner(scan);  

      for (Result result : scanner) {
         System.out.println(Bytes.toString(result.getValue(famA, col1)));
      }
   }
}
HBase-dsl API:

public class ScannerWithDsl {
   public static void main(String[] args) throws IOException {
      HBase<QueryOps, String> hBase = new HBase<QueryOps<String>, String>(String.class);

      hBase.scan("test","a","z").
         select().
            family("famA").
               col("col1").
         where().
            family("famA").
               col("col1").eq("hello world!","hello hbase!").
         foreach(new ForEach() {
            @Override
            public void process(Row row) {
               System.out.println(row.value("famA", "col1", String.class));
            }
         });
  }
}

See the unit tests, for more examples.

In the past few weeks I have been spending more and more time working with Hadoop and Hive.  For those of you that don't know what Hadoop is check out what wikipedia has to say.  Hive is built on top of Hadoop, simply stated is it a SQL engine that submits map/reduce jobs to Hadoop for execution.

So next you ask yourself, "why do I care"?  Well with Hive using Hadoop for all the heavy lifting, the amount of data that you can process is only limited by the amount of hardware you have in your cluster.  Hive is used for data warehousing which means that it is designed to work on huge datasets, huge joins, huge data loads, huge query results, etc.  However before you start thinking about getting rid of that MySQL database, think again.  Hive is not and never will be low latency.  All queries submit map/reduce jobs to Hadoop which then operates on files stored in HDFS.

Hive has a lot of nice features built in, like:
  • It can operate on raw files located in HDFS, like logs from you application, like csv files from your database(s).  So this can reduce your load time, because you don't have to actually load it into a database before you can use it.
  • It can operate on compressed files.  I started using this feature last week because I am getting a 4 to 1 compression ratio with no different in performance (I am using sequence files with block compression).
  • In your SQL statements you can actually use the Hadoop streaming api to build your own mapper and reducers, and they don't even have to be written in Java!
  • You can also create your own user defined functions, so when you have to do something crazy with the data, you can!

And there are lots more, so go check it out!

Hive, the real Netezza killer.