<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>Hadoop - Blogs at Near Infinity</title>


        <link>http://www.nearinfinity.com/blogs/</link>
        <description>Employee Blogs</description>
        <language>en</language>
        <copyright>Copyright 2011</copyright>
        <lastBuildDate>Wed, 09 Nov 2011 15:00:00 -0500</lastBuildDate>
        <generator>http://www.sixapart.com/movabletype/</generator>
        <docs>http://www.rssboard.org/rss-specification</docs>
        
        <item>
            <title>An Introduction to Blur</title>
            <description><![CDATA[<p>Blur is a new Apache 2.0 licensed software project that provides a search capability built on top of Hadoop and Lucene.  Elastic Search and Solr already exist so why build something new?  While these projects work well, they didn't have a solid integration with the Hadoop ecosystem.  Blur was built specifically for Big Data, taking scalability, redundancy, and performance into consideration from the very start, while leveraging all the goodness that already exists in the Hadoop stack.  </p>

<p>A year and a half ago, my project began using Hadoop for data processing.  Very early on, we were having networking issues that would make our HDFS cluster network connectivity spotty at best.  Over one weekend in particular, we steadily lost network connection to 47 of the 90 data nodes in the cluster.  When we came in on Monday morning, I noticed that the MapReduce system was a little sluggish but still working.  When I checked HDFS I saw that our capacity had dropped by about 50%.  After running an fsck on the cluster I was amazed to find that what seemed like a catastrophic failure over the weekend resulted in a still healthy file system.  This experience left a lasting impression on me.  It was then that I got the idea to somehow leverage the redundancy and fault tolerance of HDFS for the next version of a search system that I was just beginning to (re)write.  </p>

<p>I had already written a custom sharded Lucene server that had been in a production system for a couple of years.  Lucene worked really well and did everything that we needed for search.  The issue that we faced was that it  was running on big iron that was not redundant and could not be easily expanded.  After seeing the resilient characteristics of Hadoop first hand, I decided to look into marrying the already mature and impressive feature set of Lucene with the built in redundancy and scalability of the Hadoop platform.  From this experiment Blur was created.</p>

<p>The biggest technical issues/features that Blur solves:</p>

<ul>
<li>Rapid mass indexing of entire datasets</li>
<li>Automatic Shard Server Failover</li>
<li>Near Real-time update compatibility via Lucene NRT</li>
<li>Compression of Lucene FDT files while maintaining random access performance</li>
<li>Lucene WAL (Write Ahead Log) to provide data reliability</li>
<li>Lucene R/W directly into HDFS (the seek on write problem)</li>
<li>Random access performance with block caching of the Lucene Directory</li>
</ul>

<h1>Data Model</h1>

<p>Data in Blur is stored in Tables that contain Rows.  Rows must have a unique row id and contain one or more Records.  Records have a unique record id (unique within the Row) and a column family for grouping columns that logically make up a single record.  Columns contain a name and a value, and a Record can contain multiple columns with the same name.</p>

<script src="https://gist.github.com/1349055.js?file=gistfile1.js"></script>

<h1>Architecture</h1>

<p>Blur uses Hadoop's MapReduce framework for indexing data, and Hadoop's HDFS filesystem for storing indexes.  Thrift is used for all inter-process communications and Zookeeper is used to know the state of the system and to store meta data.  The Blur architecture is made up of two types of server processes:</p>

<ul>
<li>Blur Controller Server</li>
<li>Blur Shard Server</li>
</ul>

<p>The shard server, serves 0 or more shards from all the currently online tables.  The calculation of the what shards are online in each shard server is done through the state information in Zookeeper.  If a shard server goes down, through interaction with Zookeeper the remaining shard servers detect the failure and determine which if any of the missing shards they need to serve from HDFS.  </p>

<p>The controller server provides a single point of entry (logically) to the cluster for spraying out queries, collecting the responses, and providing a single response.  Both the controller and shard servers expose the same Thrift API which helps to ease debugging.  It also allows developers to start a single shard server and interact with it the same way they would with a large cluster.  Many controller servers can be (and should be) run for redundancy. The controllers act as gateways to all of the data that is being served by the shard servers.</p>

<h1>Updating / Loading Data</h1>

<p>Currently there are two ways to load and update data.  The first is through a bulk load in MapReduce and the second is through mutation calls in Thrift.</p>

<h2>Bulk Load MapReduce Example</h2>

<script src="https://gist.github.com/1348788.js?file=BlurMapReduce.java"></script>

<h2>Data Mutation Thrift Example</h2>

<script src="https://gist.github.com/1348845.js?file=ThriftMutationExample.java"></script>

<h1>Searching Data</h1>

<p>Any element in the Blur data model is searchable through the normal Lucene semantics: analyzers. Analyzers are defined per Blur table.</p>

<p>The standard Lucene query syntax is the default way to search Blur.  If anything outside of the standard syntax is needed, you can create a Lucene query directly with Java objects, and submit them through the expert query API.</p>

<p>The column family grouping within Rows allows for results to be discovered across column families similar to what you would get with an inner join across two tables that share the same key (or in this case rowid).  For complicated data models that have multiple column families, this makes for a very powerful search capability.</p>

<p>The following example searches for "value" as a full text search.  If I had wanted to search for "value" in a single field like column "colA" in column family "famB" the query would look like "famB.colA:value".</p>

<script src="https://gist.github.com/1348874.js?file=ThriftSearchExample.java"></script>

<h1>Fetching Data</h1>

<p>Fetches can be done by row or by record.  This is done by creating a selector object in which you specify the rowid or recordid, and the specific column families or columns that you would like returned.  When not specified, the entire Row or Record is returned.</p>

<script src="https://gist.github.com/1348865.js?file=ThriftFetchExample.java"></script>

<h1>Current State</h1>

<p>Blur is nearing it's first release 0.1 and is relatively stable.  The first release candidate should be available for download within the next few weeks.  In the meantime you can check it out on github:</p>

<p><a href="https://github.com/nearinfinity/blur">https://github.com/nearinfinity/blur</a></p>

<p><a href="http://blur.io">http://blur.io</a></p>
]]></description>
            <link>http://www.nearinfinity.com/blogs/aaron_mccurry/an_introduction_to_blur.html</link>
            <guid>http://www.nearinfinity.com/blogs/aaron_mccurry/an_introduction_to_blur.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">Hadoop</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">Java</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">Lucene</category>
            
            
                <category domain="http://www.sixapart.com/ns/types#tag">Big Data</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">Hadoop</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">Java</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">Lucene</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">MapReduce</category>
            
            <pubDate>Wed, 09 Nov 2011 15:00:00 -0500</pubDate>
        </item>
        
        <item>
            <title>Hadoop Presentation at NOVA/DC Java Users Group</title>
            <description><![CDATA[<p>Last Thursday (on Cinco de Mayo) I gave a presentation on <a href="http://hadoop.apache.org/">Hadoop</a> and <a href="http://hive.apache.org/">Hive</a> at the <a href="http://www.meetup.com/dc-jug/">Nova/DC Java Users Group</a>. As several people asked about getting the slides, I've shared them <a href="http://www.slideshare.net/scottleber/hadoop-7904044">here</a> on Slideshare. I also posted the presentation sample code on Github at <a href="https://github.com/sleberknight/basic-hadoop-examples">basic-hadoop-examples</a>.</p>]]></description>
            <link>http://www.nearinfinity.com/blogs/scott_leberknight/hadoop_presentation_at_novadc.html</link>
            <guid>http://www.nearinfinity.com/blogs/scott_leberknight/hadoop_presentation_at_novadc.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">Hadoop</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">Java</category>
            
            
                <category domain="http://www.sixapart.com/ns/types#tag">Big Data</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">hadoop</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">hive</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">java</category>
            
            <pubDate>Tue, 10 May 2011 00:30:15 -0500</pubDate>
        </item>
        
        <item>
            <title>Hadoop for Managers</title>
            <description><![CDATA[<span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;" id="internal-source-marker_0.32143889308278895">So,
you have a ton of data and are trying to figure out what to do with it.
&nbsp;Big data is the newest industry buzzword and there are a myriad of
solutions out there that claim to solve the big data problem. &nbsp;One of
the industry leading technologies at the moment is Hadoop. &nbsp;But what is
Hadoop and how does it make working with big data manageable? &nbsp;Lets see
if we can take some of the essentials of Hadoop and explain them in
more business terms. While no analogy is perfect, I think that this one
is pretty good.</span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">Coca
Cola started in Atlanta Georgia. &nbsp;Coke products were all made in one
plant and distributed locally in Georgia. &nbsp;As the popularity of Coke
increased over the years, the Coca Cola Company had to change the
production and distribution of it's product in order to meet global
demand. &nbsp;Now there are bottling plants all over the world that use raw
materials and Coke recipes to produce and distribute billions of
servings of Coke products a year. &nbsp;As your data grows like the demand
for Coke, one plant will not be enough to satisfy your data production
needs. &nbsp;Enter Hadoop.</span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">Hadoop
is a distributed data crunching architecture very much like Coke's
bottling and distribution network. &nbsp;So, why is it impractical to just
scale one bottling plant as demand grows? Let's see. &nbsp;Physical
infrastructure becomes a problem. &nbsp;As production scales up, you need
more raw material. &nbsp;This means a larger loading dock, more trucks, and
larger roads to accommodate increased traffic. &nbsp;You also need more
machinery, more power, and more people to run the plant. &nbsp;Of course all of
this can scale to a point but could you imagine the infrastructure that
Coca Cola would need in Atlanta to produce all of the billions of cans of Coke that they sell a
year?</span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">You
also need a delivery network that delivers product in a timely manner.
This is more problematic. &nbsp;Delivering Coke to new markets that are
further away from the plant means more trucks and aged product. &nbsp;For
all of these reasons Coca Cola made a transformational decision to ship
raw material all over the country and eventually world and have
regional plants bottle Coke locally. &nbsp;You can do the same thing with
data.</span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">Instead
of having one large plant that has a finite production capability,
Hadoop allows you to have thousands of smaller distributed plants that
work together in a network of data production. &nbsp;This solves many big
data problems in the same way that Coke solved their production
problem. &nbsp;A Hadoop cluster of computers is made up of as many
production plants as you need to process your data. Your raw material
is your data. &nbsp;Hadoop automatically distributes this raw material to
all of your data production plants or nodes. &nbsp;When you run a Hadoop
job, instead of pulling data to the program, it pushes the program to
the data just like a recipe would be distributed to all of the bottling
plants that produce Coke. &nbsp;This approach has many benefits over large
singular databases or having one huge bottling plant.</span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><h5 style="margin-left: 36pt; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline;">Increase capacity without affecting production</span></h5><p style="margin-left: 36pt; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">Adding
a new plant has no impact on the other plants in the network. &nbsp;When a
new plant comes online, the Hadoop system automatically distributes raw
material to it and sends it data crunching recipes so that it can
immediately increase capacity.</span></p><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><h5 style="margin-left: 36pt; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline;">Upgrade individual plants without halting production </span></h5><p style="margin-left: 36pt; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">If
there is a new conveyor belt that can increase the capacity of a plant,
while one plant is being upgraded the rest can increase production
slightly in order to absorb the temporarily reduced capacity. &nbsp;</span></p><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><h5 style="margin-left: 36pt; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline;">Absorb plant failures</span></h5><p style="margin-left: 36pt; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">With
one large system, if something catastrophic happens, all production
stops. &nbsp;With a bottling plant network, if a plant in Wisconsin has to
halt production because of local flooding, the plants in Illinois and
Ohio can ratchet up capacity temporarily to meet demand while flood
waters subside.</span></p><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><h5 style="margin-left: 36pt; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline;">And then there's scalability</span><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span></h5><p style="margin-left: 36pt; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">If
I have one big plant that that is at capacity, what do I do? &nbsp;Do I
build another big plant and double my capacity even though I may
initially only need to increase production by 5%? &nbsp;Since the plants are
smaller, Hadoop allows you to add what you need when you need it.</span></p><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;"></span><br /><span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline;">There
is much more to Hadoop than described here, but I think that this gives
you a good idea as to why you should take a look at it if you have big
data that you want to exploit in some way. &nbsp;This distribution and
production technique did wonders for Coca Cola, just think of what it
could do for you.</span> ]]></description>
            <link>http://www.nearinfinity.com/blogs/jeff_borst/hadoop_for_managers.html</link>
            <guid>http://www.nearinfinity.com/blogs/jeff_borst/hadoop_for_managers.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">Hadoop</category>
            
            
                <category domain="http://www.sixapart.com/ns/types#tag">Big Data</category>
            
            <pubDate>Mon, 07 Mar 2011 13:09:22 -0500</pubDate>
        </item>
        
        <item>
            <title>Using HBase-dsl</title>
            <description><![CDATA[At the beginning of last month I started prototyping various solutions for a customer using HBase. &nbsp;However I found myself writing tons of code to perform some fairly simple tasks. &nbsp;So I set out to simply my HBase code and ended up writing a Java <a href="http://wiki.github.com/nearinfinity/hbase-dsl" target="_blank">HBase DSL</a>. &nbsp;It's still fairly rough around the edges but it does allow the use of standard Java types and it's extensible.<div><br /><font class="Apple-style-span" style="font-size: 1.25em; "><font class="Apple-style-span" style="font-size: 1.25em; ">

Simple Put and Get Example</font></font><br /><br /><b>

Direct HBase API:</b><br />

<br />
<pre class="prettyprint">public class PutAndGet {
   public static void main(String[] args) throws IOException {
      HTable hTable = new HTable("test");

      byte[] rowId = Bytes.toBytes("abcd");
      byte[] famA = Bytes.toBytes("famA");
      byte[] col1 = Bytes.toBytes("col1");
      Put put = new Put(rowId).
         add(famA, col1, Bytes.toBytes("hello world!"));
      hTable.put(put);
      Get get = new Get(rowId);
      Result result = hTable.get(get);
      byte[] value = result.getValue(famA, col1);
      System.out.println(Bytes.toString(value));
   }
}
</pre><b>HBase-dsl API:</b><br /><br />
<pre class="prettyprint">public class PutAndGetWithDsl { 
   public static void main(String[] args) throws IOException { 
      HBase&lt;QueryOps, String&gt; hBase = new HBase&lt;QueryOps&lt;String&gt;, String&gt;(String.class);

      hBase.save("test").  
         row("abcd"). 
            family("famA"). 
               col("col1", "hello world!"); 
      String value = hBase.fetch("test"). 
         row("abcd").
            family("famA"). 
               value("col1", String.class)
      System.out.println(value);
   }
 }</pre>

Now this is where the dsl becomes more powerful!<div><br /><font class="Apple-style-span" style="font-size: 1.25em; "><font class="Apple-style-span" style="font-size: 1.25em; ">

Scanner Example</font></font><br /><br /><b>

Direct HBase API:</b><br /><br />

<pre class="prettyprint">public class Scanner {
   public static void main(String[] args) throws IOException {
      byte[] famA = Bytes.toBytes("famA");
      byte[] col1 = Bytes.toBytes("col1");  

      HTable hTable = new HTable("test");  

      Scan scan = new Scan(Bytes.toBytes("a"), Bytes.toBytes("z"));
      scan.addColumn(famA, col1);  

      SingleColumnValueFilter singleColumnValueFilterA = new SingleColumnValueFilter(
           famA, col1, CompareOp.EQUAL, Bytes.toBytes("hello world!"));
      singleColumnValueFilterA.setFilterIfMissing(true);  

      SingleColumnValueFilter singleColumnValueFilterB = new SingleColumnValueFilter(
           famA, col1, CompareOp.EQUAL, Bytes.toBytes("hello hbase!"));
      singleColumnValueFilterB.setFilterIfMissing(true);  

      FilterList filter = new FilterList(Operator.MUST_PASS_ONE, Arrays
           .asList((Filter) singleColumnValueFilterA,
                singleColumnValueFilterB));  

      scan.setFilter(filter);  

      ResultScanner scanner = hTable.getScanner(scan);  

      for (Result result : scanner) {
         System.out.println(Bytes.toString(result.getValue(famA, col1)));
      }
   }
}</pre>
<b>HBase-dsl API:</b><br /><br />

<pre class="prettyprint">public class ScannerWithDsl {
   public static void main(String[] args) throws IOException {
      HBase&lt;QueryOps, String&gt; hBase = new HBase&lt;QueryOps&lt;String&gt;, String&gt;(String.class);

      hBase.scan("test","a","z").
         select().
            family("famA").
               col("col1").
         where().
            family("famA").
               col("col1").eq("hello world!","hello hbase!").
         foreach(new ForEach<row>() {
            @Override
            public void process(Row row) {
               System.out.println(row.value("famA", "col1", String.class));
            }
         });
  }
}</row></pre><br />
See the unit tests, for more examples.<br /><br /></div></div>]]></description>
            <link>http://www.nearinfinity.com/blogs/aaron_mccurry/using_hbase-dsl.html</link>
            <guid>http://www.nearinfinity.com/blogs/aaron_mccurry/using_hbase-dsl.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">Database</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">Hadoop</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">Java</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">Persistence</category>
            
            
                <category domain="http://www.sixapart.com/ns/types#tag">Big Data</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">hadoop</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">hbase</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">hbase-dsl</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">java</category>
            
            <pubDate>Tue, 05 Jan 2010 22:34:20 -0500</pubDate>
        </item>
        
        <item>
            <title>Hive - The next great data warehouse</title>
            <description><![CDATA[In the past few weeks I have been spending more and more time working with Hadoop and Hive.&nbsp; For those of you that don't know what Hadoop is check out what <a href="http://en.wikipedia.org/wiki/Hadoop">wikipedia</a> has to say.&nbsp; Hive is built on top of Hadoop, simply stated is it a SQL engine that submits <a href="http://en.wikipedia.org/wiki/Map_Reduce">map/reduce</a> jobs to Hadoop for execution.<br /><br />So next you ask yourself, "why do I care"?&nbsp; Well with Hive using Hadoop for all the heavy lifting, the amount of data that you can process is only limited by the amount of hardware you have in your cluster.&nbsp; Hive is used for data warehousing which means that it is designed to work on huge datasets, huge joins, huge data loads, huge query results, etc.&nbsp; However before you start thinking about getting rid of that MySQL database, think again.&nbsp; Hive is not and never will be low latency.&nbsp; All queries submit map/reduce jobs to Hadoop which then operates on files stored in HDFS.<br /><br />Hive has a lot of nice features built in, like:<br /><ul><li>It can operate on <i>raw</i> files located in HDFS, like logs from you application, like csv files from your database(s).&nbsp; So this can reduce your load time, because you don't have to actually load it into a database before you can use it.</li><li>It can operate on compressed files.&nbsp; I started using this feature last week because I am getting a 4 to 1 compression ratio with no different in performance (I am using sequence files with block compression).</li><li>In your SQL statements you can actually use the Hadoop streaming api to build your own mapper and reducers, and they don't even have to be written in Java!</li><li>You can also create your own user defined functions, so when you have to do something crazy with the data, you can!</li></ul><br />And there are lots more, so go check it out!<br /><br /><a href="http://wiki.apache.org/hadoop/Hive">Hive</a>, the real Netezza killer.<br />]]></description>
            <link>http://www.nearinfinity.com/blogs/aaron_mccurry/hive_-_the_next_great_data_war.html</link>
            <guid>http://www.nearinfinity.com/blogs/aaron_mccurry/hive_-_the_next_great_data_war.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">Hadoop</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">Java</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">SQL</category>
            
            
                <category domain="http://www.sixapart.com/ns/types#tag">Big Data</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">hadoop</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">hdfs</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">hive</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">java</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">sql</category>
            
            <pubDate>Sun, 04 Oct 2009 13:18:55 -0500</pubDate>
        </item>
        
    </channel>
</rss>

