In the past few weeks I have been spending more and more time working with Hadoop and Hive. For those of you that don't know what Hadoop is check out what wikipedia has to say. Hive is built on top of Hadoop, simply stated is it a SQL engine that submits map/reduce jobs to Hadoop for execution.
So next you ask yourself, "why do I care"? Well with Hive using Hadoop for all the heavy lifting, the amount of data that you can process is only limited by the amount of hardware you have in your cluster. Hive is used for data warehousing which means that it is designed to work on huge datasets, huge joins, huge data loads, huge query results, etc. However before you start thinking about getting rid of that MySQL database, think again. Hive is not and never will be low latency. All queries submit map/reduce jobs to Hadoop which then operates on files stored in HDFS.
Hive has a lot of nice features built in, like:
And there are lots more, so go check it out!
Hive, the real Netezza killer.
So next you ask yourself, "why do I care"? Well with Hive using Hadoop for all the heavy lifting, the amount of data that you can process is only limited by the amount of hardware you have in your cluster. Hive is used for data warehousing which means that it is designed to work on huge datasets, huge joins, huge data loads, huge query results, etc. However before you start thinking about getting rid of that MySQL database, think again. Hive is not and never will be low latency. All queries submit map/reduce jobs to Hadoop which then operates on files stored in HDFS.
Hive has a lot of nice features built in, like:
- It can operate on raw files located in HDFS, like logs from you application, like csv files from your database(s). So this can reduce your load time, because you don't have to actually load it into a database before you can use it.
- It can operate on compressed files. I started using this feature last week because I am getting a 4 to 1 compression ratio with no different in performance (I am using sequence files with block compression).
- In your SQL statements you can actually use the Hadoop streaming api to build your own mapper and reducers, and they don't even have to be written in Java!
- You can also create your own user defined functions, so when you have to do something crazy with the data, you can!
And there are lots more, so go check it out!
Hive, the real Netezza killer.
7 Comments
Leave a comment
0 TrackBacks
Listed below are links to blogs that reference this entry: Hive - The next great data warehouse.
TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/640



How does it relate to HBase? How do they compare?
Well HBase is a non-relational database, it is designed to have low latency (real time) CRUD. However it's api is more limiting than a RDBMS. Don't get me wrong I'm a big fan of HBase, and it has many uses. However it is not designed to take relational data and be as flexible as a RDBMS.
Hive on the other hand is a SQL (ish) data warehouse engine. It operates on the data in a brute force kind of way, meaning it does not have indexes. So most queries actually operate on all the data in the given table. There are some ways to optimize it through partitioning, however for the most part it is brute force.
So it's slow by comparison to a normal SQL database like Oracle, or MySQL, but if you ask one of those databases to join 3 tables that have more that 2 billion rows in each and insert the results into a new table, you are going to be waiting while. And more than likely, if you don't have a HUGE machine, the query will just fail, and worst case scenario the machine will crash. But I actually tried this out with Hive last week and it took about 1.5 hours on a 20 node cluster with 4 cores 250 G of storage and 3 G of ram each. So not a big cluster, but it worked.
So when you have a ton of data in a relational format and you need to work with it using SQL. Hive is a good choice.
Hmm…
I'm not sold on the real utility of Hive for general purpose DW usage. You incur a huge amount of overhead for each query that is run (reads, writes, etc.). It's a valid solution if you will only ever query the data once or twice, but that's a rare case.
For ultra low cost usage you'll get much better performance from a combination of Postgres and GridSQL (http://tr.im/BwtJ). Postgres also offers external table features that would allow you to operate on raw files in a similar way. Ultimately this solution will not scale as high as a Hive solution, but for anything less than 100s of servers/TBs it's a win.
Time is money though and in my experience Netezza will complete a similar join to the one you mention (2 billion rows, 3 ways, into new table) in a handful of *minutes*. If your budget is tighter consider Vertica, Aster, or Greenplum (in that order…), all of whom offer Hadoop/MapReduce integration.
Joe
Joe, sorry I missed your comment and haven't responded sooner.
I haven't looked at GridSQL, I will check it out, thanks! As far as the Netezza is concerned, Hive can perform the large joins to create the data marts I need, which is my main use for the Netezza. However for smaller more adhoc queries the Netezza will out perform Hive. My Biggest problem with Netezza'a is what do you do when you out grow it? Hive+Hadoop can be expanded by adding more nodes. So as with anything in technology, it all depends on what you need, and how much money you have. Right now Hive and Hadoop seems to be the best fit for what we need.
Aaron
I agree that hive can be an option ONLY when oracle fails, and that happens when the number of rows are in 100s of millions. I feel what hive lacks is an intelligent data structure tailor-made to speed up search. I feel Hive and Oracle optimize on completely 2 different aspects. Hive tries to do everything using the distributed map/reduce paradigm, whereas Oracle tries to optimize in their data-storage architecture ( B+-trees, inverted-index etc. ). Probably what might make sense is to combine Lucene+Hive so that hive can take advantage of the table-schema.
I agree, I think that Hive could use some intermediate indexes. Although I don't think that Lucene is the best option here, mainly because of the memory needed for Lucene to run. Possibly an HBase solution might be better.
Aaron
hai...im priya anybody tell me wat is the drawbacks found in HIVE and how to overcome concepts if anybody have idea about this mail to my mail id
ppriyaccet@gmail.com