I’ve spent some time lately to dig into the Hadoop ecosystem both from a product survey and some hands on. Here is some remarks about the state of Hadoop in April 2013. I’ve played with Greenplum HD 1.2 and CDH4.2 and read a lot of stuff about Hadoop and peripherical products.
1. Map reduce is dead
Map-reduce was a bad idea from the begining and every product on top of it is, especially Hive. It was probably the good thing to do because it was easier to implement and understand and it gives the market audience Hadoop enjoy currently. But now it will fade out. I’ve already push that idea earlier
but it’s really a game changer. Run Hive and run a simple select a from table limit 10
. Then get you a coffee. No joke, over a 30MB dataset it will take you 20 seconds.
I’m not talking about HDFS here. I found HDFS convenient to use. I don’t know its technical limits but it seems sound ground.
2. Impala is serious business
Here things are getting really serious. First it’s smart as an DBMS should be and second the Impala HQL is a bit better (you have to try it to figure out many minor points that run on Impala but not on Hive). I didn’t try Shark
which is similar but I really like the story. Apache Drill
is on same space too with a better ground but a bit too ambitious. I still hope it will succeed.
3. The ecosystem rocks : Apache Oozie and Cloudera Hue
They are both currently immature, but they provide an invaluable framework around Hadoop. Oozie is an ETL tool embedded in Hadoop while Hue is a workbench where you can use Hive, Impala, … well every Hadoop tools. The big momentum around Hadoop means that most issues you face in your daily life will be solved. It’s a bit like Java, no matter what you want there is a library for that. Same thing happen with Hadoop.
In conclusion, Hadoop will be the of common use soon.
Real-time SQL is getting real, the stack is getting quite complete (Oozie, Hue), the momentum is huge and it’s free. The mix is getting better and better. It’s way too early to do something serious with it, as I can see better alternatives for most usage (excluding the living archive one, not my area anyway). Major change will happen this year and it’s the time to look closely at Hadoop.
Let's stay in touch with the newsletter
Possible related posts:
- Hadoop is dead thanks to EMC, long live to Hadoop
- Big data benchmark : Impala vs Hawq vs Hive
- Book review : Marketing calculator
- Data Manipulation Part 2 : ETL
- Book review : Competing on analytics