Tuesday, December 18, 2012

What is Hadoop

Best way to understand it is via a simple analogy



What is an Operating System What does it do at its core ?
      Simple it just does two things at its core ... 
  • Ability to store files
  • Ability to run app on top of files
Then comes device drives , security , libraries and all these things are on top of these two.

Similarly you can think of hadoop as a "Modern day Operating System" which gives you similar service the only difference is that its with many many many machines. In fact its an abstraction above it leverages windows and linux to do same.


Why such a strange name ? Is it an acronym ?
Hadoop was designed by Doug Cutting who decided to name the project after his 3 years old  son's favorite toy :- a stuffed elephant ... whom his son used to call Hadoop. so its no acronym , it doesn't mean anything its not an acronym ...  it's just a name.

And now his son is 12 years old and is proud of his achievement :)

Hadoop's architecture
Hadoop was not designed for Enterprise Architecture , It was designed for Clustered Architecture.
so think in right context.

Why was it made ?
Initial work was done by Google who wanted to index the entire web on daily basis. Doug Cutting got inspiration for Hadoop after reading google's paper on Google File System and Google Map Reduce.
Then Google + Yahoo + Apache joined hands to solve this problem.

Components
Hadoop was inspired from Googles release of 3 white papers
  1. Google File System :- a distributed file system
  2. Google Map Reduce :- a combination of Mapper and Reducer function
  3. Google BigTable - search quickly across large data (this is technically not part of core)
Doug Cutting went through above and implemented the same which later with help of Yahoo , Apache and Google became Hadoop.
 
Component Details
1.  Hadoop Distributed File System :-
Used to manage data across clusters. Split Scattered Replicate and manage this data across nodes. Takes care in case a node is not available or down.

2. Hadoop Map Reduce :-
Computational Mechanism Execute an application in parallel by dividing into tasks, co locating this tasks with facts of data. Collecting and distributing intermediate results and managing failures across clusters.

3.Apache HBase :-

This is technically not part of Core Hadoop. It's implementation of a paper from Google called "BigTable". It tells how to hide latency of HDFS layer so that we can do v v quick lookups. 
 

No comments: