Sunday, December 23, 2012

Hadoop : HDFS

Hadoop Distributed File System

As the name suggests its a Distributed file system. Stores your file in a distributed way.

It has a master slave architecture. There are two types of nodes
1.       Master :- Name Node & Job Tracker :- single (single point of failure )
2.       Slave :- Data Node , Task Tracker :- multiple
In 2nd Gen Hadoop there are more than one Name node.
Name Node keeps everything in RAM has nothing in disk. If it crashes everything is lostr sos a secondary node comes to rescue. So secondary node contacts the name node every hour and takes that data out and cleans and reshuffles it into a file for you. dont htink secondary node as high availability for the name node.

  • Files are not stored on name node , they just keep the meta data. As shown in the figure name node is keeping track of two files. 1st is broken into 3 blocks. 2nd in 2 blocks. 
  • File permissions are stored. 
  •  The last access time is stored.
  •  Disk space quotas tracked here.

here replication factor is 3. so

if one of the data nodes goes down then the heart beat of that data node to name node seizes too. after 10 mins the name node will consider the data node to be dead. the blocks on the dead node will be re-spawned.
Also there is a possibility that one of the nodes is slow ( may be going through a GB collection cycle) then in that case the name node can spawn same process on slow node on one more node to help achieve the task on time.
Now lets see how the second file is stored...

There is some intelligence to this how this distribution happens

Name node also has a web server that hosts a basic web site that has statistics about the File system/. can see configured capacity , can see how many data nodes configure it. the no of live nodes the , no. of dead nodes etc , usage capacity.

You don't format the underlying Hard drive into cluster with HDFS. you format it into slave machine into a logical ext3 or ext4 or xfs. underlying file system options
  • ext3 - used by yahoo
  • ext4 - used by google
  • xfs -

HDFS has Linux like commands.

Name Node is Rack Aware

In above figure file.txt is broken in two blocks , which is stored in different data nodes (replication factor is 3) and also notice that Name node is Rackaware knows which rack has what nodes. (need to manually configure on rack when configuring via a python script)

If there is no rack awareness in hadoop then there is a chance that if that rack goes down then the file is lost :- so intelligent storing.

For a smaller configuring ( less than 20 nodes ) you may not want to make it rack aware and can assume that all are on a default rack.

How Write Happens

the sequence of communication is like
  • client issue copyFromLocal command ( or write  a map reduce ) and everything happens itself.
  • This goes to Name Node and it replies "ok write it to Data Node 1,7,9" 
  • Client will directly contact Data Node 1 "TCP Ready command" and also keep "DN 7 & 9" ready. DN1 hops the same command to DN7 & 9. Then a ready ack goes from 9 to 7 70 1 to client. Hence TCP pipeline is ready to take the block.
  • Note now client doesn't have to go thru the name node any more. Name Node is used only to know where in cluster you will write that block then client directly contacts nodes.
  • Client send the block to DN1 which fwd t0 7 which fwd to 9
  • Then each of the DN (individually) lets the NameNode know they receive the block successfully and hence Name Node makes up its Meta Data
  • AND the 1st DN lets the client node know operation successful
  • Then the client moves on write for block B. which may be stored on entirely diff set of racks and data nodes.
  • NOTE :- we r spanning 2 racks ... 1st rack is randomly chosen then the name node chooses a diff rack , has 2 DN (?always i thnk yes ? BCOZ from DN 1 to 7 u need to go thru 3 diff switches one on 1st rack 2nd on nw and 3rd on rack ) BUT from 7 to 9 you go thru only 2 switches... hence keeping latency low) client needs to write to all 3 DN to consider write to be successful

Write with replica=1 will be 5-10 times faster.

Reading a file
1. client asks what are the block locations for a file
2. Name node has locations for block A & B for the files
3. Name node will respond back saying there are two blocks for this file A[1,4,5] B[9,1,2]
4. Client will directly connect with name node to read the blocks, this list is ordered. first will try to read from DN1 , if not possible read from DN 4
5. Name node can be like a book keeper in lib, in a large lib u can't go from point to point to  find a book instead you can goto the bookkeeper he will tell you the rest. as he has organized books via index cards ( by author , publish date etc )

A file comes in you chop it into blocks ( default 64 MB)  and take these blocks and store it on infrastructure. and you take these blocks and replicate these. File gets broken down into blocks. Each is 64MB.

Why size uniform ? :- else the time taken to read will depend on the largest file block.

It's not a fully POSIX compliant file system, its optimized fo
·         Throughput
·         Put / Get / Delete
·         Appends
·         NOT For latency
·         CANT insert stuff in middle of file
·         BUT can do Append
·         Compression if needed is available at higher layer ( CHS from google) not at this layer

No comments: