Hadoop Distributed File System
As the name suggests its a Distributed file system. Stores your file in a distributed way.
Name Node is Rack Aware
Why size uniform ? :- else the time taken to
read will depend on the largest file block.
As the name suggests its a Distributed file system. Stores your file in a distributed way.
It has a master slave architecture. There are two types of
nodes
1.
Master :- Name Node & Job Tracker :- single
(single point of failure )
2.
Slave :- Data Node , Task Tracker :- multiple
In 2nd Gen Hadoop there are more than one Name node.
Name Node keeps everything in RAM has nothing in
disk. If it crashes everything is lostr sos a secondary node comes to rescue.
So secondary node contacts the name node every hour and takes that data out and
cleans and reshuffles it into a file for you. dont htink secondary node as high
availability for the name node.- Files are not stored on name node , they just keep the meta data. As shown in the figure name node is keeping track of two files. 1st is broken into 3 blocks. 2nd in 2 blocks.
- File permissions are stored.
- The last access time is stored.
- Disk space quotas tracked here.
here replication factor is 3. so
if one of the data nodes goes down then the heart beat of
that data node to name node seizes too. after 10 mins the name node will
consider the data node to be dead. the blocks on the dead node will be
re-spawned.
Also there is a possibility that one of the nodes is slow (
may be going through a GB collection cycle) then in that case the name node can
spawn same process on slow node on one more node to help achieve the task on
time.
Now lets see how the second file is stored...
There is some intelligence to this how this distribution
happens
Name node also has a web server that hosts a basic web site
that has statistics about the File system/. can see configured capacity , can
see how many data nodes configure it. the no of live nodes the , no. of dead
nodes etc , usage capacity.
You don't format the underlying Hard drive into cluster with
HDFS. you format it into slave machine into a logical ext3 or ext4 or xfs. underlying file system options
- ext3 - used by yahoo
- ext4 - used by google
- xfs -
HDFS has Linux like commands.
Name Node is Rack Aware
In above figure file.txt is broken in two blocks , which is
stored in different data nodes (replication factor is 3) and also notice that
Name node is Rackaware knows which rack has what nodes. (need to manually
configure on rack when configuring via a python script)
If there is no rack awareness in hadoop then there is a
chance that if that rack goes down then the file is lost :- so intelligent
storing.
For a smaller configuring ( less than 20 nodes ) you may not
want to make it rack aware and can assume that all are on a default rack.
How Write Happens
the sequence of communication is like
- client issue copyFromLocal command ( or write a map reduce ) and everything happens itself.
- This goes to Name Node and it replies "ok write it to Data Node 1,7,9"
- Client will directly contact Data Node 1 "TCP Ready command" and also keep "DN 7 & 9" ready. DN1 hops the same command to DN7 & 9. Then a ready ack goes from 9 to 7 70 1 to client. Hence TCP pipeline is ready to take the block.
- Note now client doesn't have to go thru the name node any more. Name Node is used only to know where in cluster you will write that block then client directly contacts nodes.
- Client send the block to DN1 which fwd t0 7 which fwd to 9
- Then each of the DN (individually) lets the NameNode know they receive the block successfully and hence Name Node makes up its Meta Data
- AND the 1st DN lets the client node know operation successful
- Then the client moves on write for block B. which may be stored on entirely diff set of racks and data nodes.
- NOTE :- we r spanning 2 racks ... 1st rack is randomly chosen then the name node chooses a diff rack , has 2 DN (?always i thnk yes ? BCOZ from DN 1 to 7 u need to go thru 3 diff switches one on 1st rack 2nd on nw and 3rd on rack ) BUT from 7 to 9 you go thru only 2 switches... hence keeping latency low) client needs to write to all 3 DN to consider write to be successful
Write with replica=1 will be 5-10 times faster.
Reading a file
1. client asks what are the block locations for a file
2. Name node has locations for block A & B for the files
3. Name node will respond back saying there are two blocks
for this file A[1,4,5] B[9,1,2]
4. Client will directly connect with name node to read the
blocks, this list is ordered. first will try to read from DN1 , if not possible
read from DN 4
5. Name node can be like a book keeper in lib, in a large
lib u can't go from point to point to
find a book instead you can goto the bookkeeper he will tell you the
rest. as he has organized books via index cards ( by author , publish date etc
)
A file comes in you chop it into blocks ( default 64
MB) and take these blocks and store it
on infrastructure. and you take these blocks and replicate these. File gets broken down into blocks. Each is 64MB.
It's not a fully POSIX compliant file system, its optimized
fo
·
Throughput
·
Put / Get / Delete
·
Appends
·
NOT For latency
·
CANT insert stuff in middle of file
·
BUT can do Append
·
Compression if needed is available at higher
layer ( CHS from google) not at this layer
No comments:
Post a Comment