Sunday, July 30, 2017

HDFS

http://lavnish.blogspot.in/2012/12/hadoop-hdfs.html

Filesystems that manage the storage across a network of machines are called distributed filesystems.

Commodity hardware : Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors)

In which scenario HDFS does not work so well.
Although this may change in the future, these are areas where HDFS is not a good fit today:
  1. Low-latency data access : i.e. access to data, in the tens of milliseconds range, will not work well with HDFS. Remember, HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for low-latency access.
  2. Lots of small files : Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode.

Blocks
HDFS also has blocks.

Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage. (For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.)

Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.

HDFS’s fsck command understands blocks. For example, running:
% hdfs fsck /my_dir -files -blocks
will list the blocks that make up each file in the filesystem.

NameNode
The namenode also knows the datanodes on which all the blocks for a given file are located; however, it does not store block locations persistently, because this information is reconstructed from
datanodes when the system starts.

Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.

HDFS Federation
The namenode keeps a reference to every file and block in the filesystem in memory, which means that on very large clusters with many files, memory becomes the limiting factor for scaling.
HDFS federation, introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share.

DataNode : periodically sends a block report , to NameNode.

The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller. There are various failover controllers, but the default implementation uses ZooKeeper to ensure that only one namenode is active.

Two ways to access HDFS


There are java apis also to interact with HDFS.












No comments: