Pages

Hadoop Terminologies

Files and Blocks

File abstraction using blocks
File abstraction means a file can be larger than any one hard disk in the cluster
It can be achieved by Network file system as well as HDFS
HDFS and other distributed file systems typically uses local file system over network file system
Files are distributed on HDFS based on dfs.blocksize



Mappers and Reducers

       Mappers
  • Number of mappers is determined by framework based up on block size and split size
  • uses data locality
  • Logic to filter, row level transformations are implemented in the map function
  • Mapper tasks execute map function
  • Shuffle & Sort
  • Taken (typically) care by Hadoop MapReduce framework
  • Enhance or customize the capability in the form of custom partitioners and custom comparators.



       Reducers

  • Can be pre-determined for some of the cases
  • If the report has to be generated by year, then number of reducers can be number of years you want to generate report
  • If the report has to be generated for number of regions or states , then number of reducers can be number of regions or states.
  • Logic to implement aggregations, joins etc are implemented in the reduce function
  • Reducer tasks execute reduce function





Fault Tolerance – Replication Factor


HDFS is fault tolerant which means when the system functions properly without any data loss even if some hardware components of the system has failed.
HDFS does not use RAID (RAID only solves Hard disk failure, mirroring is expensive and striping is slow)
HDFS uses mirroring and dfs.replication controls how many copies should be made (default 3). 
HDFS mirroring/replication solves disk failure as well as any other hardware failure (except network failures)
Network failures are addressed using multiple racks with multiple switches




Metadata

Files are divided into blocks based up on dfs.blocksize (default 128 MB)Each block will have multiple copies and stored in the servers designated as datanodes. It is controlled by parameter called dfs.replication (default 3)
What is file metadata?HDFS file is logicalEach block will have block id and multiple copiesEach copy will be stored in separate data nodeMapping between file, block and block location is metadata of a fileAlso file permissions, directories etc.All of this will be stored in in-memory of Namenode




Heartbeat and block report

Datanode sends heartbeat every 3 seconds to Namenode
Heartbeat interval is controlled by dfs.heartbeat.interval
Along with heartbeat, Datanode sends information such as Disk capacity,Current activity 
Also data node sends periodic block report (default 6 hours) to Namenode (dfs.blockreport.*)



Checksum

    Checksum is used to ensure blocks or files are not corrupted while files are being read from HDFS or written to HDFS














Hadoop Terminologies Hadoop Terminologies Reviewed by Admin on January 19, 2018 Rating: 5

No comments:

Powered by Blogger.