Hadoop Terminologies

Files and Blocks

File abstraction using blocks
File abstraction means a file can be larger than any one hard disk in the cluster
It can be achieved by Network file system as well as HDFS
HDFS and other distributed file systems typically uses local file system over network file system
Files are distributed on HDFS based on dfs.blocksize

Mappers and Reducers

Mappers

Number of mappers is determined by framework based up on block size and split size
uses data locality
Logic to filter, row level transformations are implemented in the map function
Mapper tasks execute map function
Shuffle & Sort
Taken (typically) care by Hadoop MapReduce framework
Enhance or customize the capability in the form of custom partitioners and custom comparators.

Reducers

Can be pre-determined for some of the cases
If the report has to be generated by year, then number of reducers can be number of years you want to generate report
If the report has to be generated for number of regions or states , then number of reducers can be number of regions or states.
Logic to implement aggregations, joins etc are implemented in the reduce function
Reducer tasks execute reduce function

Fault Tolerance – Replication Factor

HDFS is fault tolerant which means when the system functions properly without any data loss even if some hardware components of the system has failed.
HDFS does not use RAID (RAID only solves Hard disk failure, mirroring is expensive and striping is slow)
HDFS uses mirroring and dfs.replication controls how many copies should be made (default 3).
HDFS mirroring/replication solves disk failure as well as any other hardware failure (except network failures)
Network failures are addressed using multiple racks with multiple switches

Metadata

Files are divided into blocks based up on dfs.blocksize (default 128 MB)Each block will have multiple copies and stored in the servers designated as datanodes. It is controlled by parameter called dfs.replication (default 3)
What is file metadata?HDFS file is logicalEach block will have block id and multiple copiesEach copy will be stored in separate data nodeMapping between file, block and block location is metadata of a fileAlso file permissions, directories etc.All of this will be stored in in-memory of Namenode

Heartbeat and block report

Datanode sends heartbeat every 3 seconds to Namenode

Heartbeat interval is controlled by dfs.heartbeat.interval

Along with heartbeat, Datanode sends information such as Disk capacity,Current activity

Also data node sends periodic block report (default 6 hours) to Namenode (dfs.blockreport.*)

Checksum

Checksum is used to ensure blocks or files are not corrupted while files are being read from HDFS or written to HDFS

TechDestined

Pages

Hadoop Terminologies

Files and Blocks

Mappers and Reducers

Fault Tolerance – Replication Factor

Metadata

Heartbeat and block report

Checksum

No comments:

Pages

Recent Posts

Blog Archive

Ads

Popular Posts

Categories

Report Abuse

Featured Post

Facebook’s ‘10 year challenge’ : Fun challenge or a Data mining trick?

Followers

Search This Blog

Search This Blog

LATEST IN TECH

Pages

Social Plugin

Random Posts

Tags

Popular Posts

AMAZON WEB SERVICES

Sports

Facebook’s ‘10 year challenge’ : Fun challenge or a Data mining trick?

Recent Posts

Categories