Back To Lists

14 Top Hadoop Interview Questions and Answers


Here is a list of 14 Top Hadoop Interview Questions and Answers . Lets take a look at them one by one


1 Explain “Distributed Cache” in a “MapReduce Framework”.

Number 1 on the list of Top Hadoop Interview Questions and Answers is Explain “Distributed Cache” in a “MapReduce Framework”.

Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.

2 What is MapReduce

Number 2 on the list of Top Hadoop Interview Questions and Answers is What is MapReduce

It is a framework/a programming model that is used for processing large data sets over a cluster of computers using parallel programming.

3 What is the difference between an “HDFS Block” and an “Input Split”?

Number 3 on the list of Top Hadoop Interview Questions and Answers is What is the difference between an “HDFS Block” and an “Input Split”?

The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.

4 What is “speculative execution” in Hadoop?

Number 4 on the list of Top Hadoop Interview Questions and Answers is What is “speculative execution” in Hadoop?

If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

5 What is Rack Awareness in Hadoop

Number 5 on the list of Top Hadoop Interview Questions and Answers is What is Rack Awareness in Hadoop

Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.

6 What does ‘jps’ command do?

Number 6 on the list of Top Hadoop Interview Questions and Answers is What does ‘jps’ command do?

The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.

7 Why do we use HDFS for applications having large data sets and not when there are a lot of small files?

Number 7 on the list of Top Hadoop Interview Questions and Answers is Why do we use HDFS for applications having large data sets and not when there are a lot of small files?

HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding the file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too many files will lead to the generation of too much metadata. And, storing these metadata in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.

8 Explain Fault Tolerance in HDFS

Number 8 on the list of Top Hadoop Interview Questions and Answers is Explain Fault Tolerance in HDFS

When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.

9 What is Checkpointing in Hadoop

Number 9 on the list of Top Hadoop Interview Questions and Answers is What is Checkpointing in Hadoop

“Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.

10 What needs to be done when NameNode is down

Number 10 on the list of Top Hadoop Interview Questions and Answers is What needs to be done when NameNode is down

The NameNode recovery process involves the following steps to make the Hadoop cluster up and running: Use the file system metadata replica (FsImage) to start a new NameNode. Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started. Now the new NameNode will start serving the client after it has completed loading the last checkpoint FsImage (for metadata information) and received enough block reports from the DataNodes. Whereas, on large Hadoop clusters this NameNode recovery process may consume a lot of time and this becomes even a greater challenge in the case of the routine maintenance

11 How does NameNode tackle DataNode failures?

Number 11 on the list of Top Hadoop Interview Questions and Answers is How does NameNode tackle DataNode failures?

NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly. A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead. The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.

12 Why does one remove or add nodes in a Hadoop cluster frequently?

Number 12 on the list of Top Hadoop Interview Questions and Answers is Why does one remove or add nodes in a Hadoop cluster frequently?

One of the most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. Because of these two reasons, one of the most common task of a Hadoop administrator is to commission (Add) and decommission (Remove) “Data Nodes” in a Hadoop Cluster.

13 What are vaious Hadoop daemons and their roles in a Hadoop cluster.

Number 13 on the list of Top Hadoop Interview Questions and Answers is What are vaious Hadoop daemons and their roles in a Hadoop cluster.

NameNode: It is the master node which is responsible for storing the metadata of all the files and directories. It has information about blocks, that make a file, and where those blocks are located in the cluster. Datanode: It is the slave node that contains the actual data. Secondary NameNode: It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode. ResourceManager: It is the central authority that manages resources and schedule applications running on top of YARN. NodeManager: It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager. JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.

14 What is Apache Hadoop

Number 14 on the list of Top Hadoop Interview Questions and Answers is What is Apache Hadoop

Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.
Two main components of Hadoop are :
Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)


Make Conversational Experiences To Increase Sales

Sign Up For Contlo