Hadoop Cookbook : How to interview for hadoop admin job?

These are few problems whose solution a good hadoop admin should know.

List 3 hadoop fs shell commands to perform copy operation

fs -copyToLocal
fs -copyFromLocal
fs -put

How to decommission nodes from HDFS cluster?

- Remove list of nodes from slaves files and execute -refreshNodes.

How to add new nodes to the HDFS cluster ?

- Add new node hostname to slaves file and start data node & task tracker on new node.

How to perform copy across multiple HDFS clusters.

- Use distcp to copy files across multiple clusters.

How to verify if HDFS is corrupt?

Execute Hadoop fsck to check for missing blocks.

What are the default configuration files that are used in Hadoop

As of 0.20 release, Hadoop supported the following read-only default configurations

- src/core/core-default.xml

- src/hdfs/hdfs-default.xml

- src/mapred/mapred-default.xml

How will you make changes to the default configuration files

Hadoop does not recommends changing the default configuration files, instead it recommends making all site specific changes in the following files

- conf/core-site.xml

- conf/hdfs-site.xml

- conf/mapred-site.xml

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:

- core-default.xml : Read-only defaults for hadoop.

- core-site.xml: Site-specific configuration for a given hadoop installation.

Hence if same configuration is defined in file core-default.xml and src/core/core-default.xml then the values in file core-default.xml (same is true for other 2 file pairs) is used.

Consider case scenario where you have set property mapred.output.compress to true to ensure that all output files are compressed for efficient space usage on the cluster. If a cluster user does not want to compress data for a specific job then what will you recommend him to do ?

Ask him to create his own configuration file and specify configuration mapred.output.compress to false and load this file as a resource in his job.

What of the following is the only required variable that needs to be set in file conf/hadoop-env.sh for hadoop to work

- HADOOP_LOG_DIR

- JAVA_HOME

- HADOOP_CLASSPATH

The only required variable to set is JAVA_HOME that needs to point to directory

List all the daemons required to run the Hadoop cluster

- NameNode

- DataNode

- JobTracker

- TaskTracker

Whats the default port that jobtrackers listens to : 50030

Whats the default port where the dfs namenode web ui will listen on : 50070

Tech Insights

About Us

Contact Form