Hadoop – Rack and Rack Awareness

In Hadoop, a rack is a collection of network-connected data storage devices, such as servers or disk arrays, that are physically located in the same area, such as a data center or a computer room. The concept of racks is used to improve the data locality and fault tolerance of the Hadoop cluster.

 

Rack awareness is a feature of Hadoop that allows the system to take into account the physical location of data storage devices when allocating data blocks. The idea is to place data blocks on different racks to improve data locality and to minimize the impact of a rack failure.

 

When a client requests a data block, Hadoop will try to place the block on a node that is in the same rack as the client, thus minimizing the network traffic and improving the performance. If the client is not on the same rack as the data, Hadoop will try to place the block on a node that is in the same rack as the client, and then on a node that is in a different rack.

 

Rack awareness also helps to increase the fault tolerance of the cluster by ensuring that data blocks are stored on multiple racks. This means that if a rack fails, the data blocks stored on that rack can still be accessed from replicas on other racks.

 

Rack awareness is configured by specifying the rack topology script in the Hadoop configuration file. The script maps the nodes in the cluster to their respective racks. Once the script is configured, the Namenode will use the rack information to make decisions about block placement and replication.

To configure rack awareness in Hadoop, you can use the following steps:

 

Create a rack topology script that maps the nodes in the cluster to their respective racks. The script should take the hostname of a node as input and output the rack name.

Update the Hadoop configuration file (hdfs-site.xml) to specify the path to the rack topology script using the property net.topology.script.file.name.

Restart the NameNode for the configuration changes to take effect.

You can also check the rack information of a datanode using the command hdfs dfsadmin -printTopology it will show the rack information of all the datanodes in the cluster.

 

It is important to note that rack awareness is an optional feature in Hadoop and it needs to be configured and enabled manually. If rack awareness is not configured, then Hadoop will treat all datanodes as if they are in the same rack.

For example, let's say you have a Hadoop cluster with four data nodes, and two racks. The first rack is named "Rack1" and contains nodes "DN1" and "DN2" and the second rack is named "Rack2" and contains nodes "DN3" and "DN4".

 

To configure rack awareness, you would need to create a rack topology script. This script can be written in any language, but it must take the hostname of a node as input and output the rack name. Here's an example of a simple rack topology script written in Python:

 


To configure rack awareness in Hadoop, you will need to follow these steps:

  1. Edit the hdfs-site.xml configuration file located in the Hadoop configuration directory (usually /etc/hadoop/conf).

  2. Add the following property to the configuration file:

<property> <name>dfs.block.replicator.class</name> <value>org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant</value> </property>
  1. Optionally, you can also specify the number of replicas to be placed on different racks by adding the following property:
<property> <name>dfs.replication.considerLoad</name> <value>true</value> </property>
  1. You also need to specify your network topology. You can use a script to automatically determine the rack for each node. You can configure the script by adding the following property:
<property> <name>net.topology.script.file.name</name> <value>/path/to/your/script</value> </property>
  1. Save the configuration file and restart the Hadoop daemons for the changes to take effect.

  2. You can use command "hdfs dfsadmin -report -live" to see the rack information of all nodes in your cluster.

Note that you can also use the command "hdfs dfsadmin -setStoragePolicy -path <path> -policy <policy>" to set the storage policy for a specific directory or file. For example, to set a file called "example.txt" to use the rack-aware policy, you would run the command "hdfs dfsadmin -setStoragePolicy -path /example.txt -policy BLOCK_STORAGE_POLICY_NAME_RA

 

 

Previous Post Next Post