Software
HDFS-HC2: A Data Placement Module for Heterogeneous Hadoop Clusters
Publication
If you use our HDFS-HC module to conduct your research, please cite our paper and software in your publications. We highly appreciate if you give us the credit.
This HDFS-HC tool is based on our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010. [PDF | PPT]
For information on HDFS-HC2, please refer to the report - Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters - by Sanket Reddy Chintapalli. [PDF | PPT | Source Code ]
Introduction
HDFS-HC is a software module for rebalancing data load in heterogeneous Hadoop clusters. This data placement tool was integrated into the Hadoop distributed file system (HDFS) to initially distribute a large data set to multiple nodes in accordance to the computing capacity of each node.
If you are not very family with the Hadoop system, please visit the Mapreduce overview that shows you the background of Mapreduce and the way of installing Hadoop in a cluster. The purpose of this document is to help you to install and use the HDFS-HC data placement tool in heterogeneous Hadoop cluster.
Do you have questions on HDFS-HC2?
For questions please contact Sanket Reddy Chintapalli at szc0060@auburn.edu, Jiong Xie at jzx0009@auburn.edu, and Xiao Qin at xqin@auburn.edu,
Supported Platforms
Required Software
Required software for Linux and Windows include:
Additional requirements for Windows include:
Download
You can download the HDFS-HC tool here. Note that this is a recent Hadoop hdfs project file, in which the HDFS-HC module is integrated.
Extracting the tar Archive
After downloading the tarball (i.e., hdfs-hc2.tar.gz), you
can follow the command below to extract the tar archive:
tar -vfxz hdfs-hc2.tar.gz
Installation of Hadoop
You need to install Hadoop before working on HDFS-HC2. Required software for Linux and Windows include:
Hadoop Startup
To start your Hadoop cluster, you will need to start both the HDFS and the cluster.
Format a new distributed filesystem:
$ bin/hadoop namenode -format
Start the HDFS with the following command, run on the
designated NameNode:
$ bin/start-dfs.sh
The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves.
Start Map-Reduce with the following command, run on the
designated JobTracker:
$ bin/start-mapred.sh
The bin/start-mapred.sh script also consults the
${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts
the TaskTracker daemon on all the listed slaves.
Hadoop Shutdown
Stop HDFS with the following command, run on the designated
NameNode:
$ bin/stop-dfs.sh
The bin/stop-dfs.sh script consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.
Stop Map/Reduce with the following command, run on the
designated the designated JobTracker:
$
bin/stop-mapred.sh
The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.
The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).
HDFS-HC2 (a.k.a., CRBalancer)
To run the CRBalancing utility, follow the instructions below
Instructions
Before getting started please read ‘How to contribute to Hadoop Projects’ in the following link - How to Contribute
How to Interpret Source Files
There are two files and a folder you have to be concerned with in the package.
1. JAR file –
2. Script –
3. Hadoop HDFS Project Folder –
Steps to run the CRBalancer
1. Replace the hadoop-hdfs-2.3.0.jar file in following path - HADOOP_HOME_PATH/share/Hadoop/hdfs
2. Replace the script file hdfs in the following folder - HADOOP_HOME_PATH/bin/hdfs
3. Run the script file with the following parameters as follows
hdfs -file {full path to computation ratio file} -namenodename {Hostname of the namenode} -port {port number to access the namenode}
How should the configuration file related to the computation ratio look like?
hostname1 ratio
hostname2 ratio
hostname3 ratio
.
.
.
hostnameN ratio
Note: The ratios are calculated by placing entire data set on a single file and finding their least common multiple. Place the file using hadoop fs –put HDFS_DIRECTORY_PATH/filename
Example:
hpxeon01 0.36
jedi05 0.54
Example Command:
hdfs crbalancer –file /user/sanket/crmap.txt –namenodename hpxeon01 –port 54310
How to View Source Code
1. Open Hadoop hdfs project folder.
2. Navigate to the path hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/crbalancer.
3. You will find three files CRBalancer.java, CRBalancingPolicy.java and CRNamenodeConnector.java.
4. CRBalancer.java is the main program which makes the decision to transfer data among nodes based on computing power.
5. CRBalancingPolicy calculates and stores the space occupied by each node. It is used to know the total space occupied by a node currently. It aids in making the decision.
6. NamenodeConnector is used for connecting to the Namenode in order to get the information about the datanodes.
7. In every Program, I have mentioned the place where code has been changed from the original balancer which balances the node based on space utilization instead of computing utilization.
References
If you use our HDFS-HC module to conduct your research, please cite the following paper:
Copyright and Disclaimer
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
Acknowledgments
This software is based upon work supported by the US National Science Foundation under Grants CCF-0845257 (CAREER), CNS-0917137 (CSR), CNS-0757778 (CSR), CCF-0742187 (CPA), CNS-0831502 (CyberTrust), CNS-0855251 (CRI), OCI-0753305 (CI-TEAM), DUE-0837341 (CCLI), and DUE-0830831 (SFS), as well as Auburn University under a startup grant and a gift (Number 2005-04-070) from the Intel Corporation.