QoSec Project
A Middleware Approach to Teaching Computer Security (2009 - )
Project 3: Cluster Performance
Project Description
Hadoop is at its best when working with large data sets. This project will illustrate how to run a program with a large data set and the results of that program.
Resources
1. Hadoop: The Definitive Guide. Author: Tom White. O’Reilly Media, 2009.
2. Input files are located on BlackBoard (i.e. 1MB.txt, 64MB.txt, 512MB.txt, 1GB.txt, 2GB.txt, 4GB.txt,
8GB.txt)
System Requirements
1. Ubuntu version 8.04 or later.
2. Sun Java 6
3. SSH installed
4.Apache Hadoop installed
Project Tasks
This project will introduce large data sets to running Hadoop programs. You will organize and chart the
results to see what Hadoop can really do.
1. (XX points) To begin, write a script that will execute wordcount on each of the input files. Make
sure that your script will output the information necessary to calculate how long each execution runs.
Refer to Project 1 to refamiliarize yourself with wordcount if necessary.
Submit your job to the cluster when it is done. Make sure you proofread your script carefully so you
do not have to run it again.
2. (XX points) Organize the data you collected (Excel is recommended). Use tables and charts to show
your findings.
3. (XX points) Record all of your observations regarding the data and draw conclusions that your findings
support. Be clear and descriptive.
Submission
You need to submit a ReadMe (including your script and command line output) as well as a detailed lab report (including your data collection and conclusions). You also need to provide an explanation to the observations that are interesting or surprising.