While
data replication is widely used in clusters to provide fault tolerance,
it can heavily stress communication networks and degrade overall
performance of parallel applications. The performance degradation is
particularly unacceptable with disk-write-intensive applications. As a
result, data duplication management for parallel applications running on
clusters is a significant and urgent challenge. This paper presents the
design, implementation, and evaluation of a network-aware task
duplication management system, or TUFF, where redundant data can be
regenerated by corresponding duplicate tasks rather than directly
replicating through networks. In addition, TUFF is capable of improving
availability performance of parallel applications, because TUFF allows
two replicas of each I/O-intensive task to be executed on two different
nodes. We have implemented and evaluated TUFF using extensive
simulations under a diverse set of workload conditions. Experimental
results show that TUFF improves the overall performance of parallel
applications running on clusters by efficiently reducing network
resource consumption.
|