Computer Science and Engineering, Department of


Date of this Version



2013 International Conference on Parallel and Distributed Computing, Applications and Technologies, Pages: 141 - 146, DOI: 10.1109/PDCAT.2013.29


Copyright © 2013 IEEE. Used by permission.


Energy efficiency is now used as an important metric for evaluating a computing system. However, saving energy is a big challenge due to many constraints. For example, in one of the most popular distributed processing frameworks, Hadoop, three replicas of each data block are randomly distributed in order to improve performance and fault tolerance. But such a mechanism limits the largest number of machines that can be turned off to save energy without affecting the data availability. To overcome this limitation, previous research introduces a new mechanism called covering subset which maintains a set of active nodes to ensure the immediate availability of data, even when all other nodes are turned off. This covering subset based mechanism works smoothly if no failure happens. However, a node in the covering subset may fail.

In this paper, we study the energy-efficient failure recovery in Hadoop clusters. Rather than only using the replication as adopted by a Hadoop system by default, we investigate both replication and erasure coding as possible redundancy mechanisms. We develop failure recovery algorithms for both systems and analytically compare their energy efficiency.