Computer Science and Engineering, Department of


Date of this Version



Weiyue Xu, Energy-efficient Failure Recovery in Hadoop Cluster, MS thesis, University of Nebraska-Lincoln, April 2013.


A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professor Ying Lu. Lincoln, Nebraska: May, 2013

Copyright (c) 2013 Weiyue Xu


Based on U.S. Environmental Protection Agency’s estimation, only in U.S., billions of dollars are spent on the electricity cost of data centers each year, and the cost is continually increasing very quickly. Energy efficiency is now used as an important metric for evaluating a computing system. However, saving energy is a big challenge due to many constraints. For example, in one of the most popular distributed processing frameworks, Hadoop, three replicas of each data block are randomly distributed in order to improve performance and fault tolerance, but such a mechanism limits the largest number of machine that can be turned off to save energy without affecting the data availability. To overcome this limitation, previous research introduces a new mechanism called covering subset which maintains a set of active nodes to ensure the immediate availability of data, even when all nodes not in the covering subset are turned off. This covering subset based mechanism works smoothly if no failure happens. However, a node in the covering subset may fail. In this thesis, we study the energy-efficient failure recovery in Hadoop clusters where nodes are grouped into covering and non-covering subsets. Rather than only using the replication as adopted by a Hadoop system by default, we study both replication and erasure coding as possible redundancy mechanisms. We first present a replication based greedy failure recovery algorithm and then introduce an erasure coding based greedy failure recovery algorithm. Moreover, we also develop a recovery aware data placement strategy to further improve the energy efficiency in failure recovery.

To evaluate the algorithms, we simulate node failure recovery in the clusters of different sizes, construct the energy model and analyze the energy consumed during the failure recovery process. The simulation results show that the erasure coding based failure recovery algorithm often outperforms the replication based approach. On average, the former requires 60% of the energy as that of the later and the energy saving increases with the cluster size. In addition, with our recovery aware data placement strategy, the energy consumption for both approaches could be further reduced.

Advisor: Ying Lu