Date of this Version
MapReduce is a framework for processing huge amounts of data in a distributed environment and Hadoop is Apache’s open source implementation of MapReduce, which is widely used. However, Hadoop’s performance is currently limited by its default task scheduler, which assumes that cluster nodes are homogeneous when estimating the task progress and choosing slow tasks for re-execution. In practice, the homogeneity assumption does not always hold. Longest Approximate Time to End (LATE) is a scheduling algorithm that takes heterogeneity into account. It, however, still depends on a static method to estimate the task execution time. As a result, neither Hadoop default nor LATE schedulers perform very well. Self-adaptive MapReduce Scheduling Algorithm (SAMR) is more advantageous than LATE. It uses historical information on each node to adjust the stage weights of map and reduce tasks when estimating the task execution time. The limitation of SAMR is that it only considers node types but not any other factors, such as job types and dataset sizes, which also affect stage weights. In this thesis, we propose ESAMR: an Enhanced Self-Adaptive MapReduce scheduling algorithm to improve the speculative re-execution of slow tasks in MapReduce. In ESAMR, in order to identify slow tasks accurately, we differentiate historical stage weights information on each node and divide them into K clusters using a K-means clustering algorithm; and when executing a job’s tasks on a node, ESAMR classifies the tasks into one of the clusters and uses the cluster’s weights to estimate the execution time of the job’s tasks on the node. Experimental results show that among the aforementioned algorithms, ESAMR leads to the smallest error in task execution time estimation and identifies slow tasks most accurately.
Adviser: Ying Lu