Computer Science and Engineering, Department of


First Advisor

Ying Lu

Date of this Version

Summer 6-10-2019

Document Type



Suman, Sai. "Scheduling And Prefetching In Hadoop With Block Access Pattern Awareness And Global Memory Sharing With Load Balancing Scheme." Master's thesis, University of Nebraska-Lincoln, 2019.


A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfilment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professor Ying Lu. Lincoln, Nebraska: March, 2019

Copyright 2019 Sai Suman


Although several scheduling and prefetching algorithms have been proposed to improve data locality in Hadoop, there has not been much research to increase cluster performance by targeting the issue of data locality while considering the 1) cluster memory, 2) data access patterns and 3) real-time scheduling issues together.

Firstly, considering the data access patterns is crucial because the computation might access some portion of the data in the cluster only once while the rest could be accessed multiple times. Blindly retaining data in memory might eventually lead to inefficient memory utilization.

Secondly, several studies found that the cluster memory goes highly underutilized, leaving much room that can be leveraged for storing input data for future tasks. Leveraging the aforementioned memory underutilization in the clusters is important since the nodes are usually equipped with large amounts of memory.

Thirdly, enabling a prefetching mechanism to retain popular blocks in memory could eventually lead to memory shortage, we thus present two cache eviction algorithms to evict the data that will not be accessed frequently. Furthermore, the caching mechanism could potentially lead to unbalanced utilization of memory in a cluster’s nodes, so we present a mechanism for balancing the memory loads across the cluster such that the utlization of the memory on all nodes is uniform and no node’s memory is overutilized due to the prefetching mechanism.

Keeping the above issues in mind, in this thesis, we present a scheduling and data prefetching framework on Hadoop that leverages the data access patterns and memory underutilization. Our framework has been developed and implemented as a full integration into the Hadoop 2.8 ecosystem. We evaluate our framework in two modes: pseduo-distributed mode and fully distributed mode with 5 nodes on the standard WordCount benchmark.Our experiments show that the framework achieves improved job completion times, higher memory utilization, higher locality placement of tasks and also better overall system performance.

Adviser: Ying Lu