Date of this Version
The Hadoop Distributed File System (HDFS) is the distributed storage infrastructure for the Hadoop big-data analytics ecosystem. A single node, called the NameNode of HDFS stores the metadata of the entire file system and coordinates the file content placement and retrieval actions of the data storage subsystems, called DataNodes. However the single NameNode architecture has long been viewed as the Achilles' heel of the Hadoop Distributed file system, as it not only represents a single point of failure, but also limits the scalability of the storage tier in the system stack. Since Hadoop is now being deployed at increasing scale, this concern has become more prominent. Various solutions have been proposed to address this issue, but the current solutions are primarily focused on improving availability, ignoring or paying less attention to the important issue of scalability. In this thesis, we first present a brief study of the state-of-art solutions for the problem, assessing proposals from both industry and academia. Based on our unique observation of HDFS that most of the metadata operations in Hadoop workload tend to have direct access rather than exploiting locality, we argue that HDFS should have a flat namespace instead of the hierarchical one as used in traditional POSIX-based file system. We propose a novel distributed NameNode architecture based on the flat namespace that improves both the availability and scalability of HDFS, using the well-established hashing namespace partitioning approach that most existing solutions avoid to use because of the loss of hierarchical. We also evaluate the enhanced architecture using a Hadoop cluster, applying both a micro metadata benchmark and the standard Hadoop macro benchmark.
Adviser: Hong Jiang