Computer Science and Engineering, Department of

 

Date of this Version

Spring 5-1-2014

Document Type

Article

Comments

A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professor Hong Jiang. Lincoln, Nebraska: April, 2014

Copyright (c) 2014 Ziling Huang

Abstract

The Hadoop Distributed File System (HDFS) is the distributed storage infrastructure for the Hadoop big-data analytics ecosystem. A single node, called the NameNode of HDFS stores the metadata of the entire file system and coordinates the file content placement and retrieval actions of the data storage subsystems, called DataNodes. However the single NameNode architecture has long been viewed as the Achilles' heel of the Hadoop Distributed file system, as it not only represents a single point of failure, but also limits the scalability of the storage tier in the system stack. Since Hadoop is now being deployed at increasing scale, this concern has become more prominent. Various solutions have been proposed to address this issue, but the current solutions are primarily focused on improving availability, ignoring or paying less attention to the important issue of scalability. In this thesis, we first present a brief study of the state-of-art solutions for the problem, assessing proposals from both industry and academia. Based on our unique observation of HDFS that most of the metadata operations in Hadoop workload tend to have direct access rather than exploiting locality, we argue that HDFS should have a flat namespace instead of the hierarchical one as used in traditional POSIX-based file system. We propose a novel distributed NameNode architecture based on the flat namespace that improves both the availability and scalability of HDFS, using the well-established hashing namespace partitioning approach that most existing solutions avoid to use because of the loss of hierarchical. We also evaluate the enhanced architecture using a Hadoop cluster, applying both a micro metadata benchmark and the standard Hadoop macro benchmark.

Adviser: Hong Jiang

Share

COinS