Computing, School of

First Advisor

Hong Jiang

Date of this Version

4-2012

Document Type

Thesis

Comments

A thesis presented to the faculty of the Graduate College at the University of Nebraska in partial fulfillment of requirements for the degree of Master of Science

Major: Computer Science

Under the supervision of Professor Hong Jiang. Lincoln, Nebraska, April 2012

Abstract

The benefits provided by cloud computing and the space savings offered by data deduplication make it attractive to host data storage services like backup in the cloud. Data deduplication relies on comparing fingerprints of data chunks, and store them in the chunk index, to identify and remove redundant data, with an ultimate goal of saving storage space and network bandwidth.

However, the chunk index presents a bottleneck to the throughput of the backup operation. While several solutions to address deduplication throughput have been proposed, the chunk index is still a centralized resource and limits the scalability of both storage capacity and backup throughput in public cloud environments. In addressing this challenge, we propose the Scalable Hybrid Hash Cluster (SHHC) that hosts a low-latency distributed hash table for storing fingerprints. SHHC is a cluster of nodes designed to scale and handle numerous concurrent backup requests while maintaining high fingerprint lookup throughput. Each node in the cluster features hybrid memory consisting of DRAM and Solid State Drives (SSDs) to present a large usable memory for storing the chunk index. Our evaluation with real-world workloads shows that SHHC is consistently scalable as the number of nodes increases. The throughput increases almost linearly with the number of nodes.

The restore performance over the relatively low bandwidth wide area network (WAN) links is another drawback in the use of cloud backup services. High speed network connectivity is either too expensive for most organizations or reserved for special applications. Removing redundant data before transmitting over the WAN offers a viable option to improve network throughput during the restore operation. To that end, we propose Application-Aware Phased Restore (AAPR), a simple restore solution for deduplication-based cloud backup clients. AAPR improves restore time by removing redundant data before transmitting over the WAN. Furthermore, we exploit application awareness to restore critical data first and thus improve the recovery time. Our evaluations show that, for workloads with high redundancy, AAPR reduces restore time by over 85%.

Advisor: Hong Jiang

Download

Included in

Computer Engineering Commons, Computer Sciences Commons

COinS

Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Improving Backup and Restore Performance for Deduplication-based Cloud Backup Services

First Advisor

Date of this Version

Document Type

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Improving Backup and Restore Performance for Deduplication-based Cloud Backup Services

Authors

First Advisor

Date of this Version

Document Type

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links