Computing, School of
School of Computing: Technical Reports
Accessibility Remediation
If you are unable to use this item in its current form due to accessibility barriers, you may request remediation through our remediation request form.
Date of this Version
Summer 5-30-2012
Document Type
Article
Citation
UNL CSE Technical Report TR-UNL-CSE-2012-0005.pdf
Abstract
Cluster deduplication has become a widely deployed technology in data protection services for Big Data to satisfy the requirements of service level agreement (SLA). However, it remains a great challenge for cluster deduplica- tion to strike a sensible tradeoff between the conflicting goals of scalable dedu- plication throughput and high duplicate elimination ratio in cluster systems with low-end individual secondary storage nodes. We propose Σ-Dedupe, a scalable inline cluster deduplication framework, as a middleware deployable in cloud da- ta centers, to meet this challenge by exploiting data similarity and locality to op- timize cluster deduplication in inter-node and intra-node scenarios, respectively. Governed by a similarity-based stateful data routing scheme, Σ-Dedupe assigns similar data to the same backup server at the super-chunk granularity using a handprinting technique to maintain high cluster-deduplication efficiency with- out cross-node deduplication, and balances the workload of servers from backup clients. Meanwhile, Σ-Dedupe builds a similarity index over the traditional lo- cality-preserved caching design to alleviate the chunk index-lookup bottleneck in each node. Extensive evaluation of our Σ-Dedupe prototype against state-of- the-art schemes, driven by real-world datasets, demonstrates that Σ-Dedupe achieves a cluster-wide duplicate elimination ratio almost as high as the high- overhead and poorly scalable traditional stateful routing scheme but at an over- head only slightly higher than that of the scalable but low duplicate-elimination- ratio stateless routing approaches.
Included in
Computer and Systems Architecture Commons, Computer Sciences Commons, Data Storage Systems Commons