Date of this Version
Cluster deduplication has become a widely deployed technology in data protection services for Big Data to satisfy the requirements of service level agreement (SLA). However, it remains a great challenge for cluster deduplica- tion to strike a sensible tradeoff between the conflicting goals of scalable dedu- plication throughput and high duplicate elimination ratio in cluster systems with low-end individual secondary storage nodes. We propose Σ-Dedupe, a scalable inline cluster deduplication framework, as a middleware deployable in cloud da- ta centers, to meet this challenge by exploiting data similarity and locality to op- timize cluster deduplication in inter-node and intra-node scenarios, respectively. Governed by a similarity-based stateful data routing scheme, Σ-Dedupe assigns similar data to the same backup server at the super-chunk granularity using a handprinting technique to maintain high cluster-deduplication efficiency with- out cross-node deduplication, and balances the workload of servers from backup clients. Meanwhile, Σ-Dedupe builds a similarity index over the traditional lo- cality-preserved caching design to alleviate the chunk index-lookup bottleneck in each node. Extensive evaluation of our Σ-Dedupe prototype against state-of- the-art schemes, driven by real-world datasets, demonstrates that Σ-Dedupe achieves a cluster-wide duplicate elimination ratio almost as high as the high- overhead and poorly scalable traditional stateful routing scheme but at an over- head only slightly higher than that of the scalable but low duplicate-elimination- ratio stateless routing approaches.