Computer Science and Engineering, Department of


Document Type


Date of this Version



Published in Journal of Parallel and Distributed Computing 66:2 (February 2006), pp. 291–306; doi: 10.1016/j.jpdc.2005.10.008 Copyright © 2005 Elsevier Inc. Used by permission.


The vulnerability of computer nodes due to component failures is a critical issue for cluster-based file systems. This paper studies the development and deployment of mirroring in cluster-based parallel virtual file systems to provide fault tolerance and analyzes the tradeoffs between the performance and the reliability in the mirroring scheme. It presents the design and implementation of CEFT, a scalable RAID-10 style file system based on PVFS, and proposes four novel mirroring protocols depending on whether the mirroring operations are server-driven or client-driven, whether they are asynchronous or synchronous. The comparisons of their write performances, measured in a real cluster, and their reliability and availability, obtained through analytical modeling, show that these protocols strike different tradeoffs between the reliability and performance. Protocols with higher peak write performance are less reliable than those with lower peak write performance, and vice versa. A hybrid protocol is proposed to optimize this tradeoff.