Off-campus UNL users: To download campus access dissertations, please use the following link to log into our proxy server with your NU ID and password. When you are done browsing please remember to return to this page and log out.

Non-UNL users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Design of a Robust Scientific Grid System for User-Defined Task and Data Availability

Zhe Zhang, University of Nebraska - Lincoln


Failure is inevitable in scientific computing. As scientific applications and facilities increase their scales over the last decades, it is challenging to investigate the root cause of a failure. Different scientific computing customers have varying availability demands, as well as a diverse willingness, to pay for availability. In contrast to existing solutions that try to provide higher and higher availability, we look at the problem from a different angle. By analyzing real-time tracing logs from scientific grids, we discovered interesting failure behaviors that are commonly seen but unique in scientific grids. With the knowledge of these failures, scientific applications can be accommodated in a more efficient way than traditional approaches do. This dissertation addresses this opportunity and targets two types of availability in scientific computing: task availability and data availability. With regard to user-defined task availability, existing solutions proposed to adaptively use a different number of replicas to meet user-defined availability, where a replica represents a redundant copy of the original task. However, they ignored intrinsic failure behaviors in scientific grids such as preemption, temporal and spatial localities. This dissertation takes a statistical approach to estimate failures in modern scientific grids and proposes novel selection algorithms that replicate tasks to adequate resource sites. Compared with the traditional task replication, our replication algorithms can meet all user-defined availability. The second target is the user-defined data availability. We design a Robust Intermediate Storage System (RISS). In the RISS, we implement a few techniques that expose tunable parameters to user data. These parameters allow the system to achieve different design tradeoffs in storage overhead, network bandwidth, and data availability. The moral of the dissertation is to abstract the process of running scientific applications on infrastructures. From the application side, a system can accept multiple levels of availability from users; from the infrastructure side, a system can encounter various types of failures. Our goal is to design negotiating algorithms between two sides that can take full advantage of computational resources while meeting user-defined availability. We also design and build systems that employ the proposed techniques. These systems exhibit benefits over the existing replication algorithms.

Subject Area

Computer science|Computer Engineering

Recommended Citation

Zhang, Zhe, "Design of a Robust Scientific Grid System for User-Defined Task and Data Availability" (2020). ETD collection for University of Nebraska - Lincoln. AAI28086724.