Statistics, Department of

 

The R Journal

Date of this Version

6-2009

Document Type

Article

Citation

The R Journal (June 2009) 1(1)

Comments

Copyright 2009, The R Foundation. Open access material. License: CC BY 3.0 Unported

Abstract

Many statistical analysis tasks in areas such as bioinformatics are computationally very intensive, while lots of them rely on embarrassingly parallel computations (Grama et al., 2003). Multiple computers or even multiple processor cores on standard desktop computers, which are widespread nowadays, can easily contribute to faster analyses.

R itself does not allow parallel execution. There are some existing solutions for R to distribute calculations over many computers — a cluster — for ex ample Rmpi, rpvm, snow, nws or papply. However these solutions require the user to setup and manage the cluster on his own and therefore deeper knowledge about cluster computing itself is needed. From our experience this is a barrier for lots of Rusers, who basically use it as a tool for statistical computing.

Parallel computing has several pitfalls. A single program can easily affect a complete computing infrastructure on maloperation such as allocating too many CPUs or RAM leaving no resources for other users and their processes, or degrading the performance of one or more individual machines. Another problem is the difficulty of keeping track of what is going on in the cluster, which is sometimes hard if the program fails for some reason on a slave.

We developed the management tool sfCluster and the corresponding R package snowfall, which are designed to make parallel programming easier and more flexible. sfCluster completely hides the setup and handling of clusters from the user and monitors the execution of all parallel programs for problems affecting machines and the cluster. Together with snowfall it allows the use of parallel computing in R without further knowledge of cluster implementation and configuration.

Share

COinS