Date of this Version
The progression of scientific data leads to an increase in the demand of powerful high performance computers (HPC) where data can be stored and analyzed. High performance computing is used in several research areas to solve large problems. However, efficient usage of supercomputing resources that are available is a major concern for most of the researchers. The user has to choose the amount of resources needed for execution before submitting the job. Over or under subscription of resources leads to low system utilization and/or low user satisfaction.
Bioinformatics is one such field that heavily relies on HPC. One of the fundamental problems in bioinformatics is Genome Assembly. Generally, genome assembly tools require HPC systems with very large memories for the assembly process. Therefore, estimating the memory requirements of genome assembly tools prior to the assembly process is highly important.
We explore the methods to possibly predict the dynamic memory allocation for big data applications. Our approach for solving the problem is that, memory required by a specific type of application (e.g. genome assembly) can be predicted by executing the application for small fractions of the dataset. In order to test our hypothesis, we have carried out experiments in three stages. First, we compare two resource monitoring tools in order to study effective memory consumption of the applications. For this purpose, we analyzed the memory consumption of three different de novo genome assembly applications (Velvet, Ray, and IDBA) on six different datasets. Furthermore, we investigate whether running the assembly on a small fraction of the data (process executable in a short time) can help us in predicting the memory and time resources required to assemble the full dataset. The generated results show that the memory usage obtained from different resource monitoring tools slightly differs for our datasets. Moreover, different fractions of data show similar patterns of memory usage.
In stage 2, to further test our hypothesis, we analyzed the memory consumption of three different de novo genome assembly applications (Velvet, SPAdes, and SoapDeNovo) on four different datasets. For each experiment we use three fractions of the dataset (10%, 20%, and 30%) and record the memory usage over time. Knowing this, we are able to build a linear model that can predict the memory usage of the entire dataset based on the dataset size. Using this model, we predict the memory usage of the full datasets with 2.582%, 19.29%, 9.62% and 43.55% error rates for Velvet, 4.506%, 27%, 55.11% and 173% error rates for SPAdes and 3.353%, 68%, 32% and 20.77% error rates for SoapDeNovo for dataset1, dataset2, dataset3 and dataset4 respectively.
In stage 3, we analyzed the memory usage for three different de novo genome assembly applications (Velvet, SoapDeNovo, Spades) on three different input datasets. For each experiment we use six fractions of the dataset (1%, 2%, 3%, 10%, 20%, 30%) and record the memory usage over time. Next, we use seven different machine learning techniques (k-Nearest Neighbor, Linear Regression, Multivariate Adaptive Regression Splines, Neural Networks, Principal Component Regression, Stepwise Linear Regression and Support Vector Machines) to build the prediction model. The linear regression techniques produce the most accurate models with 2.582%, 19.29% and 9.62% error rates for Velvet; 3.353%, 68% and 32% for SoapDeNovo, and 4.506%, 27% and 68% error rates for Spades for input Dataset1, Dataset2 and Dataset3 respectively.
Considering that for a given input dataset and an application, the data fractions used have similar peak memory distributions, we believe that building a model for dynamic memory prediction and allocation is may be realistic and feasible.
Advisor: Jitender S Deogun