Computing, School of

First Advisor

Jitender S. Deogun

Date of this Version

5-2022

Document Type

Thesis

Comments

A thesis presented to the faculty of the Graduate College at the University of Nebraska in partial fulfillment of requirements for the degree of Master of Science

Major: Computer Science

Under the supervision of Professor Jitender S. Deogun. Lincoln, Nebraska, May 2022

Abstract

Gene expression and transcriptome analysis are currently one of the main focuses of research for a great number of scientists. However, the assembly of raw sequence data to obtain a draft transcriptome of an organism is a complex multi-stage process usually composed of pre-processing, assembling, and post-processing. Each of these stages includes multiple steps such as data cleaning, error correction and assembly validation. Different combinations of steps, as well as different computational methods for the same step, generate transcriptome assemblies with different accuracy. Thus, using a combination that generates more accurate assemblies is crucial for any novel biological discoveries. Implementing accurate transcriptome assembly requires a great knowledge of different algorithms, bioinformatics tools and software that can be used in an analysis pipeline. Many pipelines can be represented as automated scalable scientific workflows that can be run simultaneously on powerful distributed and computational resources, such as Campus Clusters, Grids, and Clouds, and speed-up the analyses.

In this thesis, we 1) compared and optimized de novo transcriptome assembly pipelines for diploid wheat; 2) investigated the impact of a few key parameters for generating accurate transcriptome assemblies, such as digital normalization and error correction methods, de novo assemblers and k-mer length strategies; 3) built distributed and scalable scientific workflow for blast2cap3, a step from the transcriptome assembly pipeline for protein-guided assembly, using the Pegasus Workflow Management System (WMS); and 4) deployed and examined the scientific workflow for blast2cap3 on two different computational platforms.

Based on the analysis performed in this thesis, we conclude that the best transcriptome assembly is produced when the error correction method is used with Velvet Oases and the “multi-k” strategy. Moreover, the performed experiments show that the Pegasus WMS implementation of blast2cap3 reduces the running time for more than 95% compared to its current serial implementation. The results presented in this thesis provide valuable insight for designing good de novo transcriptome assembly pipeline and show the importance of using scientific workflows for executing computationally demanding pipelines.

Advisor: Jitender S. Deogun

Download

Included in

Computational Biology Commons, Computer Sciences Commons

COinS

Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Comparative Analyses of De Novo Transcriptome Assembly Pipelines for Diploid Wheat

First Advisor

Date of this Version

Document Type

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Comparative Analyses of De Novo Transcriptome Assembly Pipelines for Diploid Wheat

Authors

First Advisor

Date of this Version

Document Type

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links