Off-campus UNL users: To download campus access dissertations, please use the following link to log into our proxy server with your NU ID and password. When you are done browsing please remember to return to this page and log out.

Non-UNL users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

A Pipeline for Analyzing High-Dimensional Nmr Metabolomics Data

Thao T Vu, University of Nebraska - Lincoln


Before reaching the end goal of interpreting any underlying biological processes, metabolomics data need to pass through a pipeline of procedures involving preprocessing, multivariate statistical modeling, and metabolites identification. However, undesirable perturbations which come from multiple sources including variations in experimental factors, instrument instability, and inconsistencies in sample handling, are always inherent to experimental data. Such non-biological variation, if not accounted for properly, may mask true signals and complicate subsequent analyses. Nine widely used normalization algorithms are evaluated using both simulated and experimental NMR data sets at different levels of added-noise. It is demonstrated that constant sum (CS) and probabilistic quotient (PQ) are top-performing algorithms based on the ability of subsiding non-biological perturbations while preserving true signals. Moreover, a maximum allowable level of noise is suggested to ensure robustness, precision, and power of successive analyses. After being preprocessed, multivariate statistical models are applied to metabolomics data to extract useful information depicting differences between experimental groups. Five commonly used classification models are assessed with different levels of group separation. It is shown that all models perform equally well when there is clear separation between groups. Orthogonal projection to latent structure - discriminant analysis (OPLS-DA) model is demonstrated to be the best performing model which successfully identifies true classifying features while sustaining reasonably high prediction accuracy, sensitivity, and specificity using both robust and relatively extreme data sets. At the last stage of every metabolomics study, it is critical to obtain a list of metabolites, which can be used to interpret some underlying biological functions. Manual assignment approaches are time-consuming, labor-intensive, and heavily reliant on knowledge and assessment of NMR experts. Thus, automating the metabolite identification process is desired. A new method is proposed to simultaneously address two major challenges, which are peak shifting errors and the sparsity of some metabolites in complex mixtures. Through comparisons with existing automated methods, the effectiveness of the proposed method is demonstrated by its ability to detect the highest number of true metabolites while keeping the cost of false identifications at a reasonably low level. The method is demonstrated using simulated, experimental, and biological NMR mixtures.

Subject Area


Recommended Citation

Vu, Thao T, "A Pipeline for Analyzing High-Dimensional Nmr Metabolomics Data" (2020). ETD collection for University of Nebraska - Lincoln. AAI28092081.