Graduate Studies


Comparing Machine Learning Techniques with State-of-the-Art Parametric Prediction Models for Predicting Soybean Traits

Susweta Ray, University of Nebraska-Lincoln

A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Statistics, Under the Supervision of Professor Reka Howard. Lincoln, Nebraska: December, 2021

Copyright © 2021 Susweta Ray


Soybean is a significant source of protein and oil, and also widely used as animal feed. Thus, developing lines that are superior in terms of yield, protein and oil content is important to feed the ever-growing population. As opposed to the high-cost phenotyping, genotyping is both cost and time efficient for breeders while evaluating new lines in different environments (location-year combinations) can be costly. Several Genomic prediction (GP) methods have been developed to use the marker and environment data effectively to predict the yield or other relevant phenotypic traits of crops. Our study compares a conventional GP method (GBLUP), a kernel method (Gaussian kernel [GK]), a machine learning method (deep learning [DL]) and a hybrid method that corresponds to the emulation of a machine learning model using a kernel method (and arc-cosine kernel [AK]) in terms of their prediction accuracies for predicting grain yield, oil and protein using data from the Soybean Nested Association Mapping experiment (1,379 genotypes tested in six environments, all genotypes in all environments). The relative performance of the four methods varied with the response variable and whether the model includes the genotype-by-environmental interaction effects or not. GBLUP consistently showed better performances, while GK and AK followed a similar pattern to GBLUP, DL performed slightly worse than the other three methods in most of the cases; however, this may also be attributed to sub-optimal hyperparameters. DL performed particularly worse than the other three methods in presence of the genotype-by-environmental interaction effects. In general, all four methods performed better when the interaction effects were included compared to the main effects only models.

Advisor: Reka Howard