Department of Statistics: Dissertations, Theses, and Student Research

On Bayesian Empirical Likelihood-based Method for Complex Survey Data with Application to Non-probability Sampling

Md Hasibur Rahman — Tue, 16 Dec 2025 14:35:17 PST

This thesis develops a Bayesian empirical likelihood (BEL) framework for inference under complex survey designs and extends it to non-probability sampling. Parametric likelihood based methods are difficult to apply to complex survey data because the likelihood is rarely available in closed form. EL provides a flexible alternative by replacing the parametric likelihood with an empirical likelihood constructed from moment conditions. The proposed method first integrates empirical likelihood constraints with survey design features then extends BEL to non-probability sampling through selection models and design consistent restrictions. Posterior inference is carried out using a Metropolis–Hastings MCMC algorithm. A real-data analysis further illustrates the framework’s ability to correct selection bias when combining probability and non-probability samples.

Advisor: Sanjay Chaudhuri

Detection of Activity Cliffs Produced by Anti-cancer Drugs and an Algorithm for Reliable Predictions in Affected Areas

Sarah Josephine Aurit — Thu, 21 Aug 2025 10:45:17 PDT

An activity cliff (AC) occurs when drugs close in chemical space produce dissimilar biological results. We focus on developing an inferential procedure to detect the presence of ACs in a chemical landscape. If detected, we provide a distance-based procedure that can be used to identify regions of stability in the chemical landscape of interest and generate prediction with higher precision in those areas of stability. We conceptualize the chemical landscape as a spatial random field and use spatial models for prediction of efficacy for new drugs based on “distance” in chemical space. We argue that an AC manifests itself by inducing non-stationarity in the foregoing spatial random field. We utilize a formal non-parametric test of stationarity to detect the presence of ACs. If non-stationarity is detected, a metric-learning algorithm is employed to transform the coordinate system of the original random field. Once completed, the data are retested for stationarity. If the transformed random field is stationary, then ordinary kriging is used for predictions of test points. We show that the precision of the prediction can be further improved by generating convex clusters in the chemical landscape and training cluster-specific spatial models. Finally, we use Euler-Bernoulli beam theory to attach uncertainty to test points that fall outside all clusters.

Advisor: Souparno Ghosh

Leveraging Historical Data for Estimating Genetic Gain and Implementing Genomic Selection in a Student Led Barley Breeding Program

Sydney Graham — Wed, 28 May 2025 15:05:22 PDT

In Nebraska, winter feed barley presents an emerging market for producers and an opportunity to diversify cropping systems. The University of Nebraska Barley Breeding Program aims to develop high-yielding, winter-hardy varieties. A unique aspect of this program is that doctoral students serve as barley breeders and are responsible for crossing, data collection, and advancement decisions. While this provides hands-on experience for the students, the impact of student leadership has not been examined.

This study used a historical data set to evaluate the realized genetic gain of the breeding program, and as a training population for genomic selection. The dataset consisted of 302 genotypes from the advanced yield trial evaluated from 2002 to 2022 in three Nebraska locations. The rate of realized genetic gain for yield was estimated by regressing estimated genotypic means on the year they entered the trial. Additionally, SNP data for 189 genotypes was generated with the USDA-SoyWheOatBar-3K array, and daily weather covariates were collected. The 2024 observation nursery was genotyped and used as a testing set for genomic selection. Genomic selection models included four variations of GBLUP (G+E, G+E+GxE, G+W, G+W+GxW), four Bayesian models (Bayes A, Bayes B, Bayes C, Bayes LASSO), and two machine learning approaches (random forest, support vector machine). The models were applied for both winter survival and grain yield.

The realized genetic gain for yield was 62.4 kilograms per hectare per year. For grain yield, the GBLUP and Bayesian genomic selection models performed well, and the subset of the training population used was more important than which model was selected. The highest predictive accuracy was achieved using the Bayes A model trained on Lincoln data only (r = 0.420). For prediction of winter survival, the highest prediction accuracy was for the RF model using Lincoln only training data (r = 0.263). While improvements can be made for winter survival, implementing genomic selection for yield can aid future student barley breeders and provide continuity between transitions.

Advisor: Reka Howard

Comparing Machine Learning Techniques with State-of-the-art Parametric Prediction Models for Predicting Soybean Traits

Susweta Ray — Mon, 18 Dec 2023 11:05:18 PST

Soybean is a significant source of protein and oil, and also widely used as animal feed. Thus, developing lines that are superior in terms of yield, protein and oil content is important to feed the ever-growing population. As opposed to the high-cost phenotyping, genotyping is both cost and time efficient for breeders while evaluating new lines in different environments (location-year combinations) can be costly. Several Genomic prediction (GP) methods have been developed to use the marker and environment data effectively to predict the yield or other relevant phenotypic traits of crops. Our study compares a conventional GP method (GBLUP), a kernel method (Gaussian kernel [GK]), a machine learning method (deep learning [DL]) and a hybrid method that corresponds to the emulation of a machine learning model using a kernel method (and arccosine kernel [AK]) in terms of their prediction accuracies for predicting grain yield, oil and protein using data from the Soybean Nested Association Mapping experiment (1,379 genotypes tested in six environments, all genotypes in all environments). The relative performance of the four methods varied with the response variable and whether the model includes the genotype-by-environmental interaction effects or not. GBLUP consistently showed better performances, while GK and AK followed a similar pattern to GBLUP, DL performed slightly worse than the other three methods in most of the cases; however, this may also be attributed to sub-optimal hyperparameters. DL performed particularly worse than the other three methods in presence of the genotype-by-environmental interaction effects. In general, all four methods performed better when the interaction effects were included compared to the main effects only models.

Advisor: Reka Howard

Exploring Experimental Design and Multivariate Analysis Techniques for Evaluating Community Structure of Bacteria in Microbiome Data

Kelsey Karnik — Mon, 31 Jul 2023 09:50:21 PDT

The gut microbiome plays a crucial role in human health, and by working collaboratively with microbiologists, we aim to further our understanding of the human gut and its impact on human health. Promoting a diverse microbiome is emphasized throughout microbiology literature, and involving a statistician in designing experiments to relate gut bacteria and some measured health outcome is crucial for ensuring valid and accurate results. By adopting new experimental design and analysis methods, researchers can begin to gain a deeper understanding of how the genetics of our food affect the composition of taxa within the gut microbiome. This dissertation is structured around three main objectives, demonstrating how applying new experimental design techniques and multivariate analysis methodologies could potentially benefit domain-specific researchers throughout the scientific process. This work developed a new experimental design structure for assigning treatments to well-plates. Multivariate analysis methods were used to analyze the data, creating new polymicrobial traits to introduce a community taxonomic effect into genome-wide association models. Finally, the effects of experimental parameters on statistical optimality criteria were explored. Our randomizations and experimental design structure exhibited increased efficiency over a design that included only replicate effects. After analyzing our taxonomic abundance data and decomposing the variability in multiple formats, our new pseudo-multivariate phenotypes were included in our collaborators' GWAS models. We found that 57\% of the calculated polymicrobial traits were included in the genome-wide association study (GWAS) models. Over half of the polymicrobial traits used as responses contained either a direct or related overlap with a univariate taxon on the same Major Effect Loci, where some of the unique and helpful relationships were explored more in-depth regarding taxonomic functions within the microbiome. Lastly, we developed a function that calculates the composite optimality criteria to compare design optimality for a multivariate linear mixed model with a covariance structure on the random genetic effects. In the future, similar models and optimal design functions could help researchers improve their experimental design layouts by leveraging their knowledge of genetic relationships in our diets and the relationships between taxa in the gut.

Advisor: Kent Eskridge

Examining the Effect of Word Embeddings and Preprocessing Methods on Fake News Detection

Jessica Hauschild — Fri, 05 May 2023 11:40:19 PDT

The words people choose to use hold a lot of power, whether that be in spreading truth or deception. As listeners and readers, we do our best to understand how words are being used. There are many current methods in computer science literature attempting to embed words into numerical information for statistical analyses. Some of these embedding methods, such as Bag of Words, treat words as independent, while others, such as Word2Vec, attempt to gain information about the context of words. It is of interest to compare how well these various methods of translating text into numerical data work specifically with detecting fake news. The term “fake news” can be quite divisive, but we define it as news that is hyper-partisan, filled with untruths, and written to cause anger and outrage, as defined in Potthast & Kiesel (2018). We hypothesize a person’s word choice relates to the factualness of an article. In Chapter 5, we utilize this embedded information in several binary classification methods. We find that words are only marginally valuable in detecting fake news regardless of the embedding or classification method used. However, within natural language processing tasks, there are many preprocessing steps taken to get the text ready for analysis, which is explored in Chapter 6. The embedding methods are confounded with the preprocessing methods used. Preprocessing of text includes, but is not limited to, filtering out words that do not appear a minimum number of times, filtering out stop words, removing numbers, and translating all letters to lower case. We find filtering out stop words and removing words not appearing a minimum number of times have the most significant effect in combination with embedding and classification methods. Finally, in Chapter 7, we extend the classification to six categories ranging from true to pants-on-fire false and found these preprocessing methods are not as influential as they were with the binary outcome. Other predictors outside of the words and word embeddings themselves are necessary for improvement in the detection of fake news.

Advisor: Kent Eskridge

Human Perception of Exponentially Increasing Data Displayed on a Log Scale Evaluated Through Experimental Graphics Tasks

Emily Robinson — Fri, 05 Aug 2022 12:35:16 PDT

Log scales are often used to display data over several orders of magnitude within one graph. We conducted a series of three graphical studies to evaluate the impact displaying data on the log scale has on human perception of exponentially increasing trends compared to displaying data on the linear scale. Each study was related to a different graphical task, each requiring a different level of interaction and cognitive use of the data being presented. The first experiment evaluated whether our ability to perceptually notice differences in exponentially increasing trends is impacted by the choice of scale. Participants were shown a set of plots and asked to identify which plot appeared to differ most from the other plots. Results indicated the choice of scale changes the contextual appearance of the data leading to slight perceptual advantages for both scales depending on the curvatures of the trend lines being compared. The second study validated a new method, ‘You Draw It’, for testing statistical graphics and introduced an appropriate statistical analysis method for comparing visually fitted trend lines to statistical regression results. This new method was then used to test participant’s ability to make forecast predictions for exponentially increasing trends on both scales. The results from the analysis showed a clear underestimation of forecasting trends with high exponential growth rates when participants were asked to make predictions on the linear scale; improvement in forecasts were made when participants were asked to make predictions on the log scale. The third study evaluated graph comprehension as it relates to the contextual scenario of the data shown. Overall, our results suggested that log logic is difficult and that anchoring and rounding biases result in a sacrifice in accuracy in estimates made on the log scale for large magnitudes. The studies conducted in this research relied on graphical tasks of varying complexity to help us understand the perceptual and cognitive advantages and disadvantages of displaying exponentially increasing data on the log scale. The results are instrumental in establishing guidelines for making design choices about scale which result in data visualizations effective at communicating the intended results.

Advisers: Susan VanderPlas and Reka Howard

Factors Influencing Student Outcomes in a Large, Online Simulation-Based Introductory Statistics Course

Ella M. Burnham — Sun, 01 Aug 2021 16:05:15 PDT

The demand for statistical knowledge and skills is growing in many disciplines, so more students are enrolling in introductory statistics courses (Blair, Kirkman, & Maxwell, 2018). At the same time, institutions are seeking course delivery methods that allow for greater flexibility for students, especially following the onset of the COVID-19 pandemic; therefore, there is more interest in the development and delivery of online introductory statistics courses.

To address this, I collaboratively designed an online introductory statistics course which focuses on simulation-based inference for the University of Nebraska-Lincoln. The course design was informed by the Community of Inquiry framework (Garrison, Anderson, & Archer, 2000). The course is delivered asynchronously and has the capacity for high enrollment. Following the development of the course, I co-taught this course from Fall 2018 to Spring 2021 and recruited enrolled students to participate in my study. Participants granted research access to several components of their normal coursework and completed three surveys: Survey of Attitudes Toward Statistics (36-question version pre-test and post-test; Schau, 2003a, 2003b) and the Distance Education and Technological Advancements Survey (Joosten & Reddy, 2015).

The primary goal of this study was to understand factors that influence student outcomes in this course. An intervention was designed to support the community of inquiry within the course and was implemented during Fall 2019 and Fall 2020. Using Bayesian hierarchical models, there was no evidence of an effect of the intervention on student outcomes. However, there were a variety of other self-reported factors that were found to be associated with student outcomes. The secondary aim of the study was to understand whether students' attitudes toward statistics changed during the term; however, descriptive statistics suggest that students' attitudes did not change during the term.

To address some of the limitations of this study, future research could examine these research questions for simulation-based introductory statistics courses across multiple institutions. This study may help create recommendations for developing online introductory statistics courses.

Adviser: Erin E. Blankenship

Statistical Methodology to Establish a Benchmark for Evaluating Antimicrobial Resistance Genes through Real Time PCR assay

Enakshy Dutta — Wed, 05 Aug 2020 10:40:24 PDT

Novel diagnostic tests are usually compared with gold standard tests for evaluating diagnostic accuracy. For assessing antimicrobial resistance (AMR) to bovine respiratory disease (BRD) pathogens, phenotypic broth microdilution method is used as gold standard (GS). The objective of the thesis is to evaluate the optimal cycle threshold (Ct) generated by real-time polymerase chain reaction (rtPCR) to genes that confer resistance that will translate to the phenotypic classification of AMR. Data from two different methodologies are assessed to identify Ct that will discriminate between resistance (R) and susceptibility (S). First, the receiver operating characteristic (ROC) curve was used to determine the optimal Ct by optimizing the area under the curve (AUC), which was further validated by assessing the sufficiency of sample sizes involved in this study and by 5-fold cross-validation. AUC is a straightforward method, using a default probability threshold (Pt) of 0.5 and independent of misclassification cost to discriminate between the classes. An alternative methodology - H measure - is proposed, which selects the Pt based on minimum error rate. The H measure is quite flexible, and the threshold can be selected according to researchers’ interest by minimizing the false positive or negative rate.

A total of 297 lung and 111 nasal swabs from bovine were tested for AMR using the gold standard and rtPCR for three specific drugs. The level of agreement between the two tests were measured using Cohen’s Kappa (. Using the first approach, the optimal Ct for lung tissue samples was between 32.6 and 35.7, with a good level of agreement between the two tests. For the nasal tissue, the rtPCR results were only validated for one drug with a Ct of 33.3 with a moderate level of agreement. For the second approach, the lungs and nasal tissues are combined, and the optimal Ct is evaluated by taking the average from AUC and H measure and lies between 32.0 and 32.9 with a moderate level of agreement.

Adviser: Jennifer Clarke

Community Impact on the Home Advantage within NCAA Men's Basketball

Erin O'Donnell — Tue, 12 May 2020 20:20:19 PDT

The home advantage is a commonly accepted truth throughout sports performances. This paper investigates the magnitude of the home advantage among NCAA Men’s Basketball teams. It will then look to draw relationships between the magnitude of the home advantage and community aspects such as attendance, location, past program success, and social media presence. Univariate and Multivariate models will be investigated.

Advisor: Walter S Stroup

Group Testing Identification: Objective Functions, Implementation, and Multiplex Assays

Brianna D. Hitt — Thu, 07 May 2020 09:20:19 PDT

Group testing is the process of combining items into groups to test for a binary characteristic. One of its most widely used applications is infectious disease testing. In this context, specimens (e.g., blood, urine) are amalgamated into groups and tested. For groups that test positive, there are many algorithmic retesting procedures available to identify positive individuals. The appeal of group testing is that the overall number of tests needed is significantly less than for individual testing when disease prevalence is small and an appropriate algorithm is chosen. Group testing has a number of applications beyond infectious disease testing, such as drug discovery, food contamination detection, and diagnosis of faulty network sensors.

An important decision that needs to be made prior to implementation is the group sizes to use. In best practice, an objective function is minimized to determine the optimal set of group sizes, known as the optimal testing configuration (OTC). We examine several different objective functions and show that the OTCs and corresponding results (e.g., number of tests, accuracy) are largely the same for these functions when using standard group testing algorithms.

Both estimating the probability of disease and identifying positive individuals are goals of group testing. We present the first general R functions for identification and make these available in the new binGroup2 package. We also include in this package estimation functions from the binGroup package by creating a unified framework for them.

We developed a web-based Shiny application to assist laboratory personnel in determining how well a group testing algorithm is expected to perform before implementation. The app utilizes binGroup2 functions to calculate the expected number of tests and diagnostic accuracy measures for a wide variety of algorithms using one- and two-disease assays. The OTC can be found with the app as well.

Most group testing research using one-disease assays makes the assumption of equal sensitivity and equal specificity values across all stages of testing. We present derivations of operating characteristics for group testing algorithms that allow the diagnostic test accuracy to differ across stages of testing. These resulting expressions are incorporated into the binGroup2 package.

Adviser: Christopher R. Bilder

Using Stability to Select a Shrinkage Method

Dean Dustin — Thu, 07 May 2020 09:00:21 PDT

Shrinkage methods are estimation techniques based on optimizing expressions to find which variables to include in an analysis, typically a linear regression. The general form of these expressions is the sum of an empirical risk plus a complexity penalty based on the number of parameters. Many shrinkage methods are known to satisfy an ‘oracle’ property meaning that asymptotically they select the correct variables and estimate their coefficients efficiently. In Section 1.2, we show oracle properties in two general settings. The first uses a log likelihood in place of the empirical risk and allows a general class of penalties. The second uses a general class of empirical risks and a general class of penalties obtaining limiting behavior for a large class of smooth likelihoods. The second contribution of this thesis is to realize that shrinkage techniques with oracle properties are asymptotically the same, but differ in their finite sample properties. To address this, in Section 2.1, we propose selection of a shrinkage method based on a stability criterion. Part of our analysis in Section 2.2 is a computational comparison of several specific shrinkage methods. In future work, we hope to optimize a stability criterion directly to derive a data driven shrinkage method using techniques from genetic algorithms. We describe this in Section 2.3 as future work.

Adviser: Bertrand Clarke

Optimal Design for a Causal Structure

Zaher Kmail — Wed, 14 Aug 2019 12:15:43 PDT

Linear models and mixed models are important statistical tools. But in many natural phenomena, there is more than one endogenous variable involved and these variables are related in a sophisticated way. Structural Equation Modeling (SEM) is often used to model the complex relationships between the endogenous and exogenous variables. It was first implemented in research to estimate the strength and direction of direct and indirect effects among variables and to measure the relative magnitude of each causal factor.

Historically, traditional optimal design theory focuses on univariate linear, nonlinear, and mixed models. There is no current literature on the subject of optimal design for a causal structure, therefore this research is the first contribution in the field. There are five objectives for this dissertation research. For a given causal structure, the objectives of this research are to obtain an optimal design: (1) For a completely randomized experiment that produces the most precise estimates for the endogenous and exogenous parameters, (2) For an experiment with random blocks or split-plots that produces the most precise estimates for the endogenous and exogenous parameters, (3) For an experiment with fixed blocks that produces the most precise estimates for the endogenous and exogenous parameters, (4) For an experiment with random blocks or split-plots that produces the most precise estimates for the endogenous parameters, exogenous parameters, and the variance components, and (5) Using the methods above to demonstrate the improvement in efficiency for two applications published in previous research.

In each case, the causal relationship dramatically changed the optimal designs. The new optimal designs were more efficient. Even orthogonal designs, which are universally optimal in the univariate case, are not optimal when considering a causal structure.

Advisor: Kent M. Eskridge

Role of Misclassification Estimates in Estimating Disease Prevalence and a Non-linear Approach to Study Synchrony Using Heart Rate Variability in Chickens

Dola Pathak — Mon, 03 Dec 2018 17:09:42 PST

Infectious disease assays can be imperfect. When estimating disease prevalence, these imperfections are accounted for by incorporating assay sensitivity and specificity into point and variance estimates. Unfortunately, these accuracy measures are often treated as fixed constants, rather than acknowledging that they are estimates from an assay validation process. The purpose of this study is to show the detrimental effect of not taking into account this sampling variability when samples are obtained through group testing (aka, pooled testing). We show that confidence interval coverage can dramatically decline as the sample size increases for the main sample of interest. As a remedy for this problem, we propose a new confidence interval which takes into account the extra sampling variability. This new interval is shown to obtain coverage near the nominal level.

Heart Rate Variability (HRV) has been used to study stressed induced reaction in humans, and mammal, in general, using non-linear analysis. Studies have also been done to establish synchrony in humans. Non-linear analysis of HRV using recurrence and cross recurrence plots, recurrence and cross recurrence quantification analysis, have also been done to study the feather pecking behavior in chickens. The main purpose of this study is to see if the human study on the degree of synchrony can be replicated for the avian population. If such synchrony exists in the avian population, then it will establish that the degree of synchrony is a primal instinct. Female leghorn chickens were used in the study as they have similar cardiac structure but are evolutionarily distant from mammals. If the presence of synchrony can be established for cagemate hens, it might lead to significant improvement in poultry well-being.

Advisers: Professors Kathryn J Hanford and Erin E Blankenship

A Characterization of a Value Added Model and a New Multi-stage Model For Estimating Teacher Effects within Small School Systems

Julie M. Garai — Thu, 03 Aug 2017 08:30:20 PDT

At both the national and state level there is increasing pressure to develop metrics to determine if school systems are meeting educational objectives. All states mandate some form of assessment by standardized tests. One method currently used to model student test scores is Value Added Modeling (VAM), which models student scores as a product of classroom and school environments. One VAM approach is the Tennessee Value Added Assessment System (TVAAS) which models student gains from year to year. Teacher effects are included in this layered model, which estimates the teacher’s added value to a student score through best linear unbiased prediction.

Research using VAM typically occurs in school systems with a large number of students (e.g. New York City, Los Angeles, Chicago, etc.) or in statewide assessments that are combined across school districts (e.g. Tennessee). VAM performance in school systems with small numbers of students is unknown.

One common issue with estimation based on small samples is lack of precision. An area of statistics that has developed methodology for small sample sizes is small area estimation. One approach in this area is indirect estimation which links similar subjects together allowing the small groups to “borrow strength” from each other.

This dissertation introduces a multi-stage model that incorporates small area estimation techniques with the traditional TVAAS. The performance of both the multi-stage and TVAAS models are studied through data simulated for small school systems. The precision of predicted teacher value added scores is assessed for both modeling methods.

Adviser: Walter W. Stroup, Erin E. Blankenship

Methods to Account for Breed Composition in a Bayesian GWAS Method which Utilizes Haplotype Clusters

Danielle F. Wilson-Wells — Thu, 29 Jun 2017 14:10:21 PDT

In livestock, prediction of an animal’s genetic merit using genomic information is becoming increasingly common. The models used to make these predictions typically assume that we are sampling from a homogeneous population. However, in both commercial and experimental populations the sire and dam of an individual may be a mixture of different breeds. Haplotype models can capture this population structure.

Two models based on breed specific haplotype clusters where developed to account for differences across multiple breeds. The first model utilizes the breed composition of the individual, while the second utilizes the breed composition from the sire and dam. Haplotype clusters were modeled as hidden states in a hidden Markov model where the genomic effects are associated with loci located on the unobserved clusters. Similar to the Bayes C model, we can model the genomic effects at the loci using a prior, which consists of a mixture of a multivariate normal and a point mass at zero distribution.

The performance of the first model will be evaluated in a composite beef cattle population, representing various fractions of several breeds, using five weight traits, seven carcass traits, and two other traits related to calving on 6,552 cattle genotyped for 99,827 mapped SNPs. The performance of the second model will be evaluated in a two-way cross population, which was a cross between two independent lines, using age of puberty records on 1,654 swine genotyped for 48,408 mapped SNPs. Both models will also be evaluated in a simulated composite population of two lines of 12,500 individuals and 61,255 mapped SNPs.

Overall, the breed specific haplotype models led to larger and more clearly observed estimated QTL. However, the prediction accuracy for the haplotype models were typically lower than those for the traditional Bayesian GWAS models. Therefore, while our ability to locate QTLs was increased, the traditional models are still the preferred choice for prediction as they have higher prediction accuracy when it comes to estimating an animal’s genetic merit.

Simulations of a New Response-adaptive Biased Coin Design

Aleksandra Stein — Tue, 08 Dec 2015 14:20:44 PST

Modern medical experiments accrue and treat patients--hence obtain treatment response data--throughout a trial. Designs which prospectively plan to modify patient allocation by leveraging accumulating data are response-adaptive randomization (RAR) designs. Many such designs attempt to balance the desire to bias assignment proportions towards a treatment which is performing better against the need to maintain randomization in the face of continued equipoise.

This dissertation consists of simulated investigations into frequentist and ethical properties of an new RAR biased coin design. Chapter 2 proposes a new adaptive design for phase III clinical trials, a modification of the 2001 Bandyopadhyay and Biswas biased coin design. Simulations show how the new design continues to ethically expose patients to the better treatment while simultaneously mitigating power loss inherent in the original design. Chapters 2 and 3 expand the applicability of the new design to scenarios where treatment variances or covariate-treatment impacts are unequal. In Chapter 4, simulations demonstrate that the new response-adaptive biased coin design can be more ethical than equal allocation, even when patient outcomes are not immediately available. Each chapter illustrates the utility and benefits of the new design through a real-world application of an HIV treatment adherence intervention. Asymptotic results are applied to a special case of the BBS design and small sample implications are compared with simulated outcomes in Chapter 5.

Adviser: Kent M. Eskridge

A New Approach to Modeling Multivariate Time Series on Multiple Temporal Scales

Tucker Zeleny — Fri, 07 Aug 2015 10:40:17 PDT

In certain situations, observations are collected on a multivariate time series at a certain temporal scale. However, there may also exist underlying time series behavior on a larger temporal scale that is of interest. Often times, identifying the behavior of the data over the course of the larger scale is the key objective. Because this large scale trend is not being directly observed, describing the trends of the data on this scale can be more difficult. To further complicate matters, the observed data on the smaller time scale may be unevenly spaced from one larger scale time point to the next. The existence of these multiple time scales means that it may be more appropriate to view the observations as coming from multiple, shorter multivariate time series occurring at each large scale time point as opposed to a single, long multivariate time series. Approaching the problem by examining the smaller scale time series separately, and then modeling the resulting estimates over the larger time scale, will provide an alternative to previous methods of dealing with similar situations while also producing additional information on the behavior of the data on the smaller observable time scale.

Advisor: David B. Marx

Modeling the Dynamic Processes of Challenge and Recovery (Stress and Strain) Over Time

Fan Yang — Fri, 07 Aug 2015 10:40:16 PDT

A dynamic process with challenge and recovery is an important branch in the family of stochastic processes. The dependent data of such processes are often observed over time, and hence, are time dependent. The purpose of this dissertation is to develop methods to characterize a dynamic process with challenge and recovery under different dimensionalities and error assumptions. In this dissertation, a univariate dynamic process under Gaussian assumption is discussed first and a bi-logistic model is developed by three different methods: compartment, additive, and Bayesian. Then the discussion is extended to a bivariate hysteresis system with challenge and recovery. Three methods: linear, nonlinear, and two-step simple harmonic, were developed to study hysteresis under the independent bivariate Gaussian assumptions. Finally, to be more general, a multivariate cylinder distribution was developed to analyze a multivariate dynamic process with challenge and recovery under more general error assumptions. In this case, the dimensionality could be any positive integers and the error assumptions are not necessarily independent Gaussian. The cylinder method is applied to the hysteretic system and the results show that the cylinder method can be used in various scenarios to obtain the least biased and most efficient parameter estimates.

Advisor: Anne M. Parkhurst

Beta-binomial Kriging: A New Approach to Modeling Spatially Correlated Proportions

Aimee Schwab — Tue, 30 Jun 2015 12:10:16 PDT

Spatially correlated count data sets appear often in applied data analysis problems, but there is little consensus in the literature about how best to analyze the data. The two prevailing approaches provide accurate parameter estimates and predictions, at the cost of model interpretability and simplicity. This dissertation will present a new approach to modeling spatially correlated binomial observations: beta-binomial kriging. The model proposed here is a modified form of spatial kriging which assumes the data are generated from a correlated beta-binomial distribution. Given this assumption, the spatial parameters and predicted values can be estimated using simple matrix algebra. Beta-binomial kriging will be thoroughly assessed in the dissertation and shown to be a competitive option for modeling spatially correlated proportions. The model’s advantages will be illustrated using childhood vaccination rates.

Adviser: David Marx