Advisor: David B. Marx

]]>Advisor: Anne M. Parkhurst

]]>Adviser: David Marx

]]>Adviser: Walter W. Stroup

]]>Statistical methods are the main data analysis technique used for developing quantitative predictions in the life sciences, but these methods are rarely applied to long-term datasets because the methods are underdeveloped in most cases. This underdevelopment of statistical methods and applications was the motivation for my research. In Chapter 1, I develop a time series analysis method for populations that accounts for errors in detection. In Chapter 2, I develop and apply a variety of methods to predict an extinction threshold using long-term monitoring data from a population of bobwhite quail (*Colinus virginianus*). In Chapter 3, I link the unified framework of missing data developed in the statistical literature to species distribution modelling, which is a common method used to analyze historical location reports of a species. In Chapter 4 I introduce an example using location records of one of the rarest avian species in the world—the whooping crane (*Grus americana*). The whooping crane location records were imprecisely recorded, and in Chapter 4, I extend regression calibration methods to correct for the location error. In Chapter 5, I explore when a commonly used statistical estimation method will fail for analyses using historical location records; I then test several alternative estimation methods. Finally, in Chapter 6, I present an application by predicting the spatial and temporal distribution of whooping cranes using historical location records. This application was developed to determine what habitat is used by whooping cranes during migration and what habitat may require special protection to ensure survival of the species.

Advisors: Erin E. Blankenship and Richard A.J. Tyre

]]>A simulation study was conducted using a closed network made up of ten nodes and three different edge density values (low, moderate, and high) to randomly generate the edges (connections) between nodes. A Poisson AR(1) process was used to generate the number of communications between nodes at each time period. Changes were then randomly assigned in time periods 26 and 52, and the aR^{2}’s calculated between adjacent time periods. A separate simulation was conducted for each combination of edge density (3 levels), AR(1) correlation parameter (3 levels), number of edges perturbed (3 levels), perturbation factor (3 levels), time period of perturbation (2 levels), and configuration dimension (2 levels). The results suggest that under these conditions the method as proposed has reasonable power for detecting “abnormal” changes in the number of communications.

Adviser: David B. Marx

]]>Advisor: Christopher R. Bilder

]]>For the first problem, we examine group testing regression models when identification of the positive and negative statuses for individuals is performed. The identification aspect leads to additional tests, known as “retests,” beyond those performed for initial groups of individuals. We show how regression models can be fit in this setting while also incorporating the extra information from these retests. Through Monte Carlo simulations, we present evidence that significant gains in efficiency occur by incorporating retesting information. Furthermore, we demonstrate that some group testing protocols can actually lead to more efficient estimates than individual testing when diagnostic tests are imperfect. Finally, we show that halving and matrix testing protocols are the most efficient to use in application.

For the second problem, we consider situations when individuals are tested in groups for multiple diseases simultaneously. This problem is important because assays frequently screen for more than one disease at a time. When these assays are used in a group testing setting, the individual positive/negative statuses consist of unobserved, correlated random variables. To estimate models in this setting, we develop an expectation-solution based algorithm that provides consistent parameter estimates and natural large-sample inference procedures.

Advisor: Christopher R. Bilder

]]>Advisor: Anne M. Parkhurst

]]>To investigate the differences between weighted kriging and ordinary kriging, a simulation study was conducted. Validation statistics were used to evaluate and compare the prediction procedures, and it was found that weighted kriging yields more desirable results than traditional kriging methods. As a follow-up, the prediction procedures were compared using real data from a groundwater quality study.

Bayesian Maximum Entropy (BME) is then introduced as an alternative method to utilize soft data in prediction. Numerical implementation of this approach is possible with the Spatiotemporal Epistemic Knowledge Synthesis-Graphical User Interface (SEKS-GUI). Using this interface, two simulation studies were conducted to investigate the differences between BME and weighted kriging. In the first study, probabilistic soft data in the form of the Gaussian distribution were used. However, since proponents of the BME approach claim that it performs extremely well when the soft data are skewed, the second study used nonsymmetrical soft data generated using a triangular distribution. In both studies, the weighted kriging validation statistics were more desirable than those from BME.

Advisor: David B. Marx

]]>Chapter 2 provides an introduction to value-added methodology by describing several value-added models available for estimating teacher effects and their respective advantages and disadvantages. Modeling variations and their impact on estimated teacher effects are also discussed in addition to the various statistical and psychometric issues associated with estimating value-added teacher effects.

Because value-added analyses require high-quality longitudinal data that are often not available, Chapters 3 and 4 propose methodology for analyzing less-than-ideal assessment data. Chapter 3 proposes value-added methodology for analyzing longitudinal student achievement data not on a single developmental scale and addresses issues arising when using a layered, longitudinal mixed model to analyze gains in standardized scores. The chapter also discusses methods for estimating teacher effects on student learning before and after entering professional development programs and applies these methods of analysis to achievement data.

Chapter 4 describes the use of curve-of-factors methodology to analyze longitudinal achievement data collected from two differently scaled assessments in a single year and subject, such as mathematics. Assuming data come from a curve-of-factors model structure, a simulation study evaluates the performance of the proposed curve-of-factors model in its ability to accurately rank teachers in the presence of either complete or missing test data and compares it to the performance of the Z-score methodology proposed in Chapter 3.

]]>We then present a stochastic model named Multi-Order Markov Model under Hidden States (MMMHS) for representing heterogeneous sequences. MMMHS is similar to the conventional Hidden Markov Model (HMM) and Double Chain Markov Model (DCMM) in terms of using hidden states to describe the non-homogeneity of a sequence, but it provides a more flexible dependency structure by changing the order of Markov dependency under different hidden states. We extend the forward-backward procedure to MMMHS and provide the complete model estimation procedure based on Expectation-Maximization (EM) algorithm. The method is then illustrated with applications on several real data sets, and the results are compared with that of traditional methods.

]]>The first paper examines spatial clustering when only one numeric response has been recorded for each observation. The geographic or spatial location is incorporated into the likelihood of the multivariate normal distribution through the variance-covariance matrix. The variance-covariance matrix is computed using any appropriate spatial covariance function, although the spherical covariance function was used for this research. The second paper extends the clustering algorithm to the multivariate case, i.e. when more than one response has been recorded on each observation. Again, the spatial location is incorporated through the variance-covariance matrix of the multivariate normal distribution. However, the actual construction of the variance-covariance matrix must take into account the cross-covariance between the variates. Oliver’s (2003) approach for modeling the cross-covariance is incorporated into the clustering algorithm.

Since not all recorded variables of interest are numeric, the third paper investigates incorporating categorical (non-numeric) responses into the spatial clustering algorithm. This paper looks first at the case where only categorical responses are recorded on the observations. After this has been implemented, the final step is to spatially cluster observations which contain both numeric and categorical responses. The algorithm must account for the spatial pattern of the data, the actual numeric responses and the categorical responses, and an appropriate weighting of the spatial component is determined. The final clustering algorithm clusters both numeric and categorical data while incorporating the geographic location of the observations.

]]>