Date of this Version
From: The Computer and the Decision-Making Process, edited by Terry B. Gutkin and Steven L. Wise (Hillsdale, New Jersey, Hove & London: Lawrence Erlbaum Associates, 1991) .
My aim in this chapter is to outline some of the substantive and psychometric bases on which we can build a science of assessment that takes advantage of the enormous potential inherent in the digital computer and in artificial intelligence. Some of these foundations are within the traditions of classical assessment. But others represent urgently needed areas of explication and research.
It is my view, in the tradition of Cronbach (1954), that developers of computer software for testing should listen to what psychometricians say, and, as well, psychometricians should be sensitive to new research ideas waiting to be solved that arise out of the experience of preparing software for test interpretation. This is particularly true because some of classical test theory based on fixed sets of items is rendered obsolete by the prospect of adaptive testing. The fact that psychometricians and authors of interpretive software are rarely prone to listen to one another brings to mind a quotation from the world-weary French novelist and philosopher, Andre Gide, cited by Block (1978): "It has all been said before, but you must say it again, since nobody listens."
Some Preconditions for Valid Computer Assisted Test Interpretation
Accurate test interpretations depend on valid data. Stated another way, the validity of the score data set an upper bound for the accuracy of test interpretations. This sounds like such a truism as to appear almost trivial. But surprisingly little attention has been directed at this issue by those who write and write about computer software for test interpretation. For example, in a recent book devoted to computer-based test interpretation (Butcher, 1987) there is scant attention directed at fundamental questions about the reliability of scores or indexes forming the bases for interpretations.
I would like to outline five preconditions for valid computer-assisted test interpretations and to discuss each in turn. These preconditions point both to the traditional wisdom of testing that can be incorporated appropriately into thinking about test interpretations, and, as well, to areas of needed research. Let me list the five: (1) Interpretations should, in general, be built around constructs of broad import; (2) Interpretations should bear an explicit substantive relationship to the constructs underlying the measures employed; (3) Where predictions are made about specific behaviors, both the reliability of the assessment data and the reliability of the criterion to be predicted should be taken into account; (4) The implications of evaluative biases both in the assessment situation and in outcomes need to be given explicit attention; and (5) Attention needs to be directed to base rates, both in the assessment situation and in outcome situations. I would like to discuss each of these points in turn.
The Usefulness of Personality Constructs
With regard to the importance of theory-based constructs, I do not know whether I should say a great deal or very little. There is a substantial literature in personality and social judgment bearing on this topic. But there is an unfortunate tendency for psychologists to consider new areas such as computerized test interpretation in isolation as if little were to be gained from treating it as part of a larger assessment endeavor. But there is something to be learned from the knowledge and controversies of personality and assessment. One of the most controversial issues in the personality literature over the past two decades is the question of whether or not there are broad personality traits or dispositions. One of the strongest advocates of the position that there are not is Walter Mischel, who has argued forcefully that what appear to be broad behavioral consistencies are in fact illusory. The evidence proffered in support of this position and its implications for computerized assessment warrant careful examination.
Mischel and Peake (1982) presented evidence that they believed failed to support the existence of broad traits of conscientiousness and friendliness. They intercorrelated behaviors purportedly representing each of these traits and interpreted mean intercorrelations of the order of .13 as evidence indicative of doubt about the existence of broad traits. But their analyses and interpretations are illustrative of the sort of ad hoc theorizing that is tempting when constructing computerized-based test interpretation systems. Mischel and Peake merely assumed that certain behaviors were linked to the traits of conscientiousness and friendliness without providing any explicit bases in the form of definitions or classification rules for their categorization. Nor did they fully consider the importance of aggregating data prior to inferring broadly based personality dispositions. lackson and Paunonen (1985) undertook a reconceptualization and reanalysis of the Mischel and Peake data on conscientiousness, distinguishing separate dimensions of studiousness, punctuality, and academic diligence by conceptual and empirical means. We estimated reliabilities for 20 behaviors relevant to our reinterpreted dimensions of .93, .95, and .86, respectively. A major import of these findings is that in drawing inferences about behavior from sample observations, the steps in construct validation (Jackson, 1971; Loevinger, 1957; Wiggins, 1973) do not only apply to tests, but apply equally to other formal and informal assessment situations, such as might be involved in combining behavioral "signs" in a computerized interpretation. The whole assessment procedure should be evaluated. A number of our conclusions (Jackson & Paunonen, 1985) have special relevance to automated test interpretations. First, in drawing an inference about a respondent based on the magnitude of a score representing a trait or disposition, a crucial aspect of construct validation is the explicit definition of traits and of situations, including their theoretical and empirical implications, and their differentiation from other related traits. Second, the structure of behavioral representations of traits and of different situations should be evaluated in a multidimensional framework. For example, if the bases for linking predicted behaviors to scores on a test is expert clinical judgment, it would be fitting to provide expert judges with a set of construct-based trait definitions and to instruct them to perform a multidimensional scaling of these traits and a larger set of predicted behavioral exemplars. Third, a crucial step in the appraisal of the predictability of behavior is its evaluation in a multitraitmulti method context in which situations are also carefully defined and empirically studied. As an initial step in such an undertaking it is appropriate to employ scales or scores that possess appropriate levels of convergent and discriminant validity. If differential predictions are to be made on the basis of scale scores, or if profile shape is the basis for classification, it can be demonstrated that predictions or classifications will be more accurate if the constituent scales are minimally intercorrelated and discriminantly valid. This is often difficult to achieve because many measures of personality, particularly those of psychopathology, share a large common component reflecting general psychopathology or self-evaluation. The presence of such a large elevation component, while perhaps facilitating the classification of the person's results into a global category of psychopathology, militates against accuracy in differential prediction, for example, of specific manifestations of psychopathology. The simple implication of the foregoing is that good automated test interpretation systems depend on good tests, a point to which I shall return.