Date of this Version
From: The Computer and the Decision-Making Process, edited by Terry B. Gutkin and Steven L. Wise (Hillsdale, New Jersey, Hove & London: Lawrence Erlbaum Associates, 1991) .
Testing by computer is big business. Many companies are offering software enabling a psychologist to test a client by seating him or her at a computer terminal and pressing Return. The software presents the instructions on the screen, guides the test taker through some sample items to see if the instructions are understood, and then presents the test, automatically recording the responses. After one or more tests have been completed, the equipment scores the responses, and delivers test scores. But it doesn't stop there. It then continues by printing out a complete test interpretation in fairly well-constructed narrative prose. The prose often shows a few signs of having been pasted together out of standard phrases, sentences, and paragraphs, but then so do many reports written by real psychologists.
The proliferation of testing systems and automated test interpreters has generated consternation among some clinical psychologists. Matarazzo (1983) cried "Wolf" in an editorial in Science, and went a little far, seeming to condemn all computerized testing. I replied (Green, 1983b) that there is much less concern about the computer giving the test than about the computer interpreting the test. In fact, a group at the Navy Personnel Research and Development Center in San Diego (McBride & Martin, 1983; Moreno, Wetzel, McBride, & Weiss, 1984) had just successfully transferred the Armed Services Vocational Aptitude Battery to the computer, with no major difficulties.
The Navy group used Computerized Adaptive Testing (CAT), the most important advance in cognitive testing (Green, 1983a; Weiss, 1985). In a CAT, the computer chooses the next item to be administered on the basis of the responses to the previous items. This procedure requires a new kind of test theory-classical test theory is not adequate. The new theory is called item response theory (IRT), and is now quite well developed, although it is still new and cumbersome. Using IRT, a computer can readily tailor the test to each test taker. The Navy group has successfully used the technique to administer the Armed Services Vocational Aptitude Battery (ASVAB). It has been found that a conventional test can be replaced by an adaptive test with about half the items, at no loss of reliability or validity. For many test takers, a conventional test has a lot of wasted items- items that are too easy for the good students, items that are too hard for the poor students. If the items are chosen to be most informative about the individual test taker, a lot of time can be saved. Of course, this means developing an estimate of the test taker's ability as the test progresses, and it implies many intermediate calculations, but the computer is good at that. An interesting by-product of CAT is that nearly everybody who takes it likes it. Such a test provides more success experiences than the lower half of the ability spectrum is used to, and does not seem to disconcert the high scorers. Also, the computer is responsive. As soon as an answer is input, another item appears on the screen; The computer is attending to the test taker in an active way that an answer sheet cannot emulate. Hardwicke and Yoes (1984) report that one recruit said, of the CAT version of the ASVAB, "It's faster, it's funner, and it's more easier."
Although computerized administration seemed to be working well in the cognitive area, there was more concern about personality tests. The American Psychological Association began getting several calls each week from its members asking about, or complaining about computerized testing. Apparently, some guidelines were needed for the users and the developers of computer-based tests and assessments. We hoped to stimulate orderly, controlled growth in an important and volatile field. The Guidelines (APA, 1986; see Appendix) address the development, use, and technical evaluation of computerized tests and test interpretations. They emphasize personality tests and personality assessments, but are relevant to all computer testing.
Why develop guidelines when we have just finished congratulating ourselves about the new joint Testing Standards (APA, AERA, & NCME, 1985)? Because the Testing Standards cover this situation only in a generic sort of way, and deserve amplification in particular details; especially computer-based assessments, that is, narrative interpretations. The new Guidelines are viewed as a special application of the new Testing Standards and as subordinate to them in case of any perceived conflict.