Date of this Version
Licensure Testing: Purposes, Procedures, and Practices, ed. James C. Impara (Lincoln, NE: Buros Institute of Mental Measurements, University of Nebraska-Lincoln, 1995).
Testing programs nearly always need examinations that measure the same thing, but are composed of different questions (i .e., alternate forms of the same test). When different questions are used, however, there is no assurance that scores on the forms are equivalent; different sets of items might be easier or harder and, therefore, produce higher or lower scores. Equating is used to overcome this problem. Simply stated, it is the design and statistical procedure that permits scores on one form of a test to be comparable to scores on an alternate form.
A hypothetical example will help explain why equating is needed. Suppose Fred takes a certifying examination for aspiring baseball umpires. The examination has 100 questions sampled from the domain of questions about baseball rules and regulations. Fred gets 50 questions right and receives a score of 50. Ethel also takes an examination about baseball rules and regulations, but her test is composed of 100 different items. Ethel gets 70 questions right. Does Ethel know more about baseball than Fred? Or, might it be that Fred's test was much more difficult than Ethel's test, and contrary to appearances, Fred knows more about baseball than Ethel? The answers to these questions lie in equating, the process of ensuring that scores from multiple forms of the same test are comparable.
Equating is a technical topic and it generally requires a considerable background in statistics. The goal of this chapter is to provide a helpful and readable introduction to the issues and concepts, while highlighting useful references that will provide technical details. The chapter begins with some general background and then presents common equating designs and an overview of methods and statistical techniques. For the most often used design, the common-item design, discussion will be expanded and examples will be provided. This will be followed by a consideration of factors that affect the precision of equating and an outline of some basic research questions. Finally, examples of currently available software will be inventoried.
At the outset it should be noted that the term "equating" implies that scores from different forms of a test will be rendered interchangeable. In fact, few data sets ever meet all of the strict assumptions that lead to interchangeable or equated scores. A more technically correct term would be scaled or comparable scores (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1985). In keeping with this notion, an attempt has been made to use the terms "scaled" or "comparable" scores throughout the chapter.
Reasons for Multiple Forms
There are at least three reasons to have multiple forms of a test. The first is security. Many testing programs administer high-stakes examinations in which performance has an important impact upon the examinee and the public: conferring a license or certificate to practice a profession, permitting admittance to a college or other training program, or granting credit for an educational experience. For a test score to have validity in any of these circumstances, it is crucial that it reflect the uncontaminated knowledge and ability of the examinees. Therefore, security is a concern and it is often desirable to give different forms to examinees seated beside each other, those who take the examination on different days, or those who take the examination on more than one occasion (Petersen, Kolen, & Hoover, 1989).
A second and related reason for different test forms is the current movement to open testing. Many programs find it necessary or desirable to release test items to the public (Holland & Rubin, 1982a). When this occurs, it is not possible to use the released items on future forms of a test without providing examinees an unfair advantage.
A third reason for different forms is that test content, and therefore test questions, by necessity changes gradually over time. Knowledge in virtually all occupations and professions evolves and it is crucial for the test to reflect the current state of practice. For example, it is obvious that today's medical licensure and certification examinations should include questions on HIV and AIDS, whereas these topics were not relevant several years ago. Even when the knowledge does not so obviously change, the context within which test items are presented is at risk of becoming dated. One could imagine a clinical scenario in medicine where descriptions of a patient's condition should be rewritten to include current drugs; in law one might want to include references to timely cases and rulings, especially if they lead to different interpretations of the law. It sometimes happens also that the correct answer to previously used questions simply changes. When this occurs it is necessary to rewrite or replace the item. [As will be discussed later, equating assumes that the test scores are based on parallel forms of the test. Thus, if the changes in content are too severe, it is not appropriate to equate.]