Top 18 Journal of Educational Measurement papers published in 1999

University of Massachusetts Amherst¹

TL;DR: The authors used a Bayesian approach to estimate the probabilities that the true DIF for an item falls into the A, B, or C categories (the True DIF method in our terminology) or to

...read moreread less

Abstract: future observed status. DIF status is expressed in terms of the probabilities associated with each of the five DIF levels defined by the ETS classification system: C-, B-, A, B+, and C+. The EB methods yield more stable DIF estimates than do conventional methods, especially in small samples, which is advantageous in computer-adaptive testing. The EB approach may also convey information about DIF stability in a more useful way by representing the state of knowledge about an item's DIF status as probabilistic. The results of a Mantel-Haenszel (MH; Mantel & Haenszel, 1959) analysis of differential item functioning (DIF) typically include an index of the magnitude of DIF, along with an estimated standard error (see Holland & Thayer, 1988). In making decisions about whether to discard items or flag them for review, however, testing companies sometimes rely on categorical ratings of the severity of DIF. Educational Testing Service (ETS) has a system for categorizing DIF as negligible ("A"), slight to moderate ("B") or moderate to severe ("C") based on both the magnitude of the DIF index and the statistical significance of the results. A disadvantage of this classification system is that it sometimes conveys the notion that an item's DIF category is deterministic. A possible solution is to use a Bayesian approach to estimate the probabilities that the true DIF for an item falls into the A, B, or C categories (the True DIF method in our terminology) or to

...read moreread less

252 citations

Journal Article•10.1111/J.1745-3984.1999.TB00553.X•

Identifying the causes of DIF in translated verbal items

[...]

Avi Allalouf, Ronald K. Hambleton¹, Stephen G. Sireci¹•Institutions (1)

Donald E. Powers, Donald A. Rock

TL;DR: The authors found that 34% of the items functioned differentially across languages, mostly in favor of the Russian-speaking examinees, and the main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.

...read moreread less

Abstract: Translated tests are being used increasingly for assessing the knowledge and skills of individuals who speak different languages. There is little research exploring why translated items sometimes function differently across languages. If the sources of differential item functioning (DIF) across languages could be predicted, it could have important implications on test development, scoring and equating. This study focuses on two questions: "Is DIF related to item type?", "What are the causes of DIF?" The data were taken from the Israeli Psychometric Entrance Test in Hebrew (source) and Russian (translated). The results indicated that 34% of the items functioned differentially across languages. The analogy items were the most problematic with 65% showing DIF, mostly in favor of the Russian-speaking examinees. The sentence completion items were also a problem (45% D1F). The main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.

...read moreread less

163 citations

Journal Article•10.1111/J.1745-3984.1999.TB00549.X•

Effects of coaching on SAT I: Reasoning Test scores.

[...]

TL;DR: A College Board-sponsored survey of a nationally representative sample of 1995-96 SAT takers yielded a data base for more than 4, 000 examinees, about 500 of whom had attended formal coaching programs outside their schools as mentioned in this paper.

...read moreread less

Abstract: A College Board-sponsored survey of a nationally representative sample of 1995–96 SAT takers yielded a data base for more than 4, 000 examinees, about 500 of whom had attended formal coaching programs outside their schools. Several alternative analytical methods were used to estimate the effects of coaching on SAT I: Reasoning Test scores. The various analyses produced slightly different estimates. All of the estimates, however, suggested that the effects of coaching are far less than is claimed by major commercial test preparation companies. The revised SAT does not appear to be any more coachable than its predecessor.

...read moreread less

149 citations

Journal Article•10.1111/J.1745-3984.1999.TB00558.X•

An Investigation of the Power of the Likelihood Ratio Goodness-of-Fit Statistic in Detecting Differential Item Functioning.

[...]

Robert D. Ankenmann¹, Elizabeth A. Witt, Stephen B. Dunbar¹•Institutions (1)

University of Iowa¹

University of Massachusetts Amherst¹

TL;DR: In this paper, the authors investigated the power and error rate of the likelihood ratio goodness-of-fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model.

...read moreread less

Abstract: The purpose of this study was to investigate the power and Type I error rate of the likelihood ratio goodness-of-fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model. A multiple-replication Monte Carlo study was utilized in which DIF was modeled in simulated data sets which were then calibrated with MULTILOG (Thissen, 1991) using hierarchically nested item response models. In addition, the power and Type I error rate of the Mantel (1963) approach for detecting DIF in ordered response categories were investigated using the same simulated data, for comparative purposes. The power of both the Mantel and LR procedures was affected by sample size, as expected. The LR procedure lacked the power to consistently detect DIF when it existed in reference/focal groups with sample sizes as small as 500/500. The Mantel procedure maintained control of its Type I error rate and was more powerful than the LR procedure when the comparison group ability distributions were identical and there was a constant DIF pattern. On the other hand, the Mantel procedure lost control of its Type I error rate, whereas the LR procedure did not, when the comparison groups differed in mean ability; and the LR procedure demonstrated a profound power advantage over the Mantel procedure under conditions of balanced DIF in which the comparison group ability distributions were identical. The choice and subsequent use of any procedure requires a thorough understanding of the power and Type I error rates of the procedure under varying conditions of DIF pattern, comparison group ability distributions.–or as a surrogate, observed score distributions–and item characteristics.

...read moreread less

85 citations

Journal Article•10.1111/J.1745-3984.1999.TB00546.X•

Note On Sources of Sampling Variability in Science Performance Assessments

[...]

Richard J. Shavelson¹, Maria Araceli Ruiz-Primo¹, Edward W. Wiley¹•Institutions (1)

Stanford University¹

01 Jan 1999-Journal of Educational Measurement

TL;DR: The authors found that task sampling variability was due to both the person x task interaction and task x task x occasion interaction and concluded that direct observation, notebook and computer simulation methods were equally exchangeable, but their exchangeability was limited by the volatility of student performances across tasks and occasions.

...read moreread less

Abstract: In 1993, we reported in Journal of Educational Measurement that task-sampling variability was the Achilles’ heel of science performance assessment. To reduce measurement error, tasks needed to be stratified before sampling, sampled in large number, or possibly both. However, Cronbach, Linn, Brennan, & Haertel (1997) pointed out that a task-sampling interpretation of a large person x task variance component might be incorrect. Task and occasion sampling are confounded because tasks are typically given on only a single occasion. The person x task source of measurement error is then confounded with the pt x occasion source. If pto variability accounts for a substantial part of the commonly observed pt interaction, stratifying tasks into homogenous subsets—a cost-effective way of addressing task sampling variability—might not increase accuracy. Stratification would not address the pro source of error. Another conclusion reported in JEM was that only direct observation (DO) and notebook (NB) methods of collecting performance assessment data were exchangeable; computer simulation, short-answer, and multiple-choice methods were not. However, if Cronbach et al. were right, our exchangeability conclusion might be incorrect. After re-examining and re-analyzing data, we found support for Conbach et al. We concluded that large task-sampling variability was due to both the person x task interaction and person x task x occasion interaction. Moreover, we found that direct observation, notebook and computer simulation methods were equally exchangeable, but their exchangeability was limited by the volatility of student performances across tasks and occasions.

...read moreread less

85 citations

Journal Article•10.1111/J.1745-3984.1999.TB00559.X•

Alternative Displays for Communicating NAEP Results: A Redesign and Validity Study

[...]

Howard Wainer, Ronald K. Hambleton¹, Kevin Meara¹•Institutions (1)

National Board of Medical Examiners¹

TL;DR: In this article, five diswlays, chosen from the NAEP 1994 Reading: A First Look, were redesigned using the principles developed and enunciated in Wainer's 1997 popular text Visual Revelations.

...read moreread less

Abstract: Five diswlays, chosen from the NAEP 1994 Reading: A First Look, were redesigned. The redesign was informed by the principles developed and enunciated in Wainer's 1997 popular text Visual Revelations. After the redesign was completed a survey of educational policymakers was done in which substantive questions were asked about the content of the various displays. Each redesign was paired with the original and were assigned randomly to one of two survey forms. We found that, on average, the redesigns yielded both more accurate and faster answers to the questions asked. The more difficult the question the greater the disparity between the original format and the redesigned one.

...read moreread less

60 citations

Journal Article•10.1111/J.1745-3984.1999.TB00544.X•

Components of Rater Error in a Complex Performance Assessment

[...]

Brian E. Clauser¹, Stephen G. Clyman¹, David B. Swanson¹•Institutions (1)

Florida State University¹

TL;DR: In this paper, the authors present two studies that focus on aspects of the rating process that are not explicitly accounted for in this typical design, and highlight the importance of clearly specifying the relevant facets of the universe of interest.

...read moreread less

Abstract: Numerous studies have examined performance assessment data using generaliz-ability theory. Typically, these studies have treated raters as randomly sampled from a population, with each rater judging a given performance on a single occasion. This paper presents two studies that focus on aspects of the rating process that are not explicitly accounted for in this typical design. The first study makes explicit the “committee” facet, acknowledging that raters often work within groups. The second study makes explicit the “rating-occasion” facet by having each rater judge each performance on two separate occasions. The results of the first study highlight the importance of clearly specifying the relevant facets of the universe of interest. Failing to include the committee facet led to an overly optimistic estimate of the precision of the measurement procedure. By contrast, failing to include the rating-occasion facet, in the second study, had minimal impact on the estimated error variance.

...read moreread less

45 citations

Journal Article•10.1111/J.1745-3984.1999.TB00560.X•

A Cautionary Note on IRT-Based Linking of Tests With Polytomous Items

[...]

Richard L. Tate¹•Institutions (1)

TL;DR: The authors argued that a modification of the traditional linking is necessary when tests consist of constructed-response items judged by raters and there is a possibility of year-to-year variation in the rating discrimination and severity.

...read moreread less

Abstract: Published discussions of the year-to-year linking of tests comprised of polytomous items appear to suggest that the linking logic traditionally used for multiple-choice items is also appropriate for polytomous items. It is argued and illustrated that a modification of the traditional linking is necessary when tests consist of constructed-response items judged by raters and there is a possibility of year-to-year variation in the rating discrimination and severity.

...read moreread less

42 citations

Journal Article•10.1111/J.1745-3984.1999.TB00550.X•

Contextual Explanations of Local Dependence in Item Clusters in a Large Scale Hands-On Science Performance Assessment

[...]

Steven Ferrara¹, Huynh Huynh², Hillary R. Michaels³•Institutions (3)

American Institutes for Research¹, University of South Carolina², CTB/McGraw Hill³

TL;DR: In this paper, the authors provide hypothesized explanations for local item dependence (LID) in a large scale hands-on science performance assessment using contextual analysis procedures described in this and other studies.

...read moreread less

Abstract: This study provides hypothesized explanations for local item dependence (LID) in a large scale hands-on science performance assessment. Items within multi-step item clusters were classified as low or high in LID using contextual analysis procedures described in this and other studies. LID was identified statistically using the average within cluster (AWC) correlation procedure described in previous studies. Levels of LID identified in contextual analyses were compared to levels of LlD identified in correlation analyses. Consistent with other studies, items that appear to elicit locally dependent responses require examinees to answer and explain their answer or to use given or generated information to respond

...read moreread less

38 citations

Journal Article•10.1111/J.1745-3984.1999.TB00555.X•

Stepwise Analysis of Differential Item Functioning Based on Multiple‐Group Partial Credit Model

[...]

Eiji Muraki

Lawrence J. Stricker, Walter Emmerich

TL;DR: In this paper, a stepwise DIF analysis based on the multiple-group partial credit model was applied to the National Assessment of Educational Progress (NAEP) writing trend data, where uniform and non-uniform items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups.

...read moreread less

Abstract: Bock, Muraki, and Pfeiffenberger (1988) proposed a dichotomous item response theory (IRT) model for the detection of differential item functioning (DIF), and they estimated the IRT parameters and the means and standard deviations of the multiple latent trait distributions. This IRT DIF detection method is extended to the partial credit model (Masters, 1982; Muraki, 1993) and presented as one of the multiple-group IRT models. Uniform and non-uniform DIF items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups. The DIF method was applied to this simulated data using a stepwise procedure. The standardized DIF measures for slope and item location parameters successfully detected the non-uniform and uniform DIF items as well as recovered the means and standard deviations of the latent trait distributions.This stepwise DIF analysis based on the multiple-group partial credit model was then applied to the National Assessment of Educational Progress (NAEP) writing trend data.

...read moreread less

32 citations

Journal Article•10.1111/J.1745-3984.1999.TB00561.X•

Possible Determinants of Differential Item Functioning: Familiarity, Interest, and Emotional Reaction.

[...]

TL;DR: In this paper, the authors evaluated the connection between gender differences in examinees' familiarity, interest, and negative emotional reactions to items on the Advanced Placement Psychology Examination and the items' gender differential item functioning (DIF).

...read moreread less

Abstract: This study evaluated the connection between gender differences in examinees’ familiarity, interest, and negative emotional reactions to items on the Advanced Placement Psychology Examination and the items’ gender differential item functioning (DIF). Gender DIF and gender differences in interest varied appreciably with the content of the items. Gender differences in the three variables were substantially related to the items’ gender DIF (e.g., R = .50). Much of the gender DIF on this test may be attributable to gender differences in these variables.

...read moreread less

Journal Article•10.1111/J.1745-3984.1999.TB00551.X•

Can Examinees Use a Review Option to Obtain Positively Biased Ability Estimates on a Computerized Adaptive Test

[...]

Walter P. Vispoel¹, Thomas R. Rocklin¹, Tianyou Wang², Timothy Bleiler³•Institutions (3)

University of Iowa¹, The American College of Financial Services², University at Buffalo³

TL;DR: This article showed that the Wainer strategy yielded inflated ability estimates as well as instances in which it yielded deflated ability estimates, depending on the ability estimation method used (maximum likelihood versus Bayesian), the examinee's true ability level, the standard error of the ability estimate, the examineer's ability to implement the strategy, and the type of decision made from the estimate.

...read moreread less

Abstract: Part of the controversy about allowing examinees to review and change answers to previous items on computerized adaptive tests (CATs) centers on a strategy for obtaining positively biased ability estimates attributed to Wainer (1993) in which examinees intentionally answer items incorrectly before review and to the best of their abilities upon review. Our results, based on both simulated and live testing data, showed that there were instances in which the Wainer strategy yielded inflated ability estimates as well as instances in which it yielded deflated ability estimates. The success of the strategy in inflating ability estimates depended on the ability estimation method used (maximum likelihood versus Bayesian), the examinee's true ability level, the standard error of the ability estimate, the examinee's ability to implement the strategy, and the type of decision made from the ability estimate. We discuss approaches to dealing with the Wainer strategy in operational CAT settings. The terms "review" and "item review" in testing contexts refer to administrative rules that allow examinees to change their responses to previously answered items. Review is a non-controversial element of most fixed-item, paper-and-pencil administered tests. Over a half century of research on review supports several conclusions: (a) only a very small percentage of answers are changed; (b) more answers are changed from wrong to right than from right to wrong (or wrong to wrong),

...read moreread less

Journal Article•10.1111/J.1745-3984.1999.TB00547.X•

Measurement Error, Multidimensionality, and Scale Shrinkage: A Reply to Yen and Burket

[...]

Gregory Camilli¹•Institutions (1)

Rutgers University¹

TL;DR: In this article, select research on reliability, multidimensionality and scale shrinkage are reviewed, and these perspectives are then integrated with issues of practical importance to educators, such as the importance of scale in vertical equating.

...read moreread less

Abstract: Yen and Burket (1997) provided a critical analysis of the work by Camilli (1988) regarding scale shrinkage, and suggested that shrinkage in vertical equating cannot be understood apart front multidimensionality. In this response, select research on reliability, multidimensionality and scale shrinkage are reviewed. These perspectives are then integrated with issues of practical importance to educators.

...read moreread less

Journal Article•10.1111/J.1745-3984.1999.TB00556.X•

Psychometric and Cognitive Functioning of an Under-Determined Computer-Based Response Type for Quantitative Reasoning

[...]

Randy Elliot Bennett, Mary Morley, Dennis Quardt, Donald A. Rock, Mark K. Singley¹, Irvin R. Katz², Adisack Nhouyvanisvong³ - Show less +3 more•Institutions (3)

IBM¹, George Mason University², Carnegie Mellon University³

John C. Kalohn, Judith A. Spray

TL;DR: This paper evaluated a computer-delivered response type for measuring quantitative skill and found that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills.

...read moreread less

Abstract: We evaluated a computer-delivered response type for measuring quantitative skill. "Generating Examples" (GE) presents under-determined problems that can have many right answers. We administered two GE tests that differed in the manipulation of specific item features hypothesized to affect difficulty. Analyses related to internal consistency reliability, external relations, and features contributing to item difficulty, adverse impact, and examinee perceptions. Results showed that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills. Item features that increased difficulty included asking examinees to supply more than one correct answer and to identify whether an item was solvable. Gender differences were similar to those found on the GRE quantitative and analytical test sections. Finally, examinees were divided on whether GE items were a fairer indicator of ability than multiple-choice items, but still overwhelmingly preferred to take the more conventional questions.

...read moreread less

Journal Article•10.1111/J.1745-3984.1999.TB00545.X•

The Effect of Model Misspecification on Classification Decisions Made Using a Computerized Test

[...]

David M. Williamson, Isaac I. Bejar, Anne S. Hone

TL;DR: This study examined the effects of model misspecification on the precision of the decisions made using the sequential probability ratio test (SPRT) and the primary focus of this paper was related to item response theory model selection and the computerized classification test (CCT).

...read moreread less

Abstract: Many computerized testing algorithms require the fitting of some item response theory (IRT) model to examinees' responses to facilitate item selection, the determination of test stopping rules, and classification decisions. Some IRT models are thought to be particularly useful for small volume certification programs that wish to make the transition to computerized adaptive testing (CAT). The one-parameter logistic model (1-PLM) is usually assumed to require a smaller sample size than the three-parameter logistic model (3-PLM) for item parameter calibrations. This study examined the effects of model misspecification on the precision of the decisions made using the sequential probability ratio test (SPRT). For this comparison, the 1-PLM was used to estimate item parameters, even though the items' characteristics were represented by a 3-PLM. Results demonstrated that the I-PLM produced considerably more decision errors under simulation conditions similar to a real testing environment, compared to the true model and to a fixed-form standard reference set of items. In certification and licensure testing, a balance must be maintained between minimizing costs for clients and ensuring the protection of the public in terms of making valid decisions regarding minimum competency. To remain competitive, a testing organization must be able to offer clients testing services at the lowest price allowable while providing good measurement. These services frequently include computerized testing. There are many forms of computerized testing available today. These vary from the simple administration of a fixed form on a computer to computerized adaptive testing for the estimation of a broad range of examinee abilities. The primary focus of this paper is related to item response theory model selection and the computerized classification test (CCT). A CCT is an examination that is designed to function optimally at making pass/fail decisions. When determining if a credentialing or licensure client is ready to implement CCT, many factors are taken into consideration. These factors include the current size and status of the client's item pool, the testing volume (i.e., the number of examinees who take each test form), the test administration frequency, the frequency of pretesting, the nature of the examinee population (i.e., first-time test takers versus recertifiers or advanced-level practitioners), and the current test blueprint or test content outline. It is desirable, from the client's point of view, to know the predicted impact of making the transition from a fixed-length, fixed-form testing format to a test of variable length consisting of different items for different examinees. The outcomes or effects of a change in testing format could include changes in the average testing time or test length for an examinee, the average

...read moreread less

Journal Article•

The validity of comparative educational studies

[...]

Cornelis de Glopper¹, Timo M. Bechger, G.L.H. van den Wittenboer, Joop J. Hox•Institutions (1)

University of Amsterdam¹

01 Jan 1999-Journal of Educational Measurement

TL;DR: In this article, the concept of comparative validity was added to the lexicon of the psychometrician, and the conditions that will allow appropriate comparisons among groups among groups were discussed.

...read moreread less

Abstract: What are the conditions that will allow for appropriate comparisons among groups\" Should the concept of comparative validity be added to the lexicon of the psychometrician\

...read moreread less

Journal Article•10.1111/J.1745-3984.1999.TB00552.X•

"Mental Model" Comparison of Automated and Human Scoring

[...]

TL;DR: It is concluded that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency, and mental models used by automated scoring are contrasted with those used by experienced human graders.

...read moreread less

Abstract: ‘Mental models’ used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders Candidate solutions (N = 3613) received both automated and human holistic scores Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented Solutions with discrepancies between automated and human scores were selected for qualitative analysis The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process After review, slightly more than half of the score discrepancies were reduced or eliminated Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency

...read moreread less

Journal Article•10.1111/J.1745-3984.1999.TB00557.X•

Multiple Objective Test Assembly Problems

[...]

Bernard P. Veldkamp¹•Institutions (1)

University of Twente¹