Scispace (Formerly Typeset)
  1. Home
  2. Journals
  3. Journal of Educational Measurement
  4. 1999
  1. Home
  2. Journals
  3. Journal of Educational Measurement
  4. 1999
Showing papers in "Journal of Educational Measurement in 1999"
Journal Article•10.1111/J.1745-3984.1999.TB00543.X•
An Empirical Bayes Approach to Mantel-Haenszel DIF Analysis.

[...]

Rebecca Zwick1, Dorothy T. Thayer, Charles Lewis•
University of California, Santa Barbara1
01 Mar 1999-Journal of Educational Measurement
TL;DR: The authors used a Bayesian approach to estimate the probabilities that the true DIF for an item falls into the A, B, or C categories (the True DIF method in our terminology) or to
Abstract: future observed status. DIF status is expressed in terms of the probabilities associated with each of the five DIF levels defined by the ETS classification system: C-, B-, A, B+, and C+. The EB methods yield more stable DIF estimates than do conventional methods, especially in small samples, which is advantageous in computer-adaptive testing. The EB approach may also convey information about DIF stability in a more useful way by representing the state of knowledge about an item's DIF status as probabilistic. The results of a Mantel-Haenszel (MH; Mantel & Haenszel, 1959) analysis of differential item functioning (DIF) typically include an index of the magnitude of DIF, along with an estimated standard error (see Holland & Thayer, 1988). In making decisions about whether to discard items or flag them for review, however, testing companies sometimes rely on categorical ratings of the severity of DIF. Educational Testing Service (ETS) has a system for categorizing DIF as negligible ("A"), slight to moderate ("B") or moderate to severe ("C") based on both the magnitude of the DIF index and the statistical significance of the results. A disadvantage of this classification system is that it sometimes conveys the notion that an item's DIF category is deterministic. A possible solution is to use a Bayesian approach to estimate the probabilities that the true DIF for an item falls into the A, B, or C categories (the True DIF method in our terminology) or to

252 citations

Journal Article•10.1111/J.1745-3984.1999.TB00553.X•
Identifying the causes of DIF in translated verbal items

[...]

Avi Allalouf, Ronald K. Hambleton1, Stephen G. Sireci1•
University of Massachusetts Amherst1
01 Sep 1999-Journal of Educational Measurement
TL;DR: The authors found that 34% of the items functioned differentially across languages, mostly in favor of the Russian-speaking examinees, and the main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.
Abstract: Translated tests are being used increasingly for assessing the knowledge and skills of individuals who speak different languages. There is little research exploring why translated items sometimes function differently across languages. If the sources of differential item functioning (DIF) across languages could be predicted, it could have important implications on test development, scoring and equating. This study focuses on two questions: "Is DIF related to item type?", "What are the causes of DIF?" The data were taken from the Israeli Psychometric Entrance Test in Hebrew (source) and Russian (translated). The results indicated that 34% of the items functioned differentially across languages. The analogy items were the most problematic with 65% showing DIF, mostly in favor of the Russian-speaking examinees. The sentence completion items were also a problem (45% D1F). The main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.

163 citations

Journal Article•10.1111/J.1745-3984.1999.TB00549.X•
Effects of coaching on SAT I: Reasoning Test scores.

[...]

Donald E. Powers, Donald A. Rock
01 Jun 1999-Journal of Educational Measurement
TL;DR: A College Board-sponsored survey of a nationally representative sample of 1995-96 SAT takers yielded a data base for more than 4, 000 examinees, about 500 of whom had attended formal coaching programs outside their schools as mentioned in this paper.
Abstract: A College Board-sponsored survey of a nationally representative sample of 1995–96 SAT takers yielded a data base for more than 4, 000 examinees, about 500 of whom had attended formal coaching programs outside their schools. Several alternative analytical methods were used to estimate the effects of coaching on SAT I: Reasoning Test scores. The various analyses produced slightly different estimates. All of the estimates, however, suggested that the effects of coaching are far less than is claimed by major commercial test preparation companies. The revised SAT does not appear to be any more coachable than its predecessor.

149 citations

Journal Article•10.1111/J.1745-3984.1999.TB00558.X•
An Investigation of the Power of the Likelihood Ratio Goodness-of-Fit Statistic in Detecting Differential Item Functioning.

[...]

Robert D. Ankenmann1, Elizabeth A. Witt, Stephen B. Dunbar1•
University of Iowa1
01 Dec 1999-Journal of Educational Measurement
TL;DR: In this paper, the authors investigated the power and error rate of the likelihood ratio goodness-of-fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model.
Abstract: The purpose of this study was to investigate the power and Type I error rate of the likelihood ratio goodness-of-fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model. A multiple-replication Monte Carlo study was utilized in which DIF was modeled in simulated data sets which were then calibrated with MULTILOG (Thissen, 1991) using hierarchically nested item response models. In addition, the power and Type I error rate of the Mantel (1963) approach for detecting DIF in ordered response categories were investigated using the same simulated data, for comparative purposes. The power of both the Mantel and LR procedures was affected by sample size, as expected. The LR procedure lacked the power to consistently detect DIF when it existed in reference/focal groups with sample sizes as small as 500/500. The Mantel procedure maintained control of its Type I error rate and was more powerful than the LR procedure when the comparison group ability distributions were identical and there was a constant DIF pattern. On the other hand, the Mantel procedure lost control of its Type I error rate, whereas the LR procedure did not, when the comparison groups differed in mean ability; and the LR procedure demonstrated a profound power advantage over the Mantel procedure under conditions of balanced DIF in which the comparison group ability distributions were identical. The choice and subsequent use of any procedure requires a thorough understanding of the power and Type I error rates of the procedure under varying conditions of DIF pattern, comparison group ability distributions.–or as a surrogate, observed score distributions–and item characteristics.

85 citations

Journal Article•10.1111/J.1745-3984.1999.TB00546.X•
Note On Sources of Sampling Variability in Science Performance Assessments

[...]

Richard J. Shavelson1, Maria Araceli Ruiz-Primo1, Edward W. Wiley1•
Stanford University1
01 Jan 1999-Journal of Educational Measurement
TL;DR: The authors found that task sampling variability was due to both the person x task interaction and task x task x occasion interaction and concluded that direct observation, notebook and computer simulation methods were equally exchangeable, but their exchangeability was limited by the volatility of student performances across tasks and occasions.
Abstract: In 1993, we reported in Journal of Educational Measurement that task-sampling variability was the Achilles’ heel of science performance assessment. To reduce measurement error, tasks needed to be stratified before sampling, sampled in large number, or possibly both. However, Cronbach, Linn, Brennan, & Haertel (1997) pointed out that a task-sampling interpretation of a large person x task variance component might be incorrect. Task and occasion sampling are confounded because tasks are typically given on only a single occasion. The person x task source of measurement error is then confounded with the pt x occasion source. If pto variability accounts for a substantial part of the commonly observed pt interaction, stratifying tasks into homogenous subsets—a cost-effective way of addressing task sampling variability—might not increase accuracy. Stratification would not address the pro source of error. Another conclusion reported in JEM was that only direct observation (DO) and notebook (NB) methods of collecting performance assessment data were exchangeable; computer simulation, short-answer, and multiple-choice methods were not. However, if Cronbach et al. were right, our exchangeability conclusion might be incorrect. After re-examining and re-analyzing data, we found support for Conbach et al. We concluded that large task-sampling variability was due to both the person x task interaction and person x task x occasion interaction. Moreover, we found that direct observation, notebook and computer simulation methods were equally exchangeable, but their exchangeability was limited by the volatility of student performances across tasks and occasions.

85 citations

Journal Article•10.1111/J.1745-3984.1999.TB00559.X•
Alternative Displays for Communicating NAEP Results: A Redesign and Validity Study

[...]

Howard Wainer, Ronald K. Hambleton1, Kevin Meara1•
University of Massachusetts Amherst1
01 Dec 1999-Journal of Educational Measurement
TL;DR: In this article, five diswlays, chosen from the NAEP 1994 Reading: A First Look, were redesigned using the principles developed and enunciated in Wainer's 1997 popular text Visual Revelations.
Abstract: Five diswlays, chosen from the NAEP 1994 Reading: A First Look, were redesigned. The redesign was informed by the principles developed and enunciated in Wainer's 1997 popular text Visual Revelations. After the redesign was completed a survey of educational policymakers was done in which substantive questions were asked about the content of the various displays. Each redesign was paired with the original and were assigned randomly to one of two survey forms. We found that, on average, the redesigns yielded both more accurate and faster answers to the questions asked. The more difficult the question the greater the disparity between the original format and the redesigned one.

60 citations

Journal Article•10.1111/J.1745-3984.1999.TB00544.X•
Components of Rater Error in a Complex Performance Assessment

[...]

Brian E. Clauser1, Stephen G. Clyman1, David B. Swanson1•
National Board of Medical Examiners1
01 Mar 1999-Journal of Educational Measurement
TL;DR: In this paper, the authors present two studies that focus on aspects of the rating process that are not explicitly accounted for in this typical design, and highlight the importance of clearly specifying the relevant facets of the universe of interest.
Abstract: Numerous studies have examined performance assessment data using generaliz-ability theory. Typically, these studies have treated raters as randomly sampled from a population, with each rater judging a given performance on a single occasion. This paper presents two studies that focus on aspects of the rating process that are not explicitly accounted for in this typical design. The first study makes explicit the “committee” facet, acknowledging that raters often work within groups. The second study makes explicit the “rating-occasion” facet by having each rater judge each performance on two separate occasions. The results of the first study highlight the importance of clearly specifying the relevant facets of the universe of interest. Failing to include the committee facet led to an overly optimistic estimate of the precision of the measurement procedure. By contrast, failing to include the rating-occasion facet, in the second study, had minimal impact on the estimated error variance.

45 citations

Journal Article•10.1111/J.1745-3984.1999.TB00560.X•
A Cautionary Note on IRT-Based Linking of Tests With Polytomous Items

[...]

Richard L. Tate1•
Florida State University1
01 Dec 1999-Journal of Educational Measurement
TL;DR: The authors argued that a modification of the traditional linking is necessary when tests consist of constructed-response items judged by raters and there is a possibility of year-to-year variation in the rating discrimination and severity.
Abstract: Published discussions of the year-to-year linking of tests comprised of polytomous items appear to suggest that the linking logic traditionally used for multiple-choice items is also appropriate for polytomous items. It is argued and illustrated that a modification of the traditional linking is necessary when tests consist of constructed-response items judged by raters and there is a possibility of year-to-year variation in the rating discrimination and severity.

42 citations

Journal Article•10.1111/J.1745-3984.1999.TB00550.X•
Contextual Explanations of Local Dependence in Item Clusters in a Large Scale Hands-On Science Performance Assessment

[...]

Steven Ferrara1, Huynh Huynh2, Hillary R. Michaels3•
American Institutes for Research1, University of South Carolina2, CTB/McGraw Hill3
01 Jun 1999-Journal of Educational Measurement
TL;DR: In this paper, the authors provide hypothesized explanations for local item dependence (LID) in a large scale hands-on science performance assessment using contextual analysis procedures described in this and other studies.
Abstract: This study provides hypothesized explanations for local item dependence (LID) in a large scale hands-on science performance assessment. Items within multi-step item clusters were classified as low or high in LID using contextual analysis procedures described in this and other studies. LID was identified statistically using the average within cluster (AWC) correlation procedure described in previous studies. Levels of LID identified in contextual analyses were compared to levels of LlD identified in correlation analyses. Consistent with other studies, items that appear to elicit locally dependent responses require examinees to answer and explain their answer or to use given or generated information to respond

38 citations

Journal Article•10.1111/J.1745-3984.1999.TB00555.X•
Stepwise Analysis of Differential Item Functioning Based on Multiple‐Group Partial Credit Model

[...]

Eiji Muraki
01 Sep 1999-Journal of Educational Measurement
TL;DR: In this paper, a stepwise DIF analysis based on the multiple-group partial credit model was applied to the National Assessment of Educational Progress (NAEP) writing trend data, where uniform and non-uniform items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups.
Abstract: Bock, Muraki, and Pfeiffenberger (1988) proposed a dichotomous item response theory (IRT) model for the detection of differential item functioning (DIF), and they estimated the IRT parameters and the means and standard deviations of the multiple latent trait distributions. This IRT DIF detection method is extended to the partial credit model (Masters, 1982; Muraki, 1993) and presented as one of the multiple-group IRT models. Uniform and non-uniform DIF items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups. The DIF method was applied to this simulated data using a stepwise procedure. The standardized DIF measures for slope and item location parameters successfully detected the non-uniform and uniform DIF items as well as recovered the means and standard deviations of the latent trait distributions.This stepwise DIF analysis based on the multiple-group partial credit model was then applied to the National Assessment of Educational Progress (NAEP) writing trend data.

32 citations

Journal Article•10.1111/J.1745-3984.1999.TB00561.X•
Possible Determinants of Differential Item Functioning: Familiarity, Interest, and Emotional Reaction.

[...]

Lawrence J. Stricker, Walter Emmerich
01 Dec 1999-Journal of Educational Measurement
TL;DR: In this paper, the authors evaluated the connection between gender differences in examinees' familiarity, interest, and negative emotional reactions to items on the Advanced Placement Psychology Examination and the items' gender differential item functioning (DIF).
Abstract: This study evaluated the connection between gender differences in examinees’ familiarity, interest, and negative emotional reactions to items on the Advanced Placement Psychology Examination and the items’ gender differential item functioning (DIF). Gender DIF and gender differences in interest varied appreciably with the content of the items. Gender differences in the three variables were substantially related to the items’ gender DIF (e.g., R = .50). Much of the gender DIF on this test may be attributable to gender differences in these variables.
Journal Article•10.1111/J.1745-3984.1999.TB00551.X•
Can Examinees Use a Review Option to Obtain Positively Biased Ability Estimates on a Computerized Adaptive Test

[...]

Walter P. Vispoel1, Thomas R. Rocklin1, Tianyou Wang2, Timothy Bleiler3•
University of Iowa1, The American College of Financial Services2, University at Buffalo3
01 Jun 1999-Journal of Educational Measurement
TL;DR: This article showed that the Wainer strategy yielded inflated ability estimates as well as instances in which it yielded deflated ability estimates, depending on the ability estimation method used (maximum likelihood versus Bayesian), the examinee's true ability level, the standard error of the ability estimate, the examineer's ability to implement the strategy, and the type of decision made from the estimate.
Abstract: Part of the controversy about allowing examinees to review and change answers to previous items on computerized adaptive tests (CATs) centers on a strategy for obtaining positively biased ability estimates attributed to Wainer (1993) in which examinees intentionally answer items incorrectly before review and to the best of their abilities upon review. Our results, based on both simulated and live testing data, showed that there were instances in which the Wainer strategy yielded inflated ability estimates as well as instances in which it yielded deflated ability estimates. The success of the strategy in inflating ability estimates depended on the ability estimation method used (maximum likelihood versus Bayesian), the examinee's true ability level, the standard error of the ability estimate, the examinee's ability to implement the strategy, and the type of decision made from the ability estimate. We discuss approaches to dealing with the Wainer strategy in operational CAT settings. The terms "review" and "item review" in testing contexts refer to administrative rules that allow examinees to change their responses to previously answered items. Review is a non-controversial element of most fixed-item, paper-and-pencil administered tests. Over a half century of research on review supports several conclusions: (a) only a very small percentage of answers are changed; (b) more answers are changed from wrong to right than from right to wrong (or wrong to wrong),
Journal Article•10.1111/J.1745-3984.1999.TB00547.X•
Measurement Error, Multidimensionality, and Scale Shrinkage: A Reply to Yen and Burket

[...]

Gregory Camilli1•
Rutgers University1
01 Mar 1999-Journal of Educational Measurement
TL;DR: In this article, select research on reliability, multidimensionality and scale shrinkage are reviewed, and these perspectives are then integrated with issues of practical importance to educators, such as the importance of scale in vertical equating.
Abstract: Yen and Burket (1997) provided a critical analysis of the work by Camilli (1988) regarding scale shrinkage, and suggested that shrinkage in vertical equating cannot be understood apart front multidimensionality. In this response, select research on reliability, multidimensionality and scale shrinkage are reviewed. These perspectives are then integrated with issues of practical importance to educators.
Journal Article•10.1111/J.1745-3984.1999.TB00556.X•
Psychometric and Cognitive Functioning of an Under-Determined Computer-Based Response Type for Quantitative Reasoning

[...]

Randy Elliot Bennett, Mary Morley, Dennis Quardt, Donald A. Rock, Mark K. Singley1, Irvin R. Katz2, Adisack Nhouyvanisvong3 •
IBM1, George Mason University2, Carnegie Mellon University3
01 Sep 1999-Journal of Educational Measurement
TL;DR: This paper evaluated a computer-delivered response type for measuring quantitative skill and found that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills.
Abstract: We evaluated a computer-delivered response type for measuring quantitative skill. "Generating Examples" (GE) presents under-determined problems that can have many right answers. We administered two GE tests that differed in the manipulation of specific item features hypothesized to affect difficulty. Analyses related to internal consistency reliability, external relations, and features contributing to item difficulty, adverse impact, and examinee perceptions. Results showed that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills. Item features that increased difficulty included asking examinees to supply more than one correct answer and to identify whether an item was solvable. Gender differences were similar to those found on the GRE quantitative and analytical test sections. Finally, examinees were divided on whether GE items were a fairer indicator of ability than multiple-choice items, but still overwhelmingly preferred to take the more conventional questions.
Journal Article•10.1111/J.1745-3984.1999.TB00545.X•
The Effect of Model Misspecification on Classification Decisions Made Using a Computerized Test

[...]

John C. Kalohn, Judith A. Spray
01 Mar 1999-Journal of Educational Measurement
TL;DR: This study examined the effects of model misspecification on the precision of the decisions made using the sequential probability ratio test (SPRT) and the primary focus of this paper was related to item response theory model selection and the computerized classification test (CCT).
Abstract: Many computerized testing algorithms require the fitting of some item response theory (IRT) model to examinees' responses to facilitate item selection, the determination of test stopping rules, and classification decisions. Some IRT models are thought to be particularly useful for small volume certification programs that wish to make the transition to computerized adaptive testing (CAT). The one-parameter logistic model (1-PLM) is usually assumed to require a smaller sample size than the three-parameter logistic model (3-PLM) for item parameter calibrations. This study examined the effects of model misspecification on the precision of the decisions made using the sequential probability ratio test (SPRT). For this comparison, the 1-PLM was used to estimate item parameters, even though the items' characteristics were represented by a 3-PLM. Results demonstrated that the I-PLM produced considerably more decision errors under simulation conditions similar to a real testing environment, compared to the true model and to a fixed-form standard reference set of items. In certification and licensure testing, a balance must be maintained between minimizing costs for clients and ensuring the protection of the public in terms of making valid decisions regarding minimum competency. To remain competitive, a testing organization must be able to offer clients testing services at the lowest price allowable while providing good measurement. These services frequently include computerized testing. There are many forms of computerized testing available today. These vary from the simple administration of a fixed form on a computer to computerized adaptive testing for the estimation of a broad range of examinee abilities. The primary focus of this paper is related to item response theory model selection and the computerized classification test (CCT). A CCT is an examination that is designed to function optimally at making pass/fail decisions. When determining if a credentialing or licensure client is ready to implement CCT, many factors are taken into consideration. These factors include the current size and status of the client's item pool, the testing volume (i.e., the number of examinees who take each test form), the test administration frequency, the frequency of pretesting, the nature of the examinee population (i.e., first-time test takers versus recertifiers or advanced-level practitioners), and the current test blueprint or test content outline. It is desirable, from the client's point of view, to know the predicted impact of making the transition from a fixed-length, fixed-form testing format to a test of variable length consisting of different items for different examinees. The outcomes or effects of a change in testing format could include changes in the average testing time or test length for an examinee, the average
Journal Article•
The validity of comparative educational studies

[...]

Cornelis de Glopper1, Timo M. Bechger, G.L.H. van den Wittenboer, Joop J. Hox•
University of Amsterdam1
01 Jan 1999-Journal of Educational Measurement
TL;DR: In this article, the concept of comparative validity was added to the lexicon of the psychometrician, and the conditions that will allow appropriate comparisons among groups among groups were discussed.
Abstract: What are the conditions that will allow for appropriate comparisons among groups\" Should the concept of comparative validity be added to the lexicon of the psychometrician\
Journal Article•10.1111/J.1745-3984.1999.TB00552.X•
"Mental Model" Comparison of Automated and Human Scoring

[...]

David M. Williamson, Isaac I. Bejar, Anne S. Hone
01 Jun 1999-Journal of Educational Measurement
TL;DR: It is concluded that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency, and mental models used by automated scoring are contrasted with those used by experienced human graders.
Abstract: ‘Mental models’ used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders Candidate solutions (N = 3613) received both automated and human holistic scores Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented Solutions with discrepancies between automated and human scores were selected for qualitative analysis The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process After review, slightly more than half of the score discrepancies were reduced or eliminated Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency
Journal Article•10.1111/J.1745-3984.1999.TB00557.X•
Multiple Objective Test Assembly Problems

[...]

Bernard P. Veldkamp1•
University of Twente1
01 Sep 1999-Journal of Educational Measurement
TL;DR: In this article, the authors propose methods appropriate for solving multiple objective test assembly problems, and the impact of the method on the optimality of the solution is shown and the appropriateness of the methods is discussed.
Abstract: Mathematical programming techniques for optimal test assembly are discussed. Most methods optimize a single objective: for instance, the amount of information in a test, subject to a number of constraints. However, some test assembly problems have multiple objectives. A recent example in the literature is the problem of assembling test that measure multiple traits, where the amount of information in the test about each different trait has to be maximized. The present paper proposes methods appropriate for solving multiple objective test assembly problems. An overview of multiple objective optimization methods is given. The impact of the method on the optimality of the solution is shown and the appropriateness of the methods is discussed. The methods are illustrated using an empirical example of a test assembly problem for a two-dimensional mathematics item pool.

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve