Item Analysis

Procedures typically used by test developers in their efforts to select the best items from a pool of tryout items
Criteria for the best items may differ as a function of the test developer’s objectives.
g. one test developer might deem thefts items to be those that optimally contribute the the internal reliability of the test, another test developer may wish to design a test with the highest possible criterion-related validity and then select items accordingly.
Among the tools test developers might employ to analyse and select items are:
- An index of the item’s difficulty
- An index of the item’s reliability
- An index of the item’s validity
- An index of the items discrimination

The item-difficulty Index

An index of an item’s difficulty is obtained by calculating the proportion of the total number of testtakers who answered the item correctly.
p = item difficulty
A subscript refers to the item number e.g. p₁
The value of an item-difficulty index can theoretically range from 0 (if no one got the item right) to 1 (if everyone got the item right) e.g. 50/100 people got item 2 right —> p₂= .5
Larger the item-difficulty index = easier the item
Percent of people passing the item
Item-difficulty index = item-endorsement index in other contexts such as personality testing – The statistic provides not a measure of percent of people passing the item but a measure of the percent of people who said yes to, agreed with, or endorsed the item.
An index of the difficulty of the average test item for a particular test can be calculated by averaging the itemdifficulty indices for all the test’s items.

Sum the item-difficulty indices for all test items, divide by the total number of items on the test

For maximum discrimination among the abilities of the testtakers, the optimal average item difficulty is approx. .5, with individual items on the test ranging in difficulty from about .3 to .8.
Possible effect of guessing must be considered for selected-response tests
With this type of item, the optimal average item difficulty is usually the midpoint between 1.00 and the chance success proportion (probability of answering correctly by random guessing)
- g. in true-false, chance success proportion is 1/2, so optimal item difficulty is half way between .50 and 1.00.

Differential Item Functioning (DIF): an item functions differently in one group of testtakers as compared to another group of testtakers known to have the same or similar level of the underlying trait

Instruments containing such items may have reduced validity for between-group comparisons because their scores may indicate a variety of attributes other than those the scale is intended to measure.
DIF analysis: test developers scrutinise group by group item response curves, looking for DIF items.
DIF items: those items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership
DIF analysis has been used to evaluate measurement equivalence in item content across groups that vary by culture, gender and age

Developing item banks

Each of the items assembled as part of an item bank, whether taken from an existing test or written especially for the item bank, have undergone rigorous qualitative and quantitative evaluation
All items available for use as well as new items created especially for the item bank constitute the item pool
Item pool evaluated by content experts, potential respondents and survey experts using a variety of qualitative and quantitative methods
Individual items in an item pool may be evaluated by cognitive testing procedures —> interviewer conducts one on one interviews with respondents in an effort to identify any ambiguities associated with the items.
Item pools may also be evaluated by groups of respondents, which allows for discussion of the clarity and relevance of each item, among other item characteristics —> items that make the cut after such scrutiny constitute the preliminary item bank
Next step in creating the item bank is the administration of all the questionnaire items to a large and representative sample of the target population —> for ease in data analysis, group administration by computer is preferable.
However, depending upon the content and method of administration required by the items, the questionnaire (or portions of it) may be administered individually using paper and pencil methods.
After administration of the preliminary item bank to the entire sample of respondents, responded to the items are evaluated with regard to several variables such as validity, reliability, domain coverage and differential item functioning
The final item bank will consist of a large set of items all measuring a single domain (singe trait or ability)
A test developer will use the banked items to create one or more tests with a fixed number of items e.g. teacher may create two different versions of a math test in order to minimise efforts by testtakers to cheat
When used within a CAT environment, a testator’s response to an item may automatically trigger which item is presented to the testtaker next
The software has been programmed to present the item next that will be most informative with regard to the testtakers standing on the construct being measured

If most of the high scorers fail a particular item, these testtakers may be making an alternative interpretation of a response intended to serve as a distractor.
In such a case, the test developer should interview the examinees to understand better the basis for the choice and then appropriately revise (or eliminate) the item.
item discrimination index = d
Estimate of item discrimination compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores
The optimal boundary lines for what we refer to as to “upper” and “lower” areas of a distribution of scores with demarcate the upper and lower 27% of the distribution of scores – provided the distribution is normal.
As the distribution of test scores becomes more platykurtic (flatter), the optimal boundary line for defining upper and lower increases to near 33%.
For most applications, any percentage between 25 and 33 will yield similar estimations.
The item-discrimination index is a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly
Higher value of d = greater number of high scorers answering the item correctly
Negative d value = red flag —> indicates that low-scoring examinees are more likely to answer the item correctly than high-scoring examinees.
Highest possible value of d is +1.00 —> all members of Upper group (top 27%) answered correctly, all members of Lower group (bottom 27%) group answered incorrectly
If the same proportion of members of the U and L groups pass the item, the item is not discriminating between testtakers at all and d will be 0.
Lowest value is d = -1 —> all members of U group failed the item, all members of the L group passed it.

Analysis of item alternatives

Quality of each alternative within a multiple-choice item can be readily assessed with reference to the comparative performance of upper and lower scorers.
No formulas or stats
By charting the number of testtakers in the U and L groups who chose each alternative, the test developer can get an idea of the effectiveness of a distractor by means of a simple eyeball test —> look at distribution of answer for each item.

Item characteristic curves (ICCs)

Item characteristic curve: graphic representation of item difficulty and discrimination
Steeper slope = steeper item discrimination
Item may vary in terms of difficulty level —> easy item will shift ICC to left along the ability axis, a difficult item will shift the ICC to the right along the horizontal axis
It takes high ability levels for a person to have a high probability of their response being scored as correct

Other considerations in Item Analysis

Guessing

Criteria that any correction for guessing must meet:

recognise that when a respondent guesses at an answer on an achievement test, the guys sis not typically made on a totally random basis. It is more reasonable to assume that the testtakers guess is based on some knowledge of the subject matter and the ability to rule out one or more of the distractor alternatives. However, the individual testtaker’s amount of knowledge of the subject matter will vary from one item to the next.
Must deal with the problem of omitted items. Sometimes, instead of guessing, the testtaker will simply omit a response to an item. Should omitted item be scored wrong? Should it be excluded? etc.
Some testtakers may be luckier than others in guessing the choices that are keyed correct.

No solution to the problem of guessing deemed entirely satisfactory
Responsible test developer address problem of guessing by including in the test manual:
1. Explicit instructions regarding this point for the examiner to convey to the examinees
2. Specific instructions for scoring and interpreting omitted items

Item fairness

The degree a test item is biased
Biased test item: favours one particular group of examinees in relation to another when differences in group ability are controlled
Item characteristic curves can be used to identify biased items
Specific items are identified as biased in a statistical sense if they exhibit differential item functioning
Differential item functioning: exemplified by different shapes of item-characteristic curves for different groups (e.g. men and women) when the two groups do not differ in total test score.
If an item is to be considered fair to different groups of testtakers, the item-characteristic curves for the different groups should not be significantly different.
Establishing presence of differential item functioning requires a statistical test of the null hypothesis and no different between the item-characteristic curves of the two groups
Items exhibiting significant difference in item-characteristic curves must be revised or eliminated Speed tests
Item analyses of tests taken under speed conditions yield misleading or uninterpretable results.
The closer an item is to the end of the test, the more difficult it may appear to be —> This is because testtakers may not get to items near the end of the test before time runs out
Measures of item discrimination may be artificially high for late-appearing items —> because test takers who know the material better may work faster and are thus more likely to answer the later items
Items appearing late in a speed test are consequently more likely to show positive item-total correlations because of the select group of examinees reaching those items.
So, how can items on a speed test be analysed? —> most obvious solution to restrict the item analysis of items on a speed test only to the items completed by the testtaker —> not recommended for 3 reasons
1) item analyses of the later items would be based on a progressively smaller number of testtakers, yielding progressively less reliable results
2) if the more knowledgable examinees reach the later items, then part of the analysis is based on all testtakers and part is based on a selected sample
3) because the more knowledgeable testtakers are more likely to score correctly, their performance will make items occurring toward the end of the test appear to be easier than they are

Qualitative Item Analysis

Qualitative methods: techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures
Encouraging testtakers to discuss aspects of their test-taking experience is eliciting/generating data
These data may be used to improve the test
Qualitative item analysis: general term for various non statistical procedures designed to explore how individual test items work.
The analysis compares individual test items to each other and to the test as a whole
Involve exploration of the issues through verbal means such as interviews and group discussions conducted with testtakers and other relevant parties
Caution —> there may be abuse of the process – respondents may be disgruntled for any number of reasons, from failure to prepare adequately for the test to disappointment in their test performance.

“Think aloud” test administration

Having respondents verbalise thoughts as they occur
Shed light on testtaker’s thought processes during the administration of a test
One to one basis

Expert panels

Sensitivity review: study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes, or situations
Possible forms of content bias that may be in an achievement test
- Status
- Stereotype
- Familiarity
- Offensive choice of words
- Other

TEST REVISION

Test revision as a Stage in New Test Development

One approach to test revision is characterising each item according to its strengths and weaknesses – some items may be highly reliable but lack criterion validity, others may be purely unbiased but too easy.
Some items will be found with many weaknesses —> prone to deletion/revision

Test developers may find that they must balance various strengths and weaknesses across items e.g. if many otherwise good items tend to be somewhat easy, the test developer may purposefully include some more difficult items even if they have other problems

Purpose of the test influences the blueprint/plan for the revision e.g. if the test will be used to influence major decisions about educational placement or employment, the test developer should be scrupulously concerned with item bias —> if there is a need to identify the most highly skilled individuals among those being tested, items demonstrating excellent item discrimination, leading to the best possible test discrimination, will be made a priority.
As revision proceeds, the advantage of writing a large item pool becomes more and more apparent —> poor items can be eliminated in favour of those that were shown on the test tryout to be good items
Next step is to administer the revised test under standardised conditions to a second appropriate sample of examinees.
Once the test is in its finished form, the test’s norms may be developed from the data, and the test will be said to have been “standardised” on this (second) sample.

Test revision in the Life Cycle of an Existing Test

No hard and fast rules exist for when to revise a test
APA offered to general suggestions that an existing test be kept in its present form as long as it remains “useful” but that it should be revised when “significant changes in the domain represented, or new conditions of test use and interpretation, make the test inappropriate for its intended use”
Many tests due for revision when any of the following conditions exist:
stimulus materials look dated and current testtakers cannot relate to them
verbal content of the test, including the administration instructions and the test items, contained dated vocab that is not readily understood by current testtakers
as popular culture changes and words take on new meanings, certain words or expressions in the test items or directions may be perceived as inappropriate or offensive to a particular group and must be changed
test norms are no longer adequate as a result of group membership changes in the population of potential testtakers
The test norms are no longer adequate as a result of age-related shifts in the abilities measured over time, and so an age extension of the norms (upward, downward, or in both directions) is necessary.
Reliability or validity, as well as the effectiveness of individual test items can be significantly improved by a revision
The theory on which the test was originally based has been improved significantly, and these changes should be reflected in the design and content of the test
Steps to revise an existing test parallel those to create a brand new one
Test conceptualisation phase —> the test developer must think through the objectives of the revision and how they can best be met
In the test construction phase —> the proposed changes are made
Test tryout, item analysis and test revision follow
Once the successor to an established test is published, there are inevitably questions about the equivalence of the two editions e.g. does a measured full scale of 110 on the first edition of an intelligence test mean exactly the same thing as a full-scale IQ of 110 on the second edition?

Formal item analysis methods must be employed to evaluate the stability of items between revisions of the same test

Ultimately scores on a test and on its updated version may not be directly comparable

Cross-validation and co-validation

Cross-validation

Revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion.
We expect that items selected for final version of the test (in part because of their high correlations with a criterion measure) will have smaller item validities when administered to a second sample of testtakers —> because of the operation of chance
The decrease in item validities that inevitably occurs after cross-validation of findings = validity shrinkage.
Such shrinkage is expected and integral to the test development process.
Such shrinkage is preferable to a scenario wherein (spuriously) high item validities are published in a test manual as a result of inappropriately using the identical sample of testtakers for test standardisation and cross-validation of findings.
When such scenarios occur, test users will typically be let down by lower-than-expected test validity.
The test manual accompanying commercially prepared tests should outline the test development procedures used, and reliability info i.e. test retest and internal consistency.

Co-validation

a test validation process conducted on two or more tests using the same sample of testtakers
When used in conjunction with the creation of norms or the revision of existing norms —> ‘co-norming’
A current trend among test publishers who publish more than one test designed for use with the same population is to co-validate and/or co-norm tests
Co-validation beneficial to test publishers because it is economical
During the process of validating a test, many prospective testtakers must first be identified —> After being identified as a possible participant in the validation study, a person will be prescreened for suitability as a possible participant in the validation study i.e. faec to face interview or telephone interview —> costs money
Both money and time saved if person is deemed suitable in the validation studies for multiple tests
Cost of retaining professional personnel on a per test basis is minimised when the work is done for multiple tests simultaneously
Benefits for test users and test takers —> many tests that tend to be used together are published by the same publisher —> sampling error eliminated

Quality assurance during test revision

Ensuring examiners adhere to standardisation procedures —> qualified —> trained
(for WISC-IV) —> Having two qualified scorers restore each protocol collected during the national tryout and standard assertion stages of test development
If there were discrepancies in scoring —> resolved by another scorer – the resolver.

Anchor protocol: a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.

Scoring drift :A discrepancy between scoring in an anchor protocol and the scoring of another protocol

For quality assurance during the data entry phase of test development, test developers may employ computer programs to seek out and identify any irregularities in score reporting
g. if a score on a particular subtest can range from a low of 1 to a high of 10, any score reported out of that range would be flagged by the computer.

Use of IRT in Building and Revising Tests

One of the disadvantages of applying CTT in test development is the extent to which item statistics are dependent upon characteristics (strength of traits or ability level) of the group of people tested —> all CTT-based statistics are sample dependent.
g. consider a hypothetical Perceptual-Motor Ability Test (PMAT), and the characteristics of items on that test with reference to different groups of testtakers
From a CTT perspective, a PMAT item might be judged to be very high in difficulty when it is administered to a sample of people known to be very low in perceptual-motor ability. From that same perspective, the PMAT item might be judged to be very low in difficulty when administered to a group of people known to be very high in perceptual- motor ability.
Because the way that an item is viewed is to dependent on the group of testtakers taking the test, the ideal situation, form the CTT perspective, is one in which all the testtakers represent a truly random sample of how well the trait or ability being studied is represented in the population.
Using IRT, test developers evaluate individual item performance, with reference to item characteristic curves (ICCs) —> provide info about the relationship between the performance of individual items and the presumed underlying ability or trait level in the testtaker.
3 of many possible applications of IRT in building and revising tests include:
1. evaluating existing tests for the purpose of mapping test revisions
2. determining measurement equivalence across testtaker populations
3. developing item banks

Evaluating the properties of existing tests and guiding test revision

IRT information curves can help test developers evaluate how well an individual item (or entire test) is working to measure different levels of the underlying construct
Developers can use these information curves to weed out uninformative questions or to eliminate redundant items that provide duplicate levels of information
Information curves allow test developers to tailor an instrument to provide high information (precision)

Determining measurement equivalence across testtaker populations

Test developers often aspire to have their tests become so popular that they will be translated into other languages and used in many places throughout the world
IRT —> tool to help ensure that the same construct is being measured, no matter what language the test has been translated into
Different populations may interpret items differently

Instruments containing such items may have reduced validity for between-group comparisons because their scores may indicate a variety of attributes other than those the scale is intended to measure.

DIF analysis: test developers scrutinise group by group item response curves, looking for DIF items.
DIF items: those items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership
DIF analysis has been used to evaluate measurement equivalence in item content across groups that vary by culture, gender and age

Developing item banks

Each of the items assembled as part of an item bank, whether taken from an existing test or written especially for the item bank, have undergone rigorous qualitative and quantitative evaluation
All items available for use as well as new items created especially for the item bank constitute the item pool
Item pool evaluated by content experts, potential respondents and survey experts using a variety of qualitative and quantitative methods
Individual items in an item pool may be evaluated by cognitive testing procedures —> interviewer conducts one on one interviews with respondents in an effort to identify any ambiguities associated with the items.
Item pools may also be evaluated by groups of respondents, which allows for discussion of the clarity and relevance of each item, among other item characteristics —> items that make the cut after such scrutiny constitute the preliminary item bank
Next step in creating the item bank is the administration of all the questionnaire items to a large and representative sample of the target population —> for ease in data analysis, group administration by computer is preferable.
However, depending upon the content and method of administration required by the items, the questionnaire (or portions of it) may be administered individually using paper and pencil methods.
After administration of the preliminary item bank to the entire sample of respondents, responded to the items are evaluated with regard to several variables such as validity, reliability, domain coverage and differential item functioning
The final item bank will consist of a large set of items all measuring a single domain (singe trait or ability)
A test developer will use the banked items to create one or more tests with a fixed number of items e.g. teacher may create two different versions of a math test in order to minimise efforts by testtakers to cheat
When used within a CAT environment, a testator’s response to an item may automatically trigger which item is presented to the testtaker next
The software has been programmed to present the item next that will be most informative with regard to the testtakers standing on the construct being measured