Test Development

Test conceptualisation
Test construction
Test tryout
Item analysis
Test revision

TEST CONCEPTUALISATION

An emerging social phenomenon or pattern of behaviour might serve as the stimulus for the development of a new test
Development of a new test may be in response to a need to assess mastery in an emerging occupation or profession

Some preliminary questions

Regardless of the stimulus for developing the new test, a number of questions immediately confront the prospective test developer
- What is the test designed to measure?
- What is the objective of the test?
- Is there a need for this test?
- Who will use this test?
- Who will take this test?
- What content will the test cover?
- How will the test be administered?
- What is the ideal format of the test?
- Should more than one form of the test be developed?
- What special training will be required of test users for administering or interpreting the test?
- What types of responses will be required of testtakers?
- Who benefits from an administration of this test?
- Is there any potential for harm as the result of an administration of this test?
- How will meaning be attributed to scores on this test?

Norm-referenced versus criterion-referenced tests: Item development issues

Different approaches to test development and individual item analyses are necessary, depending on whether the finished test is designed to be norm-referenced or criterion-reference.
- Test does not yield info on the strength of the testtaker’s need relative to the presumed strength of that need in the general population.
- Edwards constructed his test of 210 pairs of statements in a way that respondents were “forced” to answer true or false or yes or no to only one of two statements.
- Prior research by Edwards has indicated that the two statements were equivalent in terms of how socially desirable the responses were.
- Sample of an EPPS-like forced-choice item, to which the respondents would indicate which is “more true” of themselves
- I feel depressed when I fail at something
- I feel nervous when giving a talk before a group
- On the basis of such an ipsatively scored personality test, it would be possible to draw only intra-individual conclusions about the testtaker e.g. “John’s need for achievement is higher than his need for affiliation” – It would not be appropriate to draw inter-individual comparisons on the basis of an ipsatively scored test e.g. inappropriate to compare two testtakers with a statement like “John’s need for achievement is higher than Jane’s need for achievement”

TEST TRYOUT

Tests should be tried out on people who are similar in critical respect to the people for whom the test was designed
Informal rule of thumb —> no fewer than 5 subjects and preferably as many as 10 for each item on the test • More subjects the better
more subjects employed = weaker role of chance in data analysis
Risk in using too few subjects during test tryout comes during factor analysis of the findings —> phantom factors, factors that are actually just artifacts of the small sample size, may emerge
Test tryout should be executed under conditions as identical as possible to the conditions under which the standardised test will be administered —> ensure differences in response are due to the items not extraneous factors.

What is a good item?

Reliable and valid
Helps discriminate testtakers
A good test item is one that is answered correctly by high scorers on the test as a whole, an item that is answered incorrectly by high scorers on the test = not a good item —> vice versa for low scorers
How does a test developer identify good items? —> After the first draft of the test has been administered to a representative group of examinees, the test developer analyses test scores and responses to individual items.
Item analysis: statistical scrutiny the test data can undergo
Item analysis may be quantitative and qualitative.
Comparative or categorical.

Scaling methods

Generally speaking, a testtaker is presumed to have more or less of the characteristic measured by a (valid) test as a function of the test score
higher or lower the score, the more or less of the characteristic the testtaker presumably possesses.
Numbers are assigned to responses so that a test score can be calculated though scaling the test items.
g. consider a moral-issues opinion measure called the Morally Debatable Behaviours Scale-Revised (MDBS-R) — > developed to be a “practical means of assessing what people believe, the strength of their convictions, as well as individual differences in moral tolerance”, contains 30 items, each item contains a brief description of a moral issue or behaviour on which testtakers express their opinion by means of a 10-point scale that ranges from “never justified” to “always justified”
MDBS-R is an example of a rating scale, which can be defined as a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude or emotion are indicated by the testtaker
Rating scales can be used to record judgments of oneself, others, experiences, or objects, and they can take several forms.
On the MDBS-R, the ratings that the testtaker makes for each of the 30 test items are added together to obtain a final score
Scores range from a low of 30 (if the testtaker indicates that all 30 behaviours are never justified) to a high of 300 (if the testtaker indicated that all situations are always justified)
Summative scale: summing all the ratings across the items
Likert scale —> type of summative rating scale —> usually used to scale attitudes
Likert scales usually reliable
Ratings of 1 to 5 works best
rating scales = ordinal data
Unidimensional: one dimension presumed to underlie the ratings
Multidimensional: more than one dimension thought to guide the testtaker’s responses – g. MDBS-R regarding marajuana use —> responses to this item in the low to middle range may be interpreted in many ways – i.e. 1) people should not engage in illegal activities 2) people should not take risks for their health 3) people should avoid activities that could lead to contact with a bad crowd. etc etc.
Method of paired comparisons
- produces ordinal data
- testtakers presented with pairs of stimuli (two photographs, two objects, two statements), which they are asked to compare —> must select one of the stimuli according to some rule e.g. the rule that they agree more with one statement than the other, or the rule that they find one stimulus more appealing than the other.

For each pair of options, testtakers receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges (who rated the pairs before the distribution of the test)
Test score reflect number of times testtaker agreed with judges
Advantage: forces testtakers to choose between options

Sorting tasks: Comparative scaling
- Entails judgments of a stimulus in comparison with every other stimulus on the scale
- A version of the MDBS-R that employs comparative scaling might feature 30 items, each on a separate index card
- Testtakers asked to sort through the cards from most justifiable to least justifiable
- Comparative scaling could also be accomplished by providing test takers with a list of 30 items on a sheet of paper and asking them to rank the justifiability of the items from 1 to 30.
Sorting tasks: Categorical scaling – Stimuli are placed into one or two or more alternative categories that differ quantitatively with respect to some continuum
- In the MDBS-R example, testtakers might be given 30 index cards, on each of which is printed one of the 30 items
- Testtakers would be asked to sort the cards into thee piles: those behaviours that are never justified, those that are sometimes justified, and those that are always justified.
Guttman scale
- Ordinal
- Items on it range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured
- a feature of the Guttman scales in that all respondents who agree with the stronger statements of the attitude will also agree with milder statements
- Using the MDBS-R scale as an example, consider the following statements that reflect attitudes towards suicide.
- If this were a perfect Guttman scale, all respondents who agreed with ‘a’, will also agree with ’b’, ‘c’ and ‘d’.

etc.

Guttman scales developed through the administration of a number of items to a target group
Resulting data analysed by scalogram analysis: item-analysis procedure involving a graphic mapping of testtakers responses.
Objective for the developer of a measure of attitudes is to obtain an arrangement of items wherein endorsement of one item automatically connotes endorsement of less extreme positions —> not always possible to do this.
Guttman scaling/scalogram analysis appeals to test developers nine consumer psychology, where an objective may be to learn if a consumer who will purchase one product will purchase another product. Equal-appearing intervals (Thurstone, 1929)

Obtain interval data

Steps involved in creating a scale using equal-appearing intervals method

reasonably large number of statements reflecting positive and negative attitudes towards suicide are collected e.g.

life is sacred, so people should never take their own lives

Judges evaluaet each statement in terms of how strongly it indicates that suicide is justified. Each judge is instructed to rate each statement on a scale as if the scale were interval in nature e.g. scale might range from 1 (suicide never justified) to 9 (suicide always justified) —> judge instructed that the 1 to 9 scale is being used as if there were an equal distance between each of the values i.e. an interval scale
A mean and SD of the judges ratings are calculated for each statement
Items are selected for inclusion in the final scale based on several criteria including a) degree to which the item contributes to a comprehensive measurement of the variable in question, b) the test developer’s degree of confidence that the items have been sorted into equal intervals. Item means and SD’s are taken into account – low SD = good item.
Scale now ready for administration – typically respondents asked to select those statements that mot accurately reflect their own attitudes —> vales of the items that the respondent selects (based on the judge’s ratings) are averaged, producing a score on the test.

Equal-appearing intervals an example of a direct estimation scaling method. —> in contrast to other methods that involve indirect estimation, there is no need to transform the testtaker’s responses into some other scale. Writing items
Three questions
- What range of content should the items cover?
- Which of the many different types of item formats should be employed?
- How many items should be written in total and for each content area covered?
When devising a standardised test using a multiple choice format, usually advisable that the first draft contra approx. twice the number of items that the final version of the test will contain —> item pool
Item pool: reservoir or well from which items will or will not be drawn for the final version of the test.
A comprehensive sampling provides a basis for content validity for the final version of the test
How does one develop items for the item pool? the test developer may write a large number of items form personal experience or academic acquaintance with the subject matter, help may be sought from experts
For psych tests designed to be used in clinical settings, clinicians, patients, patient’s family members, clinical staff and others may be interviewed for insights that could assist in item writing
For psych tests designed to be used by personnel psychologists, interviews with members of a targeted industry or organisation will be good value
For psych tests designed to be used by school psychologists, interviews with teachers, administrative staff, educational psychologists may be valuable.
Considerations related to variables such as the purpose of the test and the number of examinees to be tested at one time enter into decisions regarding he format of the test under construction.

Item format

Items such as the form, plan, structure, arrangement and layout of the individual test are collectively referred to as item format.
Selected response format: require testtakers to elect a response from a set of alternative responses
Multiple choice
- Has 3 elements
- Stem
- Correct alternative/option
- Several incorrect alternatives/options —> distractors/foils
Matching item – Testtaker presented with two columns: premises on left and responses on right —> determine which response is best associated with which premise i.e. draw a line to, or number each side to match them
- Providing more options than needed on one side eliminates possibility to get a perfect score despite not knowing the answers by matching all the other ones first and then deducing answer to the last one.
- Another way to lessen probability of chance or guessing is to state that each response may be a correct answer once, more than once or not at all.
- Some guidelines should be observed in writing matching items for classroom use
- short and to the point
- no more than a dozen or so premises otherwise some students will forget what they are looking for as they go through the lists
- list of premises and responses should be homogenous —> lists of the same sort of thing e.g. name of actors matched to name of film characters
True-false item (binary choice item)
- Sentence that requires the testtaker to indicate whether the statement is or is not a fact
- Good binary choice item contains a single idea, not long, not subject to debate and definitely one of the two choices
- Cannot contain distractor alternatives —> easier to write and can be written more quickly
- Disadvantage: probability of obtaining a correct response purely on the basis of chance is 50%, in a multiple choice is 25%
Constructed response format: require test takers to supply or create the answer, not select it.
Completion item
- Requires the examinee to provide a word or phrase that competes a sentence
- Good completion item should be worded so that the correct answer is specific —> Completion items that can be correctly answered in many ways lead to scoring problems
- g. the standard deviation is generally considered the most useful measure of ______ (variability)
Short-asnwer item
- g. What descriptive statistic is generally considered the most useful measure of variability?
- Word, term, sentence or paragraph
Essay item
- Test item that requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis, and/or interpretation
- Useful when the test developer wants to examinee to demonstrate a depth of knowledge on a single topic
- Not only permits the restating of learned material but also allows for the creative integration and expression of the material in the testtakers own words.
- True-false & matching items = recognition, essay = recall, organisation, planning and writing ability
- Disadvantage: tends to focus on a more limited area that can be covered in the same amount of time when using a series of selected-response items or completion items & subjectivity in scoring and inter-scorer differences.

Writing items for computer administration

Number of widely available computer programs are designed to facilitate the construction of test as well as their administration, scoring, and interpretation
These programs typically make use of two advantages of digital media: (1) ability to store items in an item bank (2) the ability to individualise testing through a technique called item branching
Item bank
- Relatively large and easily accessible collection of test questions
- Advantage: accessibility to large amount of test items conveniently classified by subject area, item statistics, or other variables.
- Items may be added, withdrawn from and modified in an item bank
Computerised Adaptive Testing (CAT) – Interactive, computer administered test-taking process wherein items presented to the testtaker are based in part on the testtakers performance on previous items
- Test may be different for each testtaker depending on performance
- Each item on an achievement test for example may have a known difficulty level —> may be factored when nit comes to tally final score on items administered (do not say “final score on the test” because what constitutes the test may be different for different testtakers).
- Advantages
- only a sample of the total number of items in the item pool is administered to any one testtaker —> on the basis of previous response patterns, items that have a high probability of being answered in a particular fashion (correctly if an ability test) are not presented, thus providing economy of testing time and total number of items presented —> CAT reduced number of test items that need to be administered by as much as 50% while simultaneously reducing measurement error by 50%.
- Reduced floor effects and ceiling effects.
- Floor effect: diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait, or other attribute being measured. A test of ninth-grade maths, for example, may contain items that range from easy to hard for testtakers having the maths ability of a ninthgrader. However, testtakers who have not yet achieved such ability might fail all of the items; because of the floor effect , the test would not provide any guidance as to the relative math ability of testtakers in this group.
- Ceiling effect: diminished ability of an assessment tool for distinguishing test takers at the high end of the ability, trait or other attribute being measured. e.g. what would happen if all the test takers answered all of the items correctly? —> test user would conclude that the test was too easy for this group of testtakers so discrimination was impaired by a ceiling effect.
Item branching: The ability of a computer to tailor the content an order of presentation of test items on the basis os responses to previous items
A computer than has stored a bank of achievement test items of different difficulty levels can be programmed to present items according to an algorithm or rule.
g. one rule might be “Don’t present an item of the next difficulty level until two consecutive times of the current difficulty level are answered correctly” to “terminate the test when 5 consecutive items of a given level of difficulty have been answered incorrectly”
Alternatively, the pattern of items to which the testtaker is exposed might be based on not the testtaker’s response to preceding items but on a random drawing of the pool of test items
Random presentation of items reduces the ease wit which testators can memorise items on behalf of future testtakers.
Item branching technology may be applied when construction tests not only of achievement but of personality. e.g. if a respondent answered an item in a way that suggests he or she is depressed, the computer may automatically probe for depression related symptoms and behaviour
Item branching technology may be used in personality tests to recognise non purposive or inconsistent responding.

Scoring items

Scoring model used most commonly is cumulative model
Cumulative model: Higher score on test = higher ability, trait etc.
Class scoring (category scoring): testtaker responses earn credit towards placement in a particular class/ category with other testtakers whose responses is similar in some way e.g. where individuals must exhibit a certain number of symptoms to qualify for a specific diagnosis.
Ipsative scoring: comparing a testtaker’s score on one scale within a test to another scale within that same test – g. Edwards Personal Preference Schedule (EPPS), designed to measure the relative strength of different psychological needs.
- EPPS ipsative scoring system yields information on the strength of various needs in relation to the strength of other needs of the testtaker.
- Test does not yield info on the strength of the testtaker’s need relative to the presumed strength of that need in the general population.
- Edwards constructed his test of 210 pairs of statements in a way that respondents were “forced” to answer true or false or yes or no to only one of two statements.
- Prior research by Edwards has indicated that the two statements were equivalent in terms of how socially desirable the responses were.
- Sample of an EPPS-like forced-choice item, to which the respondents would indicate which is “more true” of themselves
- I feel depressed when I fail at something
- I feel nervous when giving a talk before a group
- On the basis of such an ipsatively scored personality test, it would be possible to draw only intra-individual conclusions about the testtaker e.g. “John’s need for achievement is higher than his need for affiliation” – It would not be appropriate to draw inter-individual comparisons on the basis of an ipsatively scored test e.g. inappropriate to compare two testtakers with a statement like “John’s need for achievement is higher than Jane’s need for achievement”

TEST TRYOUT

Tests should be tried out on people who are similar in critical respect to the people for whom the test was designed
Informal rule of thumb —> no fewer than 5 subjects and preferably as many as 10 for each item on the test • More subjects the better
more subjects employed = weaker role of chance in data analysis
Risk in using too few subjects during test tryout comes during factor analysis of the findings —> phantom factors, factors that are actually just artifacts of the small sample size, may emerge
Test tryout should be executed under conditions as identical as possible to the conditions under which the standardised test will be administered —> ensure differences in response are due to the items not extraneous factors.

What is a good item?

Reliable and valid
Helps discriminate testtakers
A good test item is one that is answered correctly by high scorers on the test as a whole, an item that is answered incorrectly by high scorers on the test = not a good item —> vice versa for low scorers
How does a test developer identify good items? —> After the first draft of the test has been administered to a representative group of examinees, the test developer analyses test scores and responses to individual items.
Item analysis: statistical scrutiny the test data can undergo
Item analysis may be quantitative and qualitative.