The Concept of Reliability

Reliability = consistency in measurement
A score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also error.
Error: component of the observed test score that does not have to do with the testtaker’s ability.
X = observed score
T = true score
E = Error
The fact that an observed score equals the true score plus error (Classical Test Theory)

X = T + E

Variance (σ2) – the standard deviation squared —> describes test score variability
This statistic is useful because it can be broken up into components
- True variance – variance from true differences
- Error variance – variance from irrelevant, random sources
If σ2 represents the total variance, the true variance and the error variance, then the relationship of the variances can be expressed as:
Reliability: the proportion of the total variance attributed to the true variance.
The greater the proportion of the total variance attributed to the true variance = the more reliable the test
The more reliable the test, the smaller the σ2
Because true differences are assumed to be stable, they are presumed to yield consistent scores on repeated administrations on the same tests well as on equivalent forms of tests.
Measurement error: all of the factors associated with the process of measuring some variable, other than the variable being measured.
g. a mathematics test administers in English to a group of Chinese ‘whiz kids’ newly arrived in America – they fail, does the test show that these are not whiz kids? possibly, but more likely their English language skills need evaluating – perhaps they did not do well because they could not read/understand the test. —> the fact that the test was written in english could have contributed in large part to the measurement error in this evaluation.

The Standard Error of the Difference Between Two Scores

Comparisons between scores are made using the standard error of the difference —> statistical measure that can determine how large a difference should be before it is considered statistically significant.
5%, 1% (more rigorous)
The standard error of the difference between two scores can be the appropriate statistical tool to address three types of questions:
- How did this individuals performance on test 1 compare with his/her performance in test 2?
- How did this individual’s performance on test 1 compare with someone else’s performance on test 1?
- How did this individuals performance on test 1 compare with someone else’s performance on test 2?
Essential that scores are converted to the same scale
The formula for the standard error of the difference between two scores is:
If we substitute reliability coefficients for the standard errors of measurement of the separate score, the formula becomes
**both tests would have to have the same SD because they must be on the same scale
The standard error of the difference between two scores will be larger than the standard error of measurement for either score alone because the former is affected by measurement error in both scores.
The value obtained by calculating the standard error of difference is used in much the same way as the standard error of the mean i.e. if we wish to be 95% confident that the two scores are different, we would want them to be separated by 2 standard errors of the difference. A separation of only one standard error of the difference would give us 68% confidence that the two true scores are different.
Example of use of standard error of the difference between two scores
- Situation of a corporate personnel manager who is seeking a highly responsible person for the position of vice president of safety. The personnel officer decides to use a new published test called the Safety-Mindedness Test (SMT) to screen applicants for the position. After placing an ad in the employment section of the local newspaper, the personnel officer tests 100 applicants for the position using the SMT and narrows down too the two highest scorer: Moe (score: 125), and Larry, (score:134).
- Assuming the measured reliability of this test to be .92 and its SD to be 14, should the personnel officer conclude that Larry performed significantly better than Moe? To answer this question, first calculate the standard error of the difference:
**in this application of the formula, the two test reliability coefficients are the same because the two scores being compared are derived from the same test.
For any standard error of the difference, we can be:
Applying this info to the standard error of the difference just computed for the SMT, we see that the personnel officer can be:
The difference between Larry’s and Moe’s scores is only 9 points, not a large enough difference for the personnel officer to conclude with 95% confidence that the two individuals have true scores that differ on this test.
If Larry and Moe were to take a parallel form of the SMT, the personnel officer could not be 95% confident that at the next testing, Larry would outperform Moe.

THE CONCEPT OF VALIDITY

Validity: a judgment/estimate of how well a test measures what it purports to measure in a particular context.
A judgment based on the evidence about the appropriateness of inferences drawn from test scores.
No test is universally valid —> ‘valid test’ means it is valid for a particular use with a particular population of testtakers at a particular time.
Validation: the process of gathering and evaluating evidence about validity.
May gather own validation studies —> local validation studies (necessary when the test user plans to alter in some way the format, instructions, language, or content of the test)
Conceptualised validity into 3 categories (trinitarian model of validity):
Content validity: measure of validity based on evaluation of the subjects, topics, or content covered by the items in the test
Criterion-related validity: measure of validity obtained by evaluating the relationship of scores obtained on the test to scores on other tests or measures
Construct validity: measure of validity that is arrived at by executing a comprehensive analysis of:
1. how scores on the test relate to other test scores and measures
2. how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure.
Trinitarian view —> construct validity = umbrella validity because every other variety of validity falls under it.

Face validity

What a test appears to measure to the person being tested than to what the test actually measures
Face validity: a judgment concerning how relevant the test items appear to be
g. a paper and pencil personality test named The Introversion/Extraversion Test, with items that ask respondents whether they have acted in an introverted or extraverted way in particular situations may be perceived by respondents as a highly face valid test
Perspective of testtaker

Content validity

Content validity: describes a judgment of how adequately a test samples behaviour representative of the universe of behaviour that the test was designed to sample.
g. the universe of behaviour referred to as assertive is very wide-ranging – a content-valid, paper pencil test of assertiveness would be one that is adequately representative of this wide range.
Educational achievement tests to be content valid —> proportion of material covered by the test approximates the proportion of material covered in the course.
Employment test to be content-valid —> content must be a representative sample of the job-related skills required for employment.

The quantification of content validity

Content validity important in employment settings where tests used to hire and promote people are carefully scrutinised for their relevance to the job, among other factors.
One method of measuring content validity is a method for gauging agreement among raters or judges regarding how essential a particular item is.
Lawshe (1975) proposed that, for each item, each rater respond to the following question: “Is the skill or knowledge measured by this item:
- Essential
- Useful but not essential
- Not necessary
to the performance of the job?”.
If more than half the panelists indicate that an item is essential, that item has at least some content validity —> greater levels of content validity exist as larger numbers of panelists agree that a particular item is essential.

Content Validity Ratio (CVR):

Negative CVR: When fewer than half the panelists indicate “essential”, the CVR is negative
Zero CVR: When exactly half the panelists indicate “essential”, the CVR is zero.
Positive CVR: When more than half but not all the panelists indicate “essential”, the CVR ranges between .00 and .99.

If the amount of agreement observed is more than 5% likely to occur because of chance, the item should be eliminated.

Culture and the relativity of content validity

A history test considered valid in one classroom, at one time, and in one place will not necessarily be considered so in another classroom, at another time, and in another place.
Politics may play a part in perceptions and judgments concerning the validity of tests and test items.

CRITERION RELATED VALIDITY

Criterion-related validity: judgment of how adequately a test score can be used to infer an individuals most probably standing on some measure of interest – the measure of interest being the criterion.

Concurrent validity: an index of the degree to which a test score is related to some criterion measure obtained at the same time.

Predictive validity: an index of the degree to which a test score predicts some criterion measure.

What is a criterion?

Criterion: a standard against which a test or a test score is evaluated.
Characteristics of a criterion
An adequate criterion in relevant
Valid for the purpose of which it is being used.
Criterion is uncontaminated
Criterion contamination: the term applied to a criterion measure that has been based, at least in part, on predictor measures.

E.g. a hypothetical “Inmate Violence Potential Test” (IVPT) designed to predict a prisoners potential for violence in the cell block. in part, this evaluation entails ratings from fellow inmates, guards, and other staff in order to come up wit ha number that represents each inmate’s violence potential. After all the inmates in the study have been given scores on this test, the study authors then attempt to validate the test by asking guards to rate each inmate on their violence potential. Because the guards’ opinions were used to formulate the inmate’s test score in the first place (the predictor variable), the guards’ opinions cannot be used as a criterion against which to judge the soundness of the test. If the guards’ opinions were used both as a predictor and as a criterion, the new would say that criterion contamination had occurred.

Concurrent validity

If test scores are obtained at about the same time as the criterion measures are obtained, measures of the relationship between test scores and the criterion provide evidence of concurrent validity.
Statements of concurrent validity indicate the extent to which test scores may be used to estimate an individual’s present standing on a criterion.
g. If, for example, scores (or classifications) made on the basis of a psychodiagnostic test were to be validated against a criterion of already diagnosed psychiatric patients, then the process would be one of concurrent validation.
Once the validity of the inference from the test scores is established, a test may provide a faster, less expensive way to offer a diagnosis or classification decision.
Test with satisfactorily demonstrated concurrent validity be be appealing because —> potential savings of money and time
Sometimes the concurrent validity of a particular test (test A) is explored with respect to another test (test B) – test B has been validated —> how well does test A compare with test B?
Test B = validating criterion.

Predictive Validity

Test scores may be obtained at one time and the criterion measured obtained at a future time, usually after some intervening event has taken place.
Intervening event may be training, experience, therapy, medication or the passage of time.
Measures of the relationship between test scores and a criterion measure obtained at a future time provide an indication of the predictive validity of the test —> how accurately scores on the test predict some criterion measure
g. measures of the relationship between college admission tests and freshman GPAs provide evidence of the predictive validity of the admissions tests.

The validity coefficient

A correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure
g. the correlation coefficient computed from a score (or classification) on a psychodiagnostic test and the criterion score (or classification) assigned by psychodiagnosticians.
Pearson correlation coefficient
However others can be used depending on type of data, sample size, shape of distribution
Validity coefficient affected by the restriction or inflation of range —> key issue whether the range of scores employed is appropriate to the objective of the correlational analysis

e.g. in situations where attrition in the number of subjects has occurred over the course of the study, the validity coefficient may be adversely affected.

CONSTRUCT VALIDITY

Construct validity: a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct.
Construct: a well informed, scientific idea developed or hypothesised to describe or explain behaviour.
g. Intelligence is a construct that be may invoked to describe why a student performs well in school.
Anxiety is a construct that may be invoked to describe why a psychiatric patient paces the floor.
Constructs are unobservable, presupposed (underlying) traits that test developer may invoke to describe test behaviour or criterion performance.
The researcher investigating construct validity must formulate hypotheses about the expected behaviour of high and low scorers on the test.
If the test is a valid measure of the construct, then high scorers and low scorers will behave as predicted by the theory.

We don’t know the true score for any individual testtaker, so we must estimate it
Best estimate available of the individual’s true score on the test is the test score already obtained.
Thus, if a student achieved a test score of 50 on one spelling test and if the test had a SEM of 4, then – using 50 as the point estimate, we can be:
If the SD of a test is held constant, then the smaller the , the more reliable a test will be as increases,

decreases.

e.g.

SEM most frequently used in interpretation of individual test scores
Confidence interval: a range or band of test scores that is likely to contain the true score.
Calculating confidence interval e.g. 95% confidence —> suppose a 22 year old testtaker obtained a FSIQ of 75. The test user an be 95% sure that this testtaker’s true FSIQ falls in the range of 70 to 80. —> take observed score of 75, plus or minus 1.96, multiplied by the standard error of measurement. In the test manual we find that the standard error of measurement of the FSIQ for a 22 year old testtaker is 2.37. With this info in hand, the 95% confidence interval is calculated as follows:

Therefore, 70-5=70, 75+5=80 —> confidence interval is 70-80