Test Bias

Biases that systematically obscure or cause differences among groups of respondent test scores unjustifiably
It is important that test scores do not discriminate unjustifiability against any particular group
In this context, almost invariably ‘group’ refers to gender and/or race

Gender differences in mathematics:

Females on average tend to score lower on mathematics in comparison to males
However is it possible that conventional mathematics tests discriminate against females?  If so, then the differences should not be interpreted as a construct level difference

Construct bias:

Concerns the relationship of observed scores to true scores on a psychological test o Different factorial validity between groups
If the relationship is systemically different for different groups, then we might conclude that the test is biased o In some cases, two people with different observed scores may actually have the same level of the construct

o Even when the two means are equal, construct bias still must be tested (bias may result in the equal mean)

Predictive bias:

rccurs when a test’s use has different implications for two or more groups
g. using ATAR to predictive first year university performance o If it were discovered that the ATAR was differentially predictive of academic performance for males in comparison to females, than that would be an example of predictive bias
Construct bias does not necessarily imply predictive bias

Identifying test score bias:

There are at least two categories that can be used to identify test score bias o Internal methods to identify construct bias o External methods to identify predictive bias
No single methods can be used to establish bias
This is analogous to establishing validity for test scores

Test bias is about interpreting scores (validity) and are those results justifiable across all groups

Test score bias vs. true group differences:

Simply because two groups differ in their mean scores on a test does not imply that the test suffers from test score bias
If women score higher on a scale that putatively measures openness to feelings, then it is fully possible that they are as a group, more open to feelings than men
Similarly, a weighing scale would be expected to demonstrate that on average, males are heavier than females
This does not mean that the scale is biased

Detecting construct bias- internal evaluation:

Construct bias only occurs through internal evaluation
Evaluating construct bias typically occurs by conducting analyse at the item level  An item on a test is biased if:

o Persons belonging to different groups responded in different ways to the item and o It could be shown that the differing responses were not related to the group differences associated with the psychological attribute putatively measured by the test

 Variance not related to the true construct

In plain language, a test is considered biased if you take two people (a male and female) who have the same level of an attributed (e.g. openness to emotions) but tend to respond differently to that particular item o Construct bias is present when different groups perceive the item differently for some reason

Approach to evaluating item bias:

The procedure focuses on the internal structure of the test o No need to collect data from other measures/occasions
Internal structure refers to the o Pattern of correlations among items and/or

o The correlations between each item and the total test score

If the two groups exhibit different internal structures to their test responses, then we can conclude that the test is likely to suffer from construct bias

Factor analysis and bias:

Also known as PCA, is a data technique used to o Determine the number of dimensions associated with a collection of items (or subscales)

o Define the nature of the dimension based on the factor loadings Construct bias is present if:

A different number of dimensions is associated with test scores when a factor analysis is performance on the groups separately

 e.g. when measured separately, one dimension found for men and two dimensions found for women

more subtle construct bias can be present even when the groups are associated with the same number of dimensions (happens more often, same number of dimensions but the factor loadings present the bias; specifically the pattern of factor loadings is different between the groups)
we estimate this with a factor congruence coefficient
- coefficient should be positive and large to indicate the absence of construct bias
- measure this by correlating the component loadings between two items for both male and females

Factor congruence coefficient:

There are no published guidelines to help interpret factor congruence coefficients  The correlation however should at least be positive

Differential item functioning (DIF) analysis:

DIF is a sophisticated approach to the assessment of construct bias developed within the context of item response theory (IRT)
Theoretically DIF assumes that we can estimate a person’s trait level on a particular dimension via the test data
This is true for all groups
We wish to determine whether the trait levels and the item responses match up in the same way for both groups
If they don’t we have evidence for construct bias

Rank order consistency:

A simple way to test for evidence for construct bias o Calculate the means associated with each item separately for each group o Then calculate Spearman’s R correlation between the item means (i.e. item difficulties)
If the correlation between item means is low, then we would have some evidence to suggest construct bias

Suggests that a low correlation between item means would be less than .90, results in construct bias

Detecting predictive bias- external evaluation:

Predictive bias only occurs in external evaluation
Construct bias tends to be more in pure research contexts, predictive bias by contrast is a more applied consideration
Specifically, do the scores from a particular test predict a criterion (or outcome) with equal accuracy for two or more groups
g. government employees who creates the tests behind the ATAR should make sure that the scores have equal predictive validity with respect to university grade performance across all groups

Predictive bias analyses:

Involves two steps:
- Need to determine whether the test scores predict the dependent variable to start with
- Need to determine whether the test scores predict the dependent variable equally across groups
Both steps involve conducting a regression analysis

Bivariate regression review:

Is an extension of Pearson correlation
Consists of using one continuously measured variable to predict another continuously measured variable
The main difference is that bivariate regression includes additional terms which can be used to create a regression equation  Intercept:
- The expected value of Y when X=0
- The point at which the regression line crosses the Y-axis
Slope:

o The expected increase or decrease in the dependent variable (Y) as a function of an increase or decrease in the independent variable (X)

Predicting people’s scores o Intercept is the minimum achievable score o Slope is the increase in dependant score after increase in independent score

One size fits all:

The assumption that different groups share a common regression equation is based on the idea that ‘one size fits all’
When this is observed, there is no predictive bias associated with the test scores
The regression equation derived from the total sample is referred to as the ‘common regression equation’
Also use the terms ‘slope intercept’ and ‘common slope’

Next we estimate separate regression equations for each group

We then compare the group=level regression equations with the common regression equation

If the group-level estimates do not match the common regression equation, then you might suspect that your test scores are biased

Testing predictive bias statistically- intercept bias:

If the slop is equal for both groups, can confirm there is no bias  What if the intercepts are different?

o This suggests intercept bias e.g. if the male intercept is higher, suggests that males are rated higher than females

If the lines run parallel, no indication of slope bias

Slope bias:

Suppose two groups had similar intercepts, but their slops differed in magnitude
This implies that we would expect different predicted scores for males and females with the same test score

Intercept and slope bias:

There are occasions where both intercept and slop bias co-occur
Rare to have only one slope/intercept bias but not the other

Outcome score bias:

Thus far, the discussion of bias has focused on the predictor variable (e.g. test/questionnaire)
It is fully possible that the outcome variable scores are the ones that are biased (e.g. level of relationships, testers mood, supervisor ratings)
For a full analysis of bias, you would want to examine both possibilities- predictor bias and outcome bias

The effect of reliability:

Are occasions where one group will have less reliable scores than the other group (unlikely to ever be exactly equal)
The differences in reliabilities can cause differences in slopes and intercepts
Differences in reliability are a form of construct bias than can have an impact on predictive bias

Keep in mind:

The absence of the difference between two means might be caused by construct bias
We should probably always test the possibility of construct bias when we test the difference between two means based on composite scores