TEST RELIABILITY
Reliability
In statistics,
reliability is the consistency
of a measuring instrument. This can either be whether the instrument gives or
are likely to give the same measurement or in the case of more subjective
instruments, whether two independent assessors give similar scores.
Reliability is the extent to which a
test is repeatable and yields consistent scores. In order to be
valid, a test must be reliable; but reliability does not guarantee validity.
All measurement procedures have the potential for error, so the aim is to
minimize it. The goal of estimating reliability (consistency) is to
determine how much of the variability in test scores is due to measurement
error and how much is due to variability in true scores.
For example: If a person takes a personality assessment today and
scores high in a trait like dominance, and we retest that same person after six
weeks, if the person again scores high in dominance than we can say that the
test is reliable. If however, the individual scored low in dominance, we
would have to conclude that the measure was inaccurate and unreliable.
Reliability can be improved by getting repeated measurements using
the same test and getting
many different measures using slightly different techniques and methods - e.g. We would not consider one
multiple-choice exam question to be a reliable basis for testing your knowledge
of "individual differences". Many questions are asked in many
different formats (e.g., exam, essay, presentation) to help provide a more
reliable score.
Validity
Validity is the extent to
which a test measures what it is supposed to measure. Validity is a subjective judgment
made on the basis of experience and empirical indicators. Validity asks
"Is the test measuring what you think it’s measuring?” For example, we
might define "violence" as an act intended to cause harm to another
person (a conceptual definition) but the operational definition might be
seeing:
- how many times a child hits a doll
- how often a child pushes to the front of the queue
- How many physical scraps he/she gets into in the playground.
Are these valid measures of
aggression? i.e., how well does the operational definition match the
conceptual definition?
Difference between Reliability and Validity
Validity and
reliability are two terms that go hand in hand in any forms of testing.
It is important to understand the difference between the two terms.
Validity basically refers to research that provides evidence that a test
actually measures what it is supposed to measure. Reliability is more of
a measure of consistency. In terms
of accuracy and precision, reliability is
precision, while validity is accuracy.
Bathroom scale analogy
An often-used example used to elucidate the difference between reliability
and validity in the experimental sciences is a common bathroom
scale. If someone that weighs 200 lbs. steps on the scale 10 times and it
reads "200" each time, then the measurement is reliable and valid. If
the scale consistently reads "150", then it is not valid, but it is
still reliable because the measurement is very consistent. If the scale varied
a lot around 200 (190, 205, 192, 209, etc.), then the scale could be considered
valid but not reliable.
TYPES OF RELIABILITY
Reliability may be estimated through a variety of methods that fall into
two types: Single-administration (split half and internal consistency method)
and multiple-administration (test-retest and parallel form method).
Multiple-administration methods require that two assessments are administered.
Each of these estimation methods is sensitive to
different sources of error and so might not be expected to be equal.
Reliability estimates from one sample might differ from those of a second
sample if the second sample is drawn from a different population. (This is true
of measures of all types--yardsticks might measure houses well yet have poor
reliability when used to measure the lengths of insects.)
Inter-Rater or Inter-Observer Reliability
Inter-Rater or
Inter-Observer Reliability measures homogeneity, is administering the
same form to the same people by two or more raters/interviewers so as to
establish the extent of consensus on use of the instrument by those who
administer it. Used to assess the degree to which different raters/observers
give consistent estimates of the same phenomenon.
There are two
major ways to actually estimate inter-rater reliability. If your measurement
consists of categories -- the raters are checking off which category each
observation falls in -- you can calculate the percent of agreement between the
raters. For instance, let's say you had 100 observations that were being rated
by two raters. For each observation, the rater could check one of three
categories. Imagine that on 86 of the 100 observations the raters checked the
same category. In this case, the percent of agreement would be 86%. OK, it's a
crude measure, but it does give an idea of how much agreement exists, and it
works no matter how many categories are used for each observation.
The other major
way to estimate inter-rater reliability is appropriate when the measure is a
continuous one. There, all you need to do is calculate the correlation between
the ratings of the two observers. For instance, they might be rating the
overall level of activity in a classroom on a 1-to-7 scale. You could have them
give their rating at regular time intervals (e.g., every 30 seconds). The
correlation between these ratings would give you an estimate of the reliability
or consistency between the raters.
Test-retest Reliability
The most commonly used method of determining reliability is
through the test-retest method which measures stability over time. The test-retest method of estimating a test's reliability involves
administering the test to the same group of people at least twice. Then the
first set of scores is correlated with the second set of scores to determine if the scores on the first test are related to
the scores on the second test.
Correlations range between 0 (low reliability) and 1 (high reliability),
(highly unlikely they will be negative!).
This approach
assumes that there is no substantial change in the construct being measured
between the two occasions. The amount of time allowed between measures is
critical. We know that if we measure the same thing twice that the correlation
between the two observations will depend in part by how much time elapses
between the two measurement occasions. The shorter the time gap, the higher the
correlation; the longer the time gap, the lower the correlation.
Split-Half Reliability
In
split-half reliability we randomly divide all items that purport to measure the
same construct into two sets. We administer the entire instrument to a sample
of people and calculate the total score for each randomly divided half. The
split-half reliability estimate, as shown in the figure, is simply the
correlation between these two total scores. In the example it is .87. It is the relationship between half
the items and the other half. The random distribution
for split half reliability can be done by software.
Parallel-Forms Reliability
In parallel
forms reliability we first have to create two parallel forms. One way to
accomplish this is to create a large set of questions that address the same
construct and then randomly divide the questions into two sets. We administer
both instruments to the same sample of people. The correlation between the two
parallel forms is the estimate of reliability. One major problem with this
approach is that we have to be able to generate lots of items that reflect the
same construct.
Furthermore,
this approach makes the assumption that the randomly divided halves are
parallel or equivalent. Even by chance this will sometimes not be the case. The
parallel forms approach is very similar to the split-half reliability. The
major difference is that parallel forms are constructed so that the two forms
can be used independent of each other and considered equivalent measures. For
instance, we might be concerned about a testing threat to
internal validity. If we use Form A for the pretest and Form B for the
posttest, we minimize that problem. It would even be better if we randomly
assign individuals to receive Form A or B on the pretest and then switch them
on the posttest.
Internal Consistency Reliability
It is used to
assess the consistency of results across items within a test. In internal
consistency reliability estimation we use our single measurement instrument
administered to a group of people on one occasion to estimate reliability. In
effect we judge the reliability of the instrument by estimating how well the
items that reflect the same construct yield similar results. We are looking at
how consistent the results are for different items for the same construct
within the measure. There are a wide variety of internal consistency measures
that can be used and one of them is Average Inter-item Correlation.
Average
Inter-item Correlation: The average inter-item correlation uses all of the
items on our instrument that are designed to measure the same construct. We
first compute the correlation between each pair of items, as illustrated in the
figure. For example, if we have six items we will have 15 different item
pairings (i.e., 15 correlations). The average inter item correlation is simply
the average or mean of all these correlations. In the example, we find an
average inter-item correlation of .90 with the individual correlations ranging
from .84 to .95.
How reliable should tests
be? Some reliability guidelines
.90 =
high reliability
.80 =
moderate reliability
.70 = low
reliability
- High reliability is required when tests are used to make important decisions, individuals are sorted into many different categories based upon relatively small individual differences e.g. intelligence (Most standardized tests of intelligence report reliability estimates around .90).
- Lower reliability is acceptable when tests are used for preliminary rather than final decisions, tests are used to sort people into a small number of groups based on gross individual differences e.g. height or sociability /extraversion.
- Reliability estimates below .60 are usually regarded as unacceptably low.
Sources of Error or Sources of Unreliability
- Respondent’s or subject’s mood, fatigue, or motivation which effect his or her responses
- Observer’s measurements, which can be influenced by the same factor affecting the subject’s responses
- The condition under which measurement is made, which may produce responses which do not reflect the true scores. Measurement errors are essentially random: a person’s test score might not reflect the true score because they were sick, anxious, in a noisy room, etc.
- Problems with the measurement instrument, such as poorly worded questions in an interview
- Processing problems such as simple coding or mechanical errors