Fibre till Fashion: TEST RELIABILITY

TEST RELIABILITY

Reliability

In statistics, reliability is the consistency of a measuring instrument. This can either be whether the instrument gives or are likely to give the same measurement or in the case of more subjective instruments, whether two independent assessors give similar scores.

Reliability is the extent to which a test is repeatable and yields consistent scores. In order to be valid, a test must be reliable; but reliability does not guarantee validity. All measurement procedures have the potential for error, so the aim is to minimize it. The goal of estimating reliability (consistency) is to determine how much of the variability in test scores is due to measurement error and how much is due to variability in true scores.

For example: If a person takes a personality assessment today and scores high in a trait like dominance, and we retest that same person after six weeks, if the person again scores high in dominance than we can say that the test is reliable. If however, the individual scored low in dominance, we would have to conclude that the measure was inaccurate and unreliable.

Reliability can be improved by getting repeated measurements using the same test and getting many different measures using slightly different techniques and methods - e.g. We would not consider one multiple-choice exam question to be a reliable basis for testing your knowledge of "individual differences". Many questions are asked in many different formats (e.g., exam, essay, presentation) to help provide a more reliable score.

Validity

Validity is the extent to which a test measures what it is supposed to measure. Validity is a subjective judgment made on the basis of experience and empirical indicators. Validity asks "Is the test measuring what you think it’s measuring?” For example, we might define "violence" as an act intended to cause harm to another person (a conceptual definition) but the operational definition might be seeing:

how many times a child hits a doll
how often a child pushes to the front of the queue
How many physical scraps he/she gets into in the playground.

Are these valid measures of aggression? i.e., how well does the operational definition match the conceptual definition?

Difference between Reliability and Validity

Validity and reliability are two terms that go hand in hand in any forms of testing. It is important to understand the difference between the two terms. Validity basically refers to research that provides evidence that a test actually measures what it is supposed to measure. Reliability is more of a measure of consistency. In terms of accuracy and precision, reliability is precision, while validity is accuracy.

Bathroom scale analogy

An often-used example used to elucidate the difference between reliability and validity in the experimental sciences is a common bathroom scale. If someone that weighs 200 lbs. steps on the scale 10 times and it reads "200" each time, then the measurement is reliable and valid. If the scale consistently reads "150", then it is not valid, but it is still reliable because the measurement is very consistent. If the scale varied a lot around 200 (190, 205, 192, 209, etc.), then the scale could be considered valid but not reliable.

TYPES OF RELIABILITY

Reliability may be estimated through a variety of methods that fall into two types: Single-administration (split half and internal consistency method) and multiple-administration (test-retest and parallel form method). Multiple-administration methods require that two assessments are administered.

Each of these estimation methods is sensitive to different sources of error and so might not be expected to be equal. Reliability estimates from one sample might differ from those of a second sample if the second sample is drawn from a different population. (This is true of measures of all types--yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)

Inter-Rater or Inter-Observer Reliability

Inter-Rater or Inter-Observer Reliability measures homogeneity, is administering the same form to the same people by two or more raters/interviewers so as to establish the extent of consensus on use of the instrument by those who administer it. Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.

There are two major ways to actually estimate inter-rater reliability. If your measurement consists of categories -- the raters are checking off which category each observation falls in -- you can calculate the percent of agreement between the raters. For instance, let's say you had 100 observations that were being rated by two raters. For each observation, the rater could check one of three categories. Imagine that on 86 of the 100 observations the raters checked the same category. In this case, the percent of agreement would be 86%. OK, it's a crude measure, but it does give an idea of how much agreement exists, and it works no matter how many categories are used for each observation.

The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation between these ratings would give you an estimate of the reliability or consistency between the raters.

Test-retest Reliability

The most commonly used method of determining reliability is through the test-retest method which measures stability over time. The test-retest method of estimating a test's reliability involves administering the test to the same group of people at least twice. Then the first set of scores is correlated with the second set of scores to determine if the scores on the first test are related to the scores on the second test. Correlations range between 0 (low reliability) and 1 (high reliability), (highly unlikely they will be negative!).

This approach assumes that there is no substantial change in the construct being measured between the two occasions. The amount of time allowed between measures is critical. We know that if we measure the same thing twice that the correlation between the two observations will depend in part by how much time elapses between the two measurement occasions. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation.

Split-Half Reliability

In split-half reliability we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. The split-half reliability estimate, as shown in the figure, is simply the correlation between these two total scores. In the example it is .87. It is the relationship between half the items and the other half. The random distribution for split half reliability can be done by software.

Parallel-Forms Reliability

In parallel forms reliability we first have to create two parallel forms. One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. We administer both instruments to the same sample of people. The correlation between the two parallel forms is the estimate of reliability. One major problem with this approach is that we have to be able to generate lots of items that reflect the same construct.

Furthermore, this approach makes the assumption that the randomly divided halves are parallel or equivalent. Even by chance this will sometimes not be the case. The parallel forms approach is very similar to the split-half reliability. The major difference is that parallel forms are constructed so that the two forms can be used independent of each other and considered equivalent measures. For instance, we might be concerned about a testing threat to internal validity. If we use Form A for the pretest and Form B for the posttest, we minimize that problem. It would even be better if we randomly assign individuals to receive Form A or B on the pretest and then switch them on the posttest.

Internal Consistency Reliability

It is used to assess the consistency of results across items within a test. In internal consistency reliability estimation we use our single measurement instrument administered to a group of people on one occasion to estimate reliability. In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure. There are a wide variety of internal consistency measures that can be used and one of them is Average Inter-item Correlation.

Average Inter-item Correlation: The average inter-item correlation uses all of the items on our instrument that are designed to measure the same construct. We first compute the correlation between each pair of items, as illustrated in the figure. For example, if we have six items we will have 15 different item pairings (i.e., 15 correlations). The average inter item correlation is simply the average or mean of all these correlations. In the example, we find an average inter-item correlation of .90 with the individual correlations ranging from .84 to .95.

How reliable should tests be? Some reliability guidelines

.90 = high reliability

.80 = moderate reliability

.70 = low reliability

High reliability is required when tests are used to make important decisions, individuals are sorted into many different categories based upon relatively small individual differences e.g. intelligence (Most standardized tests of intelligence report reliability estimates around .90).
Lower reliability is acceptable when tests are used for preliminary rather than final decisions, tests are used to sort people into a small number of groups based on gross individual differences e.g. height or sociability /extraversion.
Reliability estimates below .60 are usually regarded as unacceptably low.

Sources of Error or Sources of Unreliability

Respondent’s or subject’s mood, fatigue, or motivation which effect his or her responses
Observer’s measurements, which can be influenced by the same factor affecting the subject’s responses
The condition under which measurement is made, which may produce responses which do not reflect the true scores. Measurement errors are essentially random: a person’s test score might not reflect the true score because they were sick, anxious, in a noisy room, etc.
Problems with the measurement instrument, such as poorly worded questions in an interview
Processing problems such as simple coding or mechanical errors

Fibre till Fashion

Friday, 1 March 2013

TEST RELIABILITY

No comments:

Post a Comment

Blog Archive

Pages