Test 1

Texas A&M University – Central Texas

PSYK 581-110

Contents

1. Chapter 2. 2

2. Chapter 4. 3

3. Chapter 5. 5

4. Chapter 6. 6

5. Chapter 7. 8

Chapter 2

1. Develop an example of each of the following scales: nominal, ordinal, interval, ratio.

Nominal scales are really no scales at all, and are used when the information is of qualitative, instead of quantitative nature. An example of this would be gender (male, female). Ordinal scales are scales with the property of magnitude but without equal intervals or an absolute 0. There is a natural order and with ordinal data we cannot state with certainty whether the intervals between each value are equal.

An example of this are ranks such as first, second and third (1st place, 2nd place, 3rd place). Interval scales are just like ordinal scales, but the intervals between each are the same. An example of this would be the IQ score. The difference between an IQ of 101 and 105 is the same as the difference between 89 and 103. Data with an absolute zero point are measured on a ratio scale.

An example of this would be any time length or measurement.

2. Explain why the mean of a distribution of Z-scores is equal to 0.

A z-score reflects how many standard deviations above or below the population mean a raw score is. In a normal, standardized distribution, a z-score is the deviation of a score X_{i }from the mean in standard deviation units. If a score is equal to the mean, then its z-score is 0, or differently said, z-scores have mean of 0 and a standard deviation of 1.0. We know that the sum of the deviations around the mean is always equal to zero.

The numerator of the z-score equation is the deviation around the mean, while the denominator is constant. The mean of the z-score can be expressed as or . Because (X_{i}-) will always equal 0, the mean of the scores will always be zero.

1.Chapter 4

2. Classic Test Theory is based on certain assumptions. Discuss these basic assumptions and the theory behind them; address the issue of challenges to any of the assumptions.

Classical test score theory assumes that each person has a true score which would be obtained if there were absolutely no errors in measurement. However, the measuring instruments used are not perfect, so score which we observe differs somewhat from a test person’s true score.

The difference between the true score and the observed score is called the measurement error. One major assumption in classical test theory is that these errors of measurement are random. Classical test theory also assumes that the true score of an individual will not change with repeated applications of the same test. However, because of random error, repeated applications of the same test can produce different scores.

Normally, the standard deviation of the distribution of errors of each test person shows the magnitude of measurement error. In classical test theory, the standard deviation of errors is used as the basic measure of error. This standard error of measurement shows how much a certain score varies from the actual true score.

3. There are several methods to estimate reliability. Compare and contrast different methods of reliability discussed in this chapter. Stress the importance of coefficient alpha.

Reliability can be estimated from the correlation of the observed test score with the true score (or what it would be). There are several methods available to assess reliability. There is the test-retest method which is used to evaluate the error associated with administering a test at two different points in time.

This analysis is usually used if we want to measure characteristics or traits which do not change over time. With this type of analysis one has to watch out for the carryover effect, which can lead to flawed scores because the first test session could have influenced the second.

When we use the parallel-forms method, then we evaluate the test across different forms of the test, meaning that we would compare two equivalent forms of a test that measures the same attributes. The two forms use different items, but the rules used to select the difficulty of each item are the same. This method is not used as much in practice, simply because it is very rigorous.

In the split-half reliability method a test is decided into two halves which are then scored separately. The results of both halves are then compared with each other. This can cause problems if the items on one of the halves are more difficult than the other.

In order to find out the reliability of the test, one would find the correlation between the two halves. This leaves the administrator with the problem of having to correct for half-length. One solution to correct for this problem would be the Spearman Brown formula which allows for estimation of the correlation for each half as if they would have been a whole. The use of this formula increases the estimate of reliability, but it is not the best choice every time.

If, for example, the two halves of the test have unequal variances than Cronbach’s coefficient alpha (α) can be used. Α provides the lowest estimate of reliability that one is willing to accept. If α is high, then we might assume that the reliability of the test is acceptable because the lowest boundary of reliability is still high; the reliability will not drop below α.

On the other hand, if we lower the α level, then we gain less information. If the variances of the two tests are equal then the coefficient alpha and Spearman Brown coefficient provide the same results.

As α has wider applicability, it has increasingly replaced KR _{20 }as a measurement of agreement or internal consistency.

2.Chapter 5

2. It is important to consider several important factors when interpreting the meaning of a validity coefficient. Outline and discuss what those factors are. Give examples when appropriate.

When evaluating validity coefficients, one has to pay attention to several factors. One has to keep in mind that conditions of a validity study are never exactly reproduced, because the conditions under which a certain test is taken are never exactly the same.

For example, the population taking the test could have changed. So, one has to pay attention to changes in the cause of relationship and the subject population in the validity study. An example here could be race. There could be a problem with applying the same validity coefficients from tests which was based on a sample of a predominantly white people to a sample of African-Americans.

Furthermore, the score range of the predictor and the criterion should not be restricted. This means that the scores should not fall too close together because this would affect variability.

Also, criterion-related validity evidence obtained in one situation should not be generalized to other similar situations. There will always be too many differences between tests (situation, the way the predictor construct is measured, demographic group etc.) in order to apply the same validity coefficient to every similar test.

Differential prediction is another factor has to keep in mind when interpreting the meaning of a validity coefficient. Predictive relationships may not be the same for all demographic groups (i.e. men and women), or a certain test could have been validated for a language other than the language of the group for which it is being used.

4. Explain how reliability is related to validity, and/or vice versa.

Reliability and validity are related concepts, and attempting to define validity of a test would be pointless if the test is not reliable. The concept of this relation can be seen in the example of a weight scale. If the scale is reliable it tells the same weight every time one steps on it (given the fact that one has not gained or lost any weight in between measurements).

However, if the scale is not working properly, the number shown on the display may not be the actual weight. If that is the case, this is an example of a scale that is reliable, but not valid. For the scale to be valid and reliable, it would have to tell the same weight every time one steps on the scale, but it also has to measure the actual weight.

3.Chapter 6

1. Develop several test items and describe methods for analyzing the appropriateness or inappropriateness of their inclusion on a test. (Hint: It may be helpful to actually "administer" these items to a group of friends.)

A.) Multiple choice tests are better than essay questions.

- Not true

-Not false

Question A should not be used on a test because it is a double negative questions and therefore not clear to the test taker. Better here would be to use the possible answers of

-True

-False.

B.) Many people feel that, in this day and age, children have too many freedoms, have too much money and are not subject to sufficient discipline to make them respectful to others. To what extent would you agree with this?

Methods for analyzing the appropriateness or inappropriateness for test items are the assessment of item difficulty. Item difficulty is defined by the number of people who get a particular item correct. A true-false item on a test should have a difficulty level of more than .50 and a multiple choice item with four possible answers should have a difficulty greater than .05.

There is also the option of assessment by item discriminability. This determines whether people who have done well on a particular test item also do well on the entire test.

Another method for analyzing the quality of test items is to use the point biserial correlation. A negative value would then indicate that the test item should be eliminated from the test.

2. Criterion-referenced tests offer some advantages over tests that may be graded more subjectively and they are now quite prolific in school systems across the country. However, they have some specific problems. Discuss these advantages and disadvantages (feel free to integrate information from previous chapters). How would you improve this situation?

A criterion-referenced test compares performance with some clearly defined criterion for learning. Criterion-referenced tests really are used to document specific skills, and not to compare students. They do this with a set of learning objectives that are established.

These objectives define what the student should have learned by the end of a given course. A criterion-referenced test would then be used to determine the actual learning outcome. An advantage of this type of test would be that if applied correctly, it provides good information about how much a certain student’s knowledge has improved over a certain amount of time.

4.Chapter 7

1. The state of the subject may well affect his or her test performance and may be a serious source of error. Discuss some possible subject variables that may interfere with or improve an individual's performance on a test.

The examiner should always take the state of the subject into consideration when evaluating their test scores, because it can be the source for serious error in testing. Test scores are greatly affected by motivation and anxiety (with test anxiety being a common condition).

Besides anxiety, physical illness such as a cold or a headache can also affect the outcome of a test. Even hormonal changes can influence tests takers. Furthermore, numerous studies have shown that a positive, familiar relationship between the test taker and the examiner can result in better test scores. Factors such as disability or age also have to be taken into consideration.

3. As an administrator of test, what factors should you consider based on this chapter (e.g., characteristics of an administrator, training administrator, administering context, subject factor, mode of administration, etc.)?

Standardized test administration procedures are essential for validity. Situational factors can affect test scores in obvious are subtle ways but might not be observed in every study. Some studies have shown that factors such as the expectancies of the administrator and even his or her race can affect scores.

There should be no direct reinforcement during standardized tests because this is known to have direct impact on the performance of the test taker. The mode of administration can also influence the outcome of the test. Computer assisted testing for example reduced examiner bias and is therefore a good alternative for administrators. Administrators also have to keep the subject factor in mind.

Administrators have to constantly check their behavior in order to avoid observer drift. They also have to be aware of the problem of reactivity (behavior that occurs when observers stray away from the rules they learned, and expectancy (a phenomenon that occurs when individuals alter their performance or behavior due to the awareness that they are being observed).