Assessment and Evaluation

Selecting High-Quality Assessments

For an assessment to be high quality it needs to have good validity and reliability as well as absence from bias.


Validity is the evaluation of the “adequacy and appropriateness of the interpretations and uses of assessment results” for a given group of individuals (Linn & Miller, 2005, p. 68). For example, is it appropriate to conclude that the results of a mathematics test on fractions given to recent immigrants accurately represent their understanding of fractions? Is it appropriate for the teacher to conclude, based on her observations, that a kindergarten student, Jasmine, has Attention Deficit Disorder because she does not follow the teacher’s oral instructions? Obviously, in each situation, other interpretations are possible that the immigrant students have poor English skills rather than mathematics skills, or that Jasmine may be hearing impaired.

It is important to understand that validity refers to the interpretation and uses made of the results of an assessment procedure, not of the assessment procedure itself. For example, making judgments about the results of the same test on fractions may be valid if the students all understand English well. A teacher concluding from her observations that the kindergarten student has Attention Deficit Disorder (ADD) may be appropriate if the student has been screened for hearing and other disorders (although the classification of a disorder like ADD cannot be made by one teacher). Validity involves making an overall judgment of the degree to which the interpretations and uses of the assessment results are justified. Validity is a matter of degree (e.g. high, moderate, or low validity) rather than all-or-none (e.g. totally valid vs invalid) (Linn & Miller, 2005).

Three sources of evidence are considered when assessing validity—content, construct, and predictive. Content validity evidence is associated with the question: How well does the assessment include the content or tasks it is supposed to? For example, suppose your educational psychology instructor devises a mid-term test and tells you this includes chapters one to seven in the textbook. Obviously, all the items in the test should be based on the content from educational psychology, not your methods or cultural foundations classes. Also, the items in the test should cover content from all seven chapters and not just chapters three to seven—unless the instructor tells you that these chapters have priority.

Teachers’ have to be clear about their purposes and priorities for instruction before they can begin to gather evidence related to content validity. Content validation determines the degree that which assessment tasks are relevant and representative of the tasks judged by the teacher (or test developer) to represent their goals and objectives (Linn & Miller, 2005). It is important for teachers to think about content validation when devising assessment tasks and one way to help do this is to devise a Table of Specifications. An example, based on Pennsylvania’s State standards for grade 3 geography, is in the table below. In the left-hand column is the instructional content for a 20-item test the teacher has decided to construct with two kinds of instructional objectives: identification and uses or locates. The second and third columns identify the number of items for each content area and each instructional objective. Notice that the teacher has decided that six items should be devoted to the sub-area of geographic representations- more than any other sub-area. Devising a table of specifications helps teachers determine if some content areas or concepts are over-sampled (i.e. there are too many items) and some concepts are under-sampled (i.e. there are too few items).

Table  10.2.1. Example of Table of Specifications: grade 3 basic geography literacy
Content Instructional objective Total number of items Percent of items
Identifies Uses or locates
Identify geography tools and their uses    
Geographic representations: e.g., maps, globe, diagrams, and photographs 3 3 6 30%
Spatial information: sketch & thematic maps 1 1 2 10%
Mental maps 1 1 2 10%
Identify and locate places and regions
Physical features (e.g. lakes, continents) 1 2 3 15%
Human features (countries, states, cities) 3 2 5 25%
Regions with unifying geographic characteristics e.g. river basins 1 1 3 10%
Total number of items 10 10 20
Total percentage of items 50% 50% 100%

Construct validity evidence is more complex than content validity evidence. Often we are interested in making broader judgments about students’ performances than specific skills such as doing fractions. The focus may be on constructs such as mathematical reasoning or reading comprehension. A construct is a characteristic of a person we assume exists to help explain behavior. For example, we use the concept of test anxiety to explain why some individuals when taking a test have difficulty concentrating, have physiological reactions such as sweating, and perform poorly on tests but not in-class assignments. Similarly, mathematics reasoning and reading comprehension are constructs as we use them to help explain performance on an assessment. Construct validation is the process of determining the extent to which performance on an assessment can be interpreted in terms of the intended constructs and is not influenced by factors irrelevant to the construct. For example, judgments about recent immigrants’ performance on a mathematical reasoning test administered in English will have low construct validity if the results are influenced by English language skills that are irrelevant to mathematical problem-solving. Similarly, construct validity of end-of-semester examinations is likely to be poor for those students who are highly anxious when taking major tests but not during regular class periods or when doing assignments. Teachers can help increase construct validity by trying to reduce factors that influence performance but are irrelevant to the construct being assessed. These factors include anxiety, English language skills, and reading speed (Linn & Miller 2005).

The third form of validity evidence is called criterion-related validity. Selective colleges in the USA use the ACT or SAT among other criteria to choose who will be admitted because these standardized tests help predict freshman grades, i.e. have high criterion-related validity. Some K-12 schools give students math or reading tests in the fall semester in order to predict which are likely to do well on the annual state tests administered in the spring semester and which students are unlikely to pass the tests and will need additional assistance. If the tests administered in the fall do not predict students’ performances accurately then the additional assistance may be given to the wrong students illustrating the importance of criterion-related validity.


Reliability refers to the consistency of the measurement (Linn & Miller 2005). Suppose Mr. Garcia is teaching a unit on food chemistry in his tenth-grade class and gives an assessment at the end of the unit using test items from the teachers’ guide. Reliability is related to questions such as: How similar would the scores of the students be if they had taken the assessment on a Friday or Monday? Would the scores have varied if Mr. Garcia had selected different test items, or if a different teacher had graded the test? An assessment provides information about students by using a specific measure of performance at one particular time. Unless the results from the assessment are reasonably consistent over different occasions, different raters, or different tasks (in the same content domain) confidence in the results will be low and so cannot be useful in improving student learning.

Obviously, we cannot expect perfect consistency. Students’ memory, attention, fatigue, effort, and anxiety fluctuate and so influence performance. Even trained raters vary somewhat when grading assessments such as essays, science projects, or oral presentations. Also, the wording and design of specific items influence students’ performances. However, some assessments are more reliable than others and there are several strategies teachers can use to increase reliability.

First, assessments with more tasks or items typically have higher reliability. To understand this, consider two tests one with five items and one with 50 items. Chance factors influence the shorter test more than the longer test. If a student does not understand one of the items in the first test the total score is very highly influenced (it would be reduced by 20 percent). In contrast, if there was one item in the test with 50 items that were confusing, the total score would be influenced much less (by only 2 percent). Obviously, this does not mean that assessments should be inordinately long, but, on average, enough tasks should be included to reduce the influence of chance variations. Second, clear directions and tasks help increase reliability. If the directions or wording of specific tasks or items are unclear, then students have to guess what they mean undermining the accuracy of their results. Third, clear scoring criteria are crucial in ensuring high reliability (Linn & Miller, 2005). Later in this chapter, we describe strategies for developing scoring criteria for a variety of types of assessment.

Absence of Bias

Bias occurs in assessment when there are components in the assessment method or administration of the assessment that distort the performance of the student because of their personal characteristics such as gender, ethnicity, or social class (Popham, 2005). Two types of assessment bias are important: offensiveness and unfair penalization. An assessment is most likely to be offensive to a subgroup of students when negative stereotypes are included in the test. For example, the assessment in a health class could include items in which all the doctors were men and all the nurses were women. Or, a series of questions in a social studies class could portray Latinos and Asians as immigrants rather than native-born Americans. In these examples, some female, Latino or Asian students are likely to be offended by the stereotypes and this can distract them from performing well on the assessment.

Unfair penalization occurs when items disadvantage one group not because they may be offensive but because of differential background experiences. For example, an item for math assessment that assumes knowledge of a particular sport may disadvantage groups not as familiar with that sport (e.g. American football for recent immigrants). Or an assessment on teamwork that asks students to model their concept of a team on a symphony orchestra is likely to be easier for those students who have attended orchestra performances—probable students from affluent families. Unfair penalization does not occur just because some students do poorly in class. For example, asking questions about a specific sport in a physical education class when information on that sport had been discussed in class is not unfair penalization as long as the questions do not require knowledge beyond that taught in class that some groups are less likely to have.

It can be difficult for new teachers to teach in multi-ethnic classrooms to devise interesting assessments that do not penalize any groups of students. Teachers need to think seriously about the impact of students’ differing backgrounds on the assessment they use in class. Listening carefully to what students say is important as is learning about the backgrounds of the students.


Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Educational Psychology Copyright © 2020 by Nicole Arduini-Van Hoose is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book