ABC on Scoring Rubrics Development for Large Scale Performance Assessment in Mathematics and Science

author:	Westat
description:	As part of its technical assistance effort, Westat is developing an Occasional Papers series addressing issues of concern in doing outcome evaluation. The first of these papers, the development of scoring rubrics, has now been completed and is available for use and comment. Suggestions for additional papers are welcome. Remember Westat staff and their consultants are available to provide assistance to you in developing or reviewing your outcome evaluation plans. NSF is providing the resources for this technical assistance. Please don't wait until the last minute to ask for help. To suggest themes for occasional papers or request technical assistance, please contact Joy Frechtling. She can be reached at frechtj1@westat.com or (301) 517-4006.
published in:	WESTAT
published:	05/01/2002
posted to site:	05/31/2001

ABC on Scoring Rubrics Development

for Standardized Performance Assessment Items

in Mathematics and Science

Prepared by

Westat

May 2001

Contents

1 Introduction

2 Basics of Scoring Rubrics

2.1 Definition of a Scoring Rubric

2.2 Functions of a Scoring Rubric

2.3 Types of Scoring Rubrics

2.3.1 Scoring Rubrics by Depth of Information Provided

2.3.2 Scoring Rubrics by Breadth of Application

2.4 Elements of a Scoring Rubric

2.5 Scales of a Scoring Rubric

3 Options for Scoring Rubric Development 9

3.1 Adopt

3.2 Adapt

3.3 Do It Yourself

4 Development of a Scoring Rubric for Performance Assessment in Mathematics and Science

4.1 Common Types of Performance Assessment Items in Standardized

Mathematics and Science Assessments

4.2 General Procedures for Scoring Rubrics Development

4.3 Tips on Scoring Rubric Development

4.4 Checklist for Scoring Rubric Development

5 Sample Sources of Additional Information for Scoring Rubric Development

5.1 Off-line Sources

5.2 On-line Sources

6 Summary

1. Introduction

After decades of multiple-choice items being used as a dominant item format in large-scale standardized assessments, educators and test developers realized the need to change the current practice of assessment to catch up the trend of educational reform, and started moving toward more authentic, performance assessments. According to the current Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999), performance assessments are defined as:

Product- and behavior-based measurements based on settings designed to emulate real-life contexts or conditions in which specific knowledge or skills are actually applied (p.179).

Examples of performance assessments are science investigation, portfolio assessment, constructed-response questions, writing essays, and so on. This kind of assessment has the following common characteristics:

It presents students with tasks that simulate real-world challenges and problems;

It requires that students generate an answer or produce a product or perform an act, instead of selecting an option from multiple choices; and

It can have more than one correct answer and more than one way to approach a problem, instead of a definite right answer as in multiple-choice items.

Consequently, performance assessment can be more flexible in assessing students’ abilities and more complex in format and content design than multiple-choice items in conventional assessments. However, since a performance assessment does not have an absolute right/wrong answer key, more detailed scoring rules, i.e., scoring rubrics, are needed. The purpose of scoring rubrics is to provide scorers an effective tool that can help them focus on the target and understand what attributes an item or task truly measures.

Since a scoring rubric is an inseparable part of performance assessment and its quality directly affects the accuracy of measurement of a student’s knowledge or skills, performance assessment developers must master an understanding of basic concepts and procedures of rubric construction. Accordingly, this paper will introduce some basic concepts of scoring rubrics and general procedures of rubric development for standardized assessment items in mathematics and science. The focus is on assessments that provide data determining program efficacy. Classroom assessment and diagnostic achievement tests are not addressed.

There are five main sections in this paper. The introduction presented definitions for performance assessment and its characteristics. Section 2 introduces some basic concepts of a scoring rubric, including its definition, functions, types, elements, and scales. After describing different options for selecting a scoring rubric in Section 3, we will provide step-by-step instructions for those who would like to develop a scoring rubric from scratch in Section 4. Finally, Section 5 provides samples of different on-line and off-line sources of information on scoring rubric development.

2. Basic Concepts of Scoring Rubrics

Before we start to develop a scoring rubric, it is necessary to get familiar with some basic concepts including the functions, types, elements, and scales of scoring rubrics used in standardized testing situations.

2.1 Definition of a Scoring Rubric

Many of us have experiences with rubric application in our daily lives. For instance, when taking the driver’s license road examination, we have to complete certain number of tasks to be granted the license. The Department of Motor Vehicle has criteria for determining a passing or cut score. The word "rubric" may not be used here, but the concept definitely is. In the field of educational assessment, scoring rubrics are becoming a highly sought-after tool due to the increasing popularity of performance assessments. Strictly speaking, a scoring rubric is defined as:

The established criteria, including rules, principles, and illustrations, used in scoring responses to individual items and clusters of items (AERA/APA/ NCME, 1999, p. 182).

Loosely speaking, it is a set of rules used in scoring performance assessment items. The term usually refers to the scoring procedures for assessment tasks that do not provide enumerated responses from which test takers make a choice. Scoring rubrics vary in the degree of judgment entailed, in the number of distinct score levels defined, and in the latitude given scorers for assigning intermediate or fractional score values.

2.2 Functions of a Scoring Rubric

As a set of rules to evaluate a student’s ability to master the knowledge or skills in educational assessments, rubrics have three main functions to teachers, test developers, decisionmakers, and students.

First, scoring rubrics provide uniform, objective criteria for judging a performance assessment item. Since all performance assessment items currently involve professional judgment, difference among scorers is inevitable. Having clear, well-established scoring criteria will improve the agreement between scorers and reduce bias.

Second, scoring rubrics provide established expectations for teachers and students that help them identify the relationships among instruction, learning, and assessment. Teachers and students may have difficulty interpreting or "guessing" what type and level of performance is expected from them. This may enhance test anxiety, a factor that can depress student performance, especially on high-stake assessments. Clear, publicly communicated scoring rubrics give test takers an idea of the knowledge and skills that will be measured and how they will be evaluated. This practice reduces unnecessary anxiety and provides more accurate measurement.

Third, well-constructed scoring rubrics reinforce a focus on content, providing performance standards of student work. Rubrics help focus teachers and students on the most critical skills and knowledge to be learned. Distractions caused by peripheral variables (e.g., overzealous attention on penmanship when the content of writing is the focus) can be minimized or eliminated.

2.3 Types of Scoring Rubrics

There are two major categories of scoring rubrics. One is by depth of information provided; the other is by breadth of application.

2.3.1 Scoring Rubrics by Depth of Information Provided

Two types of scoring procedures are used regarding depth of information provided: holistic and analytic. Both methods require explicit criteria that reflect the test framework, but they differ in the amount of detailed analyses of student responses.

Holistic scoring is based on an overall impression of a student’s work as a whole, even though the same performance criteria may be implicitly considered in analytic scoring. Holistic scoring produces only a single score based on an established scale. Almost all of the statewide, standardized assessments for K-12 education adopt this scoring procedure. Holistic scoring is preferred when a consistent overall judgment is desired and when the skills being assessed are complex and highly interrelated.

Example of Holistic Scoring

This is an item-specific scoring rubric that is applicable to the assessed item only.

Subject: Science

Grade(s): 6

Number of dimensions: 1

Scale length: 4

Source: Vermont Science Assessment Blueprint

Task

The laws of force and motion can be used to understand the motions of the planets and stars. They can also be used to understand something as simple as sliding a bowl of soup across a table.

In the school cafeteria, your friend slides a bowl of soup quickly across the table to you. The soup sloshes onto her hand as she slides it toward you and then the soup sloshes out in front of you when the bowl comes to a stop.

1. Explain why the soup sloshes onto the hand of the student who is pushing the bowl.

2. Explain why the soup sloshes out toward you when the bowl comes to a stop.

Key Elements for Scoring (Total points = 2)

1 point: Force is applied to the bowl (push > go; friction > stop)

1 point: Soup either stays where it is or keeps on going (inertia)

With analytic scoring, each critical dimension of the performance criteria is judged independently and awarded an individual score. Additionally, an overall score is given to the examinee for the item. Compared to holistic scoring, analytic scoring is more time-consuming, but it yields more detailed information that can be used for diagnostic purposes or to provide students needing specific feedback on their strengths and weaknesses. Analytic scoring is sometimes also used to evaluate curriculum and instructional programs with regard to aspects that are strong or that may need improvement.

Example of Analytic Scoring

This is a general rubric that can be used for scoring any scientific experiment of the same kind.

Subject: Science

Grade(s): Not specified

Number of dimensions: 3

Scale lengths: 4, 5

Dimension I: Designing the Experiment (Total points = 3)

3. Design allows comparison of variables and indicates sufficient number of tests to obtain meaningful data.

2. Design allows comparison of variables but lacks sufficient number of tests to obtain meaningful data.

1. Design allows comparison of variables to standards.

0. Fails to develop any type of plan.

Dimension II: Collecting and Reporting Data (Total points = 4)

4. Makes a meaningful table and records the data accurately.

3. Makes a meaningful table, but fails to record the observations or records them inaccurately.

2. Makes a data table, but the table lacks meaningful labels.

1. Describes observation in rambling discourse.

0. Fails to collect any data.

Dimension III: Drawing Conclusions (Total points = 3)

3. Draws a conclusion that is supported by the data, and gives supporting evidence for the conclusion.

2. Draws a conclusion that is supported by the data, but fails to show the support for the conclusion.

Draws a conclusion that is not supported by the data.

Fails to reach a conclusion.

Selection of a scoring procedure depends on the needs of a user. As you examine these two approaches, think about the purpose of the assessment and the information you would like to obtain from test scores. In addition, efficiency, cost, and human resources are all factors to consider.

2.3.2 Scoring Rubrics by Breadth of Application

The other two major types of scoring rubrics are by breadth of application: general and specific. A general scoring rubric can apply to more than one item or task of the same kind, while a specific scoring rubric is applicable to only one item or task.

Example of General Scoring

This is a general rubric that can apply to more than one mathematics item of the same kind.

Subject: Mathematics

Grade(s): 8

Number of dimensions: 1

Scale length: 5

Source: Maryland Math Communication Rubric

Maryland Department of Education

General Scoring (Total points = 4)

Uses mathematical language (terms, symbols, signs, and/or representations) that is highly effective, accurate, and thorough to describe operations, concepts, and processes.

Uses mathematical language (terms, symbols, signs, and/or representations) that is partially effective, accurate, and thorough to describe operations, concepts, and processes.

Uses mathematical language (terms, symbols, signs, and/or representations) that is minimally effective and accurate to describe operations, concepts, and processes.

An incorrect response — attempt is made.

0 Off task, off topic, illegible, blank or insufficient to score.

For instructional purposes, general rubrics usually prove to be most useful, because they eliminate the need for constant adaptation to particular assignments and provide an enduring vision of quality work that can guide both students and teachers. The examples of analytical and holistic presented above are both considered general rubrics.

For standardized assessments such as the ones that we are discussing in this paper, specific rubrics are more common, because each test item is measuring a specific skill, and it is accountable for estimating a student’s trait in a specific content area.

Example of Item-Specific Scoring

This is a rubric that applies to the specific item only.

Subject: Mathematics

Grade(s): 4

Number of dimensions: 1

Scale length: 4

Source: New Jersey State Assessment Program

Elementary School Proficiency Assessment

Item 28. Veronica is making a rectangular garden. She plans to put a fence around the garden using 28 feet of fencing, and she wants the garden to be 8 feet long.

• How wide will Veronica’s garden be? Show how you got your answer.

• If Veronica is going to put fence posts two feet apart around the outside of the garden, how many fence posts will she need? Show all of your work and explain your answer.

Item 28 Scoring Rubric (Total points = 3)

3 points — The student correctly determines the width of the garden (6 feet) and shows his or her work. The student also determines that Veronica would need 14 fence posts and explains his or her answer.

2 points — The student correctly determines the width of the garden (6 feet) and shows his or her work, but does not get the correct answer to the number of fence posts.

OR: The student correctly determines the width of the garden and the correct number of fence posts but shows no work.

point — The student does not attempt to determine the width of the garden, but finds the number of fence posts, with an incomplete explanation.

OR: The student correctly determines the width of the garden with an incomplete or inadequate explanation and number of fence posts is incorrect or missing.

0 point — The response shows insufficient understanding of the problem’s mathematical concepts.

2.4 Elements of a Scoring Rubric

Regardless of the type of scoring procedure, a scoring rubric needs to have the following elements:

One or more dimensions that serve as the basis for judging the student response. A dimension means a trait or feature to be used in judging a student’s performance or product.

Definitions and examples to clarify the meaning of each trait or dimension. This element is very important in that it will define the range of the trait or dimension to be assessed.

A scale of values on which to rate each dimension. A scale is the system of numbers, and their units, by which a value is reported on some dimension of measurement. In testing, scoring rubric rating scales may be numerical, qualitative, or a combination of the two.

Standards of excellence for specified performance levels accompanied by models or examples of each level. When a scale is established, standards for each level of the scale need to be stated clearly and accurately. Examples are very helpful to clarify the boundary of each score point level (Herman, Aschbacher, and Winters, 1992).

2.5 Scales of a Scoring Rubric

As mentioned above, in testing situations the scoring rubric rating scales may be numerical, qualitative, or a combination of the two. Numerical scales are probably most popular for mathematics and science items or questions in standardized assessments. One possible reason for this practice is the convenience of score calculation. Qualitative scales are used more often in subjects such as writing, or presentation.

How many points should a rating scale have? The answer is "it depends …." Generally speaking, the following issues should be considered as we make decisions for mathematics and science test items:

What is the total number of outcomes to be assessed in an item (for holistic scoring)? Is there a clear definition for each outcome? If for analytic scoring, how many dimensions are important to be assessed in an item? Is there a clear definition for each dimension?

How many critical stages of cognitive skills within a dimension does a student need to have in answering an item? It will make more sense in scale determination if you can define those stages.

Under the holistic procedure, if you are rating a product/performance on several different dimensions, how much weight does each dimension have? Will you want to add up the scores so that each is equally weighted? If so, you may find it easier to have all scales the same length.

Do you simply want to divide students into two or three groups, based on whether they have attained or exceeded the standard for an outcome? If so, then a short scale may be adequate.

For item-specific scoring, the total number of score points in a scale is usually between 2 to 6 points, depending on the complexity of an item. In general, too many score points makes it hard for scorers to reach agreement; too few makes it hard to distinguish differences between students or to interpret the scorers’ intent.

Paper