Communication Center  Conference  Projects Share  Reports from the Field Resources  Library  LSC Project Websites  NSF Program Notes
 How to Use this site    Contact us  LSC-Net: Local Systemic Change Network
Educational Reform & Policy

Professional Development

Teaching and Learning

LSC Papers and Reports

Cross Site Reports

LSC Case Study Reports

Papers / Presentations authored by LSC Members

About LSC Initiatives and their impact




Guidelines for Evaluating Student Outcomes

author: WESTAT
submitter: Joy Frechtling, WESTAT
published: 03/23/2001
posted to site: 03/23/2001


March 2001

I. Purpose

The purpose of this handbook is to help Local Systemic Change (LSC) projects assess the effects of their activities on students and student learning. Recognizing the difficulty of measuring student impacts, these guidelines have been developed to help projects design studies that will meet both their own information needs and those of the National Science Foundation (NSF). The handbook addresses a number of important issues for research and evaluation studies, including deciding on appropriate measures, study design, data analysis, and reporting, with a particular emphasis on being able to make the case that any gains you may detect are attributable to the LSC. Appendix A provides a quick reference guide to the concepts included in this document. Appendix B contains a glossary of terms used in this document. Additional resources are listed in Appendix C.

II. Why Study Student Outcomes?

The goal of NSF's Teacher Enhancement program is "the improvement of science, mathematics and technology teaching and learning at pre K-12 grade levels." Projects funded under this program address improving teaching with the goal of enhancing student learning. Until recently, the LSCs have been assessed primarily in terms of their effects on teaching. Now that the program is becoming more mature, it is appropriate to take the next step and examine effects on students.

Stakeholders at all levels want to know about student outcomes. Evidence of student impact is important at the local level, where parents and the community pay close attention to how well our public schools are meeting their young peoples' needs; at the state level, where decisionmakers want to know how well systems are operating; and at the federal level, where policymakers closely monitor the nation's accomplishments. At all levels, building support for reform efforts rests strongly on showing that investments pay off in improving what students know and can do.

In line with this increased focus on student outcomes, federal agencies have placed greater emphasis on collecting sound data on student learning. NSF has identified student impacts as one of the critical indicators of the success of its programs. Specifically, one of the outcomes that it must demonstrate to Congress is "improved achievement in mathematics and science skills needed by all Americans." In order to do this, NSF looks to the projects it funds to provide such evidence.

It is important to note that, in looking at the effects of the LSC, one should interpret "effect on students" fairly broadly. In addition to examining student achievement, you may want to consider looking at the effect of the LSC on other student outcomes, such as student participation in higher level mathematics and science courses, attendance patterns, and attitudes towards mathematics and science. The examination of multiple outcomes in your studies, as well as the use of both quantitative and qualitative data collection methodologies are encouraged.

III. Attribution - Making the Case for Your Results

The LSC program is designed to improve the teaching and learning of science, mathematics, and technology by focusing on the professional development of teachers within whole schools or school districts. Projects are expected to designate the instructional materials to be used and then to provide extensive professional development to help teachers deepen their subject matter knowledge and become skilled in the use of the instructional strategies called for in those materials; they also are expected to provide support for teachers as they implement the instructional materials in their classrooms.

While all of the LSC projects share those elements of program design, projects were encouraged to develop intervention strategies that fit the needs of their particular target population and their particular context. Thus projects vary, for example, in the relative emphasis they give to teacher content knowledge, how they distribute the required hours of professional development over the course of the project, and the extent to which they provide professional development district-wide as opposed to at the school site.

Individual projects are being asked to assess the effectiveness of their strategies for students. Taken together, the results of these individual studies will provide valuable information on how effective the overall program strategy has been. When teachers are provided extensive professional development around the use of high quality instructional materials, do their students learn more? If the LSC does not lead to improved student learning, it will be difficult to make the case that the program should be continued. If, on the other hand, different researchers-studying variations of the LSC design in diverse contexts, using a variety of outcome measures-demonstrate the effect of the LSC on students, there will be good reason to continue and even expand the program.

In assessing the effect of the LSC on students, a key question that projects need to address is: "How do you determine with reasonable probability that the LSC, and not some other policy, program, or event was responsible for any student gains?" These guidelines are intended to help projects make the case that any growth they identify is, in fact, due to the LSC. Three conditions are needed to make a case for causality: temporal precedence, a correlation between treatment and outcomes, and a lack of plausible, alternative explanations.

First, it must be clear that the treatment occurred before the observed effect (i.e., "temporal precedence"). While this may seem obvious in most educational research, there are times when the order of events must be considered. When cyclic fluctuations occur, as often happens in economics, establishing a causal relationship can be difficult. In the case of the LSCs, you know when the professional development began, and you will have a measure of outcomes at some point after that, so the condition of temporal precedence is easily met.

Next, you have to show that there is a relationship between the treatment, professional development, and the effect, e.g., student scores, or participation in advanced mathematics/science courses, or some other outcome of interest. This relationship can be demonstrated by showing that if the program is provided, you have a particular outcome, and if the program is not provided, you don't. Perhaps more applicable to the LSC program, where projects work with all teachers, you need to show that providing more of the program leads to more of the outcome, while less of the treatment leads to less of an outcome. It should be emphasized that showing a relationship between the treatment-professional development training-and the outcome-student scores-is not sufficient to show that the treatment caused the outcome.

In order to support the likelihood of a causal relationship, you must rule out other possible explanations for the effect. Here is where the research design comes in, which is the central focus of this document.

IV. Instrumentation

Some projects will be able to access existing data that will meet the needs of their studies, while others will need to administer an assessment in order to study the effects of the LSC on students. To determine if existing data will meet your needs, you should consider the following questions:

  • Are the outcomes that were measured relevant and important in light of the goals of the LSC, the LSC guidelines, and the information needs of your stakeholders?

  • Are the instruments valid and reliable?

  • Are the instruments potentially sensitive to the LSC treatment?

  • Are the outcomes measured in a way that will be acceptable to your key stakeholders?

  • Are the data reported at the individual student level, or at least at the classroom level, so you will be able to design a reasonable study? An obstacle faced by many LSC projects is that state- or district- mandated assessments are often not aligned with the goals of the LSC. One option, as described below, is to administer an additional assessment that is aligned with the project's goals to all the students involved in the project or to a sample of classes. A second option is to construct a sub-scale that is fairly well-aligned with the goals of the LSC. This option is feasible only if you have access to results for individual items on the assessment and access to the assistance of someone knowledgeable about measurement issues.

There is no one way to determine alignment, and a number of different approaches can be used. You may find it helpful to review the work of Norm Webb and see how his approaches might be used in your project. Information on alignment between expectations and assessments can be found in several articles by Webb, which are available at the following websites:

  • Vol_1_No_2/



These articles describe the three major methods of alignment-sequential development, expert review, and document analysis. In addition, five categories of criteria for judging alignment are presented-focus of the content, articulation across grades and ages, equity and fairness, pedagogical implications, and system applicability (realistic and manageable in the real world).

If you determine that previously collected data are not useful for studying the effects of the LSC, or if you do not have teacher or student level data, you will likely need to administer an assessment that meets both of these criteria.

There are several issues to consider when selecting an assessment. First, you must consider the information needs of your stakeholders. Second, you need to recognize that the types of assessment tools your project uses will convey some messages about what you think should be taught and learned. In selecting an assessment tool, it is important to balance practical concerns-what is or might be easily available-with what that choice might say about your teaching goals. Traditional measures include, for example, scores on routinely administered achievement tests such as the Iowa Tests of Basic Skills (ITBS), Tests of Achievement and Proficiency (TAP), the ninth version of the Stanford Achievement Tests (Stanford-9), the Metropolitan Achievement Tests (Metro), the Comprehensive Tests of Basic Skills (CTBS), or Terra Nova. Advantages to this approach include its familiarity to most of the stakeholders and the availability of many commercial products. Indeed, the data from such instruments may already be available through your local or state assessment programs. A disadvantage is that these products tend not to model the kind of teaching and learning embraced by reform efforts; although multiple-choice tests can assess reasoning and higher order thinking skills, the tests currently available at the elementary and secondary levels rarely do so.

Another approach, and one in-line with recommendations in the national science and mathematics standards, is to use open-ended items or performance assessments that involve multiple responses that can reflect real-life, complex problems. Developing such measures can be a challenge, however, and many assessments which appear to be valuable because of their "authenticity" might have questionable reliability and validity (e.g., they focus on a very small subset of the domains of interest), limiting the extent to which the results can be generalized. Disadvantages of this approach include the difficulty of finding an appropriate instrument and the amount of time needed to administer and score the performance items, as well as the costs associated with each.

Often, commercial tests, regardless of whether they use multiple-choice items or performance tasks, include multiple sub-scales. For example, the New Standards Reference Exam reports student performance as an overall mathematics score and on the following sub-scales: skills, concepts, and problem solving. By analyzing student scores on the sub-scales you might be able to address the information demands of different sets of stakeholders. The reform community might care most about student performance on the concept and problem-solving scales while other stakeholders might be most interested in knowing how students performed on the skills scale. When choosing an assessment, you should consider which, if any, sub-scales the instrument contains.

You might want to use both multiple-choice and performance assessments to take advantage of the benefits of each. Both the New Standards Reference Exam and the assessments developed by the Partnership for the Assessment of Standards-Based Science (PASS) include both multiple choice items and performance tasks. Similarly, the Iowa Tests of Educational Development (ITED), a commercial product for grades 9-12, requires students to read long passages, solve multi-step mathematics problems, and analyze simulated science experiments. Other instruments, including the Stanford-9 and the Terra Nova, also have optional performance assessment components. Whatever test(s) are selected, you should be sure that they:

  • Are valid-the assessments measure what you intend for them to measure;

  • Are designed for the population you will be assessing;

  • Are reliable-if a student takes the assessment multiple times, his/her score will remain stable; and

  • Report scores at a level (teacher or student) that will give you enough cases to conduct a meaningful analysis of the data.

Another issue to consider is the metric on which outcomes are reported. Two types of approaches are used most frequently for conceptualizing student outcomes: continuous scales and categorical scales. Continuous scales provide data on changes occurring over some range of possible outcomes, such as percentile ranks or normal curve equivalents, both of which operate on a 0-100 point scale. Categorical scales are far more restricted, employing only a few gradations. Proficiency scores, reflecting below-, at-, or above-level expectations, provide a categorical outcome metric that is very popular today. Both approaches have their pluses and minuses. Proficiency scores send the message that all students are expected to reach the same high standard, but do not measure growth within a proficiency level. On the other hand, continuous scores highlight improvement and allow you to examine more finely grained changes in student achievement. However, relatively small changes on continuous scales may be statistically significant when large samples are used but educationally meaningless.

Commercially-available tests frequently include both types of outcome metrics (continuous and categorical) in order to meet a variety of user needs. Careful thought needs to be given to the selection of an outcome metric within each individual study, as the outcome metric plays a big role in determining the types of statistical analyses you can apply, as well as the types of conclusions you can draw from your study.

 next page