Guidelines for Evaluating Student Outcomes

author:	WESTAT
submitter:	Joy Frechtling, WESTAT
published:	03/23/2001
posted to site:	03/23/2001

V. Sampling

Most states and districts assess mathematics achievement every year, at least in selected grades, and some assess science as well. If the project decides that one of these instruments, or some definable subset of an instrument, is an adequate measure of the LSC goals, then sampling is not usually an issue. If you can get scores on all students, you probably will want to use them all in order to maximize your ability to detect gains. On the other hand, if your study design calls for linking additional information to student records, e.g., the number of hours of professional development each student's teacher has had, you may need to work with a sample of the student records. Similarly, if you need to get signed releases from parents in order to access the scores, you might want to select a sample of students.

More typically, however, sampling comes into play only when (1) you need to administer an assessment instrument either because test scores are not available, or because the tests being used are not appropriate for your goals; and (2) it is too expensive to administer and score an assessment for the entire student population at one or more grade levels.

A sampling textbook will tell you that a simple random sample is likely your best bet, where you might make a list of all students whose teachers are participating in the project and then select every 2nd, 5th, or 10th name, depending on the total size of the sample you need to have in order to have a reasonable chance of detecting growth due to the LSC. But the realities of school life typically make this strategy infeasible; teachers will not take kindly to your pulling random students out of their classes. A more feasible alternative is to make a list of classes and randomly select from those to get the number of students you need, enabling you to administer the assessments to intact classes.

Generally, the larger your sample, the more likely you are to be able to show gains. For example, a study with 100 students (50 treated and 50 untreated) would have a reasonable (80 percent) chance of detecting a difference between the two groups of half a standard deviation. A sample size of 300 total would virtually ensure that you could detect a difference of this magnitude. In contrast, a study using a sample of only 50 students total is far less likely (a 54 percent chance) to be able to distinguish a difference of this magnitude from random noise. Looking for smaller gains requires even larger samples, e.g., you would need over 1100 students to have an acceptable chance (80 percent) to detect a gain of half this size.

In selecting your sample, you need to decide to which population you want to generalize. If your LSC covers the K-8 range, you need to make sure your sample covers that range in some reasonable way. If this is not possible, you must delimit either the claims you make from your findings or present a convincing argument for drawing broader conclusions. In quantitative studies, the sample should be large enough that the study has sufficient statistical power to detect meaningful differences over time and/or among comparison groups. In qualitative studies, trustworthiness of the research is an issue, since the sample is likely to be small. In selecting a sample for qualitative studies, it is important to be able to make the argument that the results are not isolated examples arising as a product of selective sampling of data or subjects.

In designing your study, tradeoffs are often necessary. You need to use data collection approaches that are feasible and cost effective. A brief example of the types of choices you will need to make follows:

Project XYZ has high quality performance tasks but can only afford to have 200 students' work analyzed by expert scorers. They have several options, including:

a. Administer the tasks to all classes, have teachers score their own students, and compare the results of classes where teachers have participated in varying amounts of professional development;
b. Same as "a," but have teachers score each other's classes;
c. Administer the tasks to 100 randomly selected students of each treatment level (heavy and low/no treatment); or
d. Administer the tasks to all classes and randomly select 100 students of each treatment level to score by an expert.

While all of these options are subject to selection bias (which is discussed later in the handbook), if the highly treated teachers are different from the less treated teachers, some alternatives are better than others. Choices "a" and "b" likely have the problem of unreliability of scores unless the teachers have had substantial training in scoring. Choice "a" also has a problem regarding the apparent lack of objectivity. Choice "c" would create the feasibility problems of pulling individual students out of classes. Given the constraints, choice "d" is probably the best option; it is feasible in terms of both cost and administering the test to whole classrooms, sends the right message to teachers about the objectivity of the scorer, and can be used as a vehicle for professional development.

VI. Design

The design of your study is the foundation for a number of decisions about what data you need to collect, the analysis that you will do on that data, and the conclusions that you will be able to draw as a result of your study. For LSC student outcome studies, you want to be able to determine if student outcomes have changed as a result of the teacher enhancement project. Your design, therefore, should allow you to show both that student outcomes have changed, and that any changes in student outcomes are likely a result of the LSC project and are not primarily due to some other factor. A strong study design can increase the chances that you measure a true effect, that is, your project caused the change. The integrity and credibility of your conclusions depend on having an appropriate and sound study design.

There are two fundamental design features that your study should include in order to make the case for your LSC's effect on student outcomes. First, the study is best if it involves a comparison or control group in addition to a treatment group. In cases where a comparison group is unavailable, a comparison to an accepted standard (e.g., outcomes of students in similar districts or grade-level equivalent scores) might suffice. Second, the study should examine the initial status of the comparison or control and treatment groups, so that you can make a case that they were initially equivalent or adjust for initial differences.

. Using a Control or Comparison Group

Consistent with good research design, studies of the effect of the LSC on student outcomes will be strongest if they include both a treatment group and a comparison group. You can accomplish this by comparing outcomes for students of teachers participating in the LSC to outcomes for students of teachers not participating in the LSC, or by comparing outcomes for students of teachers fully participating in LSC to outcomes of students of teachers with more limited involvement. Using a control or comparison group allows you to examine the effect of the project's treatment aside from other factors. Without a control group, it is nearly impossible to say that any change in outcomes is due to the treatment as opposed to other factors.

As an example, consider a school district that is shopping for a new reform-oriented elementary school mathematics program. They have narrowed their choices down to two programs. To help make the final decision, the school board asks the publishers to present evidence that their mathematics program helps students learn mathematics.

The first publisher shows the school board results of a study that compared 3rd graders' test scores before and after using that curriculum package. The results of the study show that the 3rd graders significantly improved their mathematics scores on the SAT-9 over the course of 1 year.

The second publisher then presents the results of their study. They also found that students scored significantly higher in mathematics on the SAT-9 after 1 year of exposure to their curriculum (comparing end of grade scores at 2nd and 3rd grades). Further, this increase was significantly larger than the increase in scores of students in the same school district exposed to the traditional mathematics program, which also is used currently by this district.

Which program do you think the school board should adopt? While the first publisher's study showed that students achieved more after using their program, there is no evidence that the increase in student scores is attributable to their mathematics program. After all, the students are older and have been in school for an entire year between the two administrations of the assessment. Most likely, the students would have scored better on the assessment after one year if they had experienced any mathematics instruction, perhaps even if they had experienced no mathematics instruction. Without further supporting data, this type of design can be readily challenged.

In contrast, the second publisher's study used a control group to eliminate those possible explanations for the change in mathematics scores. Although the second publisher's study does not answer every question one might have about the curriculum's effectiveness, it clearly makes a stronger case that their curriculum is more successful than the traditional one at helping students learn mathematics.

It is sometimes difficult to construct comparison groups with no exposure to the treatment, especially if a project is directed at a whole school or district. It is frequently more feasible to use data on documented differences in level of treatment as a way of defining groups for comparison in a study. If all teachers in your district are participating to some extent in the LSC program, you might look at the amount of training each has received as a way of defining your treatment and comparison groups. For example, you could examine the scores of students whose teachers had participated in the program for three years compared to students whose teachers had participated for only one year. Please note, however, that the appropriateness of this approach would depend on how teachers are selected for training. If teachers could volunteer to participate in the first year and only the stronger teachers tended to do so, it would not be appropriate to use teachers' years of treatment as a way to select groups for comparison, because the teachers who volunteered early in the project might well have been different from those who did not, even prior to the LSC.

Another approach would be to create groups for comparison based on the degree to which teachers are implementing the instruction intended by the project. For example, you could determine the number of instructional units used by each teacher or use a measure of the extent of implementation of intended instruction based on questionnaires or classroom observations. This approach, however, again runs the risk that teachers who are able to implement the project well were different in important ways prior to the LSC from those who are not able to do so.

Sometimes no comparison group is available, e.g., if all teachers participated in the same professional development activities at the same time. In these cases, conducting pre- and post-tests with the treatment group might still be considered if you can compare any change to an expected level of growth or to changes in a similar population. However, this approach must be backed up with a very solid logical argument that changes were the result of the treatment. This logical argument could be strengthened through the conduct of a supplementary mini-study. For example, let's say that the LSC professional development training consisted primarily of a one-week intensive program held during the summer. Although all teachers were required to attend, five were not able to do so due to illness or family matters. It might be possible to have the students of these teachers serve as a control group that can be compared to a similar subset of the treatment group.

B. Examining Initial Status

The second key element to a strong research study is examining initial status of the treatment group and the control or comparison group(s), in order to establish equivalence of the groups prior to the treatment or to take initial differences into account when drawing conclusions. This requirement can be accomplished in a variety of ways:

By using repeated measures of the outcome, generally a pre-test and post-test;
By using a relevant covariate to adjust for initial differences, if necessary;
By using matched samples; and
By making a reasonable case that the groups are initially equivalent. The most straightforward way to examine initial differences among groups in a study is to use data on the outcome variable prior to treatment. By including such data in a study, you can examine changes in the outcome variable directly and determine if there are differences across groups. Generally termed a pre-test, post-test design, this approach requires that the outcome be measured more than once for each group.

If data on the outcome of the study that measure initial status prior to treatment are not available, other data closely related to the outcome can be used as an alternative. For instance, student reading achievement scores tend to be highly correlated to mathematics achievement scores. These kinds of related data are termed covariates. Ideally, measurement of the covariates would occur prior to treatment, but measurement of a covariate during or following treatment is acceptable as long as the covariate is not likely to be affected by the treatment. A covariate may be used either to show that groups were initially equivalent or to adjust for initial differences.

A third approach to examining initial equivalence or difference among groups is to use matched samples. Doing so requires information about characteristics of the groups that might be related to the outcomes being studied, and, if not controlled, might offer explanations other than treatment as part of the LSC for differences in outcomes. For example, characteristics such as race, socioeconomic status (free or reduced-price lunch eligibility is frequently used), gender, and ability should be fairly consistent across the groups being compared. While matched sample designs do not typically include initial equivalence or difference information in the outcome analysis, they do minimize the likelihood that initial differences were present. The more alike the groups were initially, the more likely it is that any measured difference in outcomes is due to the one characteristic that is known to be different, namely treatment in the LSC. The major difficulty with matching is that you can never be sure that all the relevant factors were considered that might be critical in explaining differences across the groups.

A fourth approach, clearly the weakest of the four, is when, in the absence of data to show the extent of similarity, you try to make a reasonable case that the groups were initially equivalent. The logic of this approach is similar to that of the matched samples approach, but it differs in that data about important group characteristics are not available. For example, the treatment and comparison groups might have come from schools in similar districts or might have come from the same schools, but from different teachers. It clearly would be better to know more about the characteristics of the groups, but at least some argument is provided that the group being treated is not otherwise dissimilar in important ways from the comparison group.

Study Designs: Four Examples

Different research designs incorporate none, some, or all of the necessary elements for a defensible study of the LSC's effect on student outcomes. Four of the most common designs in education research are discussed. For each design, an example is presented and the strengths and weaknesses are identified. Although the last design is clearly the strongest, it is possible to enhance the other designs, in effect making them more like the last design.

A. Treatment Group Only, Post-Test Only

In this design, one group of students is observed or tested only after they have received the treatment (instruction from an LSC treated teacher). No information is available on the level of treatment of these teachers in the LSC program. For example, an elementary science LSC might have student scores on a district assessment in science given in the fourth grade. However, these data are not linked to the students' teachers, nor to any prior science achievement score or other potential covariate.

The scores on this science assessment show how students in the district are performing in science at the fourth grade level. This information is useful for examining student mastery of certain skills or concepts (similar to what teachers seek in their classrooms when they administer end-of-unit assessments). However, the design gives you little or no chance to demonstrate a relationship between the LSC and the student outcomes, because you cannot show growth on the outcome over a period of time in which the LSC might have influenced scores. Further, you can not judge whether the LSC treatment had any effect on the outcome; the students may have scored better, worse, or the same with or without the LSC.

There are several reasons why this design does not allow you to make the case that the LSC had an effect on student outcomes. The design lacks any comparison groups, either treated or untreated groups or groups that differ in their levels of treatment, nor is there any comparison to a standard of achievement. Since there are no comparison groups, the design cannot examine initial status differences among the groups. In fact, the initial status of the students prior to the LSC is unknown; the scores may be the result of little or no change in science achievement or a very large change. For these reasons, this design fails to provide evidence of the effect of the LSC on students, and its use is discouraged. It is presented primarily to show how the other designs address some of the deficiencies with this design.

B. Treatment Group Only, Pre- and Post-Tests

In this design, students are given a pre-test or baseline measure, then the treatment (instruction by a teacher targeted by the LSC), and finally a post-test. The pre-test and post-test scores can be compared to examine growth. Note that this design, in its basic form, does not include any comparison groups. An example of this design follows:

A group of secondary mathematics LSC teachers all used a previous end-of-course exam in Algebra as a pre-test for their Algebra students at the beginning of the school year. At the end of the year, they compare their students' results on the pretest to their scores on this year's district-mandated end-of-course exam in Algebra. The teachers can link individual student scores between the two tests by the students' names, so the change in test score from the beginning of the year to the end of the year are available. The comparison shows significant growth in Algebra achievement for these students.

This design contains some of the elements of a strong study, but falls short on others. The design includes a measurement of initial status on the outcome of interest for only one group, students whose teachers received LSC training, but does not include the use of a comparison group. The growth in Algebra achievement is known for the treatment group, but how that growth would compare to a similar group of students receiving a year of Algebra instruction by non-LSC-treated teachers is unknown. The effect, therefore, is difficult to attribute to the LSC professional development. Further, if the same test is used for pre- and post-testing, it is possible to argue that the test itself caused the change by sensitizing students to what was important to learn.

The study would be strengthened by providing a comparison group. It should be noted, however, that even if an appropriate comparison group of students is used, the teachers of those students still might not be comparable to the LSC-trained teachers. It could be the case that the LSC teachers have been the ones whose students always performed very well on the end-of-grade Algebra exam, even prior to the LSC. Sometimes, in such cases, statistical adjustments can be made to make these groups equivalent. However, if the proposed comparison group is considerably different from the control group, it probably is not an appropriate group to use.

C. Treatment and Control Groups (or Varying Levels of Dosage), Post-Test Only

In this design, you either have two groups of students, with only one group receiving instruction from teachers participating in the LSC, or groups of students who receive instruction from teachers with varying amounts of participation in the program. In this design, all groups are tested once, after a period of treatment, and their outcome scores are compared. The following example illustrates this two-group, post-test only design:

A K-8 science LSC has been taking place in several districts, one of which is testing science in the 8th grade in a standardized way for the first time this year. The LSC perceives this as an opportunity to study the contribution of the LSC to the students' scores on this assessment. The LSC has information on the extent to which teachers have participated in LSC activities over the three years of the project. District records allow the LSC to identify which science teacher each student in the district has had in the past three years. Students' scores on the science assessment, therefore, can be grouped by the number of years the students received instruction from a teacher who had participated in the LSC for more than 30 hours.

The results show that with each additional year of instruction from an LSC-trained teacher, students performed better on the science assessment.

Like the previous design, this one contains some features of high-quality research. The design of this study includes comparison groups so that differences in outcomes across groups can be examined directly. However, examination of initial status, particularly possible initial differences across groups, is not included. In order to strengthen this study, at least one of the methods for examining initial status should be employed.

D. Treatment and Control Groups (or Varying Levels of Dosage), Pre- and Post-Tests

In this design, you have two groups of students, a treatment group and a control group. A pre-test and a post-test are administered to both groups of students. Overall, this is the strongest design presented as it includes both a comparison group and the means for examining initial equivalence of the two groups. Further, if the groups are not initially equivalent, the pre-test scores allow you to make an adjustment in your analysis for the initial difference. Consider the following example:

A mathematics LSC is located in a state that mandates end-of-year tests in reading, writing, and mathematics for all students in grades 3-8. The district also uses electronic cumulative folders allowing them to track students' progress over time. Using this information, the LSC analyzes students' growth, as measured by the change in test scores year-to-year, by the number of years the students had an LSC trained teacher. The district thus is able to create three groups of students, those with 0, 1, and 2 years of instruction by a LSC-trained teacher. The analysis shows that students with one year of instruction by an LSC-trained teacher had larger gains in their mathematics scores than students who never received instruction from an LSC-trained teacher. Students with two years of instruction from an LSC-trained teacher had larger gains than each of the other two groups.

This design is the strongest of the four presented and allows for a credible case to be made that the LSC is responsible for the differential gains of the students. It is important to note that, while the use of a pre-test allows for any initial differences among students to be controlled through use of an analysis of covariance, differences among teachers are not controlled. Thus, as described in the next section, you need to examine your groups for possible selection biases.

Paper

V. Sampling

VI. Design

Study Designs: Four Examples