Guidelines for Evaluating Student Outcomes

author:	WESTAT
submitter:	Joy Frechtling, WESTAT
published:	03/23/2001
posted to site:	03/23/2001

IX. Reporting

The strength of your study depends heavily on appropriate decisions related to all of the components of the study that have been addressed so far. Yet, even the strongest study will be of very little interest or value if it is not reported appropriately and well. At the same time, a variety of stakeholders will be interested in the results of your project, and how you report your results will often vary depending on the audience. NSF and the research community will be interested in the technical details, while a summary of the project will generally be more appropriate for the school board and parents. Thus, you should be prepared to develop more than one report, each appropriately presented for its audience.

A full report should provide an overview of the project, including the goals and objectives. You should indicate why you are studying the student outcomes you have chosen, and how they are connected to the goals of the LSC and your teacher enhancement project.

It is extremely important that you provide a clear description of your study. The description should build a case for the approach you took by describing the instruments used, sample characteristics and selection, study design, and choice of analyses. Give sufficient detail so that the reader can understand and judge the credibility of the analyses undertaken. It is not enough to say, for example, that the treatment and comparison groups consisted of students in the 4th grade. Data about these students, such as their prior achievement levels, socioeconomic status, gender, and race also should be provided. Information about any special programs, in addition to the LSC, in which they, their teachers, or their schools might be participating, also should be provided.

The report should indicate the number of students involved and a rationale for any sampling decisions. Groups included in the design should be specified. The report also should include, for all groups and all variables, either frequency distributions for categorical data or means and standard deviations for continuous data. Inferential statistical analyses, including the test statistic, degrees of freedom, p-value, and effect size should be presented as well. Tables and graphs often help to organize and explain the data succinctly; they also help to communicate the most important results in meaningful ways.

Alternative explanations for any differences detected in the study results should be identified. If further analyses ruled out these explanations, these analyses should be presented, and if alternative explanations remain, they should be acknowledged and reasonable arguments regarding their likelihood presented. Conclusions, implications, and generalizations that can be drawn from your study also should be provided. It is also important to offer any lessons that you have learned that might be applicable to related efforts so that others can benefit from your experience in conducting the study, as well as from your results.

Reports for other audiences, such as school boards, principals, or parents, probably would focus on the purpose of the overall project and the specific issues you addressed in your study. These audiences would be extremely interested in the results and implications, which would need to be described in very straightforward terms that speak to the issues of greatest interest to each audience. Most of the details of the instrumentation, sampling, and analysis would not be appropriate for these audiences, although you should be prepared to make that information available upon request.

APPENDIX A

Quick Reference: Handbook for Conducting Studies of the Effect of the LSC on Student Outcomes

Instrumentation

It is up to each project to decide which instruments to use in measuring student achievement and other student outcomes. Optimally, you would examine student outcomes using multiple measures in order to satisfy the information needs of a variety of your stakeholders, to triangulate findings, and to provide a rich array of evidence of the effect of the LSC on students. Each project should make a case that:

The studied outcomes are relevant and important to the LSC project;
The chosen instruments appropriately measure the studied outcomes;
The instruments are reliable; and
The instruments are potentially sensitive to the LSC treatment.

Sampling

Studies might include data for all students targeted by the LSC, but more likely will include data from a sample from one grade level and/or from a subset of districts, schools, or classrooms. The studied sample should be:

Representative of the population of students being targeted by the LSC;
Exposed sufficiently to the LSC in order to merit an examination of effect; and
Large enough to provide statistical power to detect differences in outcomes or to ensure trustworthiness of qualitative methods.

Design

The study design should enable the researcher and the audience to answer the question: To what extent has the LSC had an effect on student outcomes? Strong studies:

Compare outcomes for treated students to outcomes for untreated students; and/or
Compare outcomes for students with varying degrees of treatment; and/or,
Compare outcomes of treated students to another standard (e.g., outcomes of students in similar districts or grade-level equivalent scores).

An examination of the initial equivalency of comparison groups on outcomes is usually necessary. Options for examining equivalency, in decreasing order of preference are:

Using pre and post measures;
Using a relevant covariate (e.g., reading test scores);
Using matched samples; or
Making the case that samples are initially equivalent, or that an appropriate standard of comparison has been chosen.

Internal Validity

The credibility of a study can be undermined if alternative explanations for the results, such as selection biases, are ignored. A sound study will:

Identify plausible alternative explanations for its results;
Address plausible alternative explanations, either through statistical methods or through arguments with evidence refuting alternative explanations; and
Acknowledge remaining shortcomings of the study, possibly providing recommendations for further research to address those limitations.

Analysis

Analysis methods and tools should be consistent with the study design and the type and level of outcome data being investigated. An appropriate qualitative analysis describes how the data were collected and analyzed, and how conclusions were drawn. An appropriate analysis in a quantitative study includes both descriptive and inferential statistics.

The goals of reporting the results of your study are to communicate your findings to audiences with an interest in your LSC and to make the case that the results of the study represent the effect of the LSC on the student outcomes you have studied. Including the information outlined here will enable your audiences to judge the study's results fairly.

For instrumentation, a technical report should include the following pieces of information:

What outcomes are being measured by what instruments;
Why you expect the outcomes and instruments to be sensitive to the LSC treatment;
On what metric the outcomes are measured;
At what level of aggregation the outcomes are reported (e.g., student, classroom); and
Information about the reliability and validity of the instruments.

For sampling, a technical report generally should include information on the how representative the sample is, including:

How the sample was selected or determined;
The size of the sample and of any sub-groups that will be considered in the analyses; and
Descriptive characteristics of the population, the sample, and any sub-groups.

For design, a technical report should include:

A description of the design and
A rationale for the design, making a case that it allows the effect of the LSC on student outcomes to be studied better than reasonable alternative designs.

For internal validity, a technical report should include:

An examination of possible selection biases;
Identification of plausible alternative explanations;
Analysis or evidence to rule out plausible alternative explanations; and
Acknowledgement of remaining shortcomings.

For quantitative analysis, common conventions for information to be conveyed include:

Descriptive statistics:
- For continuous variables, Ns, means, and standard deviations
- For categorical variables, such as gender and race/ethnicity, Ns and frequency distributions
Inferential statistics:
- Test statistic, degrees of freedom, p-value, and effect size; and
- Information confirming that the data meet the statistical assumptions of the procedures (e.g., normality or homogeneity of error variances).

For qualitative analysis, common conventions for information to be conveyed include:

How the data were collected;
How the data were analyzed;
How conclusions were drawn; and
Examples from the data.

Reporting

If several reports are produced for different audiences, reference to the most comprehensive report and contact information are generally included so that interested parties can find the most complete information available.

APPENDIX B

Glossary

Analysis of variance (ANOVA) A test of the statistical significance of the differences among the mean scores of three or more groups. It is an extension of the t test, which can handle only two groups, to a larger number of groups.

Baseline The first phase of research in which outcomes are measured before any treatment is administered.

Bias Any systematic error that influences the results and undermines the quality of the research.

Categorical scale A scale that distinguishes among individuals by putting them into a limited number of groups or categories.

Chi-square test A statistical procedure used with categorical data to test relationships between frequencies in categories of independent variables.

Comparison group A group that provides a basis for contrast with an experimental group (i.e., the group of people participating in the program or project being evaluated). The comparison group is not subjected to the treatment, thus creating a means for comparison with the experimental group that does receive the treatment. Comparison groups should be as similar as possible to the treatment group but can be used even when close matching is not possible.

Continuous scale A scale containing a large, perhaps infinite, number of intervals. Units on a continuous scale do not have a minimum size but rather can be broken down into smaller and smaller parts. For example, grade point average (GPA) is measured on a continuous scale, a student can have a GPA of 3, 3.5, 3.51, etc. (See categorical scale.)

Control group A group that does not receive the treatment. The function of the control group is to determine the extent to which the same effect occurs without the treatment. The control group must be closely matched to the experimental group. (See comparison group.)

Correlation A statistical measure of the degree of relationship between variables.

Covariate A variable that a researcher "controls for" in a study by statistically subtracting the effects of the variable.

Degrees of freedom The number of values that are free to vary when computing a statistic; the number of pieces of information that can vary independently of one another. The degrees of freedom (df) tell you the amount of data used to calculate a particular statistic and is usually one less than the number of cases. It is needed to interpret a chi-square statistic or a t value.

Descriptive statistics Statistical procedures that involve summarizing, tabulating, organizing, and graphing data for the purpose of describing objects or individuals that have been measured or observed.

Design The process of stipulating the investigatory procedures to be followed in doing a certain evaluation.

Disaggregate To separate data for the purposes of analyses. For example, achievement test scores might be disaggregated to look for separate trends by gender and race/ethnicity.

Dispersion The amount of variation in the scores around the central tendency. There are two common measures of dispersion, the range and the standard deviation.

Effect size A statistic indicating the difference in outcome for the average participant who received a treatment from the average study participant who did not (or who received a different level of the treatment). The effect size indicates if the difference is substantial or meaningful. Typically, an effect size of 0.2 is considered small, 0.5 medium, and 0.8 large.

Hierarchical linear modeling (HLM) A statistical procedure used when data are nested within levels, e.g., students grouped within classes, classes grouped within schools. The method's advantage is that it makes it possible to separate the variance into components explaining the effects of different levels of analysis upon the outcome variable, such as the effects of teacher or school factors on mathematics achievement.

Homogeneity of error variances An assumption of some statistical procedures (e.g. ANOVA) that the populations from which the samples have been drawn have equal amounts of unexplained variability.

Inferential statistics Procedures that indicate the probability associated with inferring the characteristics of the population based on data from samples.

Instrument An assessment device (test, questionnaire, protocol, etc.) adopted, adapted, or constructed for the purpose of the evaluation.

Internal validity The extent to which the results of a study can be attributed to the treatment rather than to flaws in the research design. Internal validity depends on the extent to which extraneous variables have been controlled by the researcher.

Matched samples An experimental procedure in which the subjects are divided, by means other than random assignment, to produce groups that are considered to be of equal merit or ability. (Often, matched groups are created by ensuring that they are the same or nearly so on such variables as sex, age, grade point averages, and past test scores.)

Measures of central tendency The central tendency of a distribution is an estimate of the "center" of a distribution of values. There are three major types of estimates of central tendency: mean, median, and mode.

Normal distribution An ideal distribution that results in the familiar bell-shaped curve, which is perfectly symmetrical. A large part of inferential statistics rests on the assumption that the population from which we are sampling is normally distributed. The results of a number of statistical procedures are invalid if this assumption is grossly violated.

Performance assessment A method of evaluating what skills students or other project participants have acquired by examining how they accomplish complex tasks or the products they have created (e.g., poetry, artwork).

Population The total group of individuals from which a sample is drawn.

p value Probability value. Usually found in expressions such as p < .05. This means "the probability (p) that this result could have been produced by chance is less than (<) 5 percent (.05)." The smaller the number the more likely the result was not due to chance.

Qualitative analysis The approach to evaluation that is primarily descriptive and interpretive.

Random sampling Selecting subjects from a population so that every individual subject has a specified probability of being chosen.

Rank-ordered comparisons Analyses that show the degree of relationship between two variables that are measured on an ordinal scale, that is, items on the scale can be put in order, or ranked, but the intervals between the ranks may not be equal.

Regression A set of statistical techniques that allow one to assess the relationship between independent and dependent variables.

Regression effect The tendency for extreme scores to become closer to the mean score on a second testing. Also called "regression to the mean."

Reliability Statistical reliability is the consistency of the readings from a scientific instrument or human judge.

Repeated measures A research design in which participants are measured two or more times.

Sample A subset of a population.

Selection bias Any factor other than the program that leads to post-test differences between groups.

Stakeholders Persons who have a vested interest in a project.

Standard deviation A measure of the spread of a variable based on the average amount that the scores in the distribution are different from the mean. The more widely the scores are spread out, the larger the standard deviation.

t test A test of statistical significance, frequently of the difference between two group means.

Threats to validity Factors that can lead to false conclusions.

Treatment group The group that receives whatever is being applied by the project that distinguishes it from the comparison group.

Triangulation In an evaluation, it is an attempt to get a fix on a phenomenon or measurement by approaching it via several independent routes. This effort provides cross-validation of results.

Validity The extent to which an instrument measures what it is intended to measure.

Sources

National Science Foundation. (1993). User-Friendly Handbook for Project Evaluation: Science, Mathematics, Engineering and Technology Education. NSF 93-152. Washington, DC: NSF.

National Science Foundation. (1997). User-Friendly Handbook for Mixed Method Evaluations. NSF 97-153. Washington, DC: NSF.

Scriven, Michael. (1991). Evaluation Thesaurus (4th ed.). Newbury Park, CA: Sage.

Schumacher, Sally, and James H. McMillan. (1993). Research in Education: A Conceptual Introduction. New York, NY: Harper Collins College Publishers.

Trochim, William M. The Research Methods Knowledge Base. http://trochim.human.cornell.edu/kb/index.htm

Vogt, W. Paul. (1999). Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Sciences (2nd ed.). Thousand Oaks, CA: Sage.

APPENDIX C

Additional Resources

Designing research to study the effect of the LSC on student outcomes is not a trivial task. This handbook raises many of the key issues, but it is not possible, nor was it intended, for this document to treat all of the issues at the depth required to transform a neophyte into an expert. Listed below are additional resources on research design.

Bond, Sally L., Sally E. Boyd, and Kathleen A. Rapp. (1997). Taking Stock: A Practical Guide to Evaluating Your Own Programs. Chapel Hill, NC: Horizon Research, Inc. http://www.horizon-research.com/publications/stock.pdf

Campbell, Donald T. and Julian C. Stanley. (1966). Experimental and Quasi-Experimental Designs for Research. Boston, MA: Houghton Mifflin.

Cohen, Jacob. (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.

Denzin, Norman K., and Yvonna S. Lincoln, eds. (2000). Handbook of Qualitative Research (2nd Ed.). Thousand Oaks, CA: Sage.

Joint Committee on Standards for Educational Evaluation. (1994). The Program Evaluation Standards (2nd Ed.). Thousand Oaks, CA: Sage.