In this study we ask: Do observational instruments predict teachers' value-added equally well across different state tests and district/state contexts? And, to what extent are differences in these correlations a function of the match between the observation instrument and tested content? We use data from the Gates Foundation-funded Measures of Effective Teaching (MET) Project(N=1,333) study of elementary and middle school teachers from six large public school districts,and from a smaller (N=250) study of fourth- and fifth-grade math teachers from four large public school districts. Early results indicate that estimates of the relationship between teachers' value-added scores and their observed classroom instructional quality differ considerably by district.
The purpose of this study is to investigate three aspects of construct validity for the Mathematical Quality of Instruction classroom observation instrument: (1) the dimensionality of scores, (2) the generalizability of these scores across districts, and (3) the predictive validity of these scores in terms of student achievement.
The authors used a random-assignment experiment in Los Angeles Unified School District to evaluate various non-experimental methods for estimating teacher effects on student test scores. Having estimated teacher effects during a pre-experimental period, they used these estimates to predict student achievement following random assignment of teachers to classrooms. While all of the teacher effect estimates considered were significant predictors of student achievement under random assignment, those that controlled for prior student test scores yielded unbiased predictions and those that further controlled for mean classroom characteristics yielded the best prediction accuracy. In both the experimental and non-experimental data, the authors found that teacher effects faded out by roughly 50 percent per year in the two years following teacher assignment.
Measurement scholars have recently constructed validity arguments in support of a variety of educational assessments, including classroom observation instruments. In this article, we note that users must examine the robustness of validity arguments to variation in the implementation of these instruments. We illustrate how such an analysis might be used to assess a validity argument constructed for the Mathematical Quality of Instruction instrument, focusing in particular on the effects of varying the rater pool, subject matter content, observation procedure, and district context. Variation in the subject matter content of lessons did not affect rater agreement with master scores, but the evaluation of other portions of the validity argument varied according to the composition of the rater pool, observation procedure, and district context. These results demonstrate the need for conducting such analyses, especially for classroom observation instruments that are subject to multiple sources of variation
The effect of evaluation on employee performance is traditionally studied in the context of the principal-agent problem. Evaluation can, however, also be characterized as an investment in the evaluated employee’s human capital. We study a sample of mid-career public school teachers where we can consider these two types of evaluation effect separately. Employee evaluation is a particularly salient topic in public schools where teacher effectiveness varies substantially and where teacher evaluation itself is increasingly a focus of public policy proposals. We find evidence that a quality classroom-observation-based evaluation and performance measures can improve mid-career teacher performance both during the period of evaluation, consistent with the traditional predictions; and in subsequent years, consistent with human capital investment. However the estimated improvements during evaluation are less precise. Additionally, the effects sizes represent a substantial gain in welfare given the program’s costs.
The authors administered an in-depth survey to new math teachers in New York City and collected information on a number of non-traditional predictors of effectiveness: teaching specific content knowledge, cognitive ability, personality traits, feelings of self-efficacy, and scores on a commercially available teacher selection instrument. They find that a number of these predictors have statistically and economically significant relationships with student and teacher outcomes. The authors conclude that, while there may be no single factor that can predict success in teaching, using a broad set of measures can help schools improve the quality of their teachers.
Education agencies are evaluating teachers using student achievement data. However, very little is known about the comparability of test-based or "value-added" metrics across districts and the extent to which they capture variability in classroom practices. Drawing on data from four urban districts, we find that teachers are categorized differently when compared within versus across districts. In addition, analyses of scores from two observation instruments, as well qualitative viewing of lesson videos identify stark differences in instructional practices across districts among teachers who receive similar within-district value-added rankings. Exploratory analyses suggest that these patterns are not explained by observable background characteristics of teachers and that factors beyond labor market sorting likely play a key role.
In this article, Heather Hill and Pam Grossman discuss the current focus on using teacher observation instruments as part of new teacher evaluation systems being considered and implemented by states and districts. They argue that if these teacher observation instruments are to achieve the goal of supporting teachers in improving instructional practice, they must be subject-specific, involve content experts in the process of observation, and provide information that is both accurate and useful for teachers. They discuss the instruments themselves, raters and system design, and timing of and feedback from the observations. They conclude by outlining the challenges that policy makers face in designing observation systems that will work to improve instructional practice at scale.