From one perspective,
the standards-based reform movement in the United States has been
extremely successful:
At least 47 states have created standards for student
learning; many have also adopted new curriculum frameworks to guide
instruction and new assessments to test students' knowledge. Many school districts across
the country have weighed in with their own versions of
standards-based reform, including new curricula, testing systems,
accountability schemes, and promotion or graduation
requirements.
Yet
not all of these initiatives have accomplished the goals that early
proponents of standards-based reforms envisioned. Advocates hoped that
standards outlining what students should know and be able to do
would spur other reforms that mobilize resources for student
learning, including high quality curriculum frameworks, materials,
and assessments tied to the standards; more widely available course
offerings that reflect this high quality curriculum; more intensive
teacher preparation and professional development guided by related
standards for teaching; more equalized resources for schools; and
more readily available safety nets for educationally needy students
(O’Day and Smith, 1993).
This comprehensive approach has been followed in some states
and districts, such as Connecticut, Kentucky, Maine, and North
Carolina as well as New York’s District #2, San Diego, and New
Haven, California. In these cases, investments in improved schooling
and teaching have improved student achievement while enhancing
teaching and taking steps to equalize educational opportunity.
However, this
comprehensive approach to improving education has not been pursued
everywhere. In a number
of states, the notions of standards and `accountability' have become
synonymous with mandates for student testing that are detached from
policies that might address the quality of teaching, the allocation
of resources, or the nature of schooling. In states where “high stakes
testing” is the primary policy reform, disproportionate numbers of
minority, low-income, and special needs students have failed tests
for promotion and graduation, leading to grade retention, failure to
graduate, and sanctions for schools, without efforts to ensure equal
and adequate teaching, texts, curriculum, or other educational
resources. A new
generation of equity lawsuits has emerged where standards have been
imposed without attention to educational inequalities. “Adequacy” litigation in
Alabama, California, Florida, New York, South Carolina, and
elsewhere has followed recently successful equity lawsuits in
Kentucky and New Jersey.
There are other concerns about the quality of
tests many states are using and their influence on the curriculum,
about the negative effects of high stakes tests on student
placements and opportunities to learn, and about the unintended
consequences of incentive systems that reward or sanction schools
based on average student scores rather than value-added assessments
of student growth.
These approaches appear to create incentives for pushing
low-scorers into special education, holding them back in the grades,
and encouraging them to drop out so that schools’ average scores
will look better.
Evidence of rising dropout rates in Georgia, Florida,
Massachusetts, New York, and Texas has been tied to the effects of
grade retention, student discouragement, and school exclusion and
transfer policies stimulated by high stakes tests.
In addition, sanctions for low-scoring schools
appear to reduce the likelihood that they can attract and keep
qualified teachers. For
example, Florida’s use of aggregate test scores, unadjusted for
student characteristics, to allocate school rewards and sanctions
led to reports that qualified teachers were leaving the schools
rated D or F “in droves” (DeVise, 1999; Fischer, 1999), to be
replaced by teachers without experience or training. As one principal queried,
“Is anybody going to want to dedicate their lives to a school that
has already been labeled a failure?''
States and districts that have relied primarily
on test-based accountability emphasizing sanctions for students and
teachers have often produced greater failure, rather than greater
success, for their most educationally vulnerable students. More successful reforms have
emphasized the use of standards for teaching and learning to guide
investments in better prepared teachers, higher quality teaching,
more performance-oriented curriculum and assessment, better designed
schools, more equitable and effective resource allocations, and more
diagnostic supports for student learning.
There are at least three areas in which
mid-course corrections are needed if standards and assessments are
to support improved education rather than greater inequality:
·
The quality and alignment of
standards, curriculum guidance, and assessments;
·
The appropriate use of assessments to
improve instruction rather than punish students and schools;
·
The development of systems that assure
equal and adequate opportunity to learn.
Below I discuss issues in each of these areas and
then offer an example of a state that has developed a thoughtful
approach to standards-based reform that provides a useful model.
THE QUALITY
AND ALIGNMENT OF STANDARDS, CURRICULUM, GUIDANCE, AND
ASSESSMENTS
Much research has found that high-stakes tests –
particularly when they use limited measures of achievement – can
narrow the curriculum, pushing instruction toward lower order
cognitive skills and distorting the meaning of scores (Klein,
Hamilton, McCaffrey & Stetcher, 2000; Koretz and Barron, 1998;
Koretz, Linn, Dunbar, & Shepard, 1991; Linn, 2000; Linn, Graue,
and Sanders, 1990; Stetcher, Baron, Kaganoff, & Goodwin,
1998). Recent
studies cast doubt on the gains noted on the state tests in Texas,
for example, finding that Texas students have not made comparable
gains on national standardized tests or on the state’s own college
entrance test. These
studies have suggested that teaching to the test may be raising
scores on the state high-stakes test in ways that do not generalize
to other tests that examine a broader set of higher order skills;
that many students are excluded from the state tests to prop up
average scores; and that the tests have been made easier over time
to give the appearance of gains (Haney, 2000; Gordon and Reese,
1997; Hoffman, Assaf, Pennington, & Paris, 1999; Klein et al.,
2000; Stotsky, 1998).
While some states have developed thoughtful
performance-based assessments that challenge students to demonstrate
their thinking and learning in extended tasks and responses,
including some that are embedded in the school curriculum, many
others have settled on multiple choice tests that encourage little
challenging learning and measure few of the standards. Some have adopted
off-the-shelf tests that are unaligned with state standards
altogether, creating a disjuncture between expectations that schools
should teach to the standards and accountability systems that do not
assess the standards.
And the use of norm-referenced tests in some states makes it
impossible to gauge progress accurately, as items are removed from
the test as greater numbers of students can answer them, thus
guaranteeing continuing high rates of failure, especially for
certain subpopulations of students (Haney, 2002).
Early efforts at standards-based reform
demonstrated that it is possible to create thoughtful standards and
educationally productive assessments that are aligned to them.
Connecticut, Kentucky, Maine, Maryland, Minnesota, Nebraska,
Vermont, and Washington are examples of states that have developed
fairly sophisticated performance-based assessment systems based on
intellectually ambitious standards. They developed their systems
carefully over a sustained period of time and have used them
primarily to inform ongoing school improvement rather than to punish
students or schools. [[1]] Most policy discussions have
treated tests as a “black box.” It would be useful for
analyses to examine these systems and how they operate, so that
others can learn from them.
It would also be useful for state consortia that had once
begun to form to create more thoughtful assessments, such as one
launched by the Council for Chief State School Officers, to renew
their efforts to allow states to collaborate in the development of
standards-based criterion-referenced assessment systems that can
assess the range of abilities suggested by the standards.
Finally, amendments to the new federal education
legislation are needed.
The legislation has already caused one state, Maryland, to
drop its sophisticated performance assessment system and another,
Vermont, to reject the new federal funds in order to maintain its
performance assessments.
The law’s current requirements for annual testing that allows
cross-state comparability are likely to push states back to the
lowest common denominator, undoing progress that has been made to
improve the quality of assessments and delaying the move from
antiquated norm-referenced tests to criterion-referenced
systems. More state
flexibility will be needed, along with federal supports for
improving assessment systems and enabling them to assess higher
order thinking and performance skills.
THE APPROPRIATE USE OF
ASSESSMENTS TO IMPROVE INSTRUCTION RATHER THAN PUNISH STUDENTS AND
SCHOOLS
Many unhappy
outcomes of recent reforms have been associated with the
inappropriate stakes associated with test scores, rather than the
nature of the tests themselves. The decision to attach high stakes
to tests has pushed some states back to less ambitious forms of
assessment, dropping portfolio and performance tasks, even when they
have been shown to support improved instruction. [[2]]
It has also caused a variety of dysfunctional outcomes for students
when sanctions are targeted at either students or schools.
For example,
grade retention and denial of diplomas have been major thrusts of
some state policies, although a substantial body of research has
long found lower achievement and higher dropout rates for retained
students than comparable peers who move on through the grades (see
e.g. Holmes and Matthews, 1984; Labaree, 1984; Meisels, 1992;
Shepard and Smith, 1986; Walker and Madhere, 1987). This recent
evaluation by the Consortium on Chicago School Research of a policy
that retained thousands of students based on their test scores
reiterates the recurrent findings:
Retained students did not do better than
previously socially promoted students. The progress among retained
third graders was most troubling. Over the two years between
the end of second grade and the end of the second time through third
grade, the average ITBS reading scores of these students increased
only 1.2 GEs (grade equivalents) compared to 1.5 GEs for students
with similar test scores who had been promoted prior to the
policy. Also troubling
is that one-year dropout rates among eighth graders with low skills
are higher under this policy…. Both the history of prior attempts to
redress poor performance with retention and previous research would
clearly have predicted this finding. Few studies of retention
have found positive impacts, and most suggest that retained students
do no better than socially promoted students. The CPS policy now
highlights a group of students who are facing significant barriers
to learning and are falling farther and farther behind (Roderick,
Bryk, Jacob, Easton, & Allensworth, 1999, pp. 55-56).
This kind of finding
is not new. A decade
earlier, researchers found that test-based sanctions in Georgia also
led to large increases in grade retention and dropping out.
Although most of the
reforms were popular, the policymakers and educators simply ignored
a large body of research showing that they would not produce
academic gains and would increase dropout rates. In other words, this was a
policy with no probable educational benefits and large costs. The benefits were political
and the costs were borne by at-risk students. The damage was psychological
as well as educational, increasing the likelihood that at-risk
students would drop out before receiving their diplomas; school
districts were also hurt by the diversion of resources to repetitive
years of education for many students (Orfield & Ashkinaze, 1991,
p.139).
Recent data
from the Department of Education in Massachusetts, where a similar
policy has been recently enacted, show more grade retention and
higher dropout rates, with the steepest increase in middle schools
(a 300% increase in dropouts between 1997-98 and 1999-2000), greater
proportions of students dropping out in 9th and 10th grades, more of
them African American and Latino, and fewer dropouts returning to
school. Meanwhile the
steepest increases in test scores are occurring in schools that have
the highest retention and dropout rates. As the example in Table 1
below shows, some test increases are largely a function of excluding
students from school.
Similar trends have been noted in Texas, where sharply
increasing scores are found in schools and districts with high
increases in middle school and high school dropouts (Intercultural
Developmental Research Association, 1986; Haney, 2000).
Table 1 –
Retentions, Test Failures, and Dropouts in Three Boston Middle
Schools, 1997-98 to 1999-2000
|
|
1997-98 |
1998-1999 |
1999-2000 |
|
Students
retained in Boston Public Schools |
3150
(4.9%) |
3406
(5.4%) |
3869
(6.1%) |
|
Curley Middle
School |
|
|
|
|
% failing
ELA
% failing
Math exam
# dropping
out (# dropping
out in 6th grade) |
|
39%
91%
11 |
31%
77%
24
(16) |
|
Thompson Middle
School |
|
|
|
|
% failing
ELA
% failing
Math exam
|
|
35%
84%
3 |
25%
67%
38
(21) |
|
Wilson Middle
School |
|
|
|
|
% failing
ELA
% failing
Math exam
# dropping
out (# dropping out in 6th grade) |
|
34%
83%
12 |
21%
67%
30
(15) |
The
negative consequences of these policies have been exacerbated by
sanctions attached to schools’ average test scores. Because these scores are
sensitive to changes in the population of students taking the test
and such changes can be induced by manipulating admissions,
dropouts, and pupil classifications, schools have been found to
label large numbers of low-scoring students for special education
placements so that their scores won't "count" in school reports
(Allington and McGill-Franzen, 1992; Figlio & Getzler, 2002),
retain students in grade so that their relative standing will look
better on "grade-equivalent" scores (Jacob, 2002; Haney, 2000);
exclude low-scoring students from admission to "open enrollment"
schools, and encourage such students to leave schools or drop out
(Darling-Hammond, 1991; Haney, 2000; Smith et al., 1986). Smith and colleagues
explained the widespread engineering of student populations that he
found in his study of New York City’s implementation of test-based
accountability as a basis for school level sanctions:
(S)tudent selection provides the greatest
leverage in the short-term accountability game....The easiest way to
improve one's chances of winning is (1) to add some highly likely
students and (2) to drop some unlikely students, while simply
hanging on to those in the middle. School admissions is a
central thread in the accountability fabric (Smith et al., 1986, pp.
30-31).
Finally, several studies have now found that
school averages are extremely volatile and that large gains from one
year to the next can be followed by declines in the following year,
especially in small schools, rendering decisions about school
rewards and sanctions invalid (Bolon, 2001; Haney, 2002; Kane &
Staiger, 2001). Because
of inevitable limitations of test reliability and validity for
making high stakes decisions, the American Psychological
Association, American Educational Research Association, and the
National Council on Measurement in Education have issued standards
for the use of tests.
These standards state that test scores should not be used as
the sole source of information for any major decision about student
placement or promotion. A recent report of the National Research
Council on high stakes testing summarized appropriate policy as
follows:
Scores
from large-scale assessments should never be the only sources of
information used to make a promotion or retention decision…. Test
scores should always be used in combination with other sources of
information about student achievement (Hauser & Heubert, 1999,
p. 286).
Mid-course corrections should encourage states to
follow these professional standards for the uses of tests, as a
number of states already have.
(See, for example, discussion of Connecticut’s policies
below.) Interestingly,
despite the common view that tests will only be meaningful if they
are used to allocate rewards and punishments to individuals and
organizations, these states have seen growth in student performance
without the inappropriate use of test scores for purposes they
cannot serve well.
States should be encouraged to:
·
Use local, school-based measures (including
first-hand assessments of performance) as an important component of
all placement and graduation decisions;
·
Prohibit the use of test scores as single
arbiters of decisions about students, teachers, or schools;
·
Use analyses of (value-added) individual
gain scores over time in lieu of aggregated cross-sectional measures
(such as grade level averages or proportions of students meeting cut
scores) for understanding school trends. Include multiple measures of
learning and participation in school for evaluating school
trends;
·
Use assessment data to trigger additional
supports for students who are struggling and for schools in need of
additional supports rather than for allocating sanctions.
A CASE EXAMPLE -
CONNECTICUT
Connecticut
provides an especially instructive example of how state level policy
makers have used a standards-based starting point to upgrade
teachers’ knowledge and skills as a means of improving student
learning. Since the
early 1980s, the state has pursued a purposeful and comprehensive
teaching quality agenda, using teaching standards, followed later by
student standards, to guide investments in school finance
equalization, teacher salary increases tied to higher standards for
teacher education and licensing, curriculum and assessment reforms,
and a teacher support and assessment system that strengthened
professional development (Wilson, Darling-Hammond, & Berry,
2001).
By 1998,
Connecticut’s 4th grade students ranked first in the nation in
reading and mathematics on the National Assessment of Educational
Progress (NAEP), despite increased student poverty and language
diversity in the state’s public schools during that decade (NCES,
1997; NEGP, 1999). In
addition, the proportion of Connecticut 8th graders scoring at or
above proficient in reading was first in the nation; Connecticut was
also the top performing state in writing and science. The achievement gap between
white students and the large and growing minority student population
has been shrinking, and the more than 1/4 of Connecticut’s students
who are black or Hispanic substantially outperform their
counterparts nationally (Baron, 1999).
Connecticut’s
preparation, licensing, and mentoring requirements – which are
tightly connected to its student standards – ensure that all
entering teachers have strong content and pedagogical knowledge to
enable them to teach a wide range of diverse learners well –
including those who have special education needs and those who are
English language learners.
In addition to strong standards for preservice education and
initial licensing, portfolio assessments for beginning teacher
licensing, modeled on the National Board process, examine how a
teacher can teach to Connecticut’s student learning standards in
each content area in which the teacher is assigned.
Student
assessments are aimed at higher order thinking and performance
skills embedded in state standards and are used to evaluate and
continually improve practice.
While the highly public reporting system places strong
pressure on districts and schools to improve their practice, the
student assessments are not used for rewards or punishments for
students, teachers, or schools. In evaluating the reasons
for Connecticut’s success, a National Education Goals Panel report
noted the benefits of the state’s low-stakes testing approach, which
precludes the use of test scores for graduation or promotion. This
has allowed the measurement of more ambitious skills and the
encouragement of more strategies for examining student
performance. The
state requires that each district develop its own assessment
criteria for graduation, which criteria must include but “may not be
exclusively based on” the results of the 10th grade mastery
examination. The use of
multiple measures, including local assessments and
curriculum-embedded performances, is encouraged.[[3]]
The assessment
system emphasizes reporting and analysis strategies that focus on
curriculum and teaching reforms, including widespread professional
development; the use of authentic measures of reading and writing on
the state tests; the wide dissemination of the standards and test
objectives along with widespread professional development around
literacy and the teaching of reading; and support to districts and
schools to disaggregate and analyze their data in ways that permit
diagnosis of student needs and curriculum effects (Baron,
1999). The state then
provides targeted resources to the neediest districts to help them
improve, including funding for professional development for teachers
and administrators, preschool and all-day kindergarten for students,
and smaller pupil-teacher ratios. Rather than pursue a
punitive approach that creates dysfunctional responses without
generating learning, Connecticut has made ongoing investments in
improving teaching and schooling through high standards and high
supports.
CONCLUSIONS
Mid-course corrections
to standard-based reforms are needed to develop more productive
systems of accountability for student learning. These should focus on
·
assessing meaningful learning using
high-quality measures tied to standards and supplemented by local
indicators of learning;
·
using assessment data to inform curriculum
reform and guide invests rather than to punish students and schools
for low-performance;
·
developing high-quality teaching in schools
that provide equitable access to curriculum that can enable students to
learn the standards.
REFERENCES
Allington, Richard L.
and Anne McGill-Franzen (1992). Unintended effects of educational
reform in New York. Educational Policy, 6 (4), 397-414.
Baron, J. B. (1999). Exploring High and
Improving Reading Achievement in Connecticut. Washington, DC:
National Educational Goals Panel.
Bolon, C. (2001). Significance of test-based
ratings for metropolitan Boston schools. Education Policy Analysis
Archives, 9 (42), http://epaa.asu.edu/epaa/v0n42.
Darling-Hammond, L.
(1991, November). The implications of testing policy for quality and
equality. Phi Delta Kappan, 73 (3), 220-225.
Ferguson, R. F.
(1991). Paying for public education: New evidence on how and why
money matters. Harvard Journal on Legislation, 28 (2),
465-98.
Figlio, D.N. &
Getzler, L.S. (2002, April). Accountability, ability, and
disability: Gaming the system? Cambridge, MA: National Bureau
of Economic Research.
Gordon, S.P. & Reese, M.
(1997, July). High
stakes testing. Worth
the price? Journal
of School Leadership, 7, 345-368.
Haney, W. (2000). The myth of the Texas miracle
in education.
Education Policy Analysis Archives, 8 (41), http://epaa.asu.edu/epaa/v8n41/
Haney, W.
(2002).
Lake Woebeguaranteed: Misuse of test scores in Massachusetts,
Part I. Educational
Policy Analysis Archives, 10 (24), http://epaa.asu.edu/epaa/v10n24/
Hoffman, J. V., Assaf, L., Pennington, J.,
& Paris, S. G. (1999). High stakes testing in reading: Today in
Texas, tomorrow? The Reading Teacher 52,
482-492.
Holmes, C.T. and K.M.
Matthews (1984). The effects of nonpromotion on elementary and
junior high school pupils: A meta-analysis. Review of Educational
Research 54, 225-236.
Intercultural
Development Research Association (1986, October). Texas school survey
project: A summary of findings. San Antonio, TX: Author.
Jacob, B. A.
(2002). The impact of
high-stakes testing on student achievement: Evidence from
Chicago. Working Paper.
Harvard University.
Kane, T. &
Staiger, D. (2001, April). Volatility in school test scores:
Implications of test-based accountability systems. Cambridge, MA: National Bureau of Economic
Research. http://www.nber.org/papers/w8156.
Klein, S.P., Hamilton, L.S.,
McCaffrey, D.F., & Stetcher, B.M. (2000). What do test scores in
Texas tell us? Santa Monica: The RAND
Corporation.
Koretz, D., & Barron, S. I. (1998). The
validity of gains on the Kentucky Instructional Results Information
System (KIRIS). Santa Monica, CA: RAND, MR-1014-EDU.
Koretz, D., Linn, R. L., Dunbar, S. B.,
& Shepard, L. A. (1991, April). The effects of high-stakes
testing: Preliminary evidence about generalization across tests, in
R. L. Linn (chair), The Effects of High Stakes Testing. Symposium presented at the
annual meeting of the American Educational Research Association and
the National Council on Measurement in Education, Chicago.
Labaree,
D.F.. (1984). Setting
the standard:
Alternative policies for student promotion. Harvard Educational Review,
54 (1), 67‑87.
Linn, R. L. (2000). Assessments and
accountability. Educational Researcher, 29 (2),
4-16.
Linn, R. L., Graue, M. E., & Sanders, N. M.
(1990). Comparing state and district test results to national norms:
The validity of claims that "everyone is above average."
Educational Measurement: Issues and Practice, 9, 5-14.
Meisels,
S. (1992, June). Doing harm by doing
good: Iatrogenic
effects of early childhood enrollment and promotion policies. Early Childhood Research
Quarterly, 7 (2), 155‑74.
National
Center for Education Statistics (NCES) (1997). NAEP 1996
mathematics report card for the nation and the states. Washington, DC: US Department of
Education.
National
Commission on Teaching and America's Future (NCTAF) (1996). What Matters Most:
Teaching for America's Future. New York: Author.
National
Education Goals Panel (NEGP) (1999). Reading achievement state
by state, 1999.
Washington, DC:
U.S. Government Printing Office.
Oakes,
Jeannie (1990). Multiplying inequalities: The effects of race,
social class, and tracking on opportunities to learn mathematics and
science. Santa
Monica: The RAND Corporation.
O'Day,
J. A. & Smith, M.S. (1993). Systemic school reform and
educational opportunity, in S. Fuhrman (ed.), Designing coherent
education policy: Improving the system. San Francisco:
Jossey-Bass.
Orfield, G., & Ashkinaze
C. (1991). The closing door: Conservative policy and black
opportunity.
Chicago: University of Chicago Press, p.
139.
Roderick, M., Bryk, A.S.,
Jacob, B.A., Easton, J.Q. & Allensworth, E. (1999). Ending social promotion:
Results from the first two years. Chicago: Consortium on
Chicago School Research.
Sergi, T.S. (2001). Circular Letter C-3,
2001-2002, August 17, 2001, Local
Graduation Competency Requirements, New Legislation. Hartford, CT: Connecticut State Department
of Education.
Shepard,
L. and Smith, M.L. (1986, November). Synthesis of research on school
readiness and kindergarten retention, Educational Leadership,
44 (3), 78-86.
Smith,
F., et al. (1986). High School Admission and the Improvement of
Schooling. NY: New York City Board of Education.
Stecher, B. M., Barron, S., Kaganoff, T., &
Goodwin, J. (1998). The effects of standards-based assessment
on classroom practices: Results of the 1996-97 RAND Survey of
Kentucky Teachers of Mathematics and Writing (CSE Technical Report
482). Los Angeles: Center for Research on Evaluation, Standards, and
Student Testing.
Stotsky, S. (1998). Analysis of Texas reading
tests, grades 4, 8, and 10, 1995-1998. Report prepared for the Tax
Research Association.
http://www.educationnews.org/analysis_of_the_texas_reading_te.htm
Walker. E. & Madhere, S. (1987). Multiple
retentions: Some consequences for the cognitive and affective
maturation of minority elementary students. Urban Education,
22, 85-89.
Wilson,
S., Darling-Hammond, L., & Berry, B. (2001). A case of successful
teaching policy: Connecticut’s long-term efforts to improve teaching
and learning.
Seattle: Center for the Study of Teaching and Policy,
University of Washington.