英语论文网

1991, p. 9). At the same time, Skehan correctly points out that as research progresses, this model will be modified and eventually superseded. Both Alderson and Skehan indicate that an area where further progress is needed is in the application of theoretical models of language proficiency to the design and development of language tests. Alderson, for example, states that “we need to be concerned not only with . . . the nature of language proficiency, but also with language learning and the design and researching of achievementtests; not only with testers, and the problems of our professionalism,but also with testees, with students, and their interests, perspectivesand insights” (Alderson, 1991, p. 5). 　　A second area of research and progress is in our understanding of the effects of the method of testing on test performance, A number of empirical studies conducted in the 1980s clearly demonstrated that the kind of test tasks used can affect test performance as much as the abilities we want to measure (e.g., Bachman & Palmer, 1981, 1982, 1988; Clifford, 1981; Shohamy, 1983, 1984). Other studies demonstrated that the topical content of test tasks can affect performance (e.g., Alderson & Urquhart, 1985; Erickson & Molloy, 1983). Results of these studies have stimulated a renewed interest in the investigation of test content. And here the results have been mixed. Alderson and colleagues (Alderson, 1986, 1990; Alderson & Lukmani, 1986; Alderson, Henning, & Lukmani, 1987) have been investigating (a) the extent to which “experts” agree in their judgments about what specific skills EFL reading test items measure, and at what levels, and (b) whether these expert judgments about ability levels are related to the difficulty of items. Their results indicate first, that these experts, who included test designers assessing the content of their own tests, do not agree and, second, that there is virtually no relationship between judgments of the levels of ability tested and empirical item difficulty. Bachman and colleagues, on the other hand (Bachman, Davidson, Lynch, & Ryan, 1989; Bachman, Davidson, & Milanovic, 1991; Bachman, Davidson, Ryan, & Choi, in press) have found that by using a content-rating instrument based on a taxonomy of test method characteristics (Bachman, 1990b) and by training raters, a high degree of agreement among raters can be obtained, and such content ratings are related to item difficulty and item discrimina-tion. In my view, these results are not inconsistent. The research of Alderson and colleagues presents, I believe, a sobering picture of actual practice in the design and development of language tests: Test designers and experts in the field disagree about what language tests measure, and neither the designers nor the experts have a clear sense of the levels of ability measured by their tests. This research uncovers a potentially serious problem in the way language testers practice their trade. Bachman’s research, on the other hand, presents what can be accomplished in a highly controlled situation, and provides one approach to solving this problem. Thus, an important area for future research in the years to come will be in the refinement of approaches to the analysis of test method character-istics, of which content is a substantial component, and the inves-tigation of how specific characteristics of test method affect test performance. Progress will be realized in the area of language test-ing practice when insights