Bachman and Palmer (2010) Language Assessment in Practice: part 2

Bachman, L. and Palmer, A. (2010) Language Assessment in Practice. Oxford: Oxford University Press.

In chapter 16, the authors discuss rating language performance in situations where the response to the assessment stimulus is extended and relatively unlimited in scope, and language use is situated or in the form of discourse.  These are situations in which it is not possible or practicable to use a rating system in which specific tasks or items are scored.  Thus, rating of these types of assessments is done by describing the person’s level of ability as demonstrated by the response. 

The authors describe two types of rating scales–global and analytic–and express a strong preference for analytic scales.  Global scales of language ability (as described by the authors, p. 339) give a single score, and often include multiple (and distinct) constructs within this one rating. The example they give is the global proficiency scale used by the US Foreign Service (Interagency Language Roundtable).  The authors point out that the rating description includes a large number of constructs related to various types of language ability across various domains.  The way the scores are given can make interpreting the results difficult, as it is not possible to know whether, for example, a person was strong in one area but weak in another. It is also difficult to assign levels, as one total score must be given, and there is no clear indication of the weight assigned to different aspects of the assessment.

The authors prefer analytic rating scales, and stress the importance of tying the scales to the constructs being assessed. This type of scale requires the rater to consider and rate different aspects of language ability separately. The rating scales they use run from “no evidence of” whatever ability is under consideration to “evidence of mastery” (p. 341).  Separate sub-scores of the analytic scale can be weighted and combined to produce a total score in cases where one number must be arrived at.  Analytical scales allow more nuance and detail in scoring and giving feedback.  The authors’ other recommendation is that the levels of the scale be criterion referenced: “they allow test users to make inferences about how much language ability a test taker has, and not merely how well she performs relative to other individuals, including native speakers” (p. 342).

The sample analytic scale provided in the book (p. 345) defines the construct, gives a performance criterion, and then for each scale level provides a description which includes information about range (“variety in the use of the particular component”) and accuracy (“accuracy or appropriateness in using those components”).  Thus, a person might have a limited range of use of some specific structure or aspect of language use, but have high accuracy within that range–the person’s level would still be low, given the limited range. The sample rubrics they provide show how the levels of ability/mastery can be converted to numeric scores. 

In the following pages of the chapter, the authors explore issues related to the use of rating scales, including issues of time, training raters, inter-rater reliability, having sufficient raters, etc. 

In the rest of this post, I will try to apply what I have learned from this book to the question of interpreting exams.

First, of course, comes the construct, as discussed in my previous post.  Quality in interpreting is NOT a straightforward theme; quantities of journal articles and book chapters and other writings have been published on the subject. 

These are the types of interpreting exams we give in our program: dialogue consecutive, non-dialogue consecutive, sight translation, simultaneous.  The construct is not the same for all of them, although I would say that it’s roughly similar.

For the purposes of this post, I am concentrating on dialogue interpreting (role play) exams, as those are, to my mind, the most difficult to assess.

What does complete mastery of dialogue interpreting look like? 

  • No errors or distortions of meaning; potential errors self-corrected by interpreter (either through noting and fixing them on her own or through initiating a clarification sequence).
  • No omissions or additions of content that affect meaning.
  • Manages physical presence in encounter smoothly. Shifts position, gaze as needed.
  • Supports direct communication, autonomy of parties.
  • Sets expectations for all parties and intervenes appropriately to ensure flow of communication.
  • Shows understanding of the specific situation and modifies behavior/performance as needed.
  • Transparent in all side sequences.
  • Explains rationale for encounter-related requests and actions when necessary.
  • Shows evidence of preparation effort (may possibly be inferred from knowledge of vocabulary or ease of understanding of situation-specific concepts; also includes requesting pre-session at the beginning of the encounter).
  • Understands SL utterances (from all parties) with a minimum of requests for repetition because of lack of understanding.
  • Produces natural-sounding (not contaminated with SL syntax, grammar, vocabulary, idioms), grammatical output in TL (to all parties).
  • Uses appropriate register, honorifics, and body language.
  • Is easy to listen to–volume, pace, and tone appropriate for situation and faithful to original speaker; diction clear in both languages.
  • Behaves to all parties in a culturally appropriate manner.
  • Effectively uses note-taking technique when needed.

The list is certainly not complete, and some items may be the subject of debate. However, it will serve as a starting point.  (There’s a place to leave replies to this post if anyone reads it and has comments…)

I do not have a specific scoring rating worked out for these exams, although I do use a rubric. One difficulty I have in putting a grade on the exams is that while I listen to all of the exams, and give comments to all of the students, I don’t understand all the languages they speak. Our program is language neutral, and we pay language reviewers to give general feedback to students for us.  As I generally put more weight on accuracy and completeness than on other aspects, I end up relying on what the reviewers tell me about accuracy and completeness–while at the same time I am very aware that their criteria and mine are not necessarily the same.  Not all of the reviewers have training in pedagogy, which makes it even more difficult.  Therefore, I see a real need for me to develop a better rubric and rating scale which will help both me and the reviewers to improve the feedback given to students.

One possibility would be to establish specific points to ‘check’ in the exam–to score the exam based on the percentage of identified items interpreted or responded to correctly. Because the exam is given by me and the reviewer together, and because I want to make sure the encounter simulation is as realistic as possible, I encourage the reviewers to ad-lib to a certain extent, and to add/change things to make the role play culturally relevant.  I have discovered that the trouble with this is that I then don’t know for sure what the reviewer has said (and thus can’t judge the accuracy of the interpretation into English), and that some reviewers do this better than others.  Thus, some students exams are much different than others.  I need to revise the exam and add in the most-frequently used ‘extra’ difficulties, and then ask the role play participants to follow the script–however! there’s a big problem with that, which is that the interaction becomes much less natural when the client is reading off the page rather than speaking naturally. 

It is possible that I could create a mixed rubric which allows the reviewer and me to score specific items as correct/incorrect in terms of accuracy, but which also allows for more descriptive latitude, especially in the parts that I can observe myself no matter the language.  It’s important that I be the person to give the grade (rather than the reviewer), but at the same time, I need to make sure that the reviewers’ feedback is structured and as consistent as possible.

I’m not sure how best to create a more structured rating system for language skills in the language other than English; I suspect the main thing I need to do is revise my description of each level and add some range and accuracy descriptions, as well as a descriptor about contamination from the SL. For things like delivery and situational management skills, I will also revise my descriptors. Range and accuracy are not necessarily good criteria for such skills; perhaps for delivery I will create some checklist type points (varies tone, pace appropriate…) and for situational management I can expand on the descriptors I have now, which include how consistently the student produces a behavior, and how smoothly/appropriately. 

My goal is to have a much-improved rubric by early in the spring semester and test it out in my next set of exams (which will be in March).