Shona's Shennanigans: Hybrid/AI Scoring: Content and Implications for Instruction

AI Scoring: Content and Implications for Instruction

Something to Read for Background and Further Research:

Here's what we know about "hybrid" automatic scoring: https://www.pearsonassessments.com/large-scale-assessments/k-12-large-scale-assessments/automated-scoring.html

Be sure to look at this one too: https://www.pearsonassessments.com/large-scale-assessments/k-12-large-scale-assessments/automated-scoring/automated-scoring--5-things-to-know--and-a-history-lesson.html

What's been going on?

So, there's a LOT of data that has already been collected (been going on in the background since at least 1994. And the EIA: Intelligent Essay Assessor probably used data and tech that preceded that.) Student papers. Rater scores. Rubrics. Processes. Refinements. 39 million responses were tested in 2019 by Pearson. And Pearson isn't the only one working on this stuff.

What's happening in December 2023

The test has already been designed and field tested.

The passages and genres and prompts have already been selected. The scoring guide has already been prepared by humans. They've decided what the possible answers are. They've decided what text evidence should match it. The humans have decided what wrong answers and evidence are probable. The humans have decided what paraphrases and synonyms are likely.

Sample/anchor papers have been scored and uploaded into the machine/system.

The machine/system is programmed with all of this information. The machine is programmed with the rubric.

When retesters in December submit their tests, the machine gets the papers. The machine/system already knows how sample papers scored. It scans the new submissions and compares to what it already "knows." A score is generated.

When the system/machine is challenged, the writing gets sent to a human. Sometimes the human gets the paper and the system/machine gets the paper. The human and system/machine calibrate or recalibrate. This will happen about 25% - of papers? of scoring attempts? Not sure. We just have 25%.

It's called hybrid because humans decide what the system/machine looks for. It's called hybrid because humans are scoring continuously between and with the machine.

Accuracy/Validity

Can a computer do this? Is it fair to kids and the ways they interpret the text? Some people will argue with me...but YES, it's fair. Much more fair than what we've had before. Why? Let's think about what's being assessed here. It's not really TEKS. And it's not really writing for ECR.

We want to know - and it's a good thing to know -

Can kids read and understand a text of any genre?
Can kids follow written instructions?
Can kids use data to make decisions?
Do kids know how to read stuff at their grade level?
Can kids make decisions about important ideas and communicate them in a variety of ways (correspondence, use information, argue, etc.)?

When TEA released some of the information about the new test for ELAR, they gave us some really important information: they showed us a guide for raters. The guide included the text, possible answers, and possible text evidence that could support it. The guide explained what a good response would include with the rubric description.

So...kids will read a passage (or passages.) They'll be given a prompt with a genre. The passage will have clearly supported answers and text evidence to match it. Texas teachers will have written, revised, and validated this information. The item will have been field tested and reviewed psychometrically. And all of that will be programmed into the computer.

Here's how I think it will work: We teach kids to dig for the good stuff.

Implications for Instruction:

1. They don't want to, but kids are going to have to read the passages and the prompts in full.

2. Deep comprehension is essential.

3. Digital reading and digital composing skills must be transferred from physical reading and writing tasks. In other words, learners must see what this stuff looks like in practice IN the platform or some other digital platform.

4. Readers and writers must learn beyond WHAT the tools are in the platforms and HOW they can be used to support reading and writing processes. The Cambium tutorial builds familiarity but NO information on how the tools aid the thinking processes.

5. Text evidence is KING. And it must be connected to answering the question.

6. One single acronym is NOT going to help kids answer all the ways this thing can be assessed. RACE isn't good enough. (I actually despise it because it causes bad results. More on that another time.)

Application:

Here's what we tried with a group of kids who scored between 3-5 on their essays last year. It's a powerpoint of notes and our steps that we used in working through the processes of revising their previous work. We pulled up the Cambium testing platform and used what they have available on the test. We learned a TON of stuff that kids said they did and didn't do; what they knew and didn't know. I still need to think about that for a while. But in the meantime, I hope this helps you make some instructional decisions and next steps.

Shona's Shennanigans

Sunday, October 29, 2023

Hybrid/AI Scoring: Content and Implications for Instruction