Thursday, April 11, 2024

79 vs 8% Zeros: Let's Panic about the Right Thing: Some Facts to Consider about "AI" Scoring

 

Raise Your Hand Texas posted this graphic April 10th, 2024

There's lots to say initially, and more to say after that. 

First, it's not AI. It's an Automated Scoring Engine. 

It does what it's told and doesn't learn. Not sure that helps anyone feel better. Semantics at this point. Toh mai toh. Tah mah toh. 

Second: 79% of RE testers in DECEMBER received zeros on the extended constructed responses. 

These are the folks that haven't passed the English II exam all the other times they've taken it and aren't even in English II classes anymore. Here's a scenario: 

Senior level student. End of first semester of their last year in school. They need the English II credit to graduate. Their school needs them to pass because that gets them points for accountability stuff. They've taken the test this many times: when they were in English II, the summer after the Sophomore year, the December of their Junior year, the spring of their Junior year, the summer after their Junior year, and now the December of their Senior year. They have the next Spring and Summer opportunities as well. Yeah. That's 8 times. And...a lot of these kids have been doing this for both English I and English II. 16 times. And...a lot of these kids have been doing this retesting thing for YEARS. AND, some of these kids were in English I, English II, and English III at the same time because they failed the class and the assessments. And...some of these kids were in an English class and a remedial class. And...when they originally took English II, the Extended Constructed Response, Short Constructed Response, and new item types weren't a thing. So they haven't had initial instruction or experience on much of any of it. Only remediation if all the rules were followed. 

And they made a zero on the ECR in December of 2023 when most folks didn't know that a computer would be scoring the responses. While the lack of information and knowledge about scoring is a problem - a big one - it should be NO surprise that these kids made zeros. It's not the computer's fault. More on that later. 

These kids - the 79% of RE-Testers in 2024 who made a zero - aren't failing because a computer scored their writing. Most of them didn't write. Some of them (from what we saw on the 2023 responses) wrote nothing, typed placeholders like a parenthesis, said "IDK", or "Fuck this." They aren't failing because they haven't had instruction. They are failing because they are just DONE with the testing bullshit. They are disenfranchised, tired, and ruined. 

The computer isn't the problem: the culture created by the testing regime is the problem. 

Third: So where did RYHT Get this Data and Why Did They Post It Now

Author's Purpose and Genre, ya'll. It's rhetoric; it's propaganda. 

Raise Your Hand Texas wants you to VOTE. The post was incendiary and meant to rally the masses and voting base. It's a classic technique for campaigns and lobbying. We SHOULD vote. We SHOULD be angry. But the data posted is a bit deceptive in directing our actions for the kids in front of us. 

Part of their information comes from the TEA accountability report given to the SBOE. Here it is.  And part of it comes from a research portal that I've mentioned in previous blogs. 

Notice also that they are only showing you English II. There's other grades and results out there. 

Here's a chart of what you'd find in the TEA Research Portal. I took the data and made a chart. ASE means Automated Scoring Engine. 

Source: TEA Research Portal

Action: Note the differences in the population, the prompt, and the rubric. Yes. There's a problem with the number of kids scoring a zero. But it's probably not a computer problem. Let's dive deeper. 

Fourth: There's More from Texas School Alliance and Others

The Texas School Alliance (and others) have been talking to TEA. You can become a member here. Various news organizations are also reporting and talking to TEA, asking hard questions and receiving answers. And school officials have been able to view student responses from the 2024 December retest. 

In researching the reports and ideas, here are some key things to know and understand before acting. 

Difference in the Type of Writing

The big changes that caused the zeros probably wasn't the computer scoring thingy. Kids are answering different kinds of prompts than they had been answering before. Before, kids wrote about stuff in persuasive or explanatory modes about a topic they knew something about. They used their own schema and probably didn't care much about the topic. Now, kids have to write in argumentative or explanatory modes in a classic essay format or in correspondence (which we haven't seen yet). They have to write from their understanding of a text and not their schema. 

Action: We need to make sure kids know how to do the right kind of writing. We need to make sure kids can comprehend the texts. 

The Rubric is Different

The rubric used in 2023 and 2024 retesting is DIFFERENT because the type of writing is different. From discussions with others, people are pretty confused about how that new rubric works and what it means. If teachers are confused, KIDS are confused. There has been little to no TEA training on the rubric other than some samples. 

Action: Our actions should be to really study our kids' responses and the scoring alongside the scoring guides. 

The Test ISN'T Harder Overall: SO WHAT? 

You can argue with me on this if you want to. But there's a detailed process all tests go through before and after administration. It's called equating. It's where we get the scale scores, why the cut points for raw scores are different each year, and why we don't know what passing is until after the tests have been scored. 

The ECR is probably harder. So are some of the SCRs. Some of the item types and questions are harder. Some questions have always been harder than others. That's not new. TEA is right that the test isn't harder overall, it's just that there are other things that matter to living, breathing, and feeling humans.

Just because the overall test is not harder, the student experience is not the same. When stuff is new to kids, it's scary. When stuff is hard and scary, and when stuff is high stakes, kids have anxiety. This impacts text scores in ways that are unrelated to psychometrics. Just because the test isn't harder psychometrically, doesn't mean the experience isn't a challenging psychological experience that doesn't impact instruction. 

Action: We need to do more work on preparing students for the user experience in terms of technical online and test details as well as social emotional realities. Especially for kids who have previously experienced failure. 

People Who Viewed the Zero Responses Agreed with the Computer Ratings

TEA reported before (December 2023) that the ASE is trained and MUST agree with human raters. It's part of their process. 

And, the folks I've talked to agree that the responses they saw at TEA from the kids at their campus who scored a zero agree that the papers deserved a zero. Most of them asked for no or 1-2 rescoring requests. 

This means that our colleagues agree that the ASE isn't the problem. 

The Hybrid Scoring Study Conducted by TEA

Here's the link.  And here's a screenshot of the Executive Summary on page 5. 

Huh? Here's what I think it means: 

Paragraph One: They studied the Spring 2023 results. The ASE did a good job. They are going to use the results of the study to study future tests - including the one that was given in December of 2023. 

What we don't know: Who did the study? What was the research design? What is "sufficient performance criteria"? Where is the data and criteria for the "field-test programmed" and "operationally held out validation samples"? What do those terms mean and how do they function in the programming and study? 

Paragraph Two: They studied the first 50% of the extended constructed responses scored by the automated scoring engine (ASE). The ASE did a good job for the subpops used for accountability purposes.

What we don't know: What is "sufficient performance criteria"? How did the subpops compare and what was used as criteria to know that the comparisons were fair? What are the models? 

Paragraph Three: The way the test is made now (with different texts and prompts each time) the computer scoring engines will have to be reprogrammed each test administration AND while scoring is going on. They are going to reprogram on all score points as the data arises during scoring. As changes are made to the models, all the stuff that was scored with the old models will be scored again. (That's good research practice with both a priori and constant comparative analysis. BUT - we don't really know what research protocols they are following, so I'm just guessing here.) As more essays are scored, they'll figure out new ways that kids answer in ways that confuse the models and need rerouting. If they were rerouted before the new codes, they will be rescored with the new models. 

What we don't know: We still don't know how the ASE is programmed and monitored. Like, at all. 

Paragraph Four: Conditions codes change based on paragraph three. They'll keep working on changing them and refining that process during current and future administrations for all grades. They will have to because the grades, texts, prompts, and human ratings all change each administration. The data also changes as more stuff is scored, so the codes have to change as the student responses are scored. 

Action: All of us need to read the whole study. And we need to ask more questions. For now, I think there's not a real ASE scoring problem with the zero responses. The 79% that got zeros probably earned those and we need to seek other things to "blame" for the cause.





2 comments:

  1. December 2024? Have we had that yet??

    ReplyDelete
    Replies
    1. Updated with edits. Thanks for catching that.

      Delete