Here's an article you should read. https://www.dallasnews.com/news/education/2024/02/14/computers-are-grading-texas-students-staar-essay-answers/
Falsehood or Semantics:
So...ECR's and SCR's aren't graded by AI. Artificial Intelligence. I used the wrong word initially. But they ARE graded by a machine. For most of us (parents and practitioners), neither word makes me feel all warm and fuzzy about its use. And why do we have to learn about this stuff by accident or from the Dallas Morning News?
Some important points:
1. In the February SBOE board meeting, the commissioner was asked if AI was being used to score essays and short answer. He said No. Which is true. But he also said that two humans were grading. Untrue for quite some time now, as the document about scoring came out in December. Unless you count the convoluted way that statement would be true: two raters score essays and then their ratings are programmed into a machine that now scores the essays the way the original humans did. (Which is also problematic because the machine can inherit bias and inaccuracies.)
A Truth: December testers were scored by the "machine." The machine has to score based on training it receives to mimic a large number of essays previously scored by humans.
An Inference: If the machine was trained to score December retesters based on a database of previously scored essays, then the December data had to come from a field test.
2. December testers were scored by the machine and data on previous scoring events. Probably a field test. Field tests aren't experienced by students in the same ways as official testing. Since the writing types were new, we had stand alone field tests. And scoring isn't experienced by raters in the same way in setting or urgency. This creates scoring inconsistencies and variables that don't match the real data and experiences on and after test day. That's called unreliable and invalid.
3. If December testers (who are most at risk because they've not passed in the previous administration) were scored by a machine, there's a few scenarios. All of them are problematic.
Hypothetical Scoring Scenarios:
Ultimately, we are left guessing about what happened and how. Here's some possibilities and the problems they pose.
Hypothetical Scoring Scenario One: Machine scoring was previously used to validate human scoring on a STAAR test or Field Test.
Problem: We know and knew nothing about machine scoring until the December document and the Test Administrator's 2023-2024 training. Since we didn't know, my grandma would call that sneaky and dishonest.
Hypothetical Scoring Scenario Two: Machine scoring was NOT used previously to pilot the validity of human scoring on an operational assessment. That's called unethical because something was used that we didn't have data to prove effectiveness.
Problem: For large scale assessments of high stakes outcomes for the entire state of TEXAS, why not?
Hypothetical Scoring Scenario Three: Machine scoring was tested by the development company on something that wasn't STAAR. That's called unreliable and invalid. Or just unwise at the least.
Problem: STAAR is it's own beast. It's not really like anything else. And, y'all. This is Texas. We do our own thing.
Call for Answers:
What are the implications for learners and their future outcomes?
These are great questions you raise, and teachers want answers! How do we get them? The lack of transparency is quite suspect!
ReplyDelete