Wednesday, February 14, 2024

Semantics: No AI STAAR Scoring, but does it make me feel better that it's still a machine?

 Here's an article you should read. https://www.dallasnews.com/news/education/2024/02/14/computers-are-grading-texas-students-staar-essay-answers/ 

Falsehood or Semantics:

So...ECR's and SCR's aren't graded by AI. Artificial Intelligence. I used the wrong word initially. But they ARE graded by a machine. For most of us (parents and practitioners), neither word makes me feel all warm and fuzzy about its use. And why do we have to learn about this stuff by accident or from the Dallas Morning News? 

Some important points: 

1. In the February SBOE board meeting, the commissioner was asked if AI was being used to score essays and short answer. He said No. Which is true. But he also said that two humans were grading. Untrue for quite some time now, as the document about scoring came out in December. Unless you count the convoluted way that statement would be true: two raters score essays and then their ratings are programmed into a machine that now scores the essays the way the original humans did. (Which is also problematic because the machine can inherit bias and inaccuracies.) 

A Truth: December testers were scored by the "machine." The machine has to score based on training it receives to mimic a large number of essays previously scored by humans.

An Inference: If the machine was trained to score December retesters based on a database of previously scored essays, then the December data had to come from a field test. 

2. December testers were scored by the machine and data on previous scoring events. Probably a field test. Field tests aren't experienced by students in the same ways as official testing. Since the writing types were new, we had stand alone field tests. And scoring isn't experienced by raters in the same way in setting or urgency. This creates scoring inconsistencies and variables that don't match the real data and experiences on and after test day. That's called unreliable and invalid. 

3. If December testers (who are most at risk because they've not passed in the previous administration) were scored by a machine, there's a few scenarios. All of them are problematic. 

Hypothetical Scoring Scenarios: 

Ultimately, we are left guessing about what happened and how. Here's some possibilities and the problems they pose. 

Hypothetical Scoring Scenario One: Machine scoring was previously used to validate human scoring on a STAAR test or Field Test.  

Problem: We know and knew nothing about machine scoring until the December document and the Test Administrator's 2023-2024 training. Since we didn't know, my grandma would call that sneaky and dishonest.

Hypothetical Scoring Scenario Two: Machine scoring was NOT used previously to pilot the validity of human scoring on an operational assessment. That's called unethical because something was used that we didn't have data to prove effectiveness.

Problem: For large scale assessments of high stakes outcomes for the entire state of TEXAS, why not? 

Hypothetical Scoring Scenario Three: Machine scoring was tested by the development company on something that wasn't STAAR. That's called unreliable and invalid. Or just unwise at the least.

Problem: STAAR is it's own beast. It's not really like anything else. And, y'all. This is Texas. We do our own thing. 

Call for Answers: 

What "machine" is being used? 
What's the "machine" called? 
Who developed it? 
How were the trials conducted? Were there trials?
Why weren't we told? 
Why didn't the SBOE know?
Is this scoring method authorized in the Texas Register or any House Bill?
How is the machine programmed? 
Who is programming the machine? 
How does the machine work? 
Did we hire more folks at TEA to manage the computer stuff? 
Or is there a company managing that? 
Does the machine use latent semantic analysis? 
Does the machine use keywords? 
Where is the data on content evaluation? 
Where is the data on grammar and mechanics? 
Where is the data on diction, style, and voice? 
Where is the data on organizational structure and genre?
Where is the data from effectiveness? 
Where is the data that says it's a good idea to begin with other than cost? 
How are inconsistencies in scoring triggered to send essays to humans? 
How is the program/machine "monitored"?
How is the process sustainable? 
How many field tests will be required to sustain the number of essays for training the machine for each year and each grade level and each ECR, SCR? 
How is a field test a valid measure and source of essays? Data?
How many essays did the machine use for its training?
How many essays does the research say the machine needs? 
Were there studies about the machine in this context? 
How was the research conducted and by whom? 
What happens if the writer's response is creative and unformulaic?
How have cautions in the research about machine scoring been addressed and overcome? 
What other states and exams are using this method of programming? 
How does our data compare to other states and assessments? 
How do our assessments and scoring compare to others? 
How much did it cost? 
What are the implications for instruction? 
What are the implications for learners and their future outcomes? 

And Finally: 

The research actually states that this kind of thing should be used on low stakes stuff and frequently. TEA and the SBOE talk all the time about making the assessment more match how we teach. 

SO:

Why don't teachers have access to the same technology to match assessment to instruction? Or instruction to assessment? But that's another can of worms, isn't it? 



1 comment:

  1. These are great questions you raise, and teachers want answers! How do we get them? The lack of transparency is quite suspect!

    ReplyDelete