Shona's Shennanigans: February 2024

Friday, February 16, 2024

TEA's Annual Report and The Teacher Vacancy Task Force Report: Important Reading for School Practitioners

I'm looking forward to the technical report released in the fall by the Assessment Division.

Here's an overview of what sticks out to me and where you might read to be informed.

Page 14 shows our results and compares them to New York and funding. I'd be interested to see what Paul Thomas says about that data and what it really means about a reading crisis.

Page 12 talks about a program called ACE that I wanted to learn more about based on the commissioner's remarks during the Feb SBOE meeting. He said the method was a sure thing for turning a school around. But he also noted it couldn't be done in small schools because you can't move staff around. I'd like to know if this work has been done at high schools, because as far as I've found in the research, that's a much more complex beast and not much works with such sure and high results.

Page 8 talks about the money spent on teacher incentive allotments. Problematic in practice, because teachers tell me their schools refuse to give them the ratings on their evaluations above a certain level. They are told it is district policy not to give above a certain mark regardless of the teacher proficiency or documentation provided.

Note also the graphic on page 8. In practicality, teachers don't have time to plan lessons. Not sure what they mean by planning their "master schedule" as that's not how we usually use that term. I can tell you this: providing instructional materials so teachers don't have to plan lessons really isn't what we are looking for here. I have YET to see materials that take the place of a thinking and reasoning being with high quality training and the time to do what needs to be done.

See page 7 to learn more about the increases in those served by SPED. With possible changes to Dyslexia processes, that number is going to be even larger.

Safe schools are addressed on page 6.

Page 5 addresses how the 88th legislature changed school funding. School finance at large is on page 4.

Page three gives you a cool overview of Student Outcomes and TEA goals. It gives a comparison for preCovid numbers too. It's SO important to know the overall goals because that's how they decide what happens with assessment, curriculum, and initiatives that roll out at the Region Service Centers. This stuff is what ends up happening TO us as practitioners.

The Teacher Vacancy Task force report was referenced in the citations. Here is is.

Thursday, February 15, 2024

Show me the Data: No CR Data on December retesters? No ECR reports at all? No pdf's for reteaching? Why?

Update: Just learned about a tool. The Cambium Research Portal: https://txresearchportal.com/selections?tab=state

Here's the report for English I:

Here's the report for English II

The question is WHY weren't these results part of the STATE's Summary reports? WHY did I have to do to a Cambium research portal?

Here's the Results and a ANOTHER Question: Why are we still doubling the points on a score that used to come from two people and now comes from a machine? Still don't have SCR data.

English I:

Score Point 0: 63%

Score Point 1: 5%

Score Point 2: 6%

Score Point 3: 2%

Score Point 4: 3%

Score Point 5: 3%

Score Point 6: 3%

Score Point 7: 2%

Score Point 8: 2%

Score Point 9: 1%

Score Point 10: 1%

English II:

Score Point 0: 79%

Score Point 1: 7%

Score Point 2: 5%

Score Point 3: 2%

Score Point 4: 1%

Score Point 5: 1%

Score Point 6: 1%

Score Point 7: 0%

Score Point 8: 1%

Score Point 9: 1%

Score Point 10: 1%

Original Post:

Here are the EOC state reports for December retesters:

English I

English II

And here are the FAILURE rates: A(Algebra)1: 62% E1: 68% E2: 76%

For the last administration, we saw charts like this on the summary reports:

And now, I'm hearing that schools are not getting the pdf's of retesters' essays? AND, we don't get to see any of the test items? We have to wait until they retest (AGAIN) in the spring to see data? And all of this is happening with the new machine scoring in place.

Do we have to wait until 3/25-28 for the Accountability Reports?

How are we supposed to help the students without seeing how they responded? How are we supposed to understand how they are graded without seeing the connection with the score and the essay?

I understand that releasing the tests is expensive. We've been dealing with lack of info to support retesters for a while now. But not having reports and data about constructed responses is inexcusable at this juncture regardless of the administration type.

Wednesday, February 14, 2024

Semantics: No AI STAAR Scoring, but does it make me feel better that it's still a machine?

Here's an article you should read. https://www.dallasnews.com/news/education/2024/02/14/computers-are-grading-texas-students-staar-essay-answers/

Falsehood or Semantics:

So...ECR's and SCR's aren't graded by AI. Artificial Intelligence. I used the wrong word initially. But they ARE graded by a machine. For most of us (parents and practitioners), neither word makes me feel all warm and fuzzy about its use. And why do we have to learn about this stuff by accident or from the Dallas Morning News?

Some important points:

1. In the February SBOE board meeting, the commissioner was asked if AI was being used to score essays and short answer. He said No. Which is true. But he also said that two humans were grading. Untrue for quite some time now, as the document about scoring came out in December. Unless you count the convoluted way that statement would be true: two raters score essays and then their ratings are programmed into a machine that now scores the essays the way the original humans did. (Which is also problematic because the machine can inherit bias and inaccuracies.)

A Truth: December testers were scored by the "machine." The machine has to score based on training it receives to mimic a large number of essays previously scored by humans.

An Inference: If the machine was trained to score December retesters based on a database of previously scored essays, then the December data had to come from a field test.

2. December testers were scored by the machine and data on previous scoring events. Probably a field test. Field tests aren't experienced by students in the same ways as official testing. Since the writing types were new, we had stand alone field tests. And scoring isn't experienced by raters in the same way in setting or urgency. This creates scoring inconsistencies and variables that don't match the real data and experiences on and after test day. That's called unreliable and invalid.

3. If December testers (who are most at risk because they've not passed in the previous administration) were scored by a machine, there's a few scenarios. All of them are problematic.

Hypothetical Scoring Scenarios:

Ultimately, we are left guessing about what happened and how. Here's some possibilities and the problems they pose.

Hypothetical Scoring Scenario One: Machine scoring was previously used to validate human scoring on a STAAR test or Field Test.

Problem: We know and knew nothing about machine scoring until the December document and the Test Administrator's 2023-2024 training. Since we didn't know, my grandma would call that sneaky and dishonest.

Hypothetical Scoring Scenario Two: Machine scoring was NOT used previously to pilot the validity of human scoring on an operational assessment. That's called unethical because something was used that we didn't have data to prove effectiveness.

Problem: For large scale assessments of high stakes outcomes for the entire state of TEXAS, why not?

Hypothetical Scoring Scenario Three: Machine scoring was tested by the development company on something that wasn't STAAR. That's called unreliable and invalid. Or just unwise at the least.

Problem: STAAR is it's own beast. It's not really like anything else. And, y'all. This is Texas. We do our own thing.

Call for Answers:

What "machine" is being used?

What's the "machine" called?

Who developed it?

How were the trials conducted? Were there trials?

Why weren't we told?

Why didn't the SBOE know?

Is this scoring method authorized in the Texas Register or any House Bill?

How is the machine programmed?

Who is programming the machine?

How does the machine work?

Did we hire more folks at TEA to manage the computer stuff?

Or is there a company managing that?

Does the machine use latent semantic analysis?

Does the machine use keywords?

Where is the data on content evaluation?

Where is the data on grammar and mechanics?

Where is the data on diction, style, and voice?

Where is the data on organizational structure and genre?

Where is the data from effectiveness?

Where is the data that says it's a good idea to begin with other than cost?

How are inconsistencies in scoring triggered to send essays to humans?

How is the program/machine "monitored"?

How is the process sustainable?

How many field tests will be required to sustain the number of essays for training the machine for each year and each grade level and each ECR, SCR?

How is a field test a valid measure and source of essays? Data?

How many essays did the machine use for its training?

How many essays does the research say the machine needs?

Were there studies about the machine in this context?

How was the research conducted and by whom?

What happens if the writer's response is creative and unformulaic?

How have cautions in the research about machine scoring been addressed and overcome?

What other states and exams are using this method of programming?

How does our data compare to other states and assessments?

How do our assessments and scoring compare to others?

How much did it cost?

What are the implications for instruction?
What are the implications for learners and their future outcomes?

And Finally:

The research actually states that this kind of thing should be used on low stakes stuff and frequently. TEA and the SBOE talk all the time about making the assessment more match how we teach.

SO:

Why don't teachers have access to the same technology to match assessment to instruction? Or instruction to assessment? But that's another can of worms, isn't it?

Friday, February 9, 2024

Researching Automated Scoring for Constructed Responses STAAR

Automated scoring. I want to know more about this, so I've been reading and researching what's out there for teachers and practitioners and policy makers. What kind of research are these folks using to make decisions? WHAT IS this stuff and HOW does it work?

I'm researching - won't you join me?

Here's something from ETS about the difference between Human and Machine Scoring. I used a program called Kami to annotate, highlight, and point out features and ideas we should consider. I did find guidance on developing a litmus test for use and some terms and ideas for further investigation. I didn't find much research. I did find some terms to research and some programs/companies to research about their effectiveness - but since they are making money selling this stuff and conducting their own "research", I'm not hopeful in finding valid and reliable data.

Points of Interest in the Document Linked Above:

Be sure to look at Table 2 on page 5.
Consider the considerations in the bullet points on page 7 in considering your own research and to guide what we can say to policy makers and our board members and staff at TEA. Are these conditions being met? How?

NOTE: NONE of this explains the hybrid process TEA describes in documents released in December or in statements made at CREST and TCTELA in January. And the publication is from 2014: ancient. This information is just a place to begin understanding the topics and processes behind automated scoring.

In previous posts, I analyzed a series of citations given by Pearson, trying to dive deeper to find out how Texas is going to use the stuff with our kids on STAAR. (Pearson develops the stuff they use for TELPAS speaking. Cambium develops the stuff for STAAR and TELPAS writing. I'll be diving into that next.)

Scoring Process for Constructed Response: This gives background about scoring and STAAR.

Review of Pearson's Commentary and Citations: Primarily, I was looking for actual research and where to find the support and development of the programs they've developed. I didn't find research. And some of it was flat out alarming.