Wednesday, April 17, 2024

Morath's Presentation on Hybrid Scoring: Corrections to Commentary about How Hybrid Scoring Was Communicated

TEA posted this powerpoint to answer questions about hybrid scoring. Morath gave a presentation this week. Here's the link: https://tea.texas.gov/student-assessment/testing/hybrid-scoring-key-questions.pdf 

Inaccurate Information about Communication

On slide 8, the presentation explains that "TEA provided information in other stakeholder group presentations, e.g., Texas Science Education Leadership Association (TSELA), Texas Social Studies Supervisors Association (TSSSA), Texas Council of Teachers of English Language Arts (TCTELA), Coalition of Reading and English Supervisors of Texas (CREST)" in November of 23. 

This is not true. TEA did NOT provide information to TCTELA or CREST in 2023 about hybrid scoring. 

There was NO CREST conference in 23. 

There was a TCTELA conference in February of 23, but NO mention of hybrid scoring was mentioned. Here are my notes because TEA did not release the slides. I also recorded the session on my iphone.  I reviewed my personal recording today. There was NO mention of Hybrid scoring. 

 https://docs.google.com/document/d/1K6SFfwEPeKXO43WHEugVHcQwJSicCVNJqXLlNHRF6AI/edit?usp=sharing 

The FIRST ELAR folks heard about this was from their Testing Coordinators and from a pdf posted on the TEA website in December of 2023. To reiterate - the top ELAR organizations in Texas did NOT have the information. And from what I know, the region service centers did NOT cover any of the information about hybrid scoring until after December of 2023. We didn't know. And they didn't tell us in the way described in this presentation. The information on side 8 is blatantly inaccurate. My momma would call that a lie. 

Other Information and Questions

The slides do give us some good information. But we still aren't getting the answers we need for transparency or instructional intervention or initial instruction. 



Thursday, April 11, 2024

79 vs 8% Zeros: Let's Panic about the Right Thing: Some Facts to Consider about "AI" Scoring

 

Raise Your Hand Texas posted this graphic April 10th, 2024

There's lots to say initially, and more to say after that. 

First, it's not AI. It's an Automated Scoring Engine. 

It does what it's told and doesn't learn. Not sure that helps anyone feel better. Semantics at this point. Toh mai toh. Tah mah toh. 

Second: 79% of RE testers in DECEMBER received zeros on the extended constructed responses. 

These are the folks that haven't passed the English II exam all the other times they've taken it and aren't even in English II classes anymore. Here's a scenario: 

Senior level student. End of first semester of their last year in school. They need the English II credit to graduate. Their school needs them to pass because that gets them points for accountability stuff. They've taken the test this many times: when they were in English II, the summer after the Sophomore year, the December of their Junior year, the spring of their Junior year, the summer after their Junior year, and now the December of their Senior year. They have the next Spring and Summer opportunities as well. Yeah. That's 8 times. And...a lot of these kids have been doing this for both English I and English II. 16 times. And...a lot of these kids have been doing this retesting thing for YEARS. AND, some of these kids were in English I, English II, and English III at the same time because they failed the class and the assessments. And...some of these kids were in an English class and a remedial class. And...when they originally took English II, the Extended Constructed Response, Short Constructed Response, and new item types weren't a thing. So they haven't had initial instruction or experience on much of any of it. Only remediation if all the rules were followed. 

And they made a zero on the ECR in December of 2023 when most folks didn't know that a computer would be scoring the responses. While the lack of information and knowledge about scoring is a problem - a big one - it should be NO surprise that these kids made zeros. It's not the computer's fault. More on that later. 

These kids - the 79% of RE-Testers in 2024 who made a zero - aren't failing because a computer scored their writing. Most of them didn't write. Some of them (from what we saw on the 2023 responses) wrote nothing, typed placeholders like a parenthesis, said "IDK", or "Fuck this." They aren't failing because they haven't had instruction. They are failing because they are just DONE with the testing bullshit. They are disenfranchised, tired, and ruined. 

The computer isn't the problem: the culture created by the testing regime is the problem. 

Third: So where did RYHT Get this Data and Why Did They Post It Now

Author's Purpose and Genre, ya'll. It's rhetoric; it's propaganda. 

Raise Your Hand Texas wants you to VOTE. The post was incendiary and meant to rally the masses and voting base. It's a classic technique for campaigns and lobbying. We SHOULD vote. We SHOULD be angry. But the data posted is a bit deceptive in directing our actions for the kids in front of us. 

Part of their information comes from the TEA accountability report given to the SBOE. Here it is.  And part of it comes from a research portal that I've mentioned in previous blogs. 

Notice also that they are only showing you English II. There's other grades and results out there. 

Here's a chart of what you'd find in the TEA Research Portal. I took the data and made a chart. ASE means Automated Scoring Engine. 

Source: TEA Research Portal

Action: Note the differences in the population, the prompt, and the rubric. Yes. There's a problem with the number of kids scoring a zero. But it's probably not a computer problem. Let's dive deeper. 

Fourth: There's More from Texas School Alliance and Others

The Texas School Alliance (and others) have been talking to TEA. You can become a member here. Various news organizations are also reporting and talking to TEA, asking hard questions and receiving answers. And school officials have been able to view student responses from the 2024 December retest. 

In researching the reports and ideas, here are some key things to know and understand before acting. 

Difference in the Type of Writing

The big changes that caused the zeros probably wasn't the computer scoring thingy. Kids are answering different kinds of prompts than they had been answering before. Before, kids wrote about stuff in persuasive or explanatory modes about a topic they knew something about. They used their own schema and probably didn't care much about the topic. Now, kids have to write in argumentative or explanatory modes in a classic essay format or in correspondence (which we haven't seen yet). They have to write from their understanding of a text and not their schema. 

Action: We need to make sure kids know how to do the right kind of writing. We need to make sure kids can comprehend the texts. 

The Rubric is Different

The rubric used in 2023 and 2024 retesting is DIFFERENT because the type of writing is different. From discussions with others, people are pretty confused about how that new rubric works and what it means. If teachers are confused, KIDS are confused. There has been little to no TEA training on the rubric other than some samples. 

Action: Our actions should be to really study our kids' responses and the scoring alongside the scoring guides. 

The Test ISN'T Harder Overall: SO WHAT? 

You can argue with me on this if you want to. But there's a detailed process all tests go through before and after administration. It's called equating. It's where we get the scale scores, why the cut points for raw scores are different each year, and why we don't know what passing is until after the tests have been scored. 

The ECR is probably harder. So are some of the SCRs. Some of the item types and questions are harder. Some questions have always been harder than others. That's not new. TEA is right that the test isn't harder overall, it's just that there are other things that matter to living, breathing, and feeling humans.

Just because the overall test is not harder, the student experience is not the same. When stuff is new to kids, it's scary. When stuff is hard and scary, and when stuff is high stakes, kids have anxiety. This impacts text scores in ways that are unrelated to psychometrics. Just because the test isn't harder psychometrically, doesn't mean the experience isn't a challenging psychological experience that doesn't impact instruction. 

Action: We need to do more work on preparing students for the user experience in terms of technical online and test details as well as social emotional realities. Especially for kids who have previously experienced failure. 

People Who Viewed the Zero Responses Agreed with the Computer Ratings

TEA reported before (December 2023) that the ASE is trained and MUST agree with human raters. It's part of their process. 

And, the folks I've talked to agree that the responses they saw at TEA from the kids at their campus who scored a zero agree that the papers deserved a zero. Most of them asked for no or 1-2 rescoring requests. 

This means that our colleagues agree that the ASE isn't the problem. 

The Hybrid Scoring Study Conducted by TEA

Here's the link.  And here's a screenshot of the Executive Summary on page 5. 

Huh? Here's what I think it means: 

Paragraph One: They studied the Spring 2023 results. The ASE did a good job. They are going to use the results of the study to study future tests - including the one that was given in December of 2023. 

What we don't know: Who did the study? What was the research design? What is "sufficient performance criteria"? Where is the data and criteria for the "field-test programmed" and "operationally held out validation samples"? What do those terms mean and how do they function in the programming and study? 

Paragraph Two: They studied the first 50% of the extended constructed responses scored by the automated scoring engine (ASE). The ASE did a good job for the subpops used for accountability purposes.

What we don't know: What is "sufficient performance criteria"? How did the subpops compare and what was used as criteria to know that the comparisons were fair? What are the models? 

Paragraph Three: The way the test is made now (with different texts and prompts each time) the computer scoring engines will have to be reprogrammed each test administration AND while scoring is going on. They are going to reprogram on all score points as the data arises during scoring. As changes are made to the models, all the stuff that was scored with the old models will be scored again. (That's good research practice with both a priori and constant comparative analysis. BUT - we don't really know what research protocols they are following, so I'm just guessing here.) As more essays are scored, they'll figure out new ways that kids answer in ways that confuse the models and need rerouting. If they were rerouted before the new codes, they will be rescored with the new models. 

What we don't know: We still don't know how the ASE is programmed and monitored. Like, at all. 

Paragraph Four: Conditions codes change based on paragraph three. They'll keep working on changing them and refining that process during current and future administrations for all grades. They will have to because the grades, texts, prompts, and human ratings all change each administration. The data also changes as more stuff is scored, so the codes have to change as the student responses are scored. 

Action: All of us need to read the whole study. And we need to ask more questions. For now, I think there's not a real ASE scoring problem with the zero responses. The 79% that got zeros probably earned those and we need to seek other things to "blame" for the cause.





Wednesday, April 10, 2024

13 Trips and A Request for Feedback

Note: I wrote this last week during an Abydos session on Building Community. We used a grouping strategy called "Pointing": original developed by Peter Elbow. But we ran out of time. I'd like to try it out with my friends online so I can use the data to show how we extend the grouping activities for revision and differentiated instruction.

Would you help? Can you read my writing and then "point out" the words and phrases that "penetrate your skull" or seem to be the center of "grabbity" that catch your attention? If you have time, give me an idea of why the words or phrases stuck out to you. 

There's 13 trips for a sprinkler  repair. Unlucky? Always. Why? 

You never have all the right things. It's the wrong size. Wrong gage. Wrong thread. Wrong length. Just wrong-wrong-wrong. And that was wrong too. 

1. You bought the wrong thing the first time because you were hapless enough to go the store to buy what you thought you needed before you started digging.

2. You returned to the store to buy the new thing, but when you went back to install it, you realize that thing doesn't fit any better than the first thing. You need that other thing that makes the second thing fit. Misfortune again. 

3. When you put the adapter on to make the part fit, a foreboding crack undermines your efforts.. So now you need more parts to fix that thing too. 

4. When fixing the new problem, you use an old tool at a cursed angle and the plastic for the new part snaps. You go buy the new part only to realize upon return home that...

5. Jinxed, you need a different tool so you can remove that thing that broke. So you go back to the store to buy the tool remove the part the broke so you can use the part you bought the last time. 

6. Now that things are in place, you dry fit the parts only to find that the ill-fated glue you bought the last time - only last week - has already turned solid. 

7. And the lid on the other thing that makes the first thing work won't come off. So you need another tool or chemical to open the second one and buy a new one so you just buy both of them. But then there are three choices now for the same product that used to work just fine and the wretched original is nowhere to be found. 

8. When you are digging past the mud and clay, you realize that the blighted person before you must have taped the thing together with Gorilla Glue and an old garden hose and you'll have to fix that part too. 

9. Then you go to put the new part on and realize that some star-crossed do-it-yourselfer laid the new pipe on the old pipe and there isn't room for replacement connector. So you go back for a flexible bendy part to add to the other parts you just bought. 

10. So you dig a bigger trench because you think you might lift up the pipe to make some room with the bendy part you just bought. Avalanches of dirt cover your previous attempts. So you carefully leverage what you have in the dark.  Blindly, you lift until you hear a crack and notice that on the other side of your repair there is a T where the damaged pipe intersects with two other directions. 

11. So you dig a bigger trench to uncover the new problems. Traipse back to the store to get more things. Things seem to be gong fine when you dry fit that parts and use the new glue stuff. Only to realize that you weren't quite ready for the glue. The pipe is still too long, but now the connections are too short and you'll have to catastrophically cut off the T and replace the whole thing because the pipe is too close to accept a new adapter. 

12. Back to the store. For all the new things and other adapter things. You dry fit and double check everything only to realize - calamity - that when you reach for the glue that you didn't close the lids properly and the gunk is now a dried goo all over your new tools.

13.  Since you've already spent the dire, equivalent funds and wait time for calling a pro, you throw your muddy gloves down into the hole, wipe your bloody hands on your soaked pants, and ask Siri to call the sprinkler repair guy that you should have called before your first, ill-fated trip. 


MH's Pointing Feedback: I love the phrase "avalances of dirt."  It creates a vivid image in my mind.

I like the repeated use of the word "thing."  In real life, I get frustrated with too many uses of "thing," but it fits perfectly here because I'm guessing you don't really know what all the "things" are really called!  It shows from the beginning that you probably shouldn't have been the one making this repair.
The phrase "foreboding crack" tells me "uh-oh." 

Wednesday, March 6, 2024

TEA Communication Regarding December ECR Scores

TEA Communication

 TEA Communication to Testing Coordinators

Not all of us receive notifications that the testing coordinators receive. And often - the information just doesn't "trickle down." 

Background

So here's what happened. 

1. Many people shared concerns about the scores on the December retests since the scoring method was different (not AI but by a machine that doesn't learn) and that there were a lot of zeros. 

2. We don't see responses on December tests because the tests aren't released. It's expensive to release an exam and they might need to use the passages and questions again. 

3. Responses from the December test were scored by two Automated Scoring Engines. 25% of them were also routed to and scored by a human for "various auditing purposes."  (That sounds like the human scores were used for making sure the ratings were correct and not used for reporting.) 

4. If the machines didn't have scores that were adjacent or the same (same: 0, 0; 1-1, 2-2, 3-3, 4-4, 5-5; adjacent: 0-1; 1-2; 2-3; 3-4; 4-5), then the essays were routed to humans for rescoring. 

5 The ASE, computer, sent the codes to the humans with some condition codes. These are the codes: (I'm assuming there are other codes not associated with zeros, but that information is unclear. It is also unclear about how the machine is programmed other than with the rubric.)

    a. response uses just a few words

    b. response uses mostly duplicated text

    c. response is written in another language

    d. response consists primarily of stimulus material

    e. response uses vocabulary that does not overlap with the vocabulary in the subset of responses used to program the ASE

    f. response uses language patterns that are reflective of off-topic or off-task responses

 Additional language describes "unusual" responses that could trigger a condition code for review. 

6. Responses routed to a human for rescoring retain the rating of the human. The language is unclear if the 25% scored by a human are also kept because that was addressed in a paragraph that did not address condition codes. 

Implications for Instruction: Revising

Instruction can address each element of the rubric as well as the information we see in TEA's letter to testing coordinators and bulleted above. 

We can have students ask themselves some questions and revise for these elements: 

  1. Does your response use just a few words? (Pinpoint these areas with the highlighting tool: Where did you: 1) address the prompt 2) provide evidence?) How could you expand your words into sentences? How could you use a strategy such as looping  to add additional sentences? 
  2. Does your response use mostly duplicated text? This means repetition. Go back into your writing. Use the strikethrough tool as your read through your work. Strike through any repeated words and phrases. Then go through each sentence. Does each sentence say something new? Use the strikethrough tool to remove those elements. Did you use an organizational framework for your writing? QA12345 helps you write new things for each segment. 
  3. Does your response use another language? Highlight the words in your language. Go back and translate your writing into English, doing the best you can to recreate your thoughts. Leave the text in your language as you work. Then go back and delete things that are not written in English.
  4. Does your response mainly use words from the text or passage? Good! This means you are using text evidence. First, make sure the evidence helps answer the ideas the prompt wants you to write about. Second, add a sentence that explains how that text evidence connects to the ideas in the prompt.

Implications for Instruction: Comprehension and Preparation

The last bullet indicates that the program is prepared with sample answers. This means that instructional materials and our questions must also consider what answers we are expecting and how we go about putting that down in words. It also means that we should be prepared for how students might misinterpret the both or either of the text, section, or prompt. 

  1. Teach students how to diffuse the prompt for words of distinction that change the conditions of the prompt focus (today vs past). This also includes how we decide what genre to use for the response: informational, argumentative, correspondence (informational or argumentative). 
  2. Discuss the differences about prompts that ask for responses about the whole test vs sections of the text (an aquifer vs a salamander). 
  3. In developing prompts, be sure to compose prompts that can have multiple answers and have multiple pieces of text evidence to support. Be sure to compose prompts that can have WRONG interpretations and associated evidence. 
  4. Develop sets of exemplars that writers can use to match the evidence to the thesis to the prompt and finally to the passage. They need to SEE all the ways in which their thinking can go astray. 
  5. Teach them about vocabulary and bots that scan for key words and synonyms. We may not like this, but how could that be a bad thing? They think about the key ideas in the prompt and passage and make sure their vocabulary in responses match. 

Implications for Instruction: Creativity

Most of the research about machine scoring says we really aren't ready for machines to score diverse answers. But...the language in the letter here suggests that "unusual" stuff might receive a condition code. And we don't have a description of what an unusual condition code for creative responses might be other than what was describe in the previous bullet points. IDK. 

What to Ask Your Testing Coordinator

Because of the concerns we all raised, TEA isn't going to let us see the ECR's, but they are going to let us have some data. (I'd like to see the statewide data, wouldn't you?) Ask your coordinator about the report they can ask TEA to provide. It will include information about scores on ECR's from December: 
  • How many students turned in a blank response? 
  • How many responses were unscorable with a condition code? 
  • How many responses received a zero? 
Coordinators can then ask for an appointment to VIEW responses with text that received a zero. They won't be able to ask questions about scoring, the rubric, or for responses to be rescored. But they'll be able to see the responses that have a zero. Not sure that will help much - and understand that seeing more would compromise test security and cost a lot of money because the passage and question could not be reused. 

Friday, February 16, 2024

TEA's Annual Report and The Teacher Vacancy Task Force Report: Important Reading for School Practitioners

 Here's TEA's Annual Report. 

I'm looking forward to the technical report released in the fall by the Assessment Division. 

Here's an overview of what sticks out to me and where you might read to be informed. 

Page 14 shows our results and compares them to New York and funding. I'd be interested to see what Paul Thomas says about that data and what it really means about a reading crisis. 

Page 12 talks about a program called ACE that I wanted to learn more about based on the commissioner's remarks during the Feb SBOE meeting. He said the method was a sure thing for turning a school around. But he also noted it couldn't be done in small schools because you can't move staff around. I'd like to know if this work has been done at high schools, because as far as I've found in the research, that's a much more complex beast and not much works with such sure and high results. 

Page 8 talks about the money spent on teacher incentive allotments. Problematic in practice, because teachers tell me their schools refuse to give them the ratings on their evaluations above a certain level. They are told it is district policy not to give above a certain mark regardless of the teacher proficiency or documentation provided. 

Note also the graphic on page 8. In practicality, teachers don't have time to plan lessons. Not sure what they mean by planning their "master schedule" as that's not how we usually use that term. I can tell you this: providing instructional materials so teachers don't have to plan lessons really isn't what we are looking for here. I have YET to see materials that take the place of a thinking and reasoning being with high quality training and the time to do what needs to be done. 


See page 7 to learn more about the increases in those served by SPED. With possible changes to Dyslexia processes, that number is going to be even larger. 

Safe schools are addressed on page 6. 

Page 5 addresses how the 88th legislature changed school funding. School finance at large is on page 4. 

Page three gives you a cool overview of Student Outcomes and TEA goals. It gives a comparison for preCovid numbers too. It's SO important to know the overall goals because that's how they decide what happens with assessment, curriculum, and initiatives that roll out at the Region Service Centers. This stuff is what ends up happening TO us as practitioners. 

The Teacher Vacancy Task force report was referenced in the citations. Here is is.



Thursday, February 15, 2024

Show me the Data: No CR Data on December retesters? No ECR reports at all? No pdf's for reteaching? Why?

Update: Just learned about a tool. The Cambium Research Portal: https://txresearchportal.com/selections?tab=state 

Here's the report for English I: 

Here's the report for English II

The question is WHY weren't these results part of the STATE's Summary reports? WHY did I have to do to a Cambium research portal? 

Here's the Results and a ANOTHER Question: Why are we still doubling the points on a score that used to come from two people and now comes from a machine? Still don't have SCR data. 

English I: 

Score Point 0: 63%

Score Point 1: 5%

Score Point 2: 6%

Score Point 3: 2%

Score Point 4: 3%

Score Point 5: 3%

Score Point 6: 3%

Score Point 7: 2%

Score Point 8: 2%

Score Point 9: 1%

Score Point 10: 1%

English II: 

Score Point 0: 79%

Score Point 1: 7%

Score Point 2: 5%

Score Point 3: 2%

Score Point 4: 1%

Score Point 5: 1%

Score Point 6: 1%

Score Point 7: 0%

Score Point 8: 1%

Score Point 9: 1%

Score Point 10: 1%

Original Post: 

Here are the EOC state reports for December retesters: 

English I

English II

And here are the FAILURE rates: A(Algebra)1: 62% E1: 68% E2: 76% 

For the last administration, we saw charts like this on the summary reports: 


And now, I'm hearing that schools are not getting the pdf's of retesters' essays? AND, we don't get to see any of the test items? We have to wait until they retest (AGAIN) in the spring to see data? And all of this is happening with the new machine scoring in place. 

Do we have to wait until 3/25-28 for the Accountability Reports? 

How are we supposed to help the students without seeing how they responded? How are we supposed to understand how they are graded without seeing the connection with the score and the essay? 

I understand that releasing the tests is expensive. We've been dealing with lack of info to support retesters for a while now. But not having reports and data about constructed responses is inexcusable at this juncture regardless of the administration type. 




Wednesday, February 14, 2024

Semantics: No AI STAAR Scoring, but does it make me feel better that it's still a machine?

 Here's an article you should read. https://www.dallasnews.com/news/education/2024/02/14/computers-are-grading-texas-students-staar-essay-answers/ 

Falsehood or Semantics:

So...ECR's and SCR's aren't graded by AI. Artificial Intelligence. I used the wrong word initially. But they ARE graded by a machine. For most of us (parents and practitioners), neither word makes me feel all warm and fuzzy about its use. And why do we have to learn about this stuff by accident or from the Dallas Morning News? 

Some important points: 

1. In the February SBOE board meeting, the commissioner was asked if AI was being used to score essays and short answer. He said No. Which is true. But he also said that two humans were grading. Untrue for quite some time now, as the document about scoring came out in December. Unless you count the convoluted way that statement would be true: two raters score essays and then their ratings are programmed into a machine that now scores the essays the way the original humans did. (Which is also problematic because the machine can inherit bias and inaccuracies.) 

A Truth: December testers were scored by the "machine." The machine has to score based on training it receives to mimic a large number of essays previously scored by humans.

An Inference: If the machine was trained to score December retesters based on a database of previously scored essays, then the December data had to come from a field test. 

2. December testers were scored by the machine and data on previous scoring events. Probably a field test. Field tests aren't experienced by students in the same ways as official testing. Since the writing types were new, we had stand alone field tests. And scoring isn't experienced by raters in the same way in setting or urgency. This creates scoring inconsistencies and variables that don't match the real data and experiences on and after test day. That's called unreliable and invalid. 

3. If December testers (who are most at risk because they've not passed in the previous administration) were scored by a machine, there's a few scenarios. All of them are problematic. 

Hypothetical Scoring Scenarios: 

Ultimately, we are left guessing about what happened and how. Here's some possibilities and the problems they pose. 

Hypothetical Scoring Scenario One: Machine scoring was previously used to validate human scoring on a STAAR test or Field Test.  

Problem: We know and knew nothing about machine scoring until the December document and the Test Administrator's 2023-2024 training. Since we didn't know, my grandma would call that sneaky and dishonest.

Hypothetical Scoring Scenario Two: Machine scoring was NOT used previously to pilot the validity of human scoring on an operational assessment. That's called unethical because something was used that we didn't have data to prove effectiveness.

Problem: For large scale assessments of high stakes outcomes for the entire state of TEXAS, why not? 

Hypothetical Scoring Scenario Three: Machine scoring was tested by the development company on something that wasn't STAAR. That's called unreliable and invalid. Or just unwise at the least.

Problem: STAAR is it's own beast. It's not really like anything else. And, y'all. This is Texas. We do our own thing. 

Call for Answers: 

What "machine" is being used? 
What's the "machine" called? 
Who developed it? 
How were the trials conducted? Were there trials?
Why weren't we told? 
Why didn't the SBOE know?
Is this scoring method authorized in the Texas Register or any House Bill?
How is the machine programmed? 
Who is programming the machine? 
How does the machine work? 
Did we hire more folks at TEA to manage the computer stuff? 
Or is there a company managing that? 
Does the machine use latent semantic analysis? 
Does the machine use keywords? 
Where is the data on content evaluation? 
Where is the data on grammar and mechanics? 
Where is the data on diction, style, and voice? 
Where is the data on organizational structure and genre?
Where is the data from effectiveness? 
Where is the data that says it's a good idea to begin with other than cost? 
How are inconsistencies in scoring triggered to send essays to humans? 
How is the program/machine "monitored"?
How is the process sustainable? 
How many field tests will be required to sustain the number of essays for training the machine for each year and each grade level and each ECR, SCR? 
How is a field test a valid measure and source of essays? Data?
How many essays did the machine use for its training?
How many essays does the research say the machine needs? 
Were there studies about the machine in this context? 
How was the research conducted and by whom? 
What happens if the writer's response is creative and unformulaic?
How have cautions in the research about machine scoring been addressed and overcome? 
What other states and exams are using this method of programming? 
How does our data compare to other states and assessments? 
How do our assessments and scoring compare to others? 
How much did it cost? 
What are the implications for instruction? 
What are the implications for learners and their future outcomes? 

And Finally: 

The research actually states that this kind of thing should be used on low stakes stuff and frequently. TEA and the SBOE talk all the time about making the assessment more match how we teach. 

SO:

Why don't teachers have access to the same technology to match assessment to instruction? Or instruction to assessment? But that's another can of worms, isn't it? 



Friday, February 9, 2024

Researching Automated Scoring for Constructed Responses STAAR

Automated scoring. I want to know more about this, so I've been reading and researching what's out there for teachers and practitioners and policy makers. What kind of research are these folks using to make decisions? WHAT IS this stuff and HOW does it work? 

I'm researching - won't you join me? 

Here's something from ETS about the difference between Human and Machine Scoring. I used a program called Kami to annotate, highlight, and point out features and ideas we should consider. I did find guidance on developing a litmus test for use and some terms and ideas for further investigation. I didn't find much research. I did find some terms to research and some programs/companies to research about their effectiveness - but since they are making money selling this stuff and conducting their own "research", I'm not hopeful in finding valid and reliable data. 

Points of Interest in the Document Linked Above: 

  • Be sure to look at Table 2 on page 5. 
  • Consider the considerations in the bullet points on page 7 in considering your own research and to guide what we can say to policy makers and our board members and staff at TEA. Are these conditions being met? How? 
NOTE: NONE of this explains the hybrid process TEA describes in documents released in December or in statements made at CREST and TCTELA in January. And the publication is from 2014: ancient. This information is just a place to begin understanding the topics and processes behind automated scoring. 

In previous posts, I analyzed a series of citations given by Pearson, trying to dive deeper to find out how Texas is going to use the stuff with our kids on STAAR. (Pearson develops the stuff they use for TELPAS speaking. Cambium develops the stuff for STAAR and TELPAS writing. I'll be diving into that next.) 

Scoring Process for Constructed Response: This gives background about scoring and STAAR. 

Review of Pearson's Commentary and Citations: Primarily, I was looking for actual research and where to find the support and development of the programs they've developed. I didn't find research. And some of it was flat out alarming. 

Part One

Part Two

Part Three

Part Four

Part Five

Part Six


Tuesday, January 30, 2024

TEA Instructional Materials Review, The IMRA Process: Rubrics

I'm watching the SBOE meeting today(Jan 30, 2024). At TCTELA and CREST, we asked for the rubrics, how they were developed, and the research behind them. The curriculum division didn't have the knowledge. So, I started looking. 

HB 1605 Requires a New Instructional Materials Review Process



Right now in the board meeting, they are talking about the IMRA process that will replace the previous Texas Resource Review Process. The focus is to approve the rubrics that have been developed. The development process and public review has already happened. 

In the IMQE/Texas Resource Review, we already evaluated the stuff in the light blue. HB 1605 says that we have to have guidance about the dark blue stuff as well. 

And don't get me going on the three-cuing ban. How can you ban phonics? V is phonics in MSV. No one ever meant that kids would look at the pictures and guess what the words were. And that only works in a very limited range of texts anyway. That practice is one of those things that gets bastardized in application. It probably happened. But it was never the research. But it's a silly thing to ban a practice that says we need to focus on meaning, the way our language functions in terms of grammar, and the visual - phonic elements of the language. 


Where are we on the timeline? 

Call for Reviewers has already gone out. A publisher interest form has been sent. As I was searching the TEA website and other searches, I couldn't find the 9-12 ELAR rubrics. Does anyone know where these are? 

The IMRA Process

K-3 IMRA ELA Rubric

K-3 IMRA ELA Rubric Updated in December

K-3 IMRA SLA Rubric Updated in December

4-8 IMRA SLA Rubric

4-8 IMRA ELA Rubric

4-8 IMRA ELA Rubric Updated in December

4-8 IMRA SLA Rubric Updated in December 

4-8 IMRA SLAR Rubric Updated in January

4-8 IMRA ELAR Rubric Updated in January

K-3 IMRA SLA Rubric

K-12 IMRA Math Rubric

K-12 IMRA Math Rubric Updated in December

K-12 IMRA Math Rubric Updated in January

What's Next:

Well, I have questions. What's the theoretical framework of how decisions were made for this? Where's the research behind the rubrics? TEA staff said they used such things. What were these considerations? 

And, the biggie: How does one decide the line between what is suitable and appropriate, obscene and harmful? I'm not really seeing any of that in the language of the rubrics. Maybe that's why there's not one for 9-12? Can't imagine everyone would agree on how Shakespeare measures up to suitable and appropriate, obscene and harmful. I'd like to see how they are defining those concepts. There's no evidence on the 1605 website that 9-12 is a part of the process. 

PS - Suitability is a separate rubric. They aren't scoring for this, but only adding a flag for things they find unsuitable. There was supposed to be a draft of this from the December meeting, but I haven't found it yet. 

PS - Also would like to see the research referenced that guides this rubric component.

There's a sustainability rubric too...but that's for another time. 



Scoring Process for STAAR Constructed Responses

TEA Statements about Essay Scoring

A document was released in December of 2023 about how constructed responses were scored. We already knew that retesters would have their work scored in a hybrid manner. In other words, a machine (ASE - automated scoring engine) would score all of the papers and 25 percent of them would be scored by a human. It caused quite a stir. TEA staff doesn't want to call it AI. Semantics? 

To help with understanding what we know and what we don't know, I've annotated the December Document, Scoring Process for STAAR Constructed Responses. 

Background and Resources for Consideration

Automated Essay Scoring

Since we don't really know ANYTHING about the ASE other than what is in this document, a friend and I started looking for background. 

ETS has an automated scoring engine called erater. They talk about it here

This is a literature review about automated scoring engines. It will be important to read and consider these ideas as background as we wait for guidance on what AES actually is and who developed it. And here's the article from the doi. The article does use the term AES that TEA uses in their document. 

Literature Review IS a respected form of research. And this article does describe the research method, the research questions that focused the study, how the information was searched and included. 

The authors are: 

  • Dadi Ramesh: School of Computer Science and Artificial Intelligence, SR University, Warangal, TS, India Research Scholar, JNTU, Hyderabad, India and 
Suresh Kumar Sanampudi Department of Information Technology, JNTUH College of Engineering, Nachupally, Kondagattu, Jagtial, TS, India


The aims and scope of the journal are listed here. They focus on Artificial Intelligence. 

Latent Semantic Analysis

Latent Semantic Analysis is a mathematical tool used in essay scoring. This study was conducted by folks from Parallel Consulting and members of the Army Research Institute for the Behavioral and Sciences Institute. They wanted to see if essay scoring would work for a test they use called the Consequences Test that measures creativity and divergent thinking. Their research process is clearly outlined and gives us some ideas about how LSA worked for the Army in scoring the constructed responses in the exam. It was published in a journal called Educational and Psychological Measurement. 


Where to Begin Research: 

As we know, Pearson and Cambium are running STAAR.  So that's why I used the Pearson website on Automated Scoring to start looking for the research in the previous blogs and for the CREST presentation. Similar information can be found here for Cambium. As practitioners, until we have more answers from TEA about who designed the the products, we only have these places to start looking at what we're dealing with and the research behind it. For now, we have a place to begin informing ourselves. 

PS: Here is a document about human vs machine scoring. 
And a copy of TEA Constructed Response Workbook that was shared at the assessment conference. 








 


Part 6: Science? Research? An AI Example

At CREST, we looked at evaluating research and developed suggestions for laypeople to evaluate citations provided to us as support for topics that impact our instruction. Below are our notes on evaluating the last two citations on Pearson's website about Automated Scoring.  

  • Citation: Landauer, T. K., Laham, R. D. & Foltz, P. W. (2003b). Automated Essay Assessment. Assessment in Education, 10,3, pp. 295-308.
  • Authors: Landauer and Foltz direct Research at Pearson’s Knowledge Division
  • Source: Journal: Assessment in Education; The Journal is associated with the International Association for Educational Assessment. We can read about their aims and scope here. 
  • Question: While the citation on Pearson’s website lists three authors, the actual journal only lists Landauer. Why?
  • Type of Research: Survey, description; 
  • Validity: Reports on scientific research, but is not scientific research in itself. This article would be a good place to begin by reading the studies referenced
  • Something we might consider are works written about the Intelligent Essay Assessor like the one in chapter 5 of Machine Scoring of Student Essays: Truth and Consequences. This chapter describes the user experience with the tool and the outcomes.

  • Citation: Landauer, T. K., Laham, D. & Foltz, P. W. (2001). Automated essay scoring. IEEE Intelligent Systems. September/October.
  • Authors: We are already familiar with Landauer and Foltz. Laham is a PhD who has written and published 33 articles according to research gate. The publication itself gives additional information about the authors. Laham also works for Knowledge Analysis Technologies - which Pearson has purchased. 
  • Source: IEEE Intelligent Systems: This is a journal published by the the IEEE Computer society and is sponsored by the Association for the Advancement of Artificial Intelligence (wikipedia)
  • Link to Article: https://www.researchgate.net/publication/256980328_The_Intelligent_Essay_Assessor 
  • Validity: The article is basically a report with interviews with Landauer and others like ETS. It includes data from research, but technically isn’t scientific research itself. 
  • The journal did give some insightful information about the authors. We even have a face to put with the name.
So it's also important to dig into the types of research and how the research is designed and conducted. But, as we see from the notes, none of the citations are actually the research itself that would help us evaluate the impact of these types of programs. It's like we need a way to know when we are presented with research and where to find it when it's not present. And then, we need a clearinghouse of some type that would explain what all this stuff means. Right now, it feels like a big mess that's hard to untangle and arrive at the actual research. And that's just the beginning - we still have to make sense of the research and how it impacts our classrooms - ultimately, how it impacts the humans we serve. 



Part 5: Research? Science? An AI Example

So, we're trying to find the research behind AI scoring. Stuff we need to know as practitioners. The previous posts are how the participants at CREST evaluated the citations provided on Pearson's website about Automated Scoring. As we looked at each citation, we developed suggestions for understanding and evaluating what is research and what it isn't. Here's the next citation and evaluation:

Citation: Folz, P. W. (2007). Discourse coherence and LSA. In T. K. Landauer, D. McNamara, S. Dennis, & W. Kintsch, (Eds.), LSA: A road to meaning. Mahwah, NJ: Lawrence Erlbaum Publishing.

Here's our notes: 

Authors: Folz directs research at Pearson Technologies Group. He was a pioneer in AI scoring. 


Source: LSA: A Road to Meaning; Published by Lawrence Erlbaum Publishing. Erlbaum is a respectable company for research from my limited experience.


Validity: The chapter referenced is in the section of the text about various essay scorers. The chapters report about the tools and are not peer reviewed scientific research about the tools. The previous and following sections of the book do address psychometric issues in assessment and scoring. 

Purpose: In the abstract to this chapter, we learn that the chapter is not really about using AI to score but to compare: “For all coherent discourse, however, a key feature is the subjective quality of the overlap and transitions of the meaning as it flows across the discourse. LSA provides an ability to model this quality of coherence and quantify it by measuring the semantic similarity of one section of text to the next.” It’s not really about research on how great AI scoring is to use with assessment as posed on the website’s presentation.

Again, we see some places that we need to go to find the research. And, we see that comparing semantic quality of one section or text to another isn't really the same as giving a kid a grade on a standardized assessment that will impact their graduation and the funding of their school community. Bias remains. And we still haven't found the research and science that we're looking for. 

Discern Original Purpose

Furthermore, the abstract tells us something about the purpose (even if it is from 2007): 

The purpose of LSA, the way the essay assessor works, was created to help machines understand natural language of humans to do stuff the human wanted the machine to do. Reminds me of Oppenheimer. Understanding the atom can do a lot of good. But it can be used in ways to do harm. LSA was developed for the purpose of helping a machine understand language. I can't see that it's purpose was to assign a score in an assessment regime. Also reminds me of that guy in Jurrassic Park. Have we gotten so excited that we can do something that we forgot to think about whether or not we should? 



Part 4: Is it research or science? Tips for Evaluating Citations.

 Just because something is cited, doesn't mean that it is research. It's called a reference. At CREST, I worked with a group of colleagues to establish some advice about how to evaluate citations to determine what research says and what we need to know about it as practitioners - people who work with kids and teachers. 

In this post, we'll evaluate Pearson's third citation about Automated Scoring.

Citation: Foltz, P. W., Streeter, L. A. , Lochbaum, K. E., & Landauer, T. K. (2013). Implementation and applications of the Intelligent Essay Assessor. In Handbook of Automated Essay Evaluation, M. D. Shermis & J. Burstein (Eds.), pp. 68-88. Routledge: New York.

Authors:

Foltz, P; Landauer and Foltz used AI scoring in their classes starting in 1994. Now they both direct research at Pearson’s Knowledge Technologies Group. Lochbaum is VP of Technology Services at Pearson. As before, we've pointed out the possible bias. But I'd also have to say, if I were a company, I'd be all about hiring the best researchers in the field. It's not that the folks are evil - we just need to be aware that they get a paycheck from the company that sells that product.


Source:

Handbook of Automated Essay Evaluation; Published by Routledge. Routeledge is a respected publisher from my limited experience in academia. I found the whole book online and the chapter, Implementation and Applications of the Intelligent Essay Assessor.


Title: 

Maybe this should be obvious, but the title tells you a lot. The chapter is about Implementation and Applications of automated scoring. Does that answer our questions as practitioners? It should also tell us a bit about the audience.  Are they writing for us? 

Abstract: 

On a separate site, The American Psychological Association, I looked for information about the chapter. The abstract of a text gives you an overview of the work. See it here.  Now, just be aware, writers craft their own abstracts. Much of the text in the abstract is also in the introduction of the chapter. In the abstract, we learn about the background of the authors and their use of the technology in their classes during the late 90's and the association with Pearson. The abstract chronologically highlights the activities of developing the technology. We learn about prompts: "Describe the differences between classical and operant conditioning." and "Describe the functioning of the human heart." We learn that NLP, natural language processing (readability, grammar, and spelling) is combined with LSA (latent semantic analysis) to measure content and semantics. 

So is this helpful for grading essays? In the abstract, we learn that the machine can tell if a writer is high school student, an undergraduate, or a medical student describing the heart prompt mentioned above. 

We learn that the ELAR crowd wanted more - stuff about style, grammar, and feedback. So the program algorithms became more robust - up to 60 different criteria, including "trait scores such as organization or conventions." 

Common Core is mentioned to show a focus on what is valued and assessed in terms of "mastery and higher order thinking." Then, the article promises to address how the IEA (Intelligent Essay Assessor) will be used and how it works, even in comparison to human raters. 

So...the article is a history and summary of the project. It will include research, but it's NOT the research. 

The Article Itself:

Note that Write to Learn's essay feedback scoreboard (Pearson's Product) appears on page 70 of the article. Research IS reported and documented in the article. Lots of great information follows about how this stuff works. But, to really know about the quality and meaning of the research, we'd have to actually look at the stuff cited in the article. 


Drilling Down to the Citations

The internal citations throughout the article and the references on pages 86-88 are the places we would head next to validate and search for the actual research. The chapter is a great place to begin, but the chapter itself isn't the research. 

Admission of Conflict of Interest

Scholarly work will show when conflicts of interest or associations are present. It's part of being honest. 




The chapter concludes with such a statement. And it is important that practitioners recognize the involvement that might color the way the research referenced in the article is presented. 

The Book and Table of Contents

Remember, this search is less about the particular authors and text and more about answering our questions and determining the content and quality of the research behind the topic. Sometimes, the book the chapter is in will have the kind of information we seek as practitioners. 


Chapter 11, on Validity and Reliability of Automated Essay Scoring might just be the chapter we need to consider. 

Bottom Line: 

Yep. The article posted on the Pearson website to validate the product is a reference. It's NOT research. But, the article is a bounty of background and a map of places to go in search of the research. 


Monday, January 29, 2024

Part 3: Is it research or science? Tips for Evaluating Citations

In evaluating citations, you might find out that something really doesn't count as scientific research. But, you can certainly learn a lot. 

Let's take the first two citations from Pearson's Automated Scoring Website. 

As we worked at the 2024 CREST conference session, we looked at the first two citations and developed some important ideas and tips for evaluating citations. 

Part of evaluating the citations comes from looking at the parts of the citation itself: author, venue/source, accessibility of the link. 

We started by trying out the links. As you know, the links don't work. 

Next, we copied and pasted the entire citation to see if we could find it posted or cited elsewhere on an internet or generic scholar search. 

As explained before, we were able to look up the organization/conference. 

Citation One (2018, June)
We found a pdf of the conference handbook. From there, we were able to see information about the presenters and the session itself.  


The Authors: On page 7 of the program, we see that the first presenter is J. Hauger, a member of the conference program planning committee, and member of the New Jersey Department of Education. On page 13 of the program, we see the session description and the remaining authors. 


Lochbaum is listed first in the program, but wasn't listed first in the citation.  (That's a red flag when people change the order of authorship/participation, as that's part of accurate scholarship because it is used for tenure and issues of academic integrity and property.) She's the VP of Technology Services at Pearson and Knowledge Technologies. 

 
Quesen is listed as an Associate Research Scientist at Pearson. 

Zurkowski was listed in Pearson's citation, but not in the conference program. This is not and academic practice of integrity and is dishonest. 

We also see from the program that another person from Pearson is associated with the presentation as a moderator for the session: Trent Workman, Vice President of School Assessment. Since this is a national conference on assessment, members from other states would be in attendance. Meeting potential clients at the conference would be beneficial. And nothing wrong with that. But it's not research. And it does show bias. 

The Session Description: From the session description, we can see that the session is about how Pearson's assessment product benefits "school needs" and saves time. "Consistency, accuracy, and time savings" are offered as powerful enticements, but research is not the focus of this session. 

Citation 2 (2018, June) We already know about Lochbaum. We did find a citation to the conference in another work, but yet again, the order of authorship/participation were different. 

Tips on Evaluating Citations

Find out who the authors are. Where do they work? What else have they written? 
Make sure the citations match when considered/reported in multiple places. 
Consider the dates. Y'all, this stuff is from 2016 and 2018. That's 6 to 8 years ago. ANCIENT history in the technology world. And most scholarship doesn't consider anything older than 5 years unless it's seminal research or philosophy.