Wednesday, April 17, 2024

Morath's Presentation on Hybrid Scoring: Corrections to Commentary about How Hybrid Scoring Was Communicated

TEA posted this powerpoint to answer questions about hybrid scoring. Morath gave a presentation this week. Here's the link: https://tea.texas.gov/student-assessment/testing/hybrid-scoring-key-questions.pdf 

Inaccurate Information about Communication

On slide 8, the presentation explains that "TEA provided information in other stakeholder group presentations, e.g., Texas Science Education Leadership Association (TSELA), Texas Social Studies Supervisors Association (TSSSA), Texas Council of Teachers of English Language Arts (TCTELA), Coalition of Reading and English Supervisors of Texas (CREST)" in November of 23. 

This is not true. TEA did NOT provide information to TCTELA or CREST in 2023 about hybrid scoring. 

There was NO CREST conference in 23. 

There was a TCTELA conference in February of 23, but NO mention of hybrid scoring was mentioned. Here are my notes because TEA did not release the slides. I also recorded the session on my iphone.  I reviewed my personal recording today. There was NO mention of Hybrid scoring. 

 https://docs.google.com/document/d/1K6SFfwEPeKXO43WHEugVHcQwJSicCVNJqXLlNHRF6AI/edit?usp=sharing 

The FIRST ELAR folks heard about this was from their Testing Coordinators and from a pdf posted on the TEA website in December of 2023. To reiterate - the top ELAR organizations in Texas did NOT have the information. And from what I know, the region service centers did NOT cover any of the information about hybrid scoring until after December of 2023. We didn't know. And they didn't tell us in the way described in this presentation. The information on side 8 is blatantly inaccurate. My momma would call that a lie. 

Other Information and Questions

The slides do give us some good information. But we still aren't getting the answers we need for transparency or instructional intervention or initial instruction. 



Thursday, April 11, 2024

79 vs 8% Zeros: Let's Panic about the Right Thing: Some Facts to Consider about "AI" Scoring

 

Raise Your Hand Texas posted this graphic April 10th, 2024

There's lots to say initially, and more to say after that. 

First, it's not AI. It's an Automated Scoring Engine. 

It does what it's told and doesn't learn. Not sure that helps anyone feel better. Semantics at this point. Toh mai toh. Tah mah toh. 

Second: 79% of RE testers in DECEMBER received zeros on the extended constructed responses. 

These are the folks that haven't passed the English II exam all the other times they've taken it and aren't even in English II classes anymore. Here's a scenario: 

Senior level student. End of first semester of their last year in school. They need the English II credit to graduate. Their school needs them to pass because that gets them points for accountability stuff. They've taken the test this many times: when they were in English II, the summer after the Sophomore year, the December of their Junior year, the spring of their Junior year, the summer after their Junior year, and now the December of their Senior year. They have the next Spring and Summer opportunities as well. Yeah. That's 8 times. And...a lot of these kids have been doing this for both English I and English II. 16 times. And...a lot of these kids have been doing this retesting thing for YEARS. AND, some of these kids were in English I, English II, and English III at the same time because they failed the class and the assessments. And...some of these kids were in an English class and a remedial class. And...when they originally took English II, the Extended Constructed Response, Short Constructed Response, and new item types weren't a thing. So they haven't had initial instruction or experience on much of any of it. Only remediation if all the rules were followed. 

And they made a zero on the ECR in December of 2023 when most folks didn't know that a computer would be scoring the responses. While the lack of information and knowledge about scoring is a problem - a big one - it should be NO surprise that these kids made zeros. It's not the computer's fault. More on that later. 

These kids - the 79% of RE-Testers in 2024 who made a zero - aren't failing because a computer scored their writing. Most of them didn't write. Some of them (from what we saw on the 2023 responses) wrote nothing, typed placeholders like a parenthesis, said "IDK", or "Fuck this." They aren't failing because they haven't had instruction. They are failing because they are just DONE with the testing bullshit. They are disenfranchised, tired, and ruined. 

The computer isn't the problem: the culture created by the testing regime is the problem. 

Third: So where did RYHT Get this Data and Why Did They Post It Now

Author's Purpose and Genre, ya'll. It's rhetoric; it's propaganda. 

Raise Your Hand Texas wants you to VOTE. The post was incendiary and meant to rally the masses and voting base. It's a classic technique for campaigns and lobbying. We SHOULD vote. We SHOULD be angry. But the data posted is a bit deceptive in directing our actions for the kids in front of us. 

Part of their information comes from the TEA accountability report given to the SBOE. Here it is.  And part of it comes from a research portal that I've mentioned in previous blogs. 

Notice also that they are only showing you English II. There's other grades and results out there. 

Here's a chart of what you'd find in the TEA Research Portal. I took the data and made a chart. ASE means Automated Scoring Engine. 

Source: TEA Research Portal

Action: Note the differences in the population, the prompt, and the rubric. Yes. There's a problem with the number of kids scoring a zero. But it's probably not a computer problem. Let's dive deeper. 

Fourth: There's More from Texas School Alliance and Others

The Texas School Alliance (and others) have been talking to TEA. You can become a member here. Various news organizations are also reporting and talking to TEA, asking hard questions and receiving answers. And school officials have been able to view student responses from the 2024 December retest. 

In researching the reports and ideas, here are some key things to know and understand before acting. 

Difference in the Type of Writing

The big changes that caused the zeros probably wasn't the computer scoring thingy. Kids are answering different kinds of prompts than they had been answering before. Before, kids wrote about stuff in persuasive or explanatory modes about a topic they knew something about. They used their own schema and probably didn't care much about the topic. Now, kids have to write in argumentative or explanatory modes in a classic essay format or in correspondence (which we haven't seen yet). They have to write from their understanding of a text and not their schema. 

Action: We need to make sure kids know how to do the right kind of writing. We need to make sure kids can comprehend the texts. 

The Rubric is Different

The rubric used in 2023 and 2024 retesting is DIFFERENT because the type of writing is different. From discussions with others, people are pretty confused about how that new rubric works and what it means. If teachers are confused, KIDS are confused. There has been little to no TEA training on the rubric other than some samples. 

Action: Our actions should be to really study our kids' responses and the scoring alongside the scoring guides. 

The Test ISN'T Harder Overall: SO WHAT? 

You can argue with me on this if you want to. But there's a detailed process all tests go through before and after administration. It's called equating. It's where we get the scale scores, why the cut points for raw scores are different each year, and why we don't know what passing is until after the tests have been scored. 

The ECR is probably harder. So are some of the SCRs. Some of the item types and questions are harder. Some questions have always been harder than others. That's not new. TEA is right that the test isn't harder overall, it's just that there are other things that matter to living, breathing, and feeling humans.

Just because the overall test is not harder, the student experience is not the same. When stuff is new to kids, it's scary. When stuff is hard and scary, and when stuff is high stakes, kids have anxiety. This impacts text scores in ways that are unrelated to psychometrics. Just because the test isn't harder psychometrically, doesn't mean the experience isn't a challenging psychological experience that doesn't impact instruction. 

Action: We need to do more work on preparing students for the user experience in terms of technical online and test details as well as social emotional realities. Especially for kids who have previously experienced failure. 

People Who Viewed the Zero Responses Agreed with the Computer Ratings

TEA reported before (December 2023) that the ASE is trained and MUST agree with human raters. It's part of their process. 

And, the folks I've talked to agree that the responses they saw at TEA from the kids at their campus who scored a zero agree that the papers deserved a zero. Most of them asked for no or 1-2 rescoring requests. 

This means that our colleagues agree that the ASE isn't the problem. 

The Hybrid Scoring Study Conducted by TEA

Here's the link.  And here's a screenshot of the Executive Summary on page 5. 

Huh? Here's what I think it means: 

Paragraph One: They studied the Spring 2023 results. The ASE did a good job. They are going to use the results of the study to study future tests - including the one that was given in December of 2023. 

What we don't know: Who did the study? What was the research design? What is "sufficient performance criteria"? Where is the data and criteria for the "field-test programmed" and "operationally held out validation samples"? What do those terms mean and how do they function in the programming and study? 

Paragraph Two: They studied the first 50% of the extended constructed responses scored by the automated scoring engine (ASE). The ASE did a good job for the subpops used for accountability purposes.

What we don't know: What is "sufficient performance criteria"? How did the subpops compare and what was used as criteria to know that the comparisons were fair? What are the models? 

Paragraph Three: The way the test is made now (with different texts and prompts each time) the computer scoring engines will have to be reprogrammed each test administration AND while scoring is going on. They are going to reprogram on all score points as the data arises during scoring. As changes are made to the models, all the stuff that was scored with the old models will be scored again. (That's good research practice with both a priori and constant comparative analysis. BUT - we don't really know what research protocols they are following, so I'm just guessing here.) As more essays are scored, they'll figure out new ways that kids answer in ways that confuse the models and need rerouting. If they were rerouted before the new codes, they will be rescored with the new models. 

What we don't know: We still don't know how the ASE is programmed and monitored. Like, at all. 

Paragraph Four: Conditions codes change based on paragraph three. They'll keep working on changing them and refining that process during current and future administrations for all grades. They will have to because the grades, texts, prompts, and human ratings all change each administration. The data also changes as more stuff is scored, so the codes have to change as the student responses are scored. 

Action: All of us need to read the whole study. And we need to ask more questions. For now, I think there's not a real ASE scoring problem with the zero responses. The 79% that got zeros probably earned those and we need to seek other things to "blame" for the cause.





Wednesday, April 10, 2024

13 Trips and A Request for Feedback

Note: I wrote this last week during an Abydos session on Building Community. We used a grouping strategy called "Pointing": original developed by Peter Elbow. But we ran out of time. I'd like to try it out with my friends online so I can use the data to show how we extend the grouping activities for revision and differentiated instruction.

Would you help? Can you read my writing and then "point out" the words and phrases that "penetrate your skull" or seem to be the center of "grabbity" that catch your attention? If you have time, give me an idea of why the words or phrases stuck out to you. 

There's 13 trips for a sprinkler  repair. Unlucky? Always. Why? 

You never have all the right things. It's the wrong size. Wrong gage. Wrong thread. Wrong length. Just wrong-wrong-wrong. And that was wrong too. 

1. You bought the wrong thing the first time because you were hapless enough to go the store to buy what you thought you needed before you started digging.

2. You returned to the store to buy the new thing, but when you went back to install it, you realize that thing doesn't fit any better than the first thing. You need that other thing that makes the second thing fit. Misfortune again. 

3. When you put the adapter on to make the part fit, a foreboding crack undermines your efforts.. So now you need more parts to fix that thing too. 

4. When fixing the new problem, you use an old tool at a cursed angle and the plastic for the new part snaps. You go buy the new part only to realize upon return home that...

5. Jinxed, you need a different tool so you can remove that thing that broke. So you go back to the store to buy the tool remove the part the broke so you can use the part you bought the last time. 

6. Now that things are in place, you dry fit the parts only to find that the ill-fated glue you bought the last time - only last week - has already turned solid. 

7. And the lid on the other thing that makes the first thing work won't come off. So you need another tool or chemical to open the second one and buy a new one so you just buy both of them. But then there are three choices now for the same product that used to work just fine and the wretched original is nowhere to be found. 

8. When you are digging past the mud and clay, you realize that the blighted person before you must have taped the thing together with Gorilla Glue and an old garden hose and you'll have to fix that part too. 

9. Then you go to put the new part on and realize that some star-crossed do-it-yourselfer laid the new pipe on the old pipe and there isn't room for replacement connector. So you go back for a flexible bendy part to add to the other parts you just bought. 

10. So you dig a bigger trench because you think you might lift up the pipe to make some room with the bendy part you just bought. Avalanches of dirt cover your previous attempts. So you carefully leverage what you have in the dark.  Blindly, you lift until you hear a crack and notice that on the other side of your repair there is a T where the damaged pipe intersects with two other directions. 

11. So you dig a bigger trench to uncover the new problems. Traipse back to the store to get more things. Things seem to be gong fine when you dry fit that parts and use the new glue stuff. Only to realize that you weren't quite ready for the glue. The pipe is still too long, but now the connections are too short and you'll have to catastrophically cut off the T and replace the whole thing because the pipe is too close to accept a new adapter. 

12. Back to the store. For all the new things and other adapter things. You dry fit and double check everything only to realize - calamity - that when you reach for the glue that you didn't close the lids properly and the gunk is now a dried goo all over your new tools.

13.  Since you've already spent the dire, equivalent funds and wait time for calling a pro, you throw your muddy gloves down into the hole, wipe your bloody hands on your soaked pants, and ask Siri to call the sprinkler repair guy that you should have called before your first, ill-fated trip. 


MH's Pointing Feedback: I love the phrase "avalances of dirt."  It creates a vivid image in my mind.

I like the repeated use of the word "thing."  In real life, I get frustrated with too many uses of "thing," but it fits perfectly here because I'm guessing you don't really know what all the "things" are really called!  It shows from the beginning that you probably shouldn't have been the one making this repair.
The phrase "foreboding crack" tells me "uh-oh." 

Wednesday, March 6, 2024

TEA Communication Regarding December ECR Scores

TEA Communication

 TEA Communication to Testing Coordinators

Not all of us receive notifications that the testing coordinators receive. And often - the information just doesn't "trickle down." 

Background

So here's what happened. 

1. Many people shared concerns about the scores on the December retests since the scoring method was different (not AI but by a machine that doesn't learn) and that there were a lot of zeros. 

2. We don't see responses on December tests because the tests aren't released. It's expensive to release an exam and they might need to use the passages and questions again. 

3. Responses from the December test were scored by two Automated Scoring Engines. 25% of them were also routed to and scored by a human for "various auditing purposes."  (That sounds like the human scores were used for making sure the ratings were correct and not used for reporting.) 

4. If the machines didn't have scores that were adjacent or the same (same: 0, 0; 1-1, 2-2, 3-3, 4-4, 5-5; adjacent: 0-1; 1-2; 2-3; 3-4; 4-5), then the essays were routed to humans for rescoring. 

5 The ASE, computer, sent the codes to the humans with some condition codes. These are the codes: (I'm assuming there are other codes not associated with zeros, but that information is unclear. It is also unclear about how the machine is programmed other than with the rubric.)

    a. response uses just a few words

    b. response uses mostly duplicated text

    c. response is written in another language

    d. response consists primarily of stimulus material

    e. response uses vocabulary that does not overlap with the vocabulary in the subset of responses used to program the ASE

    f. response uses language patterns that are reflective of off-topic or off-task responses

 Additional language describes "unusual" responses that could trigger a condition code for review. 

6. Responses routed to a human for rescoring retain the rating of the human. The language is unclear if the 25% scored by a human are also kept because that was addressed in a paragraph that did not address condition codes. 

Implications for Instruction: Revising

Instruction can address each element of the rubric as well as the information we see in TEA's letter to testing coordinators and bulleted above. 

We can have students ask themselves some questions and revise for these elements: 

  1. Does your response use just a few words? (Pinpoint these areas with the highlighting tool: Where did you: 1) address the prompt 2) provide evidence?) How could you expand your words into sentences? How could you use a strategy such as looping  to add additional sentences? 
  2. Does your response use mostly duplicated text? This means repetition. Go back into your writing. Use the strikethrough tool as your read through your work. Strike through any repeated words and phrases. Then go through each sentence. Does each sentence say something new? Use the strikethrough tool to remove those elements. Did you use an organizational framework for your writing? QA12345 helps you write new things for each segment. 
  3. Does your response use another language? Highlight the words in your language. Go back and translate your writing into English, doing the best you can to recreate your thoughts. Leave the text in your language as you work. Then go back and delete things that are not written in English.
  4. Does your response mainly use words from the text or passage? Good! This means you are using text evidence. First, make sure the evidence helps answer the ideas the prompt wants you to write about. Second, add a sentence that explains how that text evidence connects to the ideas in the prompt.

Implications for Instruction: Comprehension and Preparation

The last bullet indicates that the program is prepared with sample answers. This means that instructional materials and our questions must also consider what answers we are expecting and how we go about putting that down in words. It also means that we should be prepared for how students might misinterpret the both or either of the text, section, or prompt. 

  1. Teach students how to diffuse the prompt for words of distinction that change the conditions of the prompt focus (today vs past). This also includes how we decide what genre to use for the response: informational, argumentative, correspondence (informational or argumentative). 
  2. Discuss the differences about prompts that ask for responses about the whole test vs sections of the text (an aquifer vs a salamander). 
  3. In developing prompts, be sure to compose prompts that can have multiple answers and have multiple pieces of text evidence to support. Be sure to compose prompts that can have WRONG interpretations and associated evidence. 
  4. Develop sets of exemplars that writers can use to match the evidence to the thesis to the prompt and finally to the passage. They need to SEE all the ways in which their thinking can go astray. 
  5. Teach them about vocabulary and bots that scan for key words and synonyms. We may not like this, but how could that be a bad thing? They think about the key ideas in the prompt and passage and make sure their vocabulary in responses match. 

Implications for Instruction: Creativity

Most of the research about machine scoring says we really aren't ready for machines to score diverse answers. But...the language in the letter here suggests that "unusual" stuff might receive a condition code. And we don't have a description of what an unusual condition code for creative responses might be other than what was describe in the previous bullet points. IDK. 

What to Ask Your Testing Coordinator

Because of the concerns we all raised, TEA isn't going to let us see the ECR's, but they are going to let us have some data. (I'd like to see the statewide data, wouldn't you?) Ask your coordinator about the report they can ask TEA to provide. It will include information about scores on ECR's from December: 
  • How many students turned in a blank response? 
  • How many responses were unscorable with a condition code? 
  • How many responses received a zero? 
Coordinators can then ask for an appointment to VIEW responses with text that received a zero. They won't be able to ask questions about scoring, the rubric, or for responses to be rescored. But they'll be able to see the responses that have a zero. Not sure that will help much - and understand that seeing more would compromise test security and cost a lot of money because the passage and question could not be reused. 

Friday, February 16, 2024

TEA's Annual Report and The Teacher Vacancy Task Force Report: Important Reading for School Practitioners

 Here's TEA's Annual Report. 

I'm looking forward to the technical report released in the fall by the Assessment Division. 

Here's an overview of what sticks out to me and where you might read to be informed. 

Page 14 shows our results and compares them to New York and funding. I'd be interested to see what Paul Thomas says about that data and what it really means about a reading crisis. 

Page 12 talks about a program called ACE that I wanted to learn more about based on the commissioner's remarks during the Feb SBOE meeting. He said the method was a sure thing for turning a school around. But he also noted it couldn't be done in small schools because you can't move staff around. I'd like to know if this work has been done at high schools, because as far as I've found in the research, that's a much more complex beast and not much works with such sure and high results. 

Page 8 talks about the money spent on teacher incentive allotments. Problematic in practice, because teachers tell me their schools refuse to give them the ratings on their evaluations above a certain level. They are told it is district policy not to give above a certain mark regardless of the teacher proficiency or documentation provided. 

Note also the graphic on page 8. In practicality, teachers don't have time to plan lessons. Not sure what they mean by planning their "master schedule" as that's not how we usually use that term. I can tell you this: providing instructional materials so teachers don't have to plan lessons really isn't what we are looking for here. I have YET to see materials that take the place of a thinking and reasoning being with high quality training and the time to do what needs to be done. 


See page 7 to learn more about the increases in those served by SPED. With possible changes to Dyslexia processes, that number is going to be even larger. 

Safe schools are addressed on page 6. 

Page 5 addresses how the 88th legislature changed school funding. School finance at large is on page 4. 

Page three gives you a cool overview of Student Outcomes and TEA goals. It gives a comparison for preCovid numbers too. It's SO important to know the overall goals because that's how they decide what happens with assessment, curriculum, and initiatives that roll out at the Region Service Centers. This stuff is what ends up happening TO us as practitioners. 

The Teacher Vacancy Task force report was referenced in the citations. Here is is.



Thursday, February 15, 2024

Show me the Data: No CR Data on December retesters? No ECR reports at all? No pdf's for reteaching? Why?

Update: Just learned about a tool. The Cambium Research Portal: https://txresearchportal.com/selections?tab=state 

Here's the report for English I: 

Here's the report for English II

The question is WHY weren't these results part of the STATE's Summary reports? WHY did I have to do to a Cambium research portal? 

Here's the Results and a ANOTHER Question: Why are we still doubling the points on a score that used to come from two people and now comes from a machine? Still don't have SCR data. 

English I: 

Score Point 0: 63%

Score Point 1: 5%

Score Point 2: 6%

Score Point 3: 2%

Score Point 4: 3%

Score Point 5: 3%

Score Point 6: 3%

Score Point 7: 2%

Score Point 8: 2%

Score Point 9: 1%

Score Point 10: 1%

English II: 

Score Point 0: 79%

Score Point 1: 7%

Score Point 2: 5%

Score Point 3: 2%

Score Point 4: 1%

Score Point 5: 1%

Score Point 6: 1%

Score Point 7: 0%

Score Point 8: 1%

Score Point 9: 1%

Score Point 10: 1%

Original Post: 

Here are the EOC state reports for December retesters: 

English I

English II

And here are the FAILURE rates: A(Algebra)1: 62% E1: 68% E2: 76% 

For the last administration, we saw charts like this on the summary reports: 


And now, I'm hearing that schools are not getting the pdf's of retesters' essays? AND, we don't get to see any of the test items? We have to wait until they retest (AGAIN) in the spring to see data? And all of this is happening with the new machine scoring in place. 

Do we have to wait until 3/25-28 for the Accountability Reports? 

How are we supposed to help the students without seeing how they responded? How are we supposed to understand how they are graded without seeing the connection with the score and the essay? 

I understand that releasing the tests is expensive. We've been dealing with lack of info to support retesters for a while now. But not having reports and data about constructed responses is inexcusable at this juncture regardless of the administration type. 




Wednesday, February 14, 2024

Semantics: No AI STAAR Scoring, but does it make me feel better that it's still a machine?

 Here's an article you should read. https://www.dallasnews.com/news/education/2024/02/14/computers-are-grading-texas-students-staar-essay-answers/ 

Falsehood or Semantics:

So...ECR's and SCR's aren't graded by AI. Artificial Intelligence. I used the wrong word initially. But they ARE graded by a machine. For most of us (parents and practitioners), neither word makes me feel all warm and fuzzy about its use. And why do we have to learn about this stuff by accident or from the Dallas Morning News? 

Some important points: 

1. In the February SBOE board meeting, the commissioner was asked if AI was being used to score essays and short answer. He said No. Which is true. But he also said that two humans were grading. Untrue for quite some time now, as the document about scoring came out in December. Unless you count the convoluted way that statement would be true: two raters score essays and then their ratings are programmed into a machine that now scores the essays the way the original humans did. (Which is also problematic because the machine can inherit bias and inaccuracies.) 

A Truth: December testers were scored by the "machine." The machine has to score based on training it receives to mimic a large number of essays previously scored by humans.

An Inference: If the machine was trained to score December retesters based on a database of previously scored essays, then the December data had to come from a field test. 

2. December testers were scored by the machine and data on previous scoring events. Probably a field test. Field tests aren't experienced by students in the same ways as official testing. Since the writing types were new, we had stand alone field tests. And scoring isn't experienced by raters in the same way in setting or urgency. This creates scoring inconsistencies and variables that don't match the real data and experiences on and after test day. That's called unreliable and invalid. 

3. If December testers (who are most at risk because they've not passed in the previous administration) were scored by a machine, there's a few scenarios. All of them are problematic. 

Hypothetical Scoring Scenarios: 

Ultimately, we are left guessing about what happened and how. Here's some possibilities and the problems they pose. 

Hypothetical Scoring Scenario One: Machine scoring was previously used to validate human scoring on a STAAR test or Field Test.  

Problem: We know and knew nothing about machine scoring until the December document and the Test Administrator's 2023-2024 training. Since we didn't know, my grandma would call that sneaky and dishonest.

Hypothetical Scoring Scenario Two: Machine scoring was NOT used previously to pilot the validity of human scoring on an operational assessment. That's called unethical because something was used that we didn't have data to prove effectiveness.

Problem: For large scale assessments of high stakes outcomes for the entire state of TEXAS, why not? 

Hypothetical Scoring Scenario Three: Machine scoring was tested by the development company on something that wasn't STAAR. That's called unreliable and invalid. Or just unwise at the least.

Problem: STAAR is it's own beast. It's not really like anything else. And, y'all. This is Texas. We do our own thing. 

Call for Answers: 

What "machine" is being used? 
What's the "machine" called? 
Who developed it? 
How were the trials conducted? Were there trials?
Why weren't we told? 
Why didn't the SBOE know?
Is this scoring method authorized in the Texas Register or any House Bill?
How is the machine programmed? 
Who is programming the machine? 
How does the machine work? 
Did we hire more folks at TEA to manage the computer stuff? 
Or is there a company managing that? 
Does the machine use latent semantic analysis? 
Does the machine use keywords? 
Where is the data on content evaluation? 
Where is the data on grammar and mechanics? 
Where is the data on diction, style, and voice? 
Where is the data on organizational structure and genre?
Where is the data from effectiveness? 
Where is the data that says it's a good idea to begin with other than cost? 
How are inconsistencies in scoring triggered to send essays to humans? 
How is the program/machine "monitored"?
How is the process sustainable? 
How many field tests will be required to sustain the number of essays for training the machine for each year and each grade level and each ECR, SCR? 
How is a field test a valid measure and source of essays? Data?
How many essays did the machine use for its training?
How many essays does the research say the machine needs? 
Were there studies about the machine in this context? 
How was the research conducted and by whom? 
What happens if the writer's response is creative and unformulaic?
How have cautions in the research about machine scoring been addressed and overcome? 
What other states and exams are using this method of programming? 
How does our data compare to other states and assessments? 
How do our assessments and scoring compare to others? 
How much did it cost? 
What are the implications for instruction? 
What are the implications for learners and their future outcomes? 

And Finally: 

The research actually states that this kind of thing should be used on low stakes stuff and frequently. TEA and the SBOE talk all the time about making the assessment more match how we teach. 

SO:

Why don't teachers have access to the same technology to match assessment to instruction? Or instruction to assessment? But that's another can of worms, isn't it?