Tuesday, January 30, 2024

TEA Instructional Materials Review, The IMRA Process: Rubrics

I'm watching the SBOE meeting today(Jan 30, 2024). At TCTELA and CREST, we asked for the rubrics, how they were developed, and the research behind them. The curriculum division didn't have the knowledge. So, I started looking. 

HB 1605 Requires a New Instructional Materials Review Process



Right now in the board meeting, they are talking about the IMRA process that will replace the previous Texas Resource Review Process. The focus is to approve the rubrics that have been developed. The development process and public review has already happened. 

In the IMQE/Texas Resource Review, we already evaluated the stuff in the light blue. HB 1605 says that we have to have guidance about the dark blue stuff as well. 

And don't get me going on the three-cuing ban. How can you ban phonics? V is phonics in MSV. No one ever meant that kids would look at the pictures and guess what the words were. And that only works in a very limited range of texts anyway. That practice is one of those things that gets bastardized in application. It probably happened. But it was never the research. But it's a silly thing to ban a practice that says we need to focus on meaning, the way our language functions in terms of grammar, and the visual - phonic elements of the language. 


Where are we on the timeline? 

Call for Reviewers has already gone out. A publisher interest form has been sent. As I was searching the TEA website and other searches, I couldn't find the 9-12 ELAR rubrics. Does anyone know where these are? 

The IMRA Process

K-3 IMRA ELA Rubric

K-3 IMRA ELA Rubric Updated in December

K-3 IMRA SLA Rubric Updated in December

4-8 IMRA SLA Rubric

4-8 IMRA ELA Rubric

4-8 IMRA ELA Rubric Updated in December

4-8 IMRA SLA Rubric Updated in December 

4-8 IMRA SLAR Rubric Updated in January

4-8 IMRA ELAR Rubric Updated in January

K-3 IMRA SLA Rubric

K-12 IMRA Math Rubric

K-12 IMRA Math Rubric Updated in December

K-12 IMRA Math Rubric Updated in January

What's Next:

Well, I have questions. What's the theoretical framework of how decisions were made for this? Where's the research behind the rubrics? TEA staff said they used such things. What were these considerations? 

And, the biggie: How does one decide the line between what is suitable and appropriate, obscene and harmful? I'm not really seeing any of that in the language of the rubrics. Maybe that's why there's not one for 9-12? Can't imagine everyone would agree on how Shakespeare measures up to suitable and appropriate, obscene and harmful. I'd like to see how they are defining those concepts. There's no evidence on the 1605 website that 9-12 is a part of the process. 

PS - Suitability is a separate rubric. They aren't scoring for this, but only adding a flag for things they find unsuitable. There was supposed to be a draft of this from the December meeting, but I haven't found it yet. 

PS - Also would like to see the research referenced that guides this rubric component.

There's a sustainability rubric too...but that's for another time. 



Scoring Process for STAAR Constructed Responses

TEA Statements about Essay Scoring

A document was released in December of 2023 about how constructed responses were scored. We already knew that retesters would have their work scored in a hybrid manner. In other words, a machine (ASE - automated scoring engine) would score all of the papers and 25 percent of them would be scored by a human. It caused quite a stir. TEA staff doesn't want to call it AI. Semantics? 

To help with understanding what we know and what we don't know, I've annotated the December Document, Scoring Process for STAAR Constructed Responses. 

Background and Resources for Consideration

Automated Essay Scoring

Since we don't really know ANYTHING about the ASE other than what is in this document, a friend and I started looking for background. 

ETS has an automated scoring engine called erater. They talk about it here

This is a literature review about automated scoring engines. It will be important to read and consider these ideas as background as we wait for guidance on what AES actually is and who developed it. And here's the article from the doi. The article does use the term AES that TEA uses in their document. 

Literature Review IS a respected form of research. And this article does describe the research method, the research questions that focused the study, how the information was searched and included. 

The authors are: 

  • Dadi Ramesh: School of Computer Science and Artificial Intelligence, SR University, Warangal, TS, India Research Scholar, JNTU, Hyderabad, India and 
Suresh Kumar Sanampudi Department of Information Technology, JNTUH College of Engineering, Nachupally, Kondagattu, Jagtial, TS, India


The aims and scope of the journal are listed here. They focus on Artificial Intelligence. 

Latent Semantic Analysis

Latent Semantic Analysis is a mathematical tool used in essay scoring. This study was conducted by folks from Parallel Consulting and members of the Army Research Institute for the Behavioral and Sciences Institute. They wanted to see if essay scoring would work for a test they use called the Consequences Test that measures creativity and divergent thinking. Their research process is clearly outlined and gives us some ideas about how LSA worked for the Army in scoring the constructed responses in the exam. It was published in a journal called Educational and Psychological Measurement. 


Where to Begin Research: 

As we know, Pearson and Cambium are running STAAR.  So that's why I used the Pearson website on Automated Scoring to start looking for the research in the previous blogs and for the CREST presentation. Similar information can be found here for Cambium. As practitioners, until we have more answers from TEA about who designed the the products, we only have these places to start looking at what we're dealing with and the research behind it. For now, we have a place to begin informing ourselves. 

PS: Here is a document about human vs machine scoring. 
And a copy of TEA Constructed Response Workbook that was shared at the assessment conference. 








 


Part 6: Science? Research? An AI Example

At CREST, we looked at evaluating research and developed suggestions for laypeople to evaluate citations provided to us as support for topics that impact our instruction. Below are our notes on evaluating the last two citations on Pearson's website about Automated Scoring.  

  • Citation: Landauer, T. K., Laham, R. D. & Foltz, P. W. (2003b). Automated Essay Assessment. Assessment in Education, 10,3, pp. 295-308.
  • Authors: Landauer and Foltz direct Research at Pearson’s Knowledge Division
  • Source: Journal: Assessment in Education; The Journal is associated with the International Association for Educational Assessment. We can read about their aims and scope here. 
  • Question: While the citation on Pearson’s website lists three authors, the actual journal only lists Landauer. Why?
  • Type of Research: Survey, description; 
  • Validity: Reports on scientific research, but is not scientific research in itself. This article would be a good place to begin by reading the studies referenced
  • Something we might consider are works written about the Intelligent Essay Assessor like the one in chapter 5 of Machine Scoring of Student Essays: Truth and Consequences. This chapter describes the user experience with the tool and the outcomes.

  • Citation: Landauer, T. K., Laham, D. & Foltz, P. W. (2001). Automated essay scoring. IEEE Intelligent Systems. September/October.
  • Authors: We are already familiar with Landauer and Foltz. Laham is a PhD who has written and published 33 articles according to research gate. The publication itself gives additional information about the authors. Laham also works for Knowledge Analysis Technologies - which Pearson has purchased. 
  • Source: IEEE Intelligent Systems: This is a journal published by the the IEEE Computer society and is sponsored by the Association for the Advancement of Artificial Intelligence (wikipedia)
  • Link to Article: https://www.researchgate.net/publication/256980328_The_Intelligent_Essay_Assessor 
  • Validity: The article is basically a report with interviews with Landauer and others like ETS. It includes data from research, but technically isn’t scientific research itself. 
  • The journal did give some insightful information about the authors. We even have a face to put with the name.
So it's also important to dig into the types of research and how the research is designed and conducted. But, as we see from the notes, none of the citations are actually the research itself that would help us evaluate the impact of these types of programs. It's like we need a way to know when we are presented with research and where to find it when it's not present. And then, we need a clearinghouse of some type that would explain what all this stuff means. Right now, it feels like a big mess that's hard to untangle and arrive at the actual research. And that's just the beginning - we still have to make sense of the research and how it impacts our classrooms - ultimately, how it impacts the humans we serve. 



Part 5: Research? Science? An AI Example

So, we're trying to find the research behind AI scoring. Stuff we need to know as practitioners. The previous posts are how the participants at CREST evaluated the citations provided on Pearson's website about Automated Scoring. As we looked at each citation, we developed suggestions for understanding and evaluating what is research and what it isn't. Here's the next citation and evaluation:

Citation: Folz, P. W. (2007). Discourse coherence and LSA. In T. K. Landauer, D. McNamara, S. Dennis, & W. Kintsch, (Eds.), LSA: A road to meaning. Mahwah, NJ: Lawrence Erlbaum Publishing.

Here's our notes: 

Authors: Folz directs research at Pearson Technologies Group. He was a pioneer in AI scoring. 


Source: LSA: A Road to Meaning; Published by Lawrence Erlbaum Publishing. Erlbaum is a respectable company for research from my limited experience.


Validity: The chapter referenced is in the section of the text about various essay scorers. The chapters report about the tools and are not peer reviewed scientific research about the tools. The previous and following sections of the book do address psychometric issues in assessment and scoring. 

Purpose: In the abstract to this chapter, we learn that the chapter is not really about using AI to score but to compare: “For all coherent discourse, however, a key feature is the subjective quality of the overlap and transitions of the meaning as it flows across the discourse. LSA provides an ability to model this quality of coherence and quantify it by measuring the semantic similarity of one section of text to the next.” It’s not really about research on how great AI scoring is to use with assessment as posed on the website’s presentation.

Again, we see some places that we need to go to find the research. And, we see that comparing semantic quality of one section or text to another isn't really the same as giving a kid a grade on a standardized assessment that will impact their graduation and the funding of their school community. Bias remains. And we still haven't found the research and science that we're looking for. 

Discern Original Purpose

Furthermore, the abstract tells us something about the purpose (even if it is from 2007): 

The purpose of LSA, the way the essay assessor works, was created to help machines understand natural language of humans to do stuff the human wanted the machine to do. Reminds me of Oppenheimer. Understanding the atom can do a lot of good. But it can be used in ways to do harm. LSA was developed for the purpose of helping a machine understand language. I can't see that it's purpose was to assign a score in an assessment regime. Also reminds me of that guy in Jurrassic Park. Have we gotten so excited that we can do something that we forgot to think about whether or not we should? 



Part 4: Is it research or science? Tips for Evaluating Citations.

 Just because something is cited, doesn't mean that it is research. It's called a reference. At CREST, I worked with a group of colleagues to establish some advice about how to evaluate citations to determine what research says and what we need to know about it as practitioners - people who work with kids and teachers. 

In this post, we'll evaluate Pearson's third citation about Automated Scoring.

Citation: Foltz, P. W., Streeter, L. A. , Lochbaum, K. E., & Landauer, T. K. (2013). Implementation and applications of the Intelligent Essay Assessor. In Handbook of Automated Essay Evaluation, M. D. Shermis & J. Burstein (Eds.), pp. 68-88. Routledge: New York.

Authors:

Foltz, P; Landauer and Foltz used AI scoring in their classes starting in 1994. Now they both direct research at Pearson’s Knowledge Technologies Group. Lochbaum is VP of Technology Services at Pearson. As before, we've pointed out the possible bias. But I'd also have to say, if I were a company, I'd be all about hiring the best researchers in the field. It's not that the folks are evil - we just need to be aware that they get a paycheck from the company that sells that product.


Source:

Handbook of Automated Essay Evaluation; Published by Routledge. Routeledge is a respected publisher from my limited experience in academia. I found the whole book online and the chapter, Implementation and Applications of the Intelligent Essay Assessor.


Title: 

Maybe this should be obvious, but the title tells you a lot. The chapter is about Implementation and Applications of automated scoring. Does that answer our questions as practitioners? It should also tell us a bit about the audience.  Are they writing for us? 

Abstract: 

On a separate site, The American Psychological Association, I looked for information about the chapter. The abstract of a text gives you an overview of the work. See it here.  Now, just be aware, writers craft their own abstracts. Much of the text in the abstract is also in the introduction of the chapter. In the abstract, we learn about the background of the authors and their use of the technology in their classes during the late 90's and the association with Pearson. The abstract chronologically highlights the activities of developing the technology. We learn about prompts: "Describe the differences between classical and operant conditioning." and "Describe the functioning of the human heart." We learn that NLP, natural language processing (readability, grammar, and spelling) is combined with LSA (latent semantic analysis) to measure content and semantics. 

So is this helpful for grading essays? In the abstract, we learn that the machine can tell if a writer is high school student, an undergraduate, or a medical student describing the heart prompt mentioned above. 

We learn that the ELAR crowd wanted more - stuff about style, grammar, and feedback. So the program algorithms became more robust - up to 60 different criteria, including "trait scores such as organization or conventions." 

Common Core is mentioned to show a focus on what is valued and assessed in terms of "mastery and higher order thinking." Then, the article promises to address how the IEA (Intelligent Essay Assessor) will be used and how it works, even in comparison to human raters. 

So...the article is a history and summary of the project. It will include research, but it's NOT the research. 

The Article Itself:

Note that Write to Learn's essay feedback scoreboard (Pearson's Product) appears on page 70 of the article. Research IS reported and documented in the article. Lots of great information follows about how this stuff works. But, to really know about the quality and meaning of the research, we'd have to actually look at the stuff cited in the article. 


Drilling Down to the Citations

The internal citations throughout the article and the references on pages 86-88 are the places we would head next to validate and search for the actual research. The chapter is a great place to begin, but the chapter itself isn't the research. 

Admission of Conflict of Interest

Scholarly work will show when conflicts of interest or associations are present. It's part of being honest. 




The chapter concludes with such a statement. And it is important that practitioners recognize the involvement that might color the way the research referenced in the article is presented. 

The Book and Table of Contents

Remember, this search is less about the particular authors and text and more about answering our questions and determining the content and quality of the research behind the topic. Sometimes, the book the chapter is in will have the kind of information we seek as practitioners. 


Chapter 11, on Validity and Reliability of Automated Essay Scoring might just be the chapter we need to consider. 

Bottom Line: 

Yep. The article posted on the Pearson website to validate the product is a reference. It's NOT research. But, the article is a bounty of background and a map of places to go in search of the research. 


Monday, January 29, 2024

Part 3: Is it research or science? Tips for Evaluating Citations

In evaluating citations, you might find out that something really doesn't count as scientific research. But, you can certainly learn a lot. 

Let's take the first two citations from Pearson's Automated Scoring Website. 

As we worked at the 2024 CREST conference session, we looked at the first two citations and developed some important ideas and tips for evaluating citations. 

Part of evaluating the citations comes from looking at the parts of the citation itself: author, venue/source, accessibility of the link. 

We started by trying out the links. As you know, the links don't work. 

Next, we copied and pasted the entire citation to see if we could find it posted or cited elsewhere on an internet or generic scholar search. 

As explained before, we were able to look up the organization/conference. 

Citation One (2018, June)
We found a pdf of the conference handbook. From there, we were able to see information about the presenters and the session itself.  


The Authors: On page 7 of the program, we see that the first presenter is J. Hauger, a member of the conference program planning committee, and member of the New Jersey Department of Education. On page 13 of the program, we see the session description and the remaining authors. 


Lochbaum is listed first in the program, but wasn't listed first in the citation.  (That's a red flag when people change the order of authorship/participation, as that's part of accurate scholarship because it is used for tenure and issues of academic integrity and property.) She's the VP of Technology Services at Pearson and Knowledge Technologies. 

 
Quesen is listed as an Associate Research Scientist at Pearson. 

Zurkowski was listed in Pearson's citation, but not in the conference program. This is not and academic practice of integrity and is dishonest. 

We also see from the program that another person from Pearson is associated with the presentation as a moderator for the session: Trent Workman, Vice President of School Assessment. Since this is a national conference on assessment, members from other states would be in attendance. Meeting potential clients at the conference would be beneficial. And nothing wrong with that. But it's not research. And it does show bias. 

The Session Description: From the session description, we can see that the session is about how Pearson's assessment product benefits "school needs" and saves time. "Consistency, accuracy, and time savings" are offered as powerful enticements, but research is not the focus of this session. 

Citation 2 (2018, June) We already know about Lochbaum. We did find a citation to the conference in another work, but yet again, the order of authorship/participation were different. 

Tips on Evaluating Citations

Find out who the authors are. Where do they work? What else have they written? 
Make sure the citations match when considered/reported in multiple places. 
Consider the dates. Y'all, this stuff is from 2016 and 2018. That's 6 to 8 years ago. ANCIENT history in the technology world. And most scholarship doesn't consider anything older than 5 years unless it's seminal research or philosophy. 







Part 2: Is it research or scientific? Tips for Evaluating Citations

    Here's some things we discovered when analyzing the citations on Pearson's website about Automated Scoring. 

What do people usually do when they see citations on a website? It's easy to see accurately annotated information and think that it's research. That would be a mistake. Some folks might go as far as trying to click on the links. Well, they don't work. 

Evaluating Citations: DOI

Links fail. That's a reality. And some stuff gets posted that's old. Modern research of peer reviewed material often comes with a doi link: Digital object identifier of stuff that's published that has a permanent web address. When webpages change, the DOI doesn't. If you don't have a doi, sometimes you can find the article here. Respectable research is out there that doesn't have a doi, so it's not a deal killer if what' you are looking at doesn't use them. And you might have to reach out to a librarian or professor to get access to some articles. But citations with a doi are preferred when I'm doing research. 

Conference Presentations aren't Research

Pearson lists two conference presentations in the first two citations. 

Presented at the National Conference on Student Assessment (NCSA), San Diego, CA. and Presented at the National Conference on Student Assessment (NCSA), Philadelphia, PA. 

Researchers love to present at these things to share and celebrate their findings. While presentations can contain research, they aren't research in themselves, we don't know what kind, and have no way to evaluate the source and product. Expecially when the conference links are broken. 

Usually, it's tough to be approved to present at national conferences. So even though the presentations aren't research, we do learn that the presenters were actively involved in presenting and refining their professionalism. 

Consider Affiliations and Origins

We also need to note the name of the conference. These folks are all about Student Assessment. That's their focus. We need to keep that in mind. As we/practitioners look at research and researchers, is that our focus? Usually not. Usually, our focus is on people who have to take the assessments. That changes things for us and how we look at their activities in comparison to ours. 

It's probably also important to go look at the website of these conferences and ask some important questions:  
Why do they exist? What kind of people belong and attend?  Well - they are associated with Council of Chief State School Officers, state education leaders, and assessment practitioners. They "support states in implementing high-quality assessments and robust accountability systems, ultimately driving better outcomes for all students." That sounds like politics and state agencies to me. 
Who gives them money?  Pearson was a Platinum sponsor for 2024. (There is also no evidence that proposals are blind or double blind for selection. This means that those selecting the presenters know who the presenters are. This could cause bias, especially since Pearson hosts the conference - and as you will see later, som of the presenters also work for Pearson.)

Ultimately, we have to ask: Do these answers resonate with our purposes, people, and needs? 



Is it Research? Is it Scientific Research? An Example with AI Scoring Citations

Y'all. Some of you are going to get mad at me for some of this. But stick around for a minute or two to follow the ideas. I'm going to be really direct: We are being told that things are researched. We are told that ideas are scientific. Some of that isn't true. We MUST develop the skills to evaluate what is behind the statements, policies, programs, politics, and published material. We MUST know what is research and what isn't. We must also develop skills to find the red flags and warnings. 

First, some background and note of affiliation: 

I'm a member of a group called International Literacy Educator's Coalition. We've been working on this stuff: 


ILEC advocates for educators and families to make responsive, research-informed decisions for literacy learning that realize the potential of diverse learners to be literate, critical thinkers.

We are an international group of teachers, teacher educators, literacy scholars, researchers, and concerned parents who are dedicated to promoting literacy learning practices that enable all children and youth to realize their full potential as literate, thinking human beings.   We reject top-down, science of reading mandates that force teachers to commit educational malpractice.  We believe that teachers should be able to make the research-based decisions that are best for their students. 

And recently, we've been focused on what research really is and says about the science of reading. Controversial. I know.

An Example of ILEC Work Regarding Research:


Here's a recent contribution from Dr. Paul Thomas and some of his colleagues. It provides a bit of background for how research isn't always saying what we are told it does.

Stories Grounded in Decades of Research: What We Truly Know about the Teaching of Reading. Catherine Compton-Lilly, Lucy K. Spence, Paul L. Thomas, Scott L. Decker. 2023. Reading Teacher. ILA.

                                                                                        LINK: https://ila.onlinelibrary.wiley.com/doi/10.1002/trtr.2258


ARTICLE OVERVIEW

Stories Grounded in Decades of Research illustrates key ideas including research informants that challenge the “Science of Reading” (SoR) debates. The authors draw from a “multi-faceted and comprehensive” view of literacy using reputable research support that includes authentic student-centered observations. In combination, this offers a platform for understanding not just the WHAT but WHY of responsive professional decision-making that includes child-informed references. Contrary to the ‘simple and settled’ view of The Science of Reading, the authors position literacy as “complex, multidimensional, and mediated by social and cultural practices.”

How Laymen Tell if It's Research? An AI Example

So, these folks are admirable geniuses. But I'm a regular kind of folk. How am I supposed to figure this out for myself?

Lately, I've been studying AI and essay scoring. So I went to the Pearson Website to read more about it. I know that TEA has contracted with Pearson and Cambium to work on assessment. I do NOT know if they are using Pearson's tool to score STAAR stuff. More on that later.

Have a look at Pearson's website about AI scoring. In looking at all the items in the site - what is their purpose?
Persuasive: Sales. There's a shopping cart and links to other things you can purchase.
Informational - Note the Breadcrumb: Large Scale, K-12 Assessments, and Automated Scoring.
Argumentative: The good things it provides and the development over time.
Informational/Persuasive: The concept of continuous flow and why that's a good thing.
Persuasive: It has solutions and is innovative.

Questionable: But then we get to the bottom of the page with Selected Research and White Papers and an article called Automated Scoring: 5 Things to Know, And A History Lesson (sic.). Is it informational, persuasive, argumentative, or propaganda? Is it even research?

Let's take the components one at a time and call out the guidelines of telling the difference between research, science, genre, and purpose.

Internal Links and Articles are Not Often Research:

The article, Automated Scoring: 5 Things to Know, And A History Lesson, goes to another page on the Pearson website. It's stuff they wrote about their own product. That's called bias. And it's a red flag when you are looking for research.

The pages give a bit of background information about who is involved, where they worked and on what topics. The pages answer some basic questions (and concerns) about what AI scoring is that would clear up frequently asked items. It's definitely worth reading. But it isn't research. And it's purpose is to make us think good things about the product. It's purpose is to put us at ease.

Un-cited, Unlinked, and Unproductive Searches are Not Often Research


Citations Pleas? 
Note also the callout and quoted text at the beginning of the text. It looks like they are citing something from LearnEd: News about Learning. As if LearnEd is a company and News about Learning is a publication. But there's no citation. There's no link. And when you try to find the source...nothing like that pulls up on google.

And that's another red flag when you are looking for research. Stuff with no links or references isn't research. And stuff that's made to look like something it isn't is flat out deception.

We can contact the company and ask for this publication. But a conscientious and respectable approach would already include references.

We do find helpful information in this article. We find out the background of Pearson's VP of Automated scoring. Karen Lauchbaum sounds like a person that I would admire - a PhD from Harvard and a person who was involved from the beginning in research about how computers understand language. Impressive. And, Dr. Lauchbaum sounds like a person I would like, one that wanted to be a part of solutions. But note - she has a PhD in computer science - not assessment. She studies how computers understand language - not how language is evaluated for high stakes assessment. She writes software for Pearson in the automated scoring division - and gets a paycheck from the people who sell this stuff to Texas.

The webpage is a place to begin researching. But it is not research.

Note also, that the webpage promotes a program called Write to Learn, also a Pearson Product. That's another example of sales purposes and bias. A red flag for evaluating research.

White Papers are Not Research


Note the title of the research and white papers included before the citations at the bottom of Pearson's webpage on Automated Scoring.

White papers might include research, but they also include the author's point of view. Often, the point of view is also from the entity - in this case - Pearson. That's an example of bias and a red flag for research. A lot of times, white papers give position statements and the approach a company takes with an idea or product. You can read more about it at the Lib Guide The University of Massachussetts Lowell compiled about white papers. Bottom line: white papers are often persuasion and function as mini commercials in marketing programs. White papers are not research.

Next Up:

I presented this information at CREST at the end of January 29th. I worked with a group of colleagues to evaluate the citations and to develop guidelines about helping laypeople evaluate research statements. In the blogs, I'll share what we found from each of the citations listed as research and white papers Pearson used to support their AI automated scoring programs.