Rasch-based comparison of items created with and without generative AI

Karla Karina Ruiz Mendoza; Luis Horacio Pedroza Zúñiga

doi:10.3926/jotse.3135

Rasch-based comparison of items created with and without generative AI

Karla Karina Ruiz Mendoza, Luis Horacio Pedroza Zúñiga

Abstract

This study explores the evolving interaction between Generative Artificial Intelligence (AI) and education, focusing on how technologies such as Natural Language Processing and specific models like OpenAI’s ChatGPT can be used on high-stakes examinations. The main objective is to evaluate the ability of ChatGPT version 4.0 to generate written language assessment items and compare them to those created by human experts. The pilot items were developed for the Higher Education Entrance Examination (ExIES, according to its Spanish initials) administered at the Autonomous University of Baja California. Item Response Theory (IRT) analyses were performed on responses from 2,263 test-takers. Results show that although ChatGPT-generated items tend to be more challenging, both sets exhibit a comparable Rasch model fit and discriminatory power across varying levels of student ability. This finding suggests that Generative AI can effectively complement exam developers in creating large-scale assessments. Furthermore, ChatGPT 4.0 demonstrates a slightly higher capacity to differentiate among students of varying skill levels. In conclusion, the study underscores the importance of continually exploring AI-driven item generation as a potential means to enhance educational assessment practices and improve pedagogical outcomes.

Keywords

Artificial Intelligence, ChatGPT, educational evaluation, test, digital technology

Full Text:

PDF HTML

DOI: https://doi.org/10.3926/jotse.3135

This work is licensed under a Creative Commons Attribution 4.0 International License

Journal of Technology and Science Education, 2011-2025

Online ISSN: 2013-6374; Print ISSN: 2014-5349; DL: B-2000-2012

Publisher: OmniaScience

user
pwd
Remember me