Rasch-based comparison of items created with and without generative AI
Abstract
This study explores the evolving interaction between Generative Artificial Intelligence (AI) and education, focusing on how technologies such as Natural Language Processing and specific models like OpenAI’s ChatGPT can be used on high-stakes examinations. The main objective is to evaluate the ability of ChatGPT version 4.0 to generate written language assessment items and compare them to those created by human experts. The pilot items were developed for the Higher Education Entrance Examination (ExIES, according to its Spanish initials) administered at the Autonomous University of Baja California. Item Response Theory (IRT) analyses were performed on responses from 2,263 test-takers. Results show that although ChatGPT-generated items tend to be more challenging, both sets exhibit a comparable Rasch model fit and discriminatory power across varying levels of student ability. This finding suggests that Generative AI can effectively complement exam developers in creating large-scale assessments. Furthermore, ChatGPT 4.0 demonstrates a slightly higher capacity to differentiate among students of varying skill levels. In conclusion, the study underscores the importance of continually exploring AI-driven item generation as a potential means to enhance educational assessment practices and improve pedagogical outcomes.
Keywords
DOI: https://doi.org/10.3926/jotse.3135
This work is licensed under a Creative Commons Attribution 4.0 International License
Journal of Technology and Science Education, 2011-2025
Online ISSN: 2013-6374; Print ISSN: 2014-5349; DL: B-2000-2012
Publisher: OmniaScience