Abstract:
This paper introduces and applies a novel methodological framework for the comprehensive evaluation of Large Language Models (LLMs) as educational co-authors. The purpose of this work is to address the lack of structured, reproducible processes for assessing the quality and practical utility of AI-generated educational content. A controlled experiment was conducted in which a standardised task was assigned to several leading LLMs. The effectiveness and quality of each model were evaluated within a newly developed framework that measures both the final product quality and the “refinement effort” required. The results reveal a non-linear evolution in model capabilities, highlighting persistent shortcomings in complex reasoning. The findings emphasize the importance of a holistic evaluation that considers both the final output and the efficiency of the collaborative process, offering a practical tool for educators and researchers
International Scientific Multidisciplinary Conference: AI for a Smarter Tomorrow - AI-SMART , September 25-26, 2025
Creative Commons Non Commercial CC BY-NC: This article is distributed under the terms of the Creative Commons Attribution-Non-Commercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission.


