Evaluating the Effectiveness of Large Language Models for Automatic Question Generation in International Chinese Reading

Jing, H.; Xu, J.

A double blind peer reviewed online publication with in-print supplement since 2010 ISSN: 1949-260X

JTCLT Abstract

Volume 17 Number 1, 2026
Full issue PDF

Jing, H., & Xu, J. (2026). Evaluating the Effectiveness of Large Language Models for Automatic Question Generation in International Chinese Reading. Journal of Technology and Chinese Language Teaching, 17(1), 30-47.
[景宏伟, & 徐娟. (2026). 大语言模型在国际中文阅读自动出题中的效能评估. 科技与中文教学 (Journal of Technology and Chinese Language Teaching), 17(1), 30-47.]

Full paper

Abstract/摘要：

With the continued advancement of the digital transformation of international Chinese language education, the traditional manual approach to test item development has increasingly encountered bottlenecks in terms of efficiency, cost, and scalability. Against this backdrop, artificial intelligence technologies, particularly large language models (LLMs), have opened up new possibilities for automated test item generation. This study focuses on HSK Level 6 reading comprehension items and systematically evaluates the practical effectiveness of LLMs in automatic item generation. Four LLMs were selected for experimentation using prompt engineering, including instruction-tuned models, reasoning-oriented models, and a domain-specific model fine-tuned with Low-Rank Adaptation (LoRA). The generated items were comprehensively evaluated across six dimensions: linguistic fluency, content accuracy, item complexity, distractor quality, answer uniqueness, and item-type diversity. In addition, machine-based evaluation metrics, including BLEU, ROUGE, and Distinct, were employed to provide complementary assessments. The results indicate that the items generated by LLMs exhibit a high degree of similarity to those developed by human experts. However, the models still demonstrate instability in controlling item difficulty and ensuring answerability, suggesting that human review and revision remain necessary before the generated items can be used in instructional settings. Among the models evaluated, reasoning-oriented LLMs achieved the best overall performance. Based on these findings, this study further proposes practical recommendations for optimizing the item generation process and advancing a human–AI collaborative approach to test development.

随着国际中文教育数字化转型的持续推进，传统人工命题模式在效率、成本与规模化应用方面的瓶颈日益凸显。在此背景下，以大模型为代表的人工智能技术，为自动命题提供了新的技术路径。本研究以HSK6级阅读理解题为研究对象，系统评估大模型在自动命题任务中的实际效能。研究选取四种大模型，结合提示工程开展实验，涵盖指令大模型、推理大模型以及经LoRA微调后的垂直大模型。并从语言流畅度、内容准确性、题目复杂度、选项干扰性、答案唯一性、题型多样性六个维度，辅以BLEU、ROUGE、Distinct等机器指标，对生成题目进行综合评估。研究表明：大模型生成的题目与人工命题具有较高的相似性，但在难度控制、可回答性等方面尚存在不稳定性，需经人工审核修订后方可用于教学；在模型对比中，推理大模型整体表现更优。基于此，本研究进一步提出相应的使用建议，以优化题目生成过程，推动人机协同命题模式的发展。

This website is supported by
Department of World Languages, Literatrues, and Cultures, Middle Tennessee State University
Page last updated: 2020-12-31