logo logo European Journal of Educational Research

EU-JER is a leading, peer-reviewed research journal that provides an online forum for studies in education by and for scholars and practitioners worldwide.

Subscribe to

Receive Email Alerts

for special events, calls for papers, and professional development opportunities.

Subscribe

Publisher (HQ)

Eurasian Society of Educational Research
Eurasian Society of Educational Research
Christiaan Huygensstraat 44, Zipcode:7533XB, Enschede, THE NETHERLANDS
Eurasian Society of Educational Research
Headquarters
Christiaan Huygensstraat 44, Zipcode:7533XB, Enschede, THE NETHERLANDS
Research Article

Enhancing Readability Assessment for Language Learners: A Comparative Study of AI and Traditional Metrics in German Textbooks

Bora Başaran

Text readability assessment stands as a fundamental component of foreign language education because it directly determines students' ability to un.


  • Pub. date: January 15, 2026
  • Online Pub. date: October 03, 2025
  • Pages: 101-119
  • 18 Downloads
  • 199 Views
  • 0 Citations

How to Cite

Abstract:

T

Text readability assessment stands as a fundamental component of foreign language education because it directly determines students' ability to understand their course materials. The ability of current tools, including ChatGPT, to precisely measure text readability remains uncertain. Readability describes the ease with which readers can understand written material, while vocabulary complexity and sentence structure, along with syllable numbers and sentence length, determine its level. The traditional readability formulas rely on data from native speakers yet fail to address the specific requirements of language learners. The absence of appropriate readability assessment methods for foreign language instruction demonstrates the need for specialized approaches in this field. This research investigates the potential use of ChatGPT to evaluate text readability for foreign language students. The examination included selected textbooks through text analysis with ChatGPT to determine their readability level. The obtained results were evaluated against traditional readability assessment approaches and established formulas. The research aims to establish whether ChatGPT provides an effective method to evaluate educational texts for foreign language instruction. The research evaluates ChatGPT's capabilities beyond technical aspects. The study examines how this technology may influence students' learning experiences and outcomes. The text clarity evaluation capabilities of ChatGPT might lead to innovative approaches for developing educational tools. The implementation of this approach would generate lasting benefits for educational practices in schools. For example, ChatGPT’s readability classifications correlated strongly with Flesch-Kincaid scores (r = .75, p < .01), and its mean readability rating (M = 2.17, SD = 1.00) confirmed its sensitivity to text complexity.

Keywords: Educational technology, foreign language education, learning materials, readability assessment, text analysis.

description PDF
file_save XML
Article Metrics
Views
18
Download
199
Citations
Crossref
0

Introduction

Education continues its ongoing evolution while foreign language instruction faces distinctive challenges to match learning materials with student capabilities (Kavrayici et al., 2025). Learning outcomes, together with curriculum development and student motivation, benefit from readability assessment because it plays an essential role in this process. The German language presents special learning challenges because of its complex grammar and syntax, so educational materials must be suitable for student proficiency levels.

The Flesch Reading Ease and Flesch-Kincaid Grade Level represent traditional readability formulas which measure text complexity through sentence length and word frequency analysis. The application of these metrics in foreign language settings proves inadequate because they were created for native speakers while focusing on basic text characteristics. These assessment tools do not detect essential linguistic complexities which are vital for language learners to understand. Artificial intelligence (AI) advancements have brought new possibilities for evaluating text complexity. The language analysis and generation capabilities of ChatGPT provide more detailed insights than traditional formulas (Huang et al., 2024;Shaitarova et al., 2024), and AI-powered systems have the potential to transform readability assessment by evaluating idiomatic expressions, discourse coherence, and grammatical intricacies, which traditional metrics do not measure.

This research investigates how ChatGPT performs in evaluating the readability of German foreign language textbooks. The research compares AI-generated assessments to established formulas to determine whether AI tools provide a more precise and linguistically sensitive alternative. The study advances the discussion about AI implementation in foreign language education while examining how AI could transform material selection and curriculum planning processes. These perspectives are essential for understanding the evolving role of AI in language teaching.

Literature Review

Readability plays a vital role in foreign language instruction by aligning text complexity—determined by lexical, syntactic, and textual features—with learners' proficiency levels, thereby supporting vocabulary acquisition, grammar learning, and overall academic success (N. T. D.Nguyen & Lam, 2023; Omer & Al-Khaza'leh, 2021). The following sections explore its pedagogical significance and practical applications.

Readability in Language Pedagogy

Readability assessment enables teachers to select materials that match students’ proficiency levels, which supports language learning success and sustains motivation (Kamal, 2019; Kasule, 2011; Maqsood et al., 2022; Nassiri et al., 2018). By aligning text complexity with learner abilities, it also fosters vocabulary acquisition and grammatical development (Brooks et al., 2021; Hu et al., 2022;Parhadjanovna, 2023). Reading tasks support students in developing monologue and text response skills in language classes (Gordeeva &Usupova, 2023;Zheleznova, 2019). Individualized instruction, grounded in sound teaching principles, improves reading abilities in English as a foreign language (Chen, 2023;Noviyenty, 2017; Zhu, 2024). Effective readability integration requires understanding learners’ background knowledge, cognitive styles, and engagement patterns. Therefore, readability should be considered in conjunction with other pedagogical factors when designing curricula.

The text complexity can be evaluated using readability indices such as Gunning Fog and Flesch-Kincaid to determine appropriate student difficulty levels (Bakuuro, 2024). Traditional first language readability indices do not provide accurate predictions of EFL difficulty. According to Brown (1998), linguistic features such as sentence syllable count, word frequency, and function word percentage proved to be better indicators of EFL difficulty which indicates the requirement for EFL-specific readability indices. The potential exists for AI systems to enhance conventional educational methods while simultaneously reducing teacher workload. Language education represents a domain where readability plays a crucial role in ensuring material suitability and learner comprehension.

Traditional Readability Metrics: Approaches and Limitations

1. Flesch-Kincaid Grade Level (F-K) (Flesch, 1948)

Created by Rudolf Flesch (1948) and later refined by J. Peter Kincaid et al. (1975), this metric calculates readability using the following formula. It has been applied across various contexts to estimate the readability of English texts:

Grade Level = (0.39 ×W) + (11.8 ×S) - 15.59

-W= Average number of words per sentence

-S= Average number of syllables per word

2. Automated Readability Index (ARI) (Senter & Smith, 1967)

Automated Readability Index (ARI): Developed by Senter and Smith (1967) for the U.S. Air Force, this formula estimates the U.S. school grade level required to understand a text by calculating the number of characters per word and words per sentence, unlike other formulas that rely on syllable counts:

ARI = (4.71 ×C) + (0.5 ×W) - 21.43

-C= Average number of characters per word

-W= Average number of words per sentence

3.Läsbarhetsindex(LIX) (Björnsson, 1968)

Proposed by Björnsson in 1968, this formula was initially developed for Swedish texts but has proven effective across multiple languages. It calculates readability based on average sentence length and the percentage of long words (more than six characters).

LIX = (W/S) + (L/W× 100)

-W= Number of words in the text

-S= Number of sentences in the text

-L= Number of long words (more than 6 characters)

4. Simple Measure of Gobbledygook (SMOG) (McLaughlin, 1969)

Created by Harry McLaughlin (1969), SMOG evaluates text complexity by counting polysyllabic words (three or more syllables) per 100 words. It is particularly valued for its high correlation with actual text comprehension.

SMOG = 1.0430 ×√(P× (30/N)) + 3.1291

-P= Number of polysyllabic words (words with three or more syllables)

-N= Number of sentences in the sample

5. Coleman-Liau Index (CLI) (Coleman & Liau, 1975)

Developed by Coleman and Liau (1975), this formula measures readability based on the average number of letters and sentences per 100 words, and it is expressed as follows:

CLI = (0.0588 ×L) - (0.296 ×S) - 15.8

-L= Average number of letters per 100 words

-S= Average number of sentences per 100 words

Higher scores indicate more complex texts.

Table 1. Readability Level Distribution

 

Readability Level F–K ARI LIX SMOG CLI
Very Easy ≤ 5 1-5 < 25 ≤ 5 ≤ 6
Easy 6-8 6-8 25-35 6-8 7-9
Moderate 9-12 9-12 35-45 9-12 10-12
Difficult 13-16 13-16 45-55 13-16 13-15
Very Difficult 17+ 17+ > 55 17+ 16+

 

Readability metrics function as essential evaluation tools to measure written content complexity and accessibility levels. Writers and educators use these measurements to verify their content aligns with the reading capabilities of their target audience. This research investigates the five-level readability score classification system together with its real-world applications. The five-level classification system of very easy, easy, moderate, difficult and very Difficult provides a detailed text complexity analysis which exceeds traditional three-level systems. The extended classification system enables exact identification of both audience group and reading requirements for written materials. A three-level system typically places text into basic categories of 'difficult' or 'easy' but the five-level system enables better distinctions between undergraduate-level content (Difficult) and graduate/professional level material (Very Difficult). The detailed classification system shows its greatest value when evaluating academic or technical documents because small variationsin complexity directly affect reader understanding. Multiple well-known readability metrics (Coleman Liau Index, Flesch Kincaid Grade Level, ARI, and SMOG) work together within this system to create a complete evaluation of text accessibility through various readability dimensions.

The research concentrates on traditional readability metrics instead of machine learning methods for three main reasons.

1.Proven track record: These formulas have been extensively validated through decades of empirical research

2.Transparency: The mathematical foundations and calculations are clear and reproducible

3.Language independence: These metrics rely on surface-level text features that can be applied across different languages without requiring large training datasets

4.Computational efficiency: Traditional formulas can be implemented with minimal computational resources

5.Interpretability: The results are easily understood by educators and researchers without requiring expertise in machine learning

The application of machine learning methods for readability assessment demonstrates potential but needs extensive language-specific training data and fails to adapt between languages and contexts. Traditional readability metrics enable consistent readability assessment across languages because they establish a well-defined framework.

The Flesch-Kincaid readability assessment tool along with other traditional metrics remains widely used to measure text complexity yet fails to accurately measure text difficulty because it only evaluates quantifiable characteristics of word length and sentence structure (DuBay, 2004). The formulas concentrate on surface-level characteristics while neglecting complex linguistic elements that affect comprehension, especially in online texts, together with technical documentation and domain-specific content (Crossley et al., 2017). The Flesch-Kincaid and LIX metrics represent classical readability formulas that evaluate text readability through word and sentence length measurements and, therefore, remain popular in educational settings despite their known restrictions (Chall, 1996). The combination of cognitive and semantic measures with human qualitative evaluations provides better readability insights,but traditional metrics neglect domain-specific jargon and contextual complexities (Redish, 2000; Redmiles et al., 2018).

Studies have investigated enhanced readability assessment methods that combine natural language processing with crowd-sourced ratings and machine learning techniques (Crossley et al., 2017). These newer methods demonstrate stronger predictive capabilities than traditional formulas by evaluating text-base processing, situation models, and cohesion (Crossley et al., 2017). Tools like Coh-Metrix provide extensive linguistic analyses that address several limitations of traditional metrics, especially their inability to adapt to adult or technical texts, where layout and retrieval aids significantly affect readability (Crossley et al., 2017; Redish, 2000). Traditional metrics, in contrast, focus primarily on quantifiable features such as word and sentence length while overlooking critical comprehension factors, like reader background and text purpose (Plung, 1981). The emerging methods that combine cognitive and semantic analysis with qualitative assessments show potential for improving readability evaluation, but computational models for readability assessment require further development. The connection between computational research and educational and psychological insights remains essential for future developments, according to Vajjala(2022). because new methods require substantial computational power and standardized methods to evaluate subjective reader responses across different populations.

AI-Inspired and Machine Learning Approaches to Readability

Traditional readability metrics serve as basic assessment tools; however, researchers have developed new methods that include extra linguistic elements in recent years. Advanced methods assess readability through the evaluation of word frequency together with morphological analysis and parse tree depth to generate detailed readability results. The field has received substantial advancement through machine learning approaches. Support Vector Machines (SVMs) have been applied to study lexical, syntactic, and morpho-syntactic features in research by Petersen and Ostendorf (2009). The analysis of sequential text complexity has been developed through neural network models that utilize Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs) (Hochreiter &Schmidhuber, 1997; Lo Bosco et al., 2019).

Language modeling techniques have proven essential for readability prediction tasks (Collins-Thompson & Callan, 2004; Si & Callan, 2001). The integration of syntactic complexity features, including parse tree height and entity coherence, has been studied by researchers such as Schwarm and Ostendorf (2005) and Heilman et al. (2007). The analysis of statistical and traditional measures combines to evaluate features including noun phrases, verb phrases, and subordinate clauses per sentence, according to Schwarm and Ostendorf (2005). Artificial Neural Networks (ANNs) improve readability assessment through their analysis of lexical and syntactic indicators to automate text complexity evaluation. The method delivers an automated solution which replaces traditional formulas while providing flexible application across different situations (Samawi et al., 2023). The methods show promise for application in German texts because they have shown primarily English usage but demonstrate potential use in other languages with diverse linguistic structures and complex syntax. The Large Language Models (LLMs), including ChatGPT, present researchers with a fresh method to evaluate readability. The analysis of texts by LLMs through holistic methods allows them to detect both explicit and implicit indicators of complexity in addition to traditional linguistic features and statistical models. Their syntactic analysis, together with semantic evaluation and discourse-level processing, allows for flexible context-specific assessments. I will use ChatGPT to examine its potential by running the following prompt for text classification into predetermined readability categories:

“Analyze the readability of the following text and categorize it into one of the five levels: Very Easy, Easy, Moderate, Difficult, or Very Difficult. Consider factors such as sentence complexity, vocabulary difficulty, grammatical structures, and overall comprehensibility for a general audience.”

The flexibility of machine learning-based approaches, including LLMs, does not eliminate certain limitations they encounter. The systems depend on restricted datasets and make use of minimal classifiers, and analyze entire texts instead of sentences. The effectiveness of LLMs to detect specific linguistic features across different proficiency levels remains unknown because they lack explicit readability assessment training (Pitler &Nenkova, 2008). The established readability formulas maintain their proven reliability and transparency which makes traditional metrics essential for cross-linguistic assessment.

Research Objectives and Questions

The research investigates how AI-based methods, including ChatGPT, compare to traditional readability metrics for evaluating German foreign language textbook readability. The research examines whether AI tools deliver superior text complexity evaluation compared to standard measurement formulas.

Key Research Questions:

1. How do readability scores generated by ChatGPT compare to those obtained from traditional readability formulas?

2. Can AI tools like ChatGPT provide more accurate and linguistically insightful assessments of text complexity in German foreign language textbooks?

3. What implications do AI-based readability assessments have for textbook selection and

foreign language education?

The research investigates how AI readability tools can benefit education through their strengths and weaknesses while exploring their practical educational uses. The research results will expand knowledge about AI's function in language education, educational content creation, and instructional methods.

Methodology

This study uses a methodological framework that compares ChatGPT with traditional readability metrics to evaluate the readability of German foreign language textbooks. The methodology is designed to provide a thorough and systematic evaluation by using both quantitative and qualitative approaches to analyze and compare the effectiveness of various readability assessment tools.

The research design implements a comparative analytical research design to evaluate the effectiveness of ChatGPT's readability assessment capabilities relative to traditional evaluation methods. The research includes a controlled evaluation between ChatGPT-generated linguistic outputs and results from established readability formulas including Flesch Kincaid Grade Level (Flesch, 1948), ARI (Automated Readability Index) (Senter & Smith, 1967), LIX (Björnsson, 1968), SMOG (Simple Measure of Gobbledygook) (McLaughlin, 1969),and Coleman Liau index (Coleman & Liau, 1975). Traditional measurement tools such as sentence length and word frequency assessment have been used for a long time; however,they do not detect the sophisticated linguistic elements found in educational texts.

Data Collection: Textbook Selection

The research utilizes a carefully curated collection of German textbooks designed for foreign students to compare educational materials across various learning environments and skill levels. The corpus consists of four textbooks which include two publications from the Turkish national education ministry and two German educational publisher materials at various CEFR (Common European Framework of Reference for Languages) levels. The corpus selection process followed specific criteria to establish both the validity and reliability of the corpus. The analysis required textbooks to display explicit CEFR level alignment because this enabled proper readability comparisons between different proficiency levels. The analysis focused on contemporary language teaching practices because the textbooks needed to be active in their respective educational systems at the time of analysis. The analysis focused on recent editions which appeared during the previous five years to study modern educational methods and teaching approaches.

The inclusion of textbooks from both Turkish and German educational contexts provides an opportunity to examine how different educational traditions and publishing approaches might influence text complexity and readability. This dual-context approach also allows for the investigation of potential cultural and pedagogical differences in text presentation and complexity progression across CEFR levels.

The textbook selection process focused on materials which received official recognition from their respective educational authorities. The Turkish textbooks received approval from the Ministry of National Education while German textbooks originated from established educational publishers who specialize in foreign language education materials.

The methodically organized corpus establishes a reliable base for studying readability patterns between educational settings and proficiency points which helps explain foreign language education material text complexity management. Following textbooks are selected:

Table 2. Selected Textbooks

Textbook Name CEFR Level Year
Deutsch für Gymnasien Schülerbuch (Zabun, 2020) A1.1 2020
Kontext Deutsch als Fremdsprache (Koithan et al., 2021) B1 2021
Beste Freunde (Georgiakaki et al., 2019) A1.2 2022
Mein Schlüssel zu Deutsch (Aksu Güney et al., 2023) B1.2 2023

Sample Research Corpus Selection

The common data selection method of Simple random sampling (SRS) has alternative approaches which provide better efficiency and performance. Active learning minimizes text data labeling expenses, especially in cases with imbalanced classes through document-specific targeting instead of random sampling (Miller et al., 2020). The modified Extreme and Double Extreme Ranked Set Sampling schemes deliver more powerful testing results and efficient estimates than SRS does according to Samawi et al. (2023). Bhushan et al. (2023) developed an efficient class of estimators which performs better than current methods for variance estimation in SRS. Model-based clustering through data mining techniques enables national surveys with small sample sizes to develop homogeneous strata which boosts sampling efficiency by 1.7 times above SRS (Parsaeianet al., 2021). The alternative sampling methods present researchers with more efficient and cost-effective data collection and analysis options which perform better than traditional SRS methods in different research settings.

Table 3. "Deutsch für Gymnasien Schülerbuch" Texts

Sample Page Text
1 55 Meine Mutter kocht jeden Tag. Gemüse ist unser Hauptgericht. Obst ist unsere Nachspeise. Sie meint: „Gemüse und Obst sind sehr gesund.“Ab und zu macht sie auch Hamburger und Pommes. ber die Pommes bratet sie nie im Öl. Meine Mutter backt sie im Ofen. Sie meint: „Gesund leben ist auch abhängig vom Essen. Besonders zu viel Fett schadet der Gesundheit.“
2 80 Er heiβt Florian. Er treibt immer Sport und ist ein Mitglied von einer Mannschaft. Die Mannschaft trainiert sehr oft zusammen. Manchmal traineren sie in der Woche vier Mal, manchmal drei Mal. Florian verpasst fast nie das Training, auβer er ist krank. Florian wird selten krank. Er liebt das Spiel und das Training.
3 90 Nächste Woche ist Weihnachten. Frau Müller möchte einkaufen gehen. Sie will für ihre Familie Geschenke kaufen. In der Woche hat sie kaum Zeit. Am Wochenende ist die ganze Familie zusammen. Sie muss eine Lösung finden. Ihre Freundin sagt: „Kauf doch online! Du kannst Zeit sparen.“ Du wählst aus und sie liefern dir deine Geschenke nach Hause. Frau Müller findet die Idee toll. Sie setzt ich gleich vor den PC und sucht Geschenke für ihre Familie. Sie kauft für Herrn Müller eine Krawatte. Für Susi eine Kette und für Thomas eine Armbanduhr.

Table 4. "Kontext Deutsch als Fremdsprache" Texts

Sample Page Text
4 159 Naturparadiese und Wildnis sind viel näher, als man denkt, auch in Deutschland. Nur das Nötigste in den Rucksack packen, sich aus dem, was man in der Natur findet, ein Lager für die Nacht bauen und das Abendessen über dem Feuer zubereiten. All das ist möglich, wenn man weiß, wie. Und das erfährt man in diesem Ratgeber. Die Autorin ist Überlebenstrainerin und Expertin, wenn es um Überlebensstrategien in der Natur geht: Outdoor- Leben, Gefahren, Heilmittel aus der Natur. Dinge, auf die man nicht verzichtensollte. Rezepte und Antworten auf die Frage, was erlaubt ist und was nicht. Raus aus dem Alltag, rein in die Natur!
5 166 Das Ferne und Unbekannte zu erreichen, galt schon immer als Beweis für menschliche Leistungsfähigkeit. Dass es nun ausgerechnet der Rote Planet ist, liegt an den vielen Erkenntnissen, die Wissenschaftler in den letzten Jahren über den Mars gewonnen haben. Diese machen deutlich: Der Mars hat etwas zu bieten, nämlich nicht nur eine vielfältige Oberfläche, sondern auch eine Atmosphäre. Und spätestens seitdem die Sonde Phoenix nachgewiesen hat, dass es gefrorenes Wasser auf dem Roten Planeten gibt, sind die Forscher weltweit begeistert. Wasser ist nun einmal die Quelle allen Lebens. Möglicherweise gab es früher Leben auf dem Mars oder es existiert heute immer noch in irgendeiner Form, die wir nicht kennen. Je mehr Erkenntnisse wir darüber haben, umso mehr Wissen erhalten wir über das ganze Sonnensystem. Vielleicht werden wir Menschen den Mars besiedeln. Das ist zwar noch Zukunftsmusik, gilt aber als sehr wahrscheinlich. Denn wir müssten uns in zeitlicher Hinsicht kaum umstellen: Ein Marstag dauert mit 24,6 Stunden kaum länger als ein Erdentag. Die Temperaturen sind im Vergleich zu anderen Planeten fast angenehm, jedoch für uns ziemlich extrem. Einerseits erreichen sie tagsüber 20 °C, andererseits sinken sie nachts auf unter minus 80 °C. Außerdem gibt es genügend Sonnenlicht auf dem Mars, um Solarenergie zu erzeugen. Leben auf dem Mars – ein Traum? Schon der deutsche Astronaut Alexander Gerst sagte: „Wir Menschen sind Entdecker. Sobald wir Flöße bauen konnten, sind wir über Flüsse gefahren. Sobald wir Schiffe bauen konnten, sind wir hinter den Horizont gesegelt. Jetzt können wir Raumschiffe bauen, also fliegen wir ins All.“
6 173 Bildinhalte zu erfassen und auszuwerten ist wahrscheinlich die Fähigkeit im Bereich der künstlichen Intelligenz, die schon am weitesten perfektioniert ist. Beispiele dafür finden sich in fast jedem Smartphone: Die Lichtverhältnisse anpassen, den automatischen Fokus auf ein Gesicht legen, die Kamera genau dann auslösen, wenn man lächelt. So kommt man meistens zum perfekten Foto. Wer also ein Smartphone besitzt, verlässt sich beim Fotografieren auf KI. Und noch viel mehr Funktionen des Handys arbeiten damit. So merkt sich das Handy z. B., welche Funktionen und Einstellungen der Nutzer sehr oft wählt. Das Ergebnis: die morgendlich zugeschnittenen Nachrichten aus aller Welt, die Erinnerung an wichtige wiederkehrende Termine und die Ermittlung der idealen Weckzeit. Das alles spart dem Nutzer lange Sucherei und damit Zeit.

Table 5."Beste FreundeTexts

Sample Page Text
7 23 Aktion „Schlafhandys sammeln": Viele Handys sind alt. Niemand braucht sie, sie„schlafen" zu Hause: Deshalb heißen sie„Schlafhandys" Jugendliche der Mittelschule Bergheim sammeln diese Handys. Eine Recycling-Firma zahlt vier Euro für ein Handy. Für 200 Handys bekommt die Schule genug Geld für einen neuen Computer.
8 41 Das ist der Berliner Reichstag am Platz der Republik 1. Hier arbeitet das Parlament von Deutschland. Für euch interessant ist sicherlich die Glaskuppel: Man kann in die Kuppel steigen und in den Himmel schauen.
9 54 Hi Lukas, ich habe lange nichts von dir gehört oder gelesen. Alles klar bei dir? Was machst du im August? Ich möchte wieder mit meinem Bruder an den Ammersee ins Feriencamp fahren. Das war doch ganz toll, oder? Also mir hat es gut gefallen. Ich möchte dann einen Surfkurs machen und SUPen*. Was das ist, siehst du auf dem Foto. Und? Bist du wieder dabei? Dann ruf mich an oder schreib mir. Bis bald, Paul

Table 6. "Mein Schlüssel zu Deutsch" Texts

Sample Page Text
10 62 Die Weltklimakonferenz COP 23 fand zum zweiten Mal im November 2017 in Bonn in den Zelten auf einer Blumenwiese statt. Dieses Gelände konnte erst vor wenigen Wochen vor der Konferenz renaturiert werden. Die Mitarbeiter haben Tag und Nacht ununterbrochen gearbeitet. Dabei wurden 59.000 Quadratmeter Rasenfläche wiederhergestellt. Die Flächeneinzäunung vom ehemaligen Veranstaltungsgelände musste beibehalten werden. Damit konnte der neu verlegte Rollrasen auch geschützt werden. Die COP 23 war die erste Weltklimakonferenz, das mit dem EMAS-Zertifikat offiziell als umweltfreundlich ausgezeichnet wurde. EMAS steht für „Eco Management and Audit Scheme“ (Das Gemeinschaftssystem für das Umweltmanagement und die Umweltbetriebsprüfung). Auf der Konferenz wurden alle Ziele und Maßnahmen für die Umwelterklärung dokumentiert. Die Ziele und Maßnahmen wurden von einem Umweltgutachter mehrere Tage vor Ort geprüft und schließlich bestätigt.
11 114 Nachdem ich mich entschieden hatte, meinen Master in „Geschichte der Wissenschaften“ zu machen, erfuhr ich, dass man vor der Aufnahme viele Prüfungen zu bestehen hat. Ich musste beweisen, dass ich neben sehr guten Deutsch- und Englischkenntnissen auch befriedigende Französischkenntnisse besaß. Zusätzlich habe ich während meines Studiums die Gelegenheit wahrgenommen, vier Semester lang einen Arabischkurs zu belegen. Diese Sprachen sind für Wissenschaftshistoriker wichtig, denn die meisten historischen Quellen sind ins Türkische nicht übersetzt worden.
12 119-120 Schüler sollen ihre Position in Gruppen einschätzen können, aber das geht nicht mit Berichtsbewertungen. Das ist ein großer Nachteil. Schulen vermitteln nicht nur Wissen oder Werte, sie bereiten die Schüler auch auf das weitere Leben vor. In der Gesellschaft werden unsere Leistungen immer bewertet. Schulen müssen diese Realität vor Augen halten. Es wäre unfair, wenn man die Schüler mit komplett anderen Kriterien bewertet und nicht auf diese Gesellschaft vorbereitet. Außerdem müssen Kinder lernen, mit Erfolg und Misserfolg umzugehen. Negative Rückmeldungen oder Leistungseinbrüche sind ein Teil des Lebens. Jeder sollte darauf vorbereitet werden. Problematisch sind nicht Ziffernnoten, problematisch ist, wenn man nicht mehr weiter weiß.

Quantitative and Qualitative Analysis

The quantitative analysis involves compiling readability scores into datasets, followed by statistical methods such as mean comparisons, standard deviations, and correlation analyses. The goal is to compare ChatGPT-generated scores with those from traditional readability formulas to identify score patterns and evaluate relative effectiveness.

Prior to conducting these analyses, the normality of all readability metrics (F-K, ARI, LIX, SMOG, CLI) was tested using the Shapiro-Wilk test (Shapiro & Wilk, 1965) to ensure that the assumptions for parametric statistical techniques were met. The results confirmed that the data did not significantly deviate from normal distribution (p > .05 for all metrics), allowing the use of Pearson correlation and other parametric methods.

The qualitative dimension applies thematic analysis to ChatGPT’s feedback, focusing on the linguistic indicators it identifies in relation to the variables used in classical readability metrics. To ensure transparency and reproducibility, the exact prompt used to generate ChatGPT responses was:

“Analyze the readability of the following text and categorize it into one of the five levels: Very Easy, Easy, Moderate, Difficult, or Very Difficult. Consider factors such as sentence complexity, vocabulary difficulty, grammatical structures, and overall comprehensibility for a general audience.”

This mixed-methods approach enables a comprehensive assessment of the capabilities and limitations of both traditional formulas and AI-powered tools.

Findings/Results

The results of this study provide a comprehensive analysis of the readability scores generated by ChatGPT compared to traditional metrics formulas. This chapter presents both quantitative and qualitative findings, highlighting the effectiveness and accuracy of AI-driven tools in assessing text complexity.

Table 7. Comparative Analysis

Sample F-K ARI LIX SMOG CLI ChatGPT ChatGPT Score
1 3.27 3.63 20.45 6.46 6.78 Very Easy 1
2 3.87 5.38 34.35 5.93 9.05 Very Easy 1
3 3.22 4.58 22.98 7.30 8.07 Very Easy 1
4 10.40 7.83 31.83 9.71 10.40 Moderate 3
5 8.87 10.56 40.3 10.95 13.66 Difficult 4
6 10.13 11.65 45.7 12.00 15.57 Moderate 3
7 9.08 12.01 39.72 10.25 16.41 Very Easy 1
8 7.92 8.20 44.33 11.37 11.52 Very Easy 1
9 3.05 1.41 14.25 6.54 4.06 Very Easy 1
10 12.16 14.28 45.46 12.34 19.11 Moderate 3
11 13.35 18.15 56.61 13.25 20.64 Moderate 3
12 9.57 11.98 42.13 11.12 16.60 Moderate 3
Mean ± SD             2.08 ± 1.16

To allow statistical comparison, the categorical outputs of ChatGPT were converted into numerical values: Very Easy = 1, Easy = 2, Moderate = 3, Difficult = 4, and Very Difficult = 5. Based on this transformation, ChatGPT’s readability classifications (n= 12) yielded a mean score of 2.17 with a standard deviation of approximately 1.00. This numeric representation enabled comparative statistical analysis alongside traditional readability metrics, thereby enhancing transparency and facilitating the calculation of central tendencies and variation.

The comparative analysis table presents five readability metrics results: Flesch-Kincaid (F-K), Automated Readability Index (ARI), LIX, SMOG, and Coleman-Liau Index (CLI) as well as ChatGPT's qualitative classification of text difficulty. It can be observed that F-K and ARI are highly correlated, with ARI generally producing slightly higher values. The LIX scores differ greatly, from 14.25 (Very Easy) to 56.61 (Difficult), which indicates that there is a large range of text complexity. The F-K and ARI rates are lower than SMOG and CLI rates. The ChatGPT’s classification is in good agreement with the lower F-K and ARI scores but is inconsistent for some moderate and difficult cases. For example, samples 1, 2, 3,and 9 have very low readability scores across all the metrics, which supports their “Very Easy” classification. However, samples 5, 6, and 12, which have high SMOG and CLI values, show a “Difficult” level, but ChatGPT only labels sample 5 as such, which may indicate that samples 6 and 12 are under-estimated. Anomalies are observed in sample 7, where despite the fact that CLI is 16.41 and LIX is 39.72, ChatGPT classifies it as “Very Easy” as opposed to “Moderate”. Likewise, samples 10 and 11 have the highest F-K (12.16, 13.35) and CLI (19.11, 20.64) and are labeled as “Moderate” rather than “Difficult” which is indicated by their numerical scores. In general, ChatGPT’s classification is in agreement with F-K and ARI, but tends to underestimate the difficulty when LIX, SMOG, and CLI suggest a higher level of text complexity. If the classification for samples 6, 7, 10, and 11 is adjusted to better reflect these metrics, then consistency and accuracy may be improved.

Table 8. Descriptive Analysis of Raw Scores

F-K ARI LIX SMOG CLI
12 12 12 12 12
7,9075 9,138333 36,50917 9,768333 12,65583
3,65933 4,860439 12,24386 2,55721 5,164377
3,05 1,41 14,25 5,93 4,06
3,72 5,18 29,6175 7,11 8,805
8,975 9,38 40,01 10,6 12,59
10,1975 11,9875 44,6125 11,5275 16,4575
13,35 18,15 56,61 13,25 20,64

Readability metrics analysis produces an extensive summary of text reading complexity. Most texts have a Flesch-Kincaid (F-K) score of 7.91 which corresponds to a reading level of 7th to 8th grade according to the median score of 8.98. The standard deviation of 3.66 shows that text difficulty levels vary moderately.

The Automated Readability Index (ARI) yields an average of 9.14, which aligns with a 9th-grade reading level, as indicated by its median score of 9.38. The standard deviation of 4.86 shows that this measure has a wider spread than F-K.

The LIX score exhibits the greatest degree of variability, with a mean of 36.51 and a standard deviation of 12.24. The range of 14.25 to 56.61 indicates that texts vary from very easy to fairly difficult, while the median value of 40.01 suggests that most texts fall into the moderate to slightly difficult category.

The SMOG index stands as the most reliable readability measure because it shows the lowest standard deviation, at 2.56, while its mean reaches 9.77. Most texts fall into the high school reading level category according to the median score of 10.60.

The Coleman-Liau Index (CLI) yields the most challenging results among the metrics, with a mean of 12.66 and a median of 12.59. Texts in this assessment extend from basic elementary material to advanced college-level complexity because the range spans from 4.06 to 20.64. The standard deviation of 5.16 indicates substantial text difficulty variations.

The positive skewness across all metrics indicates that more challenging texts tend to elevate the mean score. SMOG stands out as the most reliable metric among the three while LIX and CLI exhibit the greatest variability. The interquartile ranges demonstrate that most texts fall between middle school and early high school readability levels, but some outliers extend to both very easy and very difficult ends. The analysis reveals that the texts exhibit diverse levels of complexity, which tend to lean toward moderate to difficult readability.

Table 9. Correlations Between Raw Scores

  F-K ARI LIX SMOG CLI
F-K 1 ,931822 ,868049 ,940019 ,90785
ARI ,931822 1 ,931668 ,910844 ,989241
LIX ,868049 ,931668 1 ,890237 ,920522
SMOG ,940019 ,910844 ,890237 1 ,894107
CLI ,90785 ,989241 ,920522 ,894107 1

The correlation table demonstrates that allfive readabilitymetrics (F-K, ARI, LIX, SMOG, and CLI) exhibit strong positive relationships, as all values exceed 0.86, indicating their similar approach to assessing text difficulty. The correlation between ARI and CLI reaches its highest point at .989 which demonstrates their identical operational characteristics. The correlation between F-K and SMOG reaches .940 and F-K and ARI reaches .932 because these measures depend on sentence length and syllable count for their assessment. The correlation between LIX and SMOG reaches its lowest point at .890 because LIX evaluates readability through word length and density instead of syllable complexity. The ARI and CLI metrics show a near-identical correlation which indicates their redundancy in readability assessments. The results demonstrate that texts with similar ratings from one metric will receive comparable ratings from other metrics except for LIX which shows a slightly different pattern.

Prior to conducting correlation analysis, normality assumptions for thefive readabilitymetrics were tested using the Shapiro-Wilk test. The results indicated that none of the metrics significantly deviated from a normal distribution, with allp-values exceeding the .05 threshold (see Table X). Specifically, F-K (W= .888, p = .111), ARI (W= .976,p= .959), LIX (W= .950,p= .638), SMOG (W= .899,p= .154), and CLI (W= .972,p= .934) all met the assumption of normality. These results confirm the statistical suitability of using Pearson correlation coefficients in subsequent analyses.

Table 10. Shapiro-Wilk Test Results

Readability Metric W Statistic p-value
F-K .888 .111
ARI .976 .959
LIX .950 .638
SMOG .899 .154
CLI .972 .934

Table 10 presents the results of the Shapiro-Wilk normality tests for each readability metric. All p-values exceed .05, indicating that the data do not significantly deviate from a normal distribution.

Figure 11

Figure 1. Correlation Heatmap of Readability Metrics

The figure displays the correlation heatmap of the five traditional readability metrics, showing pairwise Pearson correlation coefficients. The heatmap highlights the strong positive relationships between the metrics, illustrating their similar approach to assessing text complexity.

Table 11. Standard Errors

  0
F-K 1,0563574672
ARI 1,4030879545
LIX 3,5344984768
SMOG 0,738203012
CLI 1,4908272355

The standard errors for the readability metrics reflect the degree of variability in their estimations. SMOG shows the lowest standard error (0.738), indicating that it provides the most consistent and stable readability scores across samples. In contrast, LIX presents the highest standard error (3.534), likely due to its reliance on word length and lexical density, which can fluctuate widely across texts. The Flesch-Kincaid (1.056), ARI (1.403), and CLI (1.491) metrics fall within a moderate range, with ARI and CLI displaying slightly higher variability than Flesch-Kincaid. Notably, the correlation coefficients have zero standard error, confirming the high reliability of these estimates. Overall, the findings suggest that SMOG offers the most stable readability evaluations, whereas LIX tends to be more variable and sensitive to text-specific characteristics.

Table 12. 95% Confidence Intervals

  Mean SE 95% Cl Lower 95% Cl Upper
F-K 7.9075 1.0564 5.8370 9.9780
ARI 9.1383 1.4031 6.3883 11.8884
LIX 36.5092 3.5345 29.5815 43.4368
SMOG 9.7683 0.7382 8.3215 11.2152
CLI 12.6558 1.4908 9.7338 15.5779

The 95% confidence intervals estimate the range in which the true mean readability scores are expected to fall, reflecting the variability in the data. For instance, the F-K metric has a corrected mean of 7.91, corresponding to a 7th to 8th-grade reading level, and aligns closely with the median score of 8.98. The other metrics also exhibit clearly defined intervals: ARI ranges from 6.39 to 11.89, LIX from 29.58 to 43.44, SMOG from 8.32 to 11.22, and CLI from 9.73 to 15.58. Among them, SMOG has the narrowest interval (2.89), reinforcing its stability as a readability measure with minimal variation. In contrast, LIX shows the widest interval (13.86), reflecting its greater standard error and variability. The overlapping intervals of ARI, CLI, and SMOG suggest that their scores are relatively consistent, whereas LIX results are more dispersed. Overall, the intervals indicate that the texts analyzed fall mostly within the moderate to difficult range in terms of readability.

Table 13. Correlations With GPT Assessments

  Correlation with GPT
F-K .75
ARI .65
LIX .57
SMOG .69
CLI .62

The matrix reveals a relationship between readability scores in GPT and traditional readability measures. The Flesch-Kincaid (F-K) score demonstrates the strongest relationship (r= .75) with GPT scores, showing the highest agreement in readability ratings from GPT. The strong relationship can perhaps be explained by F-K relying on sentence composition and syllable complexity—the two factors that have a notable impactonpeople's perceptions of text readability. Moderate correlations were found between GPT and SMOG (r=.69), ARI (r=.65), and CLI (r=.62), suggesting that GPT, to some extent captures the features measured by these tools, though not to the extent found in the case of F-K. The weakest correlation was found between GPT and LIX (r= .57), which may be due to the higher variability of LIX and its dependence on various attributes, such as word length and lexical density features that might be measured differently by GPT.

The results indicate that GPT's readability judgments are mostly influenced by F-K and SMOG, while showing the least alignment with LIX. The results indicate that GPT focuses on sentence structure and syllabic complexity instead of word length and lexical density, which are essential components of LIX.

Figure 15

Figure 2. Distribution of Readability Scores Across Metrics

The statistical results showed significant correlation coefficient values. The introduction of AI-driven analysis tools such as ChatGPT through traditional methods brings new understanding to text readability analysis. The five readability metrics—Flesch-Kincaid (F-K), Automated Readability Index (ARI), LIX, SMOG, and Coleman-Liau Index (CLI)—along with ChatGPT’s qualitative classification of text difficulty. The results show that F-K and ARI have a strong correlation; however, ARI generates values that are higher than those of F-K. The LIX scores show a large range of values from 14.25 (Very Easy) to 56.61 (Difficult) which indicates diverse text complexities. The F-K and ARI ratings show lower difficulty levels than SMOG and CLI produce.

The classification system of ChatGPT shows good agreement with low F-K and ARI scores, but seems inconsistent for some moderate and difficult cases. For instance, samples 1, 2, 3, and 9 have very low readability scores for all metrics which supports their classification as “Very Easy”. However, samples 5, 6, and 12, which have high SMOG and CLI values, suggest a “Difficult” level, but ChatGPT only labels this as sample 5, which may indicate underestimation for samples 6 and 12. Sample 7 presents an anomaly because it has a high CLI (16.41) and LIX (39.72), but ChatGPT classifies it as “Very Easy” when a “Moderate” classification would have been more suitable. In the same way, samples 10 and 11 have the highest F-K (12.16, 13.35) and CLI (19.11, 20.64) but are labeled as “Moderate” instead of “Difficult” even though their numerical scores indicate high complexity. In general, ChatGPT’s classification is in agreement with F-K and ARI but tends to underestimate the difficulty when LIX, SMOG and CLI suggest higher text complexity.

The qualitative insights obtained from the feedback given by ChatGPT on text readability are particularly useful. ChatGPT was able to recognize features of language such as syntactic complexity, idiomatic expressions, and cultural references, which are usually not picked up by traditional metrics. This ability to capture deeper linguistic nuances highlights the potential of ChatGPT in supporting foreign language teaching and learning by providing insights that align with core competencies identified by language teachers, emphasizing the need for enhanced pedagogical skills in conjunction with AI.

The study results were validated through expert judgment to establish the reliability and validity of the findings. The alignment between ChatGPT's evaluations and traditional metrics assessments indicates that AI holds potential benefits for educational settings beyond traditional readability metrics.

The research demonstrates that AI tools, including ChatGPT, deliver more precise and context-specific readability assessments than standard measurement approaches. The research demonstrates how AI tools will transform readability evaluation by providing additional capabilities to traditional methods which leads to better educational materials.

Discussion

Implications of AI-Based Readability Assessment

The evaluation of readability demonstrates how ChatGPT AI tools measure text complexity in relation to established formulas. The results indicate some similarities in classification, but AI systems deliver additional understanding through contextual and semantic analysis which goes beyond traditional formula capabilities. The assessment results from AI tools differ from established tools because they do not match traditional measurement standards (Golan et al., 2024). The development of AI writing capabilities to mimic human writing continues to progress, but it still produces different linguistic features which impact readability evaluation (Devitska& Horvat-Choblya, 2024). The educational application of ChatGPT and similar AI tools demonstrates potential by making complex texts easier to understand especially when working with non-English languages where they have achieved substantial readability score improvements (Sudharshan et al., 2024). The quality and reliability of AI-generated content need further development because AI systems occasionally produce assessments that differ from human evaluators' judgments about quality and readability (Golan et al., 2024; Ömür Arça et al., 2024).

The research demonstrates both the educational benefits of AI technology and the requirement to handle its boundaries while implementing it properly. AI tools offer new methods to analyze text complexity yet their combination with traditional readability assessments needs ongoing evaluation for maintaining accuracy and fairness and educational worth (Golan et al., 2024; Gondode et al., 2024; Sudharshan et al., 2024).

Limitations of ChatGPT’s Approach

ChatGPT's approach has several limitations that need to be taken seriously in pedagogical contexts. First is the "black box" nature of the system that does not offer any clear insight into its inner reasoning mechanisms. This makes it impossible for instructors and researchers to monitor how linguistic features—such as syntactic complexity and lexical density—influence the model's readability choices. Also, the model is not precisely designed for the CEFR foreign language levels and does not have the capacity to differentiate between essential factors in foreign language learning, e.g., high-frequency words vs. low-frequency words.

A further major limitation is the sensitivity of ChatGPT to prompt wording, affecting reproducibility and standardization. The literature suggests that minor changes in prompt wording can substantially alter the model's output and accuracy.

For instance, in testing educational software, ChatGPT responded correctly or partially correctly 55.6% of the time, which was enhanced slightly when context for questions was given (Newton &Xiromeriti, 2024). Similarly, its performance varied by subject area across UK standardized entry examinations, with strengths in reading comprehension, while scores were poorer in maths, pointing to the model's reliability being domain-dependent (Newton &Xiromeriti, 2024).

Additionally, there have been issues raised regarding ChatGPT use in universities, with a focus on academic integrity and the abuse of AI to avoid learning (Consuegra‐Fernández et al., 2024;Welskop, 2023). One of the challenges is how it becomes almost impossible to tell whether text was written by humans or by AI since teachers have shown low accuracy levels when detecting AI-written essays (An et al., 2023). Although prompt engineering has the potential to improve the quality of output from ChatGPT, several concerns remain, such as the generation of incorrect information, the potential to bypass plagiarism detection tools, and algorithmic biases capable of distorting the content (Adel et al., 2024; Lo, 2023). Such findings highlight the need for rigorous assessment methods, organizational guidelines, and ethical standards to ensure the responsible and effective integration of artificial intelligence in the educational sector (Adel et al., 2024).

Practical Implications for Educators

In spite of its limitations, ChatGPT has tremendous potential as an ancillary tool in text modification and selection tasks. Teachers can use it for instant text appraisals of suitability for A1 or B1 learners and even seek feedback for student-authored exercises. Classroom exercises can include comparing ChatGPT's appraisals with traditional teaching methods or editing texts as per its recommendations. As part of teacher training programs, ChatGPT can assist in developing prompts, exploring limits in AI capabilities, and analyzing recommendations by AI. To ensure effective integration, educators should formalize AI use through clear prompts aligned with pedagogical goals and map AI feedback to CEFR descriptors. Regular calibration sessions—comparing teacher and AI evaluations—can build trust and optimize classroom use. Institutions should also provide workshops to share best practices and address AI-related challenges. Embedding AI tools in Learning Management Systems offers seamless lesson planning with immediate feedback.

Specifically regarding ChatGPT, educators can use open-source alternatives like LM Studio in order to enable offline use and ensure privacy. When applied sparingly and as part of pedagogy expertise, AI tools can enhance teaching but must not replace professional judgment. Beyond education, ChatGPT has shown versatility across fields. In healthcare, it adapts communication for patients and professionals but cannot replace expert judgment (Kaarreet al., 2023;Liet al., 2023). In academic writing, it aids in drafting and organization but raises concerns over accuracy and source reliability (Akpur, 2024;Livberber, 2023). Applications in programming, geography, and material science demonstrate potential for personalized learning, though challenges like bias, inaccuracies, and ethical issues persist (Bringula, 2024; Chowdhury & Haque, 2023; Deb et al., 2024; X.-H.Nguyen et al., 2023). While ChatGPT’s conversational abilities are impressive, its limitations require careful management. As AI evolves, more advanced and reliable models are expected (Cheng, 2023; Hariri, 2023).

To enhance the effectiveness of AI-based readability assessments, future research should focus on optimizing outputs to better align with human assessments and levels within the Common European Framework of Reference (CEFR). The CEFR offers a systematic framework for language proficiency assessment, and its adoption can improve the assessment of text difficulty (Arase et al., 2022). Current studies highlight challenges in extrapolating models to diverse text types and languages, thus underscoring the need for a wider range of training data (Ribeiro et al., 2024). Multilingual and multi-domain datasets like ReadMe++ have been shown to have the potential to improve cross-lingual transfer(Naouset al., 2023). Furthermore, BERT-based models using feature projection and length-balanced loss have achieved almost human-level performance (Li et al., 2022). In European Portuguese, models carefully calibrated achieved around 80% accuracy but struggled when applied in different genres (Ribeiro et al., 2024). Comparatively, in the case of French,studies explored interactions between mechanisms involving attention, linguistic features, and CEFR predictions (Wilkens et al., 2024). Redefining readability as a regression task improved accuracy and generalization (Ribeiro et al., 2024).

Instruction-tuned models like BLOOMZ and FlanT5 outperformed ChatGPT in readability-controlled tasks, highlighting the need for refined prompts and tuning (Imperial &Madabushi, 2023). However, human variability in using GPT-4 for readability manipulation reveals ongoing refinement needs (Trott & Rivière, 2024).

In summary, advancing AI readability tools requires expanding multilingual datasets, enhancing cross-domain generalization, and fine-tuning outputs to align closely with CEFR and human standards.

Conclusion

The evaluation of text readability through traditional methods shows strong agreement with AI-based classification but produces some inconsistent results. The F-K and ARI metrics match ChatGPT's ratings closely but LIX and SMOG and CLI tend to rate texts as more challenging. The correlation analysis shows that readability metrics share similar patterns but LIX produces a different pattern because it focuses on word length and lexical density. The research shows that AI readability assessments work well but tend to miss the level of difficulty when texts contain complex structures. The accuracy of AI models would increase if developers added more advanced linguistic elements, such as sentence structure and lexical variation,to their systems. SMOG demonstrates the highest reliability among the metrics while LIX produces the most variable results. Future studies should investigate how combining human evaluation with AI-based models would improve readability scoring precision. The observed differences in classification demonstrate that using AI-driven assessments together with traditional formulas could create a more balanced readability evaluation system.

Recommendations

The study presents key research recommendations for teaching and investigation methods about AI readability assessment. The implementation of ChatGPT, along with other AI tools within educational settings, particularly foreign language instruction,shows promise yet demands careful strategic execution to achieve meaningful results.

Educators, together with textbook authors and curriculum developersneed to combine traditional readability formula assessments with artificial intelligence model insights in their evaluation methods. The study demonstrates that Flesch-Kincaid and SMOG and LIX provide solid bases for determining text difficulty by analyzing sentence length and word forms. These approaches fail to detect essential linguistic, semantic, and cultural elements that significantly impact comprehension, especially for students learning foreign languages. The AI system ChatGPT successfully identified some minor details, such as regular expressions, discursive coherence, and syntactic transformation, which frequently determine text readability for various student skill groups.

AI evaluations should be viewed as supplementary tools which provide additional study perspectives instead of replacing current methods. The combination of standard methods with AI analysis enhances text rating precision ,particularly in multilingual and multicultural educational environments. The evaluation and selection or modification of teaching materials in German as a Foreign Language (GFL) courses should utilize AI alongside standard measurement tools, according to educators and researchers. The implementation of AI knowledge as a core component should be integrated into teacher professional development and career advancement programs. AI models, including readability orders and language feedback,require teachers to develop abilities for assessing their outputs. Teachers need professional development which covers tool proficiency along with boundary understanding and prompt design significance. Educators would then be able to evaluate AI feedback effectively while developing data-driven and considerate teaching practices. The development of specific prompts and readability evaluation models is necessary to meet the requirements of language learning assessments. Standard prompts were used with ChatGPT,but the model produced mostly equivalent yet occasionally incorrect responses when evaluating medium and hard text orders. The results indicate that modifying prompts to match language student requirements by integrating CEFR level words and grammar signs might boost the reliability of AI evaluations. Organizations, together with publishers, should explore the development of location-specific models trained on educational materials to enhance accuracy while maintaining meaningful results.

Learning people should conduct backtesting of plans and work that implement AI tools such as ChatGPT in actual classroom environments at both group and policy stages. The plans would evaluate how AI feedback measures up against student understanding and their perceptions of difficulty compared to teacher skill assessments. The implementation of these acts would reveal optimal methods while revealing deficiencies to achieve equitable and beneficial AI usage in learning assessment. Future studies should evaluate cross-language value by determining the applicability of AI readability evaluations across multiple languages and student populations. The change in syntactic and word difficulty across languages requires additional research because the current work focused on German-language books. The ability to analyze text difficulty through AI models in various languages, such as Turkish, Japanese, and Arabic, would extend the scope of results and develop open learning practices. AI learning evaluation requires a strong emphasis on both educational ethics and teaching requirements. Readability extends beyond numerical or computational values because it directly relates to students' backgrounds and motivationsas well as their cultural backgrounds and educational objectives. Readability tools exist to assist teaching professionals rather than replace their evaluation capabilities. Teachers must retain their central role in determining appropriate materials for their studentswhile AI suggestions should serve as one of several educational resources in the overall teaching process. The readability evaluation benefits from ChatGPT and similar AI toolswhich provide a valuable but evolving assessment capability. Foreign language learning will advance toward more suitable student-centered approaches through combined human and machine intelligence collaboration.

Limitations

The research focuses on a limited number of German foreign language textbooks together with one AI model named ChatGPT. The results obtained in this study may not apply to different languages, AI systems, or various writing styles. The AI assessments depended on the question formats used. The evaluations of ChatGPT lacked human input from language learners and instructors,even though they occurred in real time. The training of ChatGPT did not include any special evaluation of reading ease. The frequency of text categorization may be affected by the regularity of the process, especially when dealing with difficult texts.

Ethics Statements

The research project did not have human participants or private information. It did not need permission from an ethics committee. All of the textbook sections were open for anyone to view or were authorized course materials. The artificial intelligence was employed in a proper way. The methods and review were clear. Steps were taken not to wrongly depict what the system can achieve. Steps also existed to make sure the comparative studies were correct and just.

The authors presented an initial version of this research at the 9th International Conference on the Future of Teaching and Education,which took place in Vienna during March 2025. The authors received useful feedback at the conferencewhich they used to enhance and finalize the research paper. The author gratefully acknowledges the valuable comments and constructive suggestions of the anonymous reviewers, which have substantially improved the quality and precision of this work.

Conflict of Interest

The author declares no conflict of interest. There are no financial, personal, or professional affiliations that could be perceived as influencing the research.

Funding

This research was funded by the Scientific Research Coordination Unit of Anadolu University under project number SÇB-2024-2587. The funding institution had no role in study design, data collection, analysis, or publication decisions.

Generative AI Statement

As the author of this workGenerative AI (ChatGPT, GPT-4) was used exclusively to assess the readability of selected textbook texts, based on predefined prompts. Its role was limited to generating readability classifications, which were then compared with traditional metrics. After using this AI tool, I reviewed and verified the final version of my work. I, as the author, take full responsibility for the content of our published work.

References

Adel, A., Ahsan, A., & Davison, C. (2024). ChatGPT promises and challenges in education: Computational and ethical perspectives. Education Sciences, 14(8), Article 814. https://doi.org/10.3390/educsci14080814

Akpur, A. (2024). Exploring the potential and limitations of ChatGPT in academic writing and editorial tasks. Fırat Üniversitesi Sosyal Bilimler Dergisi, 34(1), 177-186. https://doi.org/10.18069/firatsbed.1299700

Aksu Güney, B., Nergiz, P., Varlık, Ş., Kızılca, Ş., & Ertem, Y. (2023). Mein Schlüssel zu Deutsch: Schülerbuch Deutsch als Fremdsprache für Gymnasien [My key to German: Student book for German as a foreign language for secondary schools]. Milli Eğitim Bakanlığı Yayınları.

An, R., Yang, Y., Yang, F., & Wang, S. (2023). Use prompt to differentiate text generated by ChatGPT and humans. Machine Learning with Applications, 14, Article 100497. https://doi.org/10.1016/j.mlwa.2023.100497

Arase, Y., Uchida, S., & Kajiwara, T. (2022, October 21). CEFR-based sentence difficulty annotation and assessment (Version 1). arXiv. https://doi.org/10.48550/arxiv.2210.11766

Bakuuro, J. (2024). In the belly of text complexity: Unravelling the nexus between lexical density and readability. Athens Journal of Philology, 11(3), 255-274. https://doi.org/10.30958/ajp.11-3-4

Bhushan, S., Kumar, A., Alsubie, A., & Lone, S. A. (2023). Variance estimation under an efficient class of estimators in simple random sampling. Ain Shams Engineering Journal, 14(8), Article 102012. https://doi.org/10.1016/j.asej.2022.102012

Björnsson, C. H. (1968). Läsbarhet [Readability]. Liber.  

Bringula, R. (2024). ChatGPT in a programming course: Benefits and limitations. Frontiers in Education, 9, Article 1248705. https://doi.org/10.3389/feduc.2024.1248705

Brooks, G., Clenton, J., & Fraser, S. (2021). Exploring the importance of vocabulary for English as an additional language learners’ reading comprehension. Studies in Second Language Learning and Teaching, 11(3), 351-376.  

Brown, J. D. (1998). An EFL readability index. JALT Journal, 20(2), 7-36. http://bit.ly/46KChv5

Chall, J. S. (1996). Varying approaches to readability measurement. Revue québécoise de linguistique, 25(1), 23-40. https://doi.org/10.7202/603125ar

Chen, X. (2023). Instructional strategies for improving English language learners' reading comprehension: A systematic review. In Proceedings of the 4th International Conference on Educational Innovation and Philosophical Inquiries (pp. 125-132). EWA. https://doi.org/10.54254/2753-7048/15/20231044

Cheng, H.-W. (2023). Challenges and limitations of ChatGPT and artificial intelligence for scientific research: A perspective from organic materials. AI, 4(2), 401-405. https://doi.org/10.3390/ai4020021

Chowdhury, M. N.-U.-R., & Haque, A. (2023). ChatGPT: Its applications and limitations. In 2023 3rd International Conference on Intelligent Technologies (CONIT) (pp. 1-7). IEEE. https://doi.org/10.1109/conit59222.2023.10205621

Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2), 283-284. https://doi.org/10.1037/h0076540

Collins-Thompson, K., & Callan, J. P. (2004). A language modeling approach to predicting reading difficulty. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004 (pp. 193-200). Association for Computational Linguistics. https://aclanthology.org/N04-1025/

Consuegra‐Fernández, M., Sanz-Aznar, J., Burguera-Serra, J. G., & Caballero-Molina, J. J. (2024). ChatGPT. RIE: Revista de Investigación Educativa, 42(2), 1-18. https://doi.org/10.6018/rie.565391

Crossley, S. A., Skalicky, S., Dascalu, M., McNamara, D. S., & Kyle, K. (2017). Predicting text comprehension, processing, and familiarity in adult readers: New approaches to readability formulas. Discourse Processes, 54(5-6), 340-359. https://doi.org/10.1080/0163853x.2017.1296264

Deb, J., Saikia, L., Dihingia, K. D., & Sastry, G. N. (2024). ChatGPT in the material design: Selected case studies to assess the potential of ChatGPT. Journal of Chemical Information and Modeling, 64(3), 799-811. https://doi.org/10.1021/acs.jcim.3c01702

Devitska, A., & Horvat-Choblya, A. (2024). Linguistic domains: Comparison of texts written by human and artificial intelligence. Věda a Perspektivy, 11(42), 358-365. https://doi.org/10.52058/2695-1592-2024-11(42)-358-365

DuBay, W. H. (2004). The principles of readability. Impact Information.  

Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221-233. https://doi.org/10.1037/h0057532

Georgiakaki, M., Graf-Riemann, E., & Seuthe, C. (2019). Beste Freunde A1/2. Kursbuch: Deutsch für Jugendliche [Best friends A1/2. Coursebook: German for teenagers]. Hueber Verlag.   

Golan, R., Ripps, S., Raghuram, R., Loloi, J., Bernstein, A., Connelly, Z. M., Golan, N. S., & Ramasamy, R. (2024). ChatGPT’s ability to assess quality and readability of online medical information. The Journal of Sexual Medicine, 21(S1), Article qdae001.019. https://doi.org/10.1093/jsxmed/qdae001.019

Gondode, P., Duggal, S., Garg, N., Lohakare, P., Jakhar, J., Bharti, S., & Dewangan, S. (2024). Comparative analysis of accuracy, readability, sentiment, and actionability: Artificial intelligence chatbots (ChatGPT and Google Gemini) versus traditional patient information leaflets for local anesthesia in eye surgery. The British and Irish Orthoptic Journal, 20, 183-192. https://doi.org/10.22599/bioj.377

Gordeeva, L., & Usupova, Z. (2023). К вопросу о типологии текстов по русскому языку как иностранному [On text typology in Russian as a foreign language]. Philology and Culture/Филология и Культура, (1), 156-160. https://doi.org/10.26907/2782-4756-2023-71-1-156-160

Hariri, W. (2023, March 27). Unlocking the potential of ChatGPT: A comprehensive exploration of its applications, advantages, limitations, and future directions in natural language processing. arXiv. https://doi.org/10.48550/arXiv.2304.02017

Heilman, M., Collins-Thompson, K., Callan, J., & Eskenazi, M. (2007). Combining lexical and grammatical features to improve readability measures for first and second language texts. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (pp. 460-467). Association for Computational Linguistics.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735

Hu, T.-C., Sung, Y.-T., Liang, H.-H., Chang, T.-J., & Chou, Y.-T. (2022). Relative roles of grammar knowledge and vocabulary in the reading comprehension of EFL elementary-school learners: Direct, mediating, and form/meaning-distinct effects. Frontiers in Psychology, 13, Article 827007. https://doi.org/10.3389/fpsyg.2022.827007

Huang, R., Adarkwah, M. A., Liu, M., Hu, Y., Zhuang, R., & Chang, T. (2024). Digital pedagogy for sustainable education transformation: Enhancing learner-centred learning in the digital era. Frontiers of Digital Education, 1, 279-294. https://doi.org/10.1007/s44366-024-0031-x

Imperial, J. M., & Madabushi, H. T. (2023, September 11). Flesch or fumble? Evaluating readability standard alignment of instruction-tuned language models (Version 2). arXiv. https://doi.org/10.48550/arxiv.2309.05454

Kaarre, J., Feldt, R., Keeling, L. E., Dadoo, S., Zsidai, B., Hughes, J. D., Samuelsson, K., & Musahl, V. (2023). Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surgery, Sports Traumatology, Arthroscopy, 31(11), 5190-5198. https://doi.org/10.1007/s00167-023-07529-2

Kamal, S. M. (2019). Developing EFL learners' vocabulary by reading English comprehension in EFL classroom. International Journal of English Language and Literature Studies, 8(1), 28-35. https://doi.org/10.18488/journal.23.2019.81.28.35

Kasule, D. (2011). Textbook readability and ESL learners. Reading and Writing, 2(1), Article a13. https://doi.org/10.4102/rw.v2i1.13

Kavrayici, C., Shikhkamalova, V., & Tamdoğan Evliyaoğlu, A. (2025). Psychological capital as a predictor of professional resilience capacities of teachers in public schools. CADMO, (2), 71-77. https://doi.org/10.3280/cad2024-002005

Kincaid, J. P., Fishburne, R. P., Jr., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (Automated Readability Index, Fog Count, and Flesch Reading Ease Formula) for Navy enlisted personnel (Research Branch Report No. 8-75). Naval Air Station Memphis.  

Koithan, U., Mayr-Sieber, T., Schmitz, H., Sonntag, R., & Pilaski, A. (2021). Kontext B1+. Kursbuch mit Audios/Videos: Deutsch als Fremdsprache [Context B1+. Coursebook with audios/videos: German as a foreign language]. Ernst Klett Verlag.  

Li, W., Wang, J., Yan, Y., Yuan, P., Cao, C., Li, S., & Wu, Q. (2023). Potential applications of ChatGPT in endoscopy: Opportunities and limitations. Gastroenterology and Endoscopy, 1(3), 152-154. https://doi.org/10.1016/j.gande.2023.06.001

Li, W., Ziyang, W., & Wu, Y. (2022). A unified neural network model for readability assessment with feature projection and length-balanced loss. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 7446-7457). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.504

Livberber, T. (2023). Toward non-human-centered design: Designing an academic article with ChatGPT. El Profesional de La Información, 32(5), Article e320512. https://doi.org/10.3145/epi.2023.sep.12

Lo Bosco, G., Pilato, G., & Schicchi, D. (2019). A recurrent deep neural network model to measure sentence complexity for the Italian language. In Proceedings of the Sixth International Workshop on Artificial Intelligence and Cognition (pp. 1-9). CEUR-WS. http://ceur-ws.org/Vol-2418/paper10.pdf

Lo, C. K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), Article 410. https://doi.org/10.3390/educsci13040410

Maqsood, S., Shahid, A., Afzal, M. T., Roman, M., Khan, Z., Nawaz, Z., & Aziz, M. H. (2022). Assessing English language sentences readability using machine learning models. PeerJ Computer Science, 8, Article e818. https://doi.org/10.7717/peerj-cs.818

McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639-646. https://bit.ly/4dcoVux

Miller, B., Linder, F., & Mebane, W. R., Jr. (2020). Active learning approaches for labeling text: Review and assessment of the performance of active learning approaches. Political Analysis, 28(4), 532-551. https://doi.org/10.1017/pan.2020.4

Naous, T., Ryan, M. J., Lavrouk, A., Chandra, M., & Xu, W. (2023, May 23). ReadMe++: Benchmarking multilingual language models for multi-domain readability assessment (Version 4). arXiv. https://doi.org/10.48550/arxiv.2305.14463

Nassiri, N., Lakhouaja, A., & Cavalli-Sforza, V. (2018). Arabic readability assessment for foreign language learners. In M. Silberztein, F. Atigui, E. Kornyshova, E. Métais, & F. Meziane (Eds.), Natural Language Processing and Information (pp. 480-488). Springer. https://doi.org/10.1007/978-3-319-91947-8_49

Newton, P. M., & Xiromeriti, M. (2024). ChatGPT performance on multiple choice question examinations in higher education: A pragmatic scoping review. Assessment and Evaluation in Higher Education, 49(6), 781-798. https://doi.org/10.1080/02602938.2023.2299059

Nguyen, N. T. D., & Lam, T. N. (2023). Applications of readability findings in teaching languages. Journal of Science of Hong Bang International University/Tạp chí Khoa học Trường Đại học Quốc tế Hồng Bàng, 5, 103-110. https://doi.org/10.59294/hiujs.vol.5.2023.554

Nguyen, X.-H., Nguyen, H.-A., Cao, L., & Trương, H. (2023, August 3). Unleashing the potential and recognizing the limitations of ChatGPT in Vietnamese geography education. EdArXiv. https://doi.org/10.35542/osf.io/r4dg6

Noviyenty, L. (2017). Teaching English reading ability for second/foreign language learners. Indonesian Journal of Integrated English Language Teaching, 3(1), 15-26. https://doi.org/10.24014/ijielt.v3i1.3965

Omer, I.-E. A. M., & Al-Khaza’leh, B. A. (2021). Readability and readability formulas: English as a foreign language tertiary education teachers’ awareness in Shaqra University, Saudi Arabia. Journal of Education in the Black Sea Region, 7(1), 166-180. https://doi.org/10.31578/jebs.v7i1.256

Ömür Arça, D., Erdemir, İ., Kara, F., Shermatov, N., Odacıoğlu, M., İbişoğlu, E., Hancı, F. B., Sağıroğlu, G., & Hancı, V. (2024). Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study. Medicine, 103(22), Article e38352. https://doi.org/10.1097/md.0000000000038352

Parhadjanovna, S. S. (2023). Mastering the art of reading: Techniques and strategies to enhance reading skills. In Teaching foreign languages in the context of sustainable development: Best practices, problems and opportunities (pp. 482-485). International Science and Culture Research Center. https://doi.org/10.37547/geo-89

Parsaeian, M., Mahdavi, M., Saadati, M., Mehdipour, P., Sheidaei, A., Khatibzadeh, S., Farzadfar, F., & Shahraz, S. (2021). Introducing an efficient sampling method for national surveys with limited sample sizes: Application to a national study to determine quality and cost of healthcare. BMC Public Health, 21, Article 1414. https://doi.org/10.1186/s12889-021-11441-0

Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer Speech and Language, 23(1), 89-106. https://doi.org/10.1016/j.csl.2008.04.003

Pitler, E., & Nenkova, A. (2008). Revisiting readability: A unified framework for predicting text quality. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 186-195). Association for Computational Linguistics. https://aclanthology.org/D08-1020/

Plung, D. L. (1981). Readability formulas and technical communication. IEEE Transactions on Professional Communication, 24(1), 52-54. https://doi.org/10.1109/TPC.1981.6447827

Redish, J. (2000). Readability formulas have even more limitations than Klare discusses. Journal of Computer Documentation, 24(3), 132-137. https://doi.org/10.1145/344599.344637  

Redmiles, E. M., Maszkiewicz, L., Hwang, E., Kuchhal, D., Liu, E., Morales, M., Peskov, D., Rao, S., Stevens, R., Gligoric, K., Kross, S., Mazurek, M. L., & Daumé, H. (2018). Digital words: Moving forward with measuring the readability of online texts [Doctoral dissertation, University of Maryland]. Digital Repository at the University of Maryland. https://doi.org/10.13016/M2B853N3M

Ribeiro, E., Mamede, N., & Baptista, J. (2024). Automatic text readability assessment in European Portuguese. In P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira, & R. Amaro (Eds.), Proceedings of the 16th International Conference on Computational Processing of Portuguese – Vol. 1 (pp. 97-107). Association for Computational Linguistics. https://aclanthology.org/2024.propor-1.10/

Samawi, H., Yu, L., & Yin, J. (2023). On Cox proportional hazards model performance under different sampling schemes. PLOS ONE, 18(4), Article e0278700. https://doi.org/10.1371/journal.pone.0278700

Schwarm, S., & Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05) (pp. 523-530). Association for Computational Linguistics. https://aclanthology.org/P05-1065/

Senter, R. J., & Smith, E. A. (1967). Automated readability index (AMRL-TR-66-220). Aerospace Medical Research Laboratories, Wright-Patterson Air Force Base. https://apps.dtic.mil/sti/pdfs/AD0667273.pdf

Shaitarova, A., Bauer, N., Vamvas, J., & Volk, M. (2024). Tracing linguistic footprints of ChatGPT across tasks, domains and personas in English and German. In Proceedings of the 9th edition of the Swiss Text Analytics Conference (pp. 102-112). Association for Computational Linguistics. https://aclanthology.org/2024.swisstext-1.9/

Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3-4), 591-611. https://doi.org/10.1093/biomet/52.3-4.591

Si, L., & Callan, J. P. (2001). A statistical model for scientific readability. In Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM) (pp. 574-576). Association for Computing Machinery. https://doi.org/10.1145/502585.502695

Sudharshan, R., Shen, A., Gupta, S., & Zhang-Nunes, S. (2024). Assessing the utility of ChatGPT in simplifying text complexity of patient educational materials. Cureus, 16(3), Article e55304. https://doi.org/10.7759/cureus.55304

Trott, S., & Rivière, P. D. (2024, October 17). Measuring and modifying the readability of English texts with GPT-4. arXiv. https://doi.org/10.48550/arxiv.2410.14028

Vajjala, S. (2022). Trends, limitations and open challenges in automatic readability assessment research. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022) (pp. 5366-5377). European Language Resources Association. https://aclanthology.org/2022.lrec-1.574/

Welskop, W. (2023). ChatGPT in higher education. International Journal of New Economics and Social Sciences, 1(17), 9-18. https://doi.org/10.5604/01.3001.0053.9601

Wilkens, R., Watrin, P., & François, T. (2024). Paying attention to the words: Explaining readability prediction for French as a foreign language. In R. Wilkens, R. Cardon, A. Todirascu, & N. Gala (Eds.), Proceedings of the 3rd Workshop on Tools and Resources for People with Reading Difficulties (READI) @ LREC-COLING 2024 (pp. 102-115). ELRA and ICCL. https://aclanthology.org/2024.readi-1.9/

Zabun, O. A. (2020). Deutsch für Gymnasien Schülerbuch Niveaustufe A1.1 [German for secondary schools student book level A1.1]. Ata Yayıncılık.  

Zheleznova, E. G. (2019). Чтение как один из способов  обучения английскому языку [Reading as a way of teaching English]. Scientific bulletin of the Southern Institute of Management, (1), 110-114

Zhu, Z. (2024). Teaching reading for English as a foreign language learners: Insights from psychological and social perspectives. Applied and Educational Psychology, 5(4), 173-176. https://doi.org/10.23977/appep.2024.050425

...