Evaluating Generative AI Tools for Improving English Writing Skills: A Preliminary Comparison of ChatGPT-4, Google Gemini, and Microsoft Copilot

Özlem Alpar

doi:10.12973/eu-jer.14.4.1291

Research Article

Evaluating Generative AI Tools for Improving English Writing Skills: A Preliminary Comparison of ChatGPT-4, Google Gemini, and Microsoft Copilot

Özlem Alpar

https://doi.org/10.12973/eu-jer.14.4.1291

Pub. date: October 15, 2025
Online Pub. date: September 02, 2025
Pages: 1291-1308
102 Downloads
1201 Views
0 Citations

How to Cite

Abstract:

This preliminary study examines how three generative AI tools, ChatGPT-4, Google Gemini, and Microsoft Copilot, support B+ level English as a Foreign Language (EFL) students in opinion essay writing. Conducted at a preparatory school in Türkiye, the study explored student use of the tools for brainstorming, outlining, and feedback across three essay tasks. A mixed methods design combined rubric-based evaluations, surveys, and reflections. Quantitative results showed no significant differences between tools for most criteria, indicating comparable performance in idea generation, essay structuring, and feedback. The only significant effect was in the feedback stage, where ChatGPT-4 scored higher than both Gemini and Copilot for actionability. In the brainstorming stage, a difference in argument relevance was observed across tools, but this was not statistically significant after post-hoc analysis. Qualitative findings revealed task-specific preferences: Gemini was favored for clarity and variety in brainstorming and outlining, ChatGPT-4 for detailed, clear, and actionable feedback, and Copilot for certain organizational strengths. While the tools performed similarly overall, perceptions varied by task and tool, highlighting the value of allowing flexible tool choice in EFL writing instruction.

Keywords:
ai tools english writing skills generative ai

description PDF

file_save XML

pie_chart Metrics

Citations

Article Metrics

Views

102

Download

1201

Citations

Crossref
0

Scopus
0

Introduction

Artificial intelligence (AI) has transformed many fields, especially education and research, with advances in natural language processing (NLP) and large language models (LLMs) expanding their use and understanding (Alqahtani et al., 2023). This impact is most evident in AI’s influence on educational practices. AI has become a powerful tool in education, reshaping writing instruction as it is increasingly integrated into higher education contexts such as language learning (Romero et al., 2024). Opinion essays, a common form of academic writing at university, help students develop critical thinking by stating a stance, supporting it with evidence, and organizing ideas using structures such as cause-effect or compare-contrast (Gordon, 1990;Setyowati, 2016). The easy availability of online essay samples (Setyowati, 2016) and AI-generated responses has raised concerns about originality and authorship, highlighting the need to teach students responsible, informed use of generative AI throughout the writing process. Rather than generating full essays, students can use AI more constructively for outlining or receiving feedback to support learning. AI-enabled learning systems, such as adaptive platforms and Automated Writing Evaluation (AWE) tools, provide immediate, personalized feedback, supporting revisions and helping students refine their work over time (Akyüz, 2020; Hung et al., 2024;Kabudiet al., 2021). AI tools such as ChatGPT-4, Google Gemini, and Microsoft Copilot support brainstorming, idea organization, and revision, yet research shows notable limitations in their use for writing instruction. Bibliometric and systematic analyses show most studies focus on single tools in narrow settings, with ChatGPT-4 dominating the literature while Gemini and Copilot remain underrepresented (Frumin et al., 2025;Kabudiet al., 2021). This highlights the need for research on a broader range of generative AI tools. Literature reviews note a gap in comparative studies on the educational impact of different platforms across diverse learners and writing tasks (Alqahtani et al., 2023;Kabudiet al., 2021). Furthermore, the needs of EFL students who are transitioning from paragraph to essay writing remain underexplored. This study addresses this gap by comparing three generative AI tools in supporting students through key stages of the writing process. Evaluating students’ experiences in a state university preparatory program in Ankara offers insights for educators supporting learners at this transitional proficiency level. To this end, the study examines how B+ level EFL students use three generative AI tools in the key stages of essay writing and addresses the following research questions:

1.How do B+ level EFL students’ perceptions of effectiveness differ across ChatGPT-4, Google Gemini, and Microsoft Copilot in supporting the stages of opinion essay writing (brainstorming, outlining, and feedback)?

2.What qualitative experiences and perceptions do B+ level students report about the strengths and limitations of these tools in supporting their writing?

3.How does working with these tools influence students’ attitudes toward future AI use for writing support?

Literature Review

Generative AI in Education

Artificial intelligence (AI) has garnered significant public interest, particularly following the release of large language models (LLMs) and chatbots such as ChatGPT, Copilot, and Gemini, which enable users to interact directly with LLMs (Barrett & Pack, 2023; Lin, 2023). Lin’s (2023) review of major AI developments clearly explains the rapid pace and broad scale at which generative AI tools like ChatGPT, Gemini, and Copilot have entered mainstream use. These tools, trained on vast natural language data, can generate original and contextually appropriate responses in ways that echo how humans produce novel language based on limited input. This ability reflects principles of generative grammar (Chomsky, 1991). While the technology is impressive, using these tools in education demands more than technical skill. Educators must make them accessible and ensure ethical, pedagogically sound use. This is crucial in language education, where writing requires complex skills like organizing ideas, developing arguments, and revising for clarity. Unlike receptive skills, writing involves active production, complicating AI integration. Barrett and Pack (2023) found that students and teachers generally accept generative AI for early writing stages like brainstorming and outlining but worry about its use for full-text production. These findings highlight the need to understand how these tools are used and perceived by educators and learners. Recent studies also highlight how intelligent technologies enhance personalized learning and support creativity for learners and teachers (Urmeneta & Romero, 2024; Yan et al., 2024). AI has become a disruptive force in education, enabling hybrid intelligence and transforming practices such as essay writing and feedback (Järvelä et al., 2023). In language education specifically, AI is becoming a key component in supporting teaching and learning (T. L. Nguyen et al., 2024). Individuals can now generate writing instantly on almost any topic by using simple prompts. Generative AI tools have become indispensable aids for students. AI-powered tools provide timely, automated feedback that not only addresses surface-level issues such as grammar, punctuation, and spelling but also enhances complex aspects of writing, such as content development, argumentation, and coherence. By providing grammar and style suggestions, these tools enhance students’ linguistic accuracy and allow educators to concentrate on higher-order writing skills. This makes AI tools an efficient and scalable alternative to traditional teacher feedback (Gayed et al., 2022; Link et al., 2022; Malik et al., 2023). In this context, generative AI is both a source of inspiration and a practical support tool in the writing process.

AI-Assisted Writing and Its Pedagogical Benefits

Writing, a productive skill, is particularly challenging due to its high cognitive demands. Unlike receptive skills such as listening or reading, which involve comprehension, it requires generating ideas, organizing thoughts, and applying grammar and vocabulary (Taskiran &Goksel, 2022). At the higher education level, it requires a systematic approach to exploring and organizing ideas and demonstrating critical thinking to support the writer’s perspective (Bailey, 2014). This process follows a structured format of introduction, body, and conclusion, each serving a specific function in presenting the thesis, developing arguments, and summarizing key points (Biber et al., 2004; Cumming et al., 2000). As students plan, draft, and revise, they are expected to move beyond grammar, focusing on organization, mechanics, and cohesion to produce effective writing (Mukminin, 2012). Given the complexity of these tasks, which require clarity, precision, and critical engagement with ideas (Paltridge & Starfield, 2016), writing can be time-consuming and mentally demanding (Flower & Hayes, 1981; Graham &Sandmel, 2011). Building on AI’s role as a supportive writing tool, many educators integrate it into classrooms for grammar correction, content development, personalized feedback, and to boost writing confidence and motivation. AI tools are already popular among EFL users. The pedagogical impact of generative AI in writing instruction is particularly relevant from constructivist and sociocultural perspectives. Vygotsky’s Zone of Proximal Development highlights the need for appropriate tools and support to help learners reach their potential. AI tools can serve as supportive peers or scaffolding, providing immediate, ongoing feedback that personalizes learning and fosters self-regulation (Dizon & Gold, 2023; Fong & Schallert, 2023).Zaiarnaet al. (2024) observed that EFL instructors increasingly use AI tools like ChatGPT and Gemini in classrooms, recognizing their potential for instruction and assessment. Xu et al. (2024) reported that EFL students using ChatGPT improved coherence and organization after multiple rounds of AI feedback. AI tools are also described as transformative in EFL settings, supporting grammar, outlining, and content development (Godwin-Jones, 2022). Studies also show measurable improvements in students’ writing quality and organization with such tools (Malik et al., 2023; Marzuki et al., 2023). AI tools also offer personalized feedback that promotes learner autonomy. Moreover, they can provide immediate, individualized feedback, sometimes more readable and detailed than instructor feedback. They can suggest alternative expressions, improve readability, and generate new topic ideas to support revision. Research shows that automated feedback can improve students’ writing, especially with multiple rounds of revision (Huang & Wilson, 2021;Mushthozaet al., 2023). Beyond this, AI writing tools can boost student confidence and reduce anxiety by offering accessible help during the writing process. As writing is often emotionally demanding, such tools can enhance motivation and encourage learner independence. These effects are important given that writing is shaped not only by cognitive demands but also by the writer’s emotional and motivational experiences (Bruning & Horn, 2000; Dizon & Gold, 2023; Fong & Schallert, 2023; Ghafouri, 2024; Graham, 2018).

Concerns and Limitations in AI-Assisted Writing

Despite their benefits, AI tools in writing instruction raise concerns about academic integrity, over-reliance on technology, reduced creativity, and unequal access. This study acknowledges that a balanced view is crucial, and the integration of AI must be carefully structured to account for these pedagogical, ethical, and affective dimensions (Dehouche, 2021; Iskender, 2023). A key concern is academic integrity and educators question the originality of AI-generated content. Over-reliance may encourage shortcuts, undermining the purpose of academic writing and honest self-expression (Dehouche, 2021; Malik et al., 2023). It may also hinder self-editing, weaken critical thinking and creativity, and limit the development of higher-order writing skills (Iskender, 2023; Marzuki et al., 2023). Excessive dependence on AI may risk “cognitive decline,” with rapid solutions potentially harming critical thinking and problem-solving (Kasneciet al., 2023). Heavy reliance on AI-generated feedback may reduce learners’ original thinking and critical evaluation skills. LLMs may not capture the emotional and cultural nuances of language, potentially hindering students’ communication skills (Shidiq, 2023). While some students may overuse AI tools, others lack access due to limited technology or internet, raising concerns about fairness and equal learning opportunities (Marzuki et al., 2023). This digital divide raises questions about whether the potential benefits of AI-supported education can be equally accessible to all learners (Alzubi, 2024;Modhish& Al-Kadi, 2016). Another concern is that biased or incomplete information from generative AI tools can lead to linguistic injustice (Lucy & Bamman, 2021;Rawas,2024; Uzun, 2023). This risk stems from LLMs internalizing social biases, prejudices, and stereotypes in their training data, potentially leading to pedagogically problematic outcomes (Kasneciet al., 2023; Lund & Wang, 2023). These limitations suggest that AI integration into classrooms must be approached with caution.

The Rationale for Comparing ChatGPT-4, Gemini, and Copilot

Given AI-assisted writing’s potential and limitations, it is essential to examine how different generative AI tools function in education and how well they support learning. In response, this study compares three widely available tools: ChatGPT-4, Google Gemini, and Microsoft Copilot. These three tools were specifically chosen for their widespread use, significant market presence, and distinct developmental philosophies, making them the most prominent and relevant platforms for a comparative study in today’s educational landscape. These tools have distinct development paths. ChatGPT, launched by OpenAI in 2022, quickly became a key player for its coherent, contextually appropriate responses and rapid evolution from GPT-3 to GPT-4 (Aydın & Karaarslan, 2022;Zaiarnaet al., 2024). In contrast, Microsoft Copilot, integrated into Microsoft’s productivity suite, is designed to streamline work processes, particularly in professional and academic writing (Microsoft, 2025). Google’s Gemini, a newer entrant to the AI space, is built to integrate with Google’s ecosystem and offers strong capabilities in document and content creation (Zaiarnaet al., 2024). While these tools share somefunctionalities, they differ in their design, development, and target use cases. These differences are not merely technical but also reflect distinct pedagogical potentials that can be linked to established theories. For instance, ChatGPT's conversational interface, with its dialogic-feedback model, can be seen as an extension of Vygotsky's ZPD theory, where the AI acts as a more capable peer. Gemini's multimodal capability and its integration within Google's ecosystem support a process-oriented writing approach (Flower & Hayes, 1981) by offering continuous support across different stages of writing. Similarly, Copilot's design, focused on efficiency and integration into Microsoft's productivity suite, aligns with pedagogical goals of reducing cognitive load and streamlining the writing workflow. Asadi et al. (2025) emphasize that more research is needed to explore how AI-driven feedback can be integrated with traditional methods to enhance writing instruction. Similarly, Carlson et al. (2023) encourage educators to explore how different tools can meet the specific needs of their learners. Including AI tools in the writing classroom shows great promise for addressing student difficulties and improving writing proficiency, especially when tools are selected based on their specific features and instructional contexts (Silva & Janes, 2020).

Rubric Design for Evaluating AI Writing Support

To evaluate how ChatGPT-4, Google Gemini, and Microsoft Copilot supported students at different stages of the writing process, a 10-item rubric was developed by the researcher. The rubric includes criteria aligned with three key stages: brainstorming, outlining, and feedback. Brainstorming criteria such as argument diversity, relevance, and idea depth were informed by studies highlighting the cognitive and affective benefits of this prewriting strategy. Brainstorming has been shown to help students activate prior knowledge, generate creative ideas, and overcome common writing difficulties (Karim et al., 2016; T. H. Nguyen, 2020; Rao, 2007). It also supports student motivation by creating a low-pressure environment for idea exploration (T. H. Nguyen, 2020). More recently, AI tools like ChatGPT and Gemini have been recognized for their role in enhancing brainstorming by offering a broader range of perspectives and encouraging deeper engagement with the topic (Karanjakwut&Charunsri, 2025). These tools are increasingly viewed by both teachers and students as valuable supports for overcoming initial barriers and stimulating idea generation (Xu et al., 2024). In EFL contexts, AI-supported brainstorming can also ease anxiety and help organize thoughts before writing, thus improving overall expressive ability (Pratiwi &Julianti, 2022). Outlining criteria such as thesis strength, logical structure, and support development were informed by literature emphasizing the importance of organization and clarity in EFL writing (T. H. Nguyen, 2020;Wahyudin, 2018). The outlining stage serves as a bridge between idea generation and full-text drafting, and AI tools can facilitate this process by providing organizational templates or prompting students to visualize logical flow. Research suggests that outlining—whether AI-assisted or not—enhances coherence and planning, ultimately supporting writing quality (Pratiwi &Julianti, 2022; Xu et al., 2024). Feedback-related criteria, including specificity, actionability, and accuracy, were added to reflect the increasing use of automated writing evaluation (AWE) tools. While tools such as Grammarly can identify surface-level issues like grammar, vocabulary, and mechanics, they often fall short when addressing higher-order concerns such as content development, argument strength, or global organization. Students report appreciating the immediate feedback AI provides, especially in terms of revision accessibility and technical accuracy, but research shows the best outcomes occur when AI feedback is combined with teacher or peer guidance (Yuan, 2023). Grounded in existing literature and finalized through expert review, the rubric was designed for this study and has not been tested for generalizability across other contexts or learner populations. Its theoretical foundation and practical focus, however, suggest potential for adaptation and further development in future studies on generative AI in EFL writing instruction.

Methodology

Research Design

A mixed-methods design with a within-subjects 3×3 factorial framework evaluated three AI tools across three argumentative essay topics. Each participant experienced every Tool × Topic condition, allowing direct comparisons while controlling for individual differences.Randomization of topic order was initially considered; however, due to the voluntary and self-paced nature of participation outside of class hours, all topics were provided at the outset so students could manage their workload according to their own schedules. During data collection, several participants withdrew and were replaced by new volunteers. To ensure continuity and allow new participants to complete all tasks within the limited study period, the same full set of topics and materials was given to all, and topic order was not controlled.Tool interactions were logged for auditability. Data quality was maintained through routine screening for outliers and entry errors. Listwise handling was used for repeated-measures tests to preserve comparability, and no revisions were permitted after AI feedback to prevent contamination of subsequent measures.

The ten rubric criteria covered brainstorming (argument diversity, relevance, depth), outlining (thesis strength, logical structure, strength of supporting reasons, detail level), and feedback (feedback specificity, actionability, grammar and coherence). These were treated as continuous outcomes. Descriptive statistics were calculated, assumptions for repeated-measures models including normality and sphericity were tested, and Greenhouse–Geisser corrections applied as needed. Two-way repeated-measures ANOVAs tested main effects and the Tool × Topic interaction, with Bonferroni-adjusted post-hoc comparisons. Effect sizes were reported as partial eta squared for omnibus tests and standardized mean differences for pairwise contrasts.

Open-ended responses were analyzed inductively so that emergent themes could contextualize quantitative patterns, and user experience shaped the perceived usefulness and relevance of AI-generated support.

Participants

The participants were 15 B+ level students at the preparatory school of a public university in Ankara,Turkiye, experiencing both their first year at the school and their first time writing full opinion essays in English. The language program follows a modular system, with the B+ level reflecting most CEFR Threshold features but functioning closer to upper-B1 in this institution. It marks the transition from paragraph writing to more complex academic genres, presenting challenges such as developing arguments and structuring ideas that may benefit from AI-assisted support.

Participation was voluntary, with no grades or performance-based evaluation. All participants signed consent forms, chose nicknames for anonymity, and worked independently outside class hours, representing different classrooms at the same level. Pre-survey results offered insights into their prior AI experience, writing challenges, and confidence in using AI for academic writing.

Pre-survey responses showed that 11 of the 15 students had previously used an AI writing tool, while 4 had not (Figure 1). Among those with prior use, ChatGPT-4 was the most frequently mentioned tool. Regarding usage frequency, 7 students reported using AI tools rarely (1–2 times per month), 4 students used them sometimes (1–2 times per week), 2 students used them often (3–5 times per week), and 2 students used them very often (almost every day) (Figure 2). When asked what they expected from an AI writing tool, the most commonly selected options were grammar correction, vocabulary support, and idea generation. Students rated their confidence in essay writing on a scale from 1 to 5, with most selecting scores between 2 and 4, indicating moderate confidence. A similar pattern emerged for comfort levels with AI in academic writing, indicating openness despite limited use. Finally, students most often identified organizing ideas, developing thesis statements, and maintaining coherence as their biggest writing challenges. (Figure 3).

Figure 1. Use of AI Tools Prior to the Study

Figure 2. Frequency of AI Tool Use for Academic Writing

Figure 3. Aspects of Writing Students Struggled with the Most

Data Collection

Data were collected using multiple tools: (a) a pre-survey and a post-survey administered via Google Forms; (b) three task-specific Google Forms (one for each essay topic); (c) an AI Writing Tool Evaluation Worksheet that guided students through prompt generation at each writing stage; and (d) screenshots of their interactions with the AI tools, which were submitted to a shared Google Drive folder. Students used ChatGPT-4, Gemini, and Copilot for brainstorming, outlining, and feedback in each task. Screenshots verified the completion of each stage and ensured the reliability and authenticity of the data.

The study focused on opinion essays, a genre defined by the program’s curriculum as a formal academic text presenting a reasoned argument for a specific point of view. Students were required to write an introduction with a clear thesis, body paragraphs with supporting details, and a concluding summary. Tasks were selected to align with the B+ curriculum objectivesand were based on opinion essay topics similar to those students typically complete in class:

Task 1: Traveling alone is better than traveling in a group.

Task 2: People who live in apartments shouldn't own pets.

Task 3: Some people believe that artificial intelligence will soon replace many jobs. Do you agree or disagree?

The worksheet guided students in crafting appropriate prompts for each AI tool at three key writing stages:

1.Brainstorming:Students asked each tool to list arguments for and against the topic.

2.Outlining:Students requested a structured outline including a thesis, three main ideas, and brief supporting details.

3.Feedback:Students submitted their essay to each tool and prompted them to give specific feedback on thesis strength, argument clarity, organization, grammar, coherence, and tone.

Students completed all tasks independently at their own pace. A Telegram group was used for communication, deadline reminders, support, and clarifying instructions. Some contacted the instructor by email, visited the office, or received one-on-one help, and a few without personal computers used the instructor’s computer. The initial face-to-face orientation was held on February 13, 2025, with a support session on February 27, 2025. All materials, including a presentation, instructional video, AI writing evaluation rubric, writing samples, and prompt sheets, were shared via Google Drive and in print.

The AI writing evaluation rubric was developed by the researcher with criteria for three stages: brainstorming, outlining, and feedback. Each criterion was based on principles from second language writing research and AI-integrated instruction. The rubric aimed to provide a standardized, objective evaluation of writing support, enhancing the reliability of data collection. To establish validity, the criteria were developed from the literature and then reviewed by one expert in English Language Teaching and one in Computer Education and Instructional Technology. Their feedback was incorporated to refine the criteria and ensure content validity and relevance to the academic context of B+ level students.

The prompt sheet was developed through extensive trial-and-error testing with ChatGPT-4, Google Gemini, and Microsoft Copilot to determine concise, effective prompts for brainstorming, outlining, and feedback. Designed for clarity and pedagogical suitability for B+ level EFL learners, it was reviewed by a Computer Education and Instructional Technology faculty member.

Students evaluated the feedback from each AI tool but were not required to revise their essays based on it. Some requested and received additional teacher feedback for their own learning, but this was excluded from the formal analysis since the study focused on autonomous use of AI tools.

Analyzing of Data

Quantitative data from rubric-based evaluations were analyzed using descriptive statistics and two-way repeated-measures ANOVA to examine the main effects of AI tool and essay topic, and their interaction, on student ratings across ten writing criteria for brainstorming, outlining, and feedback. Data were screened for outliers and entry errors, and assumptions of normality and sphericity were checked; Greenhouse–Geisser corrections were applied when necessary. Post-hoc pairwise comparisons used Bonferroni adjustments. Effect sizes were reported as partial eta squared for omnibus tests and standardized mean differences for pairwise contrasts. All analyses were conducted in JASP. No revisions were allowed after AI feedback to avoid contamination of subsequent measures, and listwise handling was applied for repeated-measures tests to maintain comparability.

Qualitative data from open-ended responses were analyzed inductively using thematic analysis. Coding and thematization were conducted systematically by the researcher, with an audit trail maintained to document coding decisions and theme development. Codes and themes were iteratively reviewed and refined to enhance consistency and minimize bias. Microsoft Excel was used to organize responses, identify recurring keywords, and group comments thematically across different writing stages and evaluation questions.

To strengthen validity, this study used methodological triangulation by combining quantitative rubric scores with qualitative open-ended responses. Quantitative data were analyzed first to identify patterns and significant effects, followed by qualitative analysis to provide context and contrast. Integrating both data types enabled cross-validation and a fuller understanding of participants’ experiences with the AI tools.

Findings

This section reports the findings from the evaluation of three generative AI tools: ChatGPT-4, Google Gemini, and Microsoft Copilot. Table 1 summarizes the mean scores and two-way repeated measures ANOVA results for each rubric criterion, indicating differences related to both the AI tool and the essay topic. The findings are presented under three categories: brainstorming, outlining, and feedback. Each section integrates quantitative results with qualitative student feedback to provide a comprehensive analysis of tool effectiveness.

Table 1. Descriptive Statistics for AI Tool Performance by Rubric Criterion and Essay Topic

Rubric Criterion	AI Tool	Topic 1 (M, SD)	Topic 2 (M, SD)	Topic 3 (M, SD)
Argument Diversity	ChatGPT	3.867 (0.915)	3.667 (1.291)	4.000 (0.926)
	Gemini	3.800 (0.941)	3.867 (0.834)	4.200 (0.862)
	Copilot	3.933 (0.704)	3.933 (0.704)	4.000 (1.000)
Relevance	ChatGPT	4.667 (0.488)	4.333 (0.816)	4.200 (1.146)
	Gemini	4.533 (0.915)	3.867 (0.743)	4.133 (1.125)
	Copilot	4.600 (0.632)	4.067 (0.799)	4.400 (1.056)
Depth of Arguments	ChatGPT	3.600 (1.121)	4.333 (0.816)	3.933 (1.100)
	Gemini	3.933 (0.884)	4.133 (0.743)	4.000 (1.069)
	Copilot	4.067 (0.799)	3.733 (0.799)	4.067 (1.033)
Thesis Strength	ChatGPT	3.933 (1.033)	4.000 (1.000)	4.200 (1.014)
	Gemini	3.733 (0.799)	3.933 (0.961)	4.000 (1.069)
	Copilot	3.733 (0.704)	4.067 (0.961)	3.933 (0.961)
Logical Structure	ChatGPT	4.067 (0.884)	4.067 (0.704)	4.133 (0.834)
	Gemini	4.067 (0.884)	3.867 (0.640)	3.733 (0.884)
	Copilot	4.067 (0.799)	3.867 (0.640)	4.067 (0.884)
Strength of Supporting Reasons	ChatGPT	4.000 (1.000)	4.000 (0.756)	4.000 (1.134)
	Gemini	3.867 (0.743)	4.467 (0.743)	4.067 (1.163)
	Copilot	3.867 (0.915)	4.267 (0.884)	3.867 (1.060)
Detail Level	ChatGPT	3.400 (1.121)	4.000 (1.069)	4.000 (1.000)
	Gemini	3.867 (1.060)	3.800 (1.082)	3.867 (1.187)
	Copilot	3.600 (0.632)	3.733 (1.223)	3.800 (0.941)
Feedback Specificity	ChatGPT	4.133 (0.990)	4.333 (0.900)	4.067 (0.961)
	Gemini	3.733 (1.100)	4.133 (0.834)	4.267 (0.799)
	Copilot	3.733 (0.704)	4.067 (0.884)	4.067 (0.884)
Actionability	ChatGPT	4.267 (0.594)	4.267 (0.704)	4.467 (0.743)
	Gemini	3.667 (1.234)	4.067 (0.704)	4.200 (0.676)
	Copilot	4.000 (0.655)	4.067 (0.704)	3.667 (0.976)
Grammar & Coherence Feedback	ChatGPT	4.267 (0.594)	4.333 (0.900)	4.467 (0.743)
	Gemini	3.800 (1.320)	3.867 (0.743)	4.267 (0.961)
	Copilot	4.000 (0.655)	4.067 (0.884)	4.200 (0.775)

Note.M = mean; SD = standard deviation.

Table 2. Mean Scores and Repeated Measures ANOVA Results for AI Tool and Essay Topic

Rubric Criterion	Factor	df	F	p	η²	Significant Difference (p < .05)
Argument Diversity	Topic	2,28	0.640	.535	0.044	NO
	Tool	2,28	0.414	.665	0.029	NO
	Topic×Tool	4,56	0.314	.868	0.022	NO
Relevance of Arguments	Topic	2,28	3.109	.060	0.182	NO
	Tool	2,28	3.664	.039	0.207	YES (p < .05)
	Topic×Tool	4,56	0.769	.550	0.052	NO
Depth of Arguments	Topic	2,28	0.398	.676	0.028	NO
	Tool	2,28	0.098	.907	0.007	NO
	Topic×Tool	4,56	1.812	.139	0.115	NO
Thesis Strength	Topic	2,28	0.653	.528	0.045	NO
	Tool	2,28	0.455	.639	0.032	NO
	Topic×Tool	4,56	0.178	.949	0.013	NO
Logical Structure	Topic	2,28	0.287	.753	0.020	NO
	Tool	2,28	0.966	.393	0.065	NO
	Topic×Tool	4,56	0.500	.736	0.035	NO
Strength of Supporting Reasons	Topic	2,28	0.972	.391	0.065	NO
	Tool	2,28	0.375	.691	0.026	NO
	Topic×Tool	4,56	0.801	.530	0.054	NO
Detail Level	Topic	2,28	0.853	.437	0.057	NO
	Tool	2,28	0.216	.807	0.015	NO
	Topic×Tool	4,56	0.619	.651	0.042	NO
Feedback Specificity	Topic	2,28	1.754	.192	0.111	NO
	Tool	2,28	0.645	.532	0.044	NO
	Topic×Tool	4,56	0.473	.755	0.033	NO
Actionability	Topic	2,28	0.507	.608	0.035	NO
	Tool	2,28	5.025	.014	0.264	YES (p < .05)
	Topic×Tool	4,56	1.526	.207	0.098	NO
Grammar & Coherence Feedback	Topic	2,28	1.057	.361	0.070	NO
	Tool	2,28	2.860	.074	0.170	NO
	Topic×Tool	4,56	0.198	.938	0.014	NO

Table 3. Post-hoc Pairwise Comparisons for Relevance of Arguments by Tool

		Mean difference	Standard error	df	t	pbonf
ChatGPT	Gemini	0.222	0.090	14	2.467	.081
	Copilot	0.044	0.079	14	0.564	1.00
Gemini	Copilot	-0.178	0.091	14	-1.948	.215

Note. P-value adjusted for comparing a family of 3 estimates.

Note. Results are averaged over the levels of: Topic

Table 4. Post-hoc Pairwise Comparisons for Actionability by Tool

		Standard error	df	t	pbonf
ChatGPT	Gemini	0.124	14	2.874	.037
	Copilot	0.136	14	3.106	.023
Gemini	Copilot	0.167	14	0.400	1.00

Note.P-value adjusted for comparing a family of 3 estimates.

Note.Results are averaged over the levels of: Topic

Table 5. Thematic Analysis of Student Reasoning Patterns

	Brainstorming	Outlining	Feedback	Future Use	TOTAL
Clarity/Understandability	4	19	11	6	40
Detail/Comprehensiveness	7	8	16	5	36
Effectiveness/Efficiency	6	2	4	10	22
Examples/Variety	5	9	1	5	20
Structure/Organization	0	6	1	2	9
Grammar/Language	0	0	8	0	8
Interface/Presentation	0	0	0	7	7
Specificity/Precision	0	0	5	0	5

Brainstorming

The brainstorming phase encompasses the initial generation and development of opinion essay content, specifically evaluating the diversity, relevance, and depth of arguments produced by the AI tools.

Analysis of argument diversityrevealed no statistically significant differences among the three AI tools, with mean scores remaining consistent across thetools (Table 1). There were also no significant effects of topic or tool-topic interaction (Table 2), indicating that all tools are equally capable of generating diverse content for opinion essays, regardless of topic.

For argument relevance,a significant main effect of Tool,F(2, 28) = 3.66,p= .039, η² = .21, indicating that the type of AI tool used influenced the relevance of arguments (Table 2). However, post hoc pairwise comparisons using Bonferroni adjustment (Table 3) did not reveal any significant differences between tools: ChatGPT vs. Gemini (Mean Difference = 0.22,p= .081), ChatGPT vs. Copilot (Mean Difference = 0.04,p= 1.00), and Gemini vs. Copilot (Mean Difference = -0.18,p= .215). There were no significant main effects of Topic or Topic × Tool interaction. Thus, while the overall effect of Tool was statistically significant, no specific tool outperformed the others in producing more relevant arguments.

Argument depth showed no significant differences between the AI tools or topics,and no interaction effects wereobserved (Tables 1 and 2). These results indicate that all three AI tools offer comparable depth in opinion content generation across topics.

Qualitative responses on brainstorming preferences showed more variation than for other tasks. Among the 15 students, 5 preferred ChatGPT-4, 4 chose Gemini, 1 favored Copilot, 4 rated all tools similarly, and 1 found none helpful. Comprehensiveness (7 mentions) and effectiveness (6 mentions) were the most cited reasons (Table 5). While quantitative results indicated a difference in the relevance of arguments across tools, post-hoc comparisons showed that this difference was not statistically significant. Students nevertheless expressed clear perceptions of what made certain tools more helpful. For example, Pancake found Gemini “the most helpful” for brainstorming because it gave “lots of item[s]” and offered a wider variety of ideas. Belma preferred ChatGPT-4, noting it “helped me narrow down relevant points.” Perceptions of diversity and depth were broadly similar across tools. Gemini was frequently linked to detail and insight, Copilot to clarity and uniqueness, and ChatGPT-4 to strong or creative content. Still, many students found the tools toproducecomparable brainstorming support overall.

The outlining phase evaluates the structural organization of opinion essays, including thesis development, logical structure, strength of supporting reasons, and level of detail.

The thesisstrength did not differ significantly across AI tools, with no significant main effect of topic or interaction effects observed (Table 2). As indicated in Table 1, mean scores were comparable across all three tools. This shows that all three tools are equally capable of producing strong thesis statements.

The logical structure was consistent across all AI tools, with no significant differences between tools, topics, or interaction effects observed (Tables 1 and 2). This suggests all three AI tools are equally capable of organizing opinion content coherently.

No significant differences were found inthe strength of supporting reasonsbetween AI tools, topics, or their interaction (Tables 1 and 2). Findings show all three tools performed similarly in generating strong supporting reasons.

A detailed analysis revealedno significant differences between AI tools, topics, or their interaction (Tables 1 and 2). Although Gemini scored slightly higher on detail, Table 1 shows these differences were not significant, indicating comparable performance across tools in providing detailed opinion content.

Qualitative responses on outlining preferences showed clearer distinctions between tools. Of the 15 students, 7 preferred Gemini, 5 Copilot, and 3 ChatGPT-4. Clarity (19 mentions) and variety of examples (9 mentions) were the most cited reasons (Table 5), with many describing a “clear and organized structure” and a “wide range of examples for main points.” Unlike brainstorming, no students rated all tools similarly, suggesting greater perceived differences in this phase. Copilot was often praised for clarity and structure, as Ares noted it provided a “strong and clear thesis” and was “logically structured.” Gemini was frequently associated with depth and detail, with Serendipity commenting it “gave the mostdetails,” whilednz_ preferred ChatGPT-4 for offering “more options.” These patterns reflect the stronger individual preferences students formed during outlining.

The feedback phase assesses the AI tools' capacity to provide specific, actionable, and grammatically coherent feedback on opinion essays.

Feedback specificity showed no significant differences between tools, topics, or their interaction (Tables 1 and 2). Despite the numerical differences evident in Table 1, all three tools demonstrate statistically equivalent capacity for providing specific feedback on opinion essays.

For the Actionability criterion, the main effect oftheAI tool was statistically significant (Table 2). Post-hoc pairwise comparisons with Bonferroni correction indicated that ChatGPT-4 received significantly higher scores than both Google Gemini (p= .037) and Microsoft Copilot (p= .023) (Table 4). No significant difference was found between Gemini and Copilot (p= 1.000). Mean scores across topics were highest for ChatGPT-4 (M≈ 4.33), followed by Gemini (M≈ 3.98) and Copilot (M≈ 3.91), indicating that students perceived ChatGPT-4’s feedback as more actionable than that of the other two tools.

Grammar and coherence feedback did not differ significantly between tools, topics, or their interaction (Tables 1 and 2). Results showthat all three tools performed equally in providing feedback on grammar and coherence.

Qualitative responses to feedback preferences showed the greatest differentiation among tasks. Of the 15 students, 8 preferred ChatGPT-4, 5 Gemini, and 2 Copilot. Detail (16 mentions) and clarity (11 mentions) were the most cited reasons, with many highlighting “longer and more detailed” or “clear and understandable” feedback. ChatGPT-4 emerged as the clear favorite, often described as providing not only specific but also actionable suggestions.Şampiyonpatinoted it “explained all my mistakes in detail,” while Ares praised its “clear revision steps and precise corrections.” Gemini also received comments on step-by-step support, such as Pancake’s mention of guidance for correcting mistakes. Copilot was valued by a few,such as Harmony, for being conciseand direct, but was less frequently described as revision-oriented. These reflections align with the quantitative finding that ChatGPT-4 scored significantly higher than both Gemini and Copilot for actionability, while no significant difference was found between Gemini and Copilot. This alignment between statistical results and student perceptions reinforces the view that ChatGPT-4’s feedback was not only more comprehensive but also more readily applicable in the revision process.

Student Reflections After the Study

After completing all three writing tasks and engaging with ChatGPT-4, Google Gemini, and Microsoft Copilot, students were asked to reflect on their overall experiences. The post-study survey included four open-ended questions: (1) How did the AI tools help you improve your writing? (2) What difficulties did you encounter while using the AI tools? (3) Which tool did you find the most helpful, and why? (4) Would you recommend using AI tools for future writing tasks? Why or why not?The responses provided valuable insights into students’ evolving attitudes toward AI-assisted writing, highlighting both benefits and challenges.

Post-study reflections from 15 students revealed that AI tools were perceived as most beneficial for grammar and language correction (7 mentions) and providing feedback for learning (7 mentions), followed by idea generation and brainstorming support (5 mentions). Students valued the immediate feedback and error identification capabilities of the tools.

Technical issues like internet connectivity and upload problems were the most common challenges (5 mentions), while 4 students reported no difficulties. Device limitations, especially using mobile phones instead of computers, also posed challenges for some.

When asked to identify the most helpful tool overall, 5 students preferred ChatGPT-4, 4 Gemini, and 1 Copilot, while 5 expressed mixed or equal preferences. Although task-specific preferences were clear, overall preferences varied more among individuals.

Ten ofthe fifteen students recommended using future AI tools, highlighting faster feedback, independent learning, and easy access to help. Three gave conditional recommendations, expressing concerns that over-reliance might hinder independent thinking.

Discussion

This study compared ChatGPT-4, Google Gemini, and Microsoft Copilot in supporting B+ level EFL students during brainstorming, outlining, and feedback in opinion essay writing. Drawing on quantitative rubric scores and qualitative reflections, it explored whether any tool outperformed the others and how students experienced each stage. The discussion interprets these findings in relation to existing literature and student perceptions.

Brainstorming

Quantitative results for brainstorming showed no statistically significant differencesamong ChatGPT-4, Google Gemini, and Microsoft Copilot in terms ofargument diversity or depth, indicating comparable idea quality. Students’ perceptions supported this finding, with several describing the tools as producing similarly comprehensive and useful ideas. For instance,Canshcharacterized Gemini’s ideas as “detailed, well-developed, and insightful,” while Cloud felt that “they were all the same, they said the same things.” For argument relevance, a difference was observed across tools; however, post-hoc analysis showed that this difference was not statistically significant. Up-to-date comparative AI technology reviews frequently evaluate the strengths of popular generative AI tools. Inrecent comparisons, ChatGPT is consistently highlighted for its creative writing and brainstorming capabilities, while Gemini is noted for its multimodal reasoning within the Google ecosystem, and Copilot is recognizedfor its productivity-oriented integration with the Microsoft environment (DataStudios, 2025;UpforceTech, 2025). This consistent emphasis on ChatGPT’s brainstorming strengths may help explain why, despite the absence of statistical significance, it achieved the highest average scores for argument relevance in the brainstorming stage.

Some participants preferred ChatGPT-4 for its user-friendliness and speed, echoingAlmumenandJouhar’s(2025) finding that it can enhance argument quality and supporting evidence through a rich brainstorming process.Belma described it as “easy and fast,” whileSteelShadowemphasized that “it gave details so much” and was “more comfortable.” Copilot received fewer mentions, butŞampiyonpatiappreciated its “broad brainstorming session” and ability to anticipate “potential issues.”

Overall, these perceptions confirm that while quantitative data showed parity in performance, subjective experiences varied. Students’ comments indicate that preferences were shaped bythe quality of ideasas well as clarity, speed, and perceived creativity, with Gemini noted for offering varied perspectives (Karanjakwut&Charunsri, 2025).

Outlining results showed no significant differences between the three tools across thesis strength, logical structure, supporting reasons, and detail level, suggesting all can help students structure essays effectively. However, qualitative data revealed clearer task-specific preferences: Gemini was chosen by 7 students, Copilot by 5, and ChatGPT-4 by 3, with clarity and variety of examples most often cited. These findings align with Azmi andFithriani(2025), who reported that Gemini improved organization and coherence by supporting clearer paragraph transitions and logical sequencing.In our study, Cloud appreciated Gemini’s clarity in outlining, andCanshnoted its “wide range of examples.” Similarly,Zainurrahman(2024) observed that Copilot can generate clear, logically divided outlines, which some of our participants found useful. ECRN, for example, chose Copilot for brainstorming but preferred ChatGPT-4 for feedback, noting that both were “detailed.”

ChatGPT-4’s outlining strengths, described byAlmumenandJouhar(2025) as producing coherent structures and exposing students to sophisticated text patterns, were echoed by Ares, who found its outlines “accurate and clear.” However, the lack of significant differencessuggests that these strengths were not unique to anyone tool. Student comments suggest that outlining preferences were driven more by clarity, structure, and breadth of examples than by measurable differences.

In the feedback phase, no significant differences were found between tools in specificity, grammar,and coherence. However, actionability showed a statistically significant difference, with ChatGPT-4 scoring higher than both Google Gemini and Microsoft Copilot, while no significant difference was found between Gemini and Copilot. ChatGPT-4 was most preferred, chosen by 8 students, followed by Gemini (5) and Copilot (2).

Students valued ChatGPT-4’s detailed, clear, and actionable feedback, consistent with Bruhn and Marquart (2025), who found it useful for early drafting and language feedback.Dnz_ highlighted its “more detailed and explanatory grammar feedback,” and Asiye described its feedback as “better and more detailed.” These perceptions align with Escalante et al. (2023), who describe ChatGPT as an intelligent tutor providing personalized, targeted guidance. Students also appreciated the non-judgmental nature of AI feedback, which Bruhn and Marquart (2025) linked to reduced anxiety in revision. ECRN admitted they sometimes avoided asking teachers or peers questions,but felt comfortable consulting AI. Similarly, Cloud valued the independence it provided, saying, “You can study on your own, you do not need anyone else.” Lu et al. (2024) noted that immediate feedback can enhance metacognitive reflection, a benefit echoed by Zeyzey, who “learned a lot from the instant feedback.”

Gemini’s feedback performance was competitive, especially in coherence and cohesion, as D. L. Nguyen et al. (2025) found. Hera described its feedback as “very detailed and clear.” However, limitations in spelling and word formation feedback noted by D. L. Nguyen et al. (2025) may help explain why it scored lower than ChatGPT-4 in grammar-related criteria.

Copilot’s straightforward and easy-to-understand feedback, as described byZainurrahman(2024), was appreciated byŞampiyonpatifor its breadth, though fewer students preferred it overall.

The statistical advantage of ChatGPT-4 in actionability was reinforced by qualitative comments describing its feedback as offering clear next steps and essay-specific suggestions, aligning with Woo et al. (2023), who suggest diverse information can foster creative thinking. Overall, although ChatGPT-4, Google Gemini, and Microsoft Copilot performed similarly across most feedback criteria, ChatGPT-4 stood out for theactionable guidance it provided.

Conclusion

This study examined the effectiveness of ChatGPT-4, Google Gemini, and Microsoft Copilot in supporting B+ level EFL students through brainstorming, outlining, and feedback stages of opinion essay writing. Quantitative analysis revealed no statistically significant differences between the tools for most rubric criteria, with the only significant effect being in the feedback stage, where ChatGPT-4 scored higher than both Gemini and Copilot for actionability.During the brainstorming stage, a difference in argument relevance was observed across tools; however, this differencewas not statistically significant after post-hoc analysis. Qualitative findings highlighted clear task-specific preferences: Gemini was often valued for clarity and variety in brainstorming and outlining, ChatGPT-4 was favored for detailed, clear, and actionable feedback, and Copilot, while less frequently selected, was appreciated for certain organizational and brainstorming strengths. These results suggest that while all three AI tools are objectively comparable in many aspects of performance, individual learner experiences and perceptions can vary substantially depending on both the task and the tool’s characteristics.

Recommendations

Recommendations for Educational Practice

The findings suggest AI integration in writing instruction is most effective when students can choose the tool that suits each stage of the process. Rather than prescribinga single platform, educators can offer multiple AI tools and encourage theirstrategic use for brainstorming, outlining, and feedback. Careful topic selection is essential, as familiarity and interestsignificantly influence the relevance of the arguments generated. Training in effective prompt design can help students obtain clearer, targeted outputs, and AI feedback should complement rather than replace teacher input. To promote equity, institutions should address technical barriers such as device and connectivity issues and provide adequate resources. Finally, classroom discussions on AI’s benefits and limitations can encourage students toutilize these tools to enhance, rather thanreplace, their own critical thinking and creativity.

Recommendations for Future Research

Future research could develop and validate a reliable, widely applicable rubric for evaluating AI tools across educational purposes. While the rubric in this study was tailored for a small group and based on literature and expert review, a more robust, widely tested instrument would improve comparability across contexts. Including larger, more diverse student populations could enable richer analysis, such as comparisons by proficiency level, preparatory school setting, and cross-country context. Such studies could clarify how contextual and demographic factors influencethe performance of AI tools and learner perceptions, offering a more comprehensiveunderstanding of their role in language learning.

Limitations

This study has several limitations that should be considered when interpreting the findings. The sampleconsisted of 15 B+ level students from a single preparatory program in Turkey; therefore,the results may not be generalizable to other proficiency levels, educational settings, or institutions.

While the quantitative phase involved a small sample (N = 15), the data collection was intensive and comprehensive. Each participant completed three essay tasks on different topics, using all three AI tools for brainstorming, outlining, and feedback, and provided detailed ratings and reflections for each condition. This repeated-measures design generated multiple observations per participant, enhancing reliability and statistical power despite the modest sample size. The demanding protocol limited the number of participants but enabled in-depth, controlled analysis of tool and topic effects. Findings should be viewed as exploratory and context-specific, with future research encouraged to use larger, more diverse samples.The topic order was not randomized, as all topics were provided at the outset to accommodate the self-paced format and thereplacement of withdrawn participants; thus, potential sequencing effects cannot be ruled out.The study spanned only three weeks and three opinion essay tasks, which did not allow for long-term tracking of students’ writing development or analysis of how tool use might influence writing habits over time. Another limitation is the use of free versions of ChatGPT-4, Google Gemini, and Microsoft Copilot, which may offer reduced functionality compared to paid or updated versions. Since access to GPT-4 ended during the finalization of this article, future replications may involve newer versions,which could affectcomparability. The study was also a preliminary investigation aimed at identifying the most suitable AI tool for this learner group. Broader areas such as teacher feedback comparison, detailed analysis of student–AI interactions, and long-term writing improvement were outside the current scope but could be explored in future research. The evaluation rubric, although informed by literature and reviewed by experts, was designed for this small cohort and not tested for broader validity or reliability, which limits its applicability beyond thisstudy. Reliance on student self-reports for much of the qualitative data may have introduced bias, as reflections could have been shaped by prior experiences or immediate impressions rather than sustained evaluation. Technical and environmental factors such as device type, internet stability, and prior familiarity with AI tools may also have influenced interactions with each platform. These constraints, together with the focus on a single proficiency level in one institution, mean that caution is needed when applying these findings to other learner groups or contexts.

Ethics Statements

The study involving human participants was reviewed and approved by the Ethics Committee of Ankara University. All participants were informed about the study and its procedures, and they provided written informed consent to participate voluntarily. The study complied with ethical standards for research involving human subjects.

The author would like to thank the students who voluntarily participated in the study and the instructors who provided valuable feedback during the development of the data collection tools.

Generative AI Statement

As the author of this work, I used the AI toolChatGPT-4o (OpenAI, paid version)for the purpose oflanguage polishing, sentence rephrasing, and improving clarity.After using this AI tool, I reviewed and verified the final version of the manuscript. I take full responsibility for the content of the published work.

References

Akyüz, Y. (2020). Effects of intelligent tutoring systems (ITS) on personalized learning (PL). Creative Education, 11(6), 953-978. https://doi.org/10.4236/ce.2020.116069

Almumen, H. A., & Jouhar, M. R. (2025). The impact of ChatGPT as a brainstorming tool on gifted students’ persuasive writing. Journal of Writing Research. Advance online publication. https://www.jowr.org/jowr/article/view/1514/996

Alqahtani, T., Badreldin, H. A., Alrashed, M., Alshaya, A. I., Alghamdi, S. S., bin Saleh, K., Alowais, S. A., Alshaya, O. A., Rahman, I., Al Yami, M. S., & Albekairy, A. M. (2023). The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research. Research in Social and Administrative Pharmacy, 19(8), 1236-1242. https://doi.org/10.1016/j.sapharm.2023.05.016

Alzubi, A. A. F. (2024). Generative artificial intelligence in the EFL writing context: Students’ literacy in perspective. Qubahan Academic Journal, 4(2), 59-69. https://doi.org/10.48161/qaj.v4n2a506

Asadi, M., Ebadi, S., & Mohammadi, L. (2025). The impact of integrating ChatGPT with teachers’ feedback on EFL writing skills. Thinking Skills and Creativity, 56, Article 101766. https://doi.org/10.1016/j.tsc.2025.101766

Aydın, Ö., & Karaarslan, E. (2022). OpenAI ChatGPT generated literature review: Digital twin in healthcare. In Ö. Aydın (Ed.), Emerging computer technologies 2 (pp. 22-31). Izmir Akademi Dernegi. https://doi.org/10.2139/ssrn.4308687

Azmi, K., & Fithriani, R. (2025). Students’ perceptions of Gemini AI effectiveness in academic writing. Journal of Language Teaching and Learning, Linguistics and Literature, 13(1), 74-93. https://doi.org/10.24256/ideas.v13i1.6278

Bailey, S. (2014). Academic writing: A handbook for international students (4th ed.). Routledge.

Barrett, A., & Pack, A. (2023). Not quite eye to A.I.: Student and teacher perspectives on the use of generative artificial intelligence in the writing process. International Journal of Educational Technology in Higher Education, 20, Article 59. https://doi.org/10.1186/s41239-023-00427-0

Biber, D., Conrad, S. M., Reppen, R., Byrd, P., Helt, M., Clark, V., Cortes, V., Csomay, E., & Urzua, A. (2004). Representing language use in the university: Analysis of the TOEFL 2000 spoken and written academic language corpus. Educational Testing Service. https://www.ets.org/Media/Research/pdf/RM-04-03.pdf

Bruhn, T., & Marquart, F. (2025). Student experiences of ChatGPT as a feedback tool in higher education. Tidsskriftet Læring og Medier, 18(31), 1-29. https://doi.org/10.7146/lom.v17i31.144043

Bruning, R., & Horn, S. (2000). Developing motivation to write. Educational Psychologist, 35(1), 25-37. https://doi.org/10.1207/S15326985EP3501_4

Carlson, M., Pack, A., & Escalante, J. (2023). Utilizing OpenAI's GPT-4 for written feedback. TESOL Journal, 15(2), Article e759. https://doi.org/10.1002/tesj.759

Chomsky, N. (1991). Linguistics and cognitive science: Problems and mysteries. In A. Kasher (Ed.), The Chomskyan turn (pp. 26-53). Blackwell. https://philpapers.org/rec/CHOLAC

Cumming, A., Kantor, R., Powers, D., Santos, T., & Taylor, C. (2000). TOEFL 2000 writing framework. Educational Testing Service. https://shorturl.at/a2zSK

DataStudios. (2025, July 19). ChatGPT vs Microsoft Copilot vs Google Gemini: Full report and comparison. DataStudios. https://shorturl.at/3jCdX

Dehouche, N. (2021). Plagiarism in the age of massive generative pre-trained transformers (GPT-3). Ethics in Science and Environmental Politics, 21, 17-23. https://doi.org/10.3354/esep00195

Dizon, G., & Gold, J. (2023). Exploring the effects of Grammarly on EFL students’ foreign language anxiety and learner autonomy. The JALT CALL Journal, 19(3), 299-316. https://doi.org/10.29140/jaltcall.v19n3.1049

Escalante, J., Pack, A., & Barrett, A. (2023). AI-generated feedback on writing: Insights into efficacy and ENL student preference. International Journal of Educational Technology in Higher Education, 20, Article 57. https://doi.org/10.1186/s41239-023-00425-2

Flower, L., & Hayes, J. R. (1981). A cognitive process theory of writing. College Composition and Communication, 32(4), 365-387. https://doi.org/10.2307/356600

Fong, C. J., & Schallert, D. L. (2023). Feedback to the future: Advancing motivational and emotional perspectives in feedback research. Educational Psychologist, 58(3), 146-161. https://doi.org/10.1080/00461520.2022.2134135

Frumin, I., Vorochkov, A., Kiryushina, M., Platonova, D., & Terentiev, E. (2025, February 25). Mapping the generative AI research in higher education: 2022-2024 Insights (Version 1). OSF Preprints. https://doi.org/10.31219/osf.io/srgkv_v1

Gayed, J. M., Carlon, M. K. J., Oriola, A. M., & Cross, J. S. (2022). Exploring an AI-based writing assistant’s impact on English language learners. Computers and Education: Artificial Intelligence, 3, Article 100055. https://doi.org/10.1016/j.caeai.2022.100055

Ghafouri, M. (2024). ChatGPT: The catalyst for teacher-student rapport and grit development in L2 class. System, 120, Article 103209. https://doi.org/10.1016/j.system.2023.103209

Godwin-Jones, R. (2022). Partnering with AI: Intelligent writing assistance and instructed language learning. Language Learning and Technology, 26(2), 5-24. https://doi.org/10.64152/10125/73474

Gordon, C. J. (1990). Modeling an expository text structure strategy in think alouds. Reading Horizons: A Journal of Literacy and Language Arts, 31(2), Article 6. https://shorturl.at/dkDfO

Graham, S. (2018). A revised writer(s)-within-community model of writing. Educational Psychologist, 53(4), 258-279. https://doi.org/10.1080/00461520.2018.1481406

Graham, S., & Sandmel, K. (2011). The process writing approach: A meta-analysis. The Journal of Educational Research, 104(6), 396-407. https://doi.org/10.1080/00220671.2010.488703

Huang, Y., & Wilson, J. (2021). Using automated feedback to develop writing proficiency. Computers and Composition, 62, Article 102675. https://doi.org/10.1016/j.compcom.2021.102675

Hung, R.-T., Chang, K.-H., Chen, K. T.-C., & Chuang, Y.-S. (2024). Impact of automated writing evaluation system and revision processes on college students’ writing skills in English as a foreign language course. Engineering Proceedings, 74(1), Article 52. https://doi.org/10.3390/engproc2024074052

Iskender, A. (2023). Holy or unholy? Interview with Open AI’s ChatGPT. European Journal of Tourism Research, 34, Article 3414. https://doi.org/10.54055/ejtr.v34i.3169

Järvelä, S., Nguyen, A., & Hadwin, A. (2023). Human and artificial intelligence collaboration for socially shared regulation in learning. British Journal of Educational Technology, 54(5), 1057-1076. https://doi.org/10.1111/bjet.13325

Kabudi, T., Pappas, I., & Olsen, D. H. (2021). AI-enabled adaptive learning systems: A systematic mapping of the literature. Computers and Education: Artificial Intelligence, 2, Article 100017. https://doi.org/10.1016/j.caeai.2021.100017

Karanjakwut, C., & Charunsri, K. (2025). Transforming AI chatbots for a brainstorming teaching technique of process writing. Malaysian Online Journal of Educational Technology, 13(1), 1-18. https://doi.org/10.52380/mojet.2025.13.1.559

Karim, R. A., Abu, A. G., & Khaja, F. N. M. (2016). Brainstorming approach and mind mapping in writing activity. In Proceedings of the 1st English Education International Conference (EEIC) in conjunction with the 2nd Reciprocal Graduate Research Symposium (RGRS) of the Consortium of Asia-Pacific Education Universities (CAPEU) between Sultan Idris Education University and Syiah Kuala University (pp. 423-428). Syiah Kuala University.

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poguet, O., Sailer, M., Schmidt, A., Stadler, M., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. OSF Preprints. https://doi.org/10.35542/osf.io/5er8f

Lin, H.-Y. (2023). Standing on the shoulders of AI giants. Computer, 56(1), 97-101. https://doi.org/10.1109/MC.2022.3218176

Link, S., Mehrzad, M., & Rahimi, M. (2022). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605-634. https://doi.org/10.1080/09588221.2020.1743323

Lu, J., Zheng, R., Gong, Z., & Xu, H. (2024). Supporting teachers’ professional development with generative AI: The effects on higher order thinking and self-efficacy. IEEE Transactions on Learning Technologies, 17, 1267-1277. https://doi.org/10.1109/tlt.2024.3369690

Lucy, L., & Bamman, D. (2021). Gender and representation bias in GPT-3 generated stories. In N. Akoury, F. Brahman, S. Chaturvedi, E. Clark, M. Iyyer, & L. J. Martin (Eds.), Proceedings of the Third Workshop on Narrative Understanding (pp. 48-55). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.nuse-1.5

Lund, B. D., & Wang, T. (2023). Chatting about ChatGPT: How may AI and GPT impact academia and libraries? Library Hi Tech News, 40(3), 26–29. https://doi.org/10.1108/LHTN-01-2023-0009

Malik, A. R., Pratiwi, Y., Andajani, K., Numertayasa, I. W., Suharti, S., Darwis, A., & Marzuki. (2023). Exploring artificial intelligence in academic essay: Higher education student's perspective. International Journal of Educational Research Open, 5, Article 100296. https://doi.org/10.1016/j.ijedro.2023.100296

Marzuki, Widiati, U., Rusdin, D., Darwin, & Indrawati, I. (2023). The impact of AI writing tools on the content and organization of students’ writing: EFL teachers’ perspective. Cogent Education, 10(2), Article 2236469. https://doi.org/10.1080/2331186X.2023.2236469

Microsoft. (2025). Your AI assistant for work: Microsoft 365 Copilot. https://shorturl.at/l1c3V

Modhish, A. S., & Al-Kadi, A. M. T. (2016). Internet integration in EFL college instruction: Attitudes and perspectives. International Journal on Studies in English Language and Literature, 4(6), 52-62. https://doi.org/10.20431/2347-3134.0406008

Mukminin, A. (2012). Acculturative experiences among Indonesian graduate students in U.S. higher education: Academic shock, adjustment, crisis, and resolution. Excellence in Higher Education, 3(1), 14-36. https://doi.org/10.5195/ehe.2012.64

Mushthoza, D. A., Syariatin, N., Tahalele, O., Telussa, S. I., Rasmita, S., & Mokodenseho, S. (2023). Analyzing the impact of artificial intelligence (AI) on the future of English language teaching and learning. Journal on Education, 6(1), 1549-1557. https://jonedu.org/index.php/joe/article/view/3115

Nguyen, D. L., Le, P. T. T., & Le, T. T. (2025). Using Gemini for formative assessment in English academic writing: Critical insights into the AI tool’s efficacy. AsiaCALL Online Journal, 16(1), 328-343. https://doi.org/10.54855/acoj.2516117

Nguyen, T. H. (2020). Brainstorming and outlining as an essential strategy instruction in teaching writing. The Journal of Foreign Studies, 30, 34-45.

Nguyen, T. L., Nguyen, H. L., & Le, D. H. (2024). Teachers' perspectives on AI-driven Quillionz for generating EFL reading comprehension quizzes. In Integrating AI tools for personalized language instruction: Vol. 6. Proceedings of the AsiaCALL International Conference (pp. 20-34). ICTE Press. https://doi.org/10.54855/paic.2462

Paltridge, B. R., & Starfield, S. (2016). Getting published in academic journals: Navigating the publication process. University of Michigan Press. https://doi.org/10.3998/mpub.5173299

Pratiwi, D. F., & Julianti, R. (2022). Improving students’ expressive writing ability using brainstorming techniques. ELITERATE: Journal of English Linguistics and Literature Studies, 2(3), 107-117. https://doi.org/10.26858/eliterate.v2i1.44347

Rao, Z. (2007). Training in brainstorming and developing writing skills. ELT Journal, 61(2), 100-106. https://doi.org/10.1093/elt/ccm002

Rawas, S. (2024). ChatGPT: Empowering lifelong learning in the digital age of higher education. Education and Information Technologies, 29, 6895-6908. https://doi.org/10.1007/s10639-023-12114-8

Romero, M., Reyes, J., & Kostakos, P. (2024). Generative artificial intelligence in higher education. In A. Urmeneta & M. Romero (Eds.), Creative applications of artificial intelligence in education (pp. 129-143). Palgrave Macmillan. https://doi.org/10.1007/978-3-031-55272-4_10

Setyowati, L. (2016). Analyzing the students’ ability in writing opinion essay using flash fiction. Journal of English Language Teaching and Linguistics, 1(1), 79-91. https://doi.org/10.21462/jeltl.v1i1.1

Shidiq, M. (2023). The use of artificial intelligence-based ChatGPT and its challenges for the world of education: From the viewpoint of the development of creative writing skills. Proceeding of International Conference on Education, Society and Humanity, 1(1), 353-357. https://shorturl.at/oaP1t

Silva, A. D. O., & Janes, D. D. S. (2020). Exploring the role of artificial intelligence in education: A comprehensive perspective. Review of Artificial Intelligence in Education, 1, Article e5. https://doi.org/10.37497/rev.artif.intell.education.v1i00.5

Taskiran, A., & Goksel, N. (2022). Automated feedback and teacher feedback: Writing achievement in learning English as a foreign language at a distance. Turkish Online Journal of Distance Education, 23(2), 120-139. https://doi.org/10.17718/tojde.1096260

UpforceTech. (2025, March 3). ChatGPT vs. Gemini vs. Copilot: 2025 feature comparison guide. UpforceTech. https://shorturl.at/tjrKC

Urmeneta, A., & Romero, M. (Eds.). (2024). Creative applications of artificial intelligence in education. Palgrave Macmillan. https://doi.org/10.1007/978-3-031-55272-4

Uzun, L. (2023). ChatGPT and academic integrity concerns: Detecting artificial intelligence generated content. Language Education and Technology, 3(1), 45-54. http://bit.ly/3JAE6Tp

Wahyudin, A. Y. (2018). Maximizing outlining practice in teaching writing for EFL secondary students: A research perspective. In A. Rido, A. Yudi, & I. Gulo (Eds.), Linguistics, literature, and language teaching series (pp. 45-50). Universitas Teknokrat Indonesia.

Woo, D. J., Wang, Y., Susanto, H., & Guo, K. (2023). Understanding English as a foreign language students’ idea generation strategies for creative writing with natural language generation tools. Journal of Educational Computing Research, 61(7), 1464-1482. https://doi.org/10.1177/07356331231175999

Xu, J., Gao, J., Gong, X., Shadiev, R., & Li, Y. (2024). The impact of using ChatGPT on EFL students’ writing. In Proceedings of the 2024 IEEE 24th International Conference on Advanced Learning Technologies (ICALT) (pp. 126-128). IEEE. https://doi.org/10.1109/ICALT61570.2024.00043

Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gaševic, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1), 90-112. https://doi.org/10.1111/bjet.13370

Yuan, X. (2023). The impact of automated evaluation feedback on students’ writing revision. International Journal of Languages, Literature and Linguistics, 9(6), 526-529. https://doi.org/10.18178/IJLLL.2023.9.6.464

Zaiarna, I., Zhyhadlo, O., & Dunaievska, O. (2024). ChatGPT in foreign language teaching and assessment: Exploring EFL instructors’ experience. Information Technologies and Learning Tools, 102(4), 176-191. https://doi.org/10.33407/itlt.v102i4.5716

Zainurrahman, Z. (2024, November 2). A guide to Microsoft Copilot for writing. SSRN. https://doi.org/10.2139/ssrn.5007915

Mapping the Scientific Landscape of the Flipped Classroom Model in K-12 Education During 2014-2024

Copy
Please select the style of the citation for coping to clipboard:

Alpar, Ö. (2025). Evaluating generative AI tools for improving English writing skills: A preliminary comparison of ChatGPT-4, Google Gemini, and Microsoft Copilot. European Journal of Educational Research, 14(4), 1291-1308. https://doi.org/10.12973/eu-jer.14.4.1291

Receive Email Alerts

Eurasian Society of Educational Research

Eurasian Society of Educational Research

Evaluating Generative AI Tools for Improving English Writing Skills: A Preliminary Comparison of ChatGPT-4, Google Gemini, and Microsoft Copilot

The Language Teaching Methods Scale: Reliability and Validity Studies

Mapping the Scientific Landscape of the Flipped Classroom Model in K-12 Education During 2014-2024

Mapping the Scientific Landscape of the Flipped Classroom Model in K-12 Education During 2014-2024

Receive Email Alerts

Eurasian Society of Educational Research

Eurasian Society of Educational Research

Evaluating Generative AI Tools for Improving English Writing Skills: A Preliminary Comparison of ChatGPT-4, Google Gemini, and Microsoft Copilot

The Language Teaching Methods Scale: Reliability and Validity Studies

Mapping the Scientific Landscape of the Flipped Classroom Model in K-12 Education During 2014-2024

Mapping the Scientific Landscape of the Flipped Classroom Model in K-12 Education During 2014-2024

HOW TO CITE THIS ARTICLE

Modal title