[go: up one dir, main page]

0% found this document useful (0 votes)
68 views33 pages

Generative AI's Impact on Student Performance

Generative AI Usage and Academic Performance

Uploaded by

vilaj17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views33 pages

Generative AI's Impact on Student Performance

Generative AI Usage and Academic Performance

Uploaded by

vilaj17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Generative AI Usage and Academic Performance

Janik Ole Wecksa*


ORCID: https://orcid.org/0009-0000-4116-1888
LinkedIn: https://www.linkedin.com/in/janik-ole-wecks-960ba01b2/

Johannes Voshaara
ORCID: https://orcid.org/0000-0003-0276-4509
LinkedIn: https://www.linkedin.com/in/johannes-voshaar-431aa6201

Benedikt Jost Platea


LinkedIn: https://www.linkedin.com/in/benedikt-j-plate-710a2b144/

Jochen Zimmermanna
ORCID: https://orcid.org/0000-0002-1189-7007
a
Faculty of Business Studies and Economics, University of Bremen, Bremen,
Germany
This version: May 2024

Abstract
This study evaluates the impact of students’ usage of generative artificial intelligence
(GenAI) tools such as ChatGPT on their academic performance. We analyze student essays
using GenAI detection systems to identify GenAI users among the cohort. Employing
multivariate regression analysis, we find that students using GenAI tools score on average
6.71 (out of 100) points lower than non-users. While GenAI tools may offer benefits for
learning and engagement, the way students actually use it correlates with diminished
academic outcomes. Exploring the underlying mechanism, additional analyses show that the
effect is particularly detrimental to students with high learning potential, suggesting an effect
whereby GenAI tool usage hinders learning. Our findings provide important empirical
evidence for the ongoing debate on the integration of GenAI in higher education and
underscores the necessity for educators, institutions, and policymakers to carefully consider
its implications for student performance.

Keywords: Generative AI, ChatGPT, Artificial Intelligence, Higher Education, Exam


Performance.
*Corresponding author. Address: Faculty of Business Studies and Economics, University of
Bremen, Max-von-Laue-Str. 1, 28213 Bremen, Germany; Tel.: +49 (421) 218 666 89; Email:
wecks@uni-bremen.de.
1 Introduction

The launch of OpenAI’s user-friendly and conversational ChatGPT in November 2022 made

generative artificial intelligence (GenAI) models widely accessible to a broad audience, regardless

of technical proficiency (Kishore et al. 2023). ChatGPT can process and generate natural language

and performs exceptionally in solving real-world problems through back-and-forth conversations,

question-answering, and machine translation (Lee 2023; Zhu et al. 2023). These novel

characteristics led to a tremendous surge in public attention, with over 100 million monthly active

users two months after its launch (Ahangama 2023; Gregor 2024). The launch spurred heated

discussions on the implications of GenAI across various sectors of society. Researchers and media

voiced concern about its potential for spreading misinformation, undermining trust (Hsu and

Thompson 2023), and threatening democratic processes and social cohesion (Ferrara 2024).

The popularity of ChatGPT among students has led to extensive debate over what role

GenAI applications should play in higher education (Abdaljaleel et al. 2024; Katavic et al. 2023;

Ngo 2023; Strzelecki 2023; Tiwari et al. 2023). GenAI tools offer considerable benefits such as

enabling personalized learning and adaptive instruction, enhancing learning efficiency and student

engagement, as well as providing intelligent tutoring systems including real-time feedback, hints,

and scaffolding (Chen et al. 2022; Kishore et al. 2023). Nevertheless, these tools may hinder

students’ ability to think independently and critically and to solve problems; they also harbor

strong potential for perpetuating biases and misinformation (Kasneci et al. 2023; Kishore et al.

2023). Some educational institutions have thus prohibited the use of ChatGPT at school or blocked

it on school devices and networks (Johnson 2023; Weber-Wulff et al. 2023).

This debate has been informed by recent research examining the implications of GenAI

applications in higher education settings. Studies have found that GenAI enhances student

1
understanding, engagement, and academic writing by simplifying information (Engelmann et al.

2023; Pavlik 2023), providing grammatical feedback (Perkins 2023; Wu et al. 2023), and fostering

creativity (Qadir 2023). Other research has also found detrimental effects, such as GenAI leading

to decreased learning performance due to passive information consumption (Markauskaite et al.

2022), reduced social interaction (Eager and Brunton 2023), and superficial learning (Rasul et al.

2023). Studies across the board have identified various ways GenAI can support or hinder student

learning. However, the overall effect on students’ performance remains unclear. We seek to bridge

this gap by addressing the research question: How does students’ GenAI usage affect their exam

performance?

To do so, we collect a sample of student data and empirically assess the effect of GenAI

usage of students’ academic performance. We employ a fixed effects regression model controlling

for numerous factors affecting the exam score. Because students’ use of GenAI for writing case

study essays is not directly observable, we harness the capabilities of ZeroGPT, a renowned GenAI

detection system. We conduct a set of robustness checks to ensure our findings hold under different

approaches for identifying GenAI usage and we address potential endogeneity issues using entropy

balancing. In additional analyses, we utilize an identification strategy with exam retakers to further

examine the causal effect and disaggregate the effect by considering students’ learning potential.

Our results show that students who use GenAI score significantly lower on exams. This finding

holds after including control variables, fixed effects, and several robustness checks. The negative

effect is particularly large for students with high learning potential, indicating that GenAI use

affects exam performance by impeding users’ learning progress.

We contribute to the existing literature in several ways. First, we extend the information

systems (IS) research on GenAI applications by examining its implications in higher education. A

2
number of IS studies emphasize the technical capabilities and power of GenAI tools and its

disruptive potential (Ahangama 2023; Chen et al. 2020; Kasneci et al. 2023; Lee 2023; Mu et al.

2022; Radford et al. 2018). We shed light on how GenAI impacts critical areas of society by

documenting empirical evidence on the effect of GenAI usage on exam scores. Second, we

contribute to higher education literature by investigating the effect of GenAI on academic

performance. Educational institutions and researchers have discussed the costs and benefits of

GenAI extensively, ultimately asking whether tools such as ChatGPT should be banned in higher

education (Chhina et al. 2023; Kishore et al. 2023; Van Slyke et al. 2023). We not only provide

evidence on the implications of GenAI for academic performance, but are also able to address

effects on different user groups. Third, by using detection systems to identify GenAI usage, we

extend research on the functionalities of GenAI detectors, discuss various approaches to identify

AI-generated content, and document empirical findings on which GenAI detectors provide

trustworthy results.

The paper is structured as follows: Section 2 covers the conceptual basics of GenAI and

large language models (LLMs) and summarizes the relevant literature. In Section 3, we describe

our multivariate model, discuss AI detectors, and provide details on our sample and descriptive

statistics. Section 4 presents the empirical results, robustness checks, and additional analyses,

which are discussed in detail in Section 5. In Section 6, the paper concludes with a summary, an

examination of the study’s limitations and an outlook for further research.

2 Conceptual Basics and Related Work

2.1 Generative AI and Large Language Models

GenAI refers to machine learning techniques (e.g., neural networks) to create seemingly novel and

meaningful data instances or artifacts based on patterns and relationships in training data

3
(Feuerriegel et al. 2024; Tao et al. 2023). These artifacts appear in various forms such as text,

images, sound, and video (Alavi et al. 2024). LLMs are a subset of GenAI models capable of

processing and creating natural language by applying learning technologies to extensive datasets

(Lee 2023). They can comprehend context and create textual data outputs similar to human

language without requiring specific input formats (Brown et al. 2020; Teubner et al. 2023; von

Brackel-Schmidt et al. 2023; Wilson et al. 2023).

GenAI constitutes the larger technological infrastructure required for the practical

implementation of LLMs, including the actual model and user-facing components, their modality,

and corresponding data processing (Feuerriegel et al. 2024). Such implementation enables users

to enter input data and instructions conditioning the LLM, which is referred to as prompting

(Feuerriegel et al. 2024; von Brackel-Schmidt et al. 2023). With the emergence of conversational

LLMs (e.g., models with a chat-based interface), prompting shifted from one-off inputs toward

multi-step interactions (von Brackel-Schmidt et al. 2023). Such GenAI models are capable of

completing various tasks, such as developing creative ideas, software coding, or textual content

creation with high accuracy in grammar and wording (Yuan and Chen 2023). These capabilities

render GenAI applications particularly interesting for knowledge work as in academia and higher

education (Benbya et al. 2024; Yuan and Chen 2023).

2.2 Literature Review and Research Question

A considerable body of literature has rapidly emerged discussing how GenAI influences learning

behavior and success. GenAI seems to offer several benefits for learning, potentially supporting

academic performance. For example, numerous studies discuss how GenAI chatbots can serve as

virtual tutors. Fauzi et al. (2023) and Gilson et al. (2023) report that ChatGPT can help students

by providing accurate answers to unclear questions, which can lead to better understanding and

4
knowledge retention. In this way, GenAI has the potential to replace search engines by responding

directly to questions rather than providing information from which the answer must be pieced

together (AlAfnan et al. 2023). Pavlik (2023) and Engelmann et al. (2023) highlight the ability of

GenAI to summarize or simplify information into a shorter or less complex form. Students can use

this to understand and process textual learning material better or faster (Calderon et al. 2023;

Sallam et al. 2023). Qadir (2023) emphasizes the usefulness of ChatGPT in enhancing students’

creativity by helping them brainstorm ideas and organize their thoughts. Students can also use

GenAI as an academic writing assistant that helps with phrasing (Lund et al. 2023), correcting

grammar (Wu et al. 2023), or providing feedback (Perkins 2023). Research has also shown more

latent benefits. Cotton et al. (2023) document increased student engagement and collaboration,

while other studies mention higher motivation when studying is supported by ChatGPT (Ali et al.

2023; Fauzi et al. 2023). By providing plain language explanations, giving feedback on grammar,

and demystifying academic conventions, GenAI can be particularly helpful to disadvantaged or

less privileged students, such as non-native speakers or those with communication disabilities

(Sullivan et al. 2023).

Other studies report potential negative consequences of the use of GenAI on academic

performance. Markauskaite et al. (2022) argue that AI-assisted learning may decrease learning

performance by promoting passive consumption of information rather than active engagement.

Excessive usage also reduces opportunities for interactions with teachers and peers, impacting

social and emotional learning. Similarly, Eager and Brunton (2023) find that AI-enabled learning

companions negatively affect social interaction between students, which is an important

determinant of academic performance (Jain and Kapoor 2015). Research also argues that students’

approaches to learning are affected by the use of GenAI. Easy access to answers without the need

5
for close and detailed engagement with materials may lead to superficial learning (Rasul et al.

2023), which could hinder students’ ability to deeply understand learning materials (Crawford et

al. 2023a). Sallam et al. (2023) discuss that the quick and easy answers from AI software impede

students’ development of independent problem-solving skills, which becomes a problem as there

is no AI available in exams. Crawford et al. (2023b) mention that inaccuracies and lack of

accountability in AI-generated content may lead to misinformation being incorporated into

students’ knowledge, ultimately reducing learning quality. Moreover, the extent to which GenAI

facilitates writing scientific texts and essays for students can also have negative effects on learning

(Lund et al. 2023). According to Milano et al. (2023) this diminishes the effort involved in crafting

well-written and argued texts – effort that helps in understanding course materials and which has

a positive influence on academic performance (Bangert-Drowns et al. 2004).

Research has thus identified various ways in which GenAI can support or hinder learning,

presenting a mixed picture of its impact on students’ performance. Ultimately, assessment in

higher education rests on exam performance; hence, a comprehensive understanding of GenAI

tools in higher education requires investigation of this tangible effect, which so far has not been

thoroughly explored. While existing studies have documented several individual effects of GenAI

usage on performance, our study seeks to examine its overall impact. To do so, we analyze how

students' GenAI use influences their exam scores. Focusing on exam scores provides a measure

that encapsulates the individual effects of GenAI on learning, offering a comprehensive view of

its impact on student performance.

6
3 Data and Methodology

3.1 Multivariate Model and Approach

To answer the research question, we utilize the educational setting of our institution’s first-year

introductory accounting class. To detect GenAI users among the cohort, we rely on case study

essays our students submitted during the semester. The case study concerns a knowledge transfer

exercise for students to immerse themselves in the course material and enhance comprehension.

We identify GenAI-written texts by harnessing the capabilities of ZeroGPT, a popular and

frequently used online GenAI detector. This measure of GenAI usage allows us to empirically

assess its impacts on exam performance using a fixed effects ordinary least square (OLS)

regression. In line with related research (e.g., Chiu et al. 2023; Eskew and Faley 1988), we control

for various factors that have been shown to affect exam performance. The full OLS model reads

as follows:

𝐸𝑥𝑎𝑚 𝑆𝑐𝑜𝑟𝑒𝑖 = 𝛽0 + 𝜷𝟏 𝑮𝒆𝒏𝑨𝑰 𝑼𝒔𝒆𝒓𝒊 + 𝛽2 𝐴 − 𝐿𝑒𝑣𝑒𝑙 𝐺𝑟𝑎𝑑𝑒𝑖 + 𝛽3 𝐴𝑡𝑡𝑒𝑚𝑝𝑡𝑖

+ 𝛽4 𝐴𝑡𝑡𝑒𝑛𝑑𝑎𝑛𝑐𝑒𝑖 + 𝛽5 𝑉𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑖 + 𝛽6 𝑉𝑜𝑙𝑢𝑛𝑡𝑎𝑟𝑦 𝑆𝑒𝑟𝑣𝑖𝑐𝑒𝑖 (1)

+ 𝛽7 𝐹𝑒𝑚𝑎𝑙𝑒𝑖 + 𝛽8 𝐿𝑖𝑛𝑘𝑒𝑑𝐼𝑛 𝑈𝑠𝑒𝑟𝑖 + 𝐶𝑜𝑢𝑟𝑠𝑒 𝑜𝑓 𝑆𝑡𝑢𝑑𝑦 𝐹𝐸 + 𝜀𝑖

Our dependent variable is Exam Score, which is a continuous measure indicating the

percentage of points a student achieved in the final exam. While the minimum is zero, the actual

(achievable) maximum is 96.67 (100). The variable of interest is GenAI User, an indicator variable

taking the value one if a student uses GenAI for studying and for producing work that the instructor

intended to be written without such assistance, and zero otherwise. Based on the indicated

7
probability of our GenAI detector, we classify students as GenAI User if the text is more likely to

have been written by AI than not (percentage > 50%).1

To eliminate potential confounding effects biasing our inference, we use established

determinants of exam performance as control variables. First, we control for academic

preparedness and achievements prior to higher education. We include A-Level Grade as a common

predictor for academic performance (e.g., Lento 2018; Massoudi et al. 2017). We also include two

dummy variables indicating completion of Vocational Training or Voluntary Service as indications

of maturity and experience (e.g., Guney 2009; Hartnett et al. 2004; Voshaar et al. 2023b). Second,

we control for academic behavior by including session Attendance and the number of Attempts at

taking the final exam (e.g., Cheng and Ding 2021; Massoudi et al. 2017; Voshaar et al. 2023a).

Third, previous studies have found correlations between exam performance and gender as well as

course of study (e.g., Aldamen et al. 2015; Hu et al. 2023; Wecks et al. 2023). We include Course

of study-fixed effects and a dummy variable indicating students’ gender (Female). In addition, we

introduce a novel control variable by adding a dummy indicating whether a student is a LinkedIn

User.2 LinkedIn usage has been found to be correlated with exam performance (Paul et al. 2012).

Also, because GenAI acceptance among students is driven by personal innovativeness (Strzelecki

2023), which is also correlated with (new) social media usage (Aldahdouh et al. 2020; Wijesundara

and Xixiang 2018), we adopt LinkedIn User as a potentially important control variable for our

analysis.

1
Alternatively, and to rule out potential biases, we use higher (0.6) and lower (0.4) thresholds in ZeroGPT as well as
other GenAI detection tools to define our variable of interest (also see our robustness checks).
2
We also show results without including the variable LinkedIn User.

8
3.2 Generative AI Detection Systems

Constructing our variable of interest requires us to differentiate between students who utilize

GenAI to write their case study essays and those who do not. For this, we leverage the capabilities

of a GenAI detection tool. As the availability of GenAI models has become more widespread, the

development and dissemination of AI detection tools has also accelerated (e.g., Dalalah and

Dalalah 2023). Many detection tools are now available online for a fee or free of charge. They

process the inputted text by splitting it into individual tokens and predicting the probability that a

specific token will be followed by the subsequent sequence (Crothers et al. 2023). The detector

also analyzes a text’s perplexity, which refers to the use of random elements and idiosyncrasies

typical of human writing and speech (Walters 2023). If the detector identifies high predictability

and low perplexity, it is probable that the text is AI-written and is recognized as such. In this

instance, the GenAI text detector provides a qualitative (i.e., “Your file content is AI/GPT

generated”) or quantitative (i.e., “63%”) evaluation of the probability that the text was generated

by AI.

After assessing the capabilities of the most frequently used GenAI detectors and their

suitability for German language texts, and upon reviewing the scientific literature, we opted to use

ZeroGPT (https://www.zerogpt.com/) for multiple reasons.3 First, previous research has found

ZeroGPT to be among the best detector tools, and has been consistently and accurately identifying

AI-generated texts (Aremu 2023; Liang et al. 2023; Walters 2023; Weber-Wulff et al. 2023). Its

capabilities in detecting human-generated texts have also been shown to be precise (Aremu 2023;

Liang et al. 2023; Pegoraro et al. 2023; Weber-Wulff et al. 2023). Second, it is essential to select

3
For a comprehensive overview on the effectiveness and capabilities of 16 AI text detectors, we refer the reader to
Walters (2023).

9
a detector that does not have high rates of false positives or negatives. While false negatives lead

to GenAI use remaining undetected, false positives might lead to unwarranted accusations against

students. ZeroGPT has been shown to perform well at avoiding both (Walters 2023). Third,

ZeroGPT is specifically designed to identify content generated by the most popular GenAI models,

such as ChatGPT 3.5 and 4, Gemini, and LLaMA. Finally, ZeroGPT claims to be multilingual,

making it suitable for our German language setting (ZeroGPT 2024).

ZeroGPT uses machine learning algorithms and natural language processing techniques to

analyze textual data and identify patterns common to GenAI-generated text (Alhijawi et al. 2024;

ZeroGPT 2024). The tool offers a front-end for inputting texts, which it passes along to a pretrained

model in the back-end. It outputs the proportion of the tokens estimated to be AI-generated, which

is more detailed than the binary outputs of several other detectors (e.g., Copyleaks).

3.3 Sample Selection and Descriptive Statistics

Our study is based on a broad sample of business, economics, and management students taking an

introductory financial accounting course at a German university in the winter term 2023/2024. To

obtain the required data for our analysis, we have drawn on several data sources. First, using an

online survey at the beginning of the semester, we collected data on student characteristics that

might influence exam success. Second, we retained the students’ case study essays throughout the

semester, and processed and analyzed them in terms of the use of GenAI after the final exam.

Third, we obtained the final exam scores from the central examination office to evaluate the impact

on academic performance.

Starting with an initial sample of 572 students who participated in the survey, we first

excluded students who did not hand in an essay (N = 243). Additionally, we excluded those

students who did not take the final exam (N = 127). Finally, we dropped the observations with

10
missing data for the required variables (N = 9). This leaves us with a final sample of 193 students.

Given the 502 students in the final exam, our sample accounts for about 38% of the underlying

population and can thus be considered representative.

[Insert Table 1 about here]

Table 1 reports the descriptive statistics of the student characteristics for the full sample in

Panel A.4 The mean exam score is 45.39, indicating that our sample students on average fall below

the 50% threshold. This reflects the high failure rates commonly observed among higher-education

introductory accounting courses (Prinsloo et al. 2010; Sanders and Willis 2009). The binary

variable of interest GenAI User has a mean of 0.306, indicating that 30.6% of the students in our

sample (i.e., 59 students) are identified as GenAI users by ZeroGPT. The mean value of ZeroGPT

indicates that on average 35.4% of students’ texts are flagged as AI-generated. The average student

in our sample has taken the final exam for the first time (mean Attempt = 1.425) and has attended

fewer than half of the offered tutorials (mean Attendance = 0.447). A rather small subset of 19.2%

of the students has a LinkedIn profile.

Panel B of Table 1 presents descriptive statistics divided into GenAI users and non-users.

The last column presents the results of a two-tailed t-test. The average GenAI user achieves 9.027

(p-value < 0.01) fewer exam points, has a lower A-level grade (0.152; p-value < 0.1), and a higher

number of attempts compared to non-users. This gives an initial indication of poorer exam

performance among GenAI users compared to non-users. However, because not only GenAI usage

but also student characteristics may influence exam performance, the association of GenAI usage

4
Pearson correlations are reported in Online Appendix B (https://tinyurl.com/zjehfa3n) and do not show any indication
of multicollinearity issues, as the highest absolute value is 0.331.

11
and exam performance calls for multivariate examination taking control variables into

consideration.

4 Impact of Generative AI Usage on Exam Performance

4.1 Main Results

Table 2 shows the results of our multivariate regression. The GenAI Usage dummy variable, being

the only independent variable in column (1), displays a significantly negative coefficient. In

column (2), we add the control variables commonly found in literature explaining exam

performance. The coefficient of the variable of interest remains significant and slightly lower.

Finally, column (3) presents the results of our main model, (Equation 1).5 The control variables

primarily show significance and signs in line with the literature. For the variable of interest (GenAI

User), we continue to find a statistically significant and negative coefficient. According to the

model, students using GenAI score 6.71 points lower in the final exam, which is substantial, as the

mean student scores 45.39 points. Thus, on average, the scores of GenAI users are about 15%

lower than that of the mean non-user. Given that the passing threshold is at 41 points, GenAI use

can tip the scales toward failing the exam – at least statistically. The empirical evidence provides

a clear picture of a negative influence of students’ GenAI usage on their exam scores.

[Insert Table 2 about here]

5
The link test (Pregibon 1980; Tukey 1949), a significant F-test, and the coefficient of determination (adjusted R2) all
indicate a well-fitted model. As the Breusch and Pagan (1979) and Cook and Weisberg (1983)-test detects no
heteroscedasticity (p-value of 0.56), we refrain from using robust standard errors.

12
4.2 Robustness Checks

We conduct a set of robustness checks to ensure the reliability and validity of our results. Our

initial step involves scrutinizing the robustness of the GenAI User variable. In our main analysis,

we identify GenAI users based on a threshold of over 50% in the written case study essays, as

determined by ZeroGPT. This approach yields 30.6% of our sample as GenAI users. To test

whether this share is realistic, we conduct an anonymous survey among all students in our sample

(see Appendix C: https://tinyurl.com/zjehfa3n). Among the 30 survey responses (15.5% of the

sample), 30.0% state that they used GenAI tools for the written case study, aligning well with our

measured value. Identifying about a third of our population as GenAI users is also consistent with

findings reported elsewhere in the literature. von Garrel and Mayer (2023) conducted a

representative survey among German university students and find that 34.8% of them report using

AI-based tools for studying occasionally, frequently, or very often. Considering the variation in

reported usage rates across different studies and online reports (Abdaljaleel et al. 2024), we

proceed to test alternative thresholds for the detection tool. Adjusting the threshold to 0.6 reduces

GenAI users to 20.2%, while a threshold of 0.4 increases them to 40.9%. The results in columns

(1) and (2) of Table 3 with the adjusted thresholds remain robust and unchanged.

[Insert Table 3 about here]

Additionally, we ensure the robustness of our findings by using alternative detection tools.

We utilize Originality.AI, given its prominence in the literature (Walters 2023) and its claims to

be multi-language (Originality.AI 2024). The coefficient of GenAI User, although slightly

confirming the observed effect, is not statistically significant. Looking at the detector score

distributions in Online Appendix D (https://tinyurl.com/zjehfa3n) reveals that Originality.AI may

indeed face difficulties with German language texts. Except for a few cases, all values are either

13
exactly 0%, 50%, or 100%, showing an apparent dysfunction. Despite the difficulties of

Originality.AI for German texts, the GenAI User coefficient remains negative and shows a similar

value, however is slightly insignificant (p-value of 0.17).

Given this potential for bias with German texts, we repeat the robustness check using an

AI detector particularly designed for the German language, developed at the University of Applied

Sciences Wedel (Tlok et al. 2023). According to the developer, this tool’s outputs are probabilities

and thus not directly comparable to other tools. As shown in Online Appendix D

(https://tinyurl.com/zjehfa3n), this results in a distribution of outputs with many at 0% probability

and a uniform distribution across the remaining value range. A more likely than not classification

would be impractical. Instead, we run a pre-test with 12 student seminar papers written before

GenAI was available and modify them using ChatGPT 4, creating a paired sample of known GenAI

and non-GenAI texts on the same topics as our main study. The German detector consistently

shows values below 10% for human-written and above 10% for AI-generated texts. Adopting this

threshold for our robustness check, we identify 36.3% of our sample as GenAI users, which is

similar to the main analysis, our survey, and the values reported in previous literature (von Garrel

and Mayer 2023). Column (4) presents the results using the German detector. The GenAI User

coefficient is now even more negative and significant, suggesting improved accuracy due to the

detector's optimization for German texts.

To address potential concerns about the opacity of AI detectors, we conduct a further

robustness check by manually computing the propensity of GenAI usage. A growing body of

research analyzes AI-generated texts and how to distinguish them from human-written ones. The

literature reports that AI-generated texts typically exhibit lower readability, higher lexical richness,

and a greater number of adjectives than human-written texts (Martínez et al. 2024; Muñoz-Ortiz

14
et al. 2023; Shah et al. 2023). AI texts tend to have more words per sentence and sentences per

paragraph, contributing to lower readability scores (Deveci et al. 2023; Pehlivanoğlu et al. 2023).

We employ the Gunning-Fog Index as a well-regarded measure of readability (Gunning 1952).

Lexical richness essentially refers to the ratio of unique words measurable by the metric Herdan’s

C (Herdan 1960). As another metric, we consider the proportion of adjectives used as another

metric (Markowitz et al. 2023). We conduct a principal component analysis to consolidate these

three variables into a single vector.6 This results in a factor variable ranging from 0 to 1 indicating

AI markers. In column (5), we use this factor as variable of interest and find that the presence of

lexical characteristics of AI-generated texts in a student’s work correlates with lower exam scores,

reinforcing the findings from previous analyses.

Beyond the robustness of our measure, we also account for potential endogeneity issues.

The group-wise comparison between GenAI users and non-users in Panel B of Table 1 indicates

that GenAI users have a lower A-level grade and a higher number of attempts. Academic

preparedness or experience might affect both exam performance and GenAI usage. Our approach

already addresses potential endogeneity to some extent, as our control variables can capture such

characteristics (Hill et al. 2021). To provide additional robustness, we use entropy-balancing to

balance the distributions of control variables between GenAI users and non-users and subsequently

repeat the analysis (Hainmueller 2012). The results in column (6) underscore our main findings

even when controlling for potential endogeneity.7

6
The Kaiser-Meyer-Olkin measure of 0.657 (Kaiser and Rice 1974) and a significant Bartlett (1937) test of sphericity
(p-value < 0.01) indicate that the variables are highly correlated and collectively measure the construct of a text being
written by GenAI.
7
This approach primarily addresses observable sample selection bias, while this issue might also arise from omitted
correlated variables (unobserved). However, our course of study-fixed effects mitigate this to some extent
(Wooldridge 1995) and the Ramsey (1969) RESET test indicates no omitted variables (p-value of 0.764). Additionally,
potential instrumental variables related to technical affinity and engagement do not show any correlation with GenAI
User, allowing us to discount endogeneity due to omitted correlated variables.

15
4.3 Additional Analyses

Our main results show a negative impact of students’ GenAI usage on their exam performance. To

further explore the effect and the mechanism at work, we complement our findings by conducting

additional analyses. We explore the documented effect for different levels of student engagement

and cognitive abilities and measure how GenAI use affects performance improvement when

repeating the exam.

First, we apply an identification strategy to further analyze the causal effect. In our first

additional analysis, we identify and match all repeating students who did not pass the exam in the

year before our main analysis, when GenAI models were not yet available for student use (Pre

GenAI).8 This leaves us with a sample of matched observations before and after broad GenAI

availability containing (Pre) GenAI Users (N = 15) and (Pre) Non-Users (N = 12). Figure 1 shows

the distribution of exam scores for each group and the differences in mean exam scores between

the groups and time periods (within group) along with their statistical significance. Due to the

small sample size, we bootstrap the distributions around the mean.

[Insert Figure 1 about here]

The Pre GenAI period solely consists of students who failed the exam. We observe a

statistically significant improvement in exam score in the next attempt for both groups (Post

GenAI).9 However, the GenAI User group shows a substantially lower increase. While both

subsamples perform equally well in their second attempt, those in the GenAI User group reach far

8
The Pre GenAI semester ended in January 2023. The first publicly available GenAI model (ChatGPT 3) was launched
only a few weeks before the exam but had very few users, difficult access, and extensive downtime at that early stage.
Therefore, an effect on the exam performance can be ruled out, allowing us to consider this semester as a Pre GenAI
period.
9
An increased performance among repeating students aligns with related research attributing the effect to increased
commitment (e.g., Martínez and Martinez 1992; Voshaar et al. 2023a; Wecks et al. 2023).

16
more points than the Non-User group in the attempt before in the Pre GenAI period. In the attempt

after GenAI was widely available, the Non-User group increases their exam scores to a greater

extent than the GenAI User group. Consequently, we observe a learning-hindering for students

using GenAI, as their improvement is far lower.

In our second additional analysis, we perform split sample analyses using two measures of

student capabilities and engagement. If the documented effect in our main results is indeed

attributable to hindering learning, the effect should be more pronounced for students where

individual learning and comprehending the content would otherwise have fallen on fertile ground.

To approximate this characteristic, we use students’ A-level grades and attendance at tutorials.

While the first addresses academic preparedness, pre-university achievements, and cognitive

abilities, the latter measures engagement. We conduct a median split for both measures to create

two samples with low and high A-level grades and attendance to gain additional insights into the

mechanism behind the effect. Table 4 presents the results of the additional split sample analysis.

Columns (1) and (2) include the two regressions for the split samples by A-level grade, while

attendance is used to split the sample in columns (3) and (4).

[Insert Table 4 about here]

For the split sample of students with good A-level grades (and therefore strong cognitive

abilities demonstrated through considerable pre-university performance), we find the coefficient

of GenAI User to be highly significant and negative (column (1)). While this aligns with our main

results, the coefficient’s magnitude is almost doubled compared to the one in column (3) of

Table 2. In contrast, the coefficient is positive but insignificant for students with A-level grades

below the median (Column (2)). A similar picture emerges when utilizing students’ tutorial

attendance (and hence engagement) for median split. While the students with higher attendance

17
perform worse not only statistically significantly but to an educationally impactful extent when

using GenAI, the effect is insignificant for low-attendance students (columns (3) and (4),

respectively). This suggests that the documented negative effect of GenAI use in the main analysis

is primarily due to the effect on students with good A-level grades and high attendance.

We can conclude that the impact of GenAI use on exam performance varies depending on

students’ prior academic achievements and/or cognitive abilities as well as engagement during the

semester. We find using GenAI to be detrimental to the exam scores of higher achieving and more

engaged students. This confirms the learning-hindering mechanism as those students who would

have been well equipped to understand the learning materials suffer particularly from the forgone

opportunity to engage with the course content. When compared intra-group with other students

with good prerequisites and who prepare for and write the essays themselves, the disadvantage of

(over-)reliance on GenAI is even more glaring.

5 Discussion and Implications

Our main results show that GenAI usage negatively affects students’ exam scores. While the

literature has found many aspects of GenAI that can have a positive or negative influence on

academic performance, it is unclear whether the benefits or the downsides prevail. We observe

clear evidence of a negative overall effect. Positive aspects such as summarizing information

(Pavlik 2023), increasing study motivation (Ali et al. 2023; Fauzi et al. 2023), or providing plain-

language explanations (Sullivan et al. 2023) may still occur. However, our results show that these

are overshadowed by the negative effects that have been described in previous studies (Crawford

et al. 2023a; Markauskaite et al. 2022; Rasul et al. 2023).

We explore this effect in greater depth, leveraging an identification strategy in a sample of

repeating students and find that students opting for GenAI usage for learning and academic writing

18
purposes do not exhibit an increase in their exam scores similar to that of their peers not using

GenAI. We ascribe this result primarily to a learning-hindering mechanism when using GenAI.

For example, when students let GenAI write an essay on a complex and challenging topic, instead

of exploring, grappling with, and mastering the content and then writing the essay themselves,

students waste the opportunity to learn and to experience the inherent rewards of figuring things

out. Research such as Milano et al. (2023) warn of this, stating that GenAI’s role in facilitating

academic writing is dependable to the point of negatively impacting students’ learning. Similarly,

the ready availability of quick answers may reduce the intensity of students’ engagement with the

subject matter and ultimately deter their learning, as argued by Cotton et al. (2023), Rudolph et al.

(2023), and Sallam et al. (2023).

The results of our split sample regressions further support the learning-hindering

mechanism, as students with high learning potential are especially impacted by GenAI usage. The

significantly negative effect for the students in our sample with good A-level grades or high

attendance – which can be reasonably equated to higher levels of skill or commitment – indicates

that these students have more to lose when they do not immerse themselves in the subject matter.

And indeed, we find no effect for the opposite group, who are less predisposed to assimilate

knowledge due to lower attendance or cognitive ability. We can however document the impact of

GenAI usage for these students when looking at the results of the identification strategy analysis.

This subsample solely consists of repeating students with attendance and A-Level grades that are

below average. While we document a significant performance increase in the next exam attempt,

GenAI usage hampers this improvement. Thus, the negative GenAI influence becomes evident

when there is considerable learning potential, providing further support for the learning-hindering

mechanism.

19
These findings align with the constructivist theory of learning (cf. Bada and Olusegun

2015), which highlight the importance of being actively involved in the learning process to achieve

deeper understanding and build knowledge. In this sense, writing a case study essay is valuable

not merely as an end in itself but also as an instrument to assist students in developing skills in

planning, completing, and editing their written work (Dweck 1986) as well as engaging with,

exploring, and understanding subject-related content. When students immerse themselves in their

subjects, they are more likely to experience an eureka moment that enhance comprehension. By

using GenAI for essay writing, students may bypass this essential cognitive process of

comprehension, analysis, and summarization. This might similarly occur when GenAI is used to

study for exams.10 If for instance GenAI is used to simplify or explain complex topics, students

might use it as a shortcut that makes learning seem easier, but which actually prevents them from

going through the process of understanding and learning on a deeper level.

Our findings have important implications for students, educators, and educational

institutions. For students, the results suggest a cautious approach to using GenAI for learning.

While GenAI may appear to ease the learning process, it can adversely affect learning

performance. Students should be mindful of the potential drawbacks and consider integrating

GenAI as a supplementary tool rather than a primary resource for grappling with complex topics.

Educators likewise need to take students’ use of GenAI into account when designing curricula. It

is essential to provide tasks and learning materials that promote deep learning and minimize the

potential for GenAI to diminish engagement with the subject matter. Strategies could include

incorporating in-class discussions, handwritten assignments, and other methods that encourage

10
Our detection solely identifies GenAI use in essay writing, yet it is likely that GenAI users apply this versatile
technology for various academic tasks. This is supported by our survey, as 26.7% indicate using GenAI broadly for
academic purposes, close to the 30.0% reporting that they use it for the essay.

20
active learning and critical thinking. Lastly, this study offers valuable insights for educational

institutions regarding GenAI policies in higher education. Although our results point to negative

implications of GenAI use on student performance, we do not advocate for outright bans. As with

many revolutionary information systems, there are both positive and negative aspects of GenAI.

The negative results of this study may rather show that higher education does not yet harness the

full potential of GenAI. Educational institutions should guide educators on how to instruct students

in the proper use of GenAI and develop policies that mitigate its negative effects while amplifying

its benefits. Similar to the disruptions of higher education caused by calculators and the internet,

banning is not a practical solution. Students will inevitably encounter GenAI outside the university

setting and must learn to use it effectively rather than confining themselves to getting by without

it.

6 Conclusion, Limitations, and Future Research

The present study contributes to the rich debate on how GenAI will affect learning in (higher)

education by evaluating the tangible effects of GenAI usage on exam performance. We address an

important research gap, as performance effects have not yet been examined but are nevertheless

crucial when discussing how to adapt education to the age of GenAI. Our findings reveal that using

GenAI tools for writing essays and likely for other learning purposes significantly decreases exam

scores. The additional analysis offers nuanced insights by documenting a learning-hindering

mechanism through with GenAI usage negatively affects exam scores. Our study thus has

implications for students, educators, and institutions.

Some of the limitations of our study warrant further attention. First, using GenAI detector

tools to identify GenAI users among our students comes with the risk of inaccurate detection

results. Not identifying all GenAI users (or too many) in our sample would affect our findings.

21
Having said that, we improve the robustness of our results by using several different GenAI

detector tools. These tools are particularly suited to German language applications or have distinct

operating principles. We also find the share of detected GenAI users to align with the numbers

found in related research (von Garrel and Mayer 2023) and through our own anonymous survey.

Second, our results may lack generalizability as our study is limited to a financial accounting class

at one German university. Finally, our study does not account for usage behavior and hence usage

intensity. Different intensities of use might come with different impacts on exam performance.

In addressing these limitations, future research might explore students’ GenAI usage

behavior. This could help in understanding the implications more comprehensively. Future

research might also evaluate the impact of GenAI tools from an educator’s perspective. Our study

takes a student-centric viewpoint, but GenAI also affects the daily work of educators. Exploring

the threats and opportunities from this perspective would help institutions position themselves in

discussing whether using GenAI tools should be banned, tolerated, or taught.

22
References

Abdaljaleel, M., Barakat, M., Alsanafi, M., Salim, N. A., Abazid, H., Malaeb, D., Mohammed, A. H., Hassan,
B. A. R., Wayyes, A. M., Farhan, S. S., Khatib, S. E., Rahal, M., Sahban, A., Abdelaziz, D. H., Mansour,
N. O., AlZayer, R., Khalil, R., Fekih-Romdhane, F., Hallit, R., Hallit, S., and Sallam, M. 2024. "A
Multinational Study on the Factors Influencing University Students’ Attitudes and Usage of ChatGPT,"
Scientific Reports (14:1), pp. 1983-1997.
Ahangama, S. 2023. "An Investigation of Domain-Based Social Influence on ChatGPT-Related Twitter
Data," ICIS 2023 Proceedings.
AlAfnan, M. A., Dishari, S., Jovic, M., and Lomidze, K. 2023. "ChatGPT as an Educational tool:
Opportunities, Challenges, and Recommendations for Communication, Business Writing, and
Composition Courses," Journal of Artificial Intelligence and Technology (3:2), pp. 60-68.
Alavi, M., Leidner, D. E., and Mousavi, R. 2024. "Knowledge Management Perspective of Generative
Artificial Intelligence," Journal of the Association for Information Systems (25:1), pp. 1-12.
Aldahdouh, T. Z., Nokelainen, P., and Korhonen, V. 2020. "Technology and Social Media Usage in Higher
Education: The Influence of Individual Innovativeness," SAGE Open (10:1).
Aldamen, H., Al-Esmail, R., and Hollindale, J. 2015. "Does Lecture Capturing Impact Student Performance
and Attendance in an Introductory Accounting Course?," Accounting Education (24:4), pp. 291-317.
Alhijawi, B., Jarrar, R., AbuAlRub, A., and Bader, A. 2024. "Deep Learning Detection Method for Large
Language Models-Generated Scientific Content," available at arXiv: 2403.00828.
Ali, J. K. M., Shamsan, M. A. A., Hezam, T. A., and Mohammed, A. A. Q. 2023. "Impact of ChatGPT on
Learning Motivation," Journal of English Studies in Arabia Felix (2:1).
Aremu, T. 2023. "Unlocking Pandora's Box: Unveiling the Elusive Realm of AI Text Detection," Available
at SSRN: 4470719.
Bada, S. O., and Olusegun, S. 2015. "Constructivism Learning Theory: A Paradigm for Teaching and
Learning," Journal of Research & Method in Education (5:6), pp. 66-70.
Bangert-Drowns, R., Hurley, M. M., and Wilkinson, B. 2004. "The Effects of School-Based Writing-to-
Learn Interventions on Academic Achievement: A Meta-Analysis," Review of Educational Research
(74), pp. 29-58.
Bartlett, M. S. 1937. "Properties of Sufficiency and Statistical Tests," Proceedings of the Royal Society of
London. Series A, Mathematical and Physical Sciences (160:901), pp. 268-282.
Benbya, H., Strich, F., and Tamm, T. 2024. "Navigating Generative Artificial Intelligence Promises and
Perils for Knowledge and Creative Work," Journal of the Association for Information Systems (25:1),
pp. 23-36.
Breusch, T., and Pagan, A. 1979. "A Simple Test for Heteroscedasticity and Random Coefficient Variation,"
Econometrica (47:5), pp. 1287-1294.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,
G., and Askell, A. 2020. "Language Models are Few-Shot Learners," Advances in Neural Information
Processing Systems (33), pp. 1877-1901.
Calderon, T. G., Gao, L., and Cardoso, R. L. 2023. "Generative Artificial Intelligence in the Classroom: A
Financial Accounting Experience," in Advances in Accounting Education: Teaching and Curriculum
Innovations, T.G. Calderon (ed.). Emerald Publishing Limited, pp. 125-144.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. 2020. "Generative Pretraining
from Pixels," International Conference on Machine Learning, pp. 1691-1703.
Chen, X., Zou, D., Xie, H., Cheng, G., and Liu, C. 2022. "Two Decades of Artificial Intelligence in Education
Contributors, Collaborations, Research Topics, Challenges, and Future Directions," Educational
Technology & Society (25:1), pp. 28-47.
Cheng, P., and Ding, R. 2021. "The Effect of Online Review Exercises on Student Course Engagement and
Learning Performance: A Case Study of an Introductory Financial Accounting Course at an
International Joint Venture University," Journal of Accounting Education (54:1), p. 100699.
Chhina, S., Antony, B., and Firmin, S. 2023. "Navigating the Terrain of Large Language Models in Higher
Education: A Systematic Literature Review," ACIS 2023 Proceedings.
Chiu, C., King, R., and Crossin, C. 2023. "Using Colour-Coded Digital Annotation for Enhanced Case-Based
Learning Outcomes," Accounting Education (32:2), pp. 201-221.
Cook, R. D., and Weisberg, S. 1983. "Diagnostics for Heteroscedasticity in Regression," Biometrika (70:1).

23
Cotton, D. R. E., Cotton, P. A., and Shipway, J. R. 2023. "Chatting and Cheating: Ensuring Academic
Integrity in the Era of ChatGPT," Innovations in Education and Teaching International (61:2), pp.
228-239.
Crawford, J., Cowling, M., and Allen, K.-A. 2023a. "Leadership is Needed for Ethical ChatGPT: Character,
Assessment, and Learning Using Artificial Intelligence (AI)," Journal of University Teaching &
Learning Practice (20:3).
Crawford, J., Cowling, M., Ashton-Hay, S., Kelder, J.-A., Middleton, R., and Wilson, G. S. 2023b. "Artificial
Intelligence and Authorship Editor Policy: ChatGPT, Bard, Bing AI, and Beyond," Journal of University
Teaching & Learning Practice (20:5).
Crothers, E. N., Japkowicz, N., and Viktor, H. L. 2023. "Machine-Generated Text: A Comprehensive Survey
of Threat Models and Detection Methods," IEEE Access (11), pp. 70977-71002.
Dalalah, D., and Dalalah, O. M. A. 2023. "The False Positives and False Negatives of Generative AI Detection
Tools in Education and Academic Research: The Case of ChatGPT," The International Journal of
Management Education (21:2), p. 100822.
Deveci, C. D., Baker, J. J., Sikander, B., and Rosenberg, J. 2023. "A Comparison of Cover Letters Written
by ChatGPT-4 or Humans," Danish Medical Bulletin (70:11).
Dweck, C. S. 1986. "Motivational Processes Affecting Learning," American Psychologist (41:10), pp. 1040-
1048.
Eager, B., and Brunton, R. 2023. "Prompting Higher Education Towards AI-Augmented Teaching and
Learning Practice," Journal of University Teaching & Learning Practice (20:5).
Engelmann, B., Haak, F., Kreutz, C. K., Khasmakhi, N. N., and Schaer, P. 2023. "Text Simplification of
Scientific Texts for Non-Expert Readers," available at arXiv: 2307.03569.
Eskew, R. K., and Faley, R. H. 1988. "Some Determinants of Student Performance in the First College-Level
Financial Accounting Course," The Accounting Review (63:1), pp. 137–147.
Fauzi, F., Tuhuteru, L., Sampe, F., Ausat, A. M. A., and Hatta, H. R. 2023. "Analysing the Role of ChatGPT
in Improving Student Productivity in Higher Education," Journal on Education (5:4), pp. 14886-14891.
Ferrara, E. 2024. "GenAI Against Humanity: Nefarious Applications of Generative Artificial Intelligence
and Large Language Models," Journal of Computational Social Science (Forthcoming).
Feuerriegel, S., Hartmann, J., Janiesch, C., and Zschech, P. 2024. "Generative AI," Business & Information
Systems Engineering (66:1), pp. 111-126.
Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., and Chartash, D. 2023. "How Does
ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of
Large Language Models for Medical Education and Knowledge Assessment," JMIR Medical Education
(9:1).
Gregor, S. 2024. "Responsible Artificial Intelligence and Journal Publishing," Journal of the Association
for Information Systems (25:1), pp. 48-60.
Guney, Y. 2009. "Exogenous and Endogenous Factors Influencing Students' Performance in
Undergraduate Accounting Modules," Accounting Education (18:1), pp. 51-73.
Gunning, R. 1952. The Technique of Clear Writing. McGraw-Hill.
Hainmueller, J. 2012. "Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to
Produce Balanced Samples in Observational Studies," Political Analysis (20:1), pp. 25-46.
Hartnett, N., Römcke, J., and Yap, C. 2004. "Student Performance in Tertiary‐Level Accounting: An
International Student Focus," Accounting & Finance (44:2), pp. 163-185.
Herdan, G. 1960. Type-Token Mathematics. The Hague: Mouton.
Hill, A. D., Johnson, S. G., Greco, L. M., O’Boyle, E. H., and Walter, S. L. 2021. "Endogeneity: A Review and
Agenda for the Methodology-Practice Divide Affecting Micro and Macro Research," Journal of
Management (47:1), pp. 105-143.
Hsu, T., and Thompson, S. A. 2023. "Disinformation Researchers Raise Alarms About A.I. Chatbots." The
New York Times, from https://www.nytimes.com/2023/02/08/technology/ai-chatbots-
disinformation.html
Hu, Y., Nath, N., Zhu, Y., and Laswad, F. 2023. "Accounting Students’ Online Engagement, Choice of Course
Delivery Format and Their Effects on Academic Performance," Accounting Education (32), pp. 1-36.
Jain, T., and Kapoor, M. 2015. "The Impact of Study Groups and Roommates on Academic Performance,"
Review of Economics and Statistics (97:1), pp. 44-54.
Johnson, A. 2023. "ChatGPT in Schools: Here’s Where it’s Banned - and How it Could Potentially Help
Students." Forbes.

24
Kaiser, H. F., and Rice, J. 1974. "Little Jiffy," Educational and Psychological Measurement (34:1), pp. 111-
117.
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G.,
Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet,
O., Sailer, M., Schmidt, A., Seidel, T., Stadler, M., Weller, J., Kuhn, J., and Kasneci, G. 2023. "ChatGPT
for Good? On Opportunities and Challenges of Large Language Models for Education," Learning and
Individual Differences (103).
Katavic, R., Pahuja, A., and Syed, T. A. 2023. "Navigating the Use of ChatGPT in Classrooms: A Study of
Student Experiences," ICIS 2023 Proceedings.
Kishore, S., Hong, Y., Nguyen, A., and Qutab, S. 2023. "Should ChatGPT be Banned at Schools? Organizing
Visions for Generative Artificial Intelligence (AI) in Education," ICIS 2023 Proceedings.
Lee, M. 2023. "A Mathematical Investigation of Hallucination and Creativity in GPT Models," Mathematics
(11:10).
Lento, C. 2018. "Student Usage of Assessment-Based and Self-Study Online Learning Resources in
Introductory Accounting," Issues in Accounting Education (33:4), pp. 13–31.
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., and Zou, J. 2023. "GPT Detectors are Biased Against Non-
Native English Writers," available at ArXiv: 2304.02819.
Lund, B. D., Wang, T., Mannuru, N. R., Nie, B., Shimray, S., and Wang, Z. 2023. "ChatGPT and a New
Academic Reality: Artificial Intelligence‐Written Research Papers and the Ethics of the Large Language
Models in Scholarly Publishing," Journal of the Association for Information Science and Technology
(74:5), pp. 570-581.
Markauskaite, L., Marrone, R., Poquet, O., Knight, S., Martinez-Maldonado, R., Howard, S., Tondeur, J.,
De Laat, M., Buckingham Shum, S., Gašević, D., and Siemens, G. 2022. "Rethinking the Entwinement
Between Artificial Intelligence and Human Learning: What Capabilities Do Learners Need for a World
with AI?," Computers and Education: Artificial Intelligence (3), p. 100056.
Markowitz, D. M., Hancock, J. T., and Bailenson, J. N. 2023. "Linguistic Markers of Inherently False AI
Communication and Intentionally False Human Communication: Evidence From Hotel Reviews,"
Journal of Language and Social Psychology (43:1), pp. 63-82.
Martínez, G., Hernández, J. A., Conde, J., Reviriego, P., and Merino, E. 2024. "Beware of Words: Evaluating
the Lexical Richness of Conversational Large Language Models," available at arXiv: 2402.15518.
Martínez, J. G., and Martinez, N. C. 1992. "Re-Examining Repeated Testing and Teacher Effects in a
Remedial Mathematics Course," The British Journal of Educational Psychology (62:3), pp. 356-363.
Massoudi, D., Koh, S., Hancock, P. J., and Fung, L. 2017. "The Effectiveness of Usage of Online Multiple
Choice Questions on Student Performance in Introductory Accounting," Issues in Accounting
Education (32:4), pp. 1-17.
Milano, S., McGrane, J. A., and Leonelli, S. 2023. "Large Language Models Challenge the Future of Higher
Education," Nature Machine Intelligence (5:4), pp. 333-334.
Mu, N., Kirillov, A., Wagner, D., and Xie, S. 2022. "Slip: Self-Supervision Meets Language-Image Pre-
raining," European Conference on Computer Vision: Springer, pp. 529-544.
Muñoz-Ortiz, A., Gómez-Rodríguez, C., and Vilares, D. 2023. "Contrasting Linguistic Patterns in Human
and LLM-Generated Text," available at ArXiv: 2308.09067.
Ngo, T. T. A. 2023. "The Perception by University Students of the Use of ChatGPT in Education,"
International Journal of Emerging Technologies in Learning (18:17).
Originality.AI. 2024. "Originality AI Plagiarism and Fact Checker." from https://originality.ai/
Paul, J. A., Baker, H. M., and Cochran, J. D. 2012. "Effect of Online Social Networking on Student Academic
Performance," Computers in Human Behavior (28:6), pp. 2117-2127.
Pavlik, J. V. 2023. "Collaborating With ChatGPT: Considering the Implications of Generative Artificial
Intelligence for Journalism and Media Education," Journalism & Mass Communication Educator
(78:1), pp. 84-93.
Pegoraro, A., Kumari, K., Fereidooni, H., and Sadeghi, A.-R. 2023. "To ChatGPT, or Not To ChatGPT: That
is the Question!," available at arXiv: 2304.01487.
Pehlivanoğlu, M. K., Syakura, M. A., and Duru, N. 2023. "Enhancing Paraphrasing in Chatbots Through
Prompt Engineering: A Comparative Study on ChatGPT, Bing, and Bard," 8th International Conference
on Computer Science and Engineering: IEEE, pp. 432-437.
Perkins, M. 2023. "Academic Integrity Considerations of AI Large Language Models in the Post-Pandemic
Era: ChatGPT and Beyond," Journal of University Teaching and Learning Practice (20:2).

25
Pregibon, D. 1980. "Goodness of Link Tests for Generalized Linear Models," Applied Statistics (29:1), pp.
15-14.
Prinsloo, P., Müller, H., and Du Plessis, A. 2010. "Raising awareness of the risk of failure in first-year
accounting students," Accounting Education: an international journal (19:1-2), pp. 203-218.
Qadir, J. 2023. "Engineering Education in the Era of ChatGPT: Promise and Pitfalls of Generative AI for
Education," IEEE Global Engineering Education Conference IEEE, pp. 1-9.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. 2018. "Improving Language Understanding by
Generative Pre-Training." OpenAI
Ramsey, J. B. 1969. "Tests for Specification Errors in Classical Linear Least-Squares Regression Analysis,"
Journal of the Royal Statistical Society Series B: Statistical Methodology (31:2), pp. 350-371.
Rasul, T., Nair, S., Kalendra, D., Robin, M., de Oliveira Santini, F., Ladeira, W. J., Sun, M., Day, I., Rather,
R. A., and Heathcote, L. 2023. "The Role of ChatGPT in Higher Education: Benefits, Challenges, and
Future Research Directions," Journal of Applied Learning and Teaching (6:1).
Rudolph, J., Tan, S., and Tan, S. 2023. "War of the Chatbots: Bard, Bing Chat, ChatGPT, Ernie and Beyond.
The New AI Gold Rush and Its Impact on Higher Education," Journal of Applied Learning and
Teaching (6:1).
Sallam, M., Salim, N. A., Barakat, M., and Ala'a, B. 2023. "ChatGPT Applications in Medical, Dental,
Pharmacy, and Public Health Education: A Descriptive Study Highlighting the Advantages and
Limitations," Narra J (3:1).
Sanders, D. E., and Willis, V. F. 2009. "Setting the PACE for Student Success in Intermediate Accounting,"
Issues in Accounting Education (24:3), pp. 319-337.
Shah, A., Ranka, P., Dedhia, U., Prasad, S., Muni, S., and Bhowmick, K. 2023. "Detecting and Unmasking
Ai-Generated Texts Through Explainable Artificial Intelligence Using Stylistic Features," International
Journal of Advanced Computer Science and Applications (14:10).
Strzelecki, A. 2023. "To Use or Not to Use ChatGPT in Higher Education? A Study of Students’ Acceptance
and Use of Technology," Interactive Learning Environments (31), pp. 1-14.
Sullivan, M., Kelly, A., and McLaughlan, P. 2023. "ChatGPT in Higher Education: Considerations for
Academic Integrity and Student Learning," Journal of Applied Learning & Teaching (6:1).
Tao, Y., Yoo, C., and Animesh, A. 2023. "AI Plus Other Technologies? The Impact of ChatGPT and Creativity
Support Systems on Individual Creativity," ICIS 2023 Proceedings.
Teubner, T., Flath, C. M., Weinhardt, C., van der Aalst, W., and Hinz, O. 2023. "Welcome to the Era of
ChatGPT et al," Business & Information Systems Engineering (65:2), pp. 95-101.
Tiwari, C. K., Bhat, M. A., Khan, S. T., Subramaniam, R., and Khan, M. A. I. 2023. "What Drives Students
Toward ChatGPT? An Investigation of the Factors Influencing Adoption and Usage of ChatGPT,"
Interactive Technology and Smart Education (2023).
Tlok, T., Annuth, H., and Pawłowski, M. 2023. "Robuste Erkennung von KI-generierten Texten in deutscher
Sprache," in: Master Thesis at Department of Data Science & Artificial Intelligence – FH Wedel.
Tukey, J. W. 1949. "One Degree of Freedom for Non-additivity," Biometrics (5:3), pp. 232-242.
Van Slyke, C., Johnson, R., and Sarabadani, J. 2023. "Generative Artificial Intelligence in Information
Systems Education: Challenges, Consequences, and Responses," Communications of the Association
for Information Systems (53:1), pp. 1-21.
von Brackel-Schmidt, C., Kučević, E., Memmert, L., Tavanapour, N., Cvetkovic, I., Bittner, E. A., and
Böhmann, T. 2023. "A User-Centric Taxonomy for Conversational Generative Language Models," ICIS
2023 Proceedings.
von Garrel, J., and Mayer, J. 2023. "Artificial Intelligence in Studies - Use of ChatGPT and Ai-Based Tools
Among Students in Germany," Humanities and Social Sciences Communications (10:1), p. 799.
Voshaar, J., Knipp, M., Loy, T., Zimmermann, J., and Johannsen, F. 2023a. "The impact of using a mobile
app on learning success in accounting education," Accounting Education), pp. 222-247.
Voshaar, J., Wecks, J. O., Johannsen, F., Knipp, M., Loy, T., and Zimmermann, J. 2023b. "Supporting
Students in the Transition to Higher Education: Evidence from a Mobile App in Accounting Education,"
International Conference in Information Systems, Hyderabad.
Walters, W. H. 2023. "The Effectiveness of Software Designed to Detect AI-Generated Writing: A
Comparison of 16 AI Text Detectors," Open Information Science (7:1), p. 20220158.
Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S., Foltýnek, T., Guerrero-Dib, J., Popoola, O., Šigut,
P., and Waddington, L. 2023. "Testing of Detection Tools for AI-Generated Text," International
Journal for Educational Integrity (19:1), p. 26.

26
Wecks, J. O., Voshaar, J., and Zimmermann, J. 2023. "Using Machine Learning to Address Individual
Learning Needs in Accounting Education," Available at SSRN).
Wijesundara, T. R., and Xixiang, S. 2018. "Social Networking Sites Acceptance: The Role of Personal
Innovativeness in Information Technology," International Journal of Business and Management
(13:8).
Wilson, D., Burnett, P., Valacich, J. S., and Jenkins, J. 2023. "Human or AI? Using Digital Behavior to Verify
Essay Authorship," ICIS 2023 Proceedings.
Wooldridge, J. M. 1995. "Selection Corrections for Panel Data Models Under Conditional Mean
Independence Assumptions," Journal of Econometrics (68:1), pp. 115-132.
Wu, H., Wang, W., Wan, Y., Jiao, W., and Lyu, M. R. 2023. "ChatGPT or Grammarly? Evaluating ChatGPT
on Grammatical Error Correction Benchmark," available at ArXiv: 2303.13648.
Yuan, Z., and Chen, H. 2023. "The Impact of ChatGPT on the Demand for Human Content Generating and
Editing Services: Evidence from an Online Labor Market," ICIS 2023 Proceedings.
ZeroGPT. 2024. "AI Text Detector." from https://www.zerogpt.com
Zhu, Q., Zhang, X., and Luo, J. 2023. "Biologically Inspired Design Concept Generation Using Generative
Pre-Trained Transformers," Journal of Mechanical Design (145:4), p. 041409.

27
Figures and Tables

|Mean
Group Pre GenAI Post GenAI
Diff|

GenAI
User 12.76**
(N = 15)

Non-User
18.91***
(N = 12)

|Mean
6.86 0.72
Diff|
Figure 1 presents the bootstrap distributions (1,000 replications) and the 95% confidence interval of the exam score
from the four groups of (Pre) GenAI User and Non-User before and after the public release of GenAI. Also, the
mean differences between GenAI user and non-user performance are reported as absolute values and tested by a
paired sample (two-sample) t-test for within- (between-)group difference. ***, **, * indicate statistical significance
at the 1%, 5%, and 10% level, respectively (two-tailed).

Figure 1. Results of the Identification Strategy

28
Panel A:
N Mean Median SD P25 P75
Student Data
Exam Score 193 45.39 45.83 22.23 27.50 61.11
GenAI User 193 0.306 0.462
ZeroGPT 193 0.354 0.296 0.277 0.119 0.556
A-Level Grade 193 2.290 2.200 0.607 1.800 2.700
Attempt 193 1.425 1 1.223 1 1
Attendance (relative) 193 0.447 0.444 0.322 0.111 0.778
Vocational Training 193 0.135 0.342
Voluntary Service 193 0.363 0.482
Female 193 0.472 0.500
LinkedIn User 193 0.192 0.395
Panel B:
GenAI User Non-User
Student Data by GenAI Usage |Diff.|
Variables N Mean N Mean
Exam Score 59 39.120 134 48.147 9.027 ***
A-Level Grade 59 2.184 134 2.337 0.152 *
Attempt 59 1.763 134 1.276 0.486 **
Attendance (relative) 59 0.413 134 0.463 0.051
Vocational Training 59 0.085 134 0.157 0.072
Voluntary Service 59 0.322 134 0.381 0.059
Female 59 0.390 134 0.508 0.118
LinkedIn User 59 0.221 134 0.179 0.041
Table 1 presents the descriptive statistics of student characteristics in Panel A. For binary variables, only means and
standard deviations are presented. Panel B shows student characteristics disaggregated by GenAI usage. The last
column presents the difference in mean values and the significance level of a two-tailed t-test (chi-squared test) for
continuous (binary) variables. ***, **, * indicate statistical significance at the 1%, 5%, and 10% level, respectively.

Table 1. Descriptive Statistics

29
Variables (1) (2) (3)
GenAI User -8.48 ** -6.32 ** -6.71 **
(-2.40) (-2.01) (-2.17)
A-Level Grade 12.04 *** 11.42 ***
(4.98) (4.79)
Attempt 0.93 0.76
(0.77) (0.63)
Attendance (relative) 17.99 *** 18.30 ***
(3.73) (3.86)
Vocational Training 10.54 ** 10.80 **
(2.41) (2.52)
Voluntary Service 2.20 1.76
(0.72) (0.59)
Female -9.54 *** -10.16 ***
(-3.23) (-3.50)
LinkedIn User 9.60 ***
(2.75)
Constant Included Included Included
Course of Study-Fixed Effects Included Included Included
N 193 193 193
Adj. R² 0.03 0.28 0.30
Table 1 presents the results of regressing GenAI usage on the exam score. Column (1) depicts the standalone effect
of the GenAI usage dummy. In column (2), we add control variables commonly found in literature explaining exam
performance. Column (3) adds another control variable to better explain the independent variable representing our
main results. Bold font indicates the variable of interest. ***, **, * indicate statistical significance at the 1%, 5%, and
10% level (two-tailed), respectively. t-values are presented in parentheses. All variables are defined in Appendix A
(https://tinyurl.com/zjehfa3n).
Table 2. Regression Results on the Impact of GenAI Usage on Exam Performance

30
(1) (2) (3) (4) (5) (6)
Threshold 0.4 Threshold 0.6 Originality.AI German Manual Balanced
Variables Detector Computation Sample
GenAI User -7.10 ** -7.17 ** -4.14 -8.53 *** -2.66 *** -6.51 **
(-2.43) (-2.02) (-1.38) (-2.90) (-2.70) (-2.07)
A-Level Grade 11.64 *** 11.42 *** 11.42 *** 11.21 *** 11.62 *** 8.88 ***
(4.92) (4.78) (4.73) (4.75) (4.87) (3.03)
Attempt 0.89 0.84 0.29 0.94 0.79 2.05 **
(0.74) (0.69) (0.24) (0.79) (0.66) (1.98)
Attendance (relative) 17.70 *** 18.38 *** 18.01 *** 16.81 *** 15.55 *** 17.75 ***
(3.75) (3.87) (3.78) (3.58) (3.25) (2.95)
Vocational Training 10.37 ** 11.25 *** 11.98 *** 10.93 ** 12.09 *** 8.75
(2.41) (2.63) (2.79) (2.58) (2.86) (1.65)
Voluntary Service 1.81 2.07 2.13 1.57 1.24 1.52
(0.60) (0.69) (0.71) (0.53) (0.41) (0.45)
Female -10.14 *** -9.29 *** -9.97 *** -10.20 *** -10.18 *** -10.49 ***
(-3.50) (-3.18) (-3.41) (-3.54) (-3.50) (-3.22)
LinkedIn User 10.75 *** 9.25 *** 9.05 ** 10.26 *** 7.89 ** 10.53 ***
(3.05) (2.65) (2.58) (2.96) (2.25) (2.82)
Constant Included Included Included Included Included Included
Course of Study-FE Included Included Included Included Included Included
N 193 193 193 193 193 193
Adj. R² 0.31 0.30 0.29 0.32 0.31 0.20
Table 3 presents the results of the robustness checks. In columns (1) and (2), we reduced (> 0.4) or increased (> 0.6) the threshold
of the AI detector value to be classified in the GenAI User group. Columns (3) and (4) use alternative AI detectors. Column (5)
includes a manual computed score that represents AI detection. In column (6), we again present our main results but with an
entropy-balanced sample. Bold font indicates the variable of interest. ***, **, * indicate statistical significance at the 1%, 5%, and
10% level (two-tailed), respectively. t-values are presented in parentheses. All variables are defined in Appendix A
(https://tinyurl.com/zjehfa3n).

Table 3. Results of Robustness Checks

31
(1) (2) (3) (4)
Variables Higher Lower Higher Lower
A-Level Grade A-Level Grade Attendance Attendance
ChatGPT User -12.27 *** 2.09 -11.90 *** -2.92
(-2.73) (0.46) (-2.74) (-0.64)
A-Level Grade 11.96 ** 24.13 *** 11.58 *** 12.64 ***
(2.37) (3.25) (3.60) (3.31)
Attempt -2.03 1.21 -1.78 2.27
(-0.87) (0.87) (-0.71) (1.44)
Attendance (relative) 17.18 ** 12.07 * 18.62 34.93 *
(2.24) (1.89) (1.65) (1.83)
Vocational Training 4.41 24.40 *** 10.33 * 11.93 *
(0.76) (3.80) (1.77) (1.76)
Voluntary Service -1.17 3.97 -2.80 8.47 *
(-0.27) (0.94) (-0.67) (1.83)
Female -9.65 ** -9.44 ** -11.07 *** -8.22 *
(-2.25) (-2.30) (-2.76) (-1.71)
LinkedIn User 9.87 ** 6.00 16.44 *** 0.81
(2.12) (1.12) (3.37) (0.15)
Constant Included Included Included Included
Course of Study-FE Included Included Included Included
N 103 90 104 89
Adj. R² 0.27 0.28 0.30 0.12
Table 4 presents the regression results using split samples. In columns (1) and (2), we repeat our main regression
analysis on a restricted sample only containing students with above- (below-)median A-Level Grade. Columns (3)
and (4) present the main regression separately for students with above- and below-median attendance. Bold font
indicates the variable of interest. ***, **, * indicate statistical significance at the 1%, 5%, and 10% level (two-tailed),
respectively. t-values are presented in parentheses. All variables are defined in Appendix A
(https://tinyurl.com/zjehfa3n).
Table 4. Results of Split Sample Regressions

32

You might also like