[go: up one dir, main page]

Academia.eduAcademia.edu
1 Potential Benefits of Employing Large Language Models in Research in Moral Education and Development Hyemin Han1 1 Educational Psychology Program, University of Alabama Author Note We have no known conflict of interest to disclose. Correspondence concerning this article should be addressed to Hyemin Han, University of Alabama, Box 872031, Tuscaloosa, AL 35487, United States. Email: hyemin.han@ua.edu 2 Potential Benefits of Employing Large Language Models in Research in Moral Education and Development Abstract Recently, computer scientists have developed large language models (LLMs) by training prediction models with large-scale language corpora and human reinforcements. The LLMs have become one promising way to implement artificial intelligence with accuracy in various fields. Interestingly, recent LLMs possess emergent functional features that emulate sophisticated human cognition, especially in-context learning and the chain of thought, which were unavailable in previous prediction models. In this paper, I will examine how LLMs might contribute to moral education and development research. To achieve this goal, I will review the most recently published conference papers and ArXiv preprints to overview the novel functional features implemented in LLMs. I also intend to conduct brief experiments with ChatGPT to investigate how LLMs behave while addressing ethical dilemmas and external feedback. The results suggest that LLMs might be capable of solving dilemmas based on reasoning and revising their reasoning process with external input. Furthermore, a preliminary experimental result from the moral exemplar test may demonstrate that exemplary stories can elicit moral elevation in LLMs as do they among human participants. I will discuss the potential implications of LLMs on research on moral education and development with the results. Keywords: Large language models, Artificial intelligence, Moral reasoning, Moral exemplar, Simulation 3 Introduction One of the most impactful recent developments in computer science is large language models (LLMs) (Grossmann et al., 2023), which implement advanced artificial intelligence. Computer scientists developed LLMs to predict the most probable solution to a given inquiry by training prediction models with large amounts of language input (Zhao et al., 2023). LLMs utilize large-size corpora from various sources to train their prediction models. Now, people working in fields other than computer science are using such LLMs for various purposes. For instance, ChatGPT based on one of the most widely used LLMs, i.e., GPT, has significantly influenced every aspect of human lives due to its user-friendliness and versatility (Mogavi et al., 2023). People employ ChatGPT to achieve diverse goals, such as drafting and reviewing paragraphs, preprocessing data, and generating source codes for computer programing. They found that ChatGPT and general LLMs can produce highquality products following directions provided in human languages (Dwivedi et al., 2023). Because computer science evolves rapidly, I will briefly overview recent conference papers and ArXiv preprints addressing LLMs to examine how they work and their distinctive functional features. Computer scientists invented LLMs with large-scale language datasets to make them capable of understanding human language input and producing plausible responses (Zhao et al., 2023). They develop LLMs by training prediction models with large-scale corpora, which include demonstrations and exemplars. Scientists link the exemplar inputs (X) with labels indicating whether or to what extent a specific X is desirable (Y) while training LLMs. In the long run, LLMs tend to produce a predicted output (ŷ ) in a given condition (x) that is most likely and deemed most desirable based on the trained set. Because LLMs include functionalities to interpret natural language 4 input during the process, they can understand x and generate ŷ in human language (Arcas & Agüera, 2022). When they generate ŷ with low quality, it is possible to provide feedback to LLM so that they correct their prediction models through human reinforcement learning (Ouyang et al., 2022; Srivastava et al., 2023). For instance, when LLMs generated biased statements against specific groups, providing human feedback reduced the bias significantly (Ganguli et al., 2023). Such training and learning processes are similar to what occurs among humans. The abovementioned processes induced by demonstrations and exemplars resemble Bayesian learning at the behavioral and neural levels (Friston, 2003). Each individual has prior beliefs existing before experiences. External inputs, such as observations, examples, and instructions, update the existing priors into posteriors (Mathys, 2011). Although the Theorem does not predict the posterior update completely, the mechanism of Bayesian learning well approximates human learning processes (McDiarmid et al., 2021). Recent works proposed that the Bayesian learning mechanism can also apply to moral psychology and development (Cohen et al., 2022; Han, 2023a; Railton, 2017). Interestingly, recent developments in LLMs demonstrate significant emerging features simulating human psychological processes, which were unavailable in the previous simulation models (Grossmann et al., 2023). I would like to review two major emerging features: in-context learning and reasoning and the chain of thought and reasoning that provide greater degrees of freedom and flexibility in cognition in simulation. First, the most up-to-date LLMs can conduct in-context learning (Dong et al., 2023). Unlike simple machines designed to provide specified responses to specified inquiries, LLMs can learn from a set of contextually relevant exemplars and demonstrations (Dong et al., 2023; 5 Zhao et al., 2023). Then, they can solve problems within a similar context even if they are not directly identical to the examples used for training. The in-context learning is a feature that emerged from the large-scale and complex network constituting LLMs, which was not available in the past when available computational resources did not allow the implementation of large-size prediction models (Dong et al., 2023). At this point, in the domains of learning and problem-solving, LLMs can behave more similarly to human beings than the previous simulation models because they can exercise contextual training and prediction with enhanced flexibility (Moor et al., 2023; Wu et al., 2023). Second, recent LLMs show emerging capabilities of the chain of thought and reasoning (Wei et al., 2023). Previously, computers could only provide direct answers to problems without elaboration on the process of problem-solving (Zhao et al., 2023). For instance, once we ask, “Corgi has two retriever puppets. She purchased two more boxes with two puppets in each. How many retriever puppets do Corgi have?” they could say “8” without explaining any rationale. Unlike the previous computational models, LLMs can explain the process of thought and reasoning and can even learn from the explanation of such a process provided by humans. In the case of the abovementioned inquiry, LLMs can explain their rationale, such as “Corgi started with two puppets. Three boxes of two puppets each are six puppets. 2 + 6 = 8.” Providing the chain of reasoning significantly improved LLMs’ performance in math problem-solving (Wei et al., 2023). Likewise, in the case of the bias correction study, when researchers provided additional feedback accompanying the chain of reasoning, LLMs showed significantly more decreased bias than when they gave the models simple factual corrections (Ganguli et al., 2023). Like humans, LLMs can improve their prediction models more effectively by learning from the examples 6 of thinking and reasoning processes (Huang & Chang, 2023; Zhao et al., 2023). In addition to the capability of in-context learning, the chain of thought capability suggests that recent LLMs can exercise rudimentary forms of reasoning with contextual flexibility (Huang & Chang, 2023). Such advanced emerging features of LLMs suggest LLMs might be able to simulate rudimentary human moral functioning. Related to human morality, several previous studies examined whether LLMs are capable of morality-related cognitive faculties, such as the theory of mind (ToM) and reasoning (May, 2018; Young et al., 2007), thanks to the newly emerging features. Although the evidence is equivocal (Shapira et al., 2023), some demonstrated that most up-to-date LLMs showed rudimentary forms of ToM equivalent to what seven-year-olds may perform (Kosinski, 2023). One study reported that ChatGPT could generate philosophical statements mimicking those by Daniel Dennett, an experimental philosopher, that non-expert participants could not accurately distinguish from his original ones (Schwitzgebel et al., 2023). Furthermore, one previous study examined whether LLMs can correct their bias against marginalized groups (Ganguli et al., 2023). In the study, when researchers provided the models with feedback for correction, LLMs reported a significant decrease in discrimination against minority groups. Of course, I admit that these are insufficient to support arguments that LLMs are fully equipped with the cognitive faculties that humans possess, so whether we should consider them moral agents and whether they can perform moral functioning as human beings are still controversial (Chalmers, 2023). Hence, I intend to focus on the potential practical values of LLMs while considering their current technological limitations. 7 Current Paper In this paper, I propose that LLMs can eventually assist our research on moral education and development, particularly those involving empirical and practical investigations notwithstanding current limitations. I will examine how LLMs can contribute to research in moral education. Philosophers are primarily interested in addressing the potential ethical impacts of LLMs or how enhanced artificial intelligence based on LLMs will change philosophical inquiries about cognition and psychology (e.g., Hosseini et al., 2023). In educational research, researchers started considering the educational implications of LLMs, particularly those related to teaching and learning, such as how they will change ways to teach and learn writing, search for information, etc. (e.g., Kasneci et al., 2023; Milano et al., 2023). Unlike these previous works, in this paper, I will explicitly focus on how the development of LLMs will influence empirical and practical research in moral education with concrete points. Hence, the abovementioned issues related to the ethical and educational implications of LLMs are out of the scope of my paper. In the first section, I will briefly test LLMs’ capability to conduct moral reasoning with ethical dilemmas. Then, I will examine whether LLMs’ in-context learning and reasoning and the chain of thought and reasoning features can facilitate revisions of moral reasoning with contextual information and feedback, which are closely associated with moral educational activities. Furthermore, I will also examine whether the stories of moral exemplars can elicit emotional and motivational impacts in LLMs. Because moral reasoning alone cannot accurately explain the mechanism of moral motivation and behavior (Kristjánsson, 2010), I planned to employ the moral exemplar intervention, which primarily targets moral emotion and motivation; moral educators tend to utilize the 8 intervention to generate affective and motivational impacts rather intuitively (Kristjánsson, 2017; Sanderse, 2012). Given demonstrations constitute the basis for the learning mechanisms of LLMs (Zhao et al., 2023), I expected that this moral exemplar intervention might also produce similar outcomes in LLMs as among human participants. Based on the concrete experimental outcomes, I intend to discuss the implications of LLMs on research in moral education and development. Finally, I will overview the limitations and future directions with concluding remarks. The Behavioral Defining Issue Test I examined how LLMs address moral problems to acquire insights about whether they can exercise some aspects of moral functioning, especially moral reasoning, like humans, based on the abovementioned functionalities, i.e., in-context learning, and the chain of thought and reasoning. Thus, I briefly tested how ChatGPT (May 24 version) solves ethical dilemmas in the behavioral Issues Test (bDIT). The bDIT is a simplified version of the DIT that assesses individuals’ development of moral reasoning in terms of postconventional reasoning (Choi et al., 2019; Han, Dawson, et al., 2020). I employed the bDIT instead of the original DIT due to its simple structure, which can be feasibly implemented in the ChatGPT environment. Because I used the free ChatGPT (https://chat.openai.com/), I assumed that only the general corpora, which were not specific about moral philosophy and psychology, were used to train the GPT model (Guo et al., 2023). First, I entered the dilemmas and items asking for moral philosophical rationale supporting behavioral decisions quoted from the bDIT. The bDIT presents three dilemma stories, i.e., Heinz and the drug, Newspaper, and Escaped Prisoner. For each dilemma, I 9 asked ChatGPT to evaluate whether a provided behavioral solution (e.g., “Should Heinz steal the drug?” in the case of Heinz and the drug) was morally appropriate. Then, I asked eight questions about the most important rationale supporting the decision per dilemma. For each question, I presented three options corresponding to three reasoning schemas, i.e., personal interest, maintaining norms, and postconventional schemas (Rest et al., 1999). Following the bDIT scoring guidelines, I calculated the postconventional reasoning (P) score indicating the likelihood of employing the postconventional schema in all 24 questions. For instance, if ChapGPT selected the postconventional options for 12 questions out of 24, the resultant P-score became 12 / 24 x 100 = 50. The resultant P-score from the trial with ChatGPT was 45.83 (see https://osf.io/ryq5w for the complete conversation transcript). I compared this score with the P-scores calculated from a large dataset collected from undergraduate participants. I reanalyzed the dataset primarily collected by Han (2023b). Han (2023b) collected responses from 1,596 participants (85.37% women; mean age = 21.85 years, SD = 5.88 years) via Qualtrics. The Institutional Review Board at the University of Alabama reviewed and approved the original study (IRB #: 18-12-1842). I used a customized R code to calculate their P-scores (all data and source code files are available via the Open Science Framework: https://osf.io/j98p4/). Interestingly, when I calculated the scores, the P-score reported by ChatGPT, 45.83, was slightly lower than the median P-score among the undergraduate participants, 50.00 (mean = 52.61, SD = 21.74). 45.83 was equivalent to the 40th percentile of the whole undergraduate student group. Despite potential caveats, the result might suggest that 10 ChatGPT possibly demonstrates moral judgment and reasoning compatible with those among undergraduate students. Second, I collected more qualitative responses to examine further details about the reasoning process. I again presented the three dilemmas and asked ChatGPT to elaborate on their supporting rationale. First, I asked about the rationale supporting ChatGPT’s response to Heinz and the drug: Heinz should steal the drug. Consistent with their answers to the bDIT items, ChatGPT presented several points corresponding to postconventional reasoning, such as the principle of preserving life, moral duty, compassion, and consequentialist perspective. Second, in the case of the Newspaper Dilemma, ChatGPT argued that the principal should not stop the newspaper. Like the case of the Heinz dilemma, when I asked for a rationale, ChatGPT provided several points relevant to the postconventional schema: freedom of expression, student engagement, and education. Third, when I presented the escaped prisoner dilemma to ChatGPT, they said Mrs. Jones should report Mr. Thompson to the police. ChatGPT provided themes corresponding to the maintaining norms schema, such as upholding the law, accountability, fairness, equity, and preserving the integrity of the justice system. ChatGPT could provide the rationale coherently supporting their responses corresponding to each reasoning schema. The results suggest that ChatGPT can generate moral decisions based on a reasoning process similar to humans. Although the evidence at this point is rudimentary, it may open the door to the possibility that ChatGPT engages in the chain of moral reasoning. In addition, we should note that only the general corpora, not corpora designed specifically for moral philosophy or psychology, were used to train ChatGPT. That said, ChatGPT is also capable of in-context learning and cognition, and they 11 could use exemplary demonstrations learned from the general corpora in the context of moral problem-solving. ChatGPT did not merely provide fixed responses to the questions like classical machines; instead, they demonstrated flexibility to render their decisions based on reasoning with contextual information. A Brief Experiment to Test LLMs’ Advanced Learning and Reasoning Capabilities In addition to gathering information about the reasoning process, I examined whether ChatGPT could learn moral messages from material that did not directly address the presented dilemma (see https://osf.io/zcfvq for the complete conversation transcript). I focused on whether ChatGPT possessed advanced mental capabilities with a task demanding additional flexibility and sophisticated reasoning. I tested whether ChatGPT could update their response to the story and the rationale behind their response based on the messages. To test this possibility, I started with the escaped prisoner dilemma that ChatGPT argued one should report the prisoner to the police based on themes relevant to the maintaining norms schema. I conducted a brief intervention to examine whether ChatGPT could update its decision and rationale based on indirectly relevant contextual information. Given moral psychologists have regarded Letter from Birmingham Jail authored by Martin Luther King, Jr. as an exemplary work demonstrating postconventional reasoning (Rest et al., 1999), I requested ChatGPT to read and extract moral messages from the letter. ChatGPT properly presented the moral lessons from the letter that they learned, such as the moral obligation to fight injustice and civil disobedience and the rule of law. As shown, these themes were consistent with the postconventional schema. 12 When I asked ChatGPT to solve the escaped prisoner dilemma again while considering the moral lessons, they altered their behavioral decision: one should not report the prisoner to the police. Furthermore, when I requested them to elaborate on the rationale supporting the decision, they provided their answer corresponding to the postconventional schema, such as the need for rehabilitation, consideration of potential benefits to society and community, and balancing justice and mercy. The result from my brief experiment with the escaped prisoner dilemma may suggest that ChatGPT possesses advanced learning and reasoning capabilities; it might be consistent with what I preliminary found from the abovementioned result from the initial dilemma test. ChatGPT could learn moral lessons from the letter and then apply them to another context, the escaped prisoner dilemma (in-context learning and problem-solving). They were also able to exercise moral reasoning to support the altered behavioral decision based on the ethical themes of the letter (chain of thought and reasoning). These demonstrate that ChatGPT not only can answer moral dilemmas based on rationale but also can engage in moral learning with flexibility. Even if MLK’s letter does not directly address any issue related to the escaped prisoner dilemma, ChatGPT could modify its behavioral decision based on the contextually relevant information, i.e., the moral messages learned from the letter. Furthermore, related to the chain of thought, they showed the capability to update the reasoning process to render the modified decision accordingly. Moral Exemplar Experiment In addition to the moral reasoning tests, I also conducted an experiment utilizing the stories of moral exemplars to examine whether LLMs can generate affective and 13 motivational reactions toward presented moral exemplars. One point regarding moral reasoning that we should note is that reasoning alone does not necessarily predict moral motivation and behavior (Blasi, 1980). That said, we need to consider additional factors, including affective and intuitive aspects of morality, in an integrative manner to explain motivation and behavior accurately (Kristjánsson, 2010). The results of the experiments involving moral reasoning might be insufficient to demonstrate the full potential of LLMs in moral education, which should also consider non-reasoning aspects of morality. Hence, I decided to utilize the moral exemplar intervention as an additional example in this paper. Moral educators, including moral psychologists and virtue ethicists, have suggested that the stories of moral exemplars can be powerful and efficient sources for moral education by promoting motivation for emulation (Sanderse, 2012). The foundational learning mechanism of LLMs also supports that the proposed test with moral exemplars is legitimate (Zhao et al., 2023). Given LLMs learn patterns and train prediction models with a series of demonstrations (Zhao et al., 2023), I deem that moral exemplars demonstrating moral paragons with concrete contents are also likely to elicit significant changes and responses from LLMs like the cases of the moral reasoning experiments. Interestingly, one recent paper in computer science demonstrated that LLMs are capable of emotional inference (Li et al., 2023). So, it might be worth examining whether exemplary stories cause the abovementioned affective and motivational responses. Research has demonstrated that moral exemplars can elicit affective reactions associated with moral motivation and behavior intuitively among participants (Haidt, 2000; Kristjánsson, 2017). For instance, in social psychology, researchers, especially those interested in moral intuition, have reported that being presented with others’ exemplary 14 moral behavior produces moral elevation (including uplifting sensation) (Haidt, 2000), and finally, prosocial motivation and behavior across various domains (e.g., Algoe & Haidt, 2009; Schnall et al., 2010; Silvers & Haidt, 2008; Vianello et al., 2010). Additional studies in moral education have shown that the perceived relatability and attainability of the presented exemplars positively predict such affective and motivational outcomes (Han et al., 2022; Han & Dawson, 2023). Based on the abovementioned previous research, I tested whether ChatGPT could demonstrate responses similar to those from human participants when I presented moral exemplary stories. I focused on whether ChatGPT can report moral elevation as an emotional response; then, I also examined whether the LLM successfully differentiated responses depending on the relatability and attainability of the presented stories. As Han and Dawson (2023) reported with their data collected from human participants, I anticipated that ChatGPT could report stronger affective and motivational responses when the presented exemplars were relatable and attainable (vs. unrelatable and unattainable). I used three exemplary stories initially developed and tested by Han et al. (2022). The three stories included relatable and attainable, relatable and unattainable, and unrelatable stories (see https://osf.io/qwtbd for the three selected stories; and https://osf.io/jqxv3 for the complete list of stories used in Han et al. [2022]). As shown, relatable stories demonstrated moral behaviors by Ron, a hypothetical college student in the United States. The unrelatable story presented prosocial behavior performed by Federica, a CEO of a large Italian corporate. Han et al. (2022) adjusted the attainability of the two stories by presenting two different exemplary behaviors. In the attainable condition, Ron stayed next to a victim of a traffic accident until paramedics arrived at the 15 scene. In the case of the unattainable story, Ron visited the hospitalized victim for two weeks. Han and Dawson (2023) reported that participants, American college students, could evaluate the perceived attainability and relatability of the stories as initially intended. Furthermore, as introduced, they showed that relatability and attainability positively predicted elevation. I requested ChatGPT to imagine that they were a young college student in the United States to make them perceive Ron’s stories as more attainable (see https://osf.io/eaxvm for the complete conversation transcript). After entering each story, I asked three questions to examine the perceived relatability, attainability, and evoked elevation. Here are the three questions from Han et al. (2022): How similar do you think your cultural and social background is to the person described in the story? (Relatability) How difficult do you think it would be to do the same things as the person described in the story? (Attainability) Did the story make you feel morally elevated (warm, uplifted - like when seeing unexpected acts of human goodness, kindness, or compassion).? (Elevation) When I presented the three stories, ChapGPT could accurately compare the perceived relatability and attainability of the stories. They reported that Ron’s behavior in the attainable story was significantly more replicable than that in the unattainable story. Also, ChatGpt said that Ron was deemed more relatable than Federica. Finally, when I examined the evoked elevation, ChatGPT responded that Ron’s story, which was perceived as more relatable and attainable than Federica’s by them, might be more elevating from a US college student’s perspective. Interestingly, while responding to the questions, ChatGPT explained 16 the rationale of their responses instead of merely providing short answers. Such a capability to elaborate in-depth reason-supported responses was similar to what I found during the moral dilemma experiments. Potential Implications for Research in Moral Education and Development From the previously reported interactions with ChatGPT, I found that LLMs can potentially address moral problems with advanced learning and reasoning capacities. Although developers did not train the models with morality-specific datasets, the models trained by general corpora could successfully answer ethical questions with contextual information and flexibility. Moreover, while solving the problems, instead of merely providing determined answers, they could elaborate on the chain of reasoning constituting the basis for their decisions. Finally, models could demonstrate an ability to update their reasoning process and eventual decision based on indirectly relevant sources of information via in-context learning. Furthermore, I demonstrated that LLMs can predict affective and motivational outcomes when I presented the stories of moral exemplars. ChatGPT could accurately report how relatability and attainability are positively associated with moral elevation and motivation among human participants. Such results are consistent with the foundational learning mechanism of LLMs, which are supposed to train their prediction models with a series of concrete demonstrations. Also, with this additional example, I assumed that LLMs have the potential to emulate human psychological processes related to various aspects of moral functioning, including reasoning, emotion, motivation, and intuition. These features of LLMs may suggest they can contribute to research in moral education and development in several ways. First, they can provide data about how 17 learning in moral domains occurs and how it influences our moral functioning. As I explained in the introduction, LLMs can learn from a set of demonstrations and exemplars (X) associated with labels (Y) to update their prediction models (Dong et al., 2023; Zhao et al., 2023). One of their functional features that warrant our attention regarding this point is in-context learning and reasoning (Dong et al., 2023; Huang & Chang, 2023; Wei et al., 2023). Unlike conventional prediction models developed to predict outcomes with input in a specific context, LLMs can improve their models with data from indirectly relevant contexts (Dong et al., 2023). I demonstrated that an LLM trained with general corpora, which were not directly relevant to moral contexts, i.e., ChatGPT, could solve ethical dilemmas and update its reasoning process based on contextual information, e.g., MLK’s letter. ChatGPT could also examine the relatability and attainability of presented exemplars and how the stories elicited elevation and motivation. The conversation records indicate that ChatGPT could explain the rationale supporting which stories might more effectively promote moral emotion and motivation, instead of merely providing short answers to my questions. Hence, moral educators can examine how educational input, e.g., exemplar stories, reasoning demonstrations, etc., across diverse contexts generate potential changes before implementing them among human populations. Although several previous simulation studies examined how educational interventions change behavioral outcomes within moral education, they designed their simulation models to predict specific outcomes within designated contexts (Han et al., 2016, 2018; Han, Lee, et al., 2020), so the models could not perform in-context learning and reasoning. In addition, it is also noteworthy that it is possible to train LLMs with large-scale language datasets collected from people with 18 diverse backgrounds from diverse sources (Grossman et al., 2023). Thus, unlike the conventional simulation models, LLMs will enable moral educators to predict the potential impacts of contextually various input materials among general populations on a large scale. Second, moral educators will be able to simulate how educational input influences one’s reasoning, emotional, and motivational processes, not simply their behavioral or selfreported outcomes, thanks to LLM’s feature, the chain of thought and reasoning (Volkman & Gabriels, 2023). As shown in the moral correction study and my brief intervention experiment, LLMs can dynamically update their training set and adjust prediction outputs following learning inputs in a real-time (Ganguli et al., 2023; Zhao et al., 2023). The moral correction study and my brief investigation reported how human reinforcements, i.e., the rationale supporting bias correction in the moral correction study and the postconventional moral messages from MLK’s letter in my experiment, interact with and update the chain of thought and reasoning leading to output (Ganguli et al., 2023; Srivastava et al., 2023). Moreover, in the case of the moral exemplar experiment, ChatGPT demonstrated capabilities to simulate emotional and motivational outcomes (e.g., elevation) similar to human participants. They could also elaborate on the rationale supporting their answers regarding moral elevation. The result is consistent with what Li et al. (2023) reported in their study, LLMs’ capabilities to conduct emotional inference. Although several previous studies in moral education have developed and tested simulation models predicting intervention outcomes, they could only demonstrate simulated behavioral or group-level results (Han et al., 2016, 2018; Han, Lee, et al., 2020). Unlike LLMs, they did not have any ability to emulate the reasoning process inside individuals and how external demonstrations and exemplars as input influence such a 19 process. Because LLMs possess the emerged chain of thought and reasoning feature, their unique benefit, the models will allow moral educators to examine internal reasoning, emotional, and motivational processes within moral domains. By doing so, moral educators will gain additional information and insights about how their educational input, such as the presentations of sophisticated reasoning exemplars as originally suggested by Blatt and Kohlberg (1975), impacts the internal processes among diverse populations before testing them with human subjects; and, in the domains of moral emotion and motivation, educators can test how different types of exemplary stories differently influence students’ psychological processes associated with emotion and motivation. In conclusion, thanks to the newly emerged features, i.e., in-context learning and reasoning, the chain of thought and reason, LLMs will provide practical benefits to researchers in moral education and development interested in the potential impacts of educational input. Even if they cannot perfectly emulate human behavior and cognition, they can still be versatile simulation tools for improving education with the functional features unavailable in prior ones (Volkman & Gabriels, 2023). For example, in biological science and engineering, large-scale simulation models based on deep learning, which also constitutes the computational basis of LLMs, such as the AlphaFold, are changing ways to develop materials in the whole field (Chan et al., 2019). Previously, biological scientists and engineers were mandated to spend more than months to years exploring protein materials that will become candidates for potential pharmaceutical development. Now, with AlphaFold, they can identify the best candidate materials within less than a few days (Chan et al., 2019). Although the tool per se can only simulate structural aspects of the candidates 20 and cannot perfectly emulate how they work in living organs, extracted information can effectively inform follow-up in-vitro and in-vivo experiments (Samorodnitsky, 2022). Likewise, in the case of LLMs and moral education, moral educators can conduct preliminary simulations to predict how their educational materials and activities influence students’ cognitive, motivational, and behavioral processes. Then, as outputs from AlphaFold inform further in-vitro and in-vivo investigations, the simulation outputs can provide researchers and educators with insights on how to conduct their experiments and how to implement their educational activities with students. It might be a significant advantage because even a brief educational intervention might produce non-trivial longterm impacts (Yeager & Walton, 2011). Thus, gathering information to predict potential outcomes in advance would be helpful for researchers and educators (Han et al., 2016, 2018). Conclusion Remarks I reviewed recent updates in LLM research and considered how LLMs might assist research in moral education and development in this paper. In the process, I tested a widely-used LLM, ChatGPT, with ethical dilemmas presented by the bDIT. I also examined whether ChatGPT possessed the chain of thought and reasoning capabilities to update its moral decision with an alternative moral philosophical rationale suggested by MLK’s letter. Interestingly, ChatGPT demonstrated moral reasoning and the capacity to modify its reasoning process while solving the presented dilemmas. Additionally, I also tested ChatGPT’s emotional and motivational capabilities by presenting different types of moral exemplars. They could report the perceived relatability, attainability, and elevation similar to human participants. Also, ChatGPT provided the rationale of their responses regarding 21 moral emotion and motivation in addition to short answers. Although the resultant outputs might only support the presence of rudimentary reasoning abilities and emotional and motivational capabilities, ChatGPT demonstrated its potential in simulating moral functioning and its improvement via interventions. Based on the outcomes, I briefly discussed how LLMs might help moral educators better conduct research in moral education and development, particularly those related to simulating moral psychological processes and educational outcomes. Although LLMs possess the abovementioned practical benefits, several limitations warrant our attention. First, at this point, we cannot ensure that LLMs can perfectly simulate human cognition and behavior. Some scholars argue that even if LLMs might perform rudimentary philosophical reasoning and ToM tasks (Kosinski, 2023; Schwitzgebel et al., 2023), they could be philosophical zombies that conduct their behavior according to what they learned from large corpora (Chalmers, 2023). According to the critique, their human-like behaviors are mere products of prediction models trained by linguistic data, so whether they emulate human cognition or sentience is not ensured (Arcas & Agüera, 2022). Instead of relying on LLMs without reservation, until the further development of technology, we may utilize LLMs as testbeds for moral psychology and education without assuming that they are perfectly emulating humans. Similar to the case of biotechnology, in which scientists use AlphaFold before in-vivo experiments (Samorodnitsky, 2022), researchers may use LLMs before conducting experiments with human subjects to gather additional information. Second, we need to be aware of the issue of hallucinations. Because developers trained LLMs primarily with large-scale general corpora, which may include false 22 information, LLMs might produce untrustworthy output when we enter inquiries into them. They developed LLMs as generative prediction models based on trained corpora while paying less attention to their veracity, so people now consider hallucination one of their severe limitations (McKenna et al., 2023). That said, moral educators should carefully check the quality of products generated by LLMs, such as the chain of thought and reasoning outputs, so that the products do not contain significant false or unreliable information. One alleviating fact is that users can address such problematic products, including biased responses, by providing human reinforcements and chain of thought and reasoning inputs (Ganguli et al., 2023; Srivastava et al., 2023; Zhao et al., 2023). In the long run, how LLMs correct the falsehood and incredibility in their products, particularly those in moral domains, through interactive feedback processes, can also be an interesting research topic for moral educators. Insights from such correction processes can further inform moral education intending to address biases and misinformation, which has become a crucial topic after the COVID-19 pandemic (Blackburn et al., 2023; Gover et al., 2020). Third, most current LLMs can only utilize language corpora as input, so they might not be able to perform simulations with diverse modalities of input, e.g., visual, tactical, and visceral information, unlike human beings (Chalmers, 2023). Thus, limited linguistic information constitutes input for simulations, so the models might only imperfectly emulate moral functioning that uses diverse modalities of inputs and involves embodiment (Narvaez, 2016). One good news is that engineers are now developing devices supporting multiple modalities for data input and output (Chalmers, 2023; Mu et al., 2023). Researchers can simulate moral functioning with multimodal data obtained beyond human 23 language once such devices become available. The enhanced simulation models will allow researchers to examine human moral functioning more realistically. Due to the same reason, I could only investigate the limited domains of moral psychology and education, e.g., moral reasoning and moral exemplar intervention, in this paper. Although moral reasoning is one fundamental factor predicting moral motivation and behavior (May, 2018), it could not be a sufficient condition for them (Darnell et al., 2022). Also, moral educators utilize various educational methods other than moral exemplar intervention, such as service learning. Once multimodal input and output are supported, we will be able to examine various functional components, such as moral identity and empathy, which constitute the complex network of moral functioning (Darnell et al., 2022; Han, 2023b), and educational methods. Despite the limitations of LLMs at this point, I suggest LLMs are noteworthy in research on moral education and development in the long run. Recent developments in computer science have enabled LLMs to possess emerging features central to simulating human psychological processes, such as in-context learning and reasoning, the chain of thought and reasoning, reasoning-based correction, and ToM capabilities, which were not available previously. Given the abovementioned novel capabilities constitute the basis for moral functioning, it must be interesting to see how LLMs evolve. Once they acquire additional functionalities to simulate human cognition more accurately (Arcas & Agüera, 2022), moral educators will get more insights into their research. Until then, we should pay keen attention to novel findings and updates regarding LLMs, particularly those closely related to human morality. 24 References Algoe, S. B., & Haidt, J. (2009). Witnessing excellence in action: The ‘otherpraising’emotions of elevation, gratitude, and admiration. The Journal of Positive Psychology, 4(2), 105–127. Arcas, Y., & Agüera, B. (2022). Do Large Language Models Understand Us? Daedalus, 151(2), 183–197. https://doi.org/10.1162/daed_a_01909 Blackburn, A. M., Han, H., Gelpí, R. A., Stöckli, S., Jeftić, A., Ch’ng, B., Koszałkowska, K., Lacko, D., Milfont, T. L., Lee, Y., COVIDiSTRESS Ii Consortium, & Vestergren, S. (2023). Mediation analysis of conspiratorial thinking and anti-expert sentiments on vaccine willingness. Health Psychology, 42(4), 235–246. https://doi.org/10.1037/hea0001268 Blasi, A. (1980). Bridging moral cognition and moral action: A critical review of the literature. Psychological Bulletin, 88, 1–45. https://doi.org/10.1037/00332909.88.1.1 Blatt, M. M., & Kohlberg, L. (1975). The Effects of Classroom Moral Discussion upon Children’s Level of Moral Judgment. Journal of Moral Education, 4(2), 129–161. https://doi.org/10.1080/0305724750040207 Chalmers, D. J. (2023). Could a Large Language Model be Conscious? (arXiv:2303.07103). arXiv. http://arxiv.org/abs/2303.07103 Chan, H. C. S., Shan, H., Dahoun, T., Vogel, H., & Yuan, S. (2019). Advancing Drug Discovery via Artificial Intelligence. Trends in Pharmacological Sciences, 40(8), 592–604. https://doi.org/10.1016/j.tips.2019.06.004 Choi, Y.-J., Han, H., Dawson, K. J., Thoma, S. J., & Glenn, A. L. (2019). Measuring moral 25 reasoning using moral dilemmas: Evaluating reliability, validity, and differential item functioning of the behavioural defining issues test (bDIT). European Journal of Developmental Psychology, 16(5), 622–631. https://doi.org/10.1080/17405629.2019.1614907 Cohen, H., Nissan-Rozen, I., & Maril, A. (2022). Empirical evidence for moral Bayesianism. Philosophical Psychology, 1–30. https://doi.org/10.1080/09515089.2022.2096430 Darnell, C., Fowers, B. J., & Kristjánsson, K. (2022). A multifunction approach to assessing Aristotelian phronesis (practical wisdom). Personality and Individual Differences, 196, 111684. https://doi.org/10.1016/j.paid.2022.111684 Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., Li, L., & Sui, Z. (2023). A Survey on In-context Learning (arXiv:2301.00234). arXiv. http://arxiv.org/abs/2301.00234 Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., Ahuja, M., Albanna, H., Albashrawi, M. A., Al-Busaidi, A. S., Balakrishnan, J., Barlette, Y., Basu, S., Bose, I., Brooks, L., Buhalis, D., … Wright, R. (2023). Opinion Paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management, 71, 102642. https://doi.org/10.1016/j.ijinfomgt.2023.102642 Friston, K. (2003). Learning and inference in the brain. Neural Networks, 16(9), 1325–1352. https://doi.org/10.1016/j.neunet.2003.06.005 Ganguli, D., Askell, A., Schiefer, N., Liao, T. I., Lukošiūtė, K., Chen, A., Goldie, A., Mirhoseini, A., 26 Olsson, C., Hernandez, D., Drain, D., Li, D., Tran-Johnson, E., Perez, E., Kernion, J., Kerr, J., Mueller, J., Landau, J., Ndousse, K., … Kaplan, J. (2023). The Capacity for Moral Self-Correction in Large Language Models (arXiv:2302.07459). arXiv. http://arxiv.org/abs/2302.07459 Gover, A. R., Harper, S. B., & Langton, L. (2020). Anti-Asian Hate Crime During the COVID-19 Pandemic: Exploring the Reproduction of Inequality. American Journal of Criminal Justice, 45(4), 647–667. https://doi.org/10.1007/s12103-020-09545-1 Grossmann, I., Feinberg, M., Parker, D. C., Christakis, N. A., Tetlock, P. E., & Cunningham, W. A. (2023). AI and the transformation of social science research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778 Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., & Wu, Y. (2023). How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection (arXiv:2301.07597). arXiv. http://arxiv.org/abs/2301.07597 Haidt, J. (2000). The positive emotion of elevation. Prevention and Treatment, 3(3), 1–5. Han, H. (2023a). Considering the Purposes of Moral Education with Evidence in Neuroscience: Emphasis on Habituation of Virtues and Cultivation of Phronesis. Ethical Theory and Moral Practice. https://doi.org/10.1007/s10677-023-10369-1 Han, H. (2023b). Examining the Network Structure among Moral Functioning Components with Network Analysis [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/ufg7e Han, H., & Dawson, K. J. (2023). Relatable and attainable moral exemplars as sources for moral elevation and pleasantness. Journal of Moral Education. https://doi.org/10.1080/03057240.2023.2173158 Han, H., Dawson, K. J., Thoma, S. J., & Glenn, A. L. (2020). Developmental Level of Moral 27 Judgment Influences Behavioral Patterns During Moral Decision-Making. The Journal of Experimental Education, 88(4), 660–675. https://doi.org/10.1080/00220973.2019.1574701 Han, H., Lee, K., & Soylu, F. (2016). Predicting long-term outcomes of educational interventions using the Evolutionary Causal Matrices and Markov Chain based on educational neuroscience. Trends in Neuroscience and Education, 5(4), 157–165. https://doi.org/10.1016/j.tine.2016.11.003 Han, H., Lee, K., & Soylu, F. (2018). Simulating outcomes of interventions using a multipurpose simulation program based on the evolutionary causal matrices and Markov chain. Knowledge and Information Systems. https://doi.org/10.1007/s10115-017-1151-0 Han, H., Lee, K., & Soylu, F. (2020). Applying the Deep Learning Method for Simulating Outcomes of Educational Interventions. SN Computer Science, 1(2), 70. https://doi.org/10.1007/s42979-020-0075-z Han, H., Workman, C. I., May, J., Scholtens, P., Dawson, K. J., Glenn, A. L., & Meindl, P. (2022). Which moral exemplars inspire prosociality? Philosophical Psychology, 35(7), 943– 970. https://doi.org/10.1080/09515089.2022.2035343 Hosseini, M., Resnik, D. B., & Holmes, K. (2023). The ethics of disclosing the use of artificial intelligence tools in writing scholarly manuscripts. Research Ethics, 17470161231180449. https://doi.org/10.1177/17470161231180449 Huang, J., & Chang, K. C.-C. (2023). Towards Reasoning in Large Language Models: A Survey (arXiv:2212.10403). arXiv. http://arxiv.org/abs/2212.10403 Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., 28 Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274 Kosinski, M. (2023). Theory of Mind May Have Spontaneously Emerged in Large Language Models (arXiv:2302.02083). arXiv. http://arxiv.org/abs/2302.02083 Kristjánsson, K. (2010). Educating Moral Emotions or Moral Selves: A false dichotomy? Educational Philosophy and Theory, 42(4), 397–409. https://doi.org/10.1111/j.1469-5812.2008.00489.x Kristjánsson, K. (2017). Emotions targeting moral exemplarity: Making sense of the logical geography of admiration, emulation and elevation. Theory and Research in Education, 15(1), 20–37. https://doi.org/10.1177/1477878517695679 Li, M., Su, Y., Huang, H.-Y., Cheng, J., Hu, X., Zhang, X., Wang, H., Qin, Y., Wang, X., Liu, Z., & Zhang, D. (2023). Language-Specific Representation of Emotion-Concept Knowledge Causally Supports Emotion Inference (arXiv:2302.09582). arXiv. http://arxiv.org/abs/2302.09582 Mathys, C. (2011). A Bayesian foundation for individual learning under uncertainty. Frontiers in Human Neuroscience, 5. https://doi.org/10.3389/fnhum.2011.00039 May, J. (2018). Regard for Reason in the Moral Mind. Oxford University Press. McDiarmid, A. D., Tullett, A. M., Whitt, C. M., Vazire, S., Smaldino, P. E., & Stephens, J. E. (2021). Psychologists update their beliefs about effect sizes after replication studies. Nature Human Behaviour, 5(12), 1663–1673. https://doi.org/10.1038/s41562-021- 29 01220-7 McKenna, N., Li, T., Cheng, L., Hosseini, M. J., Johnson, M., & Steedman, M. (2023). Sources of Hallucination by Large Language Models on Inference Tasks (arXiv:2305.14552). arXiv. http://arxiv.org/abs/2305.14552 Milano, S., McGrane, J. A., & Leonelli, S. (2023). Large language models challenge the future of higher education. Nature Machine Intelligence, 5(4), 333–334. https://doi.org/10.1038/s42256-023-00644-2 Mogavi, R. H., Deng, C., Kim, J. J., Zhou, P., Kwon, Y. D., Metwally, A. H. S., Tlili, A., Bassanelli, S., Bucchiarone, A., Gujar, S., Nacke, L. E., & Hui, P. (2023). Exploring User Perspectives on ChatGPT: Applications, Perceptions, and Implications for AI-Integrated Education (arXiv:2305.13114). arXiv. http://arxiv.org/abs/2305.13114 Moor, M., Banerjee, O., Abad, Z. S. H., Krumholz, H. M., Leskovec, J., Topol, E. J., & Rajpurkar, P. (2023). Foundation models for generalist medical artificial intelligence. Nature, 616(7956), 259–265. https://doi.org/10.1038/s41586-023-05881-4 Mu, Y., Zhang, Q., Hu, M., Wang, W., Ding, M., Jin, J., Wang, B., Dai, J., Qiao, Y., & Luo, P. (2023). EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought (arXiv:2305.15021). arXiv. http://arxiv.org/abs/2305.15021 Narvaez, D. (2016). Embodied Morality: Protectionism, Engagement and Imagination. Palgrave Macmillan UK. https://doi.org/10.1057/978-1-137-55399-7_1 Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & others. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. 30 Railton, P. (2017). Moral Learning: Conceptual foundations and normative relevance. Cognition, 167, 172–190. https://doi.org/10.1016/j.cognition.2016.08.015 Rest, J. R., Narvaez, D., Bebeau, M. J., & Thoma, S. J. (1999). Postconventional moral thinking: A Neo-Kohlbergian approach. Lawrence Erlbaum Associates, Publishers. Samorodnitsky, D. (2022). The Future of Biotech in an Artificially Intelligent World: Biotech hopes to benefit from protein structure prediction, pattern recognition, and support for iterative development. Genetic Engineering & Biotechnology News, 42(1), 26–27, 29. https://doi.org/10.1089/gen.42.01.09 Sanderse, W. (2012). The meaning of role modelling in moral and character education. Journal of Moral Education, 42(1), 28–42. https://doi.org/10.1080/03057240.2012.690727 Schnall, S., Roper, J., & Fessler, D. M. T. (2010). Elevation leads to altruistic behavior. Psychological Science, 21, 315–320. https://doi.org/10.1177/0956797609359882 Schwitzgebel, E., Schwitzgebel, D., & Strasser, A. (2023). Creating a Large Language Model of a Philosopher. Mind & Language. https://doi.org/10.1111/mila.12466 Shapira, N., Levy, M., Alavi, S. H., Zhou, X., Choi, Y., Goldberg, Y., Sap, M., & Shwartz, V. (2023). Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models (arXiv:2305.14763). arXiv. http://arxiv.org/abs/2305.14763 Silvers, J. A., & Haidt, J. (2008). Moral elevation can induce nursing. Emotion, 8(2), 291–295. Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., … Wu, Z. (2023). Beyond the 31 Imitation Game: Quantifying and extrapolating the capabilities of language models (arXiv:2206.04615). arXiv. http://arxiv.org/abs/2206.04615 Vianello, M., Galliani, E. M., & Haidt, J. (2010). Elevation at work: The effects of leaders’ moral excellence. The Journal of Positive Psychology, 5(5), 390–411. Volkman, R., & Gabriels, K. (2023). AI Moral Enhancement: Upgrading the Socio-Technical System of Moral Engagement. Science and Engineering Ethics, 29(2), 11. https://doi.org/10.1007/s11948-023-00428-2 Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv:2201.11903). arXiv. http://arxiv.org/abs/2201.11903 Wu, Y., Prabhumoye, S., Min, S. Y., Bisk, Y., Salakhutdinov, R., Azaria, A., Mitchell, T., & Li, Y. (2023). SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning (arXiv:2305.15486). arXiv. http://arxiv.org/abs/2305.15486 Yeager, D. S., & Walton, G. M. (2011). Social-psychological interventions in education: They’re not magic. Review of Educational Research, 81(2), 267–301. https://doi.org/10.3102/0034654311405999 Young, L., Cushman, F., Hauser, M., & Saxe, R. (2007). The neural basis of the interaction between theory of mind and moral judgment. Proceedings of the National Academy of Sciences of the United States of America, 104, 8235–8240. https://doi.org/10.1073/pnas.0701408104 Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A Survey of Large Language Models (arXiv:2303.18223). arXiv. 32 http://arxiv.org/abs/2303.18223