• Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

Generative AI —

A jargon-free explanation of how ai large language models work, want to really understand large language models here’s a gentle primer..

Timothy B. Lee and Sean Trott - Jul 31, 2023 11:00 am UTC

An illustration of words connected by lines.

When ChatGPT was introduced last fall, it sent shockwaves through the technology industry and the larger world. Machine learning researchers had been experimenting with large language models (LLMs) for a few years by that point, but the general public had not been paying close attention and didn’t realize how powerful they had become.

Today, almost everyone has heard about LLMs, and tens of millions of people have tried them out. But not very many people understand how they work.

If you know anything about this subject, you’ve probably heard that LLMs are trained to “predict the next word” and that they require huge amounts of text to do this. But that tends to be where the explanation stops. The details of how they predict the next word is often treated as a deep mystery.

One reason for this is the unusual way these systems were developed. Conventional software is created by human programmers, who give computers explicit, step-by-step instructions. By contrast, ChatGPT is built on a neural network that was trained using billions of words of ordinary language.

As a result, no one on Earth fully understands the inner workings of LLMs. Researchers are working to gain a better understanding, but this is a slow process that will take years—perhaps decades—to complete.

Still, there’s a lot that experts do understand about how these systems work. The goal of this article is to make a lot of this knowledge accessible to a broad audience. We’ll aim to explain what’s known about the inner workings of these models without resorting to technical jargon or advanced math.

We’ll start by explaining word vectors, the surprising way language models represent and reason about language. Then we’ll dive deep into the transformer, the basic building block for systems like ChatGPT. Finally, we’ll explain how these models are trained and explore why good performance requires such phenomenally large quantities of data.

Word vectors

To understand how language models work, you first need to understand how they represent words. Humans represent English words with a sequence of letters, like C-A-T for "cat." Language models use a long list of numbers called a "word vector." For example, here’s one way to represent cat as a vector:

[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, …, 0.0002]

(The full vector is 300 numbers long—to see it all, click here and then click “show the raw vector.”)

Why use such a baroque notation? Here’s an analogy. Washington, DC, is located at 38.9 degrees north and 77 degrees west. We can represent this using a vector notation:

  • Washington, DC, is at [38.9, 77]
  • New York is at [40.7, 74]
  • London is at [51.5, 0.1]
  • Paris is at [48.9, -2.4]

This is useful for reasoning about spatial relationships. You can tell New York is close to Washington, DC, because 38.9 is close to 40.7 and 77 is close to 74. By the same token, Paris is close to London. But Paris is far from Washington, DC.

reader comments

Channel ars technica.

Unraveling the landscape of large language models: a systematic review and future perspectives

Journal of Electronic Business & Digital Economics

ISSN : 2754-4214

Article publication date: 19 December 2023

Issue publication date: 16 February 2024

The rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a comprehensive examination of the research landscape in LLMs, providing an overview of the prevailing themes and topics within this dynamic domain.

Design/methodology/approach

Drawing from an extensive corpus of 198 records published between 1996 to 2023 from the relevant academic database encompassing journal articles, books, book chapters, conference papers and selected working papers, this study delves deep into the multifaceted world of LLM research. In this study, the authors employed the BERTopic algorithm, a recent advancement in topic modeling, to conduct a comprehensive analysis of the data after it had been meticulously cleaned and preprocessed. BERTopic leverages the power of transformer-based language models like bidirectional encoder representations from transformers (BERT) to generate more meaningful and coherent topics. This approach facilitates the identification of hidden patterns within the data, enabling authors to uncover valuable insights that might otherwise have remained obscure. The analysis revealed four distinct clusters of topics in LLM research: “language and NLP”, “education and teaching”, “clinical and medical applications” and “speech and recognition techniques”. Each cluster embodies a unique aspect of LLM application and showcases the breadth of possibilities that LLM technology has to offer. In addition to presenting the research findings, this paper identifies key challenges and opportunities in the realm of LLMs. It underscores the necessity for further investigation in specific areas, including the paramount importance of addressing potential biases, transparency and explainability, data privacy and security, and responsible deployment of LLM technology.

The analysis revealed four distinct clusters of topics in LLM research: “language and NLP”, “education and teaching”, “clinical and medical applications” and “speech and recognition techniques”. Each cluster embodies a unique aspect of LLM application and showcases the breadth of possibilities that LLM technology has to offer. In addition to presenting the research findings, this paper identifies key challenges and opportunities in the realm of LLMs. It underscores the necessity for further investigation in specific areas, including the paramount importance of addressing potential biases, transparency and explainability, data privacy and security, and responsible deployment of LLM technology.

Practical implications

This classification offers practical guidance for researchers, developers, educators, and policymakers to focus efforts and resources. The study underscores the importance of addressing challenges in LLMs, including potential biases, transparency, data privacy, and responsible deployment. Policymakers can utilize this information to shape regulations, while developers can tailor technology development based on the diverse applications identified. The findings also emphasize the need for interdisciplinary collaboration and highlight ethical considerations, providing a roadmap for navigating the complex landscape of LLM research and applications.

Originality/value

This study stands out as the first to examine the evolution of LLMs across such a long time frame and across such diversified disciplines. It provides a unique perspective on the key areas of LLM research, highlighting the breadth and depth of LLM’s evolution.

  • Large language models
  • Natural language processing
  • Artificial intelligence
  • Topic modeling

Ding, Q. , Ding, D. , Wang, Y. , Guan, C. and Ding, B. (2024), "Unraveling the landscape of large language models: a systematic review and future perspectives", Journal of Electronic Business & Digital Economics , Vol. 3 No. 1, pp. 3-19. https://doi.org/10.1108/JEBDE-08-2023-0015

Emerald Publishing Limited

Copyright © 2023, Qinxu Ding, Ding Ding, Yue Wang, Chong Guan and Bosheng Ding

Published in Journal of Electronic Business & Digital Economics . Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

The advent of large language models (LLMs), including conversational generation tools like OpenAI’s ChatGPT and Google’s Bard, has revolutionized the field of natural language processing (NLP). These models not only offer highly informative and integrated conversations to users, but also have the capability to develop code, conduct code reviews and fix bugs, among other functionalities ( Hsu & Ching, 2023 ). LLMs have also been applied to various domains and tasks, such as education, healthcare, speech recognition, information retrieval, text summarization and dialog systems. LLMs have shown impressive performance and potential in generating and processing natural language texts for diverse purposes and audiences. However, LLMs also pose significant challenges and risks, such as data quality, model reliability, ethical issues and social impacts. Therefore, it is important to understand the current state and future directions of LLM research and application. This paper presents a comprehensive analysis of the predominant themes and topics addressed in previous research concerning LLMs, drawing from a rich corpus of 198 records published between 1996 to 2023.

The records, which include journal articles, books, book chapters, conference papers and selected working papers, were gathered from a wide range of academic databases, such as EBSCO, Cambridge Journals, Elsevier, Emerald, Institute of Electrical and Electronics Engineers (IEEE), Journal Storage (JSTOR), Nature, Social Science Research Network (SAGE), Springer, Taylor & Francis and Wiley. The analysis was conducted using the BERTopic algorithm, a topic modeling technique that enabled us to identify four distinct clusters within LLM research: “Language and Natural Language Processing”, “Education and Teaching”, “Clinical and Medical Applications” and “Speech and Recognition Techniques”.

Each cluster represents a unique aspect of LLM application, demonstrating the wide-ranging potential of LLM technology. The “Language and Natural Language Processing” cluster delves into the technical aspects of LLMs, exploring key findings, techniques and challenges. The “Education and Teaching” cluster examines the transformative potential of LLMs in automating and enhancing educational tasks, while also addressing associated ethical and practical challenges. The “Clinical and Medical Applications” cluster investigates the promising potential of LLMs in healthcare, with a focus on their use in medical education and patient care. Lastly, the “Speech and Recognition Techniques” cluster explores the use of LLMs in speech recognition and other related applications.

This study stands out as the first to examine the evolution of LLMs across such a long time frame and such diversified disciplines. It provides a unique perspective on the key areas of LLM research and application over the years, highlighting the breadth and depth of LLM’s evolution.

In addition to presenting the research findings, this paper also identifies key challenges and opportunities in the realm of LLMs. It underscores the necessity for further investigation in specific areas, including explainability, robustness, cross-modal and multi-modal generation and interactive co-creation. Moreover, the paper highlights the paramount importance of addressing data privacy, security and responsible deployment of LLM technology.

In the following sections, this paper will detail the literature review, and explain the research methodology, data source and analysis techniques used in this study on LLMs. It will also present and discuss the results in relation to existing theories and practices. Finally, it will suggest some future research directions based on the findings and implications of the study. The aim is to provide a clear and detailed understanding of the research process, outcomes and their significance.

2. Literature review

Some review papers take a holistic view of the field of LLM, from different perspectives and time frames. An up-to-date and comprehensive review of the literature on LLMs from a technical and engineering perspective was offered ( Zhao et al. , 2023 ), which can be a useful resource for both researchers and engineers who are interested in ChatGPT or other LLM-based applications. They reviewed the background, key findings and main techniques for LLMs, such as scaling laws, emergent abilities and alignment tuning. They also covered four major aspects of LLMs, namely pretraining, adaptation tuning, utilization and capacity evaluation, and summarized the recent progress and challenges in each aspect. Moreover, they discussed the practical guide for prompt design, which is the key interface for accessing and using LLMs, and the applications of LLMs in several representative domains, such as dialog systems, information retrieval, code generation and education.

Fan et al. (2023) conducted a bibliometric review of LLMs research from 2017 to 2023. They analyzed the trends, topics, challenges and applications of LLMs in natural language processing (NLP) and related fields using various bibliometric methods and tools. They collected and visualized the data from 1,672 publications from different countries, institutions, journals and conferences. They found that LLM research has grown rapidly in the past seven years, especially in 2020 and 2021, with more than 1,000 publications in these two years alone. They also found that LLM research is mainly driven by the advances in transformer-based architectures, such as bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT), and their variants and extensions. LLMs are applied to a wide range of NLP tasks, such as text generation, question answering, natural language understanding, machine translation, sentiment analysis, text summarization and dialog systems. Besides, LLM research is highly collaborative and interdisciplinary, with many co-authorship networks and cross-domain applications.

Other review papers look at a subgroup of the LLM literature, and focus on the advancement of certain selected topics, like education, healthcare research, etc. For example, Yan et al. (2023) looked at how LLM was used in education. They conducted a systematic scoping review of 118 peer-reviewed papers published since 2017 to pinpoint the current state of research on using LLMs to automate and support educational tasks. They identified 53 use cases for LLMs in education, categorized into nine main categories: profiling/labeling, detection, grading, teaching support, prediction, knowledge representation, feedback, content generation and recommendation. They also discussed several practical and ethical challenges, such as low technological readiness, lack of replicability and transparency, and insufficient privacy and beneficence considerations.

Sallam (2023) is a systematic review of the role and limitations of ChatGPT. The paper identifies the potential benefits of ChatGPT, such as improving scientific writing, analyzing large datasets, assisting radiologists and providing personalized learning and medicine. The paper also discusses the possible risks and concerns of ChatGPT, such as ethical issues, factual inaccuracies, plagiarism, transparency problems, legal issues and infodemic risk. The paper emphasizes the need for careful and responsible use of ChatGPT, as well as the importance of open data and open science publishing to ensure the safety and quality of health care and research. On the other hand, Chang et al. (2023) conducted a comprehensive survey on the evaluation of LLMs that can generate and process natural language texts for various tasks and domains. The paper reviews the existing evaluation methods and metrics for LLMs from three dimensions: what to evaluate, where to evaluate and how to evaluate. The paper covers a wide range of evaluation tasks, such as natural language processing, reasoning, robustness, ethics, biases, trustworthiness, social science, natural science and engineering, medical applications, agent applications and other applications. The paper also summarizes the existing evaluation datasets and benchmarks for LLMs, and discusses their advantages and limitations.

Wei et al. (2022a) took a different perspective and discussed the phenomenon of emergent abilities of LLMs, which are abilities that are not present in smaller models but are present in larger models. The paper surveys several examples of emergent abilities from prior work, such as few-shot prompting, instruction following, multi-step reasoning and model calibration. The paper uses various sources of data, such as scaling curves, benchmarks, datasets and metrics, to identify and evaluate emergent abilities. The paper also uses a systematic definition of emergence and a focused scope of LLMs to provide a clear and consistent framework for discussing emergent abilities.

The literature reviews on LLMs have some limitations that may affect their validity and applicability. One limitation is the range of time that they cover, which is mostly limited to recent years, especially 2020 and 2021. This may exclude some earlier works that are still relevant and influential in the field of LLMs. Another limitation is the emphasis that they put on certain topics or aspects of LLMs, which may not reflect the diversity and complexity of the field. For example, some reviews may focus more on the technical or engineering aspects of LLMs, while others may focus more on the ethical or social aspects of LLMs. This may result in a partial or incomplete picture of the state-of-the-art and the challenges of LLMs.

Our paper addresses these limitations by conducting a systematic and comprehensive review of the literature on LLMs from a multidisciplinary perspective. We cover a wide range of time frames, from 1996 to 2023, and include both peer-reviewed and preprint publications from different sources and domains. We use rigorous and transparent methods to select, analyze and synthesize the records, and we provide a critical appraisal and a meta-analysis of the evidence. We also address the ethical, social and legal implications of LLMs in depth and provide clear and actionable recommendations for their responsible use and regulation.

LLMs have evolved significantly in terms of their development and architecture, thanks to the breakthrough of the transformer architecture by ( Vaswani et al. , 2017 ). This architecture enabled the creation of the GPT series models, from GPT-1 to GPT-4 ( OpenAI, 2023 ; Radford et al. , 2019 ), which demonstrated remarkable performance in natural language processing. Among them, InstructGPT showed superior results compared to the larger GPT-3 model, despite having fewer parameters. Another notable contribution in this field is the “Constitutional AI” method proposed by Claude from Anthropic ( Bai et al. , 2022 ). Moreover, to address the computational cost issues, Meta’s LLaMA model presented an optimized approach for various inference budgets.

LLMs have a wide range of practical applications in different domains. In education, LLMs are considered as potential game-changers. Bonner, Lege, and Frazier (2023) highlighted their potential to revolutionize educational experiences. Chaudhry, Cukurova, and Luckin (2022) stressed the importance of transparency in Artificial Intelligence (AI) applications, especially in educational settings. Chechitelli (2023) also provided insights into the role of AI in plagiarism detection, while Condor, Litster, and Pardos (2021) explored AI’s transformative potential in automating educational assessments. In healthcare, studies such as those by Kung et al. (2023) and Liévin, Hother, and Winther (2022) illustrated the promising capabilities of ChatGPT in medical education and reasoning. Translation and speech recognition have also benefited from LLMs, with significant advancements in machine translation ( Vaswani, Zhao, Fossum, & Chiang, 2013 ) and speech recognition ( Kim et al. , 2020 ).

However, LLMs also pose some challenges and ethical issues that need to be addressed. A paramount concern in the LLM community is model safety. Research on GPT-4 has explored safety-relevant reinforcement learning with human feedback (RLHF) and rule-based reward models (RBRMs) to address these concerns. However, challenges persist, such as the “hallucination” phenomenon observed in ChatGPT, where the model can produce misleading or nonsensical answers. Additionally, the potential for LLMs like ChatGPT to reproduce biases from their training data remains a significant ethical concern.

Efficiency in LLMs is not only about computational prowess but also about the effective utilization of prompts and training strategies. The significance of prompt setting in LLMs has been highlighted in various studies, with chain-of-thought prompting emerging as a promising technique ( Wei et al. , 2022a , b ). Mayer, Ludwig, and Brandt (2023) have also demonstrated the efficacy of prompt-based learning for domain-specific tasks. To address the storage and computational challenges of LLMs, Schwenk, Rousseau, and Attik (2012) advocated for efficient training strategies, emphasizing the potential benefits of modern multi-core computers. In this study, we employ a unique approach by leveraging a topical model to examine the primary focus areas of papers on LLMs over a considerable time span. Our analysis reveals four primary categories which capture the majority of these topics: “language and NLP”, “education and teaching”, “clinical and medical applications” and “speech and recognition techniques”. We provide a detailed summary of the core findings from representative papers within each category, underscoring that the first category analyzes LLMs from a technical perspective, while the rest categories focus on their practical applications, suggesting potential interconnections.

3. Methodology

This paper presents the first comprehensive investigation into the prevalent themes and subjects explored in previous research related to LLMs, utilizing topic modeling techniques. The dataset encompasses records sourced from reputable academic databases such as EBSCO, Cambridge Journals, Elsevier, Emerald, IEEE, JSTOR, Nature, SAGE, Springer, Taylor & Francis and Wiley. These records comprise abstracts and citations from peer-reviewed literature across diverse domains, spanning the period from 1985 to 2023. The dataset includes a variety of scholarly works, such as journal articles, books, book chapters, conference papers and selected working papers, specifically targeting LLM-related topics. The search was conducted using two sets of keywords: “LLM” and “Large” + “Language” + “Model,” with the latter yielding a subset within the broader list obtained from the former. An observable surge in publications occurred post-2021, constituting more than 90% of the dataset. Subsequently, a thorough examination of the abstracts within the collected records was performed, leading to the exclusion of unqualified entries. Ultimately, 198 records were retained for the final analysis.

Topic modeling stands as a widely utilized method in the domains of NLP. Its primary objective involves extracting latent topics from a given corpus, thereby aiding in the comprehension of underlying themes and structural patterns within the textual data. Consequently, this technique facilitates effective organization and analysis of the information, contributing to enhanced knowledge discovery and insights.

While traditional topic modeling algorithms like latent Dirichlet allocation (LDA) ( Blei, Ng, & Jordan, 2003 ) have been widely applied in various contexts, there are several limitations that can limit the efficacy, such as issues related to handling noisy data, topic overlapping and limited interpretability. To address these limitations, BERTopic ( Grootendorst, 2022 ) has emerged as a promising alternative. BERTopic is a sophisticated topic modeling algorithm built upon the BERT architecture that was developed recently ( Devlin, Chang, Lee, & Toutanova, 2019 ). Notably, BERTopic is well-regarded for its ability to generate coherent topics and exhibit competitive performance across diverse benchmark evaluations. It surpasses classical models like LDA and non-negative matrix factorization (NMF) ( Févotte & Idier, 2011 ) as well as more recent clustering-based approaches to topic modeling, such as Correlated Topic Model (CTM) ( Bianchi, Terragni, & Hovy, 2021 ) and Top2Vec ( Angelov, 2020 ). Prior studies have compellingly showcased the effectiveness of BERTopic in identifying dimensions and prominent keywords expressed in online reviews ( Atzeni, Bacciu, Mazzei, & Prencipe, 2022 ; Raju et al. , 2022 ) and uncovering research patterns in the field of literature analysis ( Fan et al. , 2023 ), highlighting its value and utility in diverse research contexts. Therefore, the abstracts of the selected papers in this study were analyzed using BERTopic to generate a list of interpretable topics from the previous literature.

In particular, BERTopic employs a three-step approach to create topic representations. Firstly, it converts each document into an embedding representation using a pretrained language model. Next, to improve the clustering process, it reduces the dimensionality of the resulting embeddings before clustering them. Finally, a customized class-based variation of Term Frequency-Inverse Document Frequency (TF-IDF) is utilized to extract topic representations from the document clusters ( Bafna et al. , 2016 ). These three distinct steps facilitate a flexible topic model suitable for various applications, including dynamic topic modeling.

Document embeddings

Document clustering

Upon acquiring the embeddings for the abstracts of the obtained documents, the application of clustering algorithms assumes a pivotal role, aiming to categorize the documents into coherent clusters, each encapsulating a distinct thematic focus. However, the challenge arises when dealing with datasets characterized by high-dimensional attributes, as an augmentation in dimensionality has been observed to lead the proximity to the nearest data point to approximate the distance to the farthest one. As a result, the conventional notion of spatial proximity becomes elusive within high-dimensional spaces, resulting in a convergence of diverse distance metrics. To mitigate this concern, various clustering methodologies have been devised to counteract the repercussions of the “curse of dimensionality.” Nevertheless, a more direct strategy involves the reduction of embedding dimensionality.

Although principal component analysis (PCA) ( Jolliffe & Cadima, 2016 ) and t-distributed stochastic neighbor embedding (t-SNE) ( Maaten & Hinton, 2008 ) stand as widely acknowledged techniques for dimensionality reduction, an alternative approach termed uniform manifold approximation and projection (UMAP) ( McInnes, Healy, & Melville, 2020 ) has surfaced, demonstrating superior retention of both local and global characteristics intrinsic to high-dimensional data during the projection onto lower dimensions. Notably, UMAP presents the advantage of versatility across varied dimensional spaces of language models, unburdened by computational limitations imposed on embedding dimensions. Consequently, we opt to leverage the capabilities of UMAP to effectuate the reduction in dimensionality of the document embeddings engendered within this stage.

Topic Representation

The final stage pertains to the derivation of topic depictions of the abstracts of the obtained documents. For every cluster formulated in the preceding phase, BERTopic calculates the Class-based TF-IDF (c-TF-IDF) scores for each term within the cluster. Essentially, terms characterized by the highest c-TF-IDF scores are considered the most representative of their respective topics. Here, we used the built-in reduce_frequent_words function of BERTopic to reduce frequent words in the c-TF-IDF representation. Essentially, this approach entails a reduction in the weightage accorded to words that appear very frequently across the corpus of documents.

t f x , c = f r e q u e n c y   o f   w o r d   x   i n   c l a s s   c

f x = f r e q u e n c y   o f   w o r d   x   a c r o s s   a l l   c l a s s e s

A   = a v e r a g e   n u m b e r   o f   w o r d s   p e r   c l a s s

4.1 Determining the number of topics

Following the implementation and training of the BERTopic model on the dataset, we have effectively derived four distinct topics. As depicted in Figure 1 , the visualization illustrates that BERTopic efficiently extracts a set of lexemes representative of each topic. These lexical items, commonly referred to as “topic words”, assume a vital role in encapsulating the principal themes or concepts inherent within the corpus pertaining to a specific topic. BERTopic employs a scoring mechanism to assess these topic words, a metric that conveys the relative significance of individual words with respect to a given topic. The computation of these scores is underpinned by sophisticated algorithms that consider the contextual usage and distribution patterns of words within the corpus. Words that exhibit close proximity and contribute substantially to the semantic essence of a particular topic tend to garner higher scores, signifying their relevance. Conversely, words deemed less germane or making only marginal contributions to the semantic delineation of the topic receive lower scores.

An intertropic distance map, visually depicted in Figure 2 , serves to delineate the interconnections among topics within a two-dimensional expanse. The intertopic distance map offers an understanding of the relationships between various topics, based on their semantic resemblances. It visualizes the topics in a two-dimensional plane, wherein the separation between topics represents their semantic likeness. Topics situated in proximity exhibit higher similarity concerning the words and context they encompass, whereas topics situated at a greater distance are more dissimilar. The resultant map provides insights into the proximate associations and hierarchical structure of topics, thus augmenting the depth of comprehension regarding the topic arrangement intrinsic to the dataset.

Upon conducting a comprehensive evaluation incorporating the above, we deduced four distinct clusters of topics, which are tabulated in Table 1 .

5. Discussion

5.1 language and natural language processing (nlp).

Enhance the performance of LLMs on NLP tasks: One of the leading research aspects of LLMs is to improve their performance on NLP tasks by increasing their scale. We can clearly see this trend by observing the evolutionary path of the GPT series models. These models, such as GPT-1, GPT-2, GPT-3, InstructGPT and GPT-4 ( Brown et al. , 2020 ; OpenAI, 2023 ; Ouyang et al. , 2022 ; Radford, Narasimhan, Salimans, & Sutskever, 2018 ; Radford et al. , 2019 ), are based on the architecture of the transformer ( Vaswani et al. , 2017 ), which is one of the famous architectures for developing LLMs and is well-known for its attention mechanisms, resulting in a simpler architecture than the recurrent neural networks (RNNs) ( Schuster & Paliwal, 1997 ). GPT-1 model ( Radford et al. , 2018 ) was proposed to build a generative pre-training language model for various natural language tasks such as question answering, document classification, etc. It has 117 million parameters. The motivation of this work is to build the model with a diverse corpus of unlabeled text, as there is more unlabeled text than labeled text in the real world. A relatively small set of labeled text would be used to fine-tune the generative pretraining model for specific tasks. Although GPT-1 model performs well on these tasks, it still relies on a labeled dataset for each specific task to finetune the model. The GPT-2 model ( Radford et al. , 2019 ) was proposed to further reduce the need for labeled datasets. It is a 1.5 billion transformer and uses a zero-shot setting, demonstrating that LLMs can learn various natural language tasks without explicit supervision. Compared with GPT-1 and GPT-2 models, GPT-3 model ( Brown et al. , 2020 ) has 175 billion parameters, which is much larger and improves the performance of LLMs further. GPT-4 model ( OpenAI, 2023 ), a large-scale and multimodal model, has 1.8 trillion parameters, which is over 10 times larger than GPT-3. It performs better than GPT-3 model on a diverse set of benchmarks, such as exams for humans.

Align LLMs’ output with human instructions: Although increasing the scale can make LLMs perform well on NLP tasks, it still cannot ensure that the generated outputs align with humans’ instructions, resulting in unsatisfied user interactions and incorrect information. To mitigate this problem, InstructGPT ( Ouyang et al. , 2022 ) was proposed, which uses supervised learning to fine-tune the GPT-3 model with human prompts. Meanwhile, reinforcement learning is used to reward the LLM if its results are aligned with human prompts. In fact, a 1.3B parameter InstructGPT model is even better than a 175B GPT-3 model by human evaluations. These results further demonstrate that an LLM cannot only rely on its large scale or complex structure.

Safe and harmless LLMs: On the other hand, the safety of LLMs also attracts researchers’ attention. Although ChatGPT or InstructGPT has the incredible performance to align with users’ prompts, no one can make sure the results are not harmful to humans. For example, if unsafe inputs are given, the model may instruct on committing crimes. To mitigate this problem, GPT-4 model uses two components named safety-relevant RLHF and RBRMs. Meanwhile, Claude ( Bai et al. , 2022 ), the LLM developed to beat ChatGPT by anthropic, uses a “constitutional AI” method to improve the safety of LLMs’ outputs. Compared with ChatGPT, Claude uses a combination of human feedback and AI feedback in the reinforcement learning stage, while ChatGPT only uses human feedback. In fact, the AI feedback of Claude comes from a harmless AI assistant. Using an AI assistant to supervise AI is a potential direction to improve the safety of LLMs. It can also improve efficiency and reduce cost, as collecting human feedback is difficult. However, there are also concerns. For example, would LLMs’ effectiveness be significantly affected if we require too much on LLMs’ safety? Moreover, is it reliable to use AI to supervise AI? Deeper research is still needed to find a good trade-off between safety and accuracy of LLMs.

LLMs’ development with limited cost budget: Cost is another big issue to consider in developing LLMs. If we can ignore the cost, we can quickly increase the scale of model and dataset. We can even hire more people to label the datasets. However, the budget is not ignorable in the real world. Can we still have a good-performance LLM with a smaller dataset or model? Meta’s LLaMA model was developed to achieve the best performances considering various inference budgets. It only uses publicly available datasets, while ChatGPT also uses nonpublicly available datasets. LLaMA-13B can outperform GPT-3 (175B) on some natural language benchmarks. It demonstrates that developing a good LLM with a much smaller model scale and solely public datasets is possible. As the model is released to the public, it will also accelerate the development of LLMs. The generative AI community can work on solving the bias and unsafety issues with these open-source models and datasets.

Prompt setting of LLMs: The prompt setting of LLMs should not be ignored. On the one hand, prompt setting can reduce the requirement of a large, labeled dataset used for training the LLMs. For example, Brown et al. (2020) compared zero-shot, one-shot and few-shot settings with GPT-3 model and state-of-art fine-tuned models. They demonstrated that using few-shot settings with GPT-3 model can also perform well on NLP tasks. On the other hand, a suitable prompt setting can also improve the ability of LLMs in complexity reasoning. A standard prompt setting, simple input-out pairs, is insufficient to teach LLMs to perform well in reasoning tasks. The chain-of-thought prompting ( Wei et al. , 2022b ), showing LLMs a series of intermediate reasoning steps, was proposed to improve the reasoning ability of LLMs. Moreover, prompt-based learning can also simply be the people’s usage of AI. For example, prompt-based learning approaches with LLMs, without updating the weights of LLMs, can easily be used for domain-specific tasks, such as classifying email responses to a problem-solving task ( Mayer et al. , 2023 ).

5.2 Large language model in research and education

The rise of generative artificial intelligence has bestowed upon the academic community unprecedented possibilities for advancing education and research. LLMs, such as ChatGPT, have demonstrated exceptional capabilities in understanding and generating human language, leading to their widespread adoption in various domains, including education and research. The current body of literature on LLMs delves into the transformative impact of LLMs in these domains, with a focus on the enriching experiences for students, novel teaching methodologies, empowered research endeavors and interactive learning enabled by LLMs ( Liu et al. , 2023 ).

Enhance student learning experiences : Through their natural language processing and generation abilities, LLMs can process vast amounts of educational content and cater personalized information to individual learners ( Sallam, 2023 ). By tailoring educational materials to suit each student’s proficiency and learning pace, LLMs foster a more engaging and interactive learning environment. Additionally, LLMs can facilitate interactive learning sessions wherein students can ask questions and receive real-time responses from the AI, promoting self-directed learning and knowledge acquisition.

One of the fascinating applications of LLMs in education is their integration with AI-powered entities like ChatGPT. This interaction allows students to engage in dialog with the AI, effectively transforming the traditional student-teacher dynamic. As ChatGPT operates as an intelligent conversation partner, students can seek clarifications, discuss concepts and receive guidance outside the classroom setting. This continual interaction not only fosters a deeper understanding of academic topics but also nurtures critical thinking and analytical skills ( Kasneci et al. , 2023 ).

Provide innovative methodologies in teaching and assessments: From the educator’s perspective, LLMs have ushered in innovative methodologies that challenge traditional instructional paradigms. Leveraging LLMs, educators can develop dynamic lesson plans with interactive elements, thereby captivating students’ attention and fostering active participation ( MacNeil et al. , 2022 ). Moreover, LLMs can act as virtual teaching assistants, providing instant feedback on assignments and performance assessments ( Leinonen et al. , 2023 ). This enables educators to identify students’ strengths and weaknesses more efficiently, leading to personalized interventions that enhance learning outcomes.

Improve workflow and efficiency in research: In the domain of research, LLMs have revolutionized data analysis, literature review processes and hypothesis generation. LLMs can analyze extensive volumes of academic literature, expediting the research process and aiding researchers in identifying relevant sources. Furthermore, researchers can employ LLMs to generate insightful summaries of research findings, condensing complex information into easily digestible formats. The use of LLMs in research not only saves time and effort but also empowers scholars to explore new avenues for knowledge discovery.

While the applications of LLMs in education and research are promising, the integration of AI in academic spaces warrants careful ethical deliberation. Privacy concerns, data security and potential biases in generated content are challenges that require meticulous attention ( Teubner, Flath, Weinhardt, van der Aalst, & Hinz, 2023 ). Educators and researchers must strike a balance between utilizing LLMs for enhancing academic practices and safeguarding the welfare of students and researchers. Rigorous oversight and ethical guidelines are imperative to mitigate these challenges and ensure the responsible use of LLMs in academia. Furthermore, it is vital to train LLMs on diverse and inclusive datasets to minimize bias and support fair and equitable learning experiences for all students ( Kasneci et al. , 2023 ).

The applications of LLMs in education and research present a compelling landscape of possibilities for the academic community. From enriching student learning experiences and improving educational practices to empowering research endeavors and enabling dynamic interactions with AI entities like ChatGPT, LLMs have the potential to transform the academic sphere. Nonetheless, ethical considerations must underpin these applications, ensuring that AI augmentation in education and research aligns with the best interests of students, researchers and the broader academic community. As LLM technology continues to evolve, its integration into academia will undoubtedly reshape the future of teaching, learning and knowledge generation.

5.3 Large language model in medical and public health

The advent of LLMs has brought forth ground-breaking opportunities for the medical and public health sectors. LLMs, with their natural language processing and generation abilities, possess the capacity to process vast amounts of clinical and health-related data, offering valuable insights and augmenting human decision-making ( Ufuk, 2023 ). From the existing literature, we can observe the manifold applications of LLMs in clinical settings, patient care, medical education and research, and public health initiatives, underscoring their potential to revolutionize the healthcare landscape ( Kung et al. , 2023 ).

Improve clinical diagnoses and patient treatment: In the realm of medical and clinical practices, LLMs have the potential to reshape the way clinical professionals diagnose and treat patients. By analyzing extensive medical literature, electronic health records and research papers, LLMs can assist healthcare providers in making more informed decisions and formulating personalized treatment plans ( Liévin et al. , 2022 ). Furthermore, LLMs can analyze patient data and suggest potential interventions, aiding in early detection and management of medical conditions. The integration of LLMs into electronic health record systems holds the promise of streamlining clinical workflows, reducing administrative burden and optimizing healthcare delivery. In the area of patient care, LLMs have the capacity to facilitate improved patient interactions and health literacy. Through their natural language understanding capabilities, LLMs can effectively engage with patients, answering their medical queries and providing essential health information. Moreover, LLM-powered virtual assistants can offer round-the-clock support, empowering patients to take a proactive role in managing their health. The integration of LLMs into telemedicine platforms enables remote patient monitoring and enhances access to healthcare services, particularly in underserved regions ( Sallam, 2023 ).

Enhance medical research and education: In the field of medical research and education, LLMs present invaluable tools for data analysis, literature review and hypothesis generation. By processing extensive medical databases and scientific literature, LLMs can identify patterns and associations that might have otherwise been overlooked. This capacity to explore vast volumes of data expeditiously opens up new avenues for medical research and accelerates the pace of scientific discovery. LLMs can also assist researchers in formulating research questions, facilitating experimental design and interpreting results, thus bolstering the quality and efficiency of medical research endeavors. On the other hand, LLMs have also shown immense potential in transforming medical education. These advanced AI models, like ChatGPT, can act as invaluable virtual teaching assistants, providing students with detailed and up-to-date information on various medical topics ( Lee, 2023 ). LLMs offer personalized learning pathways, adapting their responses to suit individual students’ needs and learning patterns. By incorporating LLMs into medical education, students can access a wealth of knowledge, receive interactive feedback and engage in dynamic learning experiences, ultimately fostering a more effective and comprehensive understanding of biomedical sciences.

Assist decision-making in public health management: LLMs have emerged as potent instruments for data-driven decision-making in public healthcare management. By analyzing and synthesizing diverse data sources, LLMs can offer comprehensive insights into disease trends, treatment efficacy and patient outcomes. These data-driven analyses aid in formulating evidence-based medical protocols and clinical guidelines, ultimately improving the overall quality of healthcare delivery ( Arora & Arora, 2023 ). Moreover, LLMs can support public health authorities in tracking and managing infectious disease outbreaks, enabling proactive interventions to curb their spread. The integration of LLMs in public health initiatives holds the potential to revolutionize data surveillance, epidemiological modeling and health policy formulation. LLMs can process vast amounts of public health data, including health surveys, population health records and disease surveillance reports, allowing for real-time tracking of health indicators and early identification of health risks. Through predictive modeling, LLMs can assist in forecasting disease outbreaks and resource allocation, helping public health authorities prepare timely and effective responses.

While LLMs hold great promise in the fields of medical and public health domains, there remain several challenges that require thorough and thoughtful consideration. Data privacy and security concerns are paramount, as the use of LLMs entails handling sensitive patient information. Furthermore, potential biases in the generated content must be acknowledged and addressed to ensure equitable healthcare practices ( Li et al. , 2023 ; Shen et al. , 2023 ). Ethical guidelines must be established to govern the use of LLMs in medical decision-making, research and public health initiatives, upholding the principles of patient autonomy, beneficence and non-maleficence. Transparent and explainable AI models are essential to foster trust among healthcare professionals, patients and the public. Additionally, the potential displacement of certain healthcare tasks by LLMs should be carefully managed, striking a balance between human expertise and AI assistance.

5.4 Speech and Recognition Techniques

LLMs-based end-to-end speech recognition model: The emergence of LLMs like ChatGPT also benefits the speech recognition field, which is a task to map audio inputs to text outputs. Traditional speech recognition systems usually combine various components, such as a language model, an acoustic model, a pronunciation model, etc. This structure relies on large computation storage and restricts the deployment of speech recognition systems on-device implementation. Research on end-to-end speech recognition models based on the LLMs is needed to mitigate this issue. For example, end-to-end speech recognition models were reviewed and compared on recognition accuracy, latency, mode size and computational cost ( Kim et al. , 2020 ).

Improve LLMs’ decoding speed in speech recognition: The decoding speed of the LLMs is often slow in speech recognition, which affects the use of speech recognition in real-time applications, such as voice assistants, transcription services, communication tools, etc. Taking advantage of the Central Processing Unit (CPU) and Graphics Processing Unit (GPU) is a method to accelerate the decoding speed. By using optimized deep learning frameworks that fully leverage hardware acceleration, it is possible to speed up the computations with minimal loss in accuracy. For example, an implementation using multicore CPUs and GPUs was proposed to reduce the time of speech recognition ( Kim & Lane, 2014 ).

Multi-modal LLMs in speech recognition: Another technique path to developing LLMs for speech recognition is considering the fusion of a text-based and speech-based LLM. The speech-based LLMs are usually good at preserving the speaker’s identity information and intonation and the text-based LLMs are better than speech-based LLMs in learning linguistics knowledge. Combining both types of LLMs allows the system to leverage their respective strengths, leading to a more comprehensive understanding of the input. Recently, a combination of AudioPaLM (a speech-based language model) and PaLM-2 (a text-based language model) was proposed to improve speech tasks ( Rubenstein et al. , 2023 ).

System combination for translation: Machine translation ( Brants, Popat, Xu, Och, & Dean, 2007 ) describes the problem of translating a source-language (e.g. Chinese) sentence or audio to a target language (e.g. English) sentence or audio. The developments of LLMs are revolutionizing the field of translation as existing LLMs perform quite well on various NLP tasks, and machine translation is one of the essential natural language tasks. Typically, the decoder component of an LLM will help to generate translated outputs automatically and the LLM will be combined with other techniques to build the machine translation system. For example, a combination of the neural probabilistic language model and noise-contrastive estimation was used for a machine system ( Vaswani et al. , 2013 ). The results demonstrate that combination systems can generate good-quality translations without repeating summarizations over the whole vocabulary. Moreover, the system combination method is usually used to build machine translation systems to improve the confidence of machine translation results and mitigate the ambiguity issue of natural language. It combines outputs from various translation systems. For example, if most systems contain a set of words, then the combination output would contain them in a large probability. However, doing system combinations can also be challenging as the order of the words translated by each system could differ. To improve the efficiency of the system combination in translation, a confusion network generation method ( Karakos & Khudanpur, 2008 ) is widely used in this field.

Prompting strategies of LLMs in translation: The prompting strategies are recently explored for machine translation ( Zhang, Haddow, & Birch, 2023 ). The selection of prompt examples could affect the LLM’s performance in machine translation, like how a prompting strategy could affect the performance of an LLM. Therefore, a good prompting strategy is essential for this field. The language of prompt examples is also important. For example, an English template or prompt could work best for machine translation than other languages. One reason for this result could be that the LLM is pretrained on English datasets. Therefore, the translation performance of LLMs could not be robust or stable when translating between German and Chinese. So, there is still a gap for LLMs doing machine translation in non-English-centric tasks.

LLMs in translation with limited storage: Although the performance of LLMs on machine translation is good, their storage requirements are also huge. Therefore, how we can efficiently train the LLMs for machine translation could be a key question for this field. Recently, there have been some works proposed to answer this question. For example, an efficient implementation of the continuous space language model ( Schwenk et al. , 2012 ) was developed to accelerate the training of the language model by taking advantage of modern multicore computers. The data selection technique is also proven helpful in resampling training data in large corpora. LLMs are usually enormous because of the different combinations of source and target phrases. To keep the usage of LLMs in translation under a limited computing capacity, it is meaningful to do research on reducing the scale of model without sacrificing the translation quality. One possible method is to use suitable pruning techniques for the LLMs. For example, divergence-based fine pruning ( Kim, Park, Shin, Kwon, & Kim, 2017 ) was used to reduce the model size of LLM for translation tasks.

6. Future research

Although LLMs such as ChatGPT show remarkable results when interacting with human prompts, they can still generate wrong responses sometimes, especially in cases they have never seen in the training datasets. Their responses’ accuracy should still be improved for human usage of good confidence. Therefore, researchers need to focus on the methodology of enhancing the performance of LLMs.

On the other hand, there are potential biases in the generated results by LLMs as a vast and biased dataset is used to train them. So unbiased LLMs should be developed to create fair results for humans. There are existing unbiased methods for building machine learning models ( Breeden & Leonova, 2021 ). Researchers need to consider how to combine or adapt these unbiased methods for LLMs to make sure they can generate accurate and unbiased results.

Moreover, deep learning models are usually considered black boxes because of their complex structures and lack of transparency and explainability. LLMs are more complex as they usually need a larger model and dataset size. Therefore, it becomes harder to understand why they are generating the responses they are. When LLMs are used for making high-stake decisions in finance, law, healthcare, etc., we need to understand better their results to increase our trust in them. Researchers in the explainable artificial intelligence (XAI) field are working very hard to solve the above issues, and there are existing XAI methods in the literature ( Linardatos, Papastefanopoulos, & Kotsiantis, 2021 ). Therefore, combining XAI and LLMs is another significant research direction to improve the transparency of LLMs.

Finally, as LLMs gain widespread adoption across diverse domains, concerns related to ethics and privacy become critical considerations. Researchers must collaborate closely with policymakers and legal experts to develop robust frameworks that govern the ethical usage and handling of data by these models. The establishment of clear guidelines and standards ensures that LLMs are deployed responsibly, preserving users’ rights and minimizing potential risks. By adhering to these principles and incorporating privacy-preserving techniques, LLMs can engender trust and confidence among users, paving the way for their responsible and beneficial integration in education, research and other domains.

To build an efficient and safe environment for LLMs, efforts should come from researchers in diverse domains. Computer scientists can help enhance LLMs’ performance and implement efficient pipelines. Mathematicians and physicists can help to uncover the mechanisms behind LLMs. Educators or healthcare professionals can provide domain knowledge when building domain-specific LLMs in their domains. Policymakers and legal experts can design the legal framework to address ethical concerns of using LLMs.

The illustration of the topic words for the four topics derived by BERTopic

Intertopic distance map for the 4 topics derived by BERTopic

Summary of the clusters of topics

TopicNameRepresentation
00_language_llm_model_natural[“language”, “llm”, “model”, “natural”, “ai”, “data”, “intelligence”, “task”, “text”, “generative”]
11_student_chatgpt_education_que[“student”, “chatgpt”, “education”, “que”, “research”, “teaching”, “la”, “generative”, “ai”, “academic”]
22_clinical_medical_health_public[“clinical”, “medical”, “health”, “public”, “patient”, “healthcare”, “medicine”, “chatgpt”, “data”, “potential”]
33_recognition_translation_pruning_speech[“recognition”, “translation”, “pruning”, “speech”, “model”, “ocr”, “method”, “language”, “decoder”, “divergence”]

Source(s): Table by authors

Angelov , D. ( 2020 ). Top2Vec: Distributed representations of topics . doi: 10.48550/arXiv.2008.09470 .

Arora , A. , & Arora , A. ( 2023 ). The promise of large language models in health care . The Lancet , 401 ( 10377 ), 641 .

Atzeni , D. , Bacciu , D. , Mazzei , D. , & Prencipe , G. ( 2022 ). A systematic review of wi-fi and machine learning integration with topic modeling techniques . Sensors (Basel, Switzerland) , 22 ( 13 ), 4925 . doi: 10.3390/s22134925 .

Bafna , P. , Pramod , D. , & Vaidya , A. ( 2016 , March). Document clustering: TF-IDF approach . In 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) (pp. 61 - 66 ). IEEE .

Bai , Y. , Kadavath , S. , Kundu , S. , Askell , A. , Kernion , J. , Jones , A. , … McKinnon , C. ( 2022 ). Constitutional AI: Harmlessness from AI feedback . arXiv preprint arXiv:2212.08073 .

Bianchi , F. , Terragni , S. , & Hovy , D. ( 2021 ). Pre-training is a hot topic: Contextualized document embeddings improve topic coherence . ACL-IJCNLP , 2021 , 2021 , /08// .

Blei , D. M. , Ng , A. Y. , & Jordan , M. I. ( 2003 ). Latent dirichlet allocation . Journal of Machine Learning Research , 3 , 10 .

Bonner , E. , Lege , R. , & Frazier , E. ( 2023 ). LARGE LANGUAGE model-based artificial intelligence in the language classroom: Practical ideas for teaching . Teaching English with Technology , 23 ( 1 ), 23 – 41 .

Brants , T. , Popat , A. C. , Xu , P. , Och , F. J. , & Dean , J. ( 2007 ). Large language models in machine translation .

Breeden , J. L. , & Leonova , E. ( 2021 ). Creating unbiased machine learning models by design . Journal of Risk and Financial Management , 14 ( 11 ), 565 . Available from: https://www.mdpi.com/1911-8074/14/11/565

Brown , T. , Mann , B. , Ryder , N. , Subbiah , M. , Kaplan , J. D. , Dhariwal , P. , … Askell , A. ( 2020 ). Language models are few-shot learners . Advances in Neural Information Processing Systems , 33 , 1877 – 1901 .

Chang , Y. , Wang , X. , Wang , J. , Wu , Y. , Zhu , K. , Chen , H. , … Wang , Y. ( 2023 ). A survey on evaluation of large language models . arXiv preprint arXiv:2307.03109 .

Chaudhry , M. A. , Cukurova , M. , & Luckin , R. ( 2022 ). A transparency index framework for AI in education . In International Conference on Artificial Intelligence in Education .

Chechitelli , A. ( 2023 ). AI writing detection update from Turnitin’s chief product officer . Turnitin Blog .

Condor , A. , Litster , M. , & Pardos , Z. ( 2021 ). Automatic short answer grading with SBERT on out-of-sample questions . In Proceedings of the 14th International Conference on Educational Data Mining (EDM2021) (pp. 345 - 352 ).

Devlin , J. , Chang , M.-W. , Lee , K. , & Toutanova , K. ( 2019 ). Bert: Pre-Training of deep bidirectional transformers for language understanding . In Proceedings of naacL-HLT 2019 Jun 2 (Vol. 1 , p. 2 ).

Ester , M. , Kriegel , H. P. , Sander , J. , & Xu , X. ( 1996 ). A density-based algorithm for discovering clusters in large spatial databases with noise . In kdd (Vol. 96 , No. 34 , pp. 226 - 231 ).

Fan , L. , Li , L. , Ma , Z. , Lee , S. , Yu , H. , & Hemphill , L. ( 2023 ). A bibliometric review of large language models research from 2017 to 2023 . arXiv preprint arXiv:2304.02020 .

Févotte , C. , & Idier , J. ( 2011 ). Algorithms for nonnegative Matrix factorization with the β-divergence . Neural Computation , 23 ( 9 ), 2421 – 2456 . doi: 10.1162/NECO_a_00168 .

Grootendorst , M. ( 2022 ). BERTopic: Neural topic modeling with a class-based TF-IDF procedure . doi:  10.48550/arXiv.2203.05794 .

Hsu , Y.-C. , & Ching , Y.-H. ( 2023 ). Generative artificial intelligence in education, Part One: The dynamic frontier . TechTrends , 67 , 603 – 607 .

Jolliffe , I. T. , & Cadima , J. ( 2016 ). Principal component analysis: A review and recent developments . Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , 374 ( 2065 ), 20150202 .

Karakos , D. , & Khudanpur , S. ( 2008 ). Sequential system combination for machine translation of speech . In 2008 IEEE Spoken Language Technology Workshop .

Kasneci , E. , Seßler , K. , Küchemann , S. , Bannert , M. , Dementieva , D. , Fischer , F. , … Hüllermeier , E. ( 2023 ). ChatGPT for good? On opportunities and challenges of large language models for education . Learning and Individual Differences , 103 , 102274 .

Kim , J. , & Lane , I. ( 2014 ). Accelerating large vocabulary continuous speech recognition on heterogeneous cpu-gpu platforms . In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3291 - 3295 ). IEEE .

Kim , K. , Park , E.-J. , Shin , J.-H. , Kwon , O.-W. , & Kim , Y.-K. ( 2017 ). Divergence-based fine pruning of phrase-based statistical translation model . Computer Speech and Language , 41 , 146 – 160 .

Kim , C. , Gowda , D. , Lee , D. , Kim , J. , Kumar , A. , Kim , S. , … Han , C. ( 2020 ). A review of on-device fully neural end-to-end automatic speech recognition algorithms . In 2020 54th Asilomar Conference on Signals, Systems, and Computers .

Kung , T. H. , Cheatham , M. , Medenilla , A. , Sillos , C. , De Leon , L. , Elepaño , C. , … Maningo , J. ( 2023 ). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models . PLoS Digital Health , 2 ( 2 ), e0000198 .

Lee , H. ( 2023 ). The rise of ChatGPT: Exploring its potential in medical education . Anatomical Sciences Education .

Leinonen , J. , Denny , P. , MacNeil , S. , Sarsa , S. , Bernstein , S. , Kim , J. , … Hellas , A. ( 2023 ). Comparing code explanations created by students and large language models . arXiv preprint arXiv:2304.03938 .

Li , H. , Moon , J. T. , Purkayastha , S. , Celi , L. A. , Trivedi , H. , & Gichoya , J. W. ( 2023 ). Ethics of large language models in medicine and medical research . The Lancet Digital Health , 5 ( 6 ), e333 – e335 .

Liévin , V. , Hother , C. E. , & Winther , O. ( 2022 ). Can large language models reason about medical questions? . arXiv preprint arXiv:2207.08143 .

Linardatos , P. , Papastefanopoulos , V. , & Kotsiantis , S. ( 2021 ). Explainable AI: A review of machine learning interpretability methods . Entropy , 23 ( 1 ), 18 . Available from: https://www.mdpi.com/1099-4300/23/1/18

Liu , Y. , Han , T. , Ma , S. , Zhang , J. , Yang , Y. , Tian , J. , … Liu , Z. ( 2023 ). Summary of chatgpt/gpt-4 research and perspective towards the future of large language models . arXiv preprint arXiv:2304.01852 .

Maaten , L. V. D. , & Hinton , G. ( 2008 ). Visualizing Data using t-SNE . Journal of Machine Learning Research , 9 ( 86 ), 2579 – 2605 . Available from: http://jmlr.org/papers/v9/vandermaaten08a.html

MacNeil , S. , Tran , A. , Leinonen , J. , Denny , P. , Kim , J. , Hellas , A. , … Sarsa , S. ( 2022 ). Automatically generating CS learning materials with Large Language Models . arXiv preprint arXiv:2212.05113 .

Mayer , C. W. , Ludwig , S. , & Brandt , S. ( 2023 ). Prompt text classifications with transformer models! An exemplary introduction to prompt-based learning with large language models . Journal of Research on Technology in Education , 55 ( 1 ), 125 – 141 .

McInnes , L. , Healy , J. , & Astels , S. ( 2017 ). hdbscan: Hierarchical density based clustering . Journal of Open Source Software , 2 ( 11 ), 205 .

McInnes , L. , Healy , J. , & Melville , J. ( 2020 ). Umap: Uniform manifold approximation and projection for dimension reduction . doi: 10.48550/arXiv.1802.03426 .

OpenAI ( 2023 ). GPT-4 technical report . ArXiv./abs/2303.08774 .

Ouyang , L. , Wu , J. , Jiang , X. , Almeida , D. , Wainwright , C. , Mishkin , P. , … Ray , A. ( 2022 ). Training language models to follow instructions with human feedback . Advances in Neural Information Processing Systems , 35 , 27730 – 27744 .

Radford , A. , Narasimhan , K. , Salimans , T. , & Sutskever , I. ( 2018 ). Improving language understanding by generative pre-training .

Radford , A. , Wu , J. , Child , R. , Luan , D. , Amodei , D. , & Sutskever , I. ( 2019 ). Language models are unsupervised multitask learners . OpenAI Blog , 1 ( 8 ), 9 .

Raju Sangaraju , V. , Bolla , B. K. , Nayak , D. K. , & Kh , J. ( 2022 ). Topic modelling on consumer financial protection bureau data: An approach using BERT based embeddings . arXiv e-prints, arXiv-2205 .

Rubenstein , P. K. , Asawaroengchai , C. , Nguyen , D. D. , Bapna , A. , Borsos , Z. , Quitry , F. D. C. , … Muckenhirn , H. ( 2023 ). AudioPaLM: A large language model that can speak and listen . arXiv preprint arXiv:2306.12925 .

Sallam , M. ( 2023 ). The utility of ChatGPT as an example of large language models in healthcare education, research and practice: Systematic review on the future perspectives and potential limitations . medRxiv , 2023.2002. 2019.23286155 .

Schuster , M. , & Paliwal , K. K. ( 1997 ). Bidirectional recurrent neural networks . IEEE Transactions on Signal Processing , 45 ( 11 ), 2673 – 2681 .

Schwenk , H. , Rousseau , A. , & Attik , M. ( 2012 ). Large, pruned or continuous space language models on a gpu for statistical machine translation . In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT .

Shen , Y. , Heacock , L. , Elias , J. , Hentel , K. D. , Reig , B. , Shih , G. , & Moy , L. ( 2023 ). ChatGPT and other large language models are double-edged swords . In (Vol.  307 , p.  e230163 ): Radiological Society of North America .

Teubner , T. , Flath , C. M. , Weinhardt , C. , van der Aalst , W. , & Hinz , O. ( 2023 ). Welcome to the era of chatgpt et al. the prospects of large language models . Business and Information Systems Engineering , 65 ( 2 ), 95 – 101 .

Ufuk , F. ( 2023 ). The role and limitations of large language models such as ChatGPT in clinical settings and medical journalism . Radiology , 307 ( 3 ), e230276 .

Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , … Polosukhin , I. ( 2017 ). Attention is all you need . Advances in neural information processing systems , (Vol. 30). .

Vaswani , A. , Zhao , Y. , Fossum , V. , & Chiang , D. ( 2013 ). Decoding with large-scale neural language models improves translation . In Proceedings of the 2013 conference on empirical methods in natural language processing .

Wei , J. , Tay , Y. , Bommasani , R. , Raffel , C. , Zoph , B. , Borgeaud , S. , … Metzler , D. ( 2022a ). Emergent abilities of large language models . arXiv preprint arXiv:2206.07682 .

Wei , J. , Wang , X. , Schuurmans , D. , Bosma , M. , Xia , F. , Chi , E. , … Zhou , D. ( 2022b ). Chain-of-thought prompting elicits reasoning in large language models . Advances in Neural Information Processing Systems , 35 , 24824 – 24837 .

Yan , L. , Sha , L. , Zhao , L. , Li , Y. , Martinez‐Maldonado , R. , Chen , G. , … Gašević , D. ( 2023 ). Practical and ethical challenges of large language models in education: A systematic scoping review . arXiv preprint arXiv:2303.13379 .

Zhang , B. , Haddow , B. , & Birch , A. ( 2023 ). Prompting large language model for machine translation: A case study . arXiv preprint arXiv:2301.07069 .

Zhao , W. X. , Zhou , K. , Li , J. , Tang , T. , Wang , X. , Hou , Y. , … Dong , Z. ( 2023 ). A survey of large language models . arXiv preprint arXiv:2303.18223 .

Corresponding author

Related articles, all feedback is valuable.

Please share your general feedback

Report an issue or find answers to frequently asked questions

Contact Customer Support

A Comprehensive Overview of Large Language Models

research on large language models

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations of the underlying neural networks, context length improvements, model alignment, training datasets, benchmarking, efficiency and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides that overview to the research community. It not only focuses on a systematic treatment of the existing literature on a broad range of LLM related concept, but also pays special attention to providing comprehensive summaries with extensive details about the individual existing models, datasets and major insights. We also pay heed to aligning our overview with the emerging outlook of this research direction by accounting for the other recently materializing reviews of the broader research direction of LLMs. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of this research direction. This review article is intended to not only provide a systematic survey, but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research direction.

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Datasets citing this paper 0

No dataset linking this paper

Spaces citing this paper 0

No Space linking this paper

Collections including this paper 1

How Large Language Models Will Transform Science, Society, and AI

Scholars in computer science, linguistics, and philosophy explore the pains and promises of GPT-3.

A shot of GPT-3's testing website in which it translates English to French.

Shana Lynch

GPT-3 can translate language,  write essays, generate computer code, and more — all with limited to no supervision.

For the latest Stanford research and news on large language models, subscribe to our newsletter .

In July 2020, OpenAI unveiled GPT-3, a language model that was easily the largest known at the time. Put simply, GPT-3 is trained to predict the next word in a sentence, much like how a text message autocomplete feature works. However, model developers and early users demonstrated that it had surprising capabilities, like the ability to write convincing essays, create charts and websites from text descriptions, generate computer code, and more — all with limited to no supervision. The model also has shortcomings. For example, it can generate racist, sexist, and bigoted text, as well as superficially plausible content that, upon further inspection, is factually inaccurate, undesirable, or unpredictable.

To better understand GPT-3’s capabilities, limitations, and potential impact on society, HAI convened researchers from OpenAI, Stanford, and other universities in a Chatham House Rule workshop. Below are some takeaways from the discussion. A more detailed summary can be found here .

As language models grow, their capabilities change in unexpected ways

GPT-3 has 175 billion parameters and was trained on 570 gigabytes of text. For comparison, its predecessor, GPT-2, was over 100 times smaller, at 1.5 billion parameters. This increase in scale drastically changes the behavior of the model — GPT-3 is able to perform tasks it was not explicitly trained on, like translating sentences from English to French, with few to no training examples. This behavior was mostly absent in GPT-2. Furthermore, for some tasks, GPT-3 outperforms models that were explicitly trained to solve those tasks, although in other tasks it falls short. Workshop participants said they were surprised that such behavior emerges from simple scaling of data and computational resources and expressed curiosity about what further capabilities would emerge from further scale.

GPT-3’s uses and their downstream effects on the economy are unknown

GPT-3 has an unusually large set of capabilities, including text summarization, chatbots, search, and code generation. Future users are likely to discover even more capabilities. This makes it difficult to characterize all possible uses (and misuses) of large language models in order to forecast the impact GPT-3 might have on society. Furthermore, it’s unclear what effect highly capable models will have on the labor market. This raises the question of when (or what) jobs could (or should) be automated by large language models.

Is GPT-3 intelligent, and does it matter?

Unlike chess engines, which solve a specific problem, humans are “generally” intelligent and can learn to do anything from writing poetry to playing soccer to filing tax returns. In contrast to most current AI systems, GPT-3 is edging closer to such general intelligence, workshop participants agreed. However, participants differed in terms of where they felt GPT-3 fell short in this regard. 

Some participants said that GPT-3 lacked intentions, goals, and the ability to understand cause and effect — all hallmarks of human cognition. On the other hand, some noted that GPT-3 might not need to understand to successfully perform tasks — after all, a non-French speaker recently won the French Scrabble championship. 

Future models won’t be restricted to learning just from language

GPT-3 was trained primarily on text. Participants agreed that future language models would be trained on data from other modalities (e.g., images, audio recordings, videos, etc.) to enable more diverse capabilities, provide a stronger learning signal, and increase learning speed. In fact, shortly after the workshop, OpenAI took a step in this direction and released a model called DALL-E , a version of GPT-3 that generates images from text descriptions. One surprising aspect of DALL-E is its ability to sensibly synthesize visual images from whimsical text descriptions. For example, it can generate a convincing rendition of “a baby daikon radish in a tutu walking a dog.”

Furthermore, some workshop participants also felt future models should be embodied — meaning that they should be situated in an environment they can interact with. Some argued this would help models learn cause and effect the way humans do, through physically interacting with their surroundings.

Disinformation is a real concern, but several unknowns remain

Models like GPT-3 can be used to create false or misleading essays, tweets, or news stories. Still, participants questioned whether it’s easier, cheaper, and more effective to hire humans to create such propaganda. One held that we could learn from similar calls of alarm when the photo-editing software program Photoshop was developed. Most agreed that we need a better understanding of the economies of automated versus human-generated disinformation before we understand how much of a threat GPT-3 poses.

Future models won’t merely reflect the data — they will reflect our chosen values

GPT-3 can exhibit undesirable behavior, including known racial, gender, and religious biases. Participants noted that it’s difficult to define what it means to mitigate such behavior in a universal manner—either in the training data or in the trained model — since appropriate language use varies across context and cultures. Nevertheless, participants discussed several potential solutions, including filtering the training data or model outputs, changing the way the model is trained, and learning from human feedback and testing. However, participants agreed there is no silver bullet and further cross-disciplinary research is needed on what values we should imbue these models with and how to accomplish this.

We should develop norms and principles for deploying language models now

Who should build and deploy these large language models? How will they be held accountable for possible harms resulting from poor performance, bias, or misuse? Workshop participants considered a range of ideas: Increase resources available to universities so that academia can build and evaluate new models, legally require disclosure when AI is used to generate synthetic media, and develop tools and metrics to evaluate possible harms and misuses. 

Pervading the workshop conversation was also a sense of urgency — organizations developing large language models will have only a short window of opportunity before others develop similar or better models. Those currently on the cutting edge, participants argued, have a unique ability and responsibility to set norms and guidelines that others may follow. 

Want to learn more about the workshop’s main points? Read the whitepaper .

Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition.  Learn more . 

More News Topics

Related content.

Children are read to by an instructor in a classroom.

How AI Systems Use Mad Libs to Teach Themselves Grammar

Advanced AI systems can figure out linguistic principles on their own, similar to how human children learn.

Young women line a park bench, each engrossed in her smartphone.

Can Artificial Intelligence Map Our Moods?

A Stanford researcher uses machine learning to identify mood swings through social media.

Headshots of John Etchemendy and Oren Etzioni

Is GPT-3 Intelligent? A Directors' Conversation with Oren Etzioni

The Allen Institute CEO and computer scientist talks GPT-3's capabilities and limitations, a better AI Turing test, and...

An illustration depicting a large language model at work

Large language models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks.

LLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of the public interest, as well as the point on which organizations are focusing to adopt artificial intelligence across numerous business functions and use cases.

Outside of the enterprise context, it may seem like LLMs have arrived out of the blue along with new developments in generative AI . However, many companies, including IBM, have spent years implementing LLMs at different levels to enhance their natural language understanding (NLU) and natural language processing (NLP) capabilities. This has occurred alongside advances in machine learning, machine learning models, algorithms, neural networks and the transformer models that provide the architecture for these AI systems.

LLMs are a class of foundation models , which are trained on enormous amounts of data to provide the foundational capabilities needed to drive multiple use cases and applications, as well as resolve a multitude of tasks. This is in stark contrast to the idea of building and training domain specific models for each of these use cases individually, which is prohibitive under many criteria (most importantly cost and infrastructure), stifles synergies and can even lead to inferior performance.

LLMs represent a significant breakthrough in NLP and artificial intelligence , and are easily accessible to the public through interfaces like Open AI’s Chat GPT-3 and GPT-4, which have garnered the support of Microsoft. Other examples include Meta’s Llama models and Google’s bidirectional encoder representations from transformers (BERT/RoBERTa) and PaLM models. IBM has also recently launched its Granite model series on watsonx.ai, which has become the generative AI backbone for other IBM products like watsonx Assistant and watsonx Orchestrate. 

In a nutshell, LLMs are designed to understand and generate text like a human, in addition to other forms of content, based on the vast amount of data used to train them. They have the ability to infer from context, generate coherent and contextually relevant responses, translate to languages other than English, summarize text, answer questions (general conversation and FAQs) and even assist in creative writing or code generation tasks . 

They are able to do this thanks to billions of parameters that enable them to capture intricate patterns in language and perform a wide array of language-related tasks. LLMs are revolutionizing applications in various fields, from chatbots and virtual assistants to content generation, research assistance and language translation.

As they continue to evolve and improve, LLMs are poised to reshape the way we interact with technology and access information, making them a pivotal part of the modern digital landscape.

Learn how organizations can confidently incorporate generative AI and machine learning into their business to gain a significant competitive advantage.

Register for the ebook on AI data stores

LLMs operate by leveraging deep learning techniques and vast amounts of textual data. These models are typically based on a transformer architecture, like the generative pre-trained transformer, which excels at handling sequential data like text input. LLMs consist of multiple layers of neural networks, each with parameters that can be fine-tuned during training, which are enhanced further by a numerous layer known as the attention mechanism, which dials in on specific parts of data sets.

During the training process, these models learn to predict the next word in a sentence based on the context provided by the preceding words. The model does this through attributing a probability score to the recurrence of words that have been tokenized— broken down into smaller sequences of characters. These tokens are then transformed into embeddings, which are numeric representations of this context.

To ensure accuracy, this process involves training the LLM on a massive corpora of text (in the billions of pages), allowing it to learn grammar, semantics and conceptual relationships through zero-shot and self-supervised learning. Once trained on this training data, LLMs can generate text by autonomously predicting the next word based on the input they receive, and drawing on the patterns and knowledge they've acquired. The result is coherent and contextually relevant language generation that can be harnessed for a wide range of NLU and content generation tasks.

Model performance can also be increased through prompt engineering, prompt-tuning , fine-tuning and other tactics like reinforcement learning with human feedback (RLHF) to remove the biases, hateful speech and factually incorrect answers known as “ hallucinations ” that are often unwanted byproducts of training on so much unstructured data. This is one of the most important aspects of ensuring enterprise-grade LLMs are ready for use and do not expose organizations to unwanted liability, or cause damage to their reputation. 

LLMs are redefining an increasing number of business processes and have proven their versatility across a myriad of use cases and tasks in various industries. They augment conversational AI in chatbots and virtual assistants (like IBM watsonx Assistant and Google’s BARD) to enhance the interactions that underpin excellence in customer care, providing context-aware responses that mimic interactions with human agents. 

LLMs also excel in content generation, automating content creation for blog articles, marketing or sales materials and other writing tasks. In research and academia, they aid in summarizing and extracting information from vast datasets, accelerating knowledge discovery. LLMs also play a vital role in language translation, breaking down language barriers by providing accurate and contextually relevant translations. They can even be used to write code, or “translate” between programming languages.

Moreover, they contribute to accessibility by assisting individuals with disabilities, including text-to-speech applications and generating content in accessible formats. From healthcare to finance, LLMs are transforming industries by streamlining processes, improving customer experiences and enabling more efficient and data-driven decision making. 

Most excitingly, all of these capabilities are easy to access, in some cases literally an API integration away. 

Here is a list of some of the most important areas where LLMs benefit organizations:

Text generation : language generation abilities, such as writing emails, blog posts or other mid-to-long form content in response to prompts that can be refined and polished. An excellent example is retrieval-augmented generation ( RAG ). 

Content summarization : summarize long articles, news stories, research reports, corporate documentation and even customer history into thorough texts tailored in length to the output format.

AI assistants : chatbots that answer customer queries, perform backend tasks and provide detailed information in natural language as a part of an integrated, self-serve customer care solution. 

Code generation : assists developers in building applications, finding errors in code and uncovering security issues in multiple programming languages, even “translating” between them. 

Sentiment analysis : analyze text to determine the customer’s tone in order understand customer feedback at scale and aid in brand reputation management. 

Language translation : provides wider coverage to organizations across languages and geographies with fluent translations and multilingual capabilities.  

LLMs stand to impact every industry, from finance to insurance, human resources to healthcare and beyond, by automating customer self-service, accelerating response times on an increasing number of tasks as well as providing greater accuracy, enhanced routing and intelligent context gathering. 

Organizations need a solid foundation in governance practices to harness the potential of AI models to revolutionize the way they do business. This means providing access to AI tools and technology that is trustworthy, transparent, responsible and secure. AI governance and traceability are also fundamental aspects of the solutions IBM brings to its customers, so that activities that involve AI are managed and monitored to allow for tracing origins, data and models in a way that is always auditable and accountable. 

Trained on enterprise-focused datasets curated directly by IBM to help mitigate the risks that come with generative AI, so that models are deployed responsibly and require minimal input to ensure they are customer ready.

Watsonx.ai provides access to open-source models from Hugging Face, third party models as well as IBM’s family of pre-trained models. The Granite model series, for example, uses a decoder architecture to support a variety of generative AI tasks targeted for enterprise use cases.

Deliver exceptional experiences to customers at every interaction, call center agents that need assistance, and even employees who need information. Scale answers in natural language grounded in business content to drive outcome-oriented interactions and fast, accurate responses.

Automate tasks and simplify complex processes, so that employees can focus on more high-value, strategic work, all from a conversational interface that augments employee productivity levels with a suite of automations and AI tools.

Granite is IBM's flagship series of LLM foundation models based on decoder-only transformer architecture. Granite language models are trained on trusted enterprise data spanning internet, academic, code, legal and finance.

Sometimes the problem with AI and automation is that they are too labor intensive. But that’s all changing thanks to pre-trained, open source foundation models.

Developed by IBM Research, the Granite models use a “Decoder” architecture, which is what underpins the ability of today’s large language models to predict the next word in a sequence.

Our data-driven research identifies how businesses can locate and seize upon opportunities in the evolving, expanding field of generative AI.

Powered by our IBM Granite large language model and our enterprise search engine Watson Discovery, Conversational Search is designed to scale conversational answers grounded in business content.

While enterprise-wide adoption of generative AI remains challenging, organizations that successfully implement these technologies can gain significant competitive advantage.

Fetch data to create a vector store as context for an LLM to answer questions.

Retrieve documents to create a vector store as context for an LLM to answer questions.

Ground your LLM with PDF documents to provide context for an LLM to answer questions.

Discover how to adopt AI co-pilot tools in an enterprise setting with open source software.

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

research on large language models

Can large language models identify and correct their mistakes?

January 11, 2024

Posted by Gladys Tyen, Intern, Google Research

research on large language models

LLMs are increasingly popular for reasoning tasks, such as multi-turn QA , task completion , code generation , or mathematics . Yet much like people, they do not always solve problems correctly on the first try, especially on tasks for which they were not trained. Therefore, for such systems to be most useful, they should be able to 1) identify where their reasoning went wrong and 2) backtrack to find another solution.

This has led to a surge in methods related to self-correction , where an LLM is used to identify problems in its own output, and then produce improved results based on the feedback. Self-correction is generally thought of as a single process, but we decided to break it down into two components, mistake finding and output correction .

In “ LLMs cannot find reasoning errors, but can correct them! ”, we test state-of-the-art LLMs on mistake finding and output correction separately. We present BIG-Bench Mistake , an evaluation benchmark dataset for mistake identification, which we use to address the following questions:

  • Can LLMs find logical mistakes in Chain-of-Thought (CoT) style reasoning?
  • Can mistake-finding be used as a proxy for correctness?
  • Knowing where the mistake is, can LLMs then be prompted to backtrack and arrive at the correct answer?
  • Can mistake finding as a skill generalize to tasks the LLMs have never seen?

About our dataset

Mistake finding is an underexplored problem in natural language processing, with a particular lack of evaluation tasks in this domain. To best assess the ability of LLMs to find mistakes, evaluation tasks should exhibit mistakes that are non-ambiguous. To our knowledge, most current mistake-finding datasets do not go beyond the realm of mathematics for this reason.

To assess the ability of LLMs to reason about mistakes outside of the math domain, we produce a new dataset for use by the research community, called BIG-Bench Mistake . This dataset consists of Chain-of-Thought traces generated using PaLM 2 on five tasks in BIG-Bench . Each trace is annotated with the location of the first logical mistake.

To maximize the number of mistakes in our dataset, we sample 255 traces where the answer is incorrect (so we know there is definitely a mistake), and 45 traces where the answer is correct (so there may or may not be a mistake). We then ask human labelers to go through each trace and identify the first mistake step. Each trace has been annotated by at least three labelers, whose answers had inter-rater reliability levels of >0.98 (using Krippendorff’s α ). The labeling was done for all tasks except the Dyck Languages task , which involves predicting the sequence of closing parentheses for a given input sequence. This task we labeled algorithmically.

The logical errors made in this dataset are simple and unambiguous, providing a good benchmark for testing an LLM’s ability to find its own mistakes before using them on harder, more ambiguous tasks.

research on large language models

Core questions about mistake identification

1. can llms find logical mistakes in chain-of-thought style reasoning.

First, we want to find out if LLMs can identify mistakes independently of their ability to correct them. We attempt multiple prompting methods to test GPT series models for their ability to locate mistakes (prompts here ) under the assumption that they are generally representative of modern LLM performance.

Generally, we found these state-of-the-art models perform poorly, with the best model achieving 52.9% accuracy overall. Hence, there is a need to improve LLMs’ ability in this area of reasoning.

In our experiments, we try three different prompting methods: direct (trace), direct (step) and CoT (step). In direct (trace), we provide the LLM with the trace and ask for the location step of the mistake or no mistake . In direct (step), we prompt the LLM to ask itself this question for each step it takes. In CoT (step), we prompt the LLM to give its reasoning for whether each step is a mistake or not a mistake.

A diagram showing the three prompting methods direct (trace), direct (step) and CoT (step).

Our finding is in line and builds upon prior results , but goes further in showing that LLMs struggle with even simple and unambiguous mistakes (for comparison, our human raters without prior expertise solve the problem with a high degree of agreement). We hypothesize that this is a big reason why LLMs are unable to self-correct reasoning errors. See the paper for the full results.

2. Can mistake-finding be used as a proxy for correctness of the answer?

When people are confronted with a problem where we are unsure of the answer, we can work through our solutions step-by-step. If no error is found, we can make the assumption that we did the right thing.

While we hypothesized that this would work similarly for LLMs, we discovered that this is a poor strategy. On our dataset of 85% incorrect traces and 15% correct traces, using this method is not much better than the naïve strategy of always labeling traces as incorrect, which gives a weighted average F1 of 78.

A diagram showing how well mistake-finding with LLMs can be used as a proxy for correctness of the answer on each dataset.

3. Can LLMs backtrack knowing where the error is?

Since we’ve shown that LLMs exhibit poor performance in finding reasoning errors in CoT traces, we want to know whether LLMs can even correct errors at all , even if they know where the error is.

Note that knowing the mistake location is different from knowing the right answer : CoT traces can contain logical mistakes even if the final answer is correct, or vice versa. In most real-world situations, we won’t know what the right answer is, but we might be able to identify logical errors in intermediate steps.

We propose the following backtracking method:

  • Generate CoT traces as usual, at temperature = 0. (Temperature is a parameter that controls the randomness of generated responses, with higher values producing more diverse and creative outputs, usually at the expense of quality.)
  • Identify the location of the first logical mistake (for example with a classifier, or here we just use labels from our dataset).
  • Re-generate the mistake step at temperature = 1 and produce a set of eight outputs. Since the original output is known to lead to incorrect results, the goal is to find an alternative generation at this step that is significantly different from the original.
  • From these eight outputs, select one that is different from the original mistake step. (We just use exact matching here, but in the future this can be something more sophisticated.)
  • Using the new step, generate the rest of the trace as normal at temperature = 0.

It’s a very simple method that does not require any additional prompt crafting and avoids having to re-generate the entire trace. We test it using the mistake location data from BIG-Bench Mistake, and we find that it can correct CoT errors.

Recent work showed that self-correction methods, like Reflexion and RCI , cause deterioration in accuracy scores because there are more correct answers becoming incorrect than vice versa. Our method, on the other hand, produces more gains (by correcting wrong answers) than losses (by changing right answers to wrong answers).

We also compare our method with a random baseline, where we randomly assume a step to be a mistake. Our results show that this random baseline does produce some gains, but not as much as backtracking with the correct mistake location, and with more losses.

A diagram showing the gains and losses in accuracy for our method as well as a random baseline on each dataset.

4. Can mistake finding generalize to tasks the LLMs have never seen?

To answer this question, we fine-tuned a small model on four of the BIG-Bench tasks and tested it on the fifth, held-out task. We do this for every task, producing five fine-tuned models in total. Then we compare the results with just zero-shot prompting PaLM 2-L-Unicorn , a much larger model.

Bar chart showing the accuracy improvement of the fine-tuned small model compared to zero-shot prompting with PaLM 2-L-Unicorn.

Our results show that the much smaller fine-tuned reward model generally performs better than zero-shot prompting a large model, even though the reward model has never seen data from the task in the test set. The only exception is logical deduction, where it performs on par with zero-shot prompting.

This is a very promising result as we can potentially just use a small fine-tuned reward model to perform backtracking and improve accuracy on any task, even if we don’t have the data for it. This smaller reward model is completely independent of the generator LLM, and can be updated and further fine-tuned for individual use cases.

An illustration showing how our backtracking method works.

In this work, we created an evaluation benchmark dataset that the wider academic community can use to evaluate future LLMs. We further showed that LLMs currently struggle to find logical errors. However, if they could, we show the effectiveness of backtracking as a strategy that can provide gains on tasks. Finally, a smaller reward model can be trained on general mistake-finding tasks and be used to improve out-of-domain mistake finding, showing that mistake-finding can generalize.

Acknowledgements

Thank you to Peter Chen, Tony Mak, Hassan Mansoor and Victor Cărbune for contributing ideas and helping with the experiments and data collection. We would also like to thank Sian Gooding and Vicky Zayats for their comments and suggestions on the paper.

  • Generative AI
  • Natural Language Processing

Other posts of interest

research on large language models

August 22, 2024

  • Generative AI ·
  • Natural Language Processing ·
  • Software Systems & Engineering

research on large language models

August 21, 2024

  • Machine Intelligence ·

research on large language models

August 9, 2024

  • Machine Intelligence

Mapping the Mind of a Large Language Model

research on large language models

Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.

We mostly treat AI models as a black box: something goes in and a response comes out, and it's not clear why the model gave that particular response instead of another. This makes it hard to trust that these models are safe: if we don't know how they work, how do we know they won't give harmful, biased, untruthful, or otherwise dangerous responses? How can we trust that they’ll be safe and reliable?

Opening the black box doesn't necessarily help: the internal state of the model—what the model is "thinking" before writing its response—consists of a long list of numbers ("neuron activations") without a clear meaning. From interacting with a model like Claude, it's clear that it’s able to understand and wield a wide range of concepts—but we can't discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.

Previously, we made some progress matching patterns of neuron activations, called features, to human-interpretable concepts. We used a technique called "dictionary learning", borrowed from classical machine learning, which isolates patterns of neuron activations that recur across many different contexts. In turn, any internal state of the model can be represented in terms of a few active features instead of many active neurons. Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features.

In October 2023, we reported success applying dictionary learning to a very small "toy" language model and found coherent features corresponding to concepts like uppercase text, DNA sequences, surnames in citations, nouns in mathematics, or function arguments in Python code.

Those concepts were intriguing—but the model really was very simple. Other researchers subsequently applied similar techniques to somewhat larger and more complex models than in our original study. But we were optimistic that we could scale up the technique to the vastly larger AI language models now in regular use, and in doing so, learn a great deal about the features supporting their sophisticated behaviors. This required going up by many orders of magnitude—from a backyard bottle rocket to a Saturn-V.

There was both an engineering challenge (the raw sizes of the models involved required heavy-duty parallel computation) and scientific risk (large models behave differently to small ones, so the same technique we used before might not have worked). Luckily, the engineering and scientific expertise we've developed training large language models for Claude actually transferred to helping us do these large dictionary learning experiments. We used the same scaling law philosophy that predicts the performance of larger models from smaller ones to tune our methods at an affordable scale before launching on Sonnet.

As for the scientific risk, the proof is in the pudding.

We successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet, (a member of our current, state-of-the-art model family, currently available on claude.ai ), providing a rough conceptual map of its internal states halfway through its computation. This is the first ever detailed look inside a modern, production-grade large language model.

Whereas the features we found in the toy language model were rather superficial, the features we found in Sonnet have a depth, breadth, and abstraction reflecting Sonnet's advanced capabilities.

We see features corresponding to a vast range of entities like cities (San Francisco), people (Rosalind Franklin), atomic elements (Lithium), scientific fields (immunology), and programming syntax (function calls). These features are multimodal and multilingual, responding to images of a given entity as well as its name or description in many languages.

Golden Gate Bridge Feature

We also find more abstract features—responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets.

Abstract Feature Examples

We were able to measure a kind of "distance" between features based on which neurons appeared in their activation patterns. This allowed us to look for features that are "close" to each other. Looking near a "Golden Gate Bridge" feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo .

This holds at a higher level of conceptual abstraction: looking near a feature related to the concept of "inner conflict", we find features related to relationship breakups, conflicting allegiances, logical inconsistencies, as well as the phrase "catch-22". This shows that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity. This might be the origin of Claude's excellent ability to make analogies and metaphors.

Nearest Neighbors to the 
Inner Conflict Feature

Importantly, we can also manipulate these features, artificially amplifying or suppressing them to see how Claude's responses change.

For example, amplifying the "Golden Gate Bridge" feature gave Claude an identity crisis even Hitchcock couldn’t have imagined: when asked "what is your physical form?", Claude’s usual kind of answer – "I have no physical form, I am an AI model" – changed to something much odder: "I am the Golden Gate Bridge… my physical form is the iconic bridge itself…". Altering the feature had made Claude effectively obsessed with the bridge, bringing it up in answer to almost any query—even in situations where it wasn’t at all relevant.

We also found a feature that activates when Claude reads a scam email (this presumably supports the model’s ability to recognize such emails and warn you not to respond to them). Normally, if one asks Claude to generate a scam email, it will refuse to do so. But when we ask the same question with the feature artificially activated sufficiently strongly, this overcomes Claude's harmlessness training and it responds by drafting a scam email. Users of our models don’t have the ability to strip safeguards and manipulate models in this way—but in our experiments, it was a clear demonstration of how features can be used to change how a model acts.

The fact that manipulating these features causes corresponding changes to behavior validates that they aren't just correlated with the presence of concepts in input text, but also causally shape the model's behavior. In other words, the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior.

Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse - including in scenarios of catastrophic risk. It’s therefore particularly interesting that, in addition to the aforementioned scam emails feature, we found features corresponding to:

  • Capabilities with misuse potential (code backdoors, developing biological weapons)
  • Different forms of bias (gender discrimination, racist claims about crime)
  • Potentially problematic AI behaviors (power-seeking, manipulation, secrecy)

We previously studied sycophancy , the tendency of models to provide responses that match user beliefs or desires rather than truthful ones. In Sonnet, we found a feature associated with sycophantic praise, which activates on inputs containing compliments like, "Your wisdom is unquestionable". Artificially activating this feature causes Sonnet to respond to an overconfident user with just such flowery deception.

Activating Features Alters Model Behavior

The presence of this feature doesn't mean that Claude will be sycophantic, but merely that it could be. We have not added any capabilities, safe or unsafe, to the model through this work. We have, rather, identified the parts of the model involved in its existing capabilities to recognize and potentially produce different kinds of text. (While you might worry that this method could be used to make models more harmful, researchers have demonstrated much simpler ways that someone with access to model weights can remove safety safeguards.)

We hope that we and others can use these discoveries to make models safer. For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors (such as deceiving the user), to steer them towards desirable outcomes (debiasing), or to remove certain dangerous subject matter entirely. We might also be able to enhance other safety techniques, such as Constitutional AI , by understanding how they shift the model towards more harmless and more honest behavior and identifying any gaps in the process. The latent capabilities to produce harmful text that we saw by artificially activating features are exactly the sort of thing jailbreaks try to exploit. We are proud that Claude has a best-in-industry safety profile and resistance to jailbreaks, and we hope that by looking inside the model in this way we can figure out how to improve safety even further. Finally, we note that these techniques can provide a kind of "test set for safety", looking for the problems left behind after standard training and finetuning methods have ironed out all behaviors visible via standard input/output interactions.

Anthropic has made a significant investment in interpretability research since the company's founding, because we believe that understanding models deeply will help us make them safer. This new research marks an important milestone in that effort—the application of mechanistic interpretability to publicly-deployed large language models.

But the work has really just begun. The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place). Understanding the representations the model uses doesn't tell us how it uses them; even though we have the features, we still need to find the circuits they are involved in. And we need to show that the safety-relevant features we have begun to find can actually be used to improve safety. There's much more to be done.

For full details, please read our paper, " Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet ".

If you are interested in working with us to help interpret and improve AI models, we have open roles on our team and we’d love for you to apply. We’re looking for Managers , Research Scientists , and Research Engineers .

Policy Memo

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Sustainability
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

MIT researchers use large language models to flag problems in complex systems

Press contact :, media download.

Photo of wind turbines in rural landscape, with neural-network graphic in the sky.

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license . You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."

Photo of wind turbines in rural landscape, with neural-network graphic in the sky.

Previous image Next image

Identifying one faulty turbine in a wind farm, which can involve looking at hundreds of signals and millions of data points, is akin to finding a needle in a haystack.

Engineers often streamline this complex problem using deep-learning models that can detect anomalies in measurements taken repeatedly over time by each turbine, known as time-series data.

But with hundreds of wind turbines recording dozens of signals each hour, training a deep-learning model to analyze time-series data is costly and cumbersome. This is compounded by the fact that the model may need to be retrained after deployment, and wind farm operators may lack the necessary machine-learning expertise.

In a new study, MIT researchers found that large language models (LLMs) hold the potential to be more efficient anomaly detectors for time-series data. Importantly, these pretrained models can be deployed right out of the box.

The researchers developed a framework, called SigLLM, which includes a component that converts time-series data into text-based inputs an LLM can process. A user can feed these prepared data to the model and ask it to start identifying anomalies. The LLM can also be used to forecast future time-series data points as part of an anomaly detection pipeline.

While LLMs could not beat state-of-the-art deep learning models at anomaly detection, they did perform as well as some other AI approaches. If researchers can improve the performance of LLMs, this framework could help technicians flag potential problems in equipment like heavy machinery or satellites before they occur, without the need to train an expensive deep-learning model.

“Since this is just the first iteration, we didn’t expect to get there from the first go, but these results show that there’s an opportunity here to leverage LLMs for complex anomaly detection tasks,” says Sarah Alnegheimish, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on SigLLM .

Her co-authors include Linh Nguyen, an EECS graduate student; Laure Berti-Equille, a research director at the French National Research Institute for Sustainable Development; and senior author Kalyan Veeramachaneni, a principal research scientist in the Laboratory for Information and Decision Systems. The research will be presented at the IEEE Conference on Data Science and Advanced Analytics.

An off-the-shelf solution

Large language models are autoregressive, which means they can understand that the newest values in sequential data depend on previous values. For instance, models like GPT-4 can predict the next word in a sentence using the words that precede it.

Since time-series data are sequential, the researchers thought the autoregressive nature of LLMs might make them well-suited for detecting anomalies in this type of data.

However, they wanted to develop a technique that avoids fine-tuning, a process in which engineers retrain a general-purpose LLM on a small amount of task-specific data to make it an expert at one task. Instead, the researchers deploy an LLM off the shelf, with no additional training steps.

But before they could deploy it, they had to convert time-series data into text-based inputs the language model could handle.

They accomplished this through a sequence of transformations that capture the most important parts of the time series while representing data with the fewest number of tokens. Tokens are the basic inputs for an LLM, and more tokens require more computation.

“If you don’t handle these steps very carefully, you might end up chopping off some part of your data that does matter, losing that information,” Alnegheimish says.

Once they had figured out how to transform time-series data, the researchers developed two anomaly detection approaches.

Approaches for anomaly detection

For the first, which they call Prompter, they feed the prepared data into the model and prompt it to locate anomalous values.

“We had to iterate a number of times to figure out the right prompts for one specific time series. It is not easy to understand how these LLMs ingest and process the data,” Alnegheimish adds.

For the second approach, called Detector, they use the LLM as a forecaster to predict the next value from a time series. The researchers compare the predicted value to the actual value. A large discrepancy suggests that the real value is likely an anomaly.

With Detector, the LLM would be part of an anomaly detection pipeline, while Prompter would complete the task on its own. In practice, Detector performed better than Prompter, which generated many false positives.

“I think, with the Prompter approach, we were asking the LLM to jump through too many hoops. We were giving it a harder problem to solve,” says Veeramachaneni.

When they compared both approaches to current techniques, Detector outperformed transformer-based AI models on seven of the 11 datasets they evaluated, even though the LLM required no training or fine-tuning.

In the future, an LLM may also be able to provide plain language explanations with its predictions, so an operator could be better able to understand why an LLM identified a certain data point as anomalous.

However, state-of-the-art deep learning models outperformed LLMs by a wide margin, showing that there is still work to do before an LLM could be used for anomaly detection.

“What will it take to get to the point where it is doing as well as these state-of-the-art models? That is the million-dollar question staring at us right now. An LLM-based anomaly detector needs to be a game-changer for us to justify this sort of effort,” Veeramachaneni says.

Moving forward, the researchers want to see if finetuning can improve performance, though that would require additional time, cost, and expertise for training.

Their LLM approaches also take between 30 minutes and two hours to produce results, so increasing the speed is a key area of future work. The researchers also want to probe LLMs to understand how they perform anomaly detection, in the hopes of finding a way to boost their performance.

“When it comes to complex tasks like anomaly detection in time series, LLMs really are a contender. Maybe other complex tasks can be addressed with LLMs, as well?” says Alnegheimish.

This research was supported by SES S.A., Iberdrola and ScottishPower Renewables, and Hyundai Motor Company.

Share this news article on:

Press mentions.

TechCrunch reporter Kyle Wiggers writes that MIT researchers have developed a new tool, called SigLLM, that uses large language models to flag problems in complex systems. In the future, SigLLM could be used to “help technicians flag potential problems in equipment like heavy machinery before they occur.” 

Previous item Next item

Related Links

  • Sarah Alnegheimish
  • Kalyan Veeramachaneni
  • Data to AI Group
  • Laboratory for Information and Decision Systems
  • Department of Electrical Engineering and Computer Science

Related Topics

  • Computer science and technology
  • Artificial intelligence
  • Human-computer interaction
  • Laboratory for Information and Decision Systems (LIDS)
  • Electrical Engineering & Computer Science (eecs)

Related Articles

Kalyan Veeramachaneni

3 Questions: Kalyan Veeramachaneni on hurdles preventing fully automated machine learning

clock detection graphic

Method finds hidden warning signals in measurements collected over time

MIT researchers have employed a new machine-learning technique to substantially reduce false positives in fraud-detecting technologies.

Reducing false positives in credit card fraud detection

To solve complex problems, data scientists must shepherd their raw data through a series of steps, each one requiring many human-driven decisions. The last step in the process, deciding on a modeling technique, is particularly crucial.

Auto-tuning data science: New research streamlines machine learning

More mit news.

Sebastian Lourido wears a lab coat with his name, and stands in a lab with blue-lit equipment.

Pursuing the secrets of a stealthy parasite

Read full story →

A transparnt cylinder with metal end caps contains a matrix of interconnected blue polygons. At its top, a funnel collects yellow polygons poured from another transparent cylinder containing interconnected red and yellow polygons.

Study of disordered rock salts leads to battery breakthrough

Quantum computer

Toward a code-breaking quantum computer

Amulya Aluru poses with her bicycle in front of the columns of MIT's Building 10

Uphill battles: Across the country in 75 days

Aneal Krishnan, William Cruz, Alexander Edwards, and David LoBosco pose in front of a desk with a backlit “IQT” logo. Cruz and Edwards wear military cadet uniforms.

3 Questions: From the bench to the battlefield

Duane Boning headshot

Duane Boning named vice provost for international activities

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Exploring Large Language Models for Feature Selection: A Data-centric Perspective

  • August 2024
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

(a) Average AUROC (left; higher is better) and ranking by MAE (right; lower is better) across all datasets. (b) Each LLM's feature selection results, separated by task types (CLS and REG) and selection methods (w/sample and w/text).

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • Xueyang Feng
  • Nelson F. Liu
  • John Hewitt
  • Percy Liang

Ming Jin

  • Marius Lindauer

Frank Hutter

  • Yupeng Chang

Jindong Wang

  • Somin Wadhwa
  • Silvio Amir

Byron C Wallace

  • David Wissel

Daniel Rowson

  • Valentina Boeva
  • Ismael Lemhadri
  • Louis Abraham
  • Robert Tibshirani
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up
  • Open access
  • Published: 21 August 2024

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

  • Michael Li 1 ,
  • Jianping Sun 2 &
  • Xianming Tan   ORCID: orcid.org/0000-0002-5478-2269 1 , 3  

Systematic Reviews volume  13 , Article number:  219 ( 2024 ) Cite this article

122 Accesses

Metrics details

This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.

We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.

Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.

While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.

Peer Review reports

Introduction

A systematic review is a rigorous form of research that collates and synthesizes all existing evidence on a specific research question [ 1 ]. It stands as a cornerstone not just in medical research but across diverse academic disciplines. Unlike traditional literature reviews, systematic reviews follow a comprehensive and standardized process, such as the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guideline [ 2 ], designed to minimize bias and ensure reproducibility. Therefore, these reviews are recognized as one of the zenith levels of evidence in evidence-based research [ 3 ] and play a pivotal role in shaping clinical guidelines, healthcare policies, and informing medical decisions [ 4 ].

Commencing with a well-articulated research question, a typical systematic review launches an exhaustive search strategy that subsequently sweeps through databases, such as PubMed and Embase, supplemented by additional sources such as clinical trial registries and pertinent article reference lists, with aims to capture all relevant studies and mitigate bias. Predetermined inclusion and exclusion criteria guide the subsequent screening and selection of studies, encompassing facts like study design, patient demographics, and intervention types. Reviewers, working independently, appraise each study’s eligibility, reconciling disagreements through discussions or third-party reviews. Data extraction and synthesis follow, either through meta-analysis or narrative synthesis, depending on the heterogeneity of the selected studies.

The practice of conducting systematic reviews has gained substantial popularity with considerable demand within the academic community. A notable reference [ 5 ] to this trend found that approximately 75 trials and potentially 11 systematic reviews are disseminated daily, based on data from around 2010. Moreover, a query for “systematic review” in the Google Scholar database yields approximately 17,000 entries that have been published within the year 2023 alone, which translates into about 51 systematic reviews per day, as observed on November 26, 2023. This expanding volume of literature underscores the critical role that systematic reviews play in the consolidation of research findings across various fields of study.

Despite their pivotal role, executing systematic reviews remains a formidable task due to the abstract screening, a key phase that can be overwhelmingly time-consuming due to its volume. For example, in Polanin et al. [ 6 ], the authors reported that the research staff screened 29,846 abstracts independently (14,923 unique citations were double screened) over the course of 189 days. In addition, the variability in reporting standards, use of jargon, and varied study designs can further complicate the abstract screening process [ 7 ]. Nevertheless, the repetitive nature of the task, combined with cognitive fatigue, can lead to human errors [ 8 , 9 ]. Recent advancements in machine learning (ML) and deep learning propose possible solutions to these challenges. However, traditional ML models, while promising, require domain-specific training, a time-consuming process that often demands manual labeling of datasets [ 10 ].

Advancements in natural language processing (NLP) and artificial intelligence (AI) are opening doors to address challenges in systematic reviews. Large language models (LLMs) like ChatGPT [ 11 ], PaLM [ 12 ], Llama [ 13 ], and Claude [ 14 ] are capturing the research community’s attention. Their collective potential, especially their capability to operate without exhaustive domain-specific training, makes them prime candidates for revolutionizing the systematic review process.

While each of the aforementioned AI tools brings its unique capabilities to the table, the fundamental question remains: How do they stack up, both individually and collectively, in the realm of abstract screening, against the human expert-based process? In pursuit of answers, this research seeks to thoroughly investigate the potential of ChatGPT, Google PaLM, Llama, and Claude in automating the crucial abstract screening phase integral to systematic reviews. Our goal is to rigorously compare the performance of these advanced AI-driven methods with existing machine learning (ML)-based approaches. In doing so, we aim to develop AI strategies that masterfully balance efficiency and accuracy with minimal human intervention, ultimately transforming systematic review practice across disciplines.

The use of NLP for abstract screening is established [ 10 , 15 ]. However, the application of LLMs specifically for this task is a nascent field [ 16 , 17 ]. This emerging area offers significant potential to improve efficiency and accuracy. Our study aims to fill this gap by providing a comprehensive analysis of LLM capabilities in abstract screening, laying the groundwork for future research and application. This is particularly relevant considering the rapid evolution of this technology, highlighting its potential to streamline systematic reviews now and in the future.

The remainder of this paper is structured to provide a comprehensive exploration of our topic. We begin with an in-depth examination of existing methods for abstract screening, including both manual and NLP-based approaches, laying the groundwork for understanding the current state of the field. We then introduce the use of large language model (LLM) tools for abstract screening, detailing our experimental design to meticulously evaluate their performance in this context. Subsequent sections present our empirical findings and results, shedding light on the capabilities and limitations of the AI tools in question. Finally, we engage in a thoughtful discussion, reflecting on the implications of our findings and considering the future trajectory of abstract screening in systematic reviews.

Existing approaches to abstract screening in systematic reviews

In the vast realm of systematic reviews, the critical task of abstract screening serves as a foundational step in curating the highest quality of evidence [ 2 ]. However, this process often presents significant challenges due to the involvement of sifting through large volumes of literature to identify those that align with predefined criteria. Over time, various methodologies, ranging from manual evaluations to sophisticated AI-driven techniques, have been proposed to address the complexities of this task. In this section, we first describe the existing approaches on their operational mechanisms and associated advantages and disadvantages.

Manual approach

Historically, the process of abstract screening was firmly rooted in manual evaluations. In this conventional approach, individual reviewers would scrutinize each abstract against predefined criteria [ 1 ]. The meticulous nature of this method required that multiple experts independently evaluate the literature to ensure both reliability and reduced biases [ 8 ]. While the depth of human expertise brought about nuanced understanding, the manual nature of this method made it both time-consuming and, at times, prone to human error [ 6 , 9 ].

NLP-based approach

As technology evolved, the field witnessed the incorporation of natural language processing (NLP) to automate abstract screening [ 10 ]. In this framework, abstract text undergoes preprocessing and vectorization. Supervised machine learning models, notably the support vector machine (SVM) and the random forest (RF), are then trained on this vectorized data to classify literature based on specific criteria [ 15 ]. The strength of this approach lies in its potential for efficiency. However, its efficacy and accuracy hinges heavily on the availability of a well-curated, labeled training set.

Zero-shot classification

A more recent and innovative approach is zero-shot classification, which was notably highlighted by Xian et al. [ 18 ]. Eschewing the need for an extensive labeled training dataset, zero-shot classification offers the allure of categorizing abstracts without prior domain-specific training. By calculating a probability score for each abstract, researchers obtain a dynamic measure of its alignment with predefined categories. While it does not require training for the machine learning model, the tradeoff is a loss of sensitivity and potential omissions of pertinent studies [ 19 ]. In this study, for a given set of abstracts, we first obtain embeddings (using OpenAI’s text embedding API, the babbage model) for abstracts and for a pre-specified description of an ideal study to include. We indeed use the inclusion/exclusion criteria (see “Examples of abstract screening by using LLMs,” “Automated workflow for streamlining abstract screening via ChatGPT and other tools,” and “ Study design ” sections) as the description of an ideal study to include. We then compute the cosine similarity score between each abstract (its embedding) and the pre-specified description (its embedding). We claim the top 10% (in terms of similarity) of abstracts as positive (i.e., a study should be included).

Hybrid approach

To reconcile the strengths and limitations of these models, a hybrid methodology has emerged [ 18 ]. It cleverly marries the rapid categorization capabilities of zero-shot classification with the precision of traditional machine learning models such as SVM and RF. Here, zero-shot classification provides an initial curation of all abstracts in the training set. Then, based on zero-shot classification’s curation, the abstracts with classification score higher than a pre-specified threshold are selected to undergo manual review to rectify zero-shot classification. The rectified classification then serves as a foundation for training traditional models, which are then employed for screening the broader dataset (e.g., in the testing set). The hybrid approach balances the speed of zero-shot and the precision of traditional ML, and potentially offers enhanced accuracy at reduced human efforts. However, this approach involves the use of multiple methodologies and still relies on well-curated, labeled training data (in this case, a subset of the whole training set). In this study, for a given set of abstracts, we rectify the “positive” abstracts (i.e., the top 10% of abstracts) identified by zero-shot, and then fit a classification model for rectified labels to the text embeddings of the corresponding abstracts using SVM. This learned classification model is then used to predict the class of other abstracts. We added the R code for the zero-shot and hybrid approaches to the GitHub repository https://github.com/mikeli380/LLMAbstractScreening .

Active learning approach

Active learning [ 20 , 21 , 22 ] is an innovative approach to machine learning that optimizes the training process by allowing the model to selectively query a human annotator for labels on the most informative data points. This method is particularly advantageous in scenarios where labeled data is scarce or expensive to obtain. Active learning models can start with minimal datasets, often requiring only one relevant and one irrelevant abstract, making them particularly suitable for tasks such as abstract screening, where the acquisition of large labeled datasets can be prohibitive.

The main advantage of active learning is its efficiency. By focusing on the most informative samples, it reduces the amount of data that needs to be labeled while still effectively training the model. This can significantly reduce the time and resources required for the annotation process. However, the effectiveness of active learning depends heavily on the initial selection of samples and the criteria used to determine the informativeness of subsequent data points. If not well calibrated, the model may request labels for data that do not significantly improve its performance, resulting in an inefficient use of resources. In addition, the iterative nature of querying and updating the model based on new labels can introduce complexity into the training process, potentially requiring more sophisticated infrastructure and oversight than traditional supervised learning methods.

While this study did not test active learning approaches for abstract screening, readers interested in exploring this methodology further are referred to [ 23 ] for detailed information on the application of active learning in abstract screening contexts.

  • Large language models

Amidst the evolving methodologies described in the “ Existing approaches to abstract screening in systematic reviews ” section, modern AI tools based on large language models (LLMs), such as ChatGPT, PaLM, Llama, and Claude, are emerging as potential game-changers. Grounded in advanced language processing capabilities, these tools can be tailored to evaluate abstracts against nuanced criteria and offer detailed assessments and classifications. Their prowess signals transformative potential for abstract screening. In this section, we first present two examples to illustrate the potential of LLMs in the context of abstract screening, and then precede with our in-depth discussion on the study’s workflow and design that aims to critically investigate the performance of LLMs in this domain.

Examples of abstract screening by using LLMs

In this subsection, we demonstrate the process of LLM-based abstract screening using ChatGPT with two specified abstracts: (1) Millard et al. [ 24 ] and (2) Zawia et al. [ 25 ]. Using other LLM tools such as Llama, Google PaLM, or Claude for abstract screening is similar and will be skipped for brevity.

In constructing the prompts, we adopted a standardized approach to mimic a typical interaction between a senior researcher and a research assistant. Each prompt was designed to include three critical elements: a brief statement of the topic under review, the exact eligibility criteria as specified in the study guidelines, and the text of the abstract to be reviewed. This approach was designed to simulate a realistic and practical scenario that reflects common academic practices. In addition, to ensure consistency and objectivity in the responses, we required that the results of the LLM be in a binary “yes” or “no” format, directly indicating whether each abstract met the inclusion criteria. This format was chosen to eliminate any ambiguity in the model’s decisions and to facilitate straightforward analysis of its accuracy.

Abstract (1)

ChatGPT prompt: “I would like you to help me with conducting a systematic review on animal depression. I will provide the title and abstract for one journal article and would like you to screen the paper for inclusion. Here are the inclusion criteria: Does the study provide primary data of an animal model of depression or depressive-like phenotype with an appropriate control group? Here is the abstract:

Wistar-Kyoto (WKY) rats show endogenous depressive behavior that can be reversed by antidepressants. Given that WKYs exhibit decreased sensitivity to some antidepressants and treatment-resistant depressed patients often show hypothalamic-pituitary-thyroid (HPT) dysregulation, we examined the behavioral and HPT hormonal responses of WKYs to altered thyroid status. “Euthyroid” WKYs had elevated basal plasma TSH and T(3) levels as compared to Wistars. Hypothyroidism increased TSH levels more in WKYs than in Wistars and increased response latency in the open field test (OFT) of WKYs only. Administration of T(4) and T(3) suppressed plasma TSH equally in both strains. Wistars responded to increased T(3) levels with decreased response latency and increased activity in the OFT, but increased immobility in the forced swim test. In contrast, WKYs responded only to the high T(3) levels with decreased response latency in the OFT. These results suggest the existence of a decreased central nervous system sensitivity to thyroid hormones in WKYs that could be related to their depressive behavior.”

ChatGPT v4 response (accessed on 10/15/2023):

Given the abstract and the inclusion criteria, this study appears to meet the inclusion criteria for your systematic review on animal models of depression.

Consideration for inclusion:

 1. Animal model and depressive-like phenotype:

  • The study utilizes Wistar-Kyoto (WKY) rats which are mentioned to show endogenous depressive behavior.

  • This study investigates behaviors and hormonal responses associated with depression in this particular animal model, with a focus on the role of thyroid hormones.

 2. Primary data and appropriate control group:

  • The study provides primary data on behavioral and hormonal responses of WKY rats, comparing them to Wistars (likely serving as the control group), under altered thyroid status, allowing for comparisons between the depressive-like phenotype and controls.

  • The utilization of both “Euthyroid” WKYs and Wistars as controls helps in effectively investigating the behavioral changes and responses in WKY rats due to alterations in thyroid status.

 3. Behavioral test indicating depression:

  • Depressive behavior and responses are assessed using standardized tests like the open field test (OFT) and the forced swim test, which are conventional methods to investigate depressive-like behaviors in animal models.

  • The mention of antidepressant’s impact on WKY rats indicates a focus on depressive-like states.

 4. Relevance to human condition:

  • Insights gained from this study on decreased central nervous system sensitivity to thyroid hormones and its possible relation to depressive behavior in WKYs can have implications in understanding treatment-resistant depression in humans, making it valuable for the review.

Conclusion:

Given the provided abstract, this study does seem to fit the inclusion criteria for a systematic review on animal depression, as it provides primary data on an animal model demonstrating depressive-like behavior, includes an appropriate control group, and employs standardized tests to study the depressive-like states and their relation to thyroid hormones. The detailed exploration of depressive behaviors and hormonal imbalances in relation to altered thyroid status can offer valuable insights for understanding and treating depression.

Abstract (2)

The induction of the activity of cerebral ornithine decarboxylase (EC 4.1.1.17) and mRNA by electrical stimulation exhibits regional differences. The effects of the enzyme inhibitor difluoromethylornithine on these regional variations was examined. Administration of this inhibitor resulted in pronounced depression of both basal and induced activity of ornithine decarboxylase in the hippocampus. Basal activity of the enzyme in the neocortex and the cerebellum appeared to be resistant to difluoromethylornithine but the induced enzyme activity was sensitive to the effects of this inhibitor. Susceptibility to difluoromethylornithine may be directly correlated with a slower turnover rate for ornithine decarboxylase. These results suggest that ornithine decarboxylase in the hippocampus may possess a longer half-life than its counterparts in other regions of the brain. Pretreatment with difluoromethylornithine had no effect on the induced ornithine decarboxylase mRNA in the neocortex. Thus, elevated activity of ornithine decarboxylase enzyme, due to electrical stimulation, appears to not have any effect on either the transcription or the decay rate of the induced ornithine decarboxylase mRNA. These findings support the concept of region-specific regulation of cerebral ornithine decarboxylase. ”

Given the provided abstract and the inclusion criteria, this study does not appear to meet the criteria for inclusion in a systematic review on animal models of depression or depressive-like phenotypes with an appropriate control group.

Consideration for inclusion:

 1. Depression or depressive-like phenotype:

  • The abstract does not provide any information or data related to depression or depressive-like behavior or phenotypes in animals. It is mainly focused on the activity of cerebral ornithine decarboxylase and its regional differences in response to electrical stimulation and inhibitor administration.

 2. Animal model:

  • There is no mention of the use of any specific animal model or any behavioral assays to assess depressive-like symptoms or states in the provided abstract.

 3. Behavioral assessment and primary data:

  • The abstract does not contain any details about behavioral assessments related to depression or depressive-like states. It primarily provides data on the effects of enzyme inhibitor difluoromethylornithine on cerebral ornithine decarboxylase activity and mRNA, but does not relate these findings to depression or depressive-like behaviors.

 4. Relevance to depression:

  • While the study explores biochemical aspects in brain regions like the hippocampus, which is implicated in mood disorders like depression, the connection between the findings and depressive symptoms or states is not established in the provided abstract.

Conclusion:

This article does not seem to fit the inclusion criteria for a systematic review on animal models of depression or depressive-like phenotypes as it lacks primary data on depressive-like behaviors or phenotypes in animals, and there is no mention of the use of animal models representative of depression or an appropriate control group for such a comparison.

From the above two examples presented, it is evident that ChatGPT adeptly comprehends the posed queries and efficiently extracted the inclusion and exclusion criteria. It not only makes clear decisions regarding study inclusion but also articulates its rationale by cross-referencing each criterion.

It is worth noting that the same ChatGPT prompt might result in slightly different responses between different users due to the variability in its response generation, but the conclusion for the eligibility of an abstract is typically consistent. In addition, enhancing the specificity of the criteria within the prompts can further refine the screening process. To illustrate, consider the following enhanced prompts:

Enhanced ChatGPT prompt for abstract (1): “I seek assistance with a systematic review focused on animal depression. I’ll provide the abstract of a particular journal article and would appreciate an assessment for its inclusion based on the following criteria:

The study employs an animal model reflecting depression or a depressive-like phenotype.

It incorporates an appropriate control group.

Primary data is presented within the study.

and here are the exclusion criteria:

The study bypasses the use of an animal model.

The utilized animal model doesn’t characterize depression or a depressive-like phenotype.

Absence of a control group within the study.

The study is primarily a review, devoid of primary data.

Here is the abstract: …”.

Let’s assess the provided abstract based on the inclusion and exclusion criteria:

Inclusion criteria:

 1. Animal model reflecting depression: The study uses Wistar-Kyoto (WKY) rats, which are described as showing “endogenous depressive behavior.” Thus, this criterion is met.

 2. Appropriate control group: The study compares WKY rats to Wistars (another strain), which serves as a comparison or control group. This criterion is met.

 3. Primary data: The abstract mentions specific results such as the elevated basal plasma TSH and T(3) levels in “Euthyroid” WKYs compared to Wistars and the different responses in the open field test (OFT). This suggests the presence of primary data. Thus, this criterion is met.

Exclusion criteria:

 1. Bypasses the use of an animal model: The study uses WKY rats as an animal model. Thus, this criterion is not met, which is good for inclusion.

 2. Animal model doesn’t characterize depression: The WKY rats show “endogenous depressive behavior.” Thus, this criterion is not met, which again supports inclusion.

 3. Absence of a control group: The study does use a control group (Wistars). Thus, this criterion is not met, supporting inclusion.

 4. Study is primarily a review: The abstract presents experimental data and does not indicate that the study is a review. Thus, this criterion is not met, which is favorable for inclusion.

Conclusion:

Given that the abstract meets all the inclusion criteria and none of the exclusion criteria, it should be considered in your systematic review on animal depression.

With these revised prompts applied to the previously discussed abstracts, the conclusions remain consistent. Nonetheless, we contend that this enhanced structure not only provides a more rigorous framework but also aligns more closely with widely accepted practices in systematic review criteria definition.

Automated workflow for streamlining abstract screening via ChatGPT and other tools

For systematic reviews, particularly when dealing with voluminous data, efficient workflows are paramount. The ChatGPT API (application programming interface) offers a dynamic solution, enabling the automation of abstract screening on a large scale, which circumvents the labor-intensive process of manually inputting abstracts into a chatbot interface. In this subsection, we present an automated workflow for streamlining abstract screening via ChatGPT. Note, though this automated workflow uses ChatGPT as the platform, analogous workflows work for other AI platforms like PaLM, Llama, and Claude.

Automated workflow:

Data collection: The preliminary step entails accumulating a list of titles and abstracts. By utilizing carefully crafted keywords, we retrieve these from PubMed and other pertinent databases. This comprehensive approach ensures the potential inclusion of all relevant studies for a subsequent detailed screening. It is worth noting that while this list is expansive, most of these studies may not find their way into the final meta-analysis post-screening.

Automation through Python: We have devised a python script aimed at harnessing the capabilities of ChatGPT for evaluating the amassed abstracts.

This script interacts with the ChatGPT API (specifically, the GPT-4 version) and, when furnished with tailored prompts, extracts structured responses from ChatGPT.

Typically, the AI’s response commences with a succinct summary, delves into explanations aligned with each criterion, and concludes with a decisive judgment, as exemplified by the examples in the “ Examples of abstract screening by using LLMs ” section.

This automated process efficiently saves ChatGPT’s verdicts on each abstract for ensuing analyses. For instance, it extracts the final decisions regarding the inclusion or exclusion of studies and determines the stance on each pre-specified criterion for every abstract, as exemplified by the last example in the “ Examples of abstract screening by using LLMs ” section.

Additionally, to ascertain the efficiency and cost-effectiveness of this methodology, the script also monitors the time, token usage, and the financial implications of querying the OpenAI API.

In essence, we envision this procedure as delegating the meticulous task of poring over scientific summaries to an AI assistant. This virtual entity meticulously sifts through each summary, determining its alignment with stipulated criteria.

Tuning parameters in the ChatGPT API: The effectiveness of the ChatGPT API is not only dependent on the input data; it is also significantly influenced by adjustable parameters that can refine the responses. Parameters such as temperature, top k, and top p critically affect model performance by modulating the randomness and focus of the output. While fine-tuning these parameters can improve results, it requires significant technical expertise and resources. Defaults, which are rigorously tested by developers, strike a balance between output quality and ease of use, making LLMs more accessible to a wider range of users without the need for complex parameter optimization. While customization holds promise for custom applications, the default settings provide an efficient and practical solution that facilitates wider adoption of LLM technologies.

Given the complexity of fine-tuning these parameters to optimize performance specifically for abstract screening, our study primarily used the recommended default settings for these parameters as provided by the respective platforms (detailed in Table 1 ). This approach was chosen to maintain the feasibility of our experiments and to ensure that our findings are applicable to typical deployment scenarios.

The automated workflow described applies to other LLM tools or different versions of the same tools. Throughout our study, we have tested several popular LLM tools available to us. The specific LLM models and their versions used at the time of our first submission are detailed in Table  1 , along with the tuning parameters.

The field of large language models (LLMs) has evolved rapidly since we started this study in 2023. New models are frequently released, and existing versions are constantly updated. To account for these advances, we have expanded our analysis to include results from the latest versions of previously studied models, as well as a few entirely new ones. As a best practice, we set the temperature parameter to 0 for all latest models. Table 1 now includes these latest models and their versions.

A brief explanation of these parameters is as follows:

Temperature: The temperature controls the randomness of the outputs, with a range from 0 to 2. A temperature value greater than 1 is random and 0 is deterministic. The maximum temperature of 2 gives the most creative and variable outputs.

Max length: The max length is the maximum number of tokens for the model to generate as a response. A single word is generally 2–3 tokens.

Stop sequences: This parameter controls which tokens or phrases will stop the LLM from generating more text.

Top p: When generating text, the model samples from the top p percentage of most likely tokens to generate. The top p is the cumulative probability cutoff for the model’s selection of tokens to generate. Lower top p values mean sampling from a smaller, more top-weighted nucleus.

Top k: When generating text, the model samples from the top k most likely tokens. When deciding the next word to output, the model will select the most likely word when top k is lower.

Frequency penalty: This frequency penalty parameter controls how the LLM penalizes tokens that have already appeared in the input and output text. A frequency penalty of 0 implies that the frequency of the tokens does not impact the LLM and will generate them based on their probability.

Presence penalty: This parameter controls how the LLM penalizes tokens that have not been used recently. A presence penalty of 0 means that the LLM does not care about the recency of the tokens and will generate them based on their probability.

Study design

In our pursuit to assess ChatGPT’s proficiency in abstract screening, we selected certain benchmark databases that have existing performance data from other methodologies. This selection aids in a comparative analysis of performance.

In selecting the systematic reviews for our study, we used a systematic approach guided by specific criteria to ensure relevance and reliability. These studies were selected from the publicly available SYNERGY [ 23 ] dataset, which contains 26 systematic reviews from different disciplines. Key selection criteria included:

Clarity and conciseness of eligibility criteria: The selected studies had well-defined and explicit eligibility criteria. This clarity is essential for accurate replication of the study selection process, which is critical for assessing the performance of LLM tools in an analogous real-world application.

Completeness and cleanliness of data: We ensured that the selected reviews had complete datasets, with all necessary information on included and excluded studies clearly documented, minimizing the risk of ambiguities affecting our analysis.

In addition, to comply with the AMSTAR-2 [ 26 ] guidelines, in particular point 5, we reviewed the methodologies of these reviews to confirm the selection of studies was performed in duplicate and disagreements were resolved by consensus. While our analysis assumes that these systematic reviews adhere to high standards, we recognize the inherent limitations of using pre-existing datasets as a proxy for gold standards in the discussion section.

We picked the following 3 databases from the publicly available SYNERGY dataset [ 23 ]:

Bannach-Brown 2016 [ 27 ]—topic: use of animal models to study depressive behavior

Human-curated (gold standard) results: 1258 excluded abstracts and 230 included abstracts.

We randomly selected 100 excluded abstracts and 100 included abstracts for screening by LLM tools.

Meijboom 2021 [ 28 ]—topic: retransitioning of etanercept in patients with a rheumatic disease

Human-curated (gold standard) results: 599 excluded abstracts and all 32 included abstracts.

We randomly selected 100 excluded abstracts and 32 included abstracts for screening by LLM tools.

Menon 2022 [ 29 ]—topic: methodological rigor of systematic reviews in environmental health

Human-curated (gold standard) results: 896 excluded abstracts and 73 included abstracts.

We randomly selected 100 excluded abstracts and all 73 included abstracts for screening by LLM tools.

For each chosen database, abstracts were categorized as either “cases” (those included based on a gold standard) or “controls” (those excluded per the gold standard). From each category, we randomly selected 100 abstracts (we use all abstracts if there are less than 100 abstracts in that category in a database). These abstracts underwent evaluation by ChatGPT (v4.0) as per our established workflow. Subsequently, ChatGPT’s decisions were juxtaposed against the gold standard to determine sensitivity, specificity, and overall accuracy. The same abstracts were also processed using other LLM tools as listed in Table  1 to record their respective verdicts.

Statistical analysis

To quantify the efficacy of ChatGPT and other AI tools for each database, we calculated the following metrics: (1) sensitivity, (2) specificity, and (3) overall accuracy, where sensitivity is defined as the number of true positives divided by the sum of true positives and false negatives, specificity as the number of true negatives divided by the sum of true negatives and false positives, and accuracy as sum of true positives and true negatives divided by the total number of abstracts. For each metric, associated 95% confidence intervals were also determined. Although it is very common in the field to report F1 score, recall rate, and precision, we believe it is more appropriate to report sensitivity and specificity given this study design. In addition, F1 score, recall rate, and precision can be derived from sensitivity and specificity.

Furthermore, to explore the potential of a unified decision-making process, we combined the decisions from all AI tools using a voting mechanism. The majority decision across the tools was taken as the final verdict for each abstract. For this consolidated approach, we again computed sensitivity, specificity, overall accuracy, and the associated 95% CIs for each database. We also explore the use of latent class analysis (LCA), a model-based clustering approach, to derive consolidated decisions. More details on this LCA approach are provided in the “ Beyond majority voting ” section.

For a given database, 100 cases and 100 controls yield a two-sided 95% confidence interval with a half-width equal to 0.048 when the underline sensitivity (specificity) is approximately 95%.

All statistical analyses were conducted using the R statistical software (version 4.3.1). All tests were two-sided with an alpha level set at 0.05 unless otherwise mentioned.

To improve the transparency and reproducibility of studies using AI tools, we have included the TRIPOD + AI checklist [ 30 ] in our report. This checklist has been adapted to reflect the specifics of our research, which focuses on the evaluation of large language models for abstract screening rather than diagnostic or prognostic modeling. The completed checklist is presented in Table S 1 , to provide readers with a comprehensive overview of our study’s adherence to established reporting standards.

We present the results for each of the 3 databases. For each, we first present the prompts we used when we called LLM tools to screen an abstract, then present the performance data (accuracy, sensitivity, and specificity for each method or LLM tool), followed by a summary of the performance, and a comparison of the performance of the different methods to that based on ChatGPT v4.0.

Results on the Bannach-Brown 2016 database (see Table  2 )

The prompts we used for screening abstracts in this database are as follows:

Conduct a systematic review on animal depression. I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion. After answering all the criteria with yes or no, then provide an overall explanation.

Here is the eligibility criteria: Inclusion Criteria: 1. Any article providing primary data of an animal model of depression or depressive-like phenotype with an appropriate control group (specified above). 2. Animals of all ages, sexes and species, where depression-like phenotype intended to mimic the human condition have been induced. Including animal models where depressive-like phenotypes are induced in the presence of a comorbidity (e.g. obesity or cancer). 3. All studies that claim to model depression or depressive-like phenotypes in animals. Studies that induce depressive behavior or model depression and that also test a treatment or intervention (prior or subsequent to model induction), with no exclusion criteria based on dosage, timing or frequency. 4. Studies measuring behavioral, anatomical and structural, electrophysiological, histological and/or neurochemical outcomes and where genomic, proteomic or metabolomic outcomes are measured in addition to behavioral, anatomical, electrophysiological, histological or neurochemical outcomes. Exclusion Criteria: 1. Review article, editorials, case reports, letters or comments, conference or seminar abstracts, studies providing primary data but not appropriate control group. 2. Human studies and ex vivo, in vitro or in silico studies. Studies will be excluded if authors state an intention to induce or investigate only anxiety or anxious behavior. Studies will be excluded if there is no experimental intervention on the animals (e.g. purely observational studies). 3. Studies that investigate treatments or interventions, but no depressive behavior or model of depression is induced (e.g. toxicity and side-effect studies). 4. Where metabolic outcome measures are the primary outcome measure of a study. Where genomic, proteomic, metabolic or metabolomic outcomes are the sole outcome measures in a study, they will be excluded.

Here is the abstract:

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.945) and specificity (0.960), and satisfactory sensitivity (0.930). Combined decision using major voting seemed to improve sensitivity (0.970) considerably but did not improve specificity (0.870) much. Comparatively, the zero-shot method was less effective across these metrics, while the hybrid method attained superior accuracy (0.961) and specificity (0.982), albeit with reduced sensitivity (0.843). We acknowledge that our findings regarding the zero-shot and hybrid methods differ from those reported in [ 18 ] for the same set of abstracts. This discrepancy could be attributed to the use of distinct sets of embeddings in our analysis compared to the previous study.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions, but did not surpass the performance of ChatGPT (v4.0). Claude 3 performed well, but still did not surpass the performance of ChatGPT (v4.0).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value = 0.002) than Google PaLM 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value = 0.008) and better than Llama-2 ( p value < 0.001) in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was no statistically significant difference ( p value = 0.134) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was significantly worse ( p value = 0.008) than ChatGPTv4.0 in terms of specificity.

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We assessed the performance (sensitivity and specificity) of ChatGPT v4.0 against both the zero-shot and hybrid approaches using the McNemar test. Specifically, we aligned the screening results from 100 cases and 100 controls as per the ChatGPT v4.0 method and similarly for the zero-shot and hybrid methods, testing for inconsistencies between these approaches as previously done. Our analysis revealed that ChatGPT v4.0 significantly outperformed the zero-shot method in sensitivity ( p value < 0.001) but showed comparable effectiveness in specificity ( p value = 0.37). Additionally, ChatGPT v4.0 demonstrated marginally superior sensitivity compared to the hybrid method ( p value = 0.07), while its performance in specificity was similar ( p value = 1.00).

Results on the Meijboom 2021 database (see Table  3 )

Conduct a systematic review on transitioning patients from an originator to a corresponding biosimilar.

I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.

After answering all the criteria with yes or no, then provide an overall explanation.

Here is the eligibility criteria:

Articles were included if they met the following criteria:

Study involved transitioning from a TNFα inhibitor (including etanercept, infliximab, and adalimumab) originator to a biosimilar

The number of patients who retransitioned was reported or could be calculated

The article was an original research article published in a peer-reviewed journal

The article included baseline characteristics of the patients who transitioned

The article was written in English

The full-text version of the article could be obtained.

Transitioning was defined as patients in whom the biosimilar was introduced after the originator, without treatment with other drugs in between. Retransitioning was defined as restarting the originator directly after discontinuing a biosimilar, without treatment with other drugs in between. In summary, transitioning was defined as switching from the originator to a biosimilar; retransitioning was defined as switching from the originator to a biosimilar and back to the originator. Both transitioning and retransitioning involved changes with the same active biological substance.

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.840), but not with specificity (0.860) or satisfactory sensitivity (0.812). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.720), but improved sensitivity (1.000) at the sacrifice of specificity (0.630).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value < 0.001) than Google PaLM 2, but significantly worse than ChatGPT3.5 ( p value = 0.001) and Llama 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value < 0.001) and better than Llama 2 ( p value < 0.001), but worse than Google PaLM 2 ( p value = 0.002), in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly worse ( p value > 0.50) than ChatGPTv4.0 in terms of specificity.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0), and Gemini-1.0-pro (vs. PaLM 2) did not improve performance either. However, Llama 3 (vs. Llama 2) improved over its older version and surpassed the performance of ChatGPT (v4.0). Claude 3 also slightly surpassed the performance of ChatGPT (v4.0).

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We evaluated the performance of ChatGPT v4.0, focusing on sensitivity and specificity, in comparison with the zero-shot and hybrid approaches, employing the McNemar test as described above. In this analysis, we aligned the screening results from 32 cases and 100 controls for the tests. Our findings indicated that ChatGPT v4.0 significantly surpassed the zero-shot method in sensitivity ( p value = 0.0002) and exhibited marginally improved specificity ( p value = 0.099). Furthermore, ChatGPT v4.0 showed notably higher sensitivity than the hybrid method ( p value < 0.001), although its specificity was comparatively lower.

Results on the Menon 2022 database (see Table  4 )

“Conduct a systematic review on the methodological rigour of systematic reviews in environmental health.

I provide the title and abstract for one journal article.

Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation.

Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.

Here are the eligibility criteria:

To be eligible for inclusion in the SR sample, documents had to fulfill the following criteria:

Identify explicitly as a “systematic review” in their title

Assess the effect of a non-acute, non-communicable, environmental exposure on a health outcome. Environmental exposures can include air and water pollutants, radiation, noise, occupational hazards, lifestyle factors (like diet or physical activity) and lifestyle choices influenced by family and peers (like substance use), social and economic factors (like stress from work or living conditions).

Include studies in people or mammalian models

Be available in HTML format

Abstract X”

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.913) and specificity (0.932), but not with specificity (0.900). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.884) or sensitivity (0.808), but improved specificity (0.940).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better than ChatGPT3.5 ( p value < 0.001), Google PaLM 2, and Llama 2 ( p value = 0.02) in terms of sensitivity; ChatGPTv4.0 performed worse than ChatGPTv3.5 and Google PaLM 2, in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly different ( p value = 0.134) than ChatGPTv4.0 in terms of specificity.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). However, both Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions and surpassed the performance of ChatGPT (v4.0). Claude 3 also performed well but did not surpass the performance of ChatGPT (v4.0).

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We aligned the screening results from 73 cases and 100 controls based on the ChatGPT v4.0 method, and similarly for the zero-shot and hybrid methods, to test for inconsistencies between these approaches, using the McNemar test as done in previous assessments. Our analysis showed that ChatGPT v4.0 significantly outperformed the zero-shot method in both sensitivity ( p value < 0.001) and specificity ( p value = 0.016). In comparison with the hybrid method, ChatGPT v4.0 also demonstrated superior sensitivity ( p value < 0.001) and better specificity ( p value = 0.04).

Monetary cost and time cost

To use the ChatGPT API or other LLM tools, the owners of these platforms charge a predetermined rate for access to the corresponding APIs. These fees are calculated in USD per thousand tokens, where tokens are the basic units used by these LLM platforms to quantify text length. In this context, a token can represent a word, a punctuation mark, or a character. The financial cost of screening 200 abstracts was approximately $6 for ChatGPT v4.0, $0.2 for ChatGPT v3.5, $10 for Llama 2 (using Replicate), while Google PaLM 2 offered its services for free to invited developers. Thus, the cumulative cost of evaluating 200 abstracts across all platforms was approximately $16.2. The cumulative cost of evaluating 200 abstracts across all latest models ($3 for GPT-4-Turbo, $0.05 for GPT-3.5-Turbo, free for Gemini-1.0-pro, $0.05 for Llama-3, $4 for Claude) was less, approximately $7.1. In terms of time efficiency, processing 200 abstracts with each of these LLM tools took approximately 10–20 min using a single thread. However, it is imperative to recognize that abstract screening lends itself well to parallelization. Consequently, one could significantly speed up the process by setting up multiple threads to simultaneously screen different subsets of abstracts, thereby reducing the overall time required for completion. This parallel approach not only increases efficiency, but also ensures that large amounts of data can be processed in a timely manner, making LLM tools even more attractive for large-scale abstract screening tasks. In summary, the monetary and time costs of using LLM tools for abstract screening are negligible compared to manual labeling.

Beyond majority voting

We have expanded our analysis to include a variety of approaches for synthesizing decisions across different LLM tools. Our methodology is inspired by the concept of combining multiple diagnostic tests in the absence of a gold standard, akin to situations where human expert consensus is not available. There are several publications discussing such scenarios [ 31 , 32 ], among which we proposed to use the latent class analysis (LCA) models.

Latent class analysis (LCA) is a statistical method used to identify subgroups within a population, which are not directly observed (hence “latent”) [ 33 , 34 , 35 ]. It is particularly useful when the research interest lies in categorizing individuals into mutually exclusive groups based on their responses to multiple observed variables. In the context of abstract screening, LCA can offer a sophisticated means of integrating decisions from different LLM tools without relying on a gold standard, typically provided by human expert consensus. This approach assumes that the unobserved subgroups (or “latent classes”) explain the dependence between the observed decisions made by each of the LLM tools.

Utilizing the LCA model, we treat the decisions from all LLM tools as dichotomous variables, corresponding to the adherence to each inclusion or exclusion criterion, as well as the overall decision. For instance, within the Bannach-Brown 2016 database (BB2016), there are eight criteria in total with four criteria each for inclusion and exclusion and one overall decision for inclusion, resulting in a total of nine binary items per LLM tool. In our analysis, we incorporated decisions from GPT v4.0, v3.5, and Llama 2. Decisions from Google PaLM 2 were excluded due to a high frequency (10% or more) of incomplete responses. Consequently, for the Bannach-Brown 2016 database, we worked with 27 binary items. For other databases such as Meijboom 2021 (Mj2021) and Menon 2022 (Me2022), the binary items totaled 21 and 15, respectively. It is important to note that LCA models were fitted to the binary data of each database independently.

The LCA model fitting process enables us to calculate the posterior probabilities of each abstract belonging to specific latent classes or subgroups. Abstracts are then categorized based on these probabilities, with assignment to the class for which an abstract has the highest posterior membership probability. The determination of the number of latent classes is a critical step in the LCA model fitting, which requires a priori specification. In our evaluation, we explored models with class numbers ranging from 2 to 6 and utilized the Bayesian information criterion (BIC) to identify the most “optimal” LCA model for our datasets.

Table 5 shows that after applying the Bayesian information criterion (BIC) to determine the most appropriate model for our data, we identified a 3-class model as the best fit for the binary decisions derived from the BB2016 database. Similarly, a 4-class model was optimal for the Mj2021 database, while a 3-class model was again best for the Me2022 database. The confusion matrices generated by the selected LCA models for each database provided a clear juxtaposition between the LLM-assigned classes and the actual labels of the abstracts (see Table  6 ).

The performance metrics derived from these models are noteworthy. For the BB2016 database, should we categorize abstracts within class with label 2 as “included” and assign the others in classes with label 1 or 3 to the “excluded” category, the LCA model achieved a sensitivity rate of 93% and a specificity rate of 96%, indicating a high degree of accuracy in classifying relevant and irrelevant abstracts. In the case of the Mj2021 database, if we view class 1 as “included” group and classes 2 and 3 as “excluded” group, the model achieved a perfect sensitivity rate of 100%, meaning that it correctly identified all relevant abstracts, although the specificity was lower at 79%, indicating a higher rate of false positives. Similarly, for the Me2022 database, considering class 1 to be the “excluded” category and classes 2 and 3 to be the “included” group, the model showed a sensitivity of 94.5% and a specificity of 83%, a good balance between identifying relevant abstracts and avoiding false inclusions.

These results highlight the robustness of the latent class analysis approach to the abstract screening task, providing an effective method for classifying abstracts when a gold standard is not available. The varying levels of sensitivity and specificity across databases underscore the need to tailor the LCA model to the specific characteristics of each dataset as well as further future research.

We have incorporated results from the latest LLM models to increase the robustness and relevance of our findings. However, to maintain consistency and comparability with our original analyses, we have not incorporated these new results into the previously established majority voting or latent class analysis (LCA) approaches. Instead, we have chosen to make all raw data, including results from these newer models, freely available in our GitHub repository. This approach allows interested researchers and practitioners to conduct further investigations or apply alternative methods of analysis. By providing access to this additional data, we aim to promote transparency and enable the broader community to engage with and potentially build upon our work.

This study began with a rigorous exploration of the capabilities of large language models (LLMs) in abstract screening. We used automation scripts developed in Python to interact with the APIs of several LLM tools, including Chat GPT v4.0, Google PaLM 2, and Meta Llama 2, as well as latest versions of these tools. Our central goal was to evaluate the efficiency and accuracy of these tools across three different databases of abstracts, leading us to a complex understanding of their potential in this context.

Large language models (LLMs), particularly ChatGPT, have garnered global attention since their inception. Employing LLMs for abstract screening in systematic reviews is an innovative concept [ 32 , 33 ] and remains underexplored. This study presents the first comprehensive evaluation of LLMs applied to systematic review processes. The findings are encouraging, suggesting that LLMs could revolutionize abstract screening. Specifically, ChatGPT v4.0 exhibited stellar performance across three test scenarios, achieving an accuracy of at least 85%. Furthermore, it attained sensitivity and specificity rates ranging from 80% to an impressive 95%. These exceptional outcomes highlight the substantial promise of LLMs in abstract screening, offering an efficient and capable alternative to the conventional, laborious approaches that typically necessitate extensive human annotation.

However, it is important to acknowledge that we are still in the early stages of integrating LLM tools into the abstract screening field, and they are not without their imperfections—for example, occasionally excluding many (20%) relevant studies, even with the best-performed LLM tools, ChatGPT v4.0. These tools are not a universal solution to all the challenges associated with abstract screening, and they are not ready to completely replace human expertise in this area. Instead, they should be embraced as invaluable assistants in the abstract screening process.

In discussing the limitations of our study, it is important to recognize the constraints associated with our dataset selection and model configuration. We used only three databases from the SYNERGY set, limiting the generalizability of our findings across disciplines and datasets. In addition, the reliance on human-curated labels as a gold standard, while necessary, introduces potential biases due to the retrospective nature of our analysis. These labels may contain errors, and the methodology used in the original reviews to resolve discrepancies may affect the validity of our conclusions.

The parameters chosen for our LLMs—temperature, top k, top p, and prompts—were set to defaults or based on natural conversations to balance output quality with ease of use. While this approach minimizes the need for technical expertise, it may not be optimal for all screening scenarios. In addition, the reliance of LLMs on abstracts alone, as opposed to full-text evaluations, presents a fundamental challenge; critical data influencing inclusion or exclusion may not be present in the abstracts, potentially compromising screening accuracy.

In addition, the rapid development of LLMs and their “black box” nature pose challenges to the transparency that is essential in scientific settings. The environmental impact of using these computationally intensive models is also significant [ 36 ], requiring sustainability considerations. Future research should focus on refining these tools to increase transparency and efficiency, and ensure their responsible development and use in systematic reviews.

Our research suggests that LLM tools are ready to take on a role in abstract screening and are poised to have an immediate and positive impact on the process. Their integration into abstract screening can manifest itself in a few ways. They can serve as autonomous AI reviewers, adding an extra layer of scrutiny and ensuring thoroughness. Our findings suggest that a collective decision, such as one derived from majority voting, can sometimes improve sensitivity, underscoring the potential of LLM tools as a reliable safeguard against oversight, ensuring both comprehensiveness and accuracy.

In addition, LLM tools can facilitate a synergistic partnership with human experts. They are adept at identifying “high-risk” abstracts where different LLM tools have reached different judgments, flagging them for further human evaluation, and promoting a careful and error-free screening process while minimizing human effort.

Another exciting prospect is the integration of LLM tools with hybrid approaches and active learning approach. In this scenario, LLM tools could autonomously annotate abstracts in the training set, minimizing the need for human labeling. Subsequently, these labeled abstracts could be used to train custom NLP models, paving the way for a streamlined and efficient abstract screening process, resulting in significant time and resource savings. Further research is needed to understand how the “uncertainty” in LLM-based decisions, when human-curated labels are not used, affects the performance of the hybrid approach. We also note from the test examples that the performance of hybrid approaches varies widely from example to example and depending on the text embedding tools used. Extensive research is needed to perfect the hybrid approaches.

Another future research direction is to explore how to fine-tune different versions of LLM tools and how to derive collective decisions. One idea is that by using different tuning parameters, one could propose different versions of LLM tools, and if they perform similarly in terms of accuracy but give different decisions (e.g., not highly correlated), we expect that better collective decisions would be obtained, as observed in many ensemble learning experiences [ 37 ]. However, this could be costly and require a lot of exploration.

More importantly, it would be particularly valuable to prospectively test these LLM tools, or integration of these tools with other approaches like zero-shot, active learning, in real-world meta-analysis and systematic review projects. This would provide deeper insights into their practical utility and effectiveness, and a more comprehensive understanding of their impact in live research environments.

In summary, while LLM tools may not be able to fully replace human experts in abstract screening, their ability to transform the screening process is both undeniable and imminent. With continued advances in technology and ongoing refinement, these tools will play a critical role in the future of abstract screening, ushering in a new era of efficiency and effectiveness.

Availability of data and materials

The datasets analyzed during the current study are available in the SYNERGY repository, https://github.com/asreview/synergy-dataset . We have created a GitHub repository https://github.com/mikeli380/LLMAbstractScreening to host our code (R/Python), abstracts, and intermediate results so that interested readers can replicate our findings.

Higgins JPT, et al. Cochrane handbook for systematic reviews of interventions. 2019.

Book   Google Scholar  

Moher D, et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7). https://doi.org/10.1371/journal.pmed.1000097 .

Sackett DL, Rosenberg WMC, Gray JAM, Haynes RB,  Richardson WS. Evidence based medicine: what it is and what it isn’t. 1996. Clin Orthop Relat Res. 2007;455.  https://doi.org/10.1136/bmj.312.7023.71 .

Lavis JN, Posada FB, Haines PA, Osei E. Use of research to inform public policymaking. Lancet. 2004;364(9445). https://doi.org/10.1016/S0140-6736(04)17317-0 .

Bastian H, Glasziou P,  Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up?. PLoS Med. 2010;7(9).   https://doi.org/10.1371/journal.pmed.1000326 .

Polanin JR, Pigott TD, Espelage DL, Grotpeter JK. Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res Synth Methods. 2019;10(3). https://doi.org/10.1002/jrsm.1354 .

Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. N Engl J Med. 2008;358(3). https://doi.org/10.1056/nejmsa065779 .

Edwards P, Clarke M, DiGuiseppi C, Pratap S, Roberts I, Wentz R. Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Stat Med. 2002;21(11). https://doi.org/10.1002/sim.1190 .

Wang Z, Nayfeh T, Tetzlaff J, O’Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE. 2020;15(1). https://doi.org/10.1371/journal.pone.0227742 .

O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4(1). https://doi.org/10.1186/2046-4053-4-5 .

OpenAI. Introducing ChatGPT Plus. OpenAI Blog; 2023. https://openai.com/blog/chatgpt-plus . Accessed 8 Nov 2023.

Google DeepMind. Introducing PaLM 2. Google Blog; 2023.  https://blog.google/technology/ai/google-palm-2-ai-large-language-model/ . Accessed 8 Nov 2023.

Meta AI Research. Meta and Microsoft introduce the next generation of Llama. Facebook AI Blog; 2023. https://ai.meta.com/blog/llama-2/ . Accessed 8 Nov 2023.

Claude.ai. Claude AI API reference. 2024. https://docs.anthropic.com/en/api/getting-started . Accessed 4 Jun 2024.

Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. 2012. https://doi.org/10.1145/2110363.2110464 .

Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems. 2023;11(7). https://doi.org/10.3390/systems11070351 .

Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models ‘the answer’ to bringing us closer to systematic review automation?  Syst Rev. 2023;12(1).   https://doi.org/10.1186/s13643-023-02243-z .

Xian Y, Lampert CH, Schiele B, Akata Z. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell. 2019;41(9). https://doi.org/10.1109/TPAMI.2018.2857768 .

Moreno-Garcia CF, Jayne C, Elyan E, Aceves-Martins M. A novel application of machine learning and zero-shot classification methods for automated abstract screening in systematic reviews. Decis Anal J. 2023;6.  https://doi.org/10.1016/j.dajour.2023.100162 .

Ferdinands G, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1). https://doi.org/10.1186/s13643-023-02257-7 .

Boetje J, van de Schoot R. The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses. Syst Rev. 2024;13(1). https://doi.org/10.1186/s13643-024-02502-7 .

Yang Y, Loog M. A benchmark and comparison of active learning for logistic regression. Pattern Recognit. 2018;83. https://doi.org/10.1016/j.patcog.2018.06.004 .

De Bruin RJ, Ma Y, Ferdinands G, Teijema J, Van de Schoot. SYNERGY - open machine learning dataset on study selection in systematic reviews. DataverseNL; 2023. https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/HE6NAQ .

Millard SJ, Weston-Green K, Newell KA. The Wistar-Kyoto rat model of endogenous depression: a tool for exploring treatment resistance with an urgent need to focus on sex differences. Prog Neuropsychopharmacol Biol Psychiatry. 2020;101. https://doi.org/10.1016/j.pnpbp.2020.109908 .

Zawia NH, Mattia CJ, Bondy SC. Differential effects of difluoromethylornithine on basal and induced activity of cerebral ornithine decarboxylase and mRNA. Neuropharmacology. 1991;30(4). https://doi.org/10.1016/0028-3908(91)90058-J .

Shea BJ, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358. https://doi.org/10.1136/bmj.j4008 .

Bannach-Brown A, Liao J, Wegener G, Macleod M. Understanding in vivo modelling of depression in non-human animals: a systematic review protocol. Evid Based Preclin Med. 2016;3(2). https://doi.org/10.1002/ebm2.24 .

Meijboom RW, Gardarsdottir H, Becker ML, ten Wolde S, Egberts TCG, Giezen TJ. Incidence of and reasons and determinants associated with retransitioning from biosimilar etanercept to originator etanercept. BioDrugs. 2021;35(6). https://doi.org/10.1007/s40259-021-00501-x .

Menon JML, Struijs F, Whaley P. The methodological rigour of systematic reviews in environmental health. Crit Rev Toxicol. 2022;52(3). https://doi.org/10.1080/10408444.2022.2082917 .

Collins GS, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7). https://doi.org/10.1136/bmjopen-2020-048008 .

Joseph L, Gyorkos TW, Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. Am J Epidemiol. 1995;141(3). https://doi.org/10.1093/oxfordjournals.aje.a117428 .

Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics. 1980;36(1). https://doi.org/10.2307/2530508 .

Dendukuri N, Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics. 2001;57(1). https://doi.org/10.1111/j.0006-341X.2001.00158.x .

Collins LM, Lanza ST. Latent class and latent transition analysis: with applications in the social, behavioral, and health sciences. 2010.

Google Scholar  

McCutcheon AL. Latent class analysis. Newbury Park: Sage; 1987.

Rillig MC, Ågerstrand M, Bi M, Gould KA, Sauerland U. Risks and benefits of large language models for the environment. Environm Sci Technol. 2023;57(9). https://doi.org/10.1021/acs.est.3c01106 .

Sagi O, Rokach L. Ensemble learning: a survey.  Wiley Interdiscip Rev Data Min Knowl Discov. 2018;8(4). https://doi.org/10.1002/widm.1249 .

Download references

Acknowledgements

The authors thank the reviewers and the Editor for their insightful comments and suggestions, which have greatly improved the quality of this work.

Not applicable.

Author information

Authors and affiliations.

Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA

Michael Li & Xianming Tan

Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC, 27402, USA

Jianping Sun

Department of Biostatistics and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA

Xianming Tan

You can also search for this author in PubMed   Google Scholar

Contributions

XT conceived the paper. ML and XT devised the study. ML, JS, and XT analyzed the datasets. ML, JS, and XT wrote the manuscript. All authors edited and approved the final manuscript.

Corresponding author

Correspondence to Xianming Tan .

Ethics declarations

Ethics approval and consent to participate, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1: table s1. tripod + ai checklist., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, M., Sun, J. & Tan, X. Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Syst Rev 13 , 219 (2024). https://doi.org/10.1186/s13643-024-02609-x

Download citation

Received : 20 December 2023

Accepted : 11 July 2024

Published : 21 August 2024

DOI : https://doi.org/10.1186/s13643-024-02609-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Abstract screening
  • Automated text analysis
  • Google PaLM

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

research on large language models

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 10 February 2024

The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks

  • Kent F. Hubert   ORCID: orcid.org/0009-0009-7348-5102 1   na1 ,
  • Kim N. Awa   ORCID: orcid.org/0000-0002-4932-6277 1   na1 &
  • Darya L. Zabelina   ORCID: orcid.org/0000-0002-0313-7358 1  

Scientific Reports volume  14 , Article number:  3440 ( 2024 ) Cite this article

19k Accesses

10 Citations

273 Altmetric

Metrics details

  • Human behaviour

The emergence of publicly accessible artificial intelligence (AI) large language models such as ChatGPT has given rise to global conversations on the implications of AI capabilities. Emergent research on AI has challenged the assumption that creative potential is a uniquely human trait thus, there seems to be a disconnect between human perception versus what AI is objectively capable of creating. Here, we aimed to assess the creative potential of humans in comparison to AI. In the present study, human participants (N = 151) and GPT-4 provided responses for the Alternative Uses Task, Consequences Task, and Divergent Associations Task. We found that AI was robustly more creative along each divergent thinking measurement in comparison to the human counterparts. Specifically, when controlling for fluency of responses, AI was more original and elaborate. The present findings suggest that the current state of AI language models demonstrate higher creative potential than human respondents.

Similar content being viewed by others

research on large language models

People devalue generative AI’s competence but not its advice in addressing societal and personal challenges

research on large language models

Best humans still outperform artificial intelligence in a creative divergent thinking task

research on large language models

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Introduction.

The emergence of ChatGPT—a natural language processing (NLP) model developed by OpenAI 1 to the general public has garnered global conversation on the utility of artificial intelligence (AI). OpenAI’s Generative Pretrained Transformer (GPT) is a type of machine learning that specializes in pattern recognition and prediction and has been further trained using Reinforcement Learning from Human Feedback (RLHF) so that ChatGPT responses would be indistinguishable from human responses. Recently, OpenAI 1 has advertised the new model (GPT-4) as “more creative” particularly “on creative and technical writing tasks” in comparison to previous versions, although there are arguably semantic limitations such as nonsensical answers or the possibilities of incorrect information generation 2 . Given the accessibility of AI models in the current climate, research across a variety of domains has started to emerge, thus contributing to our growing understanding of the possibilities and potential limitations of AI.

Creativity as a phenomenological construct is not immune to the effects of AI. For example, researchers have begun to assess AI models to determine appropriate design solutions 3 and logical reasoning 4 . These assessments focus on convergent thinking, i.e., determining one optimal solution to a pre-defined problem 5 . Traditionally, convergent thinking assumes an optimal single solution path and can be assessed through traditional intelligence measures or synthesis tasks. Although convergent thinking emphasizes single optimal solutions, this does not negate the potential for original or non-obvious solutions. However, convergent thinking tasks by design typically do not allow for flexible or out-of-the-box thinking. In contrast, divergent thinking involves generating multiple creative solutions to a problem which allows for the flexibility to determine multiple creative solutions 6 . Creativity researchers commonly focus on divergent creativity (in comparison to convergent creativity), given the associative mechanisms that allude to people’s ability to generate creative solutions (i.e., creative potential). Specifically, divergent thinking is considered an indicator of a person’s creative potential, but this does not guarantee creative achievement 7 . Instead, creative potential can be indicative on future capability, rather than an immediate trait that determines if someone is creative. Accordingly, a person’s creative potential has been captured via divergent thinking tasks such as the Alternative Uses Task [AUT 6 , 7 ] or the Consequences Task [CT 8 , 9 ]. Divergent thinking tasks can be evaluated along three dimensions: fluency (number of responses), originality (response novelty), and elaboration (length/detail of response). Responses in each category are given scores (i.e., according to each task) and used to assess individual differences in divergent creativity, or in other words, a person’s creative potential.

Given the emergence of OpenAI’s GPT-4 as a large language model, research has begun to empirically assess the creative potential of artificial intelligence language models through divergent thinking tasks. On one hand, some researchers argue that the human cognitive mechanisms present during creative tasks are not present in AI, and thus the creative potential of artificial intelligence can only reflect artificial creativity 10 . On the other hand, computational creativity suggests parallel networks that reflect the mechanisms of how humans go through iterative, deliberative, and generative creative processes which aid in the ability to determine creative solutions 11 . Although these aspects have been shown to aid in creative solutions, humans can experience idea fixedness, which can act as a roadblock to other creative solutions. Machines, however, will not experience this phenomenon in a metacognitive way due to computationally trained models that streamline a machine’s direct responses to a prompt 12 , 13 , 14 . Instead, a machine’s fixedness may perhaps reflect the training data of the model which could be argued is a computational consideration, rather than a creative one.

Furthermore, computational researchers have posed increasing debate on the creative capabilities of artificial intelligence models 15 by asking questions such as: How are machines capable of determining what is creative? At present, AI’s inability to explicitly determine why or if something is creative is then compensated through human-assistance. For example, human intervention is necessary for inputting appropriate and relevant data to train the model and shape outputs to become more linguistically natural 16 , 17 . This computational limitation suggests that AI is not capable of divergent creativity due to the lack of metacognitive processes (i.e., evaluation, task motivation) because AI could not generate creative ideas or reiterate on existing ideas without the intervention (i.e., input) of a human user 10 . Similarly, emotions have been seen as an integral part of creativity such that emotions help dictate states of flow or mind-wandering that aid in creative processes 18 . However, AI may not necessarily need to rely on metacognitive or affective processes to generate novel ideas 19 due to the computational framework. Thus, inner processes that contribute to human creativity may be a philosophical argument within artificial creativity models 20 .

As briefly reviewed, the creative capabilities of artificial intelligence, thus far, have scientifically and philosophically varied [e.g., 10 , 20 ]. Researchers posit humanistic and computational considerations of the creative potential of AI, however, the accessibility of tools to artificially generate products or ideas have given researchers the opportunity to evaluate public perception. For instance, people think more highly of generated artworks if they were told the artworks were created by humans but not AI 21 , 22 . The expectancy that AI generated products or ideas are less creative or hold less aesthetic value than human-created artworks appear to depend on implicit anti-AI biases 22 , 23 , 24 , as AI has been found to be indistinguishable from human-created products 25 , 26 , 27 . People’s inability to distinguish between human and AI-created products supports the feasibility of AI having creative potential.

Indeed, AI has been found to generate novel connections in music 28 , science 26 , medicine 29 , and visual art 30 to name a few. In assessments of divergent thinking, humans outperformed AI on the Alternative Uses Task 31 , but it is noteworthy that the authors propose a possible rise in AI capabilities given future progress of large language models. In fact, recent studies have found that AI divergent creativity matched that of humans using a later version of GPT-4 32 , 33 . Researchers have continued to demonstrate that the current state of LLM’s frequently score within the top 1% of human responses on standard divergent thinking tasks such as the Alternative Uses Task 32 , 33 , 34 . Additional studies utilizing other divergent thinking tasks have also reported findings that paint a more nuanced picture. For example, when scores were compared between humans and GPT-4 on a Divergent Associations Task (DAT 35 ), the researcher found that GPT-4 was more creative than human counterparts 36 . Recent research on OpenAI’s text-to-image platform DALL▪E has reported similar findings 37 and suggests that OpenAI models could match or even outperform humans in combinational creativity tasks. Given the research on AI creativity thus far, OpenAI’s advertorial claims that GPT-4 is “more creative” may hold more merit than anticipated.

Current research

Thus far, the novelty of OpenAI’s ChatGPT has posed more questions that have yet to be examined. Although creativity has considered to be a uniquely human trait 38 , the emergence of OpenAI’s generative models suggests a possible shift in how people may approach tasks that require “out of the box” thinking. Thus, the current research aims to examine how divergent creativity (i.e., fluency, originality, elaboration) may differ between humans and AI on verbal divergent thinking tasks. To our knowledge, this is one of the first studies to comprehensively examine the verbal responses across a battery of the most common divergent thinking tasks (i.e., Alternative Uses Task, Consequences Task, and Divergent Associations Task) with novel methodology by matching the fluency of ideas between human subjects and ChatGPT. We anticipate that AI may demonstrate higher creative potential in comparison to humans, though given the recency of AI-centered creativity research, our primary research questions serve as exploratory in nature.

Participants

Human participation.

Human participants (N = 151) were recruited via Prolific online data collection platform in exchange for monetary compensation of $8.00. Participants were limited to having a reported approval rating above 97%, were proficient English speakers, and were born/resided in the USA. Average total response time for completing the survey was 34.66 min. A statistical sensitivity analysis indicated that we had sufficient power to detect small effects with the present sample size ( f 2  = 0.06, 1 −  β  = 0.80). The present study was performed in accordance with the Declaration of Helsinki and was approved by the Institutional Review Board for Human Subjects Research at the University of Arkansas. All participants provided informed consent prior to the start of the study. All statistical analyses were conducted in R studio 39 . See Table 1 for participant demographics.

AI participation

Artificial participants were operationalized as ChatGPT’s instancing feature. Each ChatGPT session was considered an independent interaction between the user and GPT interface. Here, we prompted separate instances per creativity measure (as detailed below) which resulted in artificial participation sessions. For example, we used a single session instance to feed each prompt and aggregated each prompt response into a data file. In total, we collected 151 instances which represent AI’s participation for a balanced sample. For two of the creativity measures (Alternative Uses Task and Consequences Task), which are the only timed tasks, fluency was matched 1:1 such that the number of responses for both groups is equal on these timed tasks. Fluency scores of each human respondent were first calculated to match 1:1 for each GPT-4 instance for the Alternative Uses Task and Consequences Task (detailed below). Only valid responses were retained. For example, human participant #52 had a total fluency score of 6, thus GPT-4 instance #52 was instructed to provide 6 responses.

Creativity measures

Alternative uses task.

The Alternate Uses Task (AUT 6 ) was used to test divergent thinking. In this task, participants were presented with a common object (‘fork’ and ‘rope’) and were asked to generate as many creative uses as possible for these objects. Responses were scored for fluency (i.e., number of responses), originality (i.e., uniqueness of responses), and elaboration (i.e., number of words per valid response). Participants were given 3 min to generate their responses for each item. Following prior research 40 , instructions for human respondents on the AUT were:

For this task, you'll be asked to come up with as many original and creative uses for [item] as you can. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your ideas don't have to be practical or realistic; they can be silly or strange, even, so long as they are CREATIVE uses rather than ordinary uses. You can enter as many ideas as you like. The task will take 3 minutes. You can type in as many ideas as you like until then, but creative quality is more important than quantity. It's better to have a few really good ideas than a lot of uncreative ones. List as many ORIGINAL and CREATIVE uses for a [item] .

Because the goal was to control for fluency, we excluded prompt parameters such as 'quantity' from the GPT-4 instructions. Similarly, GPT does not need timing parameters in comparison to humans because we denoted the specific number of responses required. See below for instructions used per GPT instance:

For this task, you'll be asked to come up with as original and creative uses for [item] as you can. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your ideas don't have to be practical or realistic; they can be silly or strange, even, so long as they are CREATIVE uses rather than ordinary uses. List [insert fluency number] ORIGINAL and CREATIVE uses for a [item].

Consequences task

The Consequences Task (CT 8 , 9 ) is part of the verbal section of the Torrance Test of Creative Thinking (TTCT) that provides prompts to hypothetical scenarios (i.e., what would happen if humans no longer needed to sleep?). Similar to the AUT, people respond to as many consequences to the prompt as they can within a given timeframe. Responses were scored for fluency (i.e., number of responses), originality (i.e., uniqueness of responses), and elaboration (i.e., number of words per valid response). General task instructions for human respondents were:

In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". For 3 minutes, try and think of any and all consequences that might result from the statement. Please be as creative as you like. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your responses will be scored based on originality and quality. Remember, it is important to try to keep thinking of responses and to type them in for the entire time for the prompt. REMINDER: In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". For 3 minutes, try and think of any and all consequences that might result from the statement. Do this as many times as you can in 3 min. The screen will automatically change when the time is completed. Remember, it is important to try to keep thinking of responses and to type them in for the entire time for the prompt.

Participants were given two prompts shown independently: “Imagine humans no longer needed sleep,” and “Imagine humans walked with their hands.” The two CT prompts have been extensively used in research on divergent thinking 41 , 42 , 43 . Similar to the AUT, fluency and timing parameters were excluded from the GPT instructions on the CT:

In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". Please be as creative as you like. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your responses will be scored based on originality and quality. Try and think of any and all consequences that might result from the statement. [Insert scenario]. What problems might this create? List [insert fluency number] CREATIVE consequences.

Divergent associations task

The Divergent Association Task (DAT 35 ) is a task of divergent and verbal semantic creative ability. This task asks participants to come up with 10 nouns as different from each other as possible. These nouns must not be proper nouns or any type of technical term. Pairwise comparisons of semantic distance between the 10 nouns are calculated using cosine distance. The average distance scores between all pairwise comparisons are then multiplied by 100 that results in a final DAT score ( https://osf.io/bm5fd/ ). High scores indicate longer distances (i.e., words are not similar). Task instructions for both human participants and GPT-4 were:

Please enter 10 words that are as different from each other as possible, in all meanings and uses of the words. The rules: Only single words in English. Only nouns (e.g., things, objects, concepts). No proper nouns (e.g., no specific people or places). No specialized vocabulary (e.g., no technical terms). Think of the words on your own (e.g., do not just look at objects in your surroundings).

There were no time constraints for this task. The average human response time was 126.19 s ( SD  = 90.62) and the average DAT score was 76.95 ( SD  = 6.13). We scored all appropriate words that participants gave. Participants with fewer than 7 responses were excluded from data analysis (n = 2). Instructions were identical for the GPT-4 to the human instructions.

Human participants’ responses were collected online via Qualtrics. The entire study took on average 34 min ( SD  = 13.64). The order of the creativity tasks was counterbalanced. The online study used two attention checks randomly presented throughout the study. Each attention check allowed one additional attempt. Participants who failed two attention checks were removed from all analyses (N = 2). After providing their responses to each task, participants answered demographics questions.

GPT-4 procedural responses were generated through human-assistance facilitated by the first author, who provided each prompt in the following order: AUT, CT, and DAT. We did not have to account for typical human-centered confounds such as feelings of fatigue 44 , 45 and order biases 44 as these states are not relevant confounds in AI, thus the order of tasks was not counterbalanced.

Research disclosure statement

All variables, measurements, and exclusions for this article’s target research question have been reported in the methods section.

Creativity scoring

Both human and GPT-4 responses were cleaned to remove any instances that were incomplete or inappropriate at two stages: First, human responses that did not follow instructions from the task or were not understandable as a use (AUT; 0.96% removed) or a consequence (CT; 4.83%) were removed. Only valid human responses were used in matching for GPT fluency; Second, inappropriate or incomplete GPT responses for the AUT (< 0.001% removed) and CT (< 0.001% removed) were removed. Despite matching for fluency, only valid responses in both groups were used in subsequent analyses.

Traditional scoring methods of divergent thinking tasks have required human ratings of products or ideas and are assumed to be normative tasks (i.e., consensus will eventually be met with more raters). Here, we used the Open Creativity Scoring tool [OCS 46 ] to automate scoring of semantic distance objectively by capturing the originality of ideas by assigning scores of the remoteness (uniqueness) of responses. Unlike human scoring which requires multiple factors of consideration (e.g., fatigue, biases, time, cost 47 ) which could result in potential confounds, automated scoring tools such as OCS circumvent the human-centered issues and has been found to robustly correlate with human ratings 46 .

Open Creativity Scoring tool (OCS 46 ) was used to score both the AUT and CT tasks. Specifically, the semantic distance scoring tool 17 was used, which applies the GLoVe 840B text-mining model 48 to assess originality of responses by representing a prompt and response as vectors in semantic space and calculates the cosine of the angle between the vectors. The OCS tool also scores for elaboration by using the stoplist method 46 . The prompts for the AUT were “rope” and “fork” and the prompts for the CT were “humans no sleep” and “humans walked hands.”

Preliminary results

Descriptive statistics for all tasks are reported in Tables 2 and 3 . Fluency descriptive statistics are reported in Table 2 . Semantic distance descriptive statistics are reported in Table 3 .

Primary results

As expected, an independent sample t -test revealed no significant differences in total fluency due to controlling for fluency (as detailed above) between humans ( M  = 6.94, SD  = 3.80) and GPT-4 ( M  = 7.01, SD  = 3.81), t (602) = 0.21, 95% CI [− 0.54, 0.67], p  = 0.83.

To assess originality of responses via semantic distance scores, we conducted a 2 (group: human, GPT-4) X 2 (prompt: ‘fork, rope) analysis of variance. The model revealed significant main effects of group ( F (1, 600) = 622.10, p  < 0.001, η 2  = 0.51) and prompt ( F (1, 600) = 584.50, p  < 0.001, η 2  = 0.49) on originality of responses. Additionally, there were significant interaction effects between group and prompt, F (1, 600) = 113.80, p  < 0.001, η 2  = 0.16. Particularly, both samples had higher originality scores for the prompt ‘fork’ in comparison to ‘rope,’ but GPT-4 scored higher in originality, regardless of prompt. Tukey’s HSD post hoc analysis showed that all pairwise comparisons were significantly different ( p  < 0.001) aside from the human ‘fork’ and GPT-4 ‘rope’ originality ( p  = 0.989). Overall, GPT-4 was more successful at coming up with divergent responses given the same number of opportunities to generate answers compared to the human counterpart and showed higher originality but only for specific prompts (Fig.  1 ).

figure 1

Analysis of variance of originality on the alternative uses task.

Next, we compared elaboration scores between humans and GPT-4. Fluency scores differ from elaboration in the sense that fluency accounts for each coherent response whereas elaboration quantifies the number of words per valid response. For example, a person could respond “you could use a fork to knit or as a hair comb.” In this example, the fluency would be 2 (knitting instrument and comb), but the elaboration would be 12 (number of words used in the response). The results of an independent t -test revealed that elaboration was significantly higher for GPT-4 ( M  = 15.45, SD  = 6.74) in comparison to humans ( M  = 3.38, SD  = 2.91), t (602) = 28.57, 95% CI [11.24, 12.90], p  < 0.001.

As expected, an independent t -test revealed no significant differences in total fluency between humans ( M  = 5.71, SD  = 3.20) and GPT-4 ( M  = 5.50, SD  = 3.15), t (621) = 0.82, 95% CI [− 0.29, 0.71], p  = 0.41.

To assess originality of responses via semantic distance scores, we conducted a 2 (group: human, GPT) X 2 (prompt: ‘no more sleep,’ ‘walk on hands’) analysis of variance. The model revealed significant main effects of group ( F (1, 619) = 622.10, p  < 0.001, η 2  = 0.51) and prompt ( F (1, 619) = 584.50, p  < 0.001, η 2  = 0.49) on the originality of responses. Additionally, there were significant interaction effects between group and prompt, F (1, 619) = 113.80, p  < 0.001, η 2  = 0.16. Particularly, originality was marginally higher for the prompt ‘walk on hands’ in the GPT sample, although there were no significant differences in originality in the human sample between the two prompts. Tukey’s HSD post hoc analysis showed that all pairwise comparisons were significantly different ( p  < 0.001) aside from the human responses for both prompts ( p  = 0.607). Overall, GPT-4 was more successful at coming up with more divergent responses given the same number of opportunities compared to the human counterparts, and also showed higher originality dependent on prompt type (Fig.  2 ).

figure 2

Analysis of variance of originality on the consequences task.

Next, we calculated the difference in elaboration between humans and GPT-4. The results of an independent I-test revealed that elaboration was significantly higher in the GPT-4 sample ( M  = 38.69, SD  = 15.60) than in the human sample ( M  = 5.45, SD  = 4.04), t (621) = − 36.04, 95% CI [− 35.04, − 31.45], p  < 0.001.

We assessed the qualitative aspect of the words generated in the DAT between both humans and GPT through word occurrence. Namely, the frequency of single-occurrence (non-repeating words within groups) and unique occurrence (words only occurring once between groups).

Humans had a higher number of single-occurrence words (n = 523) that accounted for 69.92% within the total group response in comparison to GPT’s number of single-occurrence words (n = 152) that accounted for 47.95% within the total group response (Table 4 ). In total, there was 9.11% (n = 97) of overlapping responses between both groups. Exclusively unique words that only occurred in the human responses accounted for 87.03% (n = 651) in comparison to unique GPT responses which accounted for 69.40% (n = 220).

A chi-square test of independence was performed to examine the relationship between groups (GPT vs human) and word type (single occurrence vs unique occurrence). The relationship between these variables was not significant, \(\chi\) 2 (1, N  = 302) = 1.56, p  = 0.211. This suggests that uniqueness and occurrences of words may not have necessarily aided either group in originality, but rather aided in word complexity.

Differences in semantic distance scores were calculated between human and GPT-4 DAT responses. An independent sample t -test revealed that GPT responses ( M  = 84.56, SD  = 3.05) had higher semantic distances in comparison to human responses ( M  = 76.95, SD  = 6.13), t (300) = 13.65, 95% CI [6.51, 8.71], p  < 0.001. Despite human participants having a broader range of unique responses, the fluency uniqueness did not appear to advantage semantic distance scores when comparing groups.

The present study offers novel evidence on the current state of large language models (i.e., GPT-4) and the capabilities of divergent creative output in comparison to human participants. Overall, GPT-4 was more original and elaborate than humans on each of the divergent thinking tasks, even when controlling for fluency of responses. In other words, GPT-4 demonstrated higher creative potential across an entire battery of divergent thinking tasks (i.e., Alternative Uses Task, Consequences Task, and Divergent Associations Task).

Notably, no other study has comprehensively assessed multiple dimensions of the most frequently used divergent thinking tasks and AI. However, studies have begun to examine differences in divergent creativity between humans and AI, particularly after the public emergence of OpenAI’s ChatGPT, with findings showing that AI’s creative potential scores within the top 1% of human responses in terms of originality 32 , 33 , 34 . While there has been an influx in research examining the creativity of generative language models, to date only one previous study showed that humans outperformed GPT on the AUT (GPT-3 31 ), while another study reported that later versions of GPT (GPT-4 showed similar, albeit slightly less, creative potential in comparison to humans 32 ). Similarly, one previous study demonstrated that generative models were improved in GPT 4 compared to GPT 3.5, particularly in terms of fluency, but interestingly, not in terms of elaboration 49 which suggests that the creative potential of these LLM’s are improving, particularly the ability to generate original ideas. Indeed, only one other study thus far has reported similar results that GPT outperformed humans on the DAT 36 , but the DAT is only one aspect of divergent thinking. Instead, the novelty of the present findings provides a foundation for future research to continue to examine multiple dimensions of divergent thinking and artificial intelligence.

While the present results suggest that the current state of AI models outperform humans on divergent thinking tasks by a significant margin, there are methodological considerations that could have contributed to the present results. To comprehensively examine creativity requires not only an assessment of originality, but also of the usefulness and appropriateness of an idea or product 50 . Traditionally, this has proven difficult to standardize in comparison to assessing originality given the multifaceted dimensions that contribute to assessments of appropriateness such as accounting for sociocultural and historical contexts. Semantic distance scores do not take into consideration the aforementioned variables; instead, the scores reflect the relative distance between seemingly related (or unrelated) ideas. In this instance, GPT-4’s answers yielded higher originality than human counterparts, but the feasibility or appropriateness of an idea could be vastly inferior to that of humans. Thus, we need to consider that the results reflect only a single aspect of divergent thinking, rather than a generalization that AI is indeed more creative across the board. Future research on AI and creativity needs to not only account for the traditional measurements of creativity (i.e., fluency, elaboration, originality) but also for the usefulness and appropriateness of the ideas.

Interestingly, GPT-4 used a higher frequency of repeated words in comparison to human respondents. Although the breadth of vocabulary used by human responses was much more flexible, this did not necessarily result in higher semantic distance scores. Flexibility, or number of categories of responses, has also been found to be smaller (i.e., more similar categories of words were generated) for AI in comparison to humans 34 . In other words, like our present results, humans came up with a wider range of responses, however, this did not indicate increased originality. These findings highlight the consideration that flexible thinking may be the strong point in human-centered divergent thinking.

More so, the complexity of words chosen by AI, albeit more concentrated in occurrence, could have more robustly contributed to the originality effects. For example, only AI used words that are non-tangible items (i.e., freedom, philosophy) whereas humans may have experienced a fixedness on generating ideas that are appropriate and observable. The differences between generated lists (incorporating tangible and non-tangible word) could inflate originality to be biased toward AI.

Similarly, we need to critically consider the uniqueness of words generated in DAT responses. There was a marginal overlap of responses between the human and the AI samples (9.11%), but humans responded with a higher number of single-occurrence words. Despite these differences, AI still had a higher semantic distance score. Prior research shows that in human respondent’s originality increases over time 51 . This increase is seen as an expansion of activation in an individual’s semantic network, which leads to more original responses 52 . Human responses on these DT tasks tend to follow a diminishing returns curve before reaching a plateau for an individual’s more original responses 53 . The higher levels of elaboration and semantic distance in AI responses suggests that the LLM processing possibly does not need this ramp-up time as seen in human responses, therefore LLM’s can respond with their highest level of original responses when prompted. Whereas humans may fixate on more obvious responses at first, this algorithmic trait could then serve as an aid in overcoming ideation fixedness in humans.

It is important to note that the measures used in this study are all measures of creative potential, but involvement in creative activities or achievements is another aspect of measuring a person’s creativity. Creative potential is not a guarantee for creative achievement; instead, we need to consider creative potential as an indicator of a person’s creative capabilities 7 . Here, AI was more original thus indicating higher creative potential, but this metric may more appropriately reflect the advancement of the algorithms these models were trained on in conjunction with human input. In other words, AI, unlike humans, does not have agency, thus AI creative potentials are dependent on the assistance of a human user to elicit responses. Therefore, the creative potential of AI is in a constant state of stagnation unless prompted.

Moreover, researchers have examined the interplay between creative potential and real-world creative achievements 54 , 55 but this approach assumes human level creativity and is not able to account for artificial intelligence. AI can generate creative ideas, but it cannot be assumed that this potential would translate to achievement. The creative potential of AI is limited by the (lack of) autonomy of what the algorithms can create (i.e., creative potential) without the intervention of human assistance. Thus, future research should consider the conceptual implications of current measurements of creativity as implicated in applications in real-world settings and how generalizability at the intersection of potential and achievement may be a human-centric consideration.

The prevalence and accessibility of the internet has drastically shaped the way in which humans interact with language processing systems and search engines. LLM’s such as GPT-4 are now not an exception in ubiquity. Searching for information has multiple channels which were not previously available, and with these functions come an array of strategies to best find the desired information. Research has shown that younger people are better and more efficient in their search strategies online to find the information they want 56 , which suggests that exposure to search platforms acts as a practice in efficiency. Similar to interactions with GPT-4 and other AI platforms, humans may gradually navigate how to best utilize LLM’s. For information seeking tools like GPT-4, the creative potential has shown clear progression in capabilities, albeit there are still limitations such as response appropriateness and AI’s ability to generate idiosyncratic associations. Generative AI has demonstrated robustness in creative potential but has also shown weaknesses (i.e., less flexible thinking) that could then be supplemented by human assistance. Moving forward, future possibilities of AI acting as a tool of inspiration, as an aid in a person’s creative process, or to overcome fixedness is promising.

Data availability

All data associated with the present study is available at https://osf.io/xv6kh/ .

OpenAI. ChatGPT: Optimizing Language Models for Dialogue . (2023). https://openai.com/blog/chatgpt/ . Accessed July 2023.

Rahaman, M. S., Ahsan, M. T., Anjum, N., Terano, H. J. R. & Rahman, M. M. From ChatGPT-3 to GPT-4: A significant advancement in ai-driven NLP tools. J. Eng. Emerg. Technol. 2 (1), 1–11. https://doi.org/10.52631/jeet.v2i1.188 (2023).

Article   Google Scholar  

Lee, Y. H., & Lin, T. H. (2023). The feasibility study of AI image generator as shape convergent thinking tool. in International Conference on Human-Computer Interaction (pp. 575–589). https://doi.org/10.1007/978-3-031-35891-3_36 .

Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., & Zhang, Y. (2023). Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv:2304.03439 .

Cropley, A. In praise of convergent thinking. Creat. Res. J. 18 (3), 391–404. https://doi.org/10.1207/s15326934crj1803_13 (2006).

Guilford, J. P. The Nature of Human Intelligence (McGraw-Hill, 1967).

Google Scholar  

Runco, M. A. & Acar, S. Divergent thinking as an indicator of creative potential. Creat. Res. J. 24 (1), 66–75. https://doi.org/10.1080/10400419.2012.652929 (2012).

Torrance, E. P. The Torrance Tests of Creative Thinking: Norms-Technical Manual (Personal Press, 1974).

Wilson, R. C., Guilford, J. P., Christensen, P. R. & Lewis, D. J. A factor-analytic study of creative-thinking abilities. Psychometrika 19 (4), 297–311. https://doi.org/10.1007/bf02289230 (1954).

Runco, M. A. AI can only produce artificial creativity. J. Creat. 33 (3), 100063. https://doi.org/10.1016/j.yjoc.2023.100063 (2023).

Finke, R. A. Imagery, creativity, and emergent structure. Conscious. Cogn. 5 (3), 381–393. https://doi.org/10.1006/ccog.1996.0024 (1996).

Article   PubMed   CAS   Google Scholar  

Sarker, I. H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2 (6), 420. https://doi.org/10.1007/s42979-021-00815-1 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Khurana, D., Koli, A., Khatter, K. & Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 82 (3), 3713–3744. https://doi.org/10.1007/s11042-022-13428-4 (2022).

Zhou, M., Duan, N., Liu, S. & Shum, H.-Y. Progress in neural NLP: Modeling, learning, and reasoning. Engineering 6 (3), 275–290. https://doi.org/10.1016/j.eng.2019.12.014 (2020).

Cardoso, A., Veale, T. & Wiggins, G. A. Converging on the divergent: The history (and future) of the international joint workshops in computational creativity. AI Mag. 30 (3), 15. https://doi.org/10.1609/aimag.v30i3.2252 (2009).

Lambert, N., Castricato, L., von Werra, L., & Havrilla A. Illustrating Reinforcement Learning from Human Feedback (RLHF). Hugging Face . (2022). https://huggingface.co/blog/rlhf .

Dumas, D., Organisciak, P. & Doherty, M. Measuring divergent thinking originality with human raters and text-mining models: A psychometric comparison of methods. Psychol. Aesthet. Creat. Arts 15 (4), 645–663. https://doi.org/10.1037/aca0000319 (2021).

Kane, S. et al. Attention, affect, and creativity, from mindfulness to mind-wandering. In The Cambridge Handbook of Creativity and Emotions (eds Ivcevic, Z. et al. ) 130–148 (Cambridge University Press, 2023). https://doi.org/10.1017/9781009031240.010 .

Chapter   Google Scholar  

Chatterjee, A. Art in an age of artificial intelligence. Front. Psychol. 13 , 1024449. https://doi.org/10.3389/fpsyg.2022.1024449 (2022).

Boden, M. A. Computer models of creativity. AI Mag. 30 (3), 23–23. https://doi.org/10.1609/aimag.v30i3.2254 (2009).

Bellaiche, L. et al. Humans versus AI: Whether and why we prefer human-created compared to AI-created artwork. Cogn. Res. Princ. Implic. 8 (1), 1–22. https://doi.org/10.1186/s41235-023-00499-6 (2023).

Chiarella, S. et al. Investigating the negative bias towards artificial intelligence: Effects of prior assignment of AI-authorship on the aesthetic appreciation of abstract paintings. Comput. Hum. Behav. 137 , 107406. https://doi.org/10.1016/j.chb.2022.107406 (2022).

Fortuna, P. & Modliński, A. A(I)rtist or counterfeiter? Artificial intelligence as (D) evaluating factor on the art market. J. Arts Manag. Law Soc. 51 (3), 188–201. https://doi.org/10.1080/10632921.2021.1887032 (2021).

Liu, Y., Mittal, A., Yang, D., & Bruckman, A. (2022). Will AI console me when I lose my pet? Understanding perceptions of AI-mediated email writing. in Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3517731

Chamberlain, R., Mullin, C., Scheerlinck, B. & Wagemans, J. Putting the art in artificial: Aesthetic responses to computer-generated art. Psychol. Aesthet. Creat. Arts 12 (2), 177–192. https://doi.org/10.1037/aca0000136 (2018).

Gao, C. A. et al. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. Biorxiv https://doi.org/10.1016/j.patter.2023.100706 (2023).

Samo, A. & Highhouse, S. Artificial intelligence and art: Identifying the aesthetic judgment factors that distinguish human- and machine-generated artwork. Psychol. Aesthet. Creat. Arts. https://doi.org/10.1037/aca0000570 (2023).

Yin, Z., Reuben, F., Stepney, S. & Collins, T. Deep learning’s shallow gains: A comparative evaluation of algorithms for automatic music generation. Mach. Learn. 112 (5), 1785–1822. https://doi.org/10.1007/s10994-023-06309-w (2023).

Article   MathSciNet   Google Scholar  

Kumar, Y., Koul, A., Singla, R. & Ijaz, M. F. Artificial intelligence in disease diagnosis: A systematic literature review, synthesizing framework and future research agenda. J. Ambient Intell. Hum. Comput. https://doi.org/10.1007/s12652-021-03612-z (2022).

Anantrasirichai, N. & Bull, D. Artificial intelligence in the creative industries: A review. Artif. Intell. Rev. https://doi.org/10.1007/s10462-021-10039-7 (2022).

Stevenson, C., Smal, I., Baas, M., Grasman, R., & van der Maas, H. Putting GPT-3's Creativity to the (Alternative Uses) Test . (2022). arXiv:2206.08932 .

Haase, J. & Hanel, P. H. (2023). Artificial Muses: Generative Artificial Intelligence Chatbots Have Risen to Human-Level Creativity . https://doi.org/10.48550/arXiv.2303.12003

Koivisto, M. & Grassini, S. Best humans still outperform artificial intelligence in a creative divergent thinking task. Sci. Rep. 13 , 13601. https://doi.org/10.1038/s41598-023-40858-3 (2023).

Article   ADS   PubMed   PubMed Central   CAS   Google Scholar  

Guzik, E. E., Byrge, C. & Gilde, C. The originality of machines: AI takes the torrance test. J. Creat. 33 (3), 100065. https://doi.org/10.1016/j.yjoc.2023.100065 (2023).

Olson, J. A., Nahas, J., Chmoulevitch, D., Cropper, S. J. & Webb, M. E. Naming unrelated words predicts creativity. Proc. Natl. Acad. Sci. 118 , 25. https://doi.org/10.1073/pnas.2022340118 (2021).

Article   CAS   Google Scholar  

Cropley, D. Is artificial intelligence more creative than humans?: ChatGPT and the divergent association task. Learn. Lett. 2 , 13–13. https://doi.org/10.59453/ll.v2.13 (2023).

Chen, L., Sun, L. & Han, J. A comparison study of human and machine-generated creativity. J. Comput. Inf. Sci. Eng. 23 (5), 051012. https://doi.org/10.1115/1.4062232 (2023).

Sawyer, R. K. Explaining Creativity: The Science of Human Innovation (Oxford University Press, 2012).

R Core Team. R: A Language and Environment for Statistical Computing (Version 4.1.0) [Computer Software] . (2021). http://www.R-project.org .

Nusbaum, E. C., Silvia, P. J. & Beaty, R. E. Ready, set, create: What instructing people to “be creative” reveals about the meaning and mechanisms of divergent thinking. Psychol. Aesthet. Creat. Arts 8 (4), 423. https://doi.org/10.1037/a0036549 (2014).

Acar, S. et al. Applying automated originality scoring to the verbal form of Torrance tests of creative thinking. Gift. Child Q. 67 (1), 3–17. https://doi.org/10.1177/00169862211061874 (2021).

Hass, R. W. & Beaty, R. E. Use or consequences: Probing the cognitive difference between two measures of divergent thinking. Front. Psychol. 9 , 2327. https://doi.org/10.3389/fpsyg.2018.02327 (2018).

Urban, M. & Urban, K. Orientation toward intrinsic motivation mediates the relationship between metacognition and creativity. J. Creat. Behav. 57 (1), 6–16. https://doi.org/10.1002/jocb.558 (2023).

Day, B. et al. Ordering effects and choice set awareness in repeat-response stated preference studies. J. Environ. Econ. Manag. 63 (1), 73–91. https://doi.org/10.1016/j.jeem.2011.09.001 (2012).

Igorov, M., Predoiu, R., Predoiu, A. & Igorov, A. Creativity, resistance to mental fatigue and coping strategies in junior women handball players. Eur. Proc. Soc. Behav. Sci. https://doi.org/10.15405/epsbs.2016.06.39 (2016).

Organisciak, P. & Dumas, D. Open Creativity Scoring [Computer Software] . (University of Denver, 2020). https://openscoring.du.edu/ .

Beaty, R. E., Johnson, D. R., Zeitlen, D. C. & Forthmann, B. Semantic distance and the alternate uses task: Recommendations for reliable automated assessment of originality. Creat. Res. J. 34 (3), 245–260. https://doi.org/10.1080/10400419.2022.2025720 (2022).

Pennington, J., Socher, R. & Manning, C. Glove: Global vectors for word representation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 1532–1543 (2014).

Vinchon, F., Gironnay, V., & Lubart, T. The Creative AI-Land: Exploring new forms of creativity. In Review . (2023).

Runco, M. A. & Jaeger, G. J. The standard definition of creativity. Creat. Res. J. 24 (1), 92–96. https://doi.org/10.1080/10400419.2012.650092 (2012).

Beaty, R. E. & Silvia, P. J. Why do ideas get more creative across time? An executive interpretation of the serial order effect in divergent thinking tasks. Psychol. Aesthet. Creat. Arts 6 (4), 309–319. https://doi.org/10.1037/a0029171 (2012).

Mednick, S. The associative basis of the creative process. Psychol. Rev. 69 (3), 220–232. https://doi.org/10.1037/h0048850 (1962).

Hubert K. F., Finch A., Zabelina D. (2023). Diminishing Creative Returns: Predicting Optimal Creative Performance via Individual Differences in Executive Functioning .

Carson, S. H., Peterson, J. B. & Higgins, D. M. Reliability, validity, and factor structure of the creative achievement questionnaire. Creat. Res. J. 17 (1), 37–50. https://doi.org/10.1207/s15326934crj1701_4 (2005).

Jauk, E., Benedek, M. & Neubauer, A. C. The road to creative achievement: A latent variable model of ability and personality predictors. Pers. Individ. Diff. https://doi.org/10.1016/j.paid.2013.07.129 (2014).

Chevalier, A., Dommes, A. & Marquié, J.-C. Strategy and accuracy during information search on the web: Effects of age and complexity of the search questions. Comput. Hum. Behav. 53 , 305–315. https://doi.org/10.1016/j.chb.2015.07.017 (2015).

Download references

Author information

These authors contributed equally: Kent F. Hubert and Kim N. Awa.

Authors and Affiliations

Department of Psychological Sciences, University of Arkansas, Fayetteville, AR, 72701, USA

Kent F. Hubert, Kim N. Awa & Darya L. Zabelina

You can also search for this author in PubMed   Google Scholar

Contributions

D.L.Z., K.F.H., and K.N.A. contributed to the conceptualization and methodology. K.F.H. and K.N.A. contributed to formal analysis and investigation. K.F.H. prepared all figures. K.N.A. prepared all tables. D.L.Z., K.F.H., and K.N.A. contributed to writing and revision.

Corresponding author

Correspondence to Kent F. Hubert .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hubert, K.F., Awa, K.N. & Zabelina, D.L. The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks. Sci Rep 14 , 3440 (2024). https://doi.org/10.1038/s41598-024-53303-w

Download citation

Received : 14 October 2023

Accepted : 30 January 2024

Published : 10 February 2024

DOI : https://doi.org/10.1038/s41598-024-53303-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

An empirical investigation of the impact of chatgpt on creativity.

  • Byung Cheol Lee
  • Jaeyeon Chung

Nature Human Behaviour (2024)

Tackling AI Hyping

  • Mona Sloane
  • David Danks
  • Emanuel Moss

AI and Ethics (2024)

Towards a mixed human–machine creativity

  • Mirko Farina
  • Witold Pedrycz
  • Andrea Lavazza

Journal of Cultural Cognitive Science (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research on large language models

  • Corpus ID: 271924120

Controllable Text Generation for Large Language Models: A Survey

  • Xun Liang , Hanyu Wang , +8 authors Zhiyu Li
  • Published 22 August 2024
  • Computer Science, Linguistics

Figures and Tables from this paper

figure 1

Related Papers

Showing 1 through 3 of 0 Related Papers

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Computation and Language

Title: large language models: a survey.

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.
Comments: arXiv admin note: substantial text overlap with
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as: [cs.CL]
  (or [cs.CL] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Understanding the Weakness of Large Language Model Agents within a Complex Android Environment

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, supplemental material, index terms.

Computing methodologies

Artificial intelligence

Control methods

Robotic planning

Planning and scheduling

Human-centered computing

Human computer interaction (HCI)

Recommendations

A survey on large language model based autonomous agents.

Autonomous agents have long been a research focus in academic and industry communities. Previous research often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning ...

High-Level, Collaborative Task Planning Grammar and Execution for Heterogeneous Agents

We propose a new multi-agent task grammar to encode collaborative tasks for a team of heterogeneous agents that can have overlapping capabilities. The grammar allows users to specify the relationship between agents and parts of the task without providing ...

Domain‐specific modelling language for belief–desire–intention software agents

Development of software agents according to belief–desire–intention (BDI) model usually becomes challenging due to autonomy, distributedness, and openness of multi‐agent systems (MAS). Hence, here, a domain‐specific modelling language (DSML), called ...

Information

Published in.

cover image ACM Conferences

  • General Chairs:

Northeastern University, USA

CENTAI / Eurecat, Italy

  • SIGMOD: ACM Special Interest Group on Management of Data
  • SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • large language model
  • task planning
  • Research-article

Acceptance Rates

Contributors, other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language Models

  • Zhaoyang Yu ,
  • Minghua Ma ,
  • Chaoyun Zhang ,
  • Chetan Bansal ,
  • Saravan Rajmohan ,
  • Yingnong Dang ,
  • Changhua Pei ,
  • Qingwei Lin 林庆维 ,
  • Dongmei Zhang

Foundations of Software Engineering (FSE) | July 2024

Organized by ACM

Publication

In large-scale cloud service systems, monitoring metric data and conducting anomaly detection is an important way to maintain reliability and stability. However, great disparity exists between academic approaches and industrial practice to anomaly detection. Industry predominantly uses simple, efficient methods due to better interpretability and ease of implementation. In contrast, academically favor deep-learning methods, despite their advanced capabilities, face practical challenges in real-world applications. To address these challenges, this paper introduces MonitorAssistant, an end-to-end practical anomaly detection system via Large Language Models. MonitorAssistant automates model configuration recommendation achieving knowledge inheritance and alarm interpretation with guidance-oriented anomaly reports, facilitating a more intuitive engineer-system interaction through natural language. By deploying MonitorAssistant in Microsoft’s cloud service system, we validate its efficacy and practicality, marking a significant advancement in the field of practical anomaly detection for large-scale cloud services.

  • Follow on X
  • Like on Facebook
  • Follow on LinkedIn
  • Subscribe on Youtube
  • Follow on Instagram
  • Subscribe to our RSS feed

Share this page:

  • Share on Facebook
  • Share on LinkedIn
  • Share on Reddit
  • Search Menu

Sign in through your institution

  • Advance articles
  • Editor's Choice
  • Focus Issue Archive
  • Open Access Articles
  • JAMIA Journal Club
  • Author Guidelines
  • Submission Site
  • Open Access
  • Call for Papers
  • About Journal of the American Medical Informatics Association
  • About the American Medical Informatics Association
  • Journals Career Network
  • Editorial Board
  • Advertising and Corporate Services
  • Self-Archiving Policy
  • Dispatch Dates
  • For Reviewers
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

  • < Previous

Large language models facilitate the generation of electronic health record phenotyping algorithms

ORCID logo

  • Article contents
  • Figures & tables
  • Supplementary Data

Chao Yan, Henry H Ong, Monika E Grabowska, Matthew S Krantz, Wu-Chen Su, Alyson L Dickson, Josh F Peterson, QiPing Feng, Dan M Roden, C Michael Stein, V Eric Kerchberger, Bradley A Malin, Wei-Qi Wei, Large language models facilitate the generation of electronic health record phenotyping algorithms, Journal of the American Medical Informatics Association , Volume 31, Issue 9, September 2024, Pages 1994–2001, https://doi.org/10.1093/jamia/ocae072

  • Permissions Icon Permissions

Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts.

We prompted four LLMs—GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard—in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (ie, type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network.

GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values).

GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms.

American Medical Informatics Association members

Personal account.

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Institutional access

Sign in with a library card.

  • Sign in with username/password
  • Recommend to your librarian
  • Institutional account management
  • Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.

  • Click Sign in through your institution.
  • Select your institution from the list provided, which will take you to your institution's website to sign in.
  • When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
  • Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:

  • Click Sign in through society site.
  • When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

  • View your signed in personal account and access account management features.
  • View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Short-term Access

To purchase short-term access, please sign in to your personal account above.

Don't already have a personal account? Register

Month: Total Views:
April 2024 330
May 2024 178
June 2024 114
July 2024 84
August 2024 36

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1527-974X
  • Copyright © 2024 American Medical Informatics Association
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Rights and permissions
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

IMAGES

  1. The Large Language Model Landscape

    research on large language models

  2. Large Language Models 101: History, Evolution and Future

    research on large language models

  3. Large Language Models 101: History, Evolution and Future

    research on large language models

  4. Leading large language models (LLMs) shaping real life applications

    research on large language models

  5. Introduction to Large Language Models

    research on large language models

  6. Large Language Models: Complete Guide in 2024

    research on large language models

COMMENTS

  1. [2307.06435] A Comprehensive Overview of Large Language Models

    Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics ...

  2. A jargon-free explanation of how AI large language models work

    To understand how language models work, you first need to understand how they represent words. Humans represent English words with a sequence of letters, like C-A-T for "cat."

  3. Autonomous chemical research with large language models

    Transformer-based large language models are making significant strides in various fields, such as natural language processing 1,2,3,4,5, biology 6,7, chemistry 8,9,10 and computer programming 11 ...

  4. A Comprehensive Overview of Large Language Models

    Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length ...

  5. [2303.18223] A Survey of Large Language Models

    To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society.

  6. Unraveling the landscape of large language models: a systematic review

    The rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a comprehensive examination of the research landscape in LLMs, providing an overview of the prevailing themes and topics within this dynamic domain.,Drawing from an extensive corpus ...

  7. Science in the age of large language models

    Science in the age of large language models. Abeba Birhane, Atoosa Kasirzadeh, David Leslie &. Sandra Wachter. Nature Reviews Physics 5 , 277-280 ( 2023) Cite this article. 38k Accesses. 74 ...

  8. A Comprehensive Overview of Large Language Models

    Abstract. Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations of the underlying neural networks, context ...

  9. Using large language models in psychology

    Large language models (LLMs), such as OpenAI's GPT-4, Google's Bard or Meta's LLaMa, have created unprecedented opportunities for analysing and generating language data on a massive scale.

  10. Large language model

    A large language model (LLM) is a computational model capable of language generation or other natural language processing tasks. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process. [1]The largest and most capable LLMs, as of August 2024, are artificial neural networks ...

  11. Large language models use a surprisingly simple mechanism to retrieve

    The research will be presented at the International Conference on Learning Representations. Finding facts. Most large language models, also called transformer models, are neural networks. Loosely based on the human brain, neural networks contain billions of interconnected nodes, or neurons, that are grouped into many layers, and which encode ...

  12. Better language models and their implications

    We've trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.

  13. How Large Language Models Will Transform Science, Society, and AI

    For the latest Stanford research and news on large language models, subscribe to our newsletter. In July 2020, OpenAI unveiled GPT-3, a language model that was easily the largest known at the time. Put simply, GPT-3 is trained to predict the next word in a sentence, much like how a text message autocomplete feature works. However, model ...

  14. (PDF) A Comprehensive Overview of Large Language Models

    a revolutionary transformation by enabling the creation of. Large Language Models (LLMs) that can approximate human-. level performance on certain ev aluation benchmarks [ 3 ], [ 4 ]. LLMs ...

  15. What Are Large Language Models (LLMs)?

    Large language models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks. LLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of ...

  16. Can large language models identify and correct their mistakes?

    On our dataset of 85% incorrect traces and 15% correct traces, using this method is not much better than the naïve strategy of always labeling traces as incorrect, which gives a weighted average F1 of 78. A diagram showing how well mistake-finding with LLMs can be used as a proxy for correctness of the answer on each dataset.

  17. Recent Advances in Generative AI and Large Language Models: Current

    The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP), introducing unprecedented capabilities that are revolutionizing various domains. This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications. Our paper ...

  18. A Comprehensive Overview of Large Language Models

    extensive informative summaries of the existing works to advance the LLM research. Keywords: Large Language Models, LLMs, chatGPT, Augmented LLMs, Multimodal LLMs, LLM training, LLM Benchmarking 1.Introduction Language plays a fundamental role in facilitating commu-nication and self-expression for humans, and their interaction with machines.

  19. Mapping the Mind of a Large Language Model \ Anthropic

    Mapping the Mind of a Large Language Model. Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language ...

  20. MIT researchers use large language models to flag problems in complex

    The research will be presented at the IEEE Conference on Data Science and Advanced Analytics. An off-the-shelf solution. Large language models are autoregressive, which means they can understand that the newest values in sequential data depend on previous values. For instance, models like GPT-4 can predict the next word in a sentence using the ...

  21. (PDF) Exploring Large Language Models for Feature ...

    The rapid advancement of Large Language Models (LLMs) has significantly influenced various domains, leveraging their exceptional few-shot and zero-shot learning capabilities.

  22. Workshop on Deep Learning and Large Language Models for Knowledge

    The use of Knowledge Graphs (KGs) which constitute large networks of real-world entities and their interrelationships, has grown rapidly. A substantial body of research has emerged, exploring the integration of deep learning (DL) and large language models (LLMs) with KGs.

  23. Evaluating the effectiveness of large language models in abstract

    Objective This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows. Methods We developed automation scripts in Python to interact with the APIs of several LLM tools, including ...

  24. The current state of artificial intelligence generative language models

    The present study offers novel evidence on the current state of large language models (i.e., GPT-4) and the capabilities of divergent creative output in comparison to human participants.

  25. Controllable Text Generation for Large Language Models: A Survey

    Corpus ID: 271924120; Controllable Text Generation for Large Language Models: A Survey @inproceedings{Liang2024ControllableTG, title={Controllable Text Generation for Large Language Models: A Survey}, author={Xun Liang and Hanyu Wang and Yezhaohui Wang and Shichao Song and Jiawei Yang and Simin Niu and Jie Hu and Dan Liu and Shunyu Yao and Feiyu Xiong and Zhiyu Li}, year={2024}, url={https ...

  26. LLMaaS: Serving Large Language Models on Trusted Serverless Computing

    Abstract: In recent years, the emergence of large language models has profoundly transformed our production and lifestyle. These models have shown tremendous potential in fields such as natural language processing, speech recognition, and recommendation systems, and are increasingly playing crucial roles in applications like human-computer interaction and intelligent customer service.

  27. [2402.06196] Large Language Models: A Survey

    Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \\cite{kaplan2020scaling ...

  28. Understanding the Weakness of Large Language Model Agents within a

    Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied to general-purpose software systems like operating systems, LLM agents face three primary challenges. Firstly, the action space is vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date understanding ...

  29. MonitorAssistant: Simplifying Cloud Service Monitoring via Large

    In large-scale cloud service systems, monitoring metric data and conducting anomaly detection is an important way to maintain reliability and stability. However, great disparity exists between academic approaches and industrial practice to anomaly detection. Industry predominantly uses simple, efficient methods due to better interpretability and ease of implementation. In contrast ...

  30. Large language models facilitate the generation of electronic health

    Chao Yan, Henry H Ong, Monika E Grabowska, Matthew S Krantz, Wu-Chen Su, Alyson L Dickson, Josh F Peterson, QiPing Feng, Dan M Roden, C Michael Stein, V Eric Kerchberger, Bradley A Malin, Wei-Qi Wei, Large language models facilitate the generation of electronic health record phenotyping algorithms, Journal of the American Medical Informatics Association, Volume 31, Issue 9, September 2024 ...