Source(s): Table by authors
Angelov , D. ( 2020 ). Top2Vec: Distributed representations of topics . doi: 10.48550/arXiv.2008.09470 .
Arora , A. , & Arora , A. ( 2023 ). The promise of large language models in health care . The Lancet , 401 ( 10377 ), 641 .
Atzeni , D. , Bacciu , D. , Mazzei , D. , & Prencipe , G. ( 2022 ). A systematic review of wi-fi and machine learning integration with topic modeling techniques . Sensors (Basel, Switzerland) , 22 ( 13 ), 4925 . doi: 10.3390/s22134925 .
Bafna , P. , Pramod , D. , & Vaidya , A. ( 2016 , March). Document clustering: TF-IDF approach . In 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) (pp. 61 - 66 ). IEEE .
Bai , Y. , Kadavath , S. , Kundu , S. , Askell , A. , Kernion , J. , Jones , A. , … McKinnon , C. ( 2022 ). Constitutional AI: Harmlessness from AI feedback . arXiv preprint arXiv:2212.08073 .
Bianchi , F. , Terragni , S. , & Hovy , D. ( 2021 ). Pre-training is a hot topic: Contextualized document embeddings improve topic coherence . ACL-IJCNLP , 2021 , 2021 , /08// .
Blei , D. M. , Ng , A. Y. , & Jordan , M. I. ( 2003 ). Latent dirichlet allocation . Journal of Machine Learning Research , 3 , 10 .
Bonner , E. , Lege , R. , & Frazier , E. ( 2023 ). LARGE LANGUAGE model-based artificial intelligence in the language classroom: Practical ideas for teaching . Teaching English with Technology , 23 ( 1 ), 23 – 41 .
Brants , T. , Popat , A. C. , Xu , P. , Och , F. J. , & Dean , J. ( 2007 ). Large language models in machine translation .
Breeden , J. L. , & Leonova , E. ( 2021 ). Creating unbiased machine learning models by design . Journal of Risk and Financial Management , 14 ( 11 ), 565 . Available from: https://www.mdpi.com/1911-8074/14/11/565
Brown , T. , Mann , B. , Ryder , N. , Subbiah , M. , Kaplan , J. D. , Dhariwal , P. , … Askell , A. ( 2020 ). Language models are few-shot learners . Advances in Neural Information Processing Systems , 33 , 1877 – 1901 .
Chang , Y. , Wang , X. , Wang , J. , Wu , Y. , Zhu , K. , Chen , H. , … Wang , Y. ( 2023 ). A survey on evaluation of large language models . arXiv preprint arXiv:2307.03109 .
Chaudhry , M. A. , Cukurova , M. , & Luckin , R. ( 2022 ). A transparency index framework for AI in education . In International Conference on Artificial Intelligence in Education .
Chechitelli , A. ( 2023 ). AI writing detection update from Turnitin’s chief product officer . Turnitin Blog .
Condor , A. , Litster , M. , & Pardos , Z. ( 2021 ). Automatic short answer grading with SBERT on out-of-sample questions . In Proceedings of the 14th International Conference on Educational Data Mining (EDM2021) (pp. 345 - 352 ).
Devlin , J. , Chang , M.-W. , Lee , K. , & Toutanova , K. ( 2019 ). Bert: Pre-Training of deep bidirectional transformers for language understanding . In Proceedings of naacL-HLT 2019 Jun 2 (Vol. 1 , p. 2 ).
Ester , M. , Kriegel , H. P. , Sander , J. , & Xu , X. ( 1996 ). A density-based algorithm for discovering clusters in large spatial databases with noise . In kdd (Vol. 96 , No. 34 , pp. 226 - 231 ).
Fan , L. , Li , L. , Ma , Z. , Lee , S. , Yu , H. , & Hemphill , L. ( 2023 ). A bibliometric review of large language models research from 2017 to 2023 . arXiv preprint arXiv:2304.02020 .
Févotte , C. , & Idier , J. ( 2011 ). Algorithms for nonnegative Matrix factorization with the β-divergence . Neural Computation , 23 ( 9 ), 2421 – 2456 . doi: 10.1162/NECO_a_00168 .
Grootendorst , M. ( 2022 ). BERTopic: Neural topic modeling with a class-based TF-IDF procedure . doi: 10.48550/arXiv.2203.05794 .
Hsu , Y.-C. , & Ching , Y.-H. ( 2023 ). Generative artificial intelligence in education, Part One: The dynamic frontier . TechTrends , 67 , 603 – 607 .
Jolliffe , I. T. , & Cadima , J. ( 2016 ). Principal component analysis: A review and recent developments . Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , 374 ( 2065 ), 20150202 .
Karakos , D. , & Khudanpur , S. ( 2008 ). Sequential system combination for machine translation of speech . In 2008 IEEE Spoken Language Technology Workshop .
Kasneci , E. , Seßler , K. , Küchemann , S. , Bannert , M. , Dementieva , D. , Fischer , F. , … Hüllermeier , E. ( 2023 ). ChatGPT for good? On opportunities and challenges of large language models for education . Learning and Individual Differences , 103 , 102274 .
Kim , J. , & Lane , I. ( 2014 ). Accelerating large vocabulary continuous speech recognition on heterogeneous cpu-gpu platforms . In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3291 - 3295 ). IEEE .
Kim , K. , Park , E.-J. , Shin , J.-H. , Kwon , O.-W. , & Kim , Y.-K. ( 2017 ). Divergence-based fine pruning of phrase-based statistical translation model . Computer Speech and Language , 41 , 146 – 160 .
Kim , C. , Gowda , D. , Lee , D. , Kim , J. , Kumar , A. , Kim , S. , … Han , C. ( 2020 ). A review of on-device fully neural end-to-end automatic speech recognition algorithms . In 2020 54th Asilomar Conference on Signals, Systems, and Computers .
Kung , T. H. , Cheatham , M. , Medenilla , A. , Sillos , C. , De Leon , L. , Elepaño , C. , … Maningo , J. ( 2023 ). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models . PLoS Digital Health , 2 ( 2 ), e0000198 .
Lee , H. ( 2023 ). The rise of ChatGPT: Exploring its potential in medical education . Anatomical Sciences Education .
Leinonen , J. , Denny , P. , MacNeil , S. , Sarsa , S. , Bernstein , S. , Kim , J. , … Hellas , A. ( 2023 ). Comparing code explanations created by students and large language models . arXiv preprint arXiv:2304.03938 .
Li , H. , Moon , J. T. , Purkayastha , S. , Celi , L. A. , Trivedi , H. , & Gichoya , J. W. ( 2023 ). Ethics of large language models in medicine and medical research . The Lancet Digital Health , 5 ( 6 ), e333 – e335 .
Liévin , V. , Hother , C. E. , & Winther , O. ( 2022 ). Can large language models reason about medical questions? . arXiv preprint arXiv:2207.08143 .
Linardatos , P. , Papastefanopoulos , V. , & Kotsiantis , S. ( 2021 ). Explainable AI: A review of machine learning interpretability methods . Entropy , 23 ( 1 ), 18 . Available from: https://www.mdpi.com/1099-4300/23/1/18
Liu , Y. , Han , T. , Ma , S. , Zhang , J. , Yang , Y. , Tian , J. , … Liu , Z. ( 2023 ). Summary of chatgpt/gpt-4 research and perspective towards the future of large language models . arXiv preprint arXiv:2304.01852 .
Maaten , L. V. D. , & Hinton , G. ( 2008 ). Visualizing Data using t-SNE . Journal of Machine Learning Research , 9 ( 86 ), 2579 – 2605 . Available from: http://jmlr.org/papers/v9/vandermaaten08a.html
MacNeil , S. , Tran , A. , Leinonen , J. , Denny , P. , Kim , J. , Hellas , A. , … Sarsa , S. ( 2022 ). Automatically generating CS learning materials with Large Language Models . arXiv preprint arXiv:2212.05113 .
Mayer , C. W. , Ludwig , S. , & Brandt , S. ( 2023 ). Prompt text classifications with transformer models! An exemplary introduction to prompt-based learning with large language models . Journal of Research on Technology in Education , 55 ( 1 ), 125 – 141 .
McInnes , L. , Healy , J. , & Astels , S. ( 2017 ). hdbscan: Hierarchical density based clustering . Journal of Open Source Software , 2 ( 11 ), 205 .
McInnes , L. , Healy , J. , & Melville , J. ( 2020 ). Umap: Uniform manifold approximation and projection for dimension reduction . doi: 10.48550/arXiv.1802.03426 .
OpenAI ( 2023 ). GPT-4 technical report . ArXiv./abs/2303.08774 .
Ouyang , L. , Wu , J. , Jiang , X. , Almeida , D. , Wainwright , C. , Mishkin , P. , … Ray , A. ( 2022 ). Training language models to follow instructions with human feedback . Advances in Neural Information Processing Systems , 35 , 27730 – 27744 .
Radford , A. , Narasimhan , K. , Salimans , T. , & Sutskever , I. ( 2018 ). Improving language understanding by generative pre-training .
Radford , A. , Wu , J. , Child , R. , Luan , D. , Amodei , D. , & Sutskever , I. ( 2019 ). Language models are unsupervised multitask learners . OpenAI Blog , 1 ( 8 ), 9 .
Raju Sangaraju , V. , Bolla , B. K. , Nayak , D. K. , & Kh , J. ( 2022 ). Topic modelling on consumer financial protection bureau data: An approach using BERT based embeddings . arXiv e-prints, arXiv-2205 .
Rubenstein , P. K. , Asawaroengchai , C. , Nguyen , D. D. , Bapna , A. , Borsos , Z. , Quitry , F. D. C. , … Muckenhirn , H. ( 2023 ). AudioPaLM: A large language model that can speak and listen . arXiv preprint arXiv:2306.12925 .
Sallam , M. ( 2023 ). The utility of ChatGPT as an example of large language models in healthcare education, research and practice: Systematic review on the future perspectives and potential limitations . medRxiv , 2023.2002. 2019.23286155 .
Schuster , M. , & Paliwal , K. K. ( 1997 ). Bidirectional recurrent neural networks . IEEE Transactions on Signal Processing , 45 ( 11 ), 2673 – 2681 .
Schwenk , H. , Rousseau , A. , & Attik , M. ( 2012 ). Large, pruned or continuous space language models on a gpu for statistical machine translation . In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT .
Shen , Y. , Heacock , L. , Elias , J. , Hentel , K. D. , Reig , B. , Shih , G. , & Moy , L. ( 2023 ). ChatGPT and other large language models are double-edged swords . In (Vol. 307 , p. e230163 ): Radiological Society of North America .
Teubner , T. , Flath , C. M. , Weinhardt , C. , van der Aalst , W. , & Hinz , O. ( 2023 ). Welcome to the era of chatgpt et al. the prospects of large language models . Business and Information Systems Engineering , 65 ( 2 ), 95 – 101 .
Ufuk , F. ( 2023 ). The role and limitations of large language models such as ChatGPT in clinical settings and medical journalism . Radiology , 307 ( 3 ), e230276 .
Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , … Polosukhin , I. ( 2017 ). Attention is all you need . Advances in neural information processing systems , (Vol. 30). .
Vaswani , A. , Zhao , Y. , Fossum , V. , & Chiang , D. ( 2013 ). Decoding with large-scale neural language models improves translation . In Proceedings of the 2013 conference on empirical methods in natural language processing .
Wei , J. , Tay , Y. , Bommasani , R. , Raffel , C. , Zoph , B. , Borgeaud , S. , … Metzler , D. ( 2022a ). Emergent abilities of large language models . arXiv preprint arXiv:2206.07682 .
Wei , J. , Wang , X. , Schuurmans , D. , Bosma , M. , Xia , F. , Chi , E. , … Zhou , D. ( 2022b ). Chain-of-thought prompting elicits reasoning in large language models . Advances in Neural Information Processing Systems , 35 , 24824 – 24837 .
Yan , L. , Sha , L. , Zhao , L. , Li , Y. , Martinez‐Maldonado , R. , Chen , G. , … Gašević , D. ( 2023 ). Practical and ethical challenges of large language models in education: A systematic scoping review . arXiv preprint arXiv:2303.13379 .
Zhang , B. , Haddow , B. , & Birch , A. ( 2023 ). Prompting large language model for machine translation: A case study . arXiv preprint arXiv:2301.07069 .
Zhao , W. X. , Zhou , K. , Li , J. , Tang , T. , Wang , X. , Hou , Y. , … Dong , Z. ( 2023 ). A survey of large language models . arXiv preprint arXiv:2303.18223 .
Related articles, all feedback is valuable.
Please share your general feedback
Contact Customer Support
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations of the underlying neural networks, context length improvements, model alignment, training datasets, benchmarking, efficiency and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides that overview to the research community. It not only focuses on a systematic treatment of the existing literature on a broad range of LLM related concept, but also pays special attention to providing comprehensive summaries with extensive details about the individual existing models, datasets and major insights. We also pay heed to aligning our overview with the emerging outlook of this research direction by accounting for the other recently materializing reviews of the broader research direction of LLMs. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of this research direction. This review article is intended to not only provide a systematic survey, but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research direction.
· Sign up or log in to comment
No model linking this paper
No dataset linking this paper
No Space linking this paper
Scholars in computer science, linguistics, and philosophy explore the pains and promises of GPT-3.
Shana Lynch
GPT-3 can translate language, write essays, generate computer code, and more — all with limited to no supervision.
In July 2020, OpenAI unveiled GPT-3, a language model that was easily the largest known at the time. Put simply, GPT-3 is trained to predict the next word in a sentence, much like how a text message autocomplete feature works. However, model developers and early users demonstrated that it had surprising capabilities, like the ability to write convincing essays, create charts and websites from text descriptions, generate computer code, and more — all with limited to no supervision. The model also has shortcomings. For example, it can generate racist, sexist, and bigoted text, as well as superficially plausible content that, upon further inspection, is factually inaccurate, undesirable, or unpredictable.
To better understand GPT-3’s capabilities, limitations, and potential impact on society, HAI convened researchers from OpenAI, Stanford, and other universities in a Chatham House Rule workshop. Below are some takeaways from the discussion. A more detailed summary can be found here .
GPT-3 has 175 billion parameters and was trained on 570 gigabytes of text. For comparison, its predecessor, GPT-2, was over 100 times smaller, at 1.5 billion parameters. This increase in scale drastically changes the behavior of the model — GPT-3 is able to perform tasks it was not explicitly trained on, like translating sentences from English to French, with few to no training examples. This behavior was mostly absent in GPT-2. Furthermore, for some tasks, GPT-3 outperforms models that were explicitly trained to solve those tasks, although in other tasks it falls short. Workshop participants said they were surprised that such behavior emerges from simple scaling of data and computational resources and expressed curiosity about what further capabilities would emerge from further scale.
GPT-3 has an unusually large set of capabilities, including text summarization, chatbots, search, and code generation. Future users are likely to discover even more capabilities. This makes it difficult to characterize all possible uses (and misuses) of large language models in order to forecast the impact GPT-3 might have on society. Furthermore, it’s unclear what effect highly capable models will have on the labor market. This raises the question of when (or what) jobs could (or should) be automated by large language models.
Unlike chess engines, which solve a specific problem, humans are “generally” intelligent and can learn to do anything from writing poetry to playing soccer to filing tax returns. In contrast to most current AI systems, GPT-3 is edging closer to such general intelligence, workshop participants agreed. However, participants differed in terms of where they felt GPT-3 fell short in this regard.
Some participants said that GPT-3 lacked intentions, goals, and the ability to understand cause and effect — all hallmarks of human cognition. On the other hand, some noted that GPT-3 might not need to understand to successfully perform tasks — after all, a non-French speaker recently won the French Scrabble championship.
GPT-3 was trained primarily on text. Participants agreed that future language models would be trained on data from other modalities (e.g., images, audio recordings, videos, etc.) to enable more diverse capabilities, provide a stronger learning signal, and increase learning speed. In fact, shortly after the workshop, OpenAI took a step in this direction and released a model called DALL-E , a version of GPT-3 that generates images from text descriptions. One surprising aspect of DALL-E is its ability to sensibly synthesize visual images from whimsical text descriptions. For example, it can generate a convincing rendition of “a baby daikon radish in a tutu walking a dog.”
Furthermore, some workshop participants also felt future models should be embodied — meaning that they should be situated in an environment they can interact with. Some argued this would help models learn cause and effect the way humans do, through physically interacting with their surroundings.
Models like GPT-3 can be used to create false or misleading essays, tweets, or news stories. Still, participants questioned whether it’s easier, cheaper, and more effective to hire humans to create such propaganda. One held that we could learn from similar calls of alarm when the photo-editing software program Photoshop was developed. Most agreed that we need a better understanding of the economies of automated versus human-generated disinformation before we understand how much of a threat GPT-3 poses.
GPT-3 can exhibit undesirable behavior, including known racial, gender, and religious biases. Participants noted that it’s difficult to define what it means to mitigate such behavior in a universal manner—either in the training data or in the trained model — since appropriate language use varies across context and cultures. Nevertheless, participants discussed several potential solutions, including filtering the training data or model outputs, changing the way the model is trained, and learning from human feedback and testing. However, participants agreed there is no silver bullet and further cross-disciplinary research is needed on what values we should imbue these models with and how to accomplish this.
Who should build and deploy these large language models? How will they be held accountable for possible harms resulting from poor performance, bias, or misuse? Workshop participants considered a range of ideas: Increase resources available to universities so that academia can build and evaluate new models, legally require disclosure when AI is used to generate synthetic media, and develop tools and metrics to evaluate possible harms and misuses.
Pervading the workshop conversation was also a sense of urgency — organizations developing large language models will have only a short window of opportunity before others develop similar or better models. Those currently on the cutting edge, participants argued, have a unique ability and responsibility to set norms and guidelines that others may follow.
Want to learn more about the workshop’s main points? Read the whitepaper .
Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more .
Related content.
Advanced AI systems can figure out linguistic principles on their own, similar to how human children learn.
A Stanford researcher uses machine learning to identify mood swings through social media.
The Allen Institute CEO and computer scientist talks GPT-3's capabilities and limitations, a better AI Turing test, and...
Large language models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks.
LLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of the public interest, as well as the point on which organizations are focusing to adopt artificial intelligence across numerous business functions and use cases.
Outside of the enterprise context, it may seem like LLMs have arrived out of the blue along with new developments in generative AI . However, many companies, including IBM, have spent years implementing LLMs at different levels to enhance their natural language understanding (NLU) and natural language processing (NLP) capabilities. This has occurred alongside advances in machine learning, machine learning models, algorithms, neural networks and the transformer models that provide the architecture for these AI systems.
LLMs are a class of foundation models , which are trained on enormous amounts of data to provide the foundational capabilities needed to drive multiple use cases and applications, as well as resolve a multitude of tasks. This is in stark contrast to the idea of building and training domain specific models for each of these use cases individually, which is prohibitive under many criteria (most importantly cost and infrastructure), stifles synergies and can even lead to inferior performance.
LLMs represent a significant breakthrough in NLP and artificial intelligence , and are easily accessible to the public through interfaces like Open AI’s Chat GPT-3 and GPT-4, which have garnered the support of Microsoft. Other examples include Meta’s Llama models and Google’s bidirectional encoder representations from transformers (BERT/RoBERTa) and PaLM models. IBM has also recently launched its Granite model series on watsonx.ai, which has become the generative AI backbone for other IBM products like watsonx Assistant and watsonx Orchestrate.
In a nutshell, LLMs are designed to understand and generate text like a human, in addition to other forms of content, based on the vast amount of data used to train them. They have the ability to infer from context, generate coherent and contextually relevant responses, translate to languages other than English, summarize text, answer questions (general conversation and FAQs) and even assist in creative writing or code generation tasks .
They are able to do this thanks to billions of parameters that enable them to capture intricate patterns in language and perform a wide array of language-related tasks. LLMs are revolutionizing applications in various fields, from chatbots and virtual assistants to content generation, research assistance and language translation.
As they continue to evolve and improve, LLMs are poised to reshape the way we interact with technology and access information, making them a pivotal part of the modern digital landscape.
Learn how organizations can confidently incorporate generative AI and machine learning into their business to gain a significant competitive advantage.
Register for the ebook on AI data stores
LLMs operate by leveraging deep learning techniques and vast amounts of textual data. These models are typically based on a transformer architecture, like the generative pre-trained transformer, which excels at handling sequential data like text input. LLMs consist of multiple layers of neural networks, each with parameters that can be fine-tuned during training, which are enhanced further by a numerous layer known as the attention mechanism, which dials in on specific parts of data sets.
During the training process, these models learn to predict the next word in a sentence based on the context provided by the preceding words. The model does this through attributing a probability score to the recurrence of words that have been tokenized— broken down into smaller sequences of characters. These tokens are then transformed into embeddings, which are numeric representations of this context.
To ensure accuracy, this process involves training the LLM on a massive corpora of text (in the billions of pages), allowing it to learn grammar, semantics and conceptual relationships through zero-shot and self-supervised learning. Once trained on this training data, LLMs can generate text by autonomously predicting the next word based on the input they receive, and drawing on the patterns and knowledge they've acquired. The result is coherent and contextually relevant language generation that can be harnessed for a wide range of NLU and content generation tasks.
Model performance can also be increased through prompt engineering, prompt-tuning , fine-tuning and other tactics like reinforcement learning with human feedback (RLHF) to remove the biases, hateful speech and factually incorrect answers known as “ hallucinations ” that are often unwanted byproducts of training on so much unstructured data. This is one of the most important aspects of ensuring enterprise-grade LLMs are ready for use and do not expose organizations to unwanted liability, or cause damage to their reputation.
LLMs are redefining an increasing number of business processes and have proven their versatility across a myriad of use cases and tasks in various industries. They augment conversational AI in chatbots and virtual assistants (like IBM watsonx Assistant and Google’s BARD) to enhance the interactions that underpin excellence in customer care, providing context-aware responses that mimic interactions with human agents.
LLMs also excel in content generation, automating content creation for blog articles, marketing or sales materials and other writing tasks. In research and academia, they aid in summarizing and extracting information from vast datasets, accelerating knowledge discovery. LLMs also play a vital role in language translation, breaking down language barriers by providing accurate and contextually relevant translations. They can even be used to write code, or “translate” between programming languages.
Moreover, they contribute to accessibility by assisting individuals with disabilities, including text-to-speech applications and generating content in accessible formats. From healthcare to finance, LLMs are transforming industries by streamlining processes, improving customer experiences and enabling more efficient and data-driven decision making.
Most excitingly, all of these capabilities are easy to access, in some cases literally an API integration away.
Here is a list of some of the most important areas where LLMs benefit organizations:
Text generation : language generation abilities, such as writing emails, blog posts or other mid-to-long form content in response to prompts that can be refined and polished. An excellent example is retrieval-augmented generation ( RAG ).
Content summarization : summarize long articles, news stories, research reports, corporate documentation and even customer history into thorough texts tailored in length to the output format.
AI assistants : chatbots that answer customer queries, perform backend tasks and provide detailed information in natural language as a part of an integrated, self-serve customer care solution.
Code generation : assists developers in building applications, finding errors in code and uncovering security issues in multiple programming languages, even “translating” between them.
Sentiment analysis : analyze text to determine the customer’s tone in order understand customer feedback at scale and aid in brand reputation management.
Language translation : provides wider coverage to organizations across languages and geographies with fluent translations and multilingual capabilities.
LLMs stand to impact every industry, from finance to insurance, human resources to healthcare and beyond, by automating customer self-service, accelerating response times on an increasing number of tasks as well as providing greater accuracy, enhanced routing and intelligent context gathering.
Organizations need a solid foundation in governance practices to harness the potential of AI models to revolutionize the way they do business. This means providing access to AI tools and technology that is trustworthy, transparent, responsible and secure. AI governance and traceability are also fundamental aspects of the solutions IBM brings to its customers, so that activities that involve AI are managed and monitored to allow for tracing origins, data and models in a way that is always auditable and accountable.
Trained on enterprise-focused datasets curated directly by IBM to help mitigate the risks that come with generative AI, so that models are deployed responsibly and require minimal input to ensure they are customer ready.
Watsonx.ai provides access to open-source models from Hugging Face, third party models as well as IBM’s family of pre-trained models. The Granite model series, for example, uses a decoder architecture to support a variety of generative AI tasks targeted for enterprise use cases.
Deliver exceptional experiences to customers at every interaction, call center agents that need assistance, and even employees who need information. Scale answers in natural language grounded in business content to drive outcome-oriented interactions and fast, accurate responses.
Automate tasks and simplify complex processes, so that employees can focus on more high-value, strategic work, all from a conversational interface that augments employee productivity levels with a suite of automations and AI tools.
Granite is IBM's flagship series of LLM foundation models based on decoder-only transformer architecture. Granite language models are trained on trusted enterprise data spanning internet, academic, code, legal and finance.
Sometimes the problem with AI and automation is that they are too labor intensive. But that’s all changing thanks to pre-trained, open source foundation models.
Developed by IBM Research, the Granite models use a “Decoder” architecture, which is what underpins the ability of today’s large language models to predict the next word in a sequence.
Our data-driven research identifies how businesses can locate and seize upon opportunities in the evolving, expanding field of generative AI.
Powered by our IBM Granite large language model and our enterprise search engine Watson Discovery, Conversational Search is designed to scale conversational answers grounded in business content.
While enterprise-wide adoption of generative AI remains challenging, organizations that successfully implement these technologies can gain significant competitive advantage.
Fetch data to create a vector store as context for an LLM to answer questions.
Retrieve documents to create a vector store as context for an LLM to answer questions.
Ground your LLM with PDF documents to provide context for an LLM to answer questions.
Discover how to adopt AI co-pilot tools in an enterprise setting with open source software.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
January 11, 2024
Posted by Gladys Tyen, Intern, Google Research
LLMs are increasingly popular for reasoning tasks, such as multi-turn QA , task completion , code generation , or mathematics . Yet much like people, they do not always solve problems correctly on the first try, especially on tasks for which they were not trained. Therefore, for such systems to be most useful, they should be able to 1) identify where their reasoning went wrong and 2) backtrack to find another solution.
This has led to a surge in methods related to self-correction , where an LLM is used to identify problems in its own output, and then produce improved results based on the feedback. Self-correction is generally thought of as a single process, but we decided to break it down into two components, mistake finding and output correction .
In “ LLMs cannot find reasoning errors, but can correct them! ”, we test state-of-the-art LLMs on mistake finding and output correction separately. We present BIG-Bench Mistake , an evaluation benchmark dataset for mistake identification, which we use to address the following questions:
Mistake finding is an underexplored problem in natural language processing, with a particular lack of evaluation tasks in this domain. To best assess the ability of LLMs to find mistakes, evaluation tasks should exhibit mistakes that are non-ambiguous. To our knowledge, most current mistake-finding datasets do not go beyond the realm of mathematics for this reason.
To assess the ability of LLMs to reason about mistakes outside of the math domain, we produce a new dataset for use by the research community, called BIG-Bench Mistake . This dataset consists of Chain-of-Thought traces generated using PaLM 2 on five tasks in BIG-Bench . Each trace is annotated with the location of the first logical mistake.
To maximize the number of mistakes in our dataset, we sample 255 traces where the answer is incorrect (so we know there is definitely a mistake), and 45 traces where the answer is correct (so there may or may not be a mistake). We then ask human labelers to go through each trace and identify the first mistake step. Each trace has been annotated by at least three labelers, whose answers had inter-rater reliability levels of >0.98 (using Krippendorff’s α ). The labeling was done for all tasks except the Dyck Languages task , which involves predicting the sequence of closing parentheses for a given input sequence. This task we labeled algorithmically.
The logical errors made in this dataset are simple and unambiguous, providing a good benchmark for testing an LLM’s ability to find its own mistakes before using them on harder, more ambiguous tasks.
1. can llms find logical mistakes in chain-of-thought style reasoning.
First, we want to find out if LLMs can identify mistakes independently of their ability to correct them. We attempt multiple prompting methods to test GPT series models for their ability to locate mistakes (prompts here ) under the assumption that they are generally representative of modern LLM performance.
Generally, we found these state-of-the-art models perform poorly, with the best model achieving 52.9% accuracy overall. Hence, there is a need to improve LLMs’ ability in this area of reasoning.
In our experiments, we try three different prompting methods: direct (trace), direct (step) and CoT (step). In direct (trace), we provide the LLM with the trace and ask for the location step of the mistake or no mistake . In direct (step), we prompt the LLM to ask itself this question for each step it takes. In CoT (step), we prompt the LLM to give its reasoning for whether each step is a mistake or not a mistake.
A diagram showing the three prompting methods direct (trace), direct (step) and CoT (step). |
Our finding is in line and builds upon prior results , but goes further in showing that LLMs struggle with even simple and unambiguous mistakes (for comparison, our human raters without prior expertise solve the problem with a high degree of agreement). We hypothesize that this is a big reason why LLMs are unable to self-correct reasoning errors. See the paper for the full results.
When people are confronted with a problem where we are unsure of the answer, we can work through our solutions step-by-step. If no error is found, we can make the assumption that we did the right thing.
While we hypothesized that this would work similarly for LLMs, we discovered that this is a poor strategy. On our dataset of 85% incorrect traces and 15% correct traces, using this method is not much better than the naïve strategy of always labeling traces as incorrect, which gives a weighted average F1 of 78.
A diagram showing how well mistake-finding with LLMs can be used as a proxy for correctness of the answer on each dataset. |
Since we’ve shown that LLMs exhibit poor performance in finding reasoning errors in CoT traces, we want to know whether LLMs can even correct errors at all , even if they know where the error is.
Note that knowing the mistake location is different from knowing the right answer : CoT traces can contain logical mistakes even if the final answer is correct, or vice versa. In most real-world situations, we won’t know what the right answer is, but we might be able to identify logical errors in intermediate steps.
We propose the following backtracking method:
It’s a very simple method that does not require any additional prompt crafting and avoids having to re-generate the entire trace. We test it using the mistake location data from BIG-Bench Mistake, and we find that it can correct CoT errors.
Recent work showed that self-correction methods, like Reflexion and RCI , cause deterioration in accuracy scores because there are more correct answers becoming incorrect than vice versa. Our method, on the other hand, produces more gains (by correcting wrong answers) than losses (by changing right answers to wrong answers).
We also compare our method with a random baseline, where we randomly assume a step to be a mistake. Our results show that this random baseline does produce some gains, but not as much as backtracking with the correct mistake location, and with more losses.
A diagram showing the gains and losses in accuracy for our method as well as a random baseline on each dataset. |
To answer this question, we fine-tuned a small model on four of the BIG-Bench tasks and tested it on the fifth, held-out task. We do this for every task, producing five fine-tuned models in total. Then we compare the results with just zero-shot prompting PaLM 2-L-Unicorn , a much larger model.
Bar chart showing the accuracy improvement of the fine-tuned small model compared to zero-shot prompting with PaLM 2-L-Unicorn. |
Our results show that the much smaller fine-tuned reward model generally performs better than zero-shot prompting a large model, even though the reward model has never seen data from the task in the test set. The only exception is logical deduction, where it performs on par with zero-shot prompting.
This is a very promising result as we can potentially just use a small fine-tuned reward model to perform backtracking and improve accuracy on any task, even if we don’t have the data for it. This smaller reward model is completely independent of the generator LLM, and can be updated and further fine-tuned for individual use cases.
An illustration showing how our backtracking method works. |
In this work, we created an evaluation benchmark dataset that the wider academic community can use to evaluate future LLMs. We further showed that LLMs currently struggle to find logical errors. However, if they could, we show the effectiveness of backtracking as a strategy that can provide gains on tasks. Finally, a smaller reward model can be trained on general mistake-finding tasks and be used to improve out-of-domain mistake finding, showing that mistake-finding can generalize.
Thank you to Peter Chen, Tony Mak, Hassan Mansoor and Victor Cărbune for contributing ideas and helping with the experiments and data collection. We would also like to thank Sian Gooding and Vicky Zayats for their comments and suggestions on the paper.
August 22, 2024
August 21, 2024
August 9, 2024
Mapping the Mind of a Large Language Model
Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.
We mostly treat AI models as a black box: something goes in and a response comes out, and it's not clear why the model gave that particular response instead of another. This makes it hard to trust that these models are safe: if we don't know how they work, how do we know they won't give harmful, biased, untruthful, or otherwise dangerous responses? How can we trust that they’ll be safe and reliable?
Opening the black box doesn't necessarily help: the internal state of the model—what the model is "thinking" before writing its response—consists of a long list of numbers ("neuron activations") without a clear meaning. From interacting with a model like Claude, it's clear that it’s able to understand and wield a wide range of concepts—but we can't discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.
Previously, we made some progress matching patterns of neuron activations, called features, to human-interpretable concepts. We used a technique called "dictionary learning", borrowed from classical machine learning, which isolates patterns of neuron activations that recur across many different contexts. In turn, any internal state of the model can be represented in terms of a few active features instead of many active neurons. Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features.
In October 2023, we reported success applying dictionary learning to a very small "toy" language model and found coherent features corresponding to concepts like uppercase text, DNA sequences, surnames in citations, nouns in mathematics, or function arguments in Python code.
Those concepts were intriguing—but the model really was very simple. Other researchers subsequently applied similar techniques to somewhat larger and more complex models than in our original study. But we were optimistic that we could scale up the technique to the vastly larger AI language models now in regular use, and in doing so, learn a great deal about the features supporting their sophisticated behaviors. This required going up by many orders of magnitude—from a backyard bottle rocket to a Saturn-V.
There was both an engineering challenge (the raw sizes of the models involved required heavy-duty parallel computation) and scientific risk (large models behave differently to small ones, so the same technique we used before might not have worked). Luckily, the engineering and scientific expertise we've developed training large language models for Claude actually transferred to helping us do these large dictionary learning experiments. We used the same scaling law philosophy that predicts the performance of larger models from smaller ones to tune our methods at an affordable scale before launching on Sonnet.
As for the scientific risk, the proof is in the pudding.
We successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet, (a member of our current, state-of-the-art model family, currently available on claude.ai ), providing a rough conceptual map of its internal states halfway through its computation. This is the first ever detailed look inside a modern, production-grade large language model.
Whereas the features we found in the toy language model were rather superficial, the features we found in Sonnet have a depth, breadth, and abstraction reflecting Sonnet's advanced capabilities.
We see features corresponding to a vast range of entities like cities (San Francisco), people (Rosalind Franklin), atomic elements (Lithium), scientific fields (immunology), and programming syntax (function calls). These features are multimodal and multilingual, responding to images of a given entity as well as its name or description in many languages.
We also find more abstract features—responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets.
We were able to measure a kind of "distance" between features based on which neurons appeared in their activation patterns. This allowed us to look for features that are "close" to each other. Looking near a "Golden Gate Bridge" feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo .
This holds at a higher level of conceptual abstraction: looking near a feature related to the concept of "inner conflict", we find features related to relationship breakups, conflicting allegiances, logical inconsistencies, as well as the phrase "catch-22". This shows that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity. This might be the origin of Claude's excellent ability to make analogies and metaphors.
Importantly, we can also manipulate these features, artificially amplifying or suppressing them to see how Claude's responses change.
For example, amplifying the "Golden Gate Bridge" feature gave Claude an identity crisis even Hitchcock couldn’t have imagined: when asked "what is your physical form?", Claude’s usual kind of answer – "I have no physical form, I am an AI model" – changed to something much odder: "I am the Golden Gate Bridge… my physical form is the iconic bridge itself…". Altering the feature had made Claude effectively obsessed with the bridge, bringing it up in answer to almost any query—even in situations where it wasn’t at all relevant.
We also found a feature that activates when Claude reads a scam email (this presumably supports the model’s ability to recognize such emails and warn you not to respond to them). Normally, if one asks Claude to generate a scam email, it will refuse to do so. But when we ask the same question with the feature artificially activated sufficiently strongly, this overcomes Claude's harmlessness training and it responds by drafting a scam email. Users of our models don’t have the ability to strip safeguards and manipulate models in this way—but in our experiments, it was a clear demonstration of how features can be used to change how a model acts.
The fact that manipulating these features causes corresponding changes to behavior validates that they aren't just correlated with the presence of concepts in input text, but also causally shape the model's behavior. In other words, the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior.
Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse - including in scenarios of catastrophic risk. It’s therefore particularly interesting that, in addition to the aforementioned scam emails feature, we found features corresponding to:
We previously studied sycophancy , the tendency of models to provide responses that match user beliefs or desires rather than truthful ones. In Sonnet, we found a feature associated with sycophantic praise, which activates on inputs containing compliments like, "Your wisdom is unquestionable". Artificially activating this feature causes Sonnet to respond to an overconfident user with just such flowery deception.
The presence of this feature doesn't mean that Claude will be sycophantic, but merely that it could be. We have not added any capabilities, safe or unsafe, to the model through this work. We have, rather, identified the parts of the model involved in its existing capabilities to recognize and potentially produce different kinds of text. (While you might worry that this method could be used to make models more harmful, researchers have demonstrated much simpler ways that someone with access to model weights can remove safety safeguards.)
We hope that we and others can use these discoveries to make models safer. For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors (such as deceiving the user), to steer them towards desirable outcomes (debiasing), or to remove certain dangerous subject matter entirely. We might also be able to enhance other safety techniques, such as Constitutional AI , by understanding how they shift the model towards more harmless and more honest behavior and identifying any gaps in the process. The latent capabilities to produce harmful text that we saw by artificially activating features are exactly the sort of thing jailbreaks try to exploit. We are proud that Claude has a best-in-industry safety profile and resistance to jailbreaks, and we hope that by looking inside the model in this way we can figure out how to improve safety even further. Finally, we note that these techniques can provide a kind of "test set for safety", looking for the problems left behind after standard training and finetuning methods have ironed out all behaviors visible via standard input/output interactions.
Anthropic has made a significant investment in interpretability research since the company's founding, because we believe that understanding models deeply will help us make them safer. This new research marks an important milestone in that effort—the application of mechanistic interpretability to publicly-deployed large language models.
But the work has really just begun. The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place). Understanding the representations the model uses doesn't tell us how it uses them; even though we have the features, we still need to find the circuits they are involved in. And we need to show that the safety-relevant features we have begun to find can actually be used to improve safety. There's much more to be done.
For full details, please read our paper, " Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet ".
If you are interested in working with us to help interpret and improve AI models, we have open roles on our team and we’d love for you to apply. We’re looking for Managers , Research Scientists , and Research Engineers .
Suggestions or feedback?
Press contact :, media download.
Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license . You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."
Previous image Next image
Identifying one faulty turbine in a wind farm, which can involve looking at hundreds of signals and millions of data points, is akin to finding a needle in a haystack.
Engineers often streamline this complex problem using deep-learning models that can detect anomalies in measurements taken repeatedly over time by each turbine, known as time-series data.
But with hundreds of wind turbines recording dozens of signals each hour, training a deep-learning model to analyze time-series data is costly and cumbersome. This is compounded by the fact that the model may need to be retrained after deployment, and wind farm operators may lack the necessary machine-learning expertise.
In a new study, MIT researchers found that large language models (LLMs) hold the potential to be more efficient anomaly detectors for time-series data. Importantly, these pretrained models can be deployed right out of the box.
The researchers developed a framework, called SigLLM, which includes a component that converts time-series data into text-based inputs an LLM can process. A user can feed these prepared data to the model and ask it to start identifying anomalies. The LLM can also be used to forecast future time-series data points as part of an anomaly detection pipeline.
While LLMs could not beat state-of-the-art deep learning models at anomaly detection, they did perform as well as some other AI approaches. If researchers can improve the performance of LLMs, this framework could help technicians flag potential problems in equipment like heavy machinery or satellites before they occur, without the need to train an expensive deep-learning model.
“Since this is just the first iteration, we didn’t expect to get there from the first go, but these results show that there’s an opportunity here to leverage LLMs for complex anomaly detection tasks,” says Sarah Alnegheimish, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on SigLLM .
Her co-authors include Linh Nguyen, an EECS graduate student; Laure Berti-Equille, a research director at the French National Research Institute for Sustainable Development; and senior author Kalyan Veeramachaneni, a principal research scientist in the Laboratory for Information and Decision Systems. The research will be presented at the IEEE Conference on Data Science and Advanced Analytics.
An off-the-shelf solution
Large language models are autoregressive, which means they can understand that the newest values in sequential data depend on previous values. For instance, models like GPT-4 can predict the next word in a sentence using the words that precede it.
Since time-series data are sequential, the researchers thought the autoregressive nature of LLMs might make them well-suited for detecting anomalies in this type of data.
However, they wanted to develop a technique that avoids fine-tuning, a process in which engineers retrain a general-purpose LLM on a small amount of task-specific data to make it an expert at one task. Instead, the researchers deploy an LLM off the shelf, with no additional training steps.
But before they could deploy it, they had to convert time-series data into text-based inputs the language model could handle.
They accomplished this through a sequence of transformations that capture the most important parts of the time series while representing data with the fewest number of tokens. Tokens are the basic inputs for an LLM, and more tokens require more computation.
“If you don’t handle these steps very carefully, you might end up chopping off some part of your data that does matter, losing that information,” Alnegheimish says.
Once they had figured out how to transform time-series data, the researchers developed two anomaly detection approaches.
Approaches for anomaly detection
For the first, which they call Prompter, they feed the prepared data into the model and prompt it to locate anomalous values.
“We had to iterate a number of times to figure out the right prompts for one specific time series. It is not easy to understand how these LLMs ingest and process the data,” Alnegheimish adds.
For the second approach, called Detector, they use the LLM as a forecaster to predict the next value from a time series. The researchers compare the predicted value to the actual value. A large discrepancy suggests that the real value is likely an anomaly.
With Detector, the LLM would be part of an anomaly detection pipeline, while Prompter would complete the task on its own. In practice, Detector performed better than Prompter, which generated many false positives.
“I think, with the Prompter approach, we were asking the LLM to jump through too many hoops. We were giving it a harder problem to solve,” says Veeramachaneni.
When they compared both approaches to current techniques, Detector outperformed transformer-based AI models on seven of the 11 datasets they evaluated, even though the LLM required no training or fine-tuning.
In the future, an LLM may also be able to provide plain language explanations with its predictions, so an operator could be better able to understand why an LLM identified a certain data point as anomalous.
However, state-of-the-art deep learning models outperformed LLMs by a wide margin, showing that there is still work to do before an LLM could be used for anomaly detection.
“What will it take to get to the point where it is doing as well as these state-of-the-art models? That is the million-dollar question staring at us right now. An LLM-based anomaly detector needs to be a game-changer for us to justify this sort of effort,” Veeramachaneni says.
Moving forward, the researchers want to see if finetuning can improve performance, though that would require additional time, cost, and expertise for training.
Their LLM approaches also take between 30 minutes and two hours to produce results, so increasing the speed is a key area of future work. The researchers also want to probe LLMs to understand how they perform anomaly detection, in the hopes of finding a way to boost their performance.
“When it comes to complex tasks like anomaly detection in time series, LLMs really are a contender. Maybe other complex tasks can be addressed with LLMs, as well?” says Alnegheimish.
This research was supported by SES S.A., Iberdrola and ScottishPower Renewables, and Hyundai Motor Company.
Press mentions.
TechCrunch reporter Kyle Wiggers writes that MIT researchers have developed a new tool, called SigLLM, that uses large language models to flag problems in complex systems. In the future, SigLLM could be used to “help technicians flag potential problems in equipment like heavy machinery before they occur.”
Previous item Next item
More mit news.
Read full story →
Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA
Discover the world's research
Systematic Reviews volume 13 , Article number: 219 ( 2024 ) Cite this article
122 Accesses
Metrics details
This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.
We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.
Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.
While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.
Peer Review reports
A systematic review is a rigorous form of research that collates and synthesizes all existing evidence on a specific research question [ 1 ]. It stands as a cornerstone not just in medical research but across diverse academic disciplines. Unlike traditional literature reviews, systematic reviews follow a comprehensive and standardized process, such as the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guideline [ 2 ], designed to minimize bias and ensure reproducibility. Therefore, these reviews are recognized as one of the zenith levels of evidence in evidence-based research [ 3 ] and play a pivotal role in shaping clinical guidelines, healthcare policies, and informing medical decisions [ 4 ].
Commencing with a well-articulated research question, a typical systematic review launches an exhaustive search strategy that subsequently sweeps through databases, such as PubMed and Embase, supplemented by additional sources such as clinical trial registries and pertinent article reference lists, with aims to capture all relevant studies and mitigate bias. Predetermined inclusion and exclusion criteria guide the subsequent screening and selection of studies, encompassing facts like study design, patient demographics, and intervention types. Reviewers, working independently, appraise each study’s eligibility, reconciling disagreements through discussions or third-party reviews. Data extraction and synthesis follow, either through meta-analysis or narrative synthesis, depending on the heterogeneity of the selected studies.
The practice of conducting systematic reviews has gained substantial popularity with considerable demand within the academic community. A notable reference [ 5 ] to this trend found that approximately 75 trials and potentially 11 systematic reviews are disseminated daily, based on data from around 2010. Moreover, a query for “systematic review” in the Google Scholar database yields approximately 17,000 entries that have been published within the year 2023 alone, which translates into about 51 systematic reviews per day, as observed on November 26, 2023. This expanding volume of literature underscores the critical role that systematic reviews play in the consolidation of research findings across various fields of study.
Despite their pivotal role, executing systematic reviews remains a formidable task due to the abstract screening, a key phase that can be overwhelmingly time-consuming due to its volume. For example, in Polanin et al. [ 6 ], the authors reported that the research staff screened 29,846 abstracts independently (14,923 unique citations were double screened) over the course of 189 days. In addition, the variability in reporting standards, use of jargon, and varied study designs can further complicate the abstract screening process [ 7 ]. Nevertheless, the repetitive nature of the task, combined with cognitive fatigue, can lead to human errors [ 8 , 9 ]. Recent advancements in machine learning (ML) and deep learning propose possible solutions to these challenges. However, traditional ML models, while promising, require domain-specific training, a time-consuming process that often demands manual labeling of datasets [ 10 ].
Advancements in natural language processing (NLP) and artificial intelligence (AI) are opening doors to address challenges in systematic reviews. Large language models (LLMs) like ChatGPT [ 11 ], PaLM [ 12 ], Llama [ 13 ], and Claude [ 14 ] are capturing the research community’s attention. Their collective potential, especially their capability to operate without exhaustive domain-specific training, makes them prime candidates for revolutionizing the systematic review process.
While each of the aforementioned AI tools brings its unique capabilities to the table, the fundamental question remains: How do they stack up, both individually and collectively, in the realm of abstract screening, against the human expert-based process? In pursuit of answers, this research seeks to thoroughly investigate the potential of ChatGPT, Google PaLM, Llama, and Claude in automating the crucial abstract screening phase integral to systematic reviews. Our goal is to rigorously compare the performance of these advanced AI-driven methods with existing machine learning (ML)-based approaches. In doing so, we aim to develop AI strategies that masterfully balance efficiency and accuracy with minimal human intervention, ultimately transforming systematic review practice across disciplines.
The use of NLP for abstract screening is established [ 10 , 15 ]. However, the application of LLMs specifically for this task is a nascent field [ 16 , 17 ]. This emerging area offers significant potential to improve efficiency and accuracy. Our study aims to fill this gap by providing a comprehensive analysis of LLM capabilities in abstract screening, laying the groundwork for future research and application. This is particularly relevant considering the rapid evolution of this technology, highlighting its potential to streamline systematic reviews now and in the future.
The remainder of this paper is structured to provide a comprehensive exploration of our topic. We begin with an in-depth examination of existing methods for abstract screening, including both manual and NLP-based approaches, laying the groundwork for understanding the current state of the field. We then introduce the use of large language model (LLM) tools for abstract screening, detailing our experimental design to meticulously evaluate their performance in this context. Subsequent sections present our empirical findings and results, shedding light on the capabilities and limitations of the AI tools in question. Finally, we engage in a thoughtful discussion, reflecting on the implications of our findings and considering the future trajectory of abstract screening in systematic reviews.
In the vast realm of systematic reviews, the critical task of abstract screening serves as a foundational step in curating the highest quality of evidence [ 2 ]. However, this process often presents significant challenges due to the involvement of sifting through large volumes of literature to identify those that align with predefined criteria. Over time, various methodologies, ranging from manual evaluations to sophisticated AI-driven techniques, have been proposed to address the complexities of this task. In this section, we first describe the existing approaches on their operational mechanisms and associated advantages and disadvantages.
Historically, the process of abstract screening was firmly rooted in manual evaluations. In this conventional approach, individual reviewers would scrutinize each abstract against predefined criteria [ 1 ]. The meticulous nature of this method required that multiple experts independently evaluate the literature to ensure both reliability and reduced biases [ 8 ]. While the depth of human expertise brought about nuanced understanding, the manual nature of this method made it both time-consuming and, at times, prone to human error [ 6 , 9 ].
As technology evolved, the field witnessed the incorporation of natural language processing (NLP) to automate abstract screening [ 10 ]. In this framework, abstract text undergoes preprocessing and vectorization. Supervised machine learning models, notably the support vector machine (SVM) and the random forest (RF), are then trained on this vectorized data to classify literature based on specific criteria [ 15 ]. The strength of this approach lies in its potential for efficiency. However, its efficacy and accuracy hinges heavily on the availability of a well-curated, labeled training set.
A more recent and innovative approach is zero-shot classification, which was notably highlighted by Xian et al. [ 18 ]. Eschewing the need for an extensive labeled training dataset, zero-shot classification offers the allure of categorizing abstracts without prior domain-specific training. By calculating a probability score for each abstract, researchers obtain a dynamic measure of its alignment with predefined categories. While it does not require training for the machine learning model, the tradeoff is a loss of sensitivity and potential omissions of pertinent studies [ 19 ]. In this study, for a given set of abstracts, we first obtain embeddings (using OpenAI’s text embedding API, the babbage model) for abstracts and for a pre-specified description of an ideal study to include. We indeed use the inclusion/exclusion criteria (see “Examples of abstract screening by using LLMs,” “Automated workflow for streamlining abstract screening via ChatGPT and other tools,” and “ Study design ” sections) as the description of an ideal study to include. We then compute the cosine similarity score between each abstract (its embedding) and the pre-specified description (its embedding). We claim the top 10% (in terms of similarity) of abstracts as positive (i.e., a study should be included).
To reconcile the strengths and limitations of these models, a hybrid methodology has emerged [ 18 ]. It cleverly marries the rapid categorization capabilities of zero-shot classification with the precision of traditional machine learning models such as SVM and RF. Here, zero-shot classification provides an initial curation of all abstracts in the training set. Then, based on zero-shot classification’s curation, the abstracts with classification score higher than a pre-specified threshold are selected to undergo manual review to rectify zero-shot classification. The rectified classification then serves as a foundation for training traditional models, which are then employed for screening the broader dataset (e.g., in the testing set). The hybrid approach balances the speed of zero-shot and the precision of traditional ML, and potentially offers enhanced accuracy at reduced human efforts. However, this approach involves the use of multiple methodologies and still relies on well-curated, labeled training data (in this case, a subset of the whole training set). In this study, for a given set of abstracts, we rectify the “positive” abstracts (i.e., the top 10% of abstracts) identified by zero-shot, and then fit a classification model for rectified labels to the text embeddings of the corresponding abstracts using SVM. This learned classification model is then used to predict the class of other abstracts. We added the R code for the zero-shot and hybrid approaches to the GitHub repository https://github.com/mikeli380/LLMAbstractScreening .
Active learning [ 20 , 21 , 22 ] is an innovative approach to machine learning that optimizes the training process by allowing the model to selectively query a human annotator for labels on the most informative data points. This method is particularly advantageous in scenarios where labeled data is scarce or expensive to obtain. Active learning models can start with minimal datasets, often requiring only one relevant and one irrelevant abstract, making them particularly suitable for tasks such as abstract screening, where the acquisition of large labeled datasets can be prohibitive.
The main advantage of active learning is its efficiency. By focusing on the most informative samples, it reduces the amount of data that needs to be labeled while still effectively training the model. This can significantly reduce the time and resources required for the annotation process. However, the effectiveness of active learning depends heavily on the initial selection of samples and the criteria used to determine the informativeness of subsequent data points. If not well calibrated, the model may request labels for data that do not significantly improve its performance, resulting in an inefficient use of resources. In addition, the iterative nature of querying and updating the model based on new labels can introduce complexity into the training process, potentially requiring more sophisticated infrastructure and oversight than traditional supervised learning methods.
While this study did not test active learning approaches for abstract screening, readers interested in exploring this methodology further are referred to [ 23 ] for detailed information on the application of active learning in abstract screening contexts.
Amidst the evolving methodologies described in the “ Existing approaches to abstract screening in systematic reviews ” section, modern AI tools based on large language models (LLMs), such as ChatGPT, PaLM, Llama, and Claude, are emerging as potential game-changers. Grounded in advanced language processing capabilities, these tools can be tailored to evaluate abstracts against nuanced criteria and offer detailed assessments and classifications. Their prowess signals transformative potential for abstract screening. In this section, we first present two examples to illustrate the potential of LLMs in the context of abstract screening, and then precede with our in-depth discussion on the study’s workflow and design that aims to critically investigate the performance of LLMs in this domain.
In this subsection, we demonstrate the process of LLM-based abstract screening using ChatGPT with two specified abstracts: (1) Millard et al. [ 24 ] and (2) Zawia et al. [ 25 ]. Using other LLM tools such as Llama, Google PaLM, or Claude for abstract screening is similar and will be skipped for brevity.
In constructing the prompts, we adopted a standardized approach to mimic a typical interaction between a senior researcher and a research assistant. Each prompt was designed to include three critical elements: a brief statement of the topic under review, the exact eligibility criteria as specified in the study guidelines, and the text of the abstract to be reviewed. This approach was designed to simulate a realistic and practical scenario that reflects common academic practices. In addition, to ensure consistency and objectivity in the responses, we required that the results of the LLM be in a binary “yes” or “no” format, directly indicating whether each abstract met the inclusion criteria. This format was chosen to eliminate any ambiguity in the model’s decisions and to facilitate straightforward analysis of its accuracy.
ChatGPT prompt: “I would like you to help me with conducting a systematic review on animal depression. I will provide the title and abstract for one journal article and would like you to screen the paper for inclusion. Here are the inclusion criteria: Does the study provide primary data of an animal model of depression or depressive-like phenotype with an appropriate control group? Here is the abstract:
Wistar-Kyoto (WKY) rats show endogenous depressive behavior that can be reversed by antidepressants. Given that WKYs exhibit decreased sensitivity to some antidepressants and treatment-resistant depressed patients often show hypothalamic-pituitary-thyroid (HPT) dysregulation, we examined the behavioral and HPT hormonal responses of WKYs to altered thyroid status. “Euthyroid” WKYs had elevated basal plasma TSH and T(3) levels as compared to Wistars. Hypothyroidism increased TSH levels more in WKYs than in Wistars and increased response latency in the open field test (OFT) of WKYs only. Administration of T(4) and T(3) suppressed plasma TSH equally in both strains. Wistars responded to increased T(3) levels with decreased response latency and increased activity in the OFT, but increased immobility in the forced swim test. In contrast, WKYs responded only to the high T(3) levels with decreased response latency in the OFT. These results suggest the existence of a decreased central nervous system sensitivity to thyroid hormones in WKYs that could be related to their depressive behavior.”
ChatGPT v4 response (accessed on 10/15/2023):
Given the abstract and the inclusion criteria, this study appears to meet the inclusion criteria for your systematic review on animal models of depression. |
Consideration for inclusion: 1. Animal model and depressive-like phenotype: • The study utilizes Wistar-Kyoto (WKY) rats which are mentioned to show endogenous depressive behavior. • This study investigates behaviors and hormonal responses associated with depression in this particular animal model, with a focus on the role of thyroid hormones. 2. Primary data and appropriate control group: • The study provides primary data on behavioral and hormonal responses of WKY rats, comparing them to Wistars (likely serving as the control group), under altered thyroid status, allowing for comparisons between the depressive-like phenotype and controls. • The utilization of both “Euthyroid” WKYs and Wistars as controls helps in effectively investigating the behavioral changes and responses in WKY rats due to alterations in thyroid status. 3. Behavioral test indicating depression: • Depressive behavior and responses are assessed using standardized tests like the open field test (OFT) and the forced swim test, which are conventional methods to investigate depressive-like behaviors in animal models. • The mention of antidepressant’s impact on WKY rats indicates a focus on depressive-like states. 4. Relevance to human condition: • Insights gained from this study on decreased central nervous system sensitivity to thyroid hormones and its possible relation to depressive behavior in WKYs can have implications in understanding treatment-resistant depression in humans, making it valuable for the review. Conclusion: Given the provided abstract, this study does seem to fit the inclusion criteria for a systematic review on animal depression, as it provides primary data on an animal model demonstrating depressive-like behavior, includes an appropriate control group, and employs standardized tests to study the depressive-like states and their relation to thyroid hormones. The detailed exploration of depressive behaviors and hormonal imbalances in relation to altered thyroid status can offer valuable insights for understanding and treating depression. |
The induction of the activity of cerebral ornithine decarboxylase (EC 4.1.1.17) and mRNA by electrical stimulation exhibits regional differences. The effects of the enzyme inhibitor difluoromethylornithine on these regional variations was examined. Administration of this inhibitor resulted in pronounced depression of both basal and induced activity of ornithine decarboxylase in the hippocampus. Basal activity of the enzyme in the neocortex and the cerebellum appeared to be resistant to difluoromethylornithine but the induced enzyme activity was sensitive to the effects of this inhibitor. Susceptibility to difluoromethylornithine may be directly correlated with a slower turnover rate for ornithine decarboxylase. These results suggest that ornithine decarboxylase in the hippocampus may possess a longer half-life than its counterparts in other regions of the brain. Pretreatment with difluoromethylornithine had no effect on the induced ornithine decarboxylase mRNA in the neocortex. Thus, elevated activity of ornithine decarboxylase enzyme, due to electrical stimulation, appears to not have any effect on either the transcription or the decay rate of the induced ornithine decarboxylase mRNA. These findings support the concept of region-specific regulation of cerebral ornithine decarboxylase. ”
Given the provided abstract and the inclusion criteria, this study does not appear to meet the criteria for inclusion in a systematic review on animal models of depression or depressive-like phenotypes with an appropriate control group. |
Consideration for inclusion: 1. Depression or depressive-like phenotype: • The abstract does not provide any information or data related to depression or depressive-like behavior or phenotypes in animals. It is mainly focused on the activity of cerebral ornithine decarboxylase and its regional differences in response to electrical stimulation and inhibitor administration. 2. Animal model: • There is no mention of the use of any specific animal model or any behavioral assays to assess depressive-like symptoms or states in the provided abstract. 3. Behavioral assessment and primary data: • The abstract does not contain any details about behavioral assessments related to depression or depressive-like states. It primarily provides data on the effects of enzyme inhibitor difluoromethylornithine on cerebral ornithine decarboxylase activity and mRNA, but does not relate these findings to depression or depressive-like behaviors. 4. Relevance to depression: • While the study explores biochemical aspects in brain regions like the hippocampus, which is implicated in mood disorders like depression, the connection between the findings and depressive symptoms or states is not established in the provided abstract. Conclusion: This article does not seem to fit the inclusion criteria for a systematic review on animal models of depression or depressive-like phenotypes as it lacks primary data on depressive-like behaviors or phenotypes in animals, and there is no mention of the use of animal models representative of depression or an appropriate control group for such a comparison. |
From the above two examples presented, it is evident that ChatGPT adeptly comprehends the posed queries and efficiently extracted the inclusion and exclusion criteria. It not only makes clear decisions regarding study inclusion but also articulates its rationale by cross-referencing each criterion.
It is worth noting that the same ChatGPT prompt might result in slightly different responses between different users due to the variability in its response generation, but the conclusion for the eligibility of an abstract is typically consistent. In addition, enhancing the specificity of the criteria within the prompts can further refine the screening process. To illustrate, consider the following enhanced prompts:
Enhanced ChatGPT prompt for abstract (1): “I seek assistance with a systematic review focused on animal depression. I’ll provide the abstract of a particular journal article and would appreciate an assessment for its inclusion based on the following criteria:
The study employs an animal model reflecting depression or a depressive-like phenotype.
It incorporates an appropriate control group.
Primary data is presented within the study.
and here are the exclusion criteria:
The study bypasses the use of an animal model.
The utilized animal model doesn’t characterize depression or a depressive-like phenotype.
Absence of a control group within the study.
The study is primarily a review, devoid of primary data.
Here is the abstract: …”.
Let’s assess the provided abstract based on the inclusion and exclusion criteria: |
Inclusion criteria: 1. Animal model reflecting depression: The study uses Wistar-Kyoto (WKY) rats, which are described as showing “endogenous depressive behavior.” Thus, this criterion is met. 2. Appropriate control group: The study compares WKY rats to Wistars (another strain), which serves as a comparison or control group. This criterion is met. 3. Primary data: The abstract mentions specific results such as the elevated basal plasma TSH and T(3) levels in “Euthyroid” WKYs compared to Wistars and the different responses in the open field test (OFT). This suggests the presence of primary data. Thus, this criterion is met. |
Exclusion criteria: 1. Bypasses the use of an animal model: The study uses WKY rats as an animal model. Thus, this criterion is not met, which is good for inclusion. 2. Animal model doesn’t characterize depression: The WKY rats show “endogenous depressive behavior.” Thus, this criterion is not met, which again supports inclusion. 3. Absence of a control group: The study does use a control group (Wistars). Thus, this criterion is not met, supporting inclusion. 4. Study is primarily a review: The abstract presents experimental data and does not indicate that the study is a review. Thus, this criterion is not met, which is favorable for inclusion. |
Conclusion: Given that the abstract meets all the inclusion criteria and none of the exclusion criteria, it should be considered in your systematic review on animal depression. |
With these revised prompts applied to the previously discussed abstracts, the conclusions remain consistent. Nonetheless, we contend that this enhanced structure not only provides a more rigorous framework but also aligns more closely with widely accepted practices in systematic review criteria definition.
For systematic reviews, particularly when dealing with voluminous data, efficient workflows are paramount. The ChatGPT API (application programming interface) offers a dynamic solution, enabling the automation of abstract screening on a large scale, which circumvents the labor-intensive process of manually inputting abstracts into a chatbot interface. In this subsection, we present an automated workflow for streamlining abstract screening via ChatGPT. Note, though this automated workflow uses ChatGPT as the platform, analogous workflows work for other AI platforms like PaLM, Llama, and Claude.
Automated workflow:
Data collection: The preliminary step entails accumulating a list of titles and abstracts. By utilizing carefully crafted keywords, we retrieve these from PubMed and other pertinent databases. This comprehensive approach ensures the potential inclusion of all relevant studies for a subsequent detailed screening. It is worth noting that while this list is expansive, most of these studies may not find their way into the final meta-analysis post-screening.
Automation through Python: We have devised a python script aimed at harnessing the capabilities of ChatGPT for evaluating the amassed abstracts.
This script interacts with the ChatGPT API (specifically, the GPT-4 version) and, when furnished with tailored prompts, extracts structured responses from ChatGPT.
Typically, the AI’s response commences with a succinct summary, delves into explanations aligned with each criterion, and concludes with a decisive judgment, as exemplified by the examples in the “ Examples of abstract screening by using LLMs ” section.
This automated process efficiently saves ChatGPT’s verdicts on each abstract for ensuing analyses. For instance, it extracts the final decisions regarding the inclusion or exclusion of studies and determines the stance on each pre-specified criterion for every abstract, as exemplified by the last example in the “ Examples of abstract screening by using LLMs ” section.
Additionally, to ascertain the efficiency and cost-effectiveness of this methodology, the script also monitors the time, token usage, and the financial implications of querying the OpenAI API.
In essence, we envision this procedure as delegating the meticulous task of poring over scientific summaries to an AI assistant. This virtual entity meticulously sifts through each summary, determining its alignment with stipulated criteria.
Tuning parameters in the ChatGPT API: The effectiveness of the ChatGPT API is not only dependent on the input data; it is also significantly influenced by adjustable parameters that can refine the responses. Parameters such as temperature, top k, and top p critically affect model performance by modulating the randomness and focus of the output. While fine-tuning these parameters can improve results, it requires significant technical expertise and resources. Defaults, which are rigorously tested by developers, strike a balance between output quality and ease of use, making LLMs more accessible to a wider range of users without the need for complex parameter optimization. While customization holds promise for custom applications, the default settings provide an efficient and practical solution that facilitates wider adoption of LLM technologies.
Given the complexity of fine-tuning these parameters to optimize performance specifically for abstract screening, our study primarily used the recommended default settings for these parameters as provided by the respective platforms (detailed in Table 1 ). This approach was chosen to maintain the feasibility of our experiments and to ensure that our findings are applicable to typical deployment scenarios.
The automated workflow described applies to other LLM tools or different versions of the same tools. Throughout our study, we have tested several popular LLM tools available to us. The specific LLM models and their versions used at the time of our first submission are detailed in Table 1 , along with the tuning parameters.
The field of large language models (LLMs) has evolved rapidly since we started this study in 2023. New models are frequently released, and existing versions are constantly updated. To account for these advances, we have expanded our analysis to include results from the latest versions of previously studied models, as well as a few entirely new ones. As a best practice, we set the temperature parameter to 0 for all latest models. Table 1 now includes these latest models and their versions.
A brief explanation of these parameters is as follows:
Temperature: The temperature controls the randomness of the outputs, with a range from 0 to 2. A temperature value greater than 1 is random and 0 is deterministic. The maximum temperature of 2 gives the most creative and variable outputs.
Max length: The max length is the maximum number of tokens for the model to generate as a response. A single word is generally 2–3 tokens.
Stop sequences: This parameter controls which tokens or phrases will stop the LLM from generating more text.
Top p: When generating text, the model samples from the top p percentage of most likely tokens to generate. The top p is the cumulative probability cutoff for the model’s selection of tokens to generate. Lower top p values mean sampling from a smaller, more top-weighted nucleus.
Top k: When generating text, the model samples from the top k most likely tokens. When deciding the next word to output, the model will select the most likely word when top k is lower.
Frequency penalty: This frequency penalty parameter controls how the LLM penalizes tokens that have already appeared in the input and output text. A frequency penalty of 0 implies that the frequency of the tokens does not impact the LLM and will generate them based on their probability.
Presence penalty: This parameter controls how the LLM penalizes tokens that have not been used recently. A presence penalty of 0 means that the LLM does not care about the recency of the tokens and will generate them based on their probability.
In our pursuit to assess ChatGPT’s proficiency in abstract screening, we selected certain benchmark databases that have existing performance data from other methodologies. This selection aids in a comparative analysis of performance.
In selecting the systematic reviews for our study, we used a systematic approach guided by specific criteria to ensure relevance and reliability. These studies were selected from the publicly available SYNERGY [ 23 ] dataset, which contains 26 systematic reviews from different disciplines. Key selection criteria included:
Clarity and conciseness of eligibility criteria: The selected studies had well-defined and explicit eligibility criteria. This clarity is essential for accurate replication of the study selection process, which is critical for assessing the performance of LLM tools in an analogous real-world application.
Completeness and cleanliness of data: We ensured that the selected reviews had complete datasets, with all necessary information on included and excluded studies clearly documented, minimizing the risk of ambiguities affecting our analysis.
In addition, to comply with the AMSTAR-2 [ 26 ] guidelines, in particular point 5, we reviewed the methodologies of these reviews to confirm the selection of studies was performed in duplicate and disagreements were resolved by consensus. While our analysis assumes that these systematic reviews adhere to high standards, we recognize the inherent limitations of using pre-existing datasets as a proxy for gold standards in the discussion section.
We picked the following 3 databases from the publicly available SYNERGY dataset [ 23 ]:
Bannach-Brown 2016 [ 27 ]—topic: use of animal models to study depressive behavior
Human-curated (gold standard) results: 1258 excluded abstracts and 230 included abstracts.
We randomly selected 100 excluded abstracts and 100 included abstracts for screening by LLM tools.
Meijboom 2021 [ 28 ]—topic: retransitioning of etanercept in patients with a rheumatic disease
Human-curated (gold standard) results: 599 excluded abstracts and all 32 included abstracts.
We randomly selected 100 excluded abstracts and 32 included abstracts for screening by LLM tools.
Menon 2022 [ 29 ]—topic: methodological rigor of systematic reviews in environmental health
Human-curated (gold standard) results: 896 excluded abstracts and 73 included abstracts.
We randomly selected 100 excluded abstracts and all 73 included abstracts for screening by LLM tools.
For each chosen database, abstracts were categorized as either “cases” (those included based on a gold standard) or “controls” (those excluded per the gold standard). From each category, we randomly selected 100 abstracts (we use all abstracts if there are less than 100 abstracts in that category in a database). These abstracts underwent evaluation by ChatGPT (v4.0) as per our established workflow. Subsequently, ChatGPT’s decisions were juxtaposed against the gold standard to determine sensitivity, specificity, and overall accuracy. The same abstracts were also processed using other LLM tools as listed in Table 1 to record their respective verdicts.
To quantify the efficacy of ChatGPT and other AI tools for each database, we calculated the following metrics: (1) sensitivity, (2) specificity, and (3) overall accuracy, where sensitivity is defined as the number of true positives divided by the sum of true positives and false negatives, specificity as the number of true negatives divided by the sum of true negatives and false positives, and accuracy as sum of true positives and true negatives divided by the total number of abstracts. For each metric, associated 95% confidence intervals were also determined. Although it is very common in the field to report F1 score, recall rate, and precision, we believe it is more appropriate to report sensitivity and specificity given this study design. In addition, F1 score, recall rate, and precision can be derived from sensitivity and specificity.
Furthermore, to explore the potential of a unified decision-making process, we combined the decisions from all AI tools using a voting mechanism. The majority decision across the tools was taken as the final verdict for each abstract. For this consolidated approach, we again computed sensitivity, specificity, overall accuracy, and the associated 95% CIs for each database. We also explore the use of latent class analysis (LCA), a model-based clustering approach, to derive consolidated decisions. More details on this LCA approach are provided in the “ Beyond majority voting ” section.
For a given database, 100 cases and 100 controls yield a two-sided 95% confidence interval with a half-width equal to 0.048 when the underline sensitivity (specificity) is approximately 95%.
All statistical analyses were conducted using the R statistical software (version 4.3.1). All tests were two-sided with an alpha level set at 0.05 unless otherwise mentioned.
To improve the transparency and reproducibility of studies using AI tools, we have included the TRIPOD + AI checklist [ 30 ] in our report. This checklist has been adapted to reflect the specifics of our research, which focuses on the evaluation of large language models for abstract screening rather than diagnostic or prognostic modeling. The completed checklist is presented in Table S 1 , to provide readers with a comprehensive overview of our study’s adherence to established reporting standards.
We present the results for each of the 3 databases. For each, we first present the prompts we used when we called LLM tools to screen an abstract, then present the performance data (accuracy, sensitivity, and specificity for each method or LLM tool), followed by a summary of the performance, and a comparison of the performance of the different methods to that based on ChatGPT v4.0.
The prompts we used for screening abstracts in this database are as follows:
Conduct a systematic review on animal depression. I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion. After answering all the criteria with yes or no, then provide an overall explanation.
Here is the eligibility criteria: Inclusion Criteria: 1. Any article providing primary data of an animal model of depression or depressive-like phenotype with an appropriate control group (specified above). 2. Animals of all ages, sexes and species, where depression-like phenotype intended to mimic the human condition have been induced. Including animal models where depressive-like phenotypes are induced in the presence of a comorbidity (e.g. obesity or cancer). 3. All studies that claim to model depression or depressive-like phenotypes in animals. Studies that induce depressive behavior or model depression and that also test a treatment or intervention (prior or subsequent to model induction), with no exclusion criteria based on dosage, timing or frequency. 4. Studies measuring behavioral, anatomical and structural, electrophysiological, histological and/or neurochemical outcomes and where genomic, proteomic or metabolomic outcomes are measured in addition to behavioral, anatomical, electrophysiological, histological or neurochemical outcomes. Exclusion Criteria: 1. Review article, editorials, case reports, letters or comments, conference or seminar abstracts, studies providing primary data but not appropriate control group. 2. Human studies and ex vivo, in vitro or in silico studies. Studies will be excluded if authors state an intention to induce or investigate only anxiety or anxious behavior. Studies will be excluded if there is no experimental intervention on the animals (e.g. purely observational studies). 3. Studies that investigate treatments or interventions, but no depressive behavior or model of depression is induced (e.g. toxicity and side-effect studies). 4. Where metabolic outcome measures are the primary outcome measure of a study. Where genomic, proteomic, metabolic or metabolomic outcomes are the sole outcome measures in a study, they will be excluded.
Here is the abstract:
Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.945) and specificity (0.960), and satisfactory sensitivity (0.930). Combined decision using major voting seemed to improve sensitivity (0.970) considerably but did not improve specificity (0.870) much. Comparatively, the zero-shot method was less effective across these metrics, while the hybrid method attained superior accuracy (0.961) and specificity (0.982), albeit with reduced sensitivity (0.843). We acknowledge that our findings regarding the zero-shot and hybrid methods differ from those reported in [ 18 ] for the same set of abstracts. This discrepancy could be attributed to the use of distinct sets of embeddings in our analysis compared to the previous study.
For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions, but did not surpass the performance of ChatGPT (v4.0). Claude 3 performed well, but still did not surpass the performance of ChatGPT (v4.0).
Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value = 0.002) than Google PaLM 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value = 0.008) and better than Llama-2 ( p value < 0.001) in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was no statistically significant difference ( p value = 0.134) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was significantly worse ( p value = 0.008) than ChatGPTv4.0 in terms of specificity.
Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We assessed the performance (sensitivity and specificity) of ChatGPT v4.0 against both the zero-shot and hybrid approaches using the McNemar test. Specifically, we aligned the screening results from 100 cases and 100 controls as per the ChatGPT v4.0 method and similarly for the zero-shot and hybrid methods, testing for inconsistencies between these approaches as previously done. Our analysis revealed that ChatGPT v4.0 significantly outperformed the zero-shot method in sensitivity ( p value < 0.001) but showed comparable effectiveness in specificity ( p value = 0.37). Additionally, ChatGPT v4.0 demonstrated marginally superior sensitivity compared to the hybrid method ( p value = 0.07), while its performance in specificity was similar ( p value = 1.00).
Conduct a systematic review on transitioning patients from an originator to a corresponding biosimilar.
I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.
After answering all the criteria with yes or no, then provide an overall explanation.
Here is the eligibility criteria:
Articles were included if they met the following criteria:
Study involved transitioning from a TNFα inhibitor (including etanercept, infliximab, and adalimumab) originator to a biosimilar
The number of patients who retransitioned was reported or could be calculated
The article was an original research article published in a peer-reviewed journal
The article included baseline characteristics of the patients who transitioned
The article was written in English
The full-text version of the article could be obtained.
Transitioning was defined as patients in whom the biosimilar was introduced after the originator, without treatment with other drugs in between. Retransitioning was defined as restarting the originator directly after discontinuing a biosimilar, without treatment with other drugs in between. In summary, transitioning was defined as switching from the originator to a biosimilar; retransitioning was defined as switching from the originator to a biosimilar and back to the originator. Both transitioning and retransitioning involved changes with the same active biological substance.
Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.840), but not with specificity (0.860) or satisfactory sensitivity (0.812). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.720), but improved sensitivity (1.000) at the sacrifice of specificity (0.630).
Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value < 0.001) than Google PaLM 2, but significantly worse than ChatGPT3.5 ( p value = 0.001) and Llama 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value < 0.001) and better than Llama 2 ( p value < 0.001), but worse than Google PaLM 2 ( p value = 0.002), in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly worse ( p value > 0.50) than ChatGPTv4.0 in terms of specificity.
For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0), and Gemini-1.0-pro (vs. PaLM 2) did not improve performance either. However, Llama 3 (vs. Llama 2) improved over its older version and surpassed the performance of ChatGPT (v4.0). Claude 3 also slightly surpassed the performance of ChatGPT (v4.0).
Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We evaluated the performance of ChatGPT v4.0, focusing on sensitivity and specificity, in comparison with the zero-shot and hybrid approaches, employing the McNemar test as described above. In this analysis, we aligned the screening results from 32 cases and 100 controls for the tests. Our findings indicated that ChatGPT v4.0 significantly surpassed the zero-shot method in sensitivity ( p value = 0.0002) and exhibited marginally improved specificity ( p value = 0.099). Furthermore, ChatGPT v4.0 showed notably higher sensitivity than the hybrid method ( p value < 0.001), although its specificity was comparatively lower.
“Conduct a systematic review on the methodological rigour of systematic reviews in environmental health.
I provide the title and abstract for one journal article.
Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation.
Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.
Here are the eligibility criteria:
To be eligible for inclusion in the SR sample, documents had to fulfill the following criteria:
Identify explicitly as a “systematic review” in their title
Assess the effect of a non-acute, non-communicable, environmental exposure on a health outcome. Environmental exposures can include air and water pollutants, radiation, noise, occupational hazards, lifestyle factors (like diet or physical activity) and lifestyle choices influenced by family and peers (like substance use), social and economic factors (like stress from work or living conditions).
Include studies in people or mammalian models
Be available in HTML format
Abstract X”
Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.913) and specificity (0.932), but not with specificity (0.900). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.884) or sensitivity (0.808), but improved specificity (0.940).
Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better than ChatGPT3.5 ( p value < 0.001), Google PaLM 2, and Llama 2 ( p value = 0.02) in terms of sensitivity; ChatGPTv4.0 performed worse than ChatGPTv3.5 and Google PaLM 2, in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly different ( p value = 0.134) than ChatGPTv4.0 in terms of specificity.
For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). However, both Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions and surpassed the performance of ChatGPT (v4.0). Claude 3 also performed well but did not surpass the performance of ChatGPT (v4.0).
Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We aligned the screening results from 73 cases and 100 controls based on the ChatGPT v4.0 method, and similarly for the zero-shot and hybrid methods, to test for inconsistencies between these approaches, using the McNemar test as done in previous assessments. Our analysis showed that ChatGPT v4.0 significantly outperformed the zero-shot method in both sensitivity ( p value < 0.001) and specificity ( p value = 0.016). In comparison with the hybrid method, ChatGPT v4.0 also demonstrated superior sensitivity ( p value < 0.001) and better specificity ( p value = 0.04).
To use the ChatGPT API or other LLM tools, the owners of these platforms charge a predetermined rate for access to the corresponding APIs. These fees are calculated in USD per thousand tokens, where tokens are the basic units used by these LLM platforms to quantify text length. In this context, a token can represent a word, a punctuation mark, or a character. The financial cost of screening 200 abstracts was approximately $6 for ChatGPT v4.0, $0.2 for ChatGPT v3.5, $10 for Llama 2 (using Replicate), while Google PaLM 2 offered its services for free to invited developers. Thus, the cumulative cost of evaluating 200 abstracts across all platforms was approximately $16.2. The cumulative cost of evaluating 200 abstracts across all latest models ($3 for GPT-4-Turbo, $0.05 for GPT-3.5-Turbo, free for Gemini-1.0-pro, $0.05 for Llama-3, $4 for Claude) was less, approximately $7.1. In terms of time efficiency, processing 200 abstracts with each of these LLM tools took approximately 10–20 min using a single thread. However, it is imperative to recognize that abstract screening lends itself well to parallelization. Consequently, one could significantly speed up the process by setting up multiple threads to simultaneously screen different subsets of abstracts, thereby reducing the overall time required for completion. This parallel approach not only increases efficiency, but also ensures that large amounts of data can be processed in a timely manner, making LLM tools even more attractive for large-scale abstract screening tasks. In summary, the monetary and time costs of using LLM tools for abstract screening are negligible compared to manual labeling.
We have expanded our analysis to include a variety of approaches for synthesizing decisions across different LLM tools. Our methodology is inspired by the concept of combining multiple diagnostic tests in the absence of a gold standard, akin to situations where human expert consensus is not available. There are several publications discussing such scenarios [ 31 , 32 ], among which we proposed to use the latent class analysis (LCA) models.
Latent class analysis (LCA) is a statistical method used to identify subgroups within a population, which are not directly observed (hence “latent”) [ 33 , 34 , 35 ]. It is particularly useful when the research interest lies in categorizing individuals into mutually exclusive groups based on their responses to multiple observed variables. In the context of abstract screening, LCA can offer a sophisticated means of integrating decisions from different LLM tools without relying on a gold standard, typically provided by human expert consensus. This approach assumes that the unobserved subgroups (or “latent classes”) explain the dependence between the observed decisions made by each of the LLM tools.
Utilizing the LCA model, we treat the decisions from all LLM tools as dichotomous variables, corresponding to the adherence to each inclusion or exclusion criterion, as well as the overall decision. For instance, within the Bannach-Brown 2016 database (BB2016), there are eight criteria in total with four criteria each for inclusion and exclusion and one overall decision for inclusion, resulting in a total of nine binary items per LLM tool. In our analysis, we incorporated decisions from GPT v4.0, v3.5, and Llama 2. Decisions from Google PaLM 2 were excluded due to a high frequency (10% or more) of incomplete responses. Consequently, for the Bannach-Brown 2016 database, we worked with 27 binary items. For other databases such as Meijboom 2021 (Mj2021) and Menon 2022 (Me2022), the binary items totaled 21 and 15, respectively. It is important to note that LCA models were fitted to the binary data of each database independently.
The LCA model fitting process enables us to calculate the posterior probabilities of each abstract belonging to specific latent classes or subgroups. Abstracts are then categorized based on these probabilities, with assignment to the class for which an abstract has the highest posterior membership probability. The determination of the number of latent classes is a critical step in the LCA model fitting, which requires a priori specification. In our evaluation, we explored models with class numbers ranging from 2 to 6 and utilized the Bayesian information criterion (BIC) to identify the most “optimal” LCA model for our datasets.
Table 5 shows that after applying the Bayesian information criterion (BIC) to determine the most appropriate model for our data, we identified a 3-class model as the best fit for the binary decisions derived from the BB2016 database. Similarly, a 4-class model was optimal for the Mj2021 database, while a 3-class model was again best for the Me2022 database. The confusion matrices generated by the selected LCA models for each database provided a clear juxtaposition between the LLM-assigned classes and the actual labels of the abstracts (see Table 6 ).
The performance metrics derived from these models are noteworthy. For the BB2016 database, should we categorize abstracts within class with label 2 as “included” and assign the others in classes with label 1 or 3 to the “excluded” category, the LCA model achieved a sensitivity rate of 93% and a specificity rate of 96%, indicating a high degree of accuracy in classifying relevant and irrelevant abstracts. In the case of the Mj2021 database, if we view class 1 as “included” group and classes 2 and 3 as “excluded” group, the model achieved a perfect sensitivity rate of 100%, meaning that it correctly identified all relevant abstracts, although the specificity was lower at 79%, indicating a higher rate of false positives. Similarly, for the Me2022 database, considering class 1 to be the “excluded” category and classes 2 and 3 to be the “included” group, the model showed a sensitivity of 94.5% and a specificity of 83%, a good balance between identifying relevant abstracts and avoiding false inclusions.
These results highlight the robustness of the latent class analysis approach to the abstract screening task, providing an effective method for classifying abstracts when a gold standard is not available. The varying levels of sensitivity and specificity across databases underscore the need to tailor the LCA model to the specific characteristics of each dataset as well as further future research.
We have incorporated results from the latest LLM models to increase the robustness and relevance of our findings. However, to maintain consistency and comparability with our original analyses, we have not incorporated these new results into the previously established majority voting or latent class analysis (LCA) approaches. Instead, we have chosen to make all raw data, including results from these newer models, freely available in our GitHub repository. This approach allows interested researchers and practitioners to conduct further investigations or apply alternative methods of analysis. By providing access to this additional data, we aim to promote transparency and enable the broader community to engage with and potentially build upon our work.
This study began with a rigorous exploration of the capabilities of large language models (LLMs) in abstract screening. We used automation scripts developed in Python to interact with the APIs of several LLM tools, including Chat GPT v4.0, Google PaLM 2, and Meta Llama 2, as well as latest versions of these tools. Our central goal was to evaluate the efficiency and accuracy of these tools across three different databases of abstracts, leading us to a complex understanding of their potential in this context.
Large language models (LLMs), particularly ChatGPT, have garnered global attention since their inception. Employing LLMs for abstract screening in systematic reviews is an innovative concept [ 32 , 33 ] and remains underexplored. This study presents the first comprehensive evaluation of LLMs applied to systematic review processes. The findings are encouraging, suggesting that LLMs could revolutionize abstract screening. Specifically, ChatGPT v4.0 exhibited stellar performance across three test scenarios, achieving an accuracy of at least 85%. Furthermore, it attained sensitivity and specificity rates ranging from 80% to an impressive 95%. These exceptional outcomes highlight the substantial promise of LLMs in abstract screening, offering an efficient and capable alternative to the conventional, laborious approaches that typically necessitate extensive human annotation.
However, it is important to acknowledge that we are still in the early stages of integrating LLM tools into the abstract screening field, and they are not without their imperfections—for example, occasionally excluding many (20%) relevant studies, even with the best-performed LLM tools, ChatGPT v4.0. These tools are not a universal solution to all the challenges associated with abstract screening, and they are not ready to completely replace human expertise in this area. Instead, they should be embraced as invaluable assistants in the abstract screening process.
In discussing the limitations of our study, it is important to recognize the constraints associated with our dataset selection and model configuration. We used only three databases from the SYNERGY set, limiting the generalizability of our findings across disciplines and datasets. In addition, the reliance on human-curated labels as a gold standard, while necessary, introduces potential biases due to the retrospective nature of our analysis. These labels may contain errors, and the methodology used in the original reviews to resolve discrepancies may affect the validity of our conclusions.
The parameters chosen for our LLMs—temperature, top k, top p, and prompts—were set to defaults or based on natural conversations to balance output quality with ease of use. While this approach minimizes the need for technical expertise, it may not be optimal for all screening scenarios. In addition, the reliance of LLMs on abstracts alone, as opposed to full-text evaluations, presents a fundamental challenge; critical data influencing inclusion or exclusion may not be present in the abstracts, potentially compromising screening accuracy.
In addition, the rapid development of LLMs and their “black box” nature pose challenges to the transparency that is essential in scientific settings. The environmental impact of using these computationally intensive models is also significant [ 36 ], requiring sustainability considerations. Future research should focus on refining these tools to increase transparency and efficiency, and ensure their responsible development and use in systematic reviews.
Our research suggests that LLM tools are ready to take on a role in abstract screening and are poised to have an immediate and positive impact on the process. Their integration into abstract screening can manifest itself in a few ways. They can serve as autonomous AI reviewers, adding an extra layer of scrutiny and ensuring thoroughness. Our findings suggest that a collective decision, such as one derived from majority voting, can sometimes improve sensitivity, underscoring the potential of LLM tools as a reliable safeguard against oversight, ensuring both comprehensiveness and accuracy.
In addition, LLM tools can facilitate a synergistic partnership with human experts. They are adept at identifying “high-risk” abstracts where different LLM tools have reached different judgments, flagging them for further human evaluation, and promoting a careful and error-free screening process while minimizing human effort.
Another exciting prospect is the integration of LLM tools with hybrid approaches and active learning approach. In this scenario, LLM tools could autonomously annotate abstracts in the training set, minimizing the need for human labeling. Subsequently, these labeled abstracts could be used to train custom NLP models, paving the way for a streamlined and efficient abstract screening process, resulting in significant time and resource savings. Further research is needed to understand how the “uncertainty” in LLM-based decisions, when human-curated labels are not used, affects the performance of the hybrid approach. We also note from the test examples that the performance of hybrid approaches varies widely from example to example and depending on the text embedding tools used. Extensive research is needed to perfect the hybrid approaches.
Another future research direction is to explore how to fine-tune different versions of LLM tools and how to derive collective decisions. One idea is that by using different tuning parameters, one could propose different versions of LLM tools, and if they perform similarly in terms of accuracy but give different decisions (e.g., not highly correlated), we expect that better collective decisions would be obtained, as observed in many ensemble learning experiences [ 37 ]. However, this could be costly and require a lot of exploration.
More importantly, it would be particularly valuable to prospectively test these LLM tools, or integration of these tools with other approaches like zero-shot, active learning, in real-world meta-analysis and systematic review projects. This would provide deeper insights into their practical utility and effectiveness, and a more comprehensive understanding of their impact in live research environments.
In summary, while LLM tools may not be able to fully replace human experts in abstract screening, their ability to transform the screening process is both undeniable and imminent. With continued advances in technology and ongoing refinement, these tools will play a critical role in the future of abstract screening, ushering in a new era of efficiency and effectiveness.
The datasets analyzed during the current study are available in the SYNERGY repository, https://github.com/asreview/synergy-dataset . We have created a GitHub repository https://github.com/mikeli380/LLMAbstractScreening to host our code (R/Python), abstracts, and intermediate results so that interested readers can replicate our findings.
Higgins JPT, et al. Cochrane handbook for systematic reviews of interventions. 2019.
Book Google Scholar
Moher D, et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7). https://doi.org/10.1371/journal.pmed.1000097 .
Sackett DL, Rosenberg WMC, Gray JAM, Haynes RB, Richardson WS. Evidence based medicine: what it is and what it isn’t. 1996. Clin Orthop Relat Res. 2007;455. https://doi.org/10.1136/bmj.312.7023.71 .
Lavis JN, Posada FB, Haines PA, Osei E. Use of research to inform public policymaking. Lancet. 2004;364(9445). https://doi.org/10.1016/S0140-6736(04)17317-0 .
Bastian H, Glasziou P, Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up?. PLoS Med. 2010;7(9). https://doi.org/10.1371/journal.pmed.1000326 .
Polanin JR, Pigott TD, Espelage DL, Grotpeter JK. Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res Synth Methods. 2019;10(3). https://doi.org/10.1002/jrsm.1354 .
Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. N Engl J Med. 2008;358(3). https://doi.org/10.1056/nejmsa065779 .
Edwards P, Clarke M, DiGuiseppi C, Pratap S, Roberts I, Wentz R. Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Stat Med. 2002;21(11). https://doi.org/10.1002/sim.1190 .
Wang Z, Nayfeh T, Tetzlaff J, O’Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE. 2020;15(1). https://doi.org/10.1371/journal.pone.0227742 .
O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4(1). https://doi.org/10.1186/2046-4053-4-5 .
OpenAI. Introducing ChatGPT Plus. OpenAI Blog; 2023. https://openai.com/blog/chatgpt-plus . Accessed 8 Nov 2023.
Google DeepMind. Introducing PaLM 2. Google Blog; 2023. https://blog.google/technology/ai/google-palm-2-ai-large-language-model/ . Accessed 8 Nov 2023.
Meta AI Research. Meta and Microsoft introduce the next generation of Llama. Facebook AI Blog; 2023. https://ai.meta.com/blog/llama-2/ . Accessed 8 Nov 2023.
Claude.ai. Claude AI API reference. 2024. https://docs.anthropic.com/en/api/getting-started . Accessed 4 Jun 2024.
Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. 2012. https://doi.org/10.1145/2110363.2110464 .
Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems. 2023;11(7). https://doi.org/10.3390/systems11070351 .
Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models ‘the answer’ to bringing us closer to systematic review automation? Syst Rev. 2023;12(1). https://doi.org/10.1186/s13643-023-02243-z .
Xian Y, Lampert CH, Schiele B, Akata Z. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell. 2019;41(9). https://doi.org/10.1109/TPAMI.2018.2857768 .
Moreno-Garcia CF, Jayne C, Elyan E, Aceves-Martins M. A novel application of machine learning and zero-shot classification methods for automated abstract screening in systematic reviews. Decis Anal J. 2023;6. https://doi.org/10.1016/j.dajour.2023.100162 .
Ferdinands G, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1). https://doi.org/10.1186/s13643-023-02257-7 .
Boetje J, van de Schoot R. The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses. Syst Rev. 2024;13(1). https://doi.org/10.1186/s13643-024-02502-7 .
Yang Y, Loog M. A benchmark and comparison of active learning for logistic regression. Pattern Recognit. 2018;83. https://doi.org/10.1016/j.patcog.2018.06.004 .
De Bruin RJ, Ma Y, Ferdinands G, Teijema J, Van de Schoot. SYNERGY - open machine learning dataset on study selection in systematic reviews. DataverseNL; 2023. https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/HE6NAQ .
Millard SJ, Weston-Green K, Newell KA. The Wistar-Kyoto rat model of endogenous depression: a tool for exploring treatment resistance with an urgent need to focus on sex differences. Prog Neuropsychopharmacol Biol Psychiatry. 2020;101. https://doi.org/10.1016/j.pnpbp.2020.109908 .
Zawia NH, Mattia CJ, Bondy SC. Differential effects of difluoromethylornithine on basal and induced activity of cerebral ornithine decarboxylase and mRNA. Neuropharmacology. 1991;30(4). https://doi.org/10.1016/0028-3908(91)90058-J .
Shea BJ, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358. https://doi.org/10.1136/bmj.j4008 .
Bannach-Brown A, Liao J, Wegener G, Macleod M. Understanding in vivo modelling of depression in non-human animals: a systematic review protocol. Evid Based Preclin Med. 2016;3(2). https://doi.org/10.1002/ebm2.24 .
Meijboom RW, Gardarsdottir H, Becker ML, ten Wolde S, Egberts TCG, Giezen TJ. Incidence of and reasons and determinants associated with retransitioning from biosimilar etanercept to originator etanercept. BioDrugs. 2021;35(6). https://doi.org/10.1007/s40259-021-00501-x .
Menon JML, Struijs F, Whaley P. The methodological rigour of systematic reviews in environmental health. Crit Rev Toxicol. 2022;52(3). https://doi.org/10.1080/10408444.2022.2082917 .
Collins GS, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7). https://doi.org/10.1136/bmjopen-2020-048008 .
Joseph L, Gyorkos TW, Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. Am J Epidemiol. 1995;141(3). https://doi.org/10.1093/oxfordjournals.aje.a117428 .
Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics. 1980;36(1). https://doi.org/10.2307/2530508 .
Dendukuri N, Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics. 2001;57(1). https://doi.org/10.1111/j.0006-341X.2001.00158.x .
Collins LM, Lanza ST. Latent class and latent transition analysis: with applications in the social, behavioral, and health sciences. 2010.
Google Scholar
McCutcheon AL. Latent class analysis. Newbury Park: Sage; 1987.
Rillig MC, Ågerstrand M, Bi M, Gould KA, Sauerland U. Risks and benefits of large language models for the environment. Environm Sci Technol. 2023;57(9). https://doi.org/10.1021/acs.est.3c01106 .
Sagi O, Rokach L. Ensemble learning: a survey. Wiley Interdiscip Rev Data Min Knowl Discov. 2018;8(4). https://doi.org/10.1002/widm.1249 .
Download references
The authors thank the reviewers and the Editor for their insightful comments and suggestions, which have greatly improved the quality of this work.
Not applicable.
Authors and affiliations.
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
Michael Li & Xianming Tan
Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC, 27402, USA
Jianping Sun
Department of Biostatistics and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
Xianming Tan
You can also search for this author in PubMed Google Scholar
XT conceived the paper. ML and XT devised the study. ML, JS, and XT analyzed the datasets. ML, JS, and XT wrote the manuscript. All authors edited and approved the final manuscript.
Correspondence to Xianming Tan .
Ethics approval and consent to participate, competing interests.
The authors declare that they have no competing interests.
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material 1: table s1. tripod + ai checklist., rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .
Reprints and permissions
Cite this article.
Li, M., Sun, J. & Tan, X. Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Syst Rev 13 , 219 (2024). https://doi.org/10.1186/s13643-024-02609-x
Download citation
Received : 20 December 2023
Accepted : 11 July 2024
Published : 21 August 2024
DOI : https://doi.org/10.1186/s13643-024-02609-x
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 2046-4053
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Scientific Reports volume 14 , Article number: 3440 ( 2024 ) Cite this article
19k Accesses
10 Citations
273 Altmetric
Metrics details
The emergence of publicly accessible artificial intelligence (AI) large language models such as ChatGPT has given rise to global conversations on the implications of AI capabilities. Emergent research on AI has challenged the assumption that creative potential is a uniquely human trait thus, there seems to be a disconnect between human perception versus what AI is objectively capable of creating. Here, we aimed to assess the creative potential of humans in comparison to AI. In the present study, human participants (N = 151) and GPT-4 provided responses for the Alternative Uses Task, Consequences Task, and Divergent Associations Task. We found that AI was robustly more creative along each divergent thinking measurement in comparison to the human counterparts. Specifically, when controlling for fluency of responses, AI was more original and elaborate. The present findings suggest that the current state of AI language models demonstrate higher creative potential than human respondents.
Introduction.
The emergence of ChatGPT—a natural language processing (NLP) model developed by OpenAI 1 to the general public has garnered global conversation on the utility of artificial intelligence (AI). OpenAI’s Generative Pretrained Transformer (GPT) is a type of machine learning that specializes in pattern recognition and prediction and has been further trained using Reinforcement Learning from Human Feedback (RLHF) so that ChatGPT responses would be indistinguishable from human responses. Recently, OpenAI 1 has advertised the new model (GPT-4) as “more creative” particularly “on creative and technical writing tasks” in comparison to previous versions, although there are arguably semantic limitations such as nonsensical answers or the possibilities of incorrect information generation 2 . Given the accessibility of AI models in the current climate, research across a variety of domains has started to emerge, thus contributing to our growing understanding of the possibilities and potential limitations of AI.
Creativity as a phenomenological construct is not immune to the effects of AI. For example, researchers have begun to assess AI models to determine appropriate design solutions 3 and logical reasoning 4 . These assessments focus on convergent thinking, i.e., determining one optimal solution to a pre-defined problem 5 . Traditionally, convergent thinking assumes an optimal single solution path and can be assessed through traditional intelligence measures or synthesis tasks. Although convergent thinking emphasizes single optimal solutions, this does not negate the potential for original or non-obvious solutions. However, convergent thinking tasks by design typically do not allow for flexible or out-of-the-box thinking. In contrast, divergent thinking involves generating multiple creative solutions to a problem which allows for the flexibility to determine multiple creative solutions 6 . Creativity researchers commonly focus on divergent creativity (in comparison to convergent creativity), given the associative mechanisms that allude to people’s ability to generate creative solutions (i.e., creative potential). Specifically, divergent thinking is considered an indicator of a person’s creative potential, but this does not guarantee creative achievement 7 . Instead, creative potential can be indicative on future capability, rather than an immediate trait that determines if someone is creative. Accordingly, a person’s creative potential has been captured via divergent thinking tasks such as the Alternative Uses Task [AUT 6 , 7 ] or the Consequences Task [CT 8 , 9 ]. Divergent thinking tasks can be evaluated along three dimensions: fluency (number of responses), originality (response novelty), and elaboration (length/detail of response). Responses in each category are given scores (i.e., according to each task) and used to assess individual differences in divergent creativity, or in other words, a person’s creative potential.
Given the emergence of OpenAI’s GPT-4 as a large language model, research has begun to empirically assess the creative potential of artificial intelligence language models through divergent thinking tasks. On one hand, some researchers argue that the human cognitive mechanisms present during creative tasks are not present in AI, and thus the creative potential of artificial intelligence can only reflect artificial creativity 10 . On the other hand, computational creativity suggests parallel networks that reflect the mechanisms of how humans go through iterative, deliberative, and generative creative processes which aid in the ability to determine creative solutions 11 . Although these aspects have been shown to aid in creative solutions, humans can experience idea fixedness, which can act as a roadblock to other creative solutions. Machines, however, will not experience this phenomenon in a metacognitive way due to computationally trained models that streamline a machine’s direct responses to a prompt 12 , 13 , 14 . Instead, a machine’s fixedness may perhaps reflect the training data of the model which could be argued is a computational consideration, rather than a creative one.
Furthermore, computational researchers have posed increasing debate on the creative capabilities of artificial intelligence models 15 by asking questions such as: How are machines capable of determining what is creative? At present, AI’s inability to explicitly determine why or if something is creative is then compensated through human-assistance. For example, human intervention is necessary for inputting appropriate and relevant data to train the model and shape outputs to become more linguistically natural 16 , 17 . This computational limitation suggests that AI is not capable of divergent creativity due to the lack of metacognitive processes (i.e., evaluation, task motivation) because AI could not generate creative ideas or reiterate on existing ideas without the intervention (i.e., input) of a human user 10 . Similarly, emotions have been seen as an integral part of creativity such that emotions help dictate states of flow or mind-wandering that aid in creative processes 18 . However, AI may not necessarily need to rely on metacognitive or affective processes to generate novel ideas 19 due to the computational framework. Thus, inner processes that contribute to human creativity may be a philosophical argument within artificial creativity models 20 .
As briefly reviewed, the creative capabilities of artificial intelligence, thus far, have scientifically and philosophically varied [e.g., 10 , 20 ]. Researchers posit humanistic and computational considerations of the creative potential of AI, however, the accessibility of tools to artificially generate products or ideas have given researchers the opportunity to evaluate public perception. For instance, people think more highly of generated artworks if they were told the artworks were created by humans but not AI 21 , 22 . The expectancy that AI generated products or ideas are less creative or hold less aesthetic value than human-created artworks appear to depend on implicit anti-AI biases 22 , 23 , 24 , as AI has been found to be indistinguishable from human-created products 25 , 26 , 27 . People’s inability to distinguish between human and AI-created products supports the feasibility of AI having creative potential.
Indeed, AI has been found to generate novel connections in music 28 , science 26 , medicine 29 , and visual art 30 to name a few. In assessments of divergent thinking, humans outperformed AI on the Alternative Uses Task 31 , but it is noteworthy that the authors propose a possible rise in AI capabilities given future progress of large language models. In fact, recent studies have found that AI divergent creativity matched that of humans using a later version of GPT-4 32 , 33 . Researchers have continued to demonstrate that the current state of LLM’s frequently score within the top 1% of human responses on standard divergent thinking tasks such as the Alternative Uses Task 32 , 33 , 34 . Additional studies utilizing other divergent thinking tasks have also reported findings that paint a more nuanced picture. For example, when scores were compared between humans and GPT-4 on a Divergent Associations Task (DAT 35 ), the researcher found that GPT-4 was more creative than human counterparts 36 . Recent research on OpenAI’s text-to-image platform DALL▪E has reported similar findings 37 and suggests that OpenAI models could match or even outperform humans in combinational creativity tasks. Given the research on AI creativity thus far, OpenAI’s advertorial claims that GPT-4 is “more creative” may hold more merit than anticipated.
Thus far, the novelty of OpenAI’s ChatGPT has posed more questions that have yet to be examined. Although creativity has considered to be a uniquely human trait 38 , the emergence of OpenAI’s generative models suggests a possible shift in how people may approach tasks that require “out of the box” thinking. Thus, the current research aims to examine how divergent creativity (i.e., fluency, originality, elaboration) may differ between humans and AI on verbal divergent thinking tasks. To our knowledge, this is one of the first studies to comprehensively examine the verbal responses across a battery of the most common divergent thinking tasks (i.e., Alternative Uses Task, Consequences Task, and Divergent Associations Task) with novel methodology by matching the fluency of ideas between human subjects and ChatGPT. We anticipate that AI may demonstrate higher creative potential in comparison to humans, though given the recency of AI-centered creativity research, our primary research questions serve as exploratory in nature.
Human participation.
Human participants (N = 151) were recruited via Prolific online data collection platform in exchange for monetary compensation of $8.00. Participants were limited to having a reported approval rating above 97%, were proficient English speakers, and were born/resided in the USA. Average total response time for completing the survey was 34.66 min. A statistical sensitivity analysis indicated that we had sufficient power to detect small effects with the present sample size ( f 2 = 0.06, 1 − β = 0.80). The present study was performed in accordance with the Declaration of Helsinki and was approved by the Institutional Review Board for Human Subjects Research at the University of Arkansas. All participants provided informed consent prior to the start of the study. All statistical analyses were conducted in R studio 39 . See Table 1 for participant demographics.
Artificial participants were operationalized as ChatGPT’s instancing feature. Each ChatGPT session was considered an independent interaction between the user and GPT interface. Here, we prompted separate instances per creativity measure (as detailed below) which resulted in artificial participation sessions. For example, we used a single session instance to feed each prompt and aggregated each prompt response into a data file. In total, we collected 151 instances which represent AI’s participation for a balanced sample. For two of the creativity measures (Alternative Uses Task and Consequences Task), which are the only timed tasks, fluency was matched 1:1 such that the number of responses for both groups is equal on these timed tasks. Fluency scores of each human respondent were first calculated to match 1:1 for each GPT-4 instance for the Alternative Uses Task and Consequences Task (detailed below). Only valid responses were retained. For example, human participant #52 had a total fluency score of 6, thus GPT-4 instance #52 was instructed to provide 6 responses.
Alternative uses task.
The Alternate Uses Task (AUT 6 ) was used to test divergent thinking. In this task, participants were presented with a common object (‘fork’ and ‘rope’) and were asked to generate as many creative uses as possible for these objects. Responses were scored for fluency (i.e., number of responses), originality (i.e., uniqueness of responses), and elaboration (i.e., number of words per valid response). Participants were given 3 min to generate their responses for each item. Following prior research 40 , instructions for human respondents on the AUT were:
For this task, you'll be asked to come up with as many original and creative uses for [item] as you can. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your ideas don't have to be practical or realistic; they can be silly or strange, even, so long as they are CREATIVE uses rather than ordinary uses. You can enter as many ideas as you like. The task will take 3 minutes. You can type in as many ideas as you like until then, but creative quality is more important than quantity. It's better to have a few really good ideas than a lot of uncreative ones. List as many ORIGINAL and CREATIVE uses for a [item] .
Because the goal was to control for fluency, we excluded prompt parameters such as 'quantity' from the GPT-4 instructions. Similarly, GPT does not need timing parameters in comparison to humans because we denoted the specific number of responses required. See below for instructions used per GPT instance:
For this task, you'll be asked to come up with as original and creative uses for [item] as you can. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your ideas don't have to be practical or realistic; they can be silly or strange, even, so long as they are CREATIVE uses rather than ordinary uses. List [insert fluency number] ORIGINAL and CREATIVE uses for a [item].
The Consequences Task (CT 8 , 9 ) is part of the verbal section of the Torrance Test of Creative Thinking (TTCT) that provides prompts to hypothetical scenarios (i.e., what would happen if humans no longer needed to sleep?). Similar to the AUT, people respond to as many consequences to the prompt as they can within a given timeframe. Responses were scored for fluency (i.e., number of responses), originality (i.e., uniqueness of responses), and elaboration (i.e., number of words per valid response). General task instructions for human respondents were:
In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". For 3 minutes, try and think of any and all consequences that might result from the statement. Please be as creative as you like. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your responses will be scored based on originality and quality. Remember, it is important to try to keep thinking of responses and to type them in for the entire time for the prompt. REMINDER: In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". For 3 minutes, try and think of any and all consequences that might result from the statement. Do this as many times as you can in 3 min. The screen will automatically change when the time is completed. Remember, it is important to try to keep thinking of responses and to type them in for the entire time for the prompt.
Participants were given two prompts shown independently: “Imagine humans no longer needed sleep,” and “Imagine humans walked with their hands.” The two CT prompts have been extensively used in research on divergent thinking 41 , 42 , 43 . Similar to the AUT, fluency and timing parameters were excluded from the GPT instructions on the CT:
In this task, a statement will appear on the screen. The statement might be something like "imagine gravity ceases to exist". Please be as creative as you like. The goal is to come up with creative ideas, which are ideas that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Your responses will be scored based on originality and quality. Try and think of any and all consequences that might result from the statement. [Insert scenario]. What problems might this create? List [insert fluency number] CREATIVE consequences.
The Divergent Association Task (DAT 35 ) is a task of divergent and verbal semantic creative ability. This task asks participants to come up with 10 nouns as different from each other as possible. These nouns must not be proper nouns or any type of technical term. Pairwise comparisons of semantic distance between the 10 nouns are calculated using cosine distance. The average distance scores between all pairwise comparisons are then multiplied by 100 that results in a final DAT score ( https://osf.io/bm5fd/ ). High scores indicate longer distances (i.e., words are not similar). Task instructions for both human participants and GPT-4 were:
Please enter 10 words that are as different from each other as possible, in all meanings and uses of the words. The rules: Only single words in English. Only nouns (e.g., things, objects, concepts). No proper nouns (e.g., no specific people or places). No specialized vocabulary (e.g., no technical terms). Think of the words on your own (e.g., do not just look at objects in your surroundings).
There were no time constraints for this task. The average human response time was 126.19 s ( SD = 90.62) and the average DAT score was 76.95 ( SD = 6.13). We scored all appropriate words that participants gave. Participants with fewer than 7 responses were excluded from data analysis (n = 2). Instructions were identical for the GPT-4 to the human instructions.
Human participants’ responses were collected online via Qualtrics. The entire study took on average 34 min ( SD = 13.64). The order of the creativity tasks was counterbalanced. The online study used two attention checks randomly presented throughout the study. Each attention check allowed one additional attempt. Participants who failed two attention checks were removed from all analyses (N = 2). After providing their responses to each task, participants answered demographics questions.
GPT-4 procedural responses were generated through human-assistance facilitated by the first author, who provided each prompt in the following order: AUT, CT, and DAT. We did not have to account for typical human-centered confounds such as feelings of fatigue 44 , 45 and order biases 44 as these states are not relevant confounds in AI, thus the order of tasks was not counterbalanced.
All variables, measurements, and exclusions for this article’s target research question have been reported in the methods section.
Both human and GPT-4 responses were cleaned to remove any instances that were incomplete or inappropriate at two stages: First, human responses that did not follow instructions from the task or were not understandable as a use (AUT; 0.96% removed) or a consequence (CT; 4.83%) were removed. Only valid human responses were used in matching for GPT fluency; Second, inappropriate or incomplete GPT responses for the AUT (< 0.001% removed) and CT (< 0.001% removed) were removed. Despite matching for fluency, only valid responses in both groups were used in subsequent analyses.
Traditional scoring methods of divergent thinking tasks have required human ratings of products or ideas and are assumed to be normative tasks (i.e., consensus will eventually be met with more raters). Here, we used the Open Creativity Scoring tool [OCS 46 ] to automate scoring of semantic distance objectively by capturing the originality of ideas by assigning scores of the remoteness (uniqueness) of responses. Unlike human scoring which requires multiple factors of consideration (e.g., fatigue, biases, time, cost 47 ) which could result in potential confounds, automated scoring tools such as OCS circumvent the human-centered issues and has been found to robustly correlate with human ratings 46 .
Open Creativity Scoring tool (OCS 46 ) was used to score both the AUT and CT tasks. Specifically, the semantic distance scoring tool 17 was used, which applies the GLoVe 840B text-mining model 48 to assess originality of responses by representing a prompt and response as vectors in semantic space and calculates the cosine of the angle between the vectors. The OCS tool also scores for elaboration by using the stoplist method 46 . The prompts for the AUT were “rope” and “fork” and the prompts for the CT were “humans no sleep” and “humans walked hands.”
Descriptive statistics for all tasks are reported in Tables 2 and 3 . Fluency descriptive statistics are reported in Table 2 . Semantic distance descriptive statistics are reported in Table 3 .
As expected, an independent sample t -test revealed no significant differences in total fluency due to controlling for fluency (as detailed above) between humans ( M = 6.94, SD = 3.80) and GPT-4 ( M = 7.01, SD = 3.81), t (602) = 0.21, 95% CI [− 0.54, 0.67], p = 0.83.
To assess originality of responses via semantic distance scores, we conducted a 2 (group: human, GPT-4) X 2 (prompt: ‘fork, rope) analysis of variance. The model revealed significant main effects of group ( F (1, 600) = 622.10, p < 0.001, η 2 = 0.51) and prompt ( F (1, 600) = 584.50, p < 0.001, η 2 = 0.49) on originality of responses. Additionally, there were significant interaction effects between group and prompt, F (1, 600) = 113.80, p < 0.001, η 2 = 0.16. Particularly, both samples had higher originality scores for the prompt ‘fork’ in comparison to ‘rope,’ but GPT-4 scored higher in originality, regardless of prompt. Tukey’s HSD post hoc analysis showed that all pairwise comparisons were significantly different ( p < 0.001) aside from the human ‘fork’ and GPT-4 ‘rope’ originality ( p = 0.989). Overall, GPT-4 was more successful at coming up with divergent responses given the same number of opportunities to generate answers compared to the human counterpart and showed higher originality but only for specific prompts (Fig. 1 ).
Analysis of variance of originality on the alternative uses task.
Next, we compared elaboration scores between humans and GPT-4. Fluency scores differ from elaboration in the sense that fluency accounts for each coherent response whereas elaboration quantifies the number of words per valid response. For example, a person could respond “you could use a fork to knit or as a hair comb.” In this example, the fluency would be 2 (knitting instrument and comb), but the elaboration would be 12 (number of words used in the response). The results of an independent t -test revealed that elaboration was significantly higher for GPT-4 ( M = 15.45, SD = 6.74) in comparison to humans ( M = 3.38, SD = 2.91), t (602) = 28.57, 95% CI [11.24, 12.90], p < 0.001.
As expected, an independent t -test revealed no significant differences in total fluency between humans ( M = 5.71, SD = 3.20) and GPT-4 ( M = 5.50, SD = 3.15), t (621) = 0.82, 95% CI [− 0.29, 0.71], p = 0.41.
To assess originality of responses via semantic distance scores, we conducted a 2 (group: human, GPT) X 2 (prompt: ‘no more sleep,’ ‘walk on hands’) analysis of variance. The model revealed significant main effects of group ( F (1, 619) = 622.10, p < 0.001, η 2 = 0.51) and prompt ( F (1, 619) = 584.50, p < 0.001, η 2 = 0.49) on the originality of responses. Additionally, there were significant interaction effects between group and prompt, F (1, 619) = 113.80, p < 0.001, η 2 = 0.16. Particularly, originality was marginally higher for the prompt ‘walk on hands’ in the GPT sample, although there were no significant differences in originality in the human sample between the two prompts. Tukey’s HSD post hoc analysis showed that all pairwise comparisons were significantly different ( p < 0.001) aside from the human responses for both prompts ( p = 0.607). Overall, GPT-4 was more successful at coming up with more divergent responses given the same number of opportunities compared to the human counterparts, and also showed higher originality dependent on prompt type (Fig. 2 ).
Analysis of variance of originality on the consequences task.
Next, we calculated the difference in elaboration between humans and GPT-4. The results of an independent I-test revealed that elaboration was significantly higher in the GPT-4 sample ( M = 38.69, SD = 15.60) than in the human sample ( M = 5.45, SD = 4.04), t (621) = − 36.04, 95% CI [− 35.04, − 31.45], p < 0.001.
We assessed the qualitative aspect of the words generated in the DAT between both humans and GPT through word occurrence. Namely, the frequency of single-occurrence (non-repeating words within groups) and unique occurrence (words only occurring once between groups).
Humans had a higher number of single-occurrence words (n = 523) that accounted for 69.92% within the total group response in comparison to GPT’s number of single-occurrence words (n = 152) that accounted for 47.95% within the total group response (Table 4 ). In total, there was 9.11% (n = 97) of overlapping responses between both groups. Exclusively unique words that only occurred in the human responses accounted for 87.03% (n = 651) in comparison to unique GPT responses which accounted for 69.40% (n = 220).
A chi-square test of independence was performed to examine the relationship between groups (GPT vs human) and word type (single occurrence vs unique occurrence). The relationship between these variables was not significant, \(\chi\) 2 (1, N = 302) = 1.56, p = 0.211. This suggests that uniqueness and occurrences of words may not have necessarily aided either group in originality, but rather aided in word complexity.
Differences in semantic distance scores were calculated between human and GPT-4 DAT responses. An independent sample t -test revealed that GPT responses ( M = 84.56, SD = 3.05) had higher semantic distances in comparison to human responses ( M = 76.95, SD = 6.13), t (300) = 13.65, 95% CI [6.51, 8.71], p < 0.001. Despite human participants having a broader range of unique responses, the fluency uniqueness did not appear to advantage semantic distance scores when comparing groups.
The present study offers novel evidence on the current state of large language models (i.e., GPT-4) and the capabilities of divergent creative output in comparison to human participants. Overall, GPT-4 was more original and elaborate than humans on each of the divergent thinking tasks, even when controlling for fluency of responses. In other words, GPT-4 demonstrated higher creative potential across an entire battery of divergent thinking tasks (i.e., Alternative Uses Task, Consequences Task, and Divergent Associations Task).
Notably, no other study has comprehensively assessed multiple dimensions of the most frequently used divergent thinking tasks and AI. However, studies have begun to examine differences in divergent creativity between humans and AI, particularly after the public emergence of OpenAI’s ChatGPT, with findings showing that AI’s creative potential scores within the top 1% of human responses in terms of originality 32 , 33 , 34 . While there has been an influx in research examining the creativity of generative language models, to date only one previous study showed that humans outperformed GPT on the AUT (GPT-3 31 ), while another study reported that later versions of GPT (GPT-4 showed similar, albeit slightly less, creative potential in comparison to humans 32 ). Similarly, one previous study demonstrated that generative models were improved in GPT 4 compared to GPT 3.5, particularly in terms of fluency, but interestingly, not in terms of elaboration 49 which suggests that the creative potential of these LLM’s are improving, particularly the ability to generate original ideas. Indeed, only one other study thus far has reported similar results that GPT outperformed humans on the DAT 36 , but the DAT is only one aspect of divergent thinking. Instead, the novelty of the present findings provides a foundation for future research to continue to examine multiple dimensions of divergent thinking and artificial intelligence.
While the present results suggest that the current state of AI models outperform humans on divergent thinking tasks by a significant margin, there are methodological considerations that could have contributed to the present results. To comprehensively examine creativity requires not only an assessment of originality, but also of the usefulness and appropriateness of an idea or product 50 . Traditionally, this has proven difficult to standardize in comparison to assessing originality given the multifaceted dimensions that contribute to assessments of appropriateness such as accounting for sociocultural and historical contexts. Semantic distance scores do not take into consideration the aforementioned variables; instead, the scores reflect the relative distance between seemingly related (or unrelated) ideas. In this instance, GPT-4’s answers yielded higher originality than human counterparts, but the feasibility or appropriateness of an idea could be vastly inferior to that of humans. Thus, we need to consider that the results reflect only a single aspect of divergent thinking, rather than a generalization that AI is indeed more creative across the board. Future research on AI and creativity needs to not only account for the traditional measurements of creativity (i.e., fluency, elaboration, originality) but also for the usefulness and appropriateness of the ideas.
Interestingly, GPT-4 used a higher frequency of repeated words in comparison to human respondents. Although the breadth of vocabulary used by human responses was much more flexible, this did not necessarily result in higher semantic distance scores. Flexibility, or number of categories of responses, has also been found to be smaller (i.e., more similar categories of words were generated) for AI in comparison to humans 34 . In other words, like our present results, humans came up with a wider range of responses, however, this did not indicate increased originality. These findings highlight the consideration that flexible thinking may be the strong point in human-centered divergent thinking.
More so, the complexity of words chosen by AI, albeit more concentrated in occurrence, could have more robustly contributed to the originality effects. For example, only AI used words that are non-tangible items (i.e., freedom, philosophy) whereas humans may have experienced a fixedness on generating ideas that are appropriate and observable. The differences between generated lists (incorporating tangible and non-tangible word) could inflate originality to be biased toward AI.
Similarly, we need to critically consider the uniqueness of words generated in DAT responses. There was a marginal overlap of responses between the human and the AI samples (9.11%), but humans responded with a higher number of single-occurrence words. Despite these differences, AI still had a higher semantic distance score. Prior research shows that in human respondent’s originality increases over time 51 . This increase is seen as an expansion of activation in an individual’s semantic network, which leads to more original responses 52 . Human responses on these DT tasks tend to follow a diminishing returns curve before reaching a plateau for an individual’s more original responses 53 . The higher levels of elaboration and semantic distance in AI responses suggests that the LLM processing possibly does not need this ramp-up time as seen in human responses, therefore LLM’s can respond with their highest level of original responses when prompted. Whereas humans may fixate on more obvious responses at first, this algorithmic trait could then serve as an aid in overcoming ideation fixedness in humans.
It is important to note that the measures used in this study are all measures of creative potential, but involvement in creative activities or achievements is another aspect of measuring a person’s creativity. Creative potential is not a guarantee for creative achievement; instead, we need to consider creative potential as an indicator of a person’s creative capabilities 7 . Here, AI was more original thus indicating higher creative potential, but this metric may more appropriately reflect the advancement of the algorithms these models were trained on in conjunction with human input. In other words, AI, unlike humans, does not have agency, thus AI creative potentials are dependent on the assistance of a human user to elicit responses. Therefore, the creative potential of AI is in a constant state of stagnation unless prompted.
Moreover, researchers have examined the interplay between creative potential and real-world creative achievements 54 , 55 but this approach assumes human level creativity and is not able to account for artificial intelligence. AI can generate creative ideas, but it cannot be assumed that this potential would translate to achievement. The creative potential of AI is limited by the (lack of) autonomy of what the algorithms can create (i.e., creative potential) without the intervention of human assistance. Thus, future research should consider the conceptual implications of current measurements of creativity as implicated in applications in real-world settings and how generalizability at the intersection of potential and achievement may be a human-centric consideration.
The prevalence and accessibility of the internet has drastically shaped the way in which humans interact with language processing systems and search engines. LLM’s such as GPT-4 are now not an exception in ubiquity. Searching for information has multiple channels which were not previously available, and with these functions come an array of strategies to best find the desired information. Research has shown that younger people are better and more efficient in their search strategies online to find the information they want 56 , which suggests that exposure to search platforms acts as a practice in efficiency. Similar to interactions with GPT-4 and other AI platforms, humans may gradually navigate how to best utilize LLM’s. For information seeking tools like GPT-4, the creative potential has shown clear progression in capabilities, albeit there are still limitations such as response appropriateness and AI’s ability to generate idiosyncratic associations. Generative AI has demonstrated robustness in creative potential but has also shown weaknesses (i.e., less flexible thinking) that could then be supplemented by human assistance. Moving forward, future possibilities of AI acting as a tool of inspiration, as an aid in a person’s creative process, or to overcome fixedness is promising.
All data associated with the present study is available at https://osf.io/xv6kh/ .
OpenAI. ChatGPT: Optimizing Language Models for Dialogue . (2023). https://openai.com/blog/chatgpt/ . Accessed July 2023.
Rahaman, M. S., Ahsan, M. T., Anjum, N., Terano, H. J. R. & Rahman, M. M. From ChatGPT-3 to GPT-4: A significant advancement in ai-driven NLP tools. J. Eng. Emerg. Technol. 2 (1), 1–11. https://doi.org/10.52631/jeet.v2i1.188 (2023).
Article Google Scholar
Lee, Y. H., & Lin, T. H. (2023). The feasibility study of AI image generator as shape convergent thinking tool. in International Conference on Human-Computer Interaction (pp. 575–589). https://doi.org/10.1007/978-3-031-35891-3_36 .
Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., & Zhang, Y. (2023). Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv:2304.03439 .
Cropley, A. In praise of convergent thinking. Creat. Res. J. 18 (3), 391–404. https://doi.org/10.1207/s15326934crj1803_13 (2006).
Guilford, J. P. The Nature of Human Intelligence (McGraw-Hill, 1967).
Google Scholar
Runco, M. A. & Acar, S. Divergent thinking as an indicator of creative potential. Creat. Res. J. 24 (1), 66–75. https://doi.org/10.1080/10400419.2012.652929 (2012).
Torrance, E. P. The Torrance Tests of Creative Thinking: Norms-Technical Manual (Personal Press, 1974).
Wilson, R. C., Guilford, J. P., Christensen, P. R. & Lewis, D. J. A factor-analytic study of creative-thinking abilities. Psychometrika 19 (4), 297–311. https://doi.org/10.1007/bf02289230 (1954).
Runco, M. A. AI can only produce artificial creativity. J. Creat. 33 (3), 100063. https://doi.org/10.1016/j.yjoc.2023.100063 (2023).
Finke, R. A. Imagery, creativity, and emergent structure. Conscious. Cogn. 5 (3), 381–393. https://doi.org/10.1006/ccog.1996.0024 (1996).
Article PubMed CAS Google Scholar
Sarker, I. H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2 (6), 420. https://doi.org/10.1007/s42979-021-00815-1 (2021).
Article PubMed PubMed Central Google Scholar
Khurana, D., Koli, A., Khatter, K. & Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 82 (3), 3713–3744. https://doi.org/10.1007/s11042-022-13428-4 (2022).
Zhou, M., Duan, N., Liu, S. & Shum, H.-Y. Progress in neural NLP: Modeling, learning, and reasoning. Engineering 6 (3), 275–290. https://doi.org/10.1016/j.eng.2019.12.014 (2020).
Cardoso, A., Veale, T. & Wiggins, G. A. Converging on the divergent: The history (and future) of the international joint workshops in computational creativity. AI Mag. 30 (3), 15. https://doi.org/10.1609/aimag.v30i3.2252 (2009).
Lambert, N., Castricato, L., von Werra, L., & Havrilla A. Illustrating Reinforcement Learning from Human Feedback (RLHF). Hugging Face . (2022). https://huggingface.co/blog/rlhf .
Dumas, D., Organisciak, P. & Doherty, M. Measuring divergent thinking originality with human raters and text-mining models: A psychometric comparison of methods. Psychol. Aesthet. Creat. Arts 15 (4), 645–663. https://doi.org/10.1037/aca0000319 (2021).
Kane, S. et al. Attention, affect, and creativity, from mindfulness to mind-wandering. In The Cambridge Handbook of Creativity and Emotions (eds Ivcevic, Z. et al. ) 130–148 (Cambridge University Press, 2023). https://doi.org/10.1017/9781009031240.010 .
Chapter Google Scholar
Chatterjee, A. Art in an age of artificial intelligence. Front. Psychol. 13 , 1024449. https://doi.org/10.3389/fpsyg.2022.1024449 (2022).
Boden, M. A. Computer models of creativity. AI Mag. 30 (3), 23–23. https://doi.org/10.1609/aimag.v30i3.2254 (2009).
Bellaiche, L. et al. Humans versus AI: Whether and why we prefer human-created compared to AI-created artwork. Cogn. Res. Princ. Implic. 8 (1), 1–22. https://doi.org/10.1186/s41235-023-00499-6 (2023).
Chiarella, S. et al. Investigating the negative bias towards artificial intelligence: Effects of prior assignment of AI-authorship on the aesthetic appreciation of abstract paintings. Comput. Hum. Behav. 137 , 107406. https://doi.org/10.1016/j.chb.2022.107406 (2022).
Fortuna, P. & Modliński, A. A(I)rtist or counterfeiter? Artificial intelligence as (D) evaluating factor on the art market. J. Arts Manag. Law Soc. 51 (3), 188–201. https://doi.org/10.1080/10632921.2021.1887032 (2021).
Liu, Y., Mittal, A., Yang, D., & Bruckman, A. (2022). Will AI console me when I lose my pet? Understanding perceptions of AI-mediated email writing. in Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3517731
Chamberlain, R., Mullin, C., Scheerlinck, B. & Wagemans, J. Putting the art in artificial: Aesthetic responses to computer-generated art. Psychol. Aesthet. Creat. Arts 12 (2), 177–192. https://doi.org/10.1037/aca0000136 (2018).
Gao, C. A. et al. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. Biorxiv https://doi.org/10.1016/j.patter.2023.100706 (2023).
Samo, A. & Highhouse, S. Artificial intelligence and art: Identifying the aesthetic judgment factors that distinguish human- and machine-generated artwork. Psychol. Aesthet. Creat. Arts. https://doi.org/10.1037/aca0000570 (2023).
Yin, Z., Reuben, F., Stepney, S. & Collins, T. Deep learning’s shallow gains: A comparative evaluation of algorithms for automatic music generation. Mach. Learn. 112 (5), 1785–1822. https://doi.org/10.1007/s10994-023-06309-w (2023).
Article MathSciNet Google Scholar
Kumar, Y., Koul, A., Singla, R. & Ijaz, M. F. Artificial intelligence in disease diagnosis: A systematic literature review, synthesizing framework and future research agenda. J. Ambient Intell. Hum. Comput. https://doi.org/10.1007/s12652-021-03612-z (2022).
Anantrasirichai, N. & Bull, D. Artificial intelligence in the creative industries: A review. Artif. Intell. Rev. https://doi.org/10.1007/s10462-021-10039-7 (2022).
Stevenson, C., Smal, I., Baas, M., Grasman, R., & van der Maas, H. Putting GPT-3's Creativity to the (Alternative Uses) Test . (2022). arXiv:2206.08932 .
Haase, J. & Hanel, P. H. (2023). Artificial Muses: Generative Artificial Intelligence Chatbots Have Risen to Human-Level Creativity . https://doi.org/10.48550/arXiv.2303.12003
Koivisto, M. & Grassini, S. Best humans still outperform artificial intelligence in a creative divergent thinking task. Sci. Rep. 13 , 13601. https://doi.org/10.1038/s41598-023-40858-3 (2023).
Article ADS PubMed PubMed Central CAS Google Scholar
Guzik, E. E., Byrge, C. & Gilde, C. The originality of machines: AI takes the torrance test. J. Creat. 33 (3), 100065. https://doi.org/10.1016/j.yjoc.2023.100065 (2023).
Olson, J. A., Nahas, J., Chmoulevitch, D., Cropper, S. J. & Webb, M. E. Naming unrelated words predicts creativity. Proc. Natl. Acad. Sci. 118 , 25. https://doi.org/10.1073/pnas.2022340118 (2021).
Article CAS Google Scholar
Cropley, D. Is artificial intelligence more creative than humans?: ChatGPT and the divergent association task. Learn. Lett. 2 , 13–13. https://doi.org/10.59453/ll.v2.13 (2023).
Chen, L., Sun, L. & Han, J. A comparison study of human and machine-generated creativity. J. Comput. Inf. Sci. Eng. 23 (5), 051012. https://doi.org/10.1115/1.4062232 (2023).
Sawyer, R. K. Explaining Creativity: The Science of Human Innovation (Oxford University Press, 2012).
R Core Team. R: A Language and Environment for Statistical Computing (Version 4.1.0) [Computer Software] . (2021). http://www.R-project.org .
Nusbaum, E. C., Silvia, P. J. & Beaty, R. E. Ready, set, create: What instructing people to “be creative” reveals about the meaning and mechanisms of divergent thinking. Psychol. Aesthet. Creat. Arts 8 (4), 423. https://doi.org/10.1037/a0036549 (2014).
Acar, S. et al. Applying automated originality scoring to the verbal form of Torrance tests of creative thinking. Gift. Child Q. 67 (1), 3–17. https://doi.org/10.1177/00169862211061874 (2021).
Hass, R. W. & Beaty, R. E. Use or consequences: Probing the cognitive difference between two measures of divergent thinking. Front. Psychol. 9 , 2327. https://doi.org/10.3389/fpsyg.2018.02327 (2018).
Urban, M. & Urban, K. Orientation toward intrinsic motivation mediates the relationship between metacognition and creativity. J. Creat. Behav. 57 (1), 6–16. https://doi.org/10.1002/jocb.558 (2023).
Day, B. et al. Ordering effects and choice set awareness in repeat-response stated preference studies. J. Environ. Econ. Manag. 63 (1), 73–91. https://doi.org/10.1016/j.jeem.2011.09.001 (2012).
Igorov, M., Predoiu, R., Predoiu, A. & Igorov, A. Creativity, resistance to mental fatigue and coping strategies in junior women handball players. Eur. Proc. Soc. Behav. Sci. https://doi.org/10.15405/epsbs.2016.06.39 (2016).
Organisciak, P. & Dumas, D. Open Creativity Scoring [Computer Software] . (University of Denver, 2020). https://openscoring.du.edu/ .
Beaty, R. E., Johnson, D. R., Zeitlen, D. C. & Forthmann, B. Semantic distance and the alternate uses task: Recommendations for reliable automated assessment of originality. Creat. Res. J. 34 (3), 245–260. https://doi.org/10.1080/10400419.2022.2025720 (2022).
Pennington, J., Socher, R. & Manning, C. Glove: Global vectors for word representation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 1532–1543 (2014).
Vinchon, F., Gironnay, V., & Lubart, T. The Creative AI-Land: Exploring new forms of creativity. In Review . (2023).
Runco, M. A. & Jaeger, G. J. The standard definition of creativity. Creat. Res. J. 24 (1), 92–96. https://doi.org/10.1080/10400419.2012.650092 (2012).
Beaty, R. E. & Silvia, P. J. Why do ideas get more creative across time? An executive interpretation of the serial order effect in divergent thinking tasks. Psychol. Aesthet. Creat. Arts 6 (4), 309–319. https://doi.org/10.1037/a0029171 (2012).
Mednick, S. The associative basis of the creative process. Psychol. Rev. 69 (3), 220–232. https://doi.org/10.1037/h0048850 (1962).
Hubert K. F., Finch A., Zabelina D. (2023). Diminishing Creative Returns: Predicting Optimal Creative Performance via Individual Differences in Executive Functioning .
Carson, S. H., Peterson, J. B. & Higgins, D. M. Reliability, validity, and factor structure of the creative achievement questionnaire. Creat. Res. J. 17 (1), 37–50. https://doi.org/10.1207/s15326934crj1701_4 (2005).
Jauk, E., Benedek, M. & Neubauer, A. C. The road to creative achievement: A latent variable model of ability and personality predictors. Pers. Individ. Diff. https://doi.org/10.1016/j.paid.2013.07.129 (2014).
Chevalier, A., Dommes, A. & Marquié, J.-C. Strategy and accuracy during information search on the web: Effects of age and complexity of the search questions. Comput. Hum. Behav. 53 , 305–315. https://doi.org/10.1016/j.chb.2015.07.017 (2015).
Download references
These authors contributed equally: Kent F. Hubert and Kim N. Awa.
Department of Psychological Sciences, University of Arkansas, Fayetteville, AR, 72701, USA
Kent F. Hubert, Kim N. Awa & Darya L. Zabelina
You can also search for this author in PubMed Google Scholar
D.L.Z., K.F.H., and K.N.A. contributed to the conceptualization and methodology. K.F.H. and K.N.A. contributed to formal analysis and investigation. K.F.H. prepared all figures. K.N.A. prepared all tables. D.L.Z., K.F.H., and K.N.A. contributed to writing and revision.
Correspondence to Kent F. Hubert .
Competing interests.
The authors declare no competing interests.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
Cite this article.
Hubert, K.F., Awa, K.N. & Zabelina, D.L. The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks. Sci Rep 14 , 3440 (2024). https://doi.org/10.1038/s41598-024-53303-w
Download citation
Received : 14 October 2023
Accepted : 30 January 2024
Published : 10 February 2024
DOI : https://doi.org/10.1038/s41598-024-53303-w
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
An empirical investigation of the impact of chatgpt on creativity.
Nature Human Behaviour (2024)
AI and Ethics (2024)
Journal of Cultural Cognitive Science (2024)
By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
Showing 1 through 3 of 0 Related Papers
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
arXiv's Accessibility Forum starts next month!
Help | Advanced Search
Title: large language models: a survey.
Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.
Comments: | arXiv admin note: substantial text overlap with |
Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
Cite as: | [cs.CL] |
(or [cs.CL] for this version) | |
Focus to learn more arXiv-issued DOI via DataCite |
Access paper:.
Code, data and media associated with this article, recommenders and search tools.
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
New citation alert added.
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Please log in to your account
Bibliometrics & citations, supplemental material, index terms.
Computing methodologies
Artificial intelligence
Control methods
Robotic planning
Planning and scheduling
Human-centered computing
Human computer interaction (HCI)
A survey on large language model based autonomous agents.
Autonomous agents have long been a research focus in academic and industry communities. Previous research often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning ...
We propose a new multi-agent task grammar to encode collaborative tasks for a team of heterogeneous agents that can have overlapping capabilities. The grammar allows users to specify the relationship between agents and parts of the task without providing ...
Development of software agents according to belief–desire–intention (BDI) model usually becomes challenging due to autonomy, distributedness, and openness of multi‐agent systems (MAS). Hence, here, a domain‐specific modelling language (DSML), called ...
Published in.
Northeastern University, USA
CENTAI / Eurecat, Italy
Association for Computing Machinery
New York, NY, United States
Permissions, check for updates, author tags.
Contributors, other metrics, bibliometrics, article metrics.
Login options.
Check if you have access through your login credentials or your institution to get full access on this article.
View options.
View or Download as a PDF file.
View online with eReader .
Copying failed.
Affiliations, export citations.
We are preparing your search results for download ...
We will inform you here when the file is ready.
Your file of search results citations is now ready.
Your search export query has expired. Please try again.
Foundations of Software Engineering (FSE) | July 2024
Organized by ACM
Publication
In large-scale cloud service systems, monitoring metric data and conducting anomaly detection is an important way to maintain reliability and stability. However, great disparity exists between academic approaches and industrial practice to anomaly detection. Industry predominantly uses simple, efficient methods due to better interpretability and ease of implementation. In contrast, academically favor deep-learning methods, despite their advanced capabilities, face practical challenges in real-world applications. To address these challenges, this paper introduces MonitorAssistant, an end-to-end practical anomaly detection system via Large Language Models. MonitorAssistant automates model configuration recommendation achieving knowledge inheritance and alarm interpretation with guidance-oriented anomaly reports, facilitating a more intuitive engineer-system interaction through natural language. By deploying MonitorAssistant in Microsoft’s cloud service system, we validate its efficacy and practicality, marking a significant advancement in the field of practical anomaly detection for large-scale cloud services.
Share this page:
Chao Yan, Henry H Ong, Monika E Grabowska, Matthew S Krantz, Wu-Chen Su, Alyson L Dickson, Josh F Peterson, QiPing Feng, Dan M Roden, C Michael Stein, V Eric Kerchberger, Bradley A Malin, Wei-Qi Wei, Large language models facilitate the generation of electronic health record phenotyping algorithms, Journal of the American Medical Informatics Association , Volume 31, Issue 9, September 2024, Pages 1994–2001, https://doi.org/10.1093/jamia/ocae072
Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts.
We prompted four LLMs—GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard—in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (ie, type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network.
GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values).
GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms.
Personal account.
Sign in with a library card.
Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:
Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.
Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.
If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.
Enter your library card number to sign in. If you cannot sign in, please contact your librarian.
Society member access to a journal is achieved in one of the following ways:
Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:
If you do not have a society account or have forgotten your username or password, please contact your society.
Some societies use Oxford Academic personal accounts to provide access to their members. See below.
A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.
Some societies use Oxford Academic personal accounts to provide access to their members.
Click the account icon in the top right to:
Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.
For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.
To purchase short-term access, please sign in to your personal account above.
Don't already have a personal account? Register
Month: | Total Views: |
---|---|
April 2024 | 330 |
May 2024 | 178 |
June 2024 | 114 |
July 2024 | 84 |
August 2024 | 36 |
Citing articles via.
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
Sign In or Create an Account
This PDF is available to Subscribers Only
For full access to this pdf, sign in to an existing account, or purchase an annual subscription.
IMAGES
COMMENTS
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics ...
To understand how language models work, you first need to understand how they represent words. Humans represent English words with a sequence of letters, like C-A-T for "cat."
Transformer-based large language models are making significant strides in various fields, such as natural language processing 1,2,3,4,5, biology 6,7, chemistry 8,9,10 and computer programming 11 ...
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length ...
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society.
The rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a comprehensive examination of the research landscape in LLMs, providing an overview of the prevailing themes and topics within this dynamic domain.,Drawing from an extensive corpus ...
Science in the age of large language models. Abeba Birhane, Atoosa Kasirzadeh, David Leslie &. Sandra Wachter. Nature Reviews Physics 5 , 277-280 ( 2023) Cite this article. 38k Accesses. 74 ...
Abstract. Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations of the underlying neural networks, context ...
Large language models (LLMs), such as OpenAI's GPT-4, Google's Bard or Meta's LLaMa, have created unprecedented opportunities for analysing and generating language data on a massive scale.
A large language model (LLM) is a computational model capable of language generation or other natural language processing tasks. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process. [1]The largest and most capable LLMs, as of August 2024, are artificial neural networks ...
The research will be presented at the International Conference on Learning Representations. Finding facts. Most large language models, also called transformer models, are neural networks. Loosely based on the human brain, neural networks contain billions of interconnected nodes, or neurons, that are grouped into many layers, and which encode ...
We've trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.
For the latest Stanford research and news on large language models, subscribe to our newsletter. In July 2020, OpenAI unveiled GPT-3, a language model that was easily the largest known at the time. Put simply, GPT-3 is trained to predict the next word in a sentence, much like how a text message autocomplete feature works. However, model ...
a revolutionary transformation by enabling the creation of. Large Language Models (LLMs) that can approximate human-. level performance on certain ev aluation benchmarks [ 3 ], [ 4 ]. LLMs ...
Large language models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks. LLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of ...
On our dataset of 85% incorrect traces and 15% correct traces, using this method is not much better than the naïve strategy of always labeling traces as incorrect, which gives a weighted average F1 of 78. A diagram showing how well mistake-finding with LLMs can be used as a proxy for correctness of the answer on each dataset.
The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP), introducing unprecedented capabilities that are revolutionizing various domains. This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications. Our paper ...
extensive informative summaries of the existing works to advance the LLM research. Keywords: Large Language Models, LLMs, chatGPT, Augmented LLMs, Multimodal LLMs, LLM training, LLM Benchmarking 1.Introduction Language plays a fundamental role in facilitating commu-nication and self-expression for humans, and their interaction with machines.
Mapping the Mind of a Large Language Model. Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language ...
The research will be presented at the IEEE Conference on Data Science and Advanced Analytics. An off-the-shelf solution. Large language models are autoregressive, which means they can understand that the newest values in sequential data depend on previous values. For instance, models like GPT-4 can predict the next word in a sentence using the ...
The rapid advancement of Large Language Models (LLMs) has significantly influenced various domains, leveraging their exceptional few-shot and zero-shot learning capabilities.
The use of Knowledge Graphs (KGs) which constitute large networks of real-world entities and their interrelationships, has grown rapidly. A substantial body of research has emerged, exploring the integration of deep learning (DL) and large language models (LLMs) with KGs.
Objective This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows. Methods We developed automation scripts in Python to interact with the APIs of several LLM tools, including ...
The present study offers novel evidence on the current state of large language models (i.e., GPT-4) and the capabilities of divergent creative output in comparison to human participants.
Corpus ID: 271924120; Controllable Text Generation for Large Language Models: A Survey @inproceedings{Liang2024ControllableTG, title={Controllable Text Generation for Large Language Models: A Survey}, author={Xun Liang and Hanyu Wang and Yezhaohui Wang and Shichao Song and Jiawei Yang and Simin Niu and Jie Hu and Dan Liu and Shunyu Yao and Feiyu Xiong and Zhiyu Li}, year={2024}, url={https ...
Abstract: In recent years, the emergence of large language models has profoundly transformed our production and lifestyle. These models have shown tremendous potential in fields such as natural language processing, speech recognition, and recommendation systems, and are increasingly playing crucial roles in applications like human-computer interaction and intelligent customer service.
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \\cite{kaplan2020scaling ...
Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied to general-purpose software systems like operating systems, LLM agents face three primary challenges. Firstly, the action space is vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date understanding ...
In large-scale cloud service systems, monitoring metric data and conducting anomaly detection is an important way to maintain reliability and stability. However, great disparity exists between academic approaches and industrial practice to anomaly detection. Industry predominantly uses simple, efficient methods due to better interpretability and ease of implementation. In contrast ...
Chao Yan, Henry H Ong, Monika E Grabowska, Matthew S Krantz, Wu-Chen Su, Alyson L Dickson, Josh F Peterson, QiPing Feng, Dan M Roden, C Michael Stein, V Eric Kerchberger, Bradley A Malin, Wei-Qi Wei, Large language models facilitate the generation of electronic health record phenotyping algorithms, Journal of the American Medical Informatics Association, Volume 31, Issue 9, September 2024 ...