Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 09 July 2024

Automating psychological hypothesis generation with AI: when large language models meet causal graph

  • Song Tong   ORCID: orcid.org/0000-0002-4183-8454 1 , 2 , 3 , 4   na1 ,
  • Kai Mao 5   na1 ,
  • Zhen Huang 2 ,
  • Yukun Zhao 2 &
  • Kaiping Peng 1 , 2 , 3 , 4  

Humanities and Social Sciences Communications volume  11 , Article number:  896 ( 2024 ) Cite this article

1473 Accesses

4 Altmetric

Metrics details

  • Science, technology and society

Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 potential psychological hypotheses focusing on “well-being”, then compared them against research ideas conceived by doctoral scholars and those produced solely by the LLM. Interestingly, our combined approach of a LLM and causal graphs mirrored the expert-level insights in terms of novelty, clearly surpassing the LLM-only hypotheses ( t (59) = 3.34, p  = 0.007 and t (59) = 4.32, p  < 0.001, respectively). This alignment was further corroborated using deep semantic analysis. Our results show that combining LLM with machine learning techniques such as causal knowledge graphs can revolutionize automated discovery in psychology, extracting novel insights from the extensive literature. This work stands at the crossroads of psychology and artificial intelligence, championing a new enriched paradigm for data-driven hypothesis generation in psychological research.

Similar content being viewed by others

hypothesis generation model

Augmenting interpretable models with large language models during training

hypothesis generation model

ThoughtSource: A central hub for large language model reasoning data

hypothesis generation model

Testing theory of mind in large language models and humans

Introduction.

In an age in which the confluence of artificial intelligence (AI) with various subjects profoundly shapes sectors ranging from academic research to commercial enterprises, dissecting the interplay of these disciplines becomes paramount (Williams et al., 2023 ). In particular, psychology, which serves as a nexus between the humanities and natural sciences, consistently endeavors to demystify the complex web of human behaviors and cognition (Hergenhahn and Henley, 2013 ). Its profound insights have significantly enriched academia, inspiring innovative applications in AI design. For example, AI models have been molded on hierarchical brain structures (Cichy et al., 2016 ) and human attention systems (Vaswani et al., 2017 ). Additionally, these AI models reciprocally offer a rejuvenated perspective, deepening our understanding from the foundational cognitive taxonomy to nuanced esthetic perceptions (Battleday et al., 2020 ; Tong et al., 2021 ). Nevertheless, the multifaceted domain of psychology, particularly social psychology, has exhibited a measured evolution compared to its tech-centric counterparts. This can be attributed to its enduring reliance on conventional theory-driven methodologies (Henrich et al., 2010 ; Shah et al., 2015 ), a characteristic that stands in stark contrast to the burgeoning paradigms of AI and data-centric research (Bechmann and Bowker, 2019 ; Wang et al., 2023 ).

In the journey of psychological research, each exploration originates from a spark of innovative thought. These research trajectories may arise from established theoretical frameworks, daily event insights, anomalies within data, or intersections of interdisciplinary discoveries (Jaccard and Jacoby, 2019 ). Hypothesis generation is pivotal in psychology (Koehler, 1994 ; McGuire, 1973 ), as it facilitates the exploration of multifaceted influencers of human attitudes, actions, and beliefs. The HyGene model (Thomas et al., 2008 ) elucidated the intricacies of hypothesis generation, encompassing the constraints of working memory and the interplay between ambient and semantic memories. Recently, causal graphs have provided psychology with a systematic framework that enables researchers to construct and simulate intricate systems for a holistic view of “bio-psycho-social” interactions (Borsboom et al., 2021 ; Crielaard et al., 2022 ). Yet, the labor-intensive nature of the methodology poses challenges, which requires multidisciplinary expertise in algorithmic development, exacerbating the complexities (Crielaard et al., 2022 ). Meanwhile, advancements in AI, exemplified by models such as the generative pretrained transformer (GPT), present new avenues for creativity and hypothesis generation (Wang et al., 2023 ).

Building on this, notably large language models (LLMs) such as GPT-3, GPT-4, and Claude-2, which demonstrate profound capabilities to comprehend and infer causality from natural language texts, a promising path has emerged to extract causal knowledge from vast textual data (Binz and Schulz, 2023 ; Gu et al., 2023 ). Exciting possibilities are seen in specific scenarios in which LLMs and causal graphs manifest complementary strengths (Pan et al., 2023 ). Their synergistic combination converges human analytical and systemic thinking, echoing the holistic versus analytic cognition delineated in social psychology (Nisbett et al., 2001 ). This amalgamation enables fine-grained semantic analysis and conceptual understanding via LLMs, while causal graphs offer a global perspective on causality, alleviating the interpretability challenges of AI (Pan et al., 2023 ). This integrated methodology efficiently counters the inherent limitations of working and semantic memories in hypothesis generation and, as previous academic endeavors indicate, has proven efficacious across disciplines. For example, a groundbreaking study in physics synthesized 750,000 physics publications, utilizing cutting-edge natural language processing to extract 6368 pivotal quantum physics concepts, culminating in a semantic network forecasting research trajectories (Krenn and Zeilinger, 2020 ). Additionally, by integrating knowledge-based causal graphs into the foundation of the LLM, the LLM’s capability for causative inference significantly improves (Kıcıman et al., 2023 ).

To this end, our study seeks to build a pioneering analytical framework, combining the semantic and conceptual extraction proficiency of LLMs with the systemic thinking of the causal graph, with the aim of crafting a comprehensive causal network of semantic concepts within psychology. We meticulously analyzed 43,312 psychological articles, devising an automated method to construct a causal graph, and systematically mining causative concepts and their interconnections. Specifically, the initial sifting and preparation of the data ensures a high-quality corpus, and is followed by employing advanced extraction techniques to identify standardized causal concepts. This results in a graph database that serves as a reservoir of causal knowledge. In conclusion, using node embedding and similarity-based link prediction, we unearthed potential causal relationships, and thus generated the corresponding hypotheses.

To gauge the pragmatic value of our network, we selected 130 hypotheses on “well-being” generated by our framework, comparing them with hypotheses crafted by novice experts (doctoral students in psychology) and the LLM models. The results are encouraging: Our algorithm matches the caliber of novice experts, outshining the hypotheses generated solely by the LLM models in novelty. Additionally, through deep semantic analysis, we demonstrated that our algorithm contains more profound conceptual incorporations and a broader semantic spectrum.

Our study advances the field of psychology in two significant ways. Firstly, it extracts invaluable causal knowledge from the literature and converts it to visual graphics. These aids can feed algorithms to help deduce more latent causal relations and guide models in generating a plethora of novel causal hypotheses. Secondly, our study furnishes novel tools and methodologies for causal analysis and scientific knowledge discovery, representing the seamless fusion of modern AI with traditional research methodologies. This integration serves as a bridge between conventional theory-driven methodologies in psychology and the emerging paradigms of data-centric research, thereby enriching our understanding of the factors influencing psychology, especially within the realm of social psychology.

Methodological framework for hypothesis generation

The proposed LLM-based causal graph (LLMCG) framework encompasses three steps: literature retrieval, causal pair extraction, and hypothesis generation, as illustrated in Fig. 1 . In the literature gathering phase, ~140k psychology-related articles were downloaded from public databases. In step two, GPT-4 were used to distil causal relationships from these articles, culminating in the creation of a causal relationship network based on 43,312 selected articles. In the third step, an in-depth examination of these data was executed, adopting link prediction algorithms to forecast the dynamics within the causal relationship network for searching the highly potential causality concept pairs.

figure 1

Note: LLM stands for large language model; LLMCG algorithm stands for LLM-based causal graph algorithm, which includes the processes of literature retrieval, causal pair extraction, and hypothesis generation.

Step 1: Literature retrieval

The primary data source for this study was a public repository of scientific articles, the PMC Open Access Subset. Our decision to utilize this repository was informed by several key attributes that it possesses. The PMC Open Access Subset boasts an expansive collection of over 2 million full-text XML science and medical articles, providing a substantial and diverse base from which to derive insights for our research. Furthermore, the open-access nature of the articles not only enhances the transparency and reproducibility of our methodology, but also ensures that the results and processes can be independently accessed and verified by other researchers. Notably, the content within this subset originates from recognized journals, all of which have undergone rigorous peer review, lending credence to the quality and reliability of the data we leveraged. Finally, an added advantage was the rich metadata accompanying each article. These metadata were instrumental in refining our article selection process, ensuring coherent thematic alignment with our research objectives in the domains of psychology.

To identify articles relevant to our study, we applied a series of filtering criteria. First, the presence of certain keywords within article titles or abstracts was mandatory. Some examples of these keywords include “psychol”, “clin psychol”, and “biol psychol”. Second, we exploited the metadata accompanying each article. The classification of articles based on these metadata ensured alignment with recognized thematic standards in the domains of psychology and neuroscience. Upon the application of these criteria, we managed to curate a subset of approximately 140K articles that most likely discuss causal concepts in both psychology and neuroscience.

Step 2: Causal pair extraction

The process of extracting causal knowledge from vast troves of scientific literature is intricate and multifaceted. Our methodology distils this complex process into four coherent steps, each serving a distinct purpose. (1) Article selection and cost analysis: Determines the feasibility of processing a specific volume of articles, ensuring optimal resource allocation. (2) Text extraction and analysis: Ensures the purity of the data that enter our causal extraction phase by filtering out nonrelevant content. (3) Causal knowledge extraction: Uses advanced language models to detect, classify, and standardize causal factors relationships present in texts. (4) Graph database storage: Facilitates structured storage, easy retrieval, and the possibility of advanced relational analyses for future research. This streamlined approach ensures accuracy, consistency, and scalability in our endeavor to understand the interplay of causal concepts in psychology and neuroscience.

Text extraction and cleaning

After a meticulous cost analysis detailed in Appendix A , our selection process identified 43,312 articles. This selection was strategically based on the criterion that the journal titles must incorporate the term “Psychol”, signifying their direct relevance to the field of psychology. The distributions of publication sources and years can be found in Table 1 . Extracting the full texts of the articles from their PDF sources was an essential initial step, and, for this purpose, the PyPDF2 Python library was used. This library allowed us to seamlessly extract and concatenate titles, abstracts, and main content from each PDF article. However, a challenge arose with the presence of extraneous sections such as references or tables, in the extracted texts. The implemented procedure, employing regular expressions in Python, was not only adept at identifying variations of the term “references” but also ascertained whether this section appeared as an isolated segment. This check was critical to ensure that the identified that the “references” section was indeed distinct, marking the start of a reference list without continuation into other text. Once identified as a standalone entity, the next step in the method was to efficiently remove the reference section and its subsequent content.

Causal knowledge extraction method

In our effort to extract causal knowledge, the choice of GPT-4 was not arbitrary. While several models were available for such tasks, GPT-4 emerged as a frontrunner due to its advanced capabilities (Wu et al., 2023 ), extensive training on diverse data, with its proven proficiency in understanding context, especially in complex scientific texts (Cheng et al., 2023 ; Sanderson, 2023 ). Other models were indeed considered; however, the capacity of GPT-4 to generate coherent, contextually relevant responses gave our project an edge in its specific requirements.

The extraction process commenced with the segmentation of the articles. Due to the token constraints inherent to GPT-4, it was imperative to break down the articles into manageable chunks, specifically those of 4000 tokens or fewer. This approach ensured a comprehensive interpretation of the content without omitting any potential causal relationships. The next phase was prompt engineering. To effectively guide the extraction capabilities of GPT-4, we crafted explicit prompts. A testament to this meticulous engineering is demonstrated in a directive in which we asked the model to elucidate causal pairs in a predetermined JSON format. For a clearer understanding, readers are referred to Table 2 , which elucidates the example prompt and the subsequent model response. After extraction, the outputs were not immediately cataloged. A filtering process was initiated to ascertain the standardization of the concept pairs. This process weeded out suboptimal outputs. Aiding in this quality control, GPT-4 played a pivotal role in the verification of causal pairs, determining their relevance, causality, and ensuring correct directionality. Finally, while extracting knowledge, we were aware of the constraints imposed by the GPT-4 API. There was a conscious effort to ensure that we operated within the bounds of 60 requests and 150k tokens per minute. This interplay of prompt engineering and stringent filtering was productive.

In addition, we conducted an exploratory study to assess GPT-4’s discernment between “causality” and “correlation” involved four graduate students (mean age 31 ± 10.23), each evaluating relationship pairs extracted from their familiar psychology articles. The experimental details and results can be found in Appendix A and Table A1. The results showed that out of 289 relationships identified by GPT-4, 87.54% were validated. Notably, when GPT-4 classified relationships as causal, only 13.02% (31/238) were recognized as non-relationship, while 65.55% (156/238) agreed upon as causality. This shows that GPT-4 can accurately extract relationships (causality or correlation) in psychological texts, underscoring the potential as a tool for the construction of causal graphs.

To enhance the robustness of the extracted causal relationships and minimize biases, we adopted a multifaceted approach. Recognizing the indispensable role of human judgment, we periodically subjected random samples of extracted causal relationships to the scrutiny of domain experts. Their valuable feedback was instrumental in the real-time fine-tuning the extraction process. Instead of heavily relying on referenced hypotheses, our focus was on extracting causal pairs, primarily from the findings mentioned in the main texts. This systematic methodology ultimately resulted in a refined text corpus distilled from 43,312 articles, which contained many conceptual insights and were primed for rigorous causal extraction.

Graph database storage

Our decision to employ Neo4j as the database system was strategic. Neo4j, as a graph database (Thomer and Wickett, 2020 ), is inherently designed to capture and represent complex relationships between data points, an attribute that is essential for understanding intricate causal relationships. Beyond its technical prowess, Neo4j provides advantages such as scalability, resilience, and efficient querying capabilities (Webber, 2012 ). It is particularly adept at traversing interconnected data points, making it an excellent fit for our causal relationship analysis. The mined causal knowledge finds its abode in the Neo4j graph database. Each pair of causal concepts is represented as a node, with its directionality and interpretations stored as attributes. Relationships provide related concepts together. Storing the knowledge graph in Neo4j allows for the execution of the graph algorithms to analyze concept interconnectivity and reveal potential relationships.

The graph database contains 197k concepts and 235k connections. Table 3 encapsulates the core concepts and provides a vivid snapshot of the most recurring themes; helping us to understand the central topics that dominate the current psychological discourse. A comprehensive examination of the core concepts extracted from 43,312 psychological papers, several distinct patterns and focal areas emerged. In particular, there is a clear balance between health and illness in psychological research. The prominence of terms such as “depression”, “anxiety”, and “symptoms of depression magnifies the commitment in the discipline to understanding and addressing mental illnesses. However, juxtaposed against these are positive terms such as “life satisfaction” and “sense of happiness”, suggesting that psychology not only fixates on challenges but also delves deeply into the nuances of positivity and well-being. Furthermore, the significance given to concepts such as “life satisfaction”, “sense of happiness”, and “job satisfaction” underscores an increasing recognition of emotional well-being and job satisfaction as integral to overall mental health. Intertwining the realms of psychology and neuroscience, terms such as “microglial cell activation”, “cognitive impairment”, and “neurodegenerative changes” signal a growing interest in understanding the neural underpinnings of cognitive and psychological phenomena. In addition, the emphasis on “self-efficacy”, “positive emotions”, and “self-esteem” reflect the profound interest in understanding how self-perception and emotions influence human behavior and well-being. Concepts such as “age”, “resilience”, and “creativity” further expand the canvas, showcasing the eclectic and comprehensive nature of inquiries in the field of psychology.

Overall, this analysis paints a vivid picture of modern psychological research, illuminating its multidimensional approach. It demonstrates a discipline that is deeply engaged with both the challenges and triumphs of human existence, offering holistic insight into the human mind and its myriad complexities.

Step 3: Hypothesis generation using link prediction

In the quest to uncover novel causal relationships beyond direct extraction from texts, the technique of link prediction emerges as a pivotal methodology. It hinges on the premise of proposing potential causal ties between concepts that our knowledge graph does not explicitly connect. The process intricately weaves together vector embedding, similarity analysis, and probability-based ranking. Initially, concepts are transposed into a vector space using node2vec, which is valued for its ability to capture topological nuances. Here, every pair of unconnected concepts is assigned a similarity score, and pairs that do not meet a set benchmark are quickly discarded. As we dive deeper into the higher echelons of these scored pairs, the likelihood of their linkage is assessed using the Jaccard similarity of their neighboring concepts. Subsequently, these potential causal relationships are organized in descending order of their derived probabilities, and the elite pairs are selected.

An illustration of this approach is provided in the case highlighted in Figure A1. For instance, the behavioral inhibition system (BIS) exhibits ties to both the behavioral activation system (BAS) and the subsequent behavioral response of the BAS when encountering reward stimuli, termed the BAS reward response. Simultaneously, another concept, interference, finds itself bound to both the BAS and the BAS Reward Response. This configuration hints at a plausible link between the BIS and interference. Such highly probable causal pairs are not mere intellectual curiosity. They act as springboards, catalyzing the genesis of new experimental designs or research hypotheses ripe for empirical probing. In essence, this capability equips researchers with a cutting-edge instrument, empowering them to navigate the unexplored waters of the psychological and neurological domains.

Using pairs of highly probable causal concepts, we pushed GPT-4 to conjure novel causal hypotheses that bridge concepts. To further elucidate the process of this method, Table 4 provides some examples of hypotheses generated from the process. Such hypotheses, as exemplified in the last row, underscore the potential and power of our method for generating innovative causal propositions.

Hypotheses evaluation and results

In this section, we present an analysis focusing on quality in terms of novelty and usefulness of the hypotheses generated. According to existing literature, these dimensions are instrumental in encapsulating the essence of inventive ideas (Boden, 2009 ; McCarthy et al., 2018 ; Miron-Spektor and Beenen, 2015 ). These parameters have not only been quintessential for gauging creative concepts, but they have also been adopted to evaluate the caliber of research hypotheses (Dowling and Lucey, 2023 ; Krenn and Zeilinger, 2020 ; Oleinik, 2019 ). Specifically, we evaluate the quality of the hypotheses generated by the proposed LLMCG algorithm in relation to those generated by PhD students from an elite university who represent human junior experts, the LLM model, which represents advanced AI systems, and the research ideas refined by psychological researchers which represents cooperation between AI and humans.

The evaluation comprises three main stages. In the first stage, the hypotheses are generated by all contributors, including steps taken to ensure fairness and relevance for comparative analysis. In the second stage, the hypotheses from the first stage are independently and blindly reviewed by experts who represent the human academic community. These experts are asked to provide hypothesis ratings using a specially designed questionnaire to ensure statistical validity. The third stage delves deeper by transforming each research idea into the semantic space of a bidirectional encoder representation from transformers (BERT) (Lee et al., 2023 ), allowing us to intricately analyze the intrinsic reasons behind the rating disparities among the groups. This semantic mapping not only pinpoints the nuanced differences, but also provides potential insights into the cognitive constructs of each hypothesis.

Evaluation procedure

Selection of the focus area for hypothesis generation.

Selecting an appropriate focus area for hypothesis generation is crucial to ensure a balanced and insightful comparison of the hypothesis generation capacities between various contributors. In this study, our goal is to gauge the quality of hypotheses derived from four distinct contributors, with measures in place to mitigate potential confounding variables that might skew the results among groups (Rubin, 2005 ). Our choice of domain is informed by two pivotal criteria: the intricacy and subtlety of the subject matter and familiarity with the domain. It is essential that our chosen domain boasts sufficient complexity to prompt meaningful hypothesis generation and offer a robust assessment of both AI and human contributors” depth of understanding and creativity. Furthermore, while human contributors should be well-acquainted with the domain, their expertise need not match the vast corpus knowledge of the AI.

In terms of overarching human pursuits such as the search for happiness, positive psychology distinguishes itself by avoiding narrowly defined, individual-centric challenges (Seligman and Csikszentmihalyi, 2000 ). This alignment with our selection criteria is epitomized by well-being, a salient concept within positive psychology, as shown in Table 3 . Well-being, with its multidimensional essence that encompass emotional, psychological, and social facets, and its central stature in both research and practical applications of positive psychology (Diener et al., 2010 ; Fredrickson, 2001 ; Seligman and Csikszentmihalyi, 2000 ), becomes the linchpin of our evaluation. The growing importance of well-being in the current global context offers myriad novel avenues for hypothesis generation and theoretical advancement (Forgeard et al., 2011 ; Madill et al., 2022 ; Otu et al., 2020 ). Adding to our rationale, the Positive Psychology Research Center at Tsinghua University is a globally renowned hub for cutting-edge research in this domain. Leveraging this stature, we secured participation from specialized Ph.D. students, reinforcing positive psychology as the most fitting domain for our inquiry.

Hypotheses comparison

In our study, the generated psychological hypotheses were categorized into four distinct groups, consisting of two experimental groups and two control groups. The experimental groups encapsulate hypotheses generated by our algorithm, either through random selection or handpicking by experts from a pool of generated hypotheses. On the other hand, control groups comprise research ideas that were meticulously crafted by doctoral students with substantial academic expertise in the domains and hypotheses generated by representative LLMs. In the following, we elucidate the methodology and underlying rationale for each group:

LLMCG algorithm output (Random-selected LLMCG)

Following the requirement of generating hypotheses centred on well-being, the LLMCG algorithm crafted 130 unique hypotheses. These hypotheses were derived by LLMCG’s evaluation of the most likely causal relationships related to well-being that had not been previously documented in research literature datasets. From this refined pool, 30 research ideas were chosen at random for this experimental group. These hypotheses represent the algorithm’s ability to identify causal relationships and formulate pertinent hypotheses.

LLMCG expert-vetted hypotheses (Expert-selected LLMCG)

For this group, two seasoned psychological researchers, one male aged 47 and one female aged 46, in-depth expertise in the realm of Positive Psychology, conscientiously handpicked 30 of the most promising hypotheses from the refined pool, excluding those from the Random-selected LLMCG category. The selection criteria centered on a holistic understanding of both the novelty and practical relevance of each hypothesis. With an illustrious postdoctoral journey and a robust portfolio of publications in positive psychology to their names, they rigorously sifted through the hypotheses, pinpointing those that showcased a perfect confluence of originality and actionable insight. These hypotheses were meticulously appraised for their relevance, structural coherence, and potential academic value, representing the nexus of machine intelligence and seasoned human discernment.

PhD students’ output (Control-Human)

We enlisted the expertise of 16 doctoral students from the Positive Psychology Research Center at Tsinghua University. Under the guidance of their supervisor, each student was provided with a questionnaire geared toward research on well-being. The participants were given a period of four working days to complete and return the questionnaire, which was distributed during vacation to ensure minimal external disruptions and commitments. The specific instructions provided in the questionnaire is detailed in Table B1 , and each participant was asked to complete 3–4 research hypotheses. By the stipulated deadline, we received responses from 13 doctoral students, with a mean age of 31.92 years (SD = 7.75 years), cumulatively presenting 41 hypotheses related to well-being. To maintain uniformity with the other groups, a random selection was made to shortlist 30 hypotheses for further analysis. These hypotheses reflect the integration of core theoretical concepts with the latest insights into the domain, presenting an academic interpretation rooted in their rigorous training and education. Including this group in our study not only provides a natural benchmark for human ingenuity and expertise but also underscores the invaluable contribution of human cognition in research ideation, serving as a pivotal contrast to AI-generated hypotheses. This juxtaposition illuminates the nuanced differences between human intellectual depth and AI’s analytical progress, enriching the comparative dimensions of our study.

Claude model output (Control-Claude)

This group exemplifies the pinnacle of current LLM technology in generating research hypotheses. Since LLMCG is a nascent technology, its assessment requires a comparative study with well-established counterparts, creating a key paradigm in comparative research. Currently, Claude-2 and GPT-4 represent the apex of AI technology. For example, Claude-2, with an accuracy rate of 54. 4% excels in reasoning and answering questions, substantially outperforming other models such as Falcon, Koala and Vicuna, which have accuracy rates of 17.1–25.5% (Wu et al., 2023 ). To facilitate a more comprehensive evaluation of the new model by researchers and to increase the diversity and breadth of comparison, we chose Claude-2 as the control model. Using the detailed instructions provided in Table B2, Claude-2 was iteratively prompted to generate research hypotheses, generating ten hypotheses per prompt, culminating in a total of 50 hypotheses. Although the sheer number and range of these hypotheses accentuate the capabilities of Claude-2, to ensure compatibility in terms of complexity and depth between all groups, a subsequent refinement was considered essential. With minimal human intervention, GPT-4 was used to evaluate these 50 hypotheses and select the top 30 that exhibited the most innovative, relevant, and academically valuable insights. This process ensured the infusion of both the LLM”s analytical prowess and a layer of qualitative rigor, thus giving rise to a set of hypotheses that not only align with the overarching theme of well-being but also resonate with current academic discourse.

Hypotheses assessment

The assessment of the hypotheses encompasses two key components: the evaluation conducted by eminent psychology professors emphasizing novelty and utility, and the deep semantic analysis involving BERT and t -distributed stochastic neighbor embedding ( t -SNE) visualization to discern semantic structures and disparities among hypotheses.

Human academic community

The review task was entrusted to three eminent psychology professors (all male, mean age = 42.33), who have a decade-long legacy in guiding doctoral and master”s students in positive psychology and editorial stints in renowned journals; their task was to conduct a meticulous evaluation of the 120 hypotheses. Importantly, to ensure unbiased evaluation, the hypotheses were presented to them in a completely randomized order in the questionnaire.

Our emphasis was undeniably anchored to two primary tenets: novelty and utility (Cohen, 2017 ; Shardlow et al., 2018 ; Thompson and Skau, 2023 ; Yu et al., 2016 ), as shown in Table B3 . Utility in hypothesis crafting demands that our propositions extend beyond mere factual accuracy; they must resonate deeply with academic investigations, ensuring substantial practical implications. Given the inherent challenges of research, marked by constraints in time, manpower, and funding, it is essential to design hypotheses that optimize the utilization of these resources. On the novelty front, we strive to introduce innovative perspectives that have the power to challenge and expand upon existing academic theories. This not only propels the discipline forward but also ensures that we do not inadvertently tread on ground already covered by our contemporaries.

Deep semantic analysis

While human evaluations provide invaluable insight into the novelty and utility of hypotheses, to objectively discern and visualize semantic structures and the disparities among them, we turn to the realm of deep learning. Specifically, we employ the power of BERT (Devlin et al., 2018 ). BERT, as highlighted by Lee et al. ( 2023 ), had a remarkable potential to assess the innovation of ideas. By translating each hypothesis into a high-dimensional vector in the BERT domain, we obtain the profound semantic core of each statement. However, such granularity in dimensions presents challenges when aiming for visualization.

To alleviate this and to intuitively understand the clustering and dispersion of these hypotheses in semantic space, we deploy the t -SNE ( t -distributed Stochastic Neighbor Embedding) technique (Van der Maaten and Hinton, 2008 ), which is adept at reducing the dimensionality of the data while preserving the relative pairwise distances between the items. Thus, when we map our BERT-encoded hypotheses onto a 2D t -SNE plane, an immediate visual grasp on how closely or distantly related our hypotheses are in terms of their semantic content. Our intent is twofold: to understand the semantic terrains carved out by the different groups and to infer the potential reasons for some of the hypotheses garnered heightened novelty or utility ratings from experts. The convergence of human evaluations and semantic layouts, as delineated by Algorithm 1 in Appendix B , reveal the interplay between human intuition and the inherent semantic structure of the hypotheses.

Qualitative analysis by topic analysis

To better understand the underlying thought processes and the topical emphasis of both PhD students and the LLMCG model, qualitative analyses were performed using visual tools such as word clouds and connection graphs, as detailed in Appendix B . The word cloud, as a graphical representation, effectively captures the frequency and importance of terms, providing direct visualization of the dominant themes. Connection graphs, on the other hand, elucidate the relationships and interplay between various themes and concepts. Using these visual tools, we aimed to achieve a more intuitive and clear representation of the data, allowing for easy comparison and interpretation.

Observations drawn from both the word clouds and the connection graphs in Figures B1 and B2 provide us with a rich tapestry of insights into the thought processes and priorities of Ph.D. students and the LLMCG model. For instance, the emphasis in the Control-Human word cloud on terms such as “robot” and “AI” indicates a strong interest among Ph.D. students in the nexus between technology and psychology. It is particularly fascinating to see a group of academically trained individuals focusing on the real world implications and intersections of their studies, as shown by their apparent draw toward trending topics. This not only underscores their adaptability but also emphasizes the importance of contextual relevance. Conversely, the LLMCG groups, particularly the Expert-selected LLMCG group, emphasize the community, collective experiences, and the nuances of social interconnectedness. This denotes a deep-rooted understanding and application of higher-order social psychological concepts, reflecting the model”s ability to dive deep into the intricate layers of human social behavior.

Furthermore, the connection graphs support these observations. The Control-Human graph, with its exploration of themes such as “Robot Companionship” and its relation to factors such as “heart rate variability (HRV)”, demonstrates a confluence of technology and human well-being. The other groups, especially the Random-selected LLMCG group, yield themes that are more societal and structural, hinting at broader determinants of individual well-being.

Analysis of human evaluations

To quantify the agreement among the raters, we employed Spearman correlation coefficients. The results, as shown in Table B5, reveal a spectrum of agreement levels between the reviewer pairs, showcasing the subjective dimension intrinsic to the evaluation of novelty and usefulness. In particular, the correlation between reviewer 1 and reviewer 2 in novelty (Spearman r  = 0.387, p  < 0.0001) and between reviewer 2 and reviewer 3 in usefulness (Spearman r  = 0.376, p  < 0.0001) suggests a meaningful level of consensus, particularly highlighting their capacity to identify valuable insights when evaluating hypotheses.

The variations in correlation values, such as between reviewer 2 and reviewer 3 ( r  = 0.069, p  = 0.453), can be attributed to the diverse research orientations and backgrounds of each reviewer. Reviewer 1 focuses on social ecology, reviewer 3 specializes in neuroscientific methodologies, and reviewer 2 integrates various views using technologies like virtual reality, and computational methods. In our evaluation, we present specific hypotheses cases to illustrate the differing perspectives between reviewers, as detailed in Table B4 and Figure B3. For example, C5 introduces the novel concept of “Virtual Resilience”. Reviewers 1 and 3 highlighted its originality and utility, while reviewer 2 rated it lower in both categories. Meanwhile, C6, which focuses on social neuroscience, resonated with reviewer 3, while reviewers 1 and 2 only partially affirmed it. These differences underscore the complexity of evaluating scientific contributions and highlight the importance of considering a range of expert opinions for a comprehensive evaluation.

This assessment is divided into two main sections: Novelty analysis and usefulness analysis.

Novelty analysis

In the dynamic realm of scientific research, measuring and analyzing novelty is gaining paramount importance (Shin et al., 2022 ). ANOVA was used to analyze the novelty scores represented in Fig. 2 a, and we identified a significant influence of the group factor on the mean novelty score between different reviewers. Initially, z-scores were calculated for each reviewer”s ratings to standardize the scoring scale, which were then averaged. The distinct differences between the groups, as visualized in the boxplots, are statistically underpinned by the results in Table 5 . The ANOVA results revealed a pronounced effect of the grouping factor ( F (3116) = 6.92, p  = 0.0002), with variance explained by the grouping factor (R-squared) of 15.19%.

figure 2

Box plots on the left ( a ) and ( b ) depict distributions of novelty and usefulness scores, respectively, while smoothed line plots on the right demonstrate the descending order of novelty and usefulness scores and subjected to a moving average with a window size of 2. * denotes p  < 0.05, ** denotes p  <0.01.

Further pairwise comparisons using the Bonferroni method, as delineated in Table 5 and visually corroborated by Fig. 2 a; significant disparities were discerned between Random-selected LLMCG and Control-Claude ( t (59) = 3.34, p  = 0.007) and between Control-Human and Control-Claude ( t (59) = 4.32, p  < 0.001). The Cohen’s d values of 0.8809 and 1.1192 respectively indicate that the novelty scores for the Random-selected LLMCG and Control-Human groups are significantly higher than those for the Control-Claude group. Additionally, when considering the cumulative distribution plots to the right of Fig. 2 a, we observe the distributional characteristics of the novel scores. For example, it can be observed that the Expert-selected LLMCG curve portrays a greater concentration in the middle score range when compared to the Control-Claude , curve but dominates in the high novelty scores (highlighted in dashed rectangle). Moreover, comparisons involving Control-Human with both Random-selected LLMCG and Expert-selected LLMCG did not manifest statistically significant variances, indicating aligned novelty perceptions among these groups. Finally, the comparisons between Expert-selected LLMCG and Control-Claude ( t (59) = 2.49, p  = 0.085) suggest a trend toward significance, with a Cohen’s d value of 0.6226 indicating generally higher novelty scores for Expert-selected LLMCG compared to Control-Claude .

To mitigate potential biases due to individual reviewer inclinations, we expanded our evaluation to include both median and maximum z-scores from the three reviewers for each hypothesis. These multifaceted analyses enhance the robustness of our results by minimizing the influence of extreme values and potential outliers. First, when analyzing the median novelty scores, the ANOVA test demonstrated a notable association with the grouping factor ( F (3,116) = 6.54, p  = 0.0004), which explained 14.41% of the variance. As illustrated in Table 5 , pairwise evaluations revealed significant disparities between Control-Human and Control-Claude ( t (59) = 4.01, p  = 0.001), with Control-Human performing significantly higher than Control-Claude (Cohen’s d  = 1.1031). Similarly, there were significant differences between Random-selected LLMCG and Control-Claude ( t (59) = 3.40, p  = 0.006), where Random-selected LLMCG also significantly outperformed Control-Claude (Cohen’s d  = 0.8875). Interestingly, the comparison of Expert-selected LLMCG with Control-Claude ( t (59) = 1.70, p  = 0.550) and other group pairings did not include statistically significant differences.

Subsequently, turning our attention to maximum novelty scores provided crucial insights, especially where outlier scores may carry significant weight. The influence of the grouping factor was evident ( F (3,116) = 7.20, p  = 0.0002), indicating an explained variance of 15.70%. In particular, clear differences emerged between Control-Human and Control-Claude ( t (59) = 4.36, p  < 0.001), and between Random-selected LLMCG and Control-Claude ( t (59) = 3.47, p  = 0.004). A particularly intriguing observation was the significant difference between Expert-selected LLMCG and Control-Claude ( t (59) = 3.12, p  = 0.014). The Cohen’s d values of 1.1637, 1.0457, and 0.6987 respectively indicate that the novelty scores for the Control-Human , Random-selected LLMCG , and Expert-selected LLMCG groups are significantly higher than those for the Control-Claude group. Together, these analyses offer a multifaceted perspective on novelty evaluations. Specifically, the results of the median analysis echo and support those of the mean, reinforcing the reliability of our assessments. The discerned significance between Control-Claude and Expert-selected LLMCG in the median data emphasizes the intricate differences, while also pointing to broader congruence in novelty perceptions.

Usefulness analysis

Evaluating the practical impact of hypotheses is crucial in scientific research assessments. In the mean useful spectrum, the grouping factor did not exert a significant influence ( F (3,116) = 5.25, p  = 0.553). Figure 2 b presents the utility score distributions between groups. The narrow interquartile range of Control-Human suggests a relatively consistent assessment among reviewers. On the other hand, the spread and outliers in the Control-Claude distribution hint at varied utility perceptions. Both LLMCG groups cover a broad score range, demonstrating a mixture of high and low utility scores, while the Expert-selected LLMCG gravitates more toward higher usefulness scores. The smoothed line plots accompanying Fig. 2 b further detail the score densities. For instance, Random-selected LLMCG boasts several high utility scores, counterbalanced by a smattering of low scores. Interestingly, the distributions for Control-Human and Expert-selected LLMCG appear to be closely aligned. While mean utility scores provide an overarching view, the nuances within the boxplots and smoothed plots offer deeper insights. This comprehensive understanding can guide future endeavors in content generation and evaluation, spotlighting key areas of focus and potential improvements.

Comparison between the LLMCG and GPT-4

To evaluate the impact of integrating a causal graph with GPT-4, we performed an ablation study comparing the hypotheses generated by GPT-4 alone and those of the proposed LLMCG framework. For this experiment, 60 hypotheses were created using GPT-4, following the detailed instructions in Table B2 . Furthermore, 60 hypotheses for the LLMCG group were randomly selected from the remaining pool of 70 hypotheses. Subsequently, both sets of hypotheses were assessed by three independent reviewers for novelty and usefulness, as previously described.

Table 6 shows a comparison between the GPT-4 and LLMCG groups, highlighting a significant difference in novelty scores (mean value: t (119) = 6.60, p  < 0.0001) but not in usefulness scores (mean value: t (119) = 1.31, p  = 0.1937). This indicates that the LLMCG framework significantly enhances hypothesis novelty (all Cohen’s d  > 1.1) without affecting usefulness compared to the GPT-4 group. Figure B6 visually contrasts these findings, underlining the causal graph’s unique role in fostering novel hypothesis generation when integrated with GPT-4.

The t -SNE visualizations (Fig. 3 ) illustrate the semantic relationships between different groups, capturing the patterns of novelty and usefulness. Notably, a distinct clustering among PhD students suggests shared academic influences, while the LLMCG groups display broader topic dispersion, hinting at a wider semantic understanding. The size of the bubbles reflects the novelty and usefulness scores, emphasizing the diverse perceptions of what is considered innovative versus beneficial. Additionally, the numbers near the yellow dots represent the participant IDs, which demonstrated that the semantics of the same participant, such as H05 or H06, are closely aligned. In Fig. B4 , a distinct clustering of examples is observed, particularly highlighting the close proximity of hypotheses C3, C4, and C8 within the semantic space. This observation is further elucidated in Appendix B , enhancing the comprehension of BERT’s semantic representation. Instead of solely depending on superficial textual descriptions, this analysis penetrates into the underlying understanding of concepts within the semantic space, a topic also explored in recent research (Johnson et al., 2023 ).

figure 3

Comparison of ( a ) novelty and ( b ) usefulness scores (bubble size scaled by 100) among the different groups.

In the distribution of semantic distances (Fig. 4 ), we observed that the Control-Human group exhibits a distinctively greater semantic distance in comparison to the other groups, emphasizing their unique semantic orientations. The statistical support for this observation is derived from the ANOVA results, with a significant F-statistic ( F (3,1652) = 84.1611, p  < 0.00001), underscoring the impact of the grouping factor. This factor explains a remarkable 86.96% of the variance, as indicated by the R -squared value. Multiple comparisons, as shown in Table 7 , further elucidate the subtleties of these group differences. Control-Human and Control-Claude exhibit a significant contrast in their semantic distances, as highlighted by the t value of 16.41 and the adjusted p value ( < 0.0001). This difference indicates distinct thought patterns or emphasis in the two groups. Notably, Control-Human demonstrates a greater semantic distance (Cohen’s d = 1.1630). Similarly, a comparison of the Control-Claude and LLMCG models reveals pronounced differences (Cohen’s d  > 0.9), more so with the Expert-selected LLMCG ( p  < 0.0001). A comparison of Control-Human with the LLMCG models shows divergent semantic orientations, with statistically significant larger distances than Random-selected LLMCG ( p  = 0.0036) and a trend toward difference with Expert-selected LLMCG ( p  = 0.0687). Intriguingly, the two LLMCG groups—Random-selected and Expert-selected—exhibit similar semantic distances, as evidenced by a nonsignificant p value of 0.4362. Furthermore, the significant distinctions we observed, particularly between the Control-Human and other groups, align with human evaluations of novelty. This coherence indicates that the BERT space representation coupled with statistical analyses could effectively mimic human judgment. Such results underscore the potential of this approach for automated hypothesis testing, paving the way for more efficient and streamlined semantic evaluations in the future.

figure 4

Note: ** denotes p  < 0.01, **** denotes p  < 0.0001.

In general, visual and statistical analyses reveal the nuanced semantic landscapes of each group. While the Ph.D. students’ shared background influences their clustering, the machine models exhibit a comprehensive grasp of topics, emphasizing the intricate interplay of individual experiences, academic influences, and algorithmic understanding in shaping semantic representations.

This investigation carried out a detailed evaluation of the various hypothesis contributors, blending both quantitative and qualitative analyses. In terms of topic analysis, distinct variations were observed between Control-Human and LLMCG, the latter presenting more expansive thematic coverage. For human evaluation, hypotheses from Ph.D. students paralleled the LLMCG in novelty, reinforcing AI’s growing competence in mirroring human innovative thinking. Furthermore, when juxtaposed with AI models such as Control-Claude , the LLMCG exhibited increased novelty. Deep semantic analysis via t -SNE and BERT representations allowed us to intuitively grasp semantic essence of hypotheses, signaling the possibility of future automated hypothesis assessments. Interestingly, LLMCG appeared to encompass broader complementary domains compared to human input. Taken together, these findings highlight the emerging role of AI in hypothesis generation and provide key insights into hypothesis evaluation across diverse origins.

General discussion

This research delves into the synergistic relationship between LLM and causal graphs in the hypothesis generation process. Our findings underscore the ability of LLM, when integrated with causal graph techniques, to produce meaningful hypotheses with increased efficiency and quality. By centering our investigation on “well-being” we emphasize its pivotal role in psychological studies and highlight the potential convergence of technology and society. A multifaceted assessment approach to evaluate quality by topic analysis, human evaluation and deep semantic analysis demonstrates that AI-augmented methods not only outshine LLM-only techniques in generating hypotheses with superior novelty and show quality on par with human expertise but also boast the capability for more profound conceptual incorporations and a broader semantic spectrum. Such a multifaceted lens of assessment introduces a novel perspective for the scholarly realm, equipping researchers with an enriched understanding and an innovative toolset for hypothesis generation. At its core, the melding of LLM and causal graphs signals a promising frontier, especially in regard to dissecting cornerstone psychological constructs such as “well-being”. This marriage of methodologies, enriched by the comprehensive assessment angle, deepens our comprehension of both the immediate and broader ramifications of our research endeavors.

The prominence of causal graphs in psychology is profound, they offer researchers a unified platform for synthesizing and hypothesizing diverse psychological realms (Borsboom et al., 2021 ; Uleman et al., 2021 ). Our study echoes this, producing groundbreaking hypotheses comparable in depth to early expert propositions. Deep semantic analysis bolstered these findings, emphasizing that our hypotheses have distinct cross-disciplinary merits, particularly when compared to those of individual doctoral scholars. However, the traditional use of causal graphs in psychology presents challenges due to its demanding nature, often requiring insights from multiple experts (Crielaard et al., 2022 ). Our research, however, harnesses LLM’s causal extraction, automating causal pair derivation and, in turn, minimizing the need for extensive expert input. The union of the causal graphs’ systematic approach with AI-driven creativity, as seen with LLMs, paves the way for the future of psychological inquiry. Thanks to advancements in AI, barriers once created by causal graphs’ intricate procedures are being dismantled. Furthermore, as the era of big data dawns, the integration of AI and causal graphs in psychology augments research capabilities, but also brings into focus the broader implications for society. This fusion provides a nuanced understanding of the intricate sociopsychological dynamics, emphasizing the importance of adapting research methodologies in tandem with technological progress.

In the realm of research, LLMs serve a unique purpose, often by acting as the foundation or baseline against which newer methods and approaches are assessed. The demonstrated productivity enhancements by generative AI tools, as evidenced by Noy and Zhang ( 2023 ), indicate the potential of such LLMs. In our investigation, we pit the hypotheses generated by such substantial models against our integrated LLMCG approach. Intriguingly, while these LLMs showcased admirable practicality in their hypotheses, they substantially lagged behind in terms of innovation when juxtaposed with the doctoral student and LLMCG group. This divergence in results can be attributed to the causal network curated from 43k research papers, funneling the vast knowledge reservoir of the LLM squarely into the realm of scientific psychology. The increased precision in hypothesis generation by these models fits well within the framework of generative networks. Tong et al. ( 2021 ) highlighted that, by integrating structured constraints, conventional neural networks can accurately generate semantically relevant content. One of the salient merits of the causal graph, in this context, is its ability to alleviate inherent ambiguity or interpretability challenges posed by LLMs. By providing a systematic and structured framework, the causal graph aids in unearthing the underlying logic and rationale of the outputs generated by LLMs. Notably, this finding echoes the perspective of Pan et al. ( 2023 ), where the integration of structured knowledge from knowledge graphs was shown to provide an invaluable layer of clarity and interpretability to LLMs, especially in complex reasoning tasks. Such structured approaches not only boost the confidence of researchers in the hypotheses derived but also augment the transparency and understandability of LLM outputs. In essence, leveraging causal graphs may very well herald a new era in model interpretability, serving as a conduit to unlock the black box that large models often represent in contemporary research.

In the ever-evolving tapestry of research, every advancement invariably comes with its unique set of constraints, and our study was no exception. On the technical front, a pivotal challenge stemmed from the opaque inner workings of the GPT. Determining the exact machinations within the GPT that lead to the formation of specific causal pairs remains elusive, thereby reintroducing the age-old issue of AI’s inherent lack of transparency (Buruk, 2023 ; Cao and Yousefzadeh, 2023 ). This opacity is magnified in our sparse causal graph, which, while expansive, is occasionally riddled with concepts that, while semantically distinct, converge in meaning. In tangible applications, a careful and meticulous algorithmic evaluation would be imperative to construct an accurate psychological conceptual landscape. Delving into psychology, which bridges humanities and natural sciences, it continuously aims to unravel human cognition and behavior (Hergenhahn and Henley, 2013 ). Despite the dominance of traditional methodologies (Henrich et al., 2010 ; Shah et al., 2015 ), the present data-centric era amplifies the synergy of technology and humanities, resonating with Hasok Chang’s vision of enriched science (Chang, 2007 ). This symbiosis is evident when assessing structural holes in social networks (Burt, 2004 ) and viewing novelty as a bridge across these divides (Foster et al., 2021 ). Such perspectives emphasize the importance of thorough algorithmic assessments, highlighting potential avenues in humanities research, especially when incorporating large language models for innovative hypothesis crafting and verification.

However, there are some limitations to this research. Firstly, we acknowledge that constructing causal relationship graphs has potential inaccuracies, with ~13% relationship pairs not aligning with human expert estimations. Enhancing the estimation of relationship extraction could be a pathway to improve the accuracy of the causal graph, potentially leading to more robust hypotheses. Secondly, our validation process was limited to 130 hypotheses, however, the vastness of our conceptual landscape suggests countless possibilities. As an exemplar, the twenty pivotal psychological concepts highlighted in Table 3 alone could spawn an extensive array of hypotheses. However, the validation of these surrounding hypotheses would unquestionably lead to a multitude of speculations. A striking observation during our validation was the inconsistency in the evaluations of the senior expert panels (as shown in Table B5 ). This shift underscores a pivotal insight: our integration of AI has transitioned the dependency on scarce expert resources from hypothesis generation to evaluation. In the future, rigorous evaluations ensuring both novelty and utility could become a focal point of exploration. The promising path forward necessitates a thoughtful integration of technological innovation and human expertise to fully realize the potential suggested by our study.

In conclusion, our research provides pioneering insight into the symbiotic fusion of LLMs, which are epitomized by GPT, and causal graphs from the realm of psychological hypothesis generation, especially emphasizing “well-being”. Importantly, as highlighted by (Cao and Yousefzadeh, 2023 ), ensuring a synergistic alignment between domain knowledge and AI extrapolation is crucial. This synergy serves as the foundation for maintaining AI models within their conceptual limits, thus bolstering the validity and reliability of the hypotheses generated. Our approach intricately interweaves the advanced capabilities of LLMs with the methodological prowess of causal graphs, thereby optimizing while also refining the depth and precision of hypothesis generation. The causal graph, of paramount importance in psychology due to its cross-disciplinary potential, often demands vast amounts of expert involvement. Our innovative approach addresses this by utilizing LLM’s exceptional causal extraction abilities, effectively facilitating the transition of intense expert engagement from hypothesis creation to evaluation. Therefore, our methodology combined LLM with causal graphs, propelling psychological research forward by improving hypothesis generation and offering tools to blend theoretical and data-centric approaches. This synergy particularly enriches our understanding of social psychology’s complex dynamics, such as happiness research, demonstrating the profound impact of integrating AI with traditional research frameworks.

Data availability

The data generated and analyzed in this study are partially available within the Supplementary materials . For additional data supporting the findings of this research, interested parties may contact the corresponding author, who will provide the information upon receiving a reasonable request.

Battleday RM, Peterson JC, Griffiths TL (2020) Capturing human categorization of natural images by combining deep networks and cognitive models. Nat Commun 11(1):5418

Article   ADS   PubMed   PubMed Central   Google Scholar  

Bechmann A, Bowker GC (2019) Unsupervised by any other name: hidden layers of knowledge production in artificial intelligence on social media. Big Data Soc 6(1):2053951718819569

Article   Google Scholar  

Binz M, Schulz E (2023) Using cognitive psychology to understand GPT-3. Proc Natl Acad Sci 120(6):e2218523120

Article   CAS   PubMed   PubMed Central   Google Scholar  

Boden MA (2009) Computer models of creativity. AI Mag 30(3):23–23

Google Scholar  

Borsboom D, Deserno MK, Rhemtulla M, Epskamp S, Fried EI, McNally RJ (2021) Network analysis of multivariate data in psychological science. Nat Rev Methods Prim 1(1):58

Article   CAS   Google Scholar  

Burt RS (2004) Structural holes and good ideas. Am J Sociol 110(2):349–399

Buruk O (2023) Academic writing with GPT-3.5: reflections on practices, efficacy and transparency. arXiv preprint arXiv:2304.11079

Cao X, Yousefzadeh R (2023) Extrapolation and AI transparency: why machine learning models should reveal when they make decisions beyond their training. Big Data Soc 10(1):20539517231169731

Chang H (2007) Scientific progress: beyond foundationalism and coherentism1. R Inst Philos Suppl 61:1–20

Cheng K, Guo Q, He Y, Lu Y, Gu S, Wu H (2023) Exploring the potential of GPT-4 in biomedical engineering: the dawn of a new era. Ann Biomed Eng 51:1645–1653

Article   ADS   PubMed   Google Scholar  

Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A (2016) Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Sci Rep 6(1):27755

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Cohen BA (2017) How should novelty be valued in science? Elife 6:e28699

Article   PubMed   PubMed Central   Google Scholar  

Crielaard L, Uleman JF, Châtel BD, Epskamp S, Sloot P, Quax R (2022) Refining the causal loop diagram: a tutorial for maximizing the contribution of domain expertise in computational system dynamics modeling. Psychol Methods 29(1):169–201

Article   PubMed   Google Scholar  

Devlin J, Chang M W, Lee K & Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186)

Diener E, Wirtz D, Tov W, Kim-Prieto C, Choi D-W, Oishi S, Biswas-Diener R (2010) New well-being measures: short scales to assess flourishing and positive and negative feelings. Soc Indic Res 97:143–156

Dowling M, Lucey B (2023) ChatGPT for (finance) research: the Bananarama conjecture. Financ Res Lett 53:103662

Forgeard MJ, Jayawickreme E, Kern ML, Seligman ME (2011) Doing the right thing: measuring wellbeing for public policy. Int J Wellbeing 1(1):79–106

Foster J G, Shi F & Evans J (2021) Surprise! Measuring novelty as expectation violation. SocArXiv

Fredrickson BL (2001) The role of positive emotions in positive psychology: The broaden-and-build theory of positive emotions. Am Psychol 56(3):218

Gu Q, Kuwajerwala A, Morin S, Jatavallabhula K M, Sen B, Agarwal, A et al. (2024) ConceptGraphs: open-vocabulary 3D scene graphs for perception and planning. In 2nd Workshop on Language and Robot Learning: Language as Grounding

Henrich J, Heine SJ, Norenzayan A (2010) Most people are not WEIRD. Nature 466(7302):29–29

Article   ADS   CAS   PubMed   Google Scholar  

Hergenhahn B R, Henley T (2013) An introduction to the history of psychology . Cengage Learning

Jaccard J, Jacoby J (2019) Theory construction and model-building skills: a practical guide for social scientists . Guilford publications

Johnson DR, Kaufman JC, Baker BS, Patterson JD, Barbot B, Green AE (2023) Divergent semantic integration (DSI): Extracting creativity from narratives with distributional semantic modeling. Behav Res Methods 55(7):3726–3759

Kıcıman E, Ness R, Sharma A & Tan C (2023) Causal reasoning and large language models: opening a new frontier for causality. arXiv preprint arXiv:2305.00050

Koehler DJ (1994) Hypothesis generation and confidence in judgment. J Exp Psychol Learn Mem Cogn 20(2):461–469

Krenn M, Zeilinger A (2020) Predicting research trends with semantic and neural networks with an application in quantum physics. Proc Natl Acad Sci 117(4):1910–1916

Lee H, Zhou W, Bai H, Meng W, Zeng T, Peng K & Kumada T (2023) Natural language processing algorithms for divergent thinking assessment. In: Proc IEEE 6th Eurasian Conference on Educational Innovation (ECEI) p 198–202

Madill A, Shloim N, Brown B, Hugh-Jones S, Plastow J, Setiyawati D (2022) Mainstreaming global mental health: Is there potential to embed psychosocial well-being impact in all global challenges research? Appl Psychol Health Well-Being 14(4):1291–1313

McCarthy M, Chen CC, McNamee RC (2018) Novelty and usefulness trade-off: cultural cognitive differences and creative idea evaluation. J Cross-Cult Psychol 49(2):171–198

McGuire WJ (1973) The yin and yang of progress in social psychology: seven koan. J Personal Soc Psychol 26(3):446–456

Miron-Spektor E, Beenen G (2015) Motivating creativity: The effects of sequential and simultaneous learning and performance achievement goals on product novelty and usefulness. Organ Behav Hum Decis Process 127:53–65

Nisbett RE, Peng K, Choi I, Norenzayan A (2001) Culture and systems of thought: holistic versus analytic cognition. Psychol Rev 108(2):291–310

Article   CAS   PubMed   Google Scholar  

Noy S, Zhang W (2023) Experimental evidence on the productivity effects of generative artificial intelligence. Science 381:187–192

Oleinik A (2019) What are neural networks not good at? On artificial creativity. Big Data Soc 6(1):2053951719839433

Otu A, Charles CH, Yaya S (2020) Mental health and psychosocial well-being during the COVID-19 pandemic: the invisible elephant in the room. Int J Ment Health Syst 14:1–5

Pan S, Luo L, Wang Y, Chen C, Wang J & Wu X (2024) Unifying large language models and knowledge graphs: a roadmap. IEEE Transactions on Knowledge and Data Engineering 36(7):3580–3599

Rubin DB (2005) Causal inference using potential outcomes: design, modeling, decisions. J Am Stat Assoc 100(469):322–331

Article   MathSciNet   CAS   Google Scholar  

Sanderson K (2023) GPT-4 is here: what scientists think. Nature 615(7954):773

Seligman ME, Csikszentmihalyi M (2000) Positive psychology: an introduction. Am Psychol 55(1):5–14

Shah DV, Cappella JN, Neuman WR (2015) Big data, digital media, and computational social science: possibilities and perils. Ann Am Acad Political Soc Sci 659(1):6–13

Shardlow M, Batista-Navarro R, Thompson P, Nawaz R, McNaught J, Ananiadou S (2018) Identification of research hypotheses and new knowledge from scientific literature. BMC Med Inform Decis Mak 18(1):1–13

Shin H, Kim K, Kogler DF (2022) Scientific collaboration, research funding, and novelty in scientific knowledge. PLoS ONE 17(7):e0271678

Thomas RP, Dougherty MR, Sprenger AM, Harbison J (2008) Diagnostic hypothesis generation and human judgment. Psychol Rev 115(1):155–185

Thomer AK, Wickett KM (2020) Relational data paradigms: what do we learn by taking the materiality of databases seriously? Big Data Soc 7(1):2053951720934838

Thompson WH, Skau S (2023) On the scope of scientific hypotheses. R Soc Open Sci 10(8):230607

Tong S, Liang X, Kumada T, Iwaki S (2021) Putative ratios of facial attractiveness in a deep neural network. Vis Res 178:86–99

Uleman JF, Melis RJ, Quax R, van der Zee EA, Thijssen D, Dresler M (2021) Mapping the multicausality of Alzheimer’s disease through group model building. GeroScience 43:829–843

Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579–2605

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N & Polosukhin I (2017) Attention is all you need. In Advances in Neural Information Processing Systems

Wang H, Fu T, Du Y, Gao W, Huang K, Liu Z (2023) Scientific discovery in the age of artificial intelligence. Nature 620(7972):47–60

Webber J (2012) A programmatic introduction to neo4j. In Proceedings of the 3rd annual conference on systems, programming, and applications: software for humanity p 217–218

Williams K, Berman G, Michalska S (2023) Investigating hybridity in artificial intelligence research. Big Data Soc 10(2):20539517231180577

Wu S, Koo M, Blum L, Black A, Kao L, Scalzo F & Kurtz I (2023) A comparative study of open-source large language models, GPT-4 and Claude 2: multiple-choice test taking in nephrology. arXiv preprint arXiv:2308.04709

Yu F, Peng T, Peng K, Zheng SX, Liu Z (2016) The Semantic Network Model of creativity: analysis of online social media data. Creat Res J 28(3):268–274

Download references

Acknowledgements

The authors thank Dr. Honghong Bai (Radboud University), Dr. ChienTe Wu (The University of Tokyo), Dr. Peng Cheng (Tsinghua University), and Yusong Guo (Tsinghua University) for their great comments on the earlier version of this manuscript. This research has been generously funded by personal contributions, with special acknowledgment to K. Mao. Additionally, he conceived and developed the causality graph and AI hypothesis generation technology presented in this paper from scratch, and generated all AI hypotheses and paid for its costs. The authors sincerely thank K. Mao for his support, which enabled this research. In addition, K. Peng and S. Tong were partly supported by the Tsinghua University lnitiative Scientific Research Program (No. 20213080008), Self-Funded Project of Institute for Global Industry, Tsinghua University (202-296-001), Shuimu Scholars program of Tsinghua University (No. 2021SM157), and the China Postdoctoral International Exchange Program (No. YJ20210266).

Author information

These authors contributed equally: Song Tong, Kai Mao.

Authors and Affiliations

Department of Psychological and Cognitive Sciences, Tsinghua University, Beijing, China

Song Tong & Kaiping Peng

Positive Psychology Research Center, School of Social Sciences, Tsinghua University, Beijing, China

Song Tong, Zhen Huang, Yukun Zhao & Kaiping Peng

AI for Wellbeing Lab, Tsinghua University, Beijing, China

Institute for Global Industry, Tsinghua University, Beijing, China

Kindom KK, Tokyo, Japan

You can also search for this author in PubMed   Google Scholar

Contributions

Song Tong: Data analysis, Experiments, Writing—original draft & review. Kai Mao: Designed the causality graph methodology, Generated AI hypotheses, Developed hypothesis generation techniques, Writing—review & editing. Zhen Huang: Statistical Analysis, Experiments, Writing—review & editing. Yukun Zhao: Conceptualization, Project administration, Supervision, Writing—review & editing. Kaiping Peng: Conceptualization, Writing—review & editing.

Corresponding authors

Correspondence to Yukun Zhao or Kaiping Peng .

Ethics declarations

Competing interests.

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval

In this study, ethical approval was granted by the Institutional Review Board (IRB) of the Department of Psychology at Tsinghua University, China. The Research Ethics Committee documented this approval under the number IRB202306, following an extensive review that concluded on March 12, 2023. This approval indicates the research’s strict compliance with the IRB’s guidelines and regulations, ensuring ethical integrity and adherence throughout the study.

Informed consent

Before participating, all study participants gave their informed consent. They received comprehensive details about the study’s goals, methods, potential risks and benefits, confidentiality safeguards, and their rights as participants. This process guaranteed that participants were fully informed about the study’s nature and voluntarily agreed to participate, free from coercion or undue influence.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental material, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Tong, S., Mao, K., Huang, Z. et al. Automating psychological hypothesis generation with AI: when large language models meet causal graph. Humanit Soc Sci Commun 11 , 896 (2024). https://doi.org/10.1057/s41599-024-03407-5

Download citation

Received : 08 November 2023

Accepted : 25 June 2024

Published : 09 July 2024

DOI : https://doi.org/10.1057/s41599-024-03407-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

hypothesis generation model

Society Homepage About Public Health Policy Contact

Data-driven hypothesis generation in clinical research: what we learned from a human subject study, article sidebar.

hypothesis generation model

Submit your own article

Register as an author to reserve your spot in the next issue of the Medical Research Archives.

Join the Society

The European Society of Medicine is more than a professional association. We are a community. Our members work in countries across the globe, yet are united by a common goal: to promote health and health equity, around the world.

Join Europe’s leading medical society and discover the many advantages of membership, including free article publication.

Main Article Content

Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study design, data collection, and result analysis. In this perspective article, the authors provide a literature review on the following topics first: scientific thinking, reasoning, medical reasoning, literature-based discovery, and a field study to explore scientific thinking and discovery. Over the years, scientific thinking has shown excellent progress in cognitive science and its applied areas: education, medicine, and biomedical research. However, a review of the literature reveals the lack of original studies on hypothesis generation in clinical research. The authors then summarize their first human participant study exploring data-driven hypothesis generation by clinical researchers in a simulated setting. The results indicate that a secondary data analytical tool, VIADS—a visual interactive analytic tool for filtering, summarizing, and visualizing large health data sets coded with hierarchical terminologies, can shorten the time participants need, on average, to generate a hypothesis and also requires fewer cognitive events to generate each hypothesis. As a counterpoint, this exploration also indicates that the quality ratings of the hypotheses thus generated carry significantly lower ratings for feasibility when applying VIADS. Despite its small scale, the study confirmed the feasibility of conducting a human participant study directly to explore the hypothesis generation process in clinical research. This study provides supporting evidence to conduct a larger-scale study with a specifically designed tool to facilitate the hypothesis-generation process among inexperienced clinical researchers. A larger study could provide generalizable evidence, which in turn can potentially improve clinical research productivity and overall clinical research enterprise.

Article Details

The  Medical Research Archives  grants authors the right to publish and reproduce the unrevised contribution in whole or in part at any time and in any form for any scholarly non-commercial purpose with the condition that all publications of the contribution include a full citation to the journal as published by the  Medical Research Archives .

Subscribe to the PwC Newsletter

Join the community, edit social preview.

hypothesis generation model

Add a new code entry for this paper

Remove a code repository from this paper.

hypothesis generation model

Mark the official implementation from paper authors

Add a new evaluation result row.

TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK REMOVE
  • MULTI-ARMED BANDITS

Remove a task

hypothesis generation model

Add a method

Remove a method, edit datasets, hypothesis generation with large language models.

5 Apr 2024  ·  Yangqiaoyu Zhou , Haokun Liu , Tejes Srivastava , Hongyuan Mei , Chenhao Tan · Edit social preview

Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled examples). To enable LLMs to handle arbitrarily long contexts, we generate initial hypotheses from a small number of examples and then update them iteratively to improve the quality of hypotheses. Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process. Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks, improving accuracy by 31.7% on a synthetic dataset and by 13.9%, 3.3% and, 24.9% on three real-world datasets. We also outperform supervised learning by 12.8% and 11.2% on two challenging real-world datasets. Furthermore, we find that the generated hypotheses not only corroborate human-verified theories but also uncover new insights for the tasks.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit add remove, methods edit add remove.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Korean Med Sci
  • v.36(50); 2021 Dec 27

Logo of jkms

Formulating Hypotheses for Different Study Designs

Durga prasanna misra.

1 Department of Clinical Immunology and Rheumatology, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, India.

Armen Yuri Gasparyan

2 Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust (Teaching Trust of the University of Birmingham, UK), Russells Hall Hospital, Dudley, UK.

Olena Zimba

3 Department of Internal Medicine #2, Danylo Halytsky Lviv National Medical University, Lviv, Ukraine.

Marlen Yessirkepov

4 Department of Biology and Biochemistry, South Kazakhstan Medical Academy, Shymkent, Kazakhstan.

Vikas Agarwal

George d. kitas.

5 Centre for Epidemiology versus Arthritis, University of Manchester, Manchester, UK.

Generating a testable working hypothesis is the first step towards conducting original research. Such research may prove or disprove the proposed hypothesis. Case reports, case series, online surveys and other observational studies, clinical trials, and narrative reviews help to generate hypotheses. Observational and interventional studies help to test hypotheses. A good hypothesis is usually based on previous evidence-based reports. Hypotheses without evidence-based justification and a priori ideas are not received favourably by the scientific community. Original research to test a hypothesis should be carefully planned to ensure appropriate methodology and adequate statistical power. While hypotheses can challenge conventional thinking and may be controversial, they should not be destructive. A hypothesis should be tested by ethically sound experiments with meaningful ethical and clinical implications. The coronavirus disease 2019 pandemic has brought into sharp focus numerous hypotheses, some of which were proven (e.g. effectiveness of corticosteroids in those with hypoxia) while others were disproven (e.g. ineffectiveness of hydroxychloroquine and ivermectin).

Graphical Abstract

An external file that holds a picture, illustration, etc.
Object name is jkms-36-e338-abf001.jpg

DEFINING WORKING AND STANDALONE SCIENTIFIC HYPOTHESES

Science is the systematized description of natural truths and facts. Routine observations of existing life phenomena lead to the creative thinking and generation of ideas about mechanisms of such phenomena and related human interventions. Such ideas presented in a structured format can be viewed as hypotheses. After generating a hypothesis, it is necessary to test it to prove its validity. Thus, hypothesis can be defined as a proposed mechanism of a naturally occurring event or a proposed outcome of an intervention. 1 , 2

Hypothesis testing requires choosing the most appropriate methodology and adequately powering statistically the study to be able to “prove” or “disprove” it within predetermined and widely accepted levels of certainty. This entails sample size calculation that often takes into account previously published observations and pilot studies. 2 , 3 In the era of digitization, hypothesis generation and testing may benefit from the availability of numerous platforms for data dissemination, social networking, and expert validation. Related expert evaluations may reveal strengths and limitations of proposed ideas at early stages of post-publication promotion, preventing the implementation of unsupported controversial points. 4

Thus, hypothesis generation is an important initial step in the research workflow, reflecting accumulating evidence and experts' stance. In this article, we overview the genesis and importance of scientific hypotheses and their relevance in the era of the coronavirus disease 2019 (COVID-19) pandemic.

DO WE NEED HYPOTHESES FOR ALL STUDY DESIGNS?

Broadly, research can be categorized as primary or secondary. In the context of medicine, primary research may include real-life observations of disease presentations and outcomes. Single case descriptions, which often lead to new ideas and hypotheses, serve as important starting points or justifications for case series and cohort studies. The importance of case descriptions is particularly evident in the context of the COVID-19 pandemic when unique, educational case reports have heralded a new era in clinical medicine. 5

Case series serve similar purpose to single case reports, but are based on a slightly larger quantum of information. Observational studies, including online surveys, describe the existing phenomena at a larger scale, often involving various control groups. Observational studies include variable-scale epidemiological investigations at different time points. Interventional studies detail the results of therapeutic interventions.

Secondary research is based on already published literature and does not directly involve human or animal subjects. Review articles are generated by secondary research. These could be systematic reviews which follow methods akin to primary research but with the unit of study being published papers rather than humans or animals. Systematic reviews have a rigid structure with a mandatory search strategy encompassing multiple databases, systematic screening of search results against pre-defined inclusion and exclusion criteria, critical appraisal of study quality and an optional component of collating results across studies quantitatively to derive summary estimates (meta-analysis). 6 Narrative reviews, on the other hand, have a more flexible structure. Systematic literature searches to minimise bias in selection of articles are highly recommended but not mandatory. 7 Narrative reviews are influenced by the authors' viewpoint who may preferentially analyse selected sets of articles. 8

In relation to primary research, case studies and case series are generally not driven by a working hypothesis. Rather, they serve as a basis to generate a hypothesis. Observational or interventional studies should have a hypothesis for choosing research design and sample size. The results of observational and interventional studies further lead to the generation of new hypotheses, testing of which forms the basis of future studies. Review articles, on the other hand, may not be hypothesis-driven, but form fertile ground to generate future hypotheses for evaluation. Fig. 1 summarizes which type of studies are hypothesis-driven and which lead on to hypothesis generation.

An external file that holds a picture, illustration, etc.
Object name is jkms-36-e338-g001.jpg

STANDARDS OF WORKING AND SCIENTIFIC HYPOTHESES

A review of the published literature did not enable the identification of clearly defined standards for working and scientific hypotheses. It is essential to distinguish influential versus not influential hypotheses, evidence-based hypotheses versus a priori statements and ideas, ethical versus unethical, or potentially harmful ideas. The following points are proposed for consideration while generating working and scientific hypotheses. 1 , 2 Table 1 summarizes these points.

Points to be considered while evaluating the validity of hypotheses
Backed by evidence-based data
Testable by relevant study designs
Supported by preliminary (pilot) studies
Testable by ethical studies
Maintaining a balance between scientific temper and controversy

Evidence-based data

A scientific hypothesis should have a sound basis on previously published literature as well as the scientist's observations. Randomly generated (a priori) hypotheses are unlikely to be proven. A thorough literature search should form the basis of a hypothesis based on published evidence. 7

Unless a scientific hypothesis can be tested, it can neither be proven nor be disproven. Therefore, a scientific hypothesis should be amenable to testing with the available technologies and the present understanding of science.

Supported by pilot studies

If a hypothesis is based purely on a novel observation by the scientist in question, it should be grounded on some preliminary studies to support it. For example, if a drug that targets a specific cell population is hypothesized to be useful in a particular disease setting, then there must be some preliminary evidence that the specific cell population plays a role in driving that disease process.

Testable by ethical studies

The hypothesis should be testable by experiments that are ethically acceptable. 9 For example, a hypothesis that parachutes reduce mortality from falls from an airplane cannot be tested using a randomized controlled trial. 10 This is because it is obvious that all those jumping from a flying plane without a parachute would likely die. Similarly, the hypothesis that smoking tobacco causes lung cancer cannot be tested by a clinical trial that makes people take up smoking (since there is considerable evidence for the health hazards associated with smoking). Instead, long-term observational studies comparing outcomes in those who smoke and those who do not, as was performed in the landmark epidemiological case control study by Doll and Hill, 11 are more ethical and practical.

Balance between scientific temper and controversy

Novel findings, including novel hypotheses, particularly those that challenge established norms, are bound to face resistance for their wider acceptance. Such resistance is inevitable until the time such findings are proven with appropriate scientific rigor. However, hypotheses that generate controversy are generally unwelcome. For example, at the time the pandemic of human immunodeficiency virus (HIV) and AIDS was taking foot, there were numerous deniers that refused to believe that HIV caused AIDS. 12 , 13 Similarly, at a time when climate change is causing catastrophic changes to weather patterns worldwide, denial that climate change is occurring and consequent attempts to block climate change are certainly unwelcome. 14 The denialism and misinformation during the COVID-19 pandemic, including unfortunate examples of vaccine hesitancy, are more recent examples of controversial hypotheses not backed by science. 15 , 16 An example of a controversial hypothesis that was a revolutionary scientific breakthrough was the hypothesis put forth by Warren and Marshall that Helicobacter pylori causes peptic ulcers. Initially, the hypothesis that a microorganism could cause gastritis and gastric ulcers faced immense resistance. When the scientists that proposed the hypothesis themselves ingested H. pylori to induce gastritis in themselves, only then could they convince the wider world about their hypothesis. Such was the impact of the hypothesis was that Barry Marshall and Robin Warren were awarded the Nobel Prize in Physiology or Medicine in 2005 for this discovery. 17 , 18

DISTINGUISHING THE MOST INFLUENTIAL HYPOTHESES

Influential hypotheses are those that have stood the test of time. An archetype of an influential hypothesis is that proposed by Edward Jenner in the eighteenth century that cowpox infection protects against smallpox. While this observation had been reported for nearly a century before this time, it had not been suitably tested and publicised until Jenner conducted his experiments on a young boy by demonstrating protection against smallpox after inoculation with cowpox. 19 These experiments were the basis for widespread smallpox immunization strategies worldwide in the 20th century which resulted in the elimination of smallpox as a human disease today. 20

Other influential hypotheses are those which have been read and cited widely. An example of this is the hygiene hypothesis proposing an inverse relationship between infections in early life and allergies or autoimmunity in adulthood. An analysis reported that this hypothesis had been cited more than 3,000 times on Scopus. 1

LESSONS LEARNED FROM HYPOTHESES AMIDST THE COVID-19 PANDEMIC

The COVID-19 pandemic devastated the world like no other in recent memory. During this period, various hypotheses emerged, understandably so considering the public health emergency situation with innumerable deaths and suffering for humanity. Within weeks of the first reports of COVID-19, aberrant immune system activation was identified as a key driver of organ dysfunction and mortality in this disease. 21 Consequently, numerous drugs that suppress the immune system or abrogate the activation of the immune system were hypothesized to have a role in COVID-19. 22 One of the earliest drugs hypothesized to have a benefit was hydroxychloroquine. Hydroxychloroquine was proposed to interfere with Toll-like receptor activation and consequently ameliorate the aberrant immune system activation leading to pathology in COVID-19. 22 The drug was also hypothesized to have a prophylactic role in preventing infection or disease severity in COVID-19. It was also touted as a wonder drug for the disease by many prominent international figures. However, later studies which were well-designed randomized controlled trials failed to demonstrate any benefit of hydroxychloroquine in COVID-19. 23 , 24 , 25 , 26 Subsequently, azithromycin 27 , 28 and ivermectin 29 were hypothesized as potential therapies for COVID-19, but were not supported by evidence from randomized controlled trials. The role of vitamin D in preventing disease severity was also proposed, but has not been proven definitively until now. 30 , 31 On the other hand, randomized controlled trials identified the evidence supporting dexamethasone 32 and interleukin-6 pathway blockade with tocilizumab as effective therapies for COVID-19 in specific situations such as at the onset of hypoxia. 33 , 34 Clues towards the apparent effectiveness of various drugs against severe acute respiratory syndrome coronavirus 2 in vitro but their ineffectiveness in vivo have recently been identified. Many of these drugs are weak, lipophilic bases and some others induce phospholipidosis which results in apparent in vitro effectiveness due to non-specific off-target effects that are not replicated inside living systems. 35 , 36

Another hypothesis proposed was the association of the routine policy of vaccination with Bacillus Calmette-Guerin (BCG) with lower deaths due to COVID-19. This hypothesis emerged in the middle of 2020 when COVID-19 was still taking foot in many parts of the world. 37 , 38 Subsequently, many countries which had lower deaths at that time point went on to have higher numbers of mortality, comparable to other areas of the world. Furthermore, the hypothesis that BCG vaccination reduced COVID-19 mortality was a classic example of ecological fallacy. Associations between population level events (ecological studies; in this case, BCG vaccination and COVID-19 mortality) cannot be directly extrapolated to the individual level. Furthermore, such associations cannot per se be attributed as causal in nature, and can only serve to generate hypotheses that need to be tested at the individual level. 39

IS TRADITIONAL PEER REVIEW EFFICIENT FOR EVALUATION OF WORKING AND SCIENTIFIC HYPOTHESES?

Traditionally, publication after peer review has been considered the gold standard before any new idea finds acceptability amongst the scientific community. Getting a work (including a working or scientific hypothesis) reviewed by experts in the field before experiments are conducted to prove or disprove it helps to refine the idea further as well as improve the experiments planned to test the hypothesis. 40 A route towards this has been the emergence of journals dedicated to publishing hypotheses such as the Central Asian Journal of Medical Hypotheses and Ethics. 41 Another means of publishing hypotheses is through registered research protocols detailing the background, hypothesis, and methodology of a particular study. If such protocols are published after peer review, then the journal commits to publishing the completed study irrespective of whether the study hypothesis is proven or disproven. 42 In the post-pandemic world, online research methods such as online surveys powered via social media channels such as Twitter and Instagram might serve as critical tools to generate as well as to preliminarily test the appropriateness of hypotheses for further evaluation. 43 , 44

Some radical hypotheses might be difficult to publish after traditional peer review. These hypotheses might only be acceptable by the scientific community after they are tested in research studies. Preprints might be a way to disseminate such controversial and ground-breaking hypotheses. 45 However, scientists might prefer to keep their hypotheses confidential for the fear of plagiarism of ideas, avoiding online posting and publishing until they have tested the hypotheses.

SUGGESTIONS ON GENERATING AND PUBLISHING HYPOTHESES

Publication of hypotheses is important, however, a balance is required between scientific temper and controversy. Journal editors and reviewers might keep in mind these specific points, summarized in Table 2 and detailed hereafter, while judging the merit of hypotheses for publication. Keeping in mind the ethical principle of primum non nocere, a hypothesis should be published only if it is testable in a manner that is ethically appropriate. 46 Such hypotheses should be grounded in reality and lend themselves to further testing to either prove or disprove them. It must be considered that subsequent experiments to prove or disprove a hypothesis have an equal chance of failing or succeeding, akin to tossing a coin. A pre-conceived belief that a hypothesis is unlikely to be proven correct should not form the basis of rejection of such a hypothesis for publication. In this context, hypotheses generated after a thorough literature search to identify knowledge gaps or based on concrete clinical observations on a considerable number of patients (as opposed to random observations on a few patients) are more likely to be acceptable for publication by peer-reviewed journals. Also, hypotheses should be considered for publication or rejection based on their implications for science at large rather than whether the subsequent experiments to test them end up with results in favour of or against the original hypothesis.

Points to be considered before a hypothesis is acceptable for publication
Experiments required to test hypotheses should be ethically acceptable as per the World Medical Association declaration on ethics and related statements
Pilot studies support hypotheses
Single clinical observations and expert opinion surveys may support hypotheses
Testing hypotheses requires robust methodology and statistical power
Hypotheses that challenge established views and concepts require proper evidence-based justification

Hypotheses form an important part of the scientific literature. The COVID-19 pandemic has reiterated the importance and relevance of hypotheses for dealing with public health emergencies and highlighted the need for evidence-based and ethical hypotheses. A good hypothesis is testable in a relevant study design, backed by preliminary evidence, and has positive ethical and clinical implications. General medical journals might consider publishing hypotheses as a specific article type to enable more rapid advancement of science.

Disclosure: The authors have no potential conflicts of interest to disclose.

Author Contributions:

  • Data curation: Gasparyan AY, Misra DP, Zimba O, Yessirkepov M, Agarwal V, Kitas GD.

ORIGINAL RESEARCH article

Temporal dynamics of hypothesis generation: the influences of data serial order, data consistency, and elicitation timing.

hypothesis generation model

  • 1 Department of Psychological Sciences, Birkbeck College, University of London, London, UK
  • 2 Department of Psychology, University of Oklahoma, Norman, OK, USA

The pre-decisional process of hypothesis generation is a ubiquitous cognitive faculty that we continually employ in an effort to understand our environment and thereby support appropriate judgments and decisions. Although we are beginning to understand the fundamental processes underlying hypothesis generation, little is known about how various temporal dynamics, inherent in real world generation tasks, influence the retrieval of hypotheses from long-term memory. This paper presents two experiments investigating three data acquisition dynamics in a simulated medical diagnosis task. The results indicate that the mere serial order of data, data consistency (with previously generated hypotheses), and mode of responding influence the hypothesis generation process. An extension of the HyGene computational model endowed with dynamic data acquisition processes is forwarded and explored to provide an account of the present data.

Hypothesis generation is a pre-decisional process by which we formulate explanations and beliefs regarding the occurrences we observe in our environment. The hypotheses we generate from long-term memory (LTM) bring structure to many of the ill-structured decision making tasks we commonly encounter. As such, hypothesis generation represents a fundamental and ubiquitous cognitive faculty on which we constantly rely in our day-to-day lives. Given the regularity with which we employ this process, it is no surprise that hypothesis generation forms a core component of several professions. Auditors, for instance, must generate hypotheses regarding abnormal financial patterns, mechanics must generate hypotheses concerning car failure, and intelligence analysts must interpret the information they receive. Perhaps the clearest example, however, is that of medical diagnosis. A physician observes a pattern of symptoms presented by a patient (i.e., data) and uses this information to generate likely diagnoses (i.e., hypotheses) in an effort to explain the patient’s presenting symptoms. Given these examples, the importance of developing a full understanding of the processes underlying hypothesis generation is clear, as the consequences of impoverished or inaccurate hypothesis generation can be injurious.

Issues of temporality pervade hypothesis generation and its underlying information acquisition processes. Hypothesis generation is a task situated at the confluence of external environmental dynamics and internal cognitive dynamics. External dynamics in the environment dictate the manifestation of the information we acquire and use as cues to retrieve likely hypotheses from LTM. Internal cognitive dynamics then determine how this information is used in service of the generation process and how the resulting hypotheses are maintained over the further course of time as judgments and decisions are rendered. Additionally, these further internal processes are influenced by and interact with the ongoing environmental dynamics as new information is acquired. These complicated interactions govern the beliefs (i.e., hypotheses) we entertain over time. It is likely that these factors interact in such a manner that would cause the data acquisition process to deviate from normative prescriptions.

Important to the present work is the fact that data acquisition generally occurs serially over some span of time. This, in turn, dictates that individual pieces of data are acquired in some relative temporal relation to one another. These constraints, individual data acquisition over time and the relative ordering of data, are likely to have significant consequences for hypothesis generation processes. Given these basic constraints, it is intuitive that temporal dynamics must form an integral part of any comprehensive account of hypothesis generation processes. At present there exists only a scant amount of data concerning the temporal dynamics of hypothesis generation. Thus, the influences of the constraints operating over these processes are not yet well understood. Until such influences are addressed more deeply at an empirical and theoretical level, a full understanding of hypothesis generation processes will remain speculative.

The empirical paradigm used in the following experiments is a simulated diagnosis task comprised of two main phases. The first phase represents a form of category learning in which the participant learns the conditional probabilities of medical symptoms (i.e., data) and fictitious diseases (i.e., hypotheses), from experience over time by observing a large sample of hypothetical pre-diagnosed patients. The second phase of the task involves presenting symptoms to the participant whose task it is to generate (i.e., retrieve) likely disease states from memory. At a broader level, such experiments involving a learning phase followed by a decision making phase have been utilized widely in previous experiments (e.g., McKenzie, 1998 ; Cooper et al., 2003 ; Nelson et al., 2010 ; Sprenger and Dougherty, 2012 ). In the to-be-presented experiments, we presented the symptoms sequentially and manipulated the symptom’s sequence structures in the “decision making phase.” As the data acquisition unfolds over time, the results of these experiments provide insight into the dynamic data acquisition and hypothesis generation processes operating over time that are important for computational models.

In this paper, we present a novel extension of an existing computational model of hypothesis generation. This extension is designed to capture the working memory dynamics operating during data acquisition and how these factors contribute to the process of hypothesis generation. Additionally, two experiments exploring three questions of interest to dynamic hypothesis generation are described whose results are captured by this model. Experiment 1 utilized an adapted generalized order effects paradigm to assess how the serial position of an informative piece of information (i.e., a diagnostic datum), amongst uninformative information (i.e., non-diagnostic data), influences its contribution to the generation process. Experiment 2 investigated (1) how the acquisition of data inconsistent with previously generated hypotheses influences further generation and maintenance processes and (2) if generation behavior differs when it is based on the acquisition of a set of data vs. when those same pieces of data are acquired in isolation and generation is carried out successively as each datum is acquired. This distinction underscores different scenarios in which it is advantageous to maintain previously acquired data vs. previously generated hypotheses over time.

HyGene: A Computational Model of Hypothesis Generation

HyGene ( Thomas et al., 2008 ; Dougherty et al., 2010 ), short for hypothesis generation, is a computational architecture addressing hypothesis generation, evaluation, and testing. This framework has provided a useful account through which to understand the cognitive mechanisms underlying these processes. This process model is presented in Figure 1 .

www.frontiersin.org

Figure 1. Flow diagram of the HyGene model of hypothesis generation, judgment, and testing . A s , semantic activation of retrieved hypothesis; Act MinH , minimum semantic activation criterion for placement of hypothesis in SOC; T , total number of retrieval failures; and K max , number of retrieval failures allowed before terminating hypothesis generation.

HyGene rests upon three core principles. First, as underscored by the above examples, it is assumed that hypothesis generation represents a generalized case of cued recall. That is, the data observed in the environment (D obs ), which one would like to explain, act as cues prompting the retrieval of hypotheses from LTM. For instance, when a physician examines a patient, he/she uses the symptoms expressed by the patient as cues to related experiences stored in LTM. These cues activate a subset of related memories from which hypotheses are retrieved. These retrieval processes are indicated in Steps 1, 2, and 3 shown in Figure 1 . Step 1 represents the environmental data being matched against episodic memory. In step 2, the instances in episodic memory that are highly activated by the environmental data contribute to the extraction of an unspecified probe representing a prototype of these highly activated episodic instances. This probe is then matched against all known hypotheses in semantic memory as indicated in Step 3. Hypotheses are then sampled into working memory based on their activations resulting from this semantic memory match.

As viable hypotheses are retrieved from LTM, they are placed in the Set of Leading Contenders (SOC) as demonstrated in Step 4. The SOC represents HyGene’s working memory construct to which HyGene’s second principle applies. The second principle holds that the number of hypotheses that can be maintained at one time is constrained by cognitive limitations (e.g., working memory capacity) as well as task characteristics (e.g., divided attention, time pressure). Accordingly, the more working memory resources that one has available to devote to the generation and maintenance of hypotheses, the greater the number of additional hypotheses can be placed in the SOC. Working memory capacity places an upper bound on the amount of hypotheses and data that one will be able to maintain at any point in time. In many circumstances, however, attention will be divided by a secondary task. Under such conditions this upper bound is reduced as the alternative task siphons resource that would otherwise allow the population of the SOC to its unencumbered capacity ( Dougherty and Hunter, 2003a , b ; Sprenger and Dougherty, 2006 ; Sprenger et al., 2011 ).

The third principle states that the hypotheses maintained in the SOC form the basis from which probability judgments are derived and provide the basis from which hypothesis testing is implemented. This principle underscores the function of hypothesis generation as a pre-decisional process underlying higher-level decision making tasks. The tradition of much of the prior research on probability judgment and hypothesis testing has been to provide the participant with the options to be judged or tested. HyGene highlights this as somewhat limiting the scope of the conclusions drawn from such procedures, as decision makers in real world tasks must generally generate the to-be-evaluated hypotheses themselves. As these higher-level tasks are contingent upon the output of the hypothesis generation process, any conclusions drawn from such experimenter-provided tasks are likely limited to such conditions.

Hypothesis Generation Processes in HyGene

The representation used by HyGene was borrowed from the multiple-trace global matching memory model MINERVA II ( Hintzman, 1986 , 1988 ) and the decision making model MINERVA-DM ( Dougherty et al., 1999 ) 1 . Memory traces are represented in the model as a series of concatenated minivectors arbitrarily consisting of 1, 0, and −1 s where each minivector represents either a hypothesis or a piece of data (i.e., a feature of the memory). Separate episodic and semantic memory stores are present in HyGene which are made up of separate instances of such concatenated feature minivectors. While semantic memory contains prototypes of each disease, episodic memory contains individual traces for every experience the model acquires.

Retrieval is initiated when D obs are matched against each of data minivectors in episodic LTM. This returns an LTM activation value for each trace in episodic LTM whereby greater overlap of features present in the trace and present in the D obs results in greater activation. A threshold is applied to these episodic activation values such that only traces with long-term episodic activation values exceeding this threshold contribute to additional processing in the model. A prototype is extracted from this subset of traces which is then used as a cue to semantic memory for the retrieval of hypotheses. We refer to this cue as the unspecified probe . This unspecified probe is matched against all hypotheses in semantic memory which returns an activation value for each known hypothesis. The activation values for each hypothesis serve as input into retrieval through sampling via Luce’s choice rule. Generation proceeds in this way until a stopping rule is reached based on the total number of resamples of previously generated hypotheses (i.e., retrieval failures).

In its current form, the HyGene model is static with regards to data acquisition and utilization. The model receives all available data from the environment simultaneously and engages in only a single iteration of hypothesis generation. Given the static nature of the model, each piece of data used to cue LTM contributes equally to the recall process. Based on effects observed in related domains, however, it seems reasonable to suspect that all available data do not contribute equally in hypothesis generation tasks. For example, Anderson (1965) , for instance, observed primacy weightings in an impression formation task in which attributes describing a person were revealed sequentially. Moreover, recent work has demonstrated biases in the serial position of data used to support hypothesis generation tasks ( Sprenger and Dougherty, 2012 ). By ignoring differential use of available data in the generation process, HyGene, as previously implemented, ignores temporal dynamics influencing hypothesis generation tasks. In our view, what is needed is an understanding of working memory dynamics as data acquisition, hypothesis generation, and maintenance processes unfold and evolve over time in hypothesis generation tasks.

Dynamic Working Memory Buffer of the Context-Activation Model

The context-activation model of memory ( Davelaar et al., 2005 ) is one of the most comprehensive models of memory recall to date. It is a dual-trace model of list memory accounting for a large set of data from various recall paradigms. Integral to the model’s behavior are the activation-based working memory dynamics of its buffer. The working memory buffer of the model dictates that the activations of the items in working memory systematically fluctuate over time as the result of competing processes described by Eq. 1.

Equation 1: activation calculation of the context-activation model

The activation level of each item, x i , is determined by the item’s activation on the previous time step, self-recurrent excitation that each item recycles onto itself α, inhibition from the other active items β, and zero-mean Gaussian noise N with standard deviation σ. Lastly, λ is the Euler integration constant that discretizes the differential equation. Note, however, that as this equation is applied in the present model, noise was only applied to an item’s activation value once it was presented to the model 2 .

Figure 2 illustrates the interplay between the competitive buffer dynamics in a noiseless run of the buffer when four pieces of data have been presented to the model successively. The activation of each datum rises as it is presented to the model and its bottom-up sensory input contributes to the activation. These activations are then dampened in the absence of bottom-up input as inhibition from the other items drive activation down. Self-recurrency can keep an item in the buffer in the absence of bottom-up input, but this ability is in proportion to the amount of competition from other items in the buffer. The line at 0.2 represents the model’s working memory threshold. In the combined dynamic HyGene model (utilizing the dynamics of the buffer to determine the weights of the data) this WM threshold separates data that are available to contribute to generation (>0.2) from those that will not (<0.2). That is, if a piece of data’s activation is greater than this threshold at the time of generation then it contributes to the retrieval of hypotheses from LTM and is weighted by its amount of activation. However, if, on the other hand, a piece of data falls below the WM threshold then it is weighted zero and as a result does not contribute to the hypothesis retrieval.

www.frontiersin.org

Figure 2. Noiseless activation trajectories for four sequentially received data in the dynamic activation-based buffer . Each item presented to the buffer for 1500 iterations. F ( x ) = memory activation.

The activations of individual items are sensitive the amount of recurrency (alpha) and inhibition (beta) operating in the buffer. Figure 3 demonstrates differential sensitivity to values of alpha and beta by item presentation serial position (1 through 4 in this case). This plot was generated by running the working memory buffer across a range of alpha and beta values for 50 runs at each parameter combination. Each panel presents the activation of an item in a four-item sequence after the final item has been presented. The activation levels vary with serial position, as shown by the differences among the four panels and with the value of the alpha and beta parameters, as shown within each panel. It can be seen that items one and two are mainly sensitive to the value of alpha. As alpha is increased, these items are more likely to maintain high activation values at the end of the data presentation. Item three demonstrates a similar pattern under low values of beta, but under higher values of beta this item only achieves modest activation as it cannot overcome the strong competition exerted by item one and two. Item four demonstrates a pattern distinct from the others. Like the previous three items the value of alpha limits the influence of beta up to a certain point. At moderate to high values of alpha, however, beta has a large impact on the activation value of the fourth item. At very low values of beta (under high alpha) this item is able to attain high activation, but quickly moves to very low activation values with modest increases in beta. These modest increases in beta are enough to make the competition from the three preceding items severe enough that the fourth item cannot overcome it.

www.frontiersin.org

Figure 3. Contour plot displaying activation values of four items at end of data presentation across a range of Beta ( X axes) and Alpha ( Y axes) demonstrating differences in activation weight gradients produced by the working memory buffer .

Taken as a whole, these plots describe differences in the activation gradients (profiles of activation across all four items) taken on by the buffer across various values of alpha and beta. For instance, the stars in the plot represent two different settings of alpha and beta which result in different activation gradients across the items. The settings of alpha = 2 and beta = 0.2 represented by the white stars, for instance, represent an instance of recency in the item activations. That is, the earlier items have only slight activation, the third item modest activation, and the last item is highly active relative to the others. Tracing the activations across the settings of alpha = 3 and beta = 0.4 represented by the yellow stars, on the other hand, shows a primacy gradient in which the earlier items are highly active, item three is less so, and the last item’s activation is very low. As will be seen, this pattern of activation values across different values of alpha and beta will become important for the computational account of Experiment 2. At a broader level, however, this plot shows possible activation gradients that can be obtained with the working memory buffer. In general, the activation gradients produce recency, but primacy gradients are also possible. Additionally, there are patterns of activation across items that the buffer cannot produce. For instance an inverted U shape of item activations would not result from the buffer’s processes.

These dynamics are theoretically meaningful as they produce data patterns which item-based working memory buffers (e.g., SAM; Raaijmakers and Shiffrin, 1981 ) cannot account for. For example, the buffer dynamics of the context-activation model dictate that items presented early in a sequence will remain high in activation (i.e., remain in working memory) under fast presentation rates. That is, under fast presentation rates the model predicts a primacy effect. Such effects have been observed in cued recall ( Davelaar et al., 2005 ), free recall ( Usher et al., 2008 ), and in a hypothesis generation task ( Lange et al., 2012 ). Given these findings and the unique ability of the activation-based buffer to account for these effects, we have selected the activation-based buffer as our starting point for endowing the HyGene model with dynamic data acquisition processes.

A Dynamic Model of Hypothesis Generation: Endowing HyGene with Dynamic Data Acquisition

The competitive working memory processes of the context-activation model’s dynamic buffer provide a principled means for incorporating fine-grained temporal dynamics into currently static portions of HyGene. As a first step in incorporating the dynamic working memory processes of the working memory buffer, we use the buffer as a means to endow HyGene with dynamic data acquisition. In so doing, the HyGene architecture gains two main advantages. As pointed out by Sprenger and Dougherty (2012) , any model of hypothesis generation seeking to account for situations in which data are presented sequentially needs a means of weighting the contribution of individual data. In using the buffer’s output as weights on the generation process we provide such a weighting mechanism. Additionally, as a natural consequence of utilizing the buffer to provide weights on data observed in the environment, working memory capacity constraints are imposed on the amount of data that can contribute to the generation process. As data acquisition was not a focus of the original instantiation of HyGene, capacity limitations in this part of the generation process were not addressed. However, recent data suggest that capacity constraints operating over data acquisition influence hypothesis generation ( Lange et al., 2012 ). Lastly, at a less pragmatic level, this integration provides insight into the working memory dynamics unfolding throughout the data acquisition period thereby providing a window into processing occurring over this previously unmodeled epoch of the hypothesis generation process.

In order to endow HyGene with dynamic data acquisition, each run of the model begins with the context-activation model being sequentially presented with a series of items. In the context of this model these items are the environmental data the model has observed. The activation values for each piece of data at the end of the data acquisition period are then used as the weights on the generation process. A working memory threshold is imposed on the data activations such that data with activations falling below 0.2 are weighted with a zero rather than their actual activation value 3 . Specifically, the global memory match performed between the current D obs and episodic memory in HyGene is weighted by the individual item activations in the dynamic working memory buffer (with the application of the working memory threshold). As each trace in HyGene’s episodic memory is made up of concatenated minivectors, each representing a particular data feature (e.g., fever vs. normal temperature), this weighting is applied in a feature by feature manner in the global matching process. From this point on in the model everything operates in accordance with the original instantiation of HyGene. That is, a subset of the highly activated traces in episodic memory is then used as the basis for the extraction of the unspecified probe . This probe is then matched against semantic memory from which hypotheses are serially retrieved into working memory for further processing.

In order to demonstrate how the integrated dynamic HyGene model responds to variation in the buffer dynamics a simulation was run in which alpha and beta were manipulated at the two levels highlighted above in Figure 3 . In this simulation, the model was sequentially presented with four pieces of data. Only one of these pieces of data was diagnostic whereas the remaining three were completely non-diagnostic. An additional independent variable in this simulation was the serial position in which the diagnostic piece of data was placed. Displayed in Figure 4 is the model’s generation of the most likely hypothesis (i.e., the hypothesis suggested by the diagnostic piece of data) across that data’s serial position plotted by the two levels of alpha (recurrent activation) and beta (global lateral inhibition). What this plot demonstrates, in effect, is how the contribution of each data’s serial position to the model’s generation process is influenced by alpha and beta. As displayed on the left side of the plot, at the lower value of alpha there are clear recency effects. This is due to the buffer dynamics which under these settings predict an “early in – early out” cycling of items through the buffer as shown in Figure 2 . The recency effects emerge as earlier data are less likely to reside in the buffer at the time of generation than later data. It should be noted that these parameters (alpha = 2, beta = 0.2) have been used in previous work accounting for the data from multiple list recall paradigms ( Davelaar et al., 2005 ). By means of preview, we utilize the model’s prediction of recency under these standard parameter settings in guiding our expectations and the implementation of Experiment 1.

www.frontiersin.org

Figure 4. Influence of data serial position on the hypothesis generation behavior of the dynamic HyGene model at two levels of alpha and beta (and the performance of an equal weighted model in blue) . Data plotted represents the proportion of simulation runs on which the most likely hypothesis was generated.

Under the higher value of alpha however, recency does not obtain. In this case, the serial position function flattens substantially as the increased recurrency allows more items to be available to contribute to generation at the end of the sequential data presentation. That is, even when the diagnostic datum appears early, it is maintained long enough in the buffer to be incorporated into the cue to episodic memory. Under the higher value of beta, we see this flattening out transition to a mild primacy gradient. This results from the increased inhibition making it more difficult for the later items to gain enough activation in working memory to contribute to the retrieval process. The greater amount of inhibition essentially renders the later items uncompetitive as they face more competition than they are able, in general, to overcome. Figure 4 additionally plots a line in blue demonstrating the generation level of the static HyGene model in which, rather than utilizing the weights produced by the buffer, each piece of data was weighted equally with a value of one. It can be seen that this line of performance is intermediate under low alpha, but somewhat consistent with the high alpha condition in which more data contribute to the generation process more regularly.

Experiment 1: Data Serial Position

Order effects are pervasive in investigations of memory and decision making ( Murdock, 1962 ; Weiss and Anderson, 1969 ; Hogarth and Einhorn, 1992 ; Page and Norris, 1998 ). Such effects have even been obtained in a hypothesis generation task specifically. Although observed under different conditions than addressed by the present experiment, Sprenger and Dougherty, 2012 , Experiments 1 and 3) found that people sometimes tend to generate hypotheses suggested by more recent cues.

The generalized order effect paradigm was developed by Anderson (1965 , 1973 ) and couched within the algebra of information integration theory to derive weight estimates for individual pieces of information presented in impression formation tasks (e.g., adjectives describing a person). This procedure involved embedding a fixed list of information with a critical piece of information at various serial positions. The differences in the serial position occupied by the piece of critical information thus defined the independent variable, and given that all other information was held constant between conditions, the differences in final judgment were attributable to this difference in serial position. The present experiment represents an adaptation of this paradigm to assess the impact of data serial position on hypothesis generation.

Participants

Seventy-two participants from the University of Oklahoma participated in this experiment for course credit.

Design and procedure

The design of Experiment 1 was a one-way within-subjects design with symptom order as the independent variable. The statistical ecology for this experiment, as defined by the conditional probabilities between the various diseases and symptoms, is shown in Table 1 . Each of the values appearing in this table represents the probability that the symptom will be positive (e.g., fever) given the disease [where the complementary probability represents the probability of the symptom being negative (e.g., normal temperature) given the disease]. The only diagnostic (i.e., informative) symptom is S1 whereas the remaining symptoms, S2–S4, are non-diagnostic (uninformative).

www.frontiersin.org

Table 1 . Disease × Symptom ecology of Experiment 1 .

Table 2 displays the four symptom orders. Each of these orders was identical (S2 → S3 → S4) except for the position of S1 within them. All participants received and judged all four symptom orders.

www.frontiersin.org

Table 2 . Symptom presentation orders used in Experiment 1 .

There were three main phases to the experiment, an exemplar training phase to learn the contingencies displayed in Table 1 , a learning test to allow discrimination of participants that had learned in the training from those that had not, and an elicitation phase in which the symptom order manipulation was applied in a diagnosis task in which the patient’s symptoms were presented sequentially. The procedure began with the exemplar training phase in which a series of hypothetical pre-diagnosed patients was presented to the participant in order for them to learn, through experience, the contingencies between the diseases and symptoms. Each of these patients was represented by a diagnosis at the top of the screen and a series of test results (i.e., symptoms) pertaining to the columns of S1, S2, S3, and S4 as can be seen in the example displayed by Figure 5 .

www.frontiersin.org

Figure 5. Example exemplar used in Experiment 1 .

Each participant saw 50 exemplars of each disease for a total of 150 exemplars, thus making the base rates of the diseases equal. The specific results of these tests respected the probabilities in Table 1 . The exemplars were drawn in blocks of 10 in which the symptoms would be drawn from the fixed distribution of symptom states given that disease. These symptom states were sampled independently without replacement from exemplar to exemplar. Therefore over the 10 exemplars presented in each individual disease block, the symptoms observed by the participant perfectly represented the distribution of symptoms for that disease. The disease blocks were randomly sampled without replacement which was repeated after the third disease block was presented. Thus, over the course of training the participants were repeatedly presented with the exact probabilities displayed in Table 1 . Each exemplar appeared on the screen for a minimum of 5000 ms at which point they could continue studying the current exemplar or advance to the next exemplar by entering (on the keyboard) the first letter of the current disease exemplar. This optional prolonged studying made the training pseudo-self-paced. Prior to beginning the exemplar training phase, the participants were informed that they had an opportunity to earn a $5.00 gift card to Wal-Mart if they performed well enough in the task.

The diagnosis test phase directly followed exemplar training. This test was included to allow discrimination of participants that learned the contingencies between the symptoms and the diseases in the training phase 4 . The participants were presented with the symptoms of a series of 12 patients (four of each disease) as defined principally by the presence or absence of S1. That is, four of the patients had S1 present (suffering from Metalytis) and the remaining eight had S1 absent (four suffering from Zymosis and four suffering from Gwaronia). The remaining symptoms for the four patients of each disease were the same across the three diseases. On one patient these symptoms were all positive. On the remaining three patients one of these symptoms (S2, S3, S4) was selected without replacement to be absent while the other two were present. Note that as S2, S3, and S4 were completely non-diagnostic as the presence or absence of their symptoms does not influence the likelihood of the disease state. The disease likelihood is completely dependent on the state of S1. The symptoms of each of the patients were presented simultaneously on a single screen. The participants’ task was to correctly diagnose the patients with the disease of greatest posterior probability given their presenting symptoms. No feedback on this test performance was provided. As only S1 was diagnostic, the participants’ scores on this test were tallied based on their correct discrimination of each patient as Metalytis vs. Gwaronia or Zymosis. There were 12 test patients in this diagnosis test. If the participant scored greater than 60% on a diagnosis test they were awarded the gift card at the end of the experiment 5 . Prior to the end of the experiment, the participants were not informed of their performance on the diagnosis test. The participant then completed a series of arithmetic distracters in order to clear working memory of information processed during the diagnosis test phase. The distracter task consisted of a series of 15 arithmetic equations for which the correctness or incorrectness was to be reported (e.g., 15/3 + 2 = 7? Correct or Incorrect?). This distracter task was self-paced.

The elicitation phase then proceeded. First, the diagnosis task was described to the participants as follows: “You will now be presented with additional patients that need to be diagnosed. Each symptom of the patient will be presented one at a time. Following the last symptom you will be asked to diagnose the patient based on their symptoms. Keep in mind that sometimes the symptoms will help you narrow down the list of likely diagnoses to a single disease and other times the symptoms may not help you narrow down the list of likely diagnoses at all. It is up to you to determine if the patient is likely to be suffering from 1 disease, 2 diseases, or all 3 diseases. When you input your response make sure that you respond with the most likely disease first. You will then be asked if you think there is another likely disease. If you think so then you will enter the next most likely disease second. If you do not think there is another likely disease then just hit the Spacebar. You will then have the option to enter a third disease or hit the Spacebar in the same manner. To input the diseases you will use the first letter of the disease, just as you have been during the training and previous test.”

The participant was then presented with the first patient and triggered the onset of the stream of symptoms themselves when they were ready. Each of the four symptoms was presented individually for 1.5 s with a 250 ms interstimulus interval following each symptom. The order in which the symptoms were presented was determined by the order condition as shown in Table 2 . Additionally, all of the patient symptoms presented in this phase positive (i.e., present, as the values in Table 2 represent the likelihood of the symptoms being present given the disease state). The Bayesian posterior probability of D1 was 0.67 whereas the posterior probability of either D2 or D3 was 0.17. Following the presentation of the last symptom the participant responded to two sets of prompts: the diagnosis prompts (as previously described in the instructions to the participants) and a single probability judgment of their highest ranked diagnosis. The probability judgment was elicited with the following prompt: “If you were presented 100 patients with the symptoms of the patient you just observed how many would have [INSERT HIGHEST RANKED DISEASE]?” The participant was then presented with the remaining symptom orders in the same manner with distracter tasks intervening between each trial. The first order received by each participant was randomized between participants and the sequence of the remaining three orders was randomized within participants. Eighteen participants received each symptom order first.

Hypotheses and Predictions

A recency effect was predicted on the grounds that more recent cues would be more active in working memory and contribute to the hypothesis generation process to a greater degree than less recent cues. Given that the activation of the diagnostic symptom (S1) in working memory at the time of generation was predicted to increase in correspondence with its serial position, increases in the generation of Metalytis were predicted to be observed with greater recency of S1. As suggested by Figure 2 , the context-activation model, under parameters based on previous work in list recall paradigms ( Davelaar et al., 2005 ) predicts this generally recency effect as later items are more often more active in memory at the end of list presentation. Correspondingly, decreases in the generation of the alternatives to Metalytis were expected with increases in the serial position of S1. This prediction stems directly from the buffer activation dynamics of the context-activation model.

The main DV for the analyses was the discrete generation vs. non-generation of Metalytis as the most likely disease (i.e., first disease generated). All participants were included in the analyses regardless of performance in the diagnosis test phase and there were no differences in results based on learning. Carry-over effects were evident as demonstrated by a significant interaction between order condition and trial, χ 2 (3) = 12.68, p < 0.016 6 . In light of this, only the data from the first trial for each participant was subjected to further analysis as it was assumed that this was the only uncontaminated trial for each subject. Nominal logistic regression was used to examine the effect of data serial position on the generation of Metalytis (the disease with the greatest posterior probability given the data). A logistic regression contrast test demonstrated a trend for the generation of Metalytis as it was more often generated as the most likely hypothesis with increases in the serial position of the diagnostic data, χ 2 (1) = 4.32, p < 0.05. The number of hypotheses generated between order conditions did not differ, F (3,68) = 0.567, p = 0.64, η p 2 = 0.02 , ranging from an average of 1.67–1.89 hypotheses. There were no differences in the probability judgments of Metalytis as a function of data order when it was generated as the most likely hypothesis (with group means ranging from 56.00 to 67.13), F (3,33) = 0.66, p = 0.58, η p 2 = 0.06 .

Simulating Experiment 1

To simulate Experiment 1, the model’s episodic memory was endowed with the Disease-Symptom contingencies described in Table 1 . On each trial, each symptom was presented to the buffer for 1500 iterations (mapping onto the presentation duration of 1500 ms) and the order of the symptoms was manipulated to match the symptom orders used in the experiment. 1000 iterations of the entire simulation were run for each condition 7 . The primary model output of interest was the first hypothesis generated on each trial. As is demonstrated in Figure 6 , the model is able to capture the qualitative trend in the empirical data quite well. Although the rate of generation is slightly less for the model, the model clearly captures the recency trend as observed in the empirical data. Increased generation of the most likely hypothesis corresponded to the recency of the diagnostic datum. This effect is directly attributable to the buffer activation weights being applied to the generation process. Although Figure 10 will become more pertinent later, the left hand side of this figure demonstrates the recency gradient in the data activation weights produced by the model under these parameter settings. Inspection of the average weights for the first two data acquired show them to be below the working memory threshold of 0.2. Therefore, on a large proportion of trials the model relied on only the third and fourth piece of data (or just the last piece). This explains why the model performs around chance under the first two data orders and only deviates under orders three and four. Additionally, it should be noted that the model could provide a suitable quantitative fit to the empirical data by incorporating an assumption concerning the rate of guessing in the task or potentially by manipulating the working memory threshold. Although the aim of the current paper is to capture the qualitative effects evidenced in the data, future work may seek more precise quantitative fits.

www.frontiersin.org

Figure 6. Empirical data (solid line) and model data (dashed line) for Experiment 1 plotting the probability of reporting D1 (Metalytis) as most likely across order conditions . Error bars represent standard errors of the mean.

The primary prediction of the experiment was confirmed. The generation of the most likely hypothesis increased in correspondence with increasing recency of the diagnostic data (i.e., symptom). This finding clearly demonstrates that not all available data contribute equally to the hypothesis generation process (i.e., some data are weighted more heavily than others) and that the serial position of a datum can be an important factor governing the weight allocated to it in the generation process. Furthermore, these results are consistent with the notion that the data weightings utilized in the generation process are governed by the amount of working memory activation possessed by each datum.

There are, however, two alternative explanations for the present finding to consider that do not necessarily implicate unequal weightings of data in working memory as governing generation. First, it could be the case that all data resident in working memory at the time of generation were equally weighted, but that the likelihood of S1 dropping out of working memory increased with its distance in time from the generation prompt. Such a discrete utilization (i.e., all that matters is that data are in or out of working memory regardless of the activation associated with individual data) would likely result in a more gradual recency effect than seen in the data. Future investigations measuring working memory capacity could provide illuminating tests of this account. If generation is sensitive to only the presence or absence of data in working memory (as opposed to graded activations of the data in working memory) it could be expected that participants with higher capacity would be less biased by serial order (as shown in Lange et al., 2012 ) or would demonstrate the bias at a different serial position relative to those with lower capacity.

A second alternative explanation could be that the participants engaged in spontaneous rounds of generation following each piece of data as it was presented. Because the hypothesis generation performance was only assessed after the final piece of data in the present experiment, such “step-by-step” generation would result in stronger generation of Metalytis as the diagnostic data is presented closer to the end of the list. For instance, if spontaneous generation occurs as each piece of data is being presented, then when the diagnostic datum is presented first, there remains three more rounds of generation (based on non-diagnostic data in this case) that could obscure the generation of the initial round. As the diagnostic data moves closer to the end of the data stream the likelihood that that particular round of generation will be obscured by forthcoming rounds diminishes. It is likely that the present data represents a mixture of participants that engaged in such spontaneous generation and those that did not engage in generation until prompted. This is likely the reason for the quantitative discrepancy between the model and empirical data. Future investigations could attempt to determine the likelihood that a participant will engage in such spontaneous generation and the conditions making it more or less likely.

The probability judgments observed in the present experiments did not differ across order conditions. Because the probability judgments were only elicited for the highest ranked hypothesis, the conditions under which the probability judgments were collected were highly constrained. It should be noted that the focus of the present experiment was to address generation behavior and the collection of the judgment data was ancillary. An independent experiment manipulating serial order in the manner done here and designed explicitly for the examination of judgment behavior would be useful for examining the influence of specific data serial positions on probability judgments. This would be interesting as HyGene predicts the judged probability of a hypothesis to be directly influenced by the relative support for the hypotheses currently in working memory. In so far as serial order influences the hypotheses generated into working memory, effects of serial position on probability judgment are likely to be observed as well.

The goal of Experiment 1 was to determine how relative data serial position affects the contribution of individual data to hypothesis generation processes. It was predicted that data presented later in the sequence would be more active in working memory and would thereby contribute more to the generation process based on the dynamics of the context-activation buffer. Such an account predicts a recency profile for the generation of hypotheses from LTM. This effect was obtained and is well-captured by our model in which such differences in the working memory activation possessed by individual data govern the generation process. Despite these positive results, however, the specific processes underlying this data are not uniquely discernible in the present experiment as the aforementioned alternative explanations likely predict similar results. Converging evidence for the notion that data activation plays a governing role in the generation process should be sought.

Experiment 2: Data Maintenance and Data Consistency

When acquiring information from the world that we may use as cues for the generation of hypotheses we acquire these cues in variously sized sets. In some cases we might receive several pieces of environmental data over a brief period, such as when a patient rattles off a list of symptoms to a physician. At other times, however, we receive cues in isolation across time and generate hypotheses based on the first cue and update this set of hypotheses as further data are acquired, such as when an underlying cause of car failure reveals itself over a few weeks. Such circumstances are more complicated as additional processes come into play as further data are received and previously generated hypotheses are evaluated in light of the new data. Hogarth and Einhorn (1992) refer to this task characteristic as the response mode.

In the context of understanding dynamic hypothesis generation this distinction is of interest as it contrasts hypothesis generation following the acquisition of a set of data with a situation in which hypotheses are generated (and updated or discarded) while further data is acquired and additional hypotheses generated. An experiment manipulating this response mode variable in a hypothesis generation task was conducted by Sprenger and Dougherty, 2012 , Experiment 3) in which people hypothesized about which psychology courses were being described by various keywords. The two response modes are step-by-step (SbS), in which a response is elicited following each piece of incoming data, and end-of-sequence (EoS), in which a response is made only after all the data has been acquired as a grouped set. Following the last piece of data, the SbS conditions exhibited clear recency effects whereas EoS conditions, on the other hand, did not demonstrate reliable order effects. A careful reader may notice a discrepancy between the lack of order effects in their EoS condition and the recency effect in the present Experiment 1 (which essentially represents an EoS mode condition). In the Sprenger and Dougherty experiment, the participants received nine cues from which to generate hypotheses as opposed to the four cues in our Experiment 1. As the amount of data in their experiment exceeded working memory capacity (more severely) it is likely that the cue usage strategies utilized by the participants differed between the two experiments. Indeed, it is important to gain a deeper understanding of such cue usage strategies in order to develop a better understanding of dynamic hypothesis generation.

The present experiment compared response modes to examine differences between data maintenance prior to generation (EoS mode) and generation that does not encourage the maintenance of multiple pieces of data (SbS mode). Considered in another light, SbS responding can be thought of as encouraging an anchoring and adjustment process where the set of hypotheses generated in response to the first piece of data supply the set of beliefs in which forthcoming data may be interpreted. The EoS condition, on the other hand, does not engender such belief anchoring as generation is not prompted until all data have been observed. As such, the SbS conditions provide investigation of a potential propensity to discard previously generated hypotheses and/or generate new hypotheses in the face of inconsistent data.

One hundred fifty-seven participants from the University of Oklahoma participated in this experiment for course credit.

As previously mentioned, the first independent variable was the timing of the generation and judgment promptings provided to the participant as dictated by the response mode condition. This factor was manipulated within-subject. The second independent variable, manipulated between-subjects, was the consistency of the second symptom (S2) with the hypotheses likely to be entertained by the participant following the first symptom. This consistency or inconsistency was manipulated within the ecologies learned by the participants as displayed in Table 3 . In addition, this table demonstrates the temporal order in which the symptoms were presented in the elicitation phase of this experiment (i.e., S1 → S2 → S3 → S4). Note that only positive symptom (i.e., symptom present) states were presented in the elicitation phase. The only difference between the ecologies was the conditional probability of S2 being positive under D1. This probability was 0.9 in the “consistent ecology” and 0.1 in the “inconsistent ecology.” Given that S1 should prompt the generation of D1 and D2, this manipulation of the ecology can be realized to govern the consistency of S2 with the hypothesis(es) currently under consideration following S1. This can be seen in Table 4 displaying the Bayesian posterior probabilities for each disease following each symptom. Seventy-nine participants were in the consistent ecology condition and 78 participants were in the inconsistent ecology condition. Response mode was counter-balanced within ecology condition.

www.frontiersin.org

Table 3 . Disease × Symptom ecologies of Experiment 2 .

www.frontiersin.org

Table 4 . Bayesian posterior probabilities as further symptoms are acquired within each ecology of Experiment 2 .

The procedure was much like that of Experiment 1: exemplar training to learn the probability distributions, a test to verify learning (for which a $5.00 gift card could be earned for performance greater than 60%) 8 , and a distractor task prior to elicitation. The experiment was again cast in terms of medical diagnosis where D1, D2, and D3 represented fictitious disease states and S1–S4 represented various test results (i.e., symptoms).

There were slight differences in each phase of the procedure however. The exemplars presented in the exemplar training phase of were simplified and consisted of the disease name and a single test result (as opposed to all four). This change was made in an effort to enhance learning. Exemplars were blocked by disease such that a disease was selected at random without replacement. For each disease the participant would be presented with 40 exemplars selected at random without replacement. Therefore over the course of these 40 exemplars the entire (and exact) distribution of symptoms would be presented for that disease. This was then done for the remaining two diseases and the entire process was repeated two more times. Therefore the participant observed 120 exemplars per disease (inducing equal base rates for each disease) and observed the entire distribution three times. Each exemplar was again pseudo-self-paced and displayed on the screen for 1500 ms per exemplar prior to the participant being able to proceed to the next exemplar by pressing the first letter of the disease. Patient cases in the diagnosis test phase presented with only individual symptoms as well. Each of the eight possible symptom states were individually presented to the participants and they were asked to report the most likely disease given that particular symptom. Diseases with a posterior probability greater than or equal to 0.39 were tallied as correct responses.

In the elicitation phase, the prompts for hypothesis generation were the same as those used in Experiment 1, but the probability judgment prompt differed slightly. The judgment prompt used in the present experiment was as follows: “How likely is it that the patient has [INSERT HIGHEST RANKED DISEASE]? (Keep in mind that an answer of 0 means that there is NO CHANCE that the patient has [INSERT HIGHEST RANKED DISEASE] and that 100 means that you are ABSOLUTELY CERTAIN that the patient has [INSERT HIGHEST RANKED DISEASE].) Type in your answer from 1 to 100 and press Enter to continue.” Probability judgments were taken following each generation sequence in the SbS condition (i.e., there were four probability judgments taken, one for the disease ranked highest on each round of generation).

Hypotheses and predictions

The general prediction for the end-of-sequence response mode was that recency would be demonstrated in both ecologies as the more recent symptoms should contribute more strongly to the generation process as seen in Experiment 1. Therefore, greater generation of D3 relative to the alternatives was expected in both ecologies. The focal predictions for the SbS conditions concerned the generation behavior following S2. It was predicted that participants in the consistent ecology would generate D1 to a greater extent than those in the inconsistent ecology who were expected to purge D1 from their hypothesis set in response to its inconsistency with S2. It was additionally predicted that those in the inconsistent ecology would generate D3 to a greater extent at this point than those in the consistent ecology as they would utilize S2 to repopulate working memory with a viable hypothesis.

As no interactions with trial order were detected, both trials from each subject were used in the present analyses and no differences in results were found with differences in learning. The main dependent variable analyzed for this experiment was the hypothesis generated as most likely on each round of elicitation. All participants were included in the analyses regardless of performance in the diagnosis test phase. In order to test if a recency effect obtained following the last symptom (S4), comparisons between the rates of generation of each disease were carried out within each of the four ecology-by-response mode conditions. Within the step-by-step conditions the three diseases were generated at different rates in the consistent ecology according to Cochran’s Q Test, χ 2 (2) = 9.14, p < 0.05, but not in the inconsistent ecology χ 2 (2) = 1, p = 0.61. In the end-of-sequence conditions, significant differences in generation rates were revealed in both the consistent ecology, χ 2 (2) = 17.04, p < 0.001, and the inconsistent ecology, χ 2 (2) = 7.69, p < 0.05.

As D2 was very unlikely in both ecologies the comparison of interest in all cases is between D1 and D3. This pairwise comparison was carried out within each of the ecology-by-response mode conditions and reached significance only in the EoS mode in the consistent ecology, χ 2 (1) = 6.79, p < 0.01, as D1 was generated to a greater degree than D3 according to Cochran’s Q Test. These results, displayed in Figure 7 , demonstrate the absence of a recency effect in the present experiment. This difference between the EoS and SbS ecology is additionally observed by comparing rates of D1 generation across the entire design demonstrating a main effect of ecology, χ 2 (1) = 8.87, p < 0.01, but no effect of mode, χ 2 (1) = 0.987, p = 0.32, and no interaction, χ 2 (1) = 0.554, p = 0.457.

www.frontiersin.org

Figure 7. Proportion of generation for each disease by response mode and ecology conditions . Error bars represent standard errors of the mean.

To test the influence of the inconsistent cue on the maintenance of D1 (the most likely disease in both ecologies following S1) in the SbS conditions, elicitation round (post S1 and post S2) was entered as an independent variable with ecology and tested in a 2 × 2 logistic regression. As plotted in Figure 8 , this revealed a main effect of elicitation round, χ 2 (1) = 10.51, p < 0.01, an effect of ecology, χ 2 (1) = 6.65, p < 0.05, and a marginal interaction, χ 2 (1) = 3.785, p = 0.052. When broken down by ecology it is evident that the effect of round and the marginal interaction were due to the decreased generation of D1 following S2 in the inconsistent ecology, χ 2 (1) = 10.51, p < 0.01, as there was no difference between rounds in the consistent ecology, χ 2 (1) = 0.41, p = 0.524.

www.frontiersin.org

Figure 8. Proportion of generation for each disease within the SbS condition following S1 and S2 . Error bars represent standard errors of the mean.

This same analysis was done with D3 to examine potential differences in its rate of generation over these two rounds of generation. This test revealed a main effect of elicitation round, χ 2 (1) = 12.135, p < 0.001, but no effect of ecology, χ 2 (1) = 1.953, p = 0.162, and no interaction, χ 2 (1) = 1.375, p = 0.241.

Simulating Experiment 2

To model the EoS conditions, the model was presented all four symptoms in sequence and run in conditions in which the model was endowed with either the consistent or inconsistent ecology. This simulation was run for 1000 iterations in each condition. As is intuitive from the computational results of Experiment 1, when the model is run with the same parameters utilized in the previous simulation it predicts greater generation for D3 in both ecologies (i.e., recency) which was not observed in the present experiment. However, the model is able to capture the data of the EoS mode quite well by increasing the amount of recurrent activation that each piece of data recycles onto itself (alpha parameter) and the amount of lateral inhibition applied to each piece of data (beta parameter) as it is acquired prior to generation. These results appear alongside the empirical results in Figure 9 . Although the model is able the capture the qualitative pattern in the data in the inconsistent ecology reasonably well with either set of parameters, the model produces divergent results under the two alpha and beta levels in the consistent ecology. Only when recurrency and inhibition are increased does the model capture the data from both ecologies.

www.frontiersin.org

Figure 9. Empirical data (bars) from Experiment 2 for the EoS conditions in both ecologies plotted with model data (diamonds and circles) at two levels of alpha and beta . Error bars represent standard errors of the mean.

Examination of how the data activations are influenced by the increased alpha and beta levels reveals the underlying cause for this difference in generation. As displayed in Figure 10 , there is a steep recency gradient for the data activations under alpha = 2 and beta = 0.2 (parameters from Experiment 1), but there is a markedly different pattern of activations under alpha = 3 and beta = 0.4 9 . Most notably, these higher alpha and beta levels cause the earlier pieces of data to reach high levels of activation which then suppress the activation levels of later data. This is due to the competitive dynamics of the buffer which restrict rise of activation for later items under high alpha and beta values resulting in a primacy gradient in the activation values as opposed to the recency gradient observed under the lower values.

www.frontiersin.org

Figure 10. Individual data activations under both levels of alpha and beta .

To capture the SbS conditions for generation following S1 and generation following S2, the model was presented with different amounts of data on different trials. Specifically, the model was presented with S1 only, capturing the situation in which only the first piece of data had been received, or the model was presented with S1 and S2 successively in order to capture the SbS condition following the second piece of data. This was done for both ecologies in order to assess the effects of data inconsistency on the model’s generation behavior 10 . As can be seen in Figure 11 the model is able to capture the empirical data quite well following S1 while providing a decent, although imperfect, account of the post S2 data as well 11 . Focally, the model as implemented captures the influences of S2 on the hypothesis sets generated in response to S1. Following S2 in the inconsistent ecology D1 decreases substantially capturing its purging from working memory. Additionally, the increases in the generation of D3 are present in both ecologies.

www.frontiersin.org

Figure 11. Empirical data (bars) from Experiment 2 in the SbS conditions following S1 and S2 plotted with model data (diamonds) . Error bars represent standard errors of the mean.

The present experiment has provided a window into two distinct processing dynamics. The first dynamic under investigation was how generation differs when based on the acquisition of a set of data (EoS condition) vs. when each piece of data is acquired in isolation (SbS condition). The generation behavior between these conditions was somewhat similar overall, as neither D1 nor D3 dominated generation in three of the four conditions. The EoS consistent ecology condition, however, was clearly dominated by D1. This result obtained in contrast to the prediction of recency in the EoS conditions, which would have been evidenced by higher rankings for D3 (for both ecologies).

The divergence between the recency effect in Experiment 1 and the absence of recency effect in the EoS conditions of Experiment 2 is surprising. In order for the model to account for the amelioration of the recency effect an adjustment was made to the alpha and beta parameters governing how much activation each piece of data is able to recycle onto itself and the level of competition thereby eliminating the recency gradient in the activations. Moreover, the last piece of data did not contribute as often or as strongly to the cue to LTM under these settings. Therefore, rather than a recency effect, the model suggests a primacy effect whereby the earlier cues contributed more to generation than the later cues. As we have not manipulated serial order in the present experiment, it is difficult to assert a primacy effect based on the empirical data alone. The model’s account of the current data, however, certainly suggests that a primacy gradient is needed to capture the results. Additionally, a recent experiment in a similar paradigm utilizing an EoS response mode demonstrated a primacy effect in a diagnostic reasoning task ( Rebitschek et al., 2012 ) suggesting that primacy may be somewhat prevalent under EoS data acquisition situations.

As for why the earlier cues may have enjoyed greater activation in the present experiment relative to Experiment 1 we need to consider the main difference between these paradigms. The largest difference was that in the present experiment each piece of data present in the ecology carried a good amount of informational value whereas in Experiment 1 80% of the data in the ecology was entirely non-diagnostic. It is possible that this information rich vs. information scarce ecological difference unintentionally led to a change in how the participants allocated their attention over the course of the data streams between the two experiments. As all of the data in Experiment 2 was somewhat useful, the participants may have used this as a cue to utilize as much of the information as possible thereby rehearsing/reactivating the data as much as possible prior to generation. In contrast, being in the information scarce ecology of Experiment 1 would not have incentivized such maximization of the data activations for most of the data. Future experiments could address how the complexity of the ecology might influence dynamic attentional allocation during data acquisition.

The second dynamic explored was how inconsistent data influences the hypotheses currently under consideration. In the step-by-step conditions it was observed that a previously generated hypothesis was purged from working memory in response to the inconsistency of a newly received cue. This can be viewed as consistent with an extension of the consistency checking mechanism employed in the original HyGene framework. The present data suggests that hypotheses currently under consideration are checked against newly acquired data and are purged in accordance with their degree of (in)consistency. This is different from, although entirely compatible with, the operation of the original consistency checking mechanism operating over a single round of hypothesis generation. The consistency checking operation within the original version of HyGene checks each hypothesis retrieved into working memory for its consistency with the data used as a cue to its retrieval as the SOCs is populated. The consistency checking mechanism exposed in the present experiment, however, suggests that people check the consistency of newly acquired data against hypotheses generated from previous rounds of generation as well. If the previously generated hypotheses fall below some threshold of agreement with the newly acquired data they are purged from working memory. Recent work by Mehlhorn et al. (2011) also investigated the influence of consistent and inconsistent cues on the memory activation of hypotheses. They utilized a clever adaptation of the lexical decision task to assess the automatic memory activation of hypotheses as data were presented and found memory activation sensitivity to the consistency of the data. As the present experiment utilized overt report, these findings complement one another quite well as automatic memory activation can be understood as a precursor to the generation of hypotheses into working memory. The present experiment additionally revealed that S2 was used to re-cue LTM as evidenced by increased generation of D3 following S2. In contrast to the prediction that this would occur only in the inconsistent ecology, this recuing was observed in both ecologies. Lastly, although the model as currently implemented represents a simplification of the participant’s task in the SbS conditions, it was able to capture these effects.

General Discussion

This paper presented a model of dynamic data acquisition and hypothesis generation which was then used to account for data from two experiments investigating three consequences of hypothesis generation being extended over time. Experiment 1 varied the serial position of a diagnostic datum and demonstrated a recency effect whereby the hypothesis implied by this datum was generated more often when the datum appeared later in the data stream. Experiment 2 examined how generation might differ when it is based on isolated data acquired one at a time (step-by-step response mode) vs. when generation is based upon the acquisition of the entire set of data (end-of-sequence response mode). Secondly, the influence of an inconsistent cue (conflicting with hypotheses suggested by the first datum) was investigated by manipulating a single contingency of the data-hypothesis ecology in which the participants were trained. It was found that the different response modes did not influence hypothesis generation a great deal as the two most likely hypotheses were generated at roughly the same rates in most cases. The difference that was observed however was that the most likely hypothesis was favored in the EoS condition within the consistent ecology. This occurred in contrast to the prediction of recency for both EoS conditions, thereby suggesting that the participants weighted the data more equally than in Experiment 1 or perhaps may have weighted the earlier cues slightly more heavily. Data from the SbS conditions following the acquisition of the inconsistent cue revealed that this cue caused participants to purge a previously generated hypothesis from working memory that was incompatible with the newly acquired data. Moreover, this newly acquired data was utilized to re-cue LTM. Interestingly, this re-cueing was demonstrated in both ecologies and was therefore not contingent on the purging of hypotheses from working memory.

Given that the EoS conditions of Experiment 2 were procedurally very similar to the procedure used in Experiment 1 it becomes important to reconcile their contrasting results. As discussed above, the main factor distinguishing these conditions was the statistical ecology defining their respective data-hypothesis contingencies. The ecology of the first experiment contained mostly non-diagnostic data whereas each datum in the ecology utilized in Experiment 2 carried information as to the relative likelihood of each hypothesis. It is possible that this difference of relative information scarcity and information richness influenced the processing of the data streams between the two experiments. In order to capture the data from Experiment 2 with our model, the level of recurrent activation recycled by each piece of data was adjusted upwards and lateral inhibition increased thereby giving the early items a large processing advantage over the later pieces of data. Although post hoc , this suggests the presence of a primacy bias. It is then perhaps of additional interest to note that the EoS results resemble the SbS results following D2 and this is particularly so within the consistent ecology. This could be taken to suggest that those in the EoS condition were utilizing the initial cues more greatly than the later cues. Fisher (1987) suggested that people tend to use a subset of the pool of provided data and estimated that people generally use two cues when three are available and three cues when four are available. Interestingly the model forwarded in the present paper provides support for this estimate as it used three of the four available cues in accounting for the EoS data in Experiment 2. While the utilization of three as opposed to four data could be understood as resulting from working memory constraints, the determinants of why people would fail to utilize three pieces of data when only three data are available is less clear. Future investigation of the conditions under which people underutilize available data in three and four data-hypothesis generation problems could be illuminating for the working memory dynamics of these tasks.

It is also important to compare the primacy effect in the EoS conditions with the results of Sprenger and Dougherty (2012) in which the SbS conditions revealed recency (Experiments 1 and 3) and no order effects were revealed in the EoS conditions (only implemented in Experiment 3). As for why the SbS results of the present experiment do not demonstrate recency as in their Experiments 1 and 3 is unclear. The ecologies used in these experiments were quite different, however, and it could be the case that the ecology implemented in their experiment was better able to capture this effect. Moreover, they explicitly manipulated data serial order and it was through this manipulation that the recency effect was observed. As serial order was not manipulated in the present experiment we did not have the opportunity to observe recency in the same fashion and instead relied on relative rates of generation given one data ordering. Perhaps the manipulation of serial order within the present ecology would uncover recency as well.

In comparing the present experiment to the procedure of Sprenger and Dougherty’s Experiment 3 a clearer reason for diverging results is available. In their experiment, the participants were presented with a greater pool of data from which to generate hypotheses, nine pieces in total. Participants in the present experiment, on the other hand, were only provided with four cues. It is quite possible that people’s strategies for cue usage would differ between these conditions. Whereas the present experiment provided enough data to fill working memory to capacity (or barely breach it), Sprenger and Dougherty’s experiment provided an abundance of data thereby providing insight into a situation in which the data could not be held in working memory at once. It is possible that the larger pool of data engendered a larger pool of strategies to be employed than in the present study. Understanding the strategies that people employ and the retrieval plans developed under such conditions ( Raaijmakers and Shiffrin, 1981 ; Gillund and Shiffrin, 1984 ; Fisher, 1987 ) as well as how these processes contrast with situations in which fewer cues are available is a crucial aspect of dynamic memory retrieval in need of better understanding.

The model presented in the present work represents a fusion of the HyGene model ( Thomas et al., 2008 ) with the activation dynamics of the context-activation model of memory ( Davelaar et al., 2005 ). As the context-activation model provides insight into the working memory dynamics underlying list memory tasks, it provides a suitable guidepost for understanding some of the likely working memory dynamics supporting data acquisition and hypothesis generation over time. The present model acquires data over time whose activations systematically ebb and flow in concert with the competitive buffer dynamics borrowed from the context-activation model. The resulting activation levels possessed by each piece of data are then used as weights in the retrieval of hypotheses from LTM. In addition to providing an account of the data from the present experiments this model has demonstrated further usefulness by suggesting potentially fruitful areas of future investigation.

The modeling presented here represents the first step of a work in progress. As we are working toward a fully dynamical model of data acquisition, hypothesis generation, maintenance, and use in decision making tasks, additional facets clearly still await inclusion. Within the current implementation of the model it is only the environmental data that are subject to the working memory activation dynamics of the working memory buffer. In future work, hypotheses generated into working memory (HyGene’s SOCs) will additionally be sensitive to these dynamics. This will provide us with the means of fully capturing hypothesis maintenance dynamics (e.g., step-by-step generation) that the present model ignores. Moreover, by honoring such dynamic maintenance processes we may be able to address considerations of what information people utilize at different portions of a hypothesis generation task. For instance, when data is acquired over long lags (e.g., minutes), it is unclear what information people use to populate working memory with hypotheses at different points in the task. If someone is reminded of the diagnostic problem they are trying to solve, do they recall the hypotheses directly (e.g., via contextual retrieval) or do they sometimes recall previous data to be combined with new data and re-generate the current set of hypotheses? Presumably both strategies are prevalent, but the conditions under which they are more or less likely to manifest is unclear. It is hoped that this more fully specified model may provide insight into situations favoring one over the other.

As pointed out by Sprenger and Dougherty (2012) a fuller understanding of hypothesis generation dynamics will entail learning about how working memory resources are dynamically allocated between data and hypotheses over time. One-way that this could be achieved in the forthcoming model would be to have two sets of information available for use at any given time, one of which would be the set of relevant data (RED) and the other would be the SOC hypotheses. The competitive dynamics of the buffer could be brought to bear between these sets of items by allowing them to inhibit one another, thereby instantiating competition between the items in these sets for the same limited resource. Setting up the model in this or similar manners would be informative for addressing dynamic working memory tradeoffs that are struck between data and hypotheses over time.

In addition, this more fully elaborated model could inform maintenance dynamics as hypotheses are utilized to render judgments and decisions. The output of the judgment and decision processes could cohabitate the working memory buffer and its maintenance and potential influence on other items’ activations could be gauged across time. Lastly, as the model progresses in future work it will be important and informative to examine the model’s behavior more broadly. For the present paper we have focused on the first hypothesis generated in each round of generation. The generation behavior of people and the model of course furnishes more than one hypothesis into working memory. Further work with this model has the potential to provide a richer window into hypothesis generation behavior by taking a greater focus on the full hypothesis sets considered over time.

Developing an understanding of the temporal dynamics governing the rise and fall of beliefs over time is a complicated problem in need of further investigation and theoretical development. This paper has presented an initial model of how data acquisition dynamics influence the generation of hypotheses from LTM and two experiments considering three distinct processing dynamics. It was found that the recency of the data, sometimes but not always, biases the generation of hypotheses. Additionally, it was found that previously generated hypotheses are purged from working memory in light of new data with which they are inconsistent. Future work will develop a more fully specified model of dynamic hypothesis generation, maintenance, and use in decision making tasks.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

  • ^ For a more thorough treatment of HyGene’s computational architecture please see Thomas et al. (2008) or Dougherty et al. (2010) .
  • ^ This was done from the pragmatic view that the buffer cannot apply noise to an item representation that does not yet exist in the environment or in the system. A full and systematic analysis of how this assumption affects the behavior of the buffer has not been carried out as of yet, but in the context of the current simulations preliminary analysis suggests that this change affects the activation values produced by the buffer only slightly.
  • ^ This working memory threshold has been carried over from the context-activation model as it proved valuable for that model’s account of data from a host of list recall paradigms ( Davelaar et al., 2005 ).
  • ^ Previous investigations in our lab utilizing exemplar training tasks have demonstrated variation in conclusions drawn from results conditionalized on such learning data against entire non-conditionalized data set. Therefore including this learning test allows us a check on the presence of such discrepancies in addition to obtaining data that may inform how greater or lesser learning influences the generation process.
  • ^ Thirty-five participants (48%) exceeded this 60% criterion.
  • ^ This carry-over effect was not entirely surprising as the same symptom states were presented for every patient and our manipulation of serial order was likely transparent on later trials.
  • ^ The parameters used for this simulation were the following. Original HyGene parameters: L = 0.85, Ac = 0.1, Phi = 4, KMAX = 8. Context-activation model parameters: Alpha = 2.0, Beta = 0.2, Lambda = 0.98, Delta = 1. Note, these parameters were based on values utilized in previous work and were not chosen based on fitting the model to the current data.
  • ^ Eighty-eight participants (56%) exceeded this 60% criterion.
  • ^ These parameter values were based on a grid search to examine the neighborhood of values capturing the qualitative patterns in the data and not based on a quantitative fit to the empirical data.
  • ^ This is, of course, a simplification of the participant’s task in the SbS condition. This is addressed in the general discussion.
  • ^ This simulation was run with alpha = 3 and beta = 0.4.

Anderson, N. H. (1965). Primacy effects in personality impression formation using a generalized order effect paradigm. J. Pers. Soc. Psychol. 2, 1–9.

CrossRef Full Text

Anderson, N. H. (1973). Serial position curves in impression formation. J. Exp. Psychol. 97, 8–12.

Cooper, R. P., Yule, P., and Fox, J. (2003). Cue selection and category learning: a systematic comparison of three theories. Cogn. Sci. Q. 3, 143–182.

Davelaar, E. J., Goshen-Gottstein, Y., Ashkenazi, A., Haarmann, H. J., and Usher, M. (2005). The demise of short term memory revisited: empirical and computational investigations of recency effects. Psychol. Rev. 112, 3–42.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Dougherty, M. R. P., Gettys, C. F., and Ogden, E. E. (1999). A memory processes model for judgments of likelihood. Psychol. Rev. 106, 180–209.

Dougherty, M. R. P., and Hunter, J. E. (2003a). Probability judgment and subadditivity: the role of WMC and constraining retrieval. Mem. Cognit. 31, 968–982.

Dougherty, M. R. P., and Hunter, J. E. (2003b). Hypothesis generation, probability judgment, and working memory capacity. Acta Psychol . (Amst.) 113, 263–282.

Dougherty, M. R. P., Thomas, R. P., and Lange, N. (2010). Toward an integrative theory of hypothesis generation, probability judgment, and hypothesis testing. Psychol. Learn. Motiv. 52, 299–342.

Fisher, S. D. (1987). Cue selection in hypothesis generation: Reading habits, consistency checking, and diagnostic scanning. Organ. Behav. Hum. Decis. Process. 40, 170–192.

Gillund, G., and Shiffrin, R. M. (1984). A retrieval model for both recognition and recall. Psychol. Rev. 91, 1–67.

Hintzman, D. L. (1986). “Schema Abstraction” in a multiple-trace memory model. Psychol. Rev. 93, 411–428.

Hintzman, D. L. (1988). Judgments of frequency and recognition memory in a multiple-trace memory model. Psychol. Rev. 95, 528–551.

Hogarth, R. M., and Einhorn, H. J. (1992). Order effects in belief updating: the belief-adjustment model. Cogn. Psychol. 24, 1–55.

Lange, N. D., Thomas, R. P., and Davelaar, E. J. (2012). “Data acquisition dynamics and hypothesis generation,” in Proceedings of the 11th International Conference on Cognitive Modelling , eds N. Rußwinkel, U. Drewitz, J. Dzaack, H. van Rijn, and F. Ritter (Berlin: Universitaetsverlag der TU), 31–36.

McKenzie, C. R. M. (1998). Taking into account the strength of an alternative hypothesis. J. Exp. Psychol. Learn. Mem. Cogn. 24, 771–792.

Mehlhorn, K., Taatgen, N. A., Lebiere, C., and Krems, J. F. (2011). Memory activation and the availability of explanations in sequential diagnostic reasoning. J. Exp. Psychol. Learn. Mem. Cogn. 37, 1391–1411.

Murdock, B. B. (1962). The serial position effect of free recall. J. Exp. Psychol. 64, 482–488.

Nelson, J. D., McKenzie, C. R. M., Cottrell, G. W., and Sejnowski, T. J. (2010). Experience matters: information acquisition optimizes probability gain. Psychol. Sci. 21, 960–969.

Page, M. P. A., and Norris, D. (1998). The primacy model: a new model of immediate serial recall. Psychol. Rev. 105, 761–781.

Raaijmakers, J. G. W., and Shiffrin, R. M. (1981). Search of associative memory. Psychol. Rev. 88, 93–134.

Rebitschek, F., Scholz, A., Bocklisch, F., Krems, J. F., and Jahn, G. (2012). “Order effects in diagnostic reasoning with four candidate hypotheses,” in Proceedings of the 34th Annual Conference of the Cognitive Science Society , eds N. Miyake, D. Peebles, and R. P. Cooper (Austin, TX: Cognitive Science Society) (in press).

Sprenger, A., and Dougherty, M. P. (2012). Generating and evaluating options for decision making: the impact of sequentially presented evidence. J. Exp. Psychol. Learn. Mem. Cogn. 38, 550–575.

Sprenger, A., and Dougherty, M. R. P. (2006). Differences between probability and frequency judgments: the role of individual differences in working memory capacity. Organ. Behav. Hum. Decis. Process. 99, 202–211.

Sprenger, A. M., Dougherty, M. R., Atkins, S. M., Franco-Watkins, A. M., Thomas, R. P., Lange, N. D., and Abbs, B. (2011). Implications of cognitive load for hypothesis generation and probability judgment. Front. Psychol. 2:129. doi:10.3389/fpsyg.2011.00129

Thomas, R. P., Dougherty, M. R., Sprenger, A. M., and Harbison, J. I. (2008). Diagnostic hypothesis generation and human judgment. Psychol. Rev. 115, 155–185.

Usher, M., Davelaar, E. J., Haarmann, H., and Goshen-Gottstein, Y. (2008). Short term memory after all: comment on Sederberg, Howard, and Kahana (2008). Psychol. Rev. 115, 1108–1118.

Weiss, D. J., and Anderson, N. H. (1969). Subjective averaging of length with serial position. J. Exp. Psychol. 82, 52–63.

Keywords: hypothesis generation, temporal dynamics, working memory, information acquisition, decision making

Citation: Lange ND, Thomas RP and Davelaar EJ (2012) Temporal dynamics of hypothesis generation: the influences of data serial order, data consistency, and elicitation timing. Front. Psychology 3 :215. doi: 10.3389/fpsyg.2012.00215

Received: 24 January 2012; Accepted: 09 June 2012; Published online: 29 June 2012.

Reviewed by:

Copyright: © 2012 Lange, Thomas and Davelaar. This is an open-access article distributed under the terms of the Creative Commons Attribution Non Commercial License , which permits non-commercial use, distribution, and reproduction in other forums, provided the original authors and source are credited.

*Correspondence: Nicholas D. Lange, Department of Psychological Sciences, Birkbeck College, University of London, Malet Street, London WC1E 7HX, UK. e-mail: ndlange@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

hypothesis generation model

AI Scientist

hypothesis generation model

Musing 20: Hypothesis Generation with Large Language Models

A paper out of the university of chicago on whether llms can generate novel 'scientific hypotheses'.

hypothesis generation model

Today’s paper: Hypothesis Generation with Large Language Models. Zhou et al. 5 Apr 2024. https://arxiv.org/pdf/2404.04326.pdf

A number of my colleagues and I have always had the dream of an AI (whether or not we build it) that is able to make autonomous scientific progress that is worthy of a Nobel prize, or an equivalent. For example, can an AI discover a novel theory of economics? Be capable of proving a theorem that hasn’t been proven by a human? Prove or disprove the famous P=NP [?] conjecture in theoretical computer science?

Of course, we’re far from that at present…or are we? That’s what my brief musing today will address (I’ve kept it deliberately short and sweet, since not all readers here are scientists!). The title of the paper really says it all. It’s not claiming that an LLM can do science completely on its own, but it is looking at a particularly creative element in scientific research: the formulation of interesting hypotheses that, when investigated, will lead to novel scientific insights.

hypothesis generation model

This paper specifically explores the potential of large language models (LLMs) to generate novel hypotheses, particularly focusing on data-driven hypothesis generation. The authors present a computational framework, HypoGeniC (Hypothesis Generation in Context) , which iteratively generates and updates hypotheses to improve predictive performance in classification tasks. This approach is inspired by the multi-armed bandit problem, utilizing a reward function to balance exploration and exploitation during hypothesis generation. Key contributions include:

Introduction of a Novel Computational Framework : The paper proposes HypoGeniC, a novel framework for generating and evaluating hypotheses with LLMs. This method starts with generating initial hypotheses from a few examples and iteratively updates them to enhance their quality and predictive power.

Improvement Over Existing Methods : The generated hypotheses significantly outperform few-shot in-context learning and supervised learning benchmarks in both synthetic and real-world datasets. For example, accuracy improvements include 31.7% on a synthetic dataset and varying degrees (13.9%, 3.3%, and 24.9%) on three real-world datasets, demonstrating the robustness of the hypotheses across different settings.

Generation of High-Quality, Interpretable Hypotheses : The framework not only corroborates existing human-verified theories but also uncovers new insights, thereby contributing to the body of knowledge in the field. This quality ensures that the hypotheses are not only useful for improving classification performance but also valuable for advancing scientific understanding.

Cross-Generalization Capability : Hypotheses generated are not only applicable to the LLMs they were produced from but also show effectiveness when applied to other models. This indicates the general applicability of the generated hypotheses across different large language models.

Robustness Across Datasets : The generated hypotheses are shown to be robust across different datasets, including out-of-distribution sets, where they can outperform the oracle fine-tuned models like RoBERTa.

Speaking of experiments, the results are quite promising!

hypothesis generation model

Also, the authors conduct some qualitative analysis that reveal some interesting details. On a synthetic dataset, all models were found to identify the correct hypothesis underlying something like “Shoe Sales” e.g., that "customers tend to purchase shoes that match the color of their shirts." On real-world datasets, the authors compared their generated hypotheses with those found in existing literature, confirming the validity of some while uncovering new insights that prior studies had not addressed. Examples of these hypotheses are shown in Table 4 in the main paper, with a complete list available in Appendix D. The hypotheses corroborated useful features highlighted in the literature, and the automatic evaluation of hypothesis quality also revealed negative findings. More details are provided in Section 4.3 of the paper.

Final thoughts: the paper is one of several that is probing LLM’s ability to be ‘creative.’ While we might equate that with writing poems or creating art, creativity is also required in science, through ‘interesting’ hypothesis generation. As the saying goes, identifying a good problem is half the battle in scientific research. There’s all the wild stuff we can think of, like cheap nuclear fusion, Star Trek-style replicators, and time travel. But realistically, our bread and butter as scientists is to find interesting, but relatively ‘doable’, hypotheses that we can then publish in Nature.

The LLM is not all there yet, but this paper shows that it’s getting there. Will scientists one day be out of a job too?

hypothesis generation model

Ready for more?

Hypothesis Generation from Literature for Advancing Biological Mechanism Research: A Perspective

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, index terms.

Applied computing

Document management and text processing

Life and medical sciences

Bioinformatics

Recommendations

Automated hypothesis generation based on mining scientific literature.

Keeping up with the ever-expanding flow of data and publications is untenable and poses a fundamental bottleneck to scientific progress. Current search technologies typically find many relevant documents, but they do not extract and organize the ...

Research Article: Bioinformatic analysis of molecular network of glucosinolate biosynthesis

Glucosinolates constitute a major group of secondary metabolites in Arabidopsis, which play an important role in plant interaction with pathogens and insects. Advances in glucosinolate research have defined the biosynthetic pathways. However, cross-talk ...

Mining pathway signatures from microarray data and relevant biological knowledge

High-throughput technologies such as DNA microarray are in the process of revolutionising the way modern biological research is being done. Bioinformatics tools are becoming increasingly important to assist biomedical scientists in their quest in ...

Information

Published in.

cover image ACM Other conferences

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • bioinformatics
  • biomedical knowledge mining
  • machine learning
  • Research-article
  • Refereed limited

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 22 Total Downloads
  • Downloads (Last 12 months) 22
  • Downloads (Last 6 weeks) 1

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

How generative AI models can fuel scientific discovery

Using generative models to come up with new ideas, we can dramatically accelerate the pace at which we can discover new molecules, materials, drugs, and more.

Throughout history, humanity has made progress often through a combination of curiosity and creativity. When we have problems that need overcoming, we try to understand why something is the case to figure out a solution.

Many scientific discoveries were made as a result of trial and error. While methodical, this process can also be painstakingly slow. And in some fields of study, the impetus for solving problems can be extremely urgent, whether that’s developing new life-saving drugs, or finding new ways to mitigate the effects of climate change. It can take a decade to discover, test, and develop a new drug. In light of new realities like the COVID-19 pandemic, this is simply not fast enough.

We need to find new ways to spur our creativity and inspiration. No one person, or even a group of people, could possibly keep up with all the latest research in their field of study, let alone remember every iota of what they’ve read over their lifetimes. This, though, is an area where AI can greatly help us.

Today, there are already systems that can ingest large volumes data, sift through it, and help find patterns in the noise. And there are newer emerging streams of AI research that we work on that we believe can accelerate the pace of discovery even more. One of these areas is called generative models.

Generative models are a powerful tool in AI that’s crossed over into popular culture in recent years. We’ve seen AI tools that can mimic the styles of master painters, videos where an actor’s face is eerily plastered on a video of another actor, and AI systems where a user gives a prompt, for a picture or a short story, and they generate something entirely fictional based on the request.

These are the green shoots of the potential of generative models. They are probably our most powerful tool right now to leverage the vast troves of data in science and use it to come up with starting points to design and discover new materials, drugs and more, generate new knowledge, and create new solutions to challenging problems, including those related to climate, sustainability, healthcare and life sciences and more.

How generative models can accelerate the scientific method

In scientific discovery, we follow the scientific method — we start with a question, study it, come up with ideas, study some more, create a hypothesis, test it, assess the results, and report back. But in any discovery applications, there’s reams of information to potentially consume and understand to come up with an idea. Scientists can spend years working on a single question and not find an answer.

That’s partly a result of the limits in our knowledge, but it’s also because the space of possible answers is simply too large to systematically search. In just the field of drug discovery, it’s believed that there are some 10 63 possible drug-like molecules in the universe. Trial and error can’t possibly get us through all those combinations.

This is where generative models can be our creative aid and help us find new ideas that we might not have thought to consider before. It helps us break through the bottleneck in the process of idea generation and create new eureka moments.

All scientific discovery involves a hypothesis, and until now hypotheses have been exclusively developed by humans. But building AI systems that can learn from data and make novel and valuable suggestions can greatly aid augment human creativity, and drastically speed up the time it takes to find new ideas to test.

In just the field of drug discovery, it’s believed that there are some 10 63 possible drug-like molecules in the universe. Trial and error can’t possibly get us through all those combinations.

At IBM Research, we’ve been building a body of research exploring the development and application of generative models in discovery. Specifically, we created generative model-based AI systems to design molecules for a variety of materials discovery applications.

Our team developed one family of generative model algorithms that efficiently combines conditional generative models with reinforcement learning to design ligands 1 with desired activity against specific proteins and hit-like anticancer molecules 2 for specific omic profiles. We showed how generative models are able to support the initial design phases of the material discovery process and demonstrated how it can be combined with data-driven chemical synthesis planning to swiftly produce candidates for wet-lab experimentations.

Recently, my colleagues built a generative model that can propose new antimicrobial peptides 3 (AMPs) with desired properties. AMPs are viewed as a “drug of last resort” against antimicrobial resistance, one of the biggest threats to global health and food security. Our generative model identified novel candidate molecules, and a second AI system filtered them using predicted properties such as toxicity and broad-spectrum activity. In the span of a few weeks, we were able to identify several dozen novel candidate molecules — a process that can normally take years.

Similarly, another team at IBM Research used generative models, along with several other AI and high-performance computing advances, to come up with a new photoacid generator (PAG) — a material key to manufacturing semiconductors — a process that usually takes years and was completed in weeks.

Generative models, however, don’t have to be limited to just the hypothesis step of the scientific method. In the future, they can potentially help us figure out what questions we should even be asking before we try to find answers: Given everything we know about a field, what is the next question we should ask?

We can potentially create generative models to help us answer questions we don’t know where to start with either, such as how to find a new antiviral for an unknown protein, or whether we could make a catalyst for CO 2 in the atmosphere. We can potentially use generative models in testing, to help us determine what conditions we need to create for the most accurate results, and we can even use it to help us refine future tests after we’ve gotten our results.

Creating a scientific community of discovery

As part of our mission to accelerate discovery for IBM and its partners, we want to foster an open community around scientific discovery. Technologies like AI should be a tool that scientists and researchers use to carry out their research quicker and more effectively, rather than something that requires very specific domain knowledge to utilize.

To that end, we recently launched what we’re calling the Generative Toolkit for Scientific Discovery (GT4SD). It’s an open-source library (released under the MIT license) to accelerate hypothesis generation in the scientific discovery process that eases the adoption of state-of-the-art generative models. GT4SD includes models that can generate new molecule designs based on properties like target proteins, target omics profiles, scaffolds distances, binding energies, and additional targets relevant for materials and drug discovery.

GT4SD is an open-source library to accelerate hypothesis generation in the scientific discovery process that eases the adoption of state-of-the-art generative models.

The GT4SD library provides an effective environment for generating new hypotheses (or inference) and for fine-tuning generative models for specific domains using custom data sets (or retraining). It’s compatible with many popular deep learning frameworks, including PyTorch, PyTorch Lightning, HuggingFace Transformers, GuacaMol, and Moses. It serves a wide range of applications, ranging from materials science to drug discovery.

GT4SD’s common framework makes generative models easily accessible to a broad community, including AI/ML practitioners developing new generative models who want to deploy with just a few lines of code. GT4SD provides a centralized environment for scientists and students interested in using generative models in their scientific research, allowing them to access and explore a variety of different pretrained models. GT4SD provides consistent commands and interfaces for inference and retraining with customizable parameters across the different generative models.

The development of problem-specific intelligence is made possible by automatic workflows that allow for retraining with a user’s own data covering molecular structures and properties. The replacement of manual processes and human bias in the discovery process has important effects on applications that rely on generative models, leading to an acceleration of expert knowledge.

The entirety of GT4SD is available on GitHub , and we encourage you to try it out for yourself. In the near-term, we plan to continue expanding the toolkit’s portfolio and release new algorithms, frameworks and pre-trained models. It is our hope that through tools like GT4SD and partnerships, we can build an open community of discovery that together accelerates scientific discovery for urgent problems and speeds up the path for creating solutions that impact the world.

Learn more about:

Trustworthy Generation : Our methods facilitate data augmentation for trustworthy machine learning and accelerate novel designs for drug and material discovery, and beyond.

  • John R Smith
  • Matteo Manica
  • Accelerated Discovery
  • Generative AI
  • Materials Discovery

Jannis Born et al . 2021. Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2 . Mach. Learn.: Sci. Technol . 2 025024 ↩

Jannis Born et al . PaccMann RL : De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning . iScience 24, 102269 April 23, 2021 ↩

Das, P., Sercu, T., Wadhawan, K. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations . Nat Biomed Eng 5, 613–623 (2021). ↩

January 13, 2024

hypothesis generation model

Demystifying Hypothesis Generation: A Guide to AI-Driven Insights

Hypothesis generation involves making informed guesses about various aspects of a business, market, or problem that need further exploration and testing. This article discusses the process you need to follow while generating hypothesis and how an AI tool, like Akaike's BYOB can help you achieve the process quicker and better.

hypothesis generation model

What is Hypothesis Generation?

Hypothesis generation involves making informed guesses about various aspects of a business, market, or problem that need further exploration and testing. It's a crucial step while applying the scientific method to business analysis and decision-making. 

Here is an example from a popular B-school marketing case study: 

A bicycle manufacturer noticed that their sales had dropped significantly in 2002 compared to the previous year. The team investigating the reasons for this had many hypotheses. One of them was: “many cycling enthusiasts have switched to walking with their iPods plugged in.” The Apple iPod was launched in late 2001 and was an immediate hit among young consumers. Data collected manually by the team seemed to show that the geographies around Apple stores had indeed shown a sales decline.

Traditionally, hypothesis generation is time-consuming and labour-intensive. However, the advent of Large Language Models (LLMs) and Generative AI (GenAI) tools has transformed the practice altogether. These AI tools can rapidly process extensive datasets, quickly identifying patterns, correlations, and insights that might have even slipped human eyes, thus streamlining the stages of hypothesis generation.

These tools have also revolutionised experimentation by optimising test designs, reducing resource-intensive processes, and delivering faster results. LLMs' role in hypothesis generation goes beyond mere assistance, bringing innovation and easy, data-driven decision-making to businesses.

Hypotheses come in various types, such as simple, complex, null, alternative, logical, statistical, or empirical. These categories are defined based on the relationships between the variables involved and the type of evidence required for testing them. In this article, we aim to demystify hypothesis generation. We will explore the role of LLMs in this process and outline the general steps involved, highlighting why it is a valuable tool in your arsenal.

Understanding Hypothesis Generation

A hypothesis is born from a set of underlying assumptions and a prediction of how those assumptions are anticipated to unfold in a given context. Essentially, it's an educated, articulated guess that forms the basis for action and outcome assessment.

A hypothesis is a declarative statement that has not yet been proven true. Based on past scholarship , we could sum it up as the following: 

  • A definite statement, not a question
  • Based on observations and knowledge
  • Testable and can be proven wrong
  • Predicts the anticipated results clearly
  • Contains a dependent and an independent variable where the dependent variable is the phenomenon being explained and the independent variable does the explaining

In a business setting, hypothesis generation becomes essential when people are made to explain their assumptions. This clarity from hypothesis to expected outcome is crucial, as it allows people to acknowledge a failed hypothesis if it does not provide the intended result. Promoting such a culture of effective hypothesising can lead to more thoughtful actions and a deeper understanding of outcomes. Failures become just another step on the way to success, and success brings more success.

Hypothesis generation is a continuous process where you start with an educated guess and refine it as you gather more information. You form a hypothesis based on what you know or observe.

Say you're a pen maker whose sales are down. You look at what you know:

  • I can see that pen sales for my brand are down in May and June.
  • I also know that schools are closed in May and June and that schoolchildren use a lot of pens.
  • I hypothesise that my sales are down because school children are not using pens in May and June, and thus not buying newer ones.

The next step is to collect and analyse data to test this hypothesis, like tracking sales before and after school vacations. As you gather more data and insights, your hypothesis may evolve. You might discover that your hypothesis only holds in certain markets but not others, leading to a more refined hypothesis.

Once your hypothesis is proven correct, there are many actions you may take - (a) reduce supply in these months (b) reduce the price so that sales pick up (c) release a limited supply of novelty pens, and so on.

Once you decide on your action, you will further monitor the data to see if your actions are working. This iterative cycle of formulating, testing, and refining hypotheses - and using insights in decision-making - is vital in making impactful decisions and solving complex problems in various fields, from business to scientific research.

How do Analysts generate Hypotheses? Why is it iterative?

A typical human working towards a hypothesis would start with:

    1. Picking the Default Action

    2. Determining the Alternative Action

    3. Figuring out the Null Hypothesis (H0)

    4. Inverting the Null Hypothesis to get the Alternate Hypothesis (H1)

    5. Hypothesis Testing

The default action is what you would naturally do, regardless of any hypothesis or in a case where you get no further information. The alternative action is the opposite of your default action.

The null hypothesis, or H0, is what brings about your default action. The alternative hypothesis (H1) is essentially the negation of H0.

For example, suppose you are tasked with analysing a highway tollgate data (timestamp, vehicle number, toll amount) to see if a raise in tollgate rates will increase revenue or cause a volume drop. Following the above steps, we can determine:

Default Action “I want to increase toll rates by 10%.”
Alternative Action “I will keep my rates constant.”
H “A 10% increase in the toll rate will not cause a significant dip in traffic (say 3%).”
H “A 10% increase in the toll rate will cause a dip in traffic of greater than 3%.”

Now, we can start looking at past data of tollgate traffic in and around rate increases for different tollgates. Some data might be irrelevant. For example, some tollgates might be much cheaper so customers might not have cared about an increase. Or, some tollgates are next to a large city, and customers have no choice but to pay. 

Ultimately, you are looking for the level of significance between traffic and rates for comparable tollgates. Significance is often noted as its P-value or probability value . P-value is a way to measure how surprising your test results are, assuming that your H0 holds true.

The lower the p-value, the more convincing your data is to change your default action.

Usually, a p-value that is less than 0.05 is considered to be statistically significant, meaning there is a need to change your null hypothesis and reject your default action. In our example, a low p-value would suggest that a 10% increase in the toll rate causes a significant dip in traffic (>3%). Thus, it is better if we keep our rates as is if we want to maintain revenue. 

In other examples, where one has to explore the significance of different variables, we might find that some variables are not correlated at all. In general, hypothesis generation is an iterative process - you keep looking for data and keep considering whether that data convinces you to change your default action.

Internal and External Data 

Hypothesis generation feeds on data. Data can be internal or external. In businesses, internal data is produced by company owned systems (areas such as operations, maintenance, personnel, finance, etc). External data comes from outside the company (customer data, competitor data, and so on).

Let’s consider a real-life hypothesis generated from internal data: 

Multinational company Johnson & Johnson was looking to enhance employee performance and retention.  Initially, they favoured experienced industry candidates for recruitment, assuming they'd stay longer and contribute faster. However, HR and the people analytics team at J&J hypothesised that recent college graduates outlast experienced hires and perform equally well.  They compiled data on 47,000 employees to test the hypothesis and, based on it, Johnson & Johnson increased hires of new graduates by 20% , leading to reduced turnover with consistent performance. 

For an analyst (or an AI assistant), external data is often hard to source - it may not be available as organised datasets (or reports), or it may be expensive to acquire. Teams might have to collect new data from surveys, questionnaires, customer feedback and more. 

Further, there is the problem of context. Suppose an analyst is looking at the dynamic pricing of hotels offered on his company’s platform in a particular geography. Suppose further that the analyst has no context of the geography, the reasons people visit the locality, or of local alternatives; then the analyst will have to learn additional context to start making hypotheses to test. 

Internal data, of course, is internal, meaning access is already guaranteed. However, this probably adds up to staggering volumes of data. 

Looking Back, and Looking Forward

Data analysts often have to generate hypotheses retrospectively, where they formulate and evaluate H0 and H1 based on past data. For the sake of this article, let's call it retrospective hypothesis generation.

Alternatively, a prospective approach to hypothesis generation could be one where hypotheses are formulated before data collection or before a particular event or change is implemented. 

For example: 

A pen seller has a hypothesis that during the lean periods of summer, when schools are closed, a Buy One Get One (BOGO) campaign will lead to a 100% sales recovery because customers will buy pens in advance.  He then collects feedback from customers in the form of a survey and also implements a BOGO campaign in a single territory to see whether his hypothesis is correct, or not.
The HR head of a multi-office employer realises that some of the company’s offices have been providing snacks at 4:30 PM in the common area, and the rest have not. He has a hunch that these offices have higher productivity. The leader asks the company’s data science team to look at employee productivity data and the employee location data. “Am I correct, and to what extent?”, he asks. 

These examples also reflect another nuance, in which the data is collected differently: 

  • Observational: Observational testing happens when researchers observe a sample population and collect data as it occurs without intervention. The data for the snacks vs productivity hypothesis was observational. 
  • Experimental: In experimental testing, the sample is divided into multiple groups, with one control group. The test for the non-control groups will be varied to determine how the data collected differs from that of the control group. The data collected by the pen seller in the single territory experiment was experimental.

Such data-backed insights are a valuable resource for businesses because they allow for more informed decision-making, leading to the company's overall growth. Taking a data-driven decision, from forming a hypothesis to updating and validating it across iterations, to taking action based on your insights reduces guesswork, minimises risks, and guides businesses towards strategies that are more likely to succeed.

How can GenAI help in Hypothesis Generation?

Of course, hypothesis generation is not always straightforward. Understanding the earlier examples is easy for us because we're already inundated with context. But, in a situation where an analyst has no domain knowledge, suddenly, hypothesis generation becomes a tedious and challenging process.

AI, particularly high-capacity, robust tools such as LLMs, have radically changed how we process and analyse large volumes of data. With its help, we can sift through massive datasets with precision and speed, regardless of context, whether it's customer behaviour, financial trends, medical records, or more. Generative AI, including LLMs, are trained on diverse text data, enabling them to comprehend and process various topics.

Now, imagine an AI assistant helping you with hypothesis generation. LLMs are not born with context. Instead, they are trained upon vast amounts of data, enabling them to develop context in a completely unfamiliar environment. This skill is instrumental when adopting a more exploratory approach to hypothesis generation. For example, the HR leader from earlier could simply ask an LLM tool: “Can you look at this employee productivity data and find cohorts of high-productivity and see if they correlate to any other employee data like location, pedigree, years of service, marital status, etc?” 

For an LLM-based tool to be useful, it requires a few things:

  • Domain Knowledge: A human could take months to years to acclimatise to a particular field fully, but LLMs, when fed extensive information and utilising Natural Language Processing (NLP), can familiarise themselves in a very short time.
  • Explainability:   Explainability is its ability to explain its thought process and output to cease being a "black box".
  • Customisation: For consistent improvement, contextual AI must allow tweaks, allowing users to change its behaviour to meet their expectations. Human intervention and validation is a necessary step in adoptingAI tools. NLP allows these tools to discern context within textual data, meaning it can read, categorise, and analyse data with unimaginable speed. LLMs, thus, can quickly develop contextual understanding and generate human-like text while processing vast amounts of unstructured data, making it easier for businesses and researchers to organise and utilise data effectively.LLMs have the potential to become indispensable tools for businesses. The future rests on AI tools that harness the powers of LLMs and NLP to deliver actionable insights, mitigate risks, inform decision-making, predict future trends, and drive business transformation across various sectors.

Together, these technologies empower data analysts to unravel hidden insights within their data. For our pen maker, for example, an AI tool could aid data analytics. It can look through historical data to track when sales peaked or go through sales data to identify the pens that sold the most. It can refine a hypothesis across iterations, just as a human analyst would. It can even be used to brainstorm other hypotheses. Consider the situation where you ask the LLM, " Where do I sell the most pens? ". It will go through all of the data you have made available - places where you sell pens, the number of pens you sold - to return the answer. Now, if we were to do this on our own, even if we were particularly meticulous about keeping records, it would take us at least five to ten minutes, that too, IF we know how to query a database and extract the needed information. If we don't, there's the added effort required to find and train such a person. An AI assistant, on the other hand, could share the answer with us in mere seconds. Its finely-honed talents in sorting through data, identifying patterns, refining hypotheses iteratively, and generating data-backed insights enhance problem-solving and decision-making, supercharging our business model.

Top-Down and Bottom-Up Hypothesis Generation

As we discussed earlier, every hypothesis begins with a default action that determines your initial hypotheses and all your subsequent data collection. You look at data and a LOT of data. The significance of your data is dependent on the effect and the relevance it has to your default action. This would be a top-down approach to hypothesis generation.

There is also the bottom-up method , where you start by going through your data and figuring out if there are any interesting correlations that you could leverage better. This method is usually not as focused as the earlier approach and, as a result, involves even more data collection, processing, and analysis. AI is a stellar tool for Exploratory Data Analysis (EDA). Wading through swathes of data to highlight trends, patterns, gaps, opportunities, errors, and concerns is hardly a challenge for an AI tool equipped with NLP and powered by LLMs.

EDA can help with: 

  • Cleaning your data
  • Understanding your variables
  • Analysing relationships between variables

An AI assistant performing EDA can help you review your data, remove redundant data points, identify errors, note relationships, and more. All of this ensures ease, efficiency, and, best of all, speed for your data analysts.

Good hypotheses are extremely difficult to generate. They are nuanced and, without necessary context, almost impossible to ascertain in a top-down approach. On the other hand, an AI tool adopting an exploratory approach is swift, easily running through available data - internal and external. 

If you want to rearrange how your LLM looks at your data, you can also do that. Changing the weight you assign to the various events and categories in your data is a simple process. That’s why LLMs are a great tool in hypothesis generation - analysts can tailor them to their specific use cases. 

Ethical Considerations and Challenges

There are numerous reasons why you should adopt AI tools into your hypothesis generation process. But why are they still not as popular as they should be?

Some worry that AI tools can inadvertently pick up human biases through the data it is fed. Others fear AI and raise privacy and trust concerns. Data quality and ability are also often questioned. Since LLMs and Generative AI are developing technologies, such issues are bound to be, but these are all obstacles researchers are earnestly tackling.

One oft-raised complaint against LLM tools (like OpenAI's ChatGPT) is that they 'fill in' gaps in knowledge, providing information where there is none, thus giving inaccurate, embellished, or outright wrong answers; this tendency to "hallucinate" was a major cause for concern. But, to combat this phenomenon, newer AI tools have started providing citations with the insights they offer so that their answers become verifiable. Human validation is an essential step in interpreting AI-generated hypotheses and queries in general. This is why we need a collaboration between the intelligent and artificially intelligent mind to ensure optimised performance.

Clearly, hypothesis generation is an immensely time-consuming activity. But AI can take care of all these steps for you. From helping you figure out your default action, determining all the major research questions, initial hypotheses and alternative actions, and exhaustively weeding through your data to collect all relevant points, AI can help make your analysts' jobs easier. It can take any approach - prospective, retrospective, exploratory, top-down, bottom-up, etc. Furthermore, with LLMs, your structured and unstructured data are taken care of, meaning no more worries about messy data! With the wonders of human intuition and the ease and reliability of Generative AI and Large Language Models, you can speed up and refine your process of hypothesis generation based on feedback and new data to provide the best assistance to your business.

Related Posts

The latest industry news, interviews, technologies, and resources.

hypothesis generation model

Analyst 2.0: How is AI Changing the Role of Data Analysts

The future belongs to those who forge a symbiotic relationship between Human Ingenuity and Machine Intelligence

hypothesis generation model

From Development to Deployment: Exploring the LLMOps Life Cycle

Discover how Large Language Models (LLMs) are revolutionizing enterprise AI with capabilities like text generation, sentiment analysis, and language translation. Learn about LLMOps, the specialized practices for deploying, monitoring, and maintaining LLMs in production, ensuring reliability, performance, and security in business operations.

hypothesis generation model

8 Ways By Which AI Fraud Detection Helps Financial Firms

In the era of the Digital revolution, financial systems and AI fraud detection go hand-in-hand as they share a common characteristic.

hypothesis generation model

Akaike's Top 5 Game-Changing Generative AI Solutions in Banking, Finance, and Insurance

Knowledge Center

Case Studies

hypothesis generation model

© 2023 Akaike Technologies Pvt. Ltd. and/or its associates and partners

Terms of Use

Privacy Policy

Terms of Service

© Akaike Technologies Pvt. Ltd. and/or its associates and partners

hypothesis generation model

Hypothesis Maker

Ai-powered research hypothesis generator.

  • Scientific Research: Generate a hypothesis for your experimental or observational study based on your research question.
  • Academic Studies: Formulate a hypothesis for your thesis, dissertation, or academic paper.
  • Market Research: Develop a hypothesis for your market research study to understand consumer behavior or market trends.
  • Social Science Research: Create a hypothesis for your social science research to explore societal or behavioral patterns.

Yes, HyperWrite offers a limited trial for users to test the Hypothesis Maker. For additional access, you can choose the Premium Plan at $19.99/mo or Ultra for $44.99/mo. Use the code 'TRYHYPERWRITE' for 50% off your first month.

The Hypothesis Maker is powered by advanced AI models. These models analyze your research question and use their understanding of scientific research and hypothesis formulation to generate a clear, concise, and specific hypothesis that can guide your research process.

Yes, the Hypothesis Maker generates original hypotheses based on your provided research question. It uses advanced AI models to ensure that the generated hypothesis is unique, relevant to your research question, and can guide your research process effectively.

Yes, the Hypothesis Maker is versatile and can be used for a wide range of research types, including scientific, academic, market, and social science research. However, the output should always be reviewed and adjusted as necessary to fit the specific context and objectives of your research.

New & Trending Tools

Ai grammar checker, article summarizer x.

Hypothesis Maker Online

Looking for a hypothesis maker? This online tool for students will help you formulate a beautiful hypothesis quickly, efficiently, and for free.

Are you looking for an effective hypothesis maker online? Worry no more; try our online tool for students and formulate your hypothesis within no time.

  • 🔎 How to Use the Tool?
  • ⚗️ What Is a Hypothesis in Science?

👍 What Does a Good Hypothesis Mean?

  • 🧭 Steps to Making a Good Hypothesis

🔗 References

📄 hypothesis maker: how to use it.

Our hypothesis maker is a simple and efficient tool you can access online for free.

If you want to create a research hypothesis quickly, you should fill out the research details in the given fields on the hypothesis generator.

Below are the fields you should complete to generate your hypothesis:

  • Who or what is your research based on? For instance, the subject can be research group 1.
  • What does the subject (research group 1) do?
  • What does the subject affect? - This shows the predicted outcome, which is the object.
  • Who or what will be compared with research group 1? (research group 2).

Once you fill the in the fields, you can click the ‘Make a hypothesis’ tab and get your results.

⚗️ What Is a Hypothesis in the Scientific Method?

A hypothesis is a statement describing an expectation or prediction of your research through observation.

It is similar to academic speculation and reasoning that discloses the outcome of your scientific test . An effective hypothesis, therefore, should be crafted carefully and with precision.

A good hypothesis should have dependent and independent variables . These variables are the elements you will test in your research method – it can be a concept, an event, or an object as long as it is observable.

You can observe the dependent variables while the independent variables keep changing during the experiment.

In a nutshell, a hypothesis directs and organizes the research methods you will use, forming a large section of research paper writing.

Hypothesis vs. Theory

A hypothesis is a realistic expectation that researchers make before any investigation. It is formulated and tested to prove whether the statement is true. A theory, on the other hand, is a factual principle supported by evidence. Thus, a theory is more fact-backed compared to a hypothesis.

Another difference is that a hypothesis is presented as a single statement , while a theory can be an assortment of things . Hypotheses are based on future possibilities toward a specific projection, but the results are uncertain. Theories are verified with undisputable results because of proper substantiation.

When it comes to data, a hypothesis relies on limited information , while a theory is established on an extensive data set tested on various conditions.

You should observe the stated assumption to prove its accuracy.

Since hypotheses have observable variables, their outcome is usually based on a specific occurrence. Conversely, theories are grounded on a general principle involving multiple experiments and research tests.

This general principle can apply to many specific cases.

The primary purpose of formulating a hypothesis is to present a tentative prediction for researchers to explore further through tests and observations. Theories, in their turn, aim to explain plausible occurrences in the form of a scientific study.

It would help to rely on several criteria to establish a good hypothesis. Below are the parameters you should use to analyze the quality of your hypothesis.

Testability You should be able to test the hypothesis to present a true or false outcome after the investigation. Apart from the logical hypothesis, ensure you can test your predictions with .
Variables It should have a dependent and independent variable. Identifying the appropriate variables will help readers comprehend your prediction and what to expect at the conclusion phase.
Cause and effect A good hypothesis should have a cause-and-effect connection. One variable should influence others in some way. It should be written as an “if-then” statement to allow the researcher to make accurate predictions of the investigation results. However, this rule does not apply to a .
Clear language Writing can get complex, especially when complex research terminology is involved. So, ensure your hypothesis has expressed as a brief statement. Avoid being vague because your readers might get confused. Your hypothesis has a direct impact on your entire research paper’s quality. Thus, use simple words that are easy to understand.
Ethics Hypothesis generation should comply with . Don’t formulate hypotheses that contravene taboos or are questionable. Besides, your hypothesis should have correlations to published academic works to look data-based and authoritative.

🧭 6 Steps to Making a Good Hypothesis

Writing a hypothesis becomes way simpler if you follow a tried-and-tested algorithm. Let’s explore how you can formulate a good hypothesis in a few steps:

Step #1: Ask Questions

The first step in hypothesis creation is asking real questions about the surrounding reality.

Why do things happen as they do? What are the causes of some occurrences?

Your curiosity will trigger great questions that you can use to formulate a stellar hypothesis. So, ensure you pick a research topic of interest to scrutinize the world’s phenomena, processes, and events.

Step #2: Do Initial Research

Carry out preliminary research and gather essential background information about your topic of choice.

The extent of the information you collect will depend on what you want to prove.

Your initial research can be complete with a few academic books or a simple Internet search for quick answers with relevant statistics.

Still, keep in mind that in this phase, it is too early to prove or disapprove of your hypothesis.

Step #3: Identify Your Variables

Now that you have a basic understanding of the topic, choose the dependent and independent variables.

Take note that independent variables are the ones you can’t control, so understand the limitations of your test before settling on a final hypothesis.

Step #4: Formulate Your Hypothesis

You can write your hypothesis as an ‘if – then’ expression . Presenting any hypothesis in this format is reliable since it describes the cause-and-effect you want to test.

For instance: If I study every day, then I will get good grades.

Step #5: Gather Relevant Data

Once you have identified your variables and formulated the hypothesis, you can start the experiment. Remember, the conclusion you make will be a proof or rebuttal of your initial assumption.

So, gather relevant information, whether for a simple or statistical hypothesis, because you need to back your statement.

Step #6: Record Your Findings

Finally, write down your conclusions in a research paper .

Outline in detail whether the test has proved or disproved your hypothesis.

Edit and proofread your work, using a plagiarism checker to ensure the authenticity of your text.

We hope that the above tips will be useful for you. Note that if you need to conduct business analysis, you can use the free templates we’ve prepared: SWOT , PESTLE , VRIO , SOAR , and Porter’s 5 Forces .

❓ Hypothesis Formulator FAQ

Updated: Jul 19th, 2024

  • How to Write a Hypothesis in 6 Steps - Grammarly
  • Forming a Good Hypothesis for Scientific Research
  • The Hypothesis in Science Writing
  • Scientific Method: Step 3: HYPOTHESIS - Subject Guides
  • Hypothesis Template & Examples - Video & Lesson Transcript
  • Free Essays
  • Writing Tools
  • Lit. Guides
  • Donate a Paper
  • Q&A by Experts
  • Referencing Guides
  • Free Textbooks
  • Tongue Twisters
  • Editorial Policy
  • Job Openings
  • Video Contest
  • Writing Scholarship
  • Discount Codes
  • Brand Guidelines
  • IvyPanda Shop
  • Online Courses
  • Terms and Conditions
  • Privacy Policy
  • Cookies Policy
  • Copyright Principles
  • DMCA Request
  • Service Notice

IvyPanda's free online hypothesis maker will help you formulate a hypothesis for your study. With this easy-to-use tool, you just need to provide basic info about the focus of your research, its variables, and predicted outcomes. The rest is on us. Get a perfect hypothesis fast!

  • All Research Labs
  • 3D Deep Learning
  • Applied Research
  • Autonomous Vehicles
  • Deep Imagination
  • New and Featured
  • AI Art Gallery
  • AI & Machine Learning
  • Computer Vision
  • Academic Collaborations
  • Government Collaborations
  • Graduate Fellowship
  • Internships
  • Research Openings
  • Research Scientists
  • Meet the Team
  • Publications

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

hypothesis generation model

Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

Publication Date

Published in, research area.

Link prediction for hypothesis generation: an active curriculum learning infused temporal graph-based approach

  • Open access
  • Published: 12 August 2024
  • Volume 57 , article number  244 , ( 2024 )

Cite this article

You have full access to this open access article

hypothesis generation model

  • Uchenna Akujuobi 1   na1 ,
  • Priyadarshini Kumari 2   na1 ,
  • Jihun Choi 3 ,
  • Samy Badreddine 1 ,
  • Kana Maruyama 3 ,
  • Sucheendra K. Palaniappan 4 &
  • Tarek R. Besold 1  

135 Accesses

Explore all metrics

Over the last few years Literature-based Discovery (LBD) has regained popularity as a means to enhance the scientific research process. The resurgent interest has spurred the development of supervised and semi-supervised machine learning models aimed at making previously implicit connections between scientific concepts/entities within often extensive bodies of literature explicit—i.e., suggesting novel scientific hypotheses. In doing so, understanding the temporally evolving interactions between these entities can provide valuable information for predicting the future development of entity relationships. However, existing methods often underutilize the latent information embedded in the temporal aspects of the interaction data. Motivated by applications in the food domain—where we aim to connect nutritional information with health-related benefits—we address the hypothesis-generation problem using a temporal graph-based approach. Given that hypothesis generation involves predicting future (i.e., still to be discovered) entity connections, in our view the ability to capture the dynamic evolution of connections over time is pivotal for a robust model. To address this, we introduce THiGER , a novel batch contrastive temporal node-pair embedding method. THiGER excels in providing a more expressive node-pair encoding by effectively harnessing node-pair relationships. Furthermore, we present THiGER-A , an incremental training approach that incorporates an active curriculum learning strategy to mitigate label bias arising from unobserved connections. By progressively training on increasingly challenging and high-utility samples, our approach significantly enhances the performance of the embedding model. Empirical validation of our proposed method demonstrates its effectiveness on established temporal-graph benchmark datasets, as well as on real-world datasets within the food domain.

Similar content being viewed by others

hypothesis generation model

Connecting the Dots: Hypotheses Generation by Leveraging Semantic Shifts

hypothesis generation model

Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network

hypothesis generation model

Dyport: dynamic importance-based biomedical hypothesis generation benchmarking technique

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom. — Isaac Asimov. Science is advancing at an increasingly quick pace, as evidenced, for instance, by the exponential growth in the number of published research articles per year (White 2021 ). Effectively navigating this ever-growing body of knowledge is tedious and time-consuming in the best of cases, and more often than not becomes infeasible for individual scientists (Brainard 2020 ). In order to augment the efforts of human scientists in the research process, computational approaches have been introduced to automatically extract hypotheses from the knowledge contained in published resources. Swanson ( 1986 ) systematically used a scientific literature database to find potential connections between previously disjoint bodies of research, as a result hypothesizing a (later confirmed) curative relationship between dietary fish oils and Raynaud’s syndrome . Swanson and Smalheiser then automatized the search and linking process in the ARROWSMITH system (Swanson and Smalheiser 1997 ). Their work and other more recent examples (Fan and Lussier 2017 ; Trautman 2022 ) clearly demonstrate the usefulness of computational methods in extracting latent information from the vast body of scientific publications.

Over time, various methodologies have been proposed to address the Hypothesis Generation (HG) problem. Swanson and Smalheiser (Smalheiser and Swanson 1998 ; Swanson and Smalheiser 1997 ) pioneered the use of a basic ABC model grounded in a stringent interpretation of structural balance theory (Cartwright and Harary 1956 ). In essence, if entities A and B, as well as entities A and C, share connections, then entities B and C should be associated. Subsequent years have seen the exploration of more sophisticated machine learning-based approaches for improved inference. These encompass techniques such as text mining (Spangler et al. 2014 ; Spangler 2015 ), topic modeling (Sybrandt et al. 2017 ; Srihari et al. 2007 ; Baek et al. 2017 ), association rules (Hristovski et al. 2006 ; Gopalakrishnan et al. 2016 ; Weissenborn et al. 2015 ), and others (Jha et al. 2019 ; Xun et al. 2017 ; Shi et al. 2015 ; Sybrandt et al. 2020 )

In the context of HG, where the goal is to predict novel relationships between entities extracted from scientific publications, comprehending prior relationships is of paramount importance. For instance, in the domain of social networks, the principles of social theory come into play when assessing the dynamics of connections between individuals. When there is a gradual reduction in the social distance between two distinct individuals, as evidenced by factors such as the establishment of new connections with shared acquaintances and increased geographic proximity, there emerges a heightened likelihood of a subsequent connection between these two individuals (Zhang and Pang 2015 ; Gitmez and Zárate 2022 ). This concept extends beyond social networks and finds relevance in predicting scientific relationships or events through the utilization of temporal information (Crichton et al. 2018 ; Krenn et al. 2023 ; Zhang et al. 2022 ). In both contexts, the principles of proximity and evolving relationships serve as valuable indicators, enabling a deeper understanding of the intricate dynamics governing these complex systems.

Modeling these relationships’ temporal evolution assumes a critical role in constructing an effective and resilient hypothesis generation model. To harness the temporal dynamics, Akujuobi et al. ( 2020b , 2020a ) and Zhou et al. ( 2022 ) conceptualize the HG task as a temporal graph problem. More precisely, given a sequence of graphs \(G = \{G_{0}, G_{2},\ldots , G_{T} \}\) , the objective is to deduce which previously unlinked nodes in \(G_{T}\) ought to be connected. In this framework, nodes denote biomedical entities, and the graphs \(G_{\tau }\) represent temporal graphlets (see Fig.  1 ).

Definition 1

Temporal graphlet : A temporal graphlet \(G_{\tau } = \{V^{\tau },E^{\tau }\}\) is a temporal subgraph at time point \(\tau\) , where \(V^{\tau } \subset V\) and \(E^{\tau } \subset E\) are the temporal set of nodes and edges of the subgraph.

Their approach tackles the HG problem by introducing a temporal perspective. Instead of relying solely on the final state \(E_{T}\) on a static graph, it considers how node pairs evolve over discrete time steps \(E^{\tau }: \tau = 0 \dots T\) . To model this sequential evolution effectively, Akujuobi et al. and Zhou et al. leverage the power of recurrent neural networks (RNNs) (see Fig.  2 a). However, it is essential to note that while RNNs have traditionally been the preferred choice for HG, their sequential nature may hinder capturing long-range dependencies, impacting performance for lengthy sequences.

figure 1

Modeling hypothesis generation as a temporal link prediction problem

figure 2

Predicting the link probability \(p_{i,j}\) for a node pair \(v_i\) and \(v_j\) using a a Recurrent Neural Network approach (Akujuobi et al. 2020b ; Zhou et al. 2022 ), b THiGER, our approach. The recurrent approach aggregates the neighborhood information \({{\mathcal {N}}}^t(v_i)\) and \({{\mathcal {N}}}^t(v_j)\) sequentially while THiGER aggregates the neighborhood information hierarchically in parallel

To shed these limitations, we propose THiGER ( T emporal Hi erarchical G raph-based E ncoder R epresentation), a robust transformer-based model designed to capture the evolving relationships between node pairs. THiGER overcomes the constraints of previous methods by representing temporal relationships hierarchically (see Fig.  2 b). The proposed hierarchical layer-wise framework presents an incremental approach to comprehensively model the temporal dynamics among given concepts. It achieves this by progressively extracting the temporal interactions between consecutive time steps, thus enabling the model to prioritize attention to the informative regions of temporal evolution during the process. Our method effectively addresses issues arising from imbalanced temporal information (see Sect.  5.2 ). Moreover, it employs a contrastive learning strategy to improve the quality of task-specific node embeddings for node-pair representations and relationship inference tasks.

An equally significant challenge in HG is the lack of negative-class samples for training. Our dataset provides positive-class samples, which represent established connections between entities, but it lacks negative-class samples denoting non-existent connections (as opposed to undiscovered connections, which could potentially lead to scientific breakthroughs). This situation aligns with the positive-unlabeled (PU) learning problem. Prior approaches have typically either discarded unobserved connections as uninformative or wrongly treated them as negative-class samples. The former approach leads to the loss of valuable information, while the latter introduces label bias during training.

In response to these challenges, we furthermore introduce THiGER-A, an active curriculum learning strategy designed to train the model incrementally. THiGER-A utilizes progressively complex positive samples and highly informative, diverse unobserved connections as negative-class samples. Our experimental results demonstrate that by employing incremental training with THiGER-A, we achieve enhanced convergence and performance for hypothesis-generation models compared to training on the entire dataset in one go. Remarkably, our approach demonstrates strong generalization capabilities, especially in challenging inductive test scenarios where the entities were not part of the seen training dataset.

Inspired by Swanson’s pioneering work, we chose the food domain as a promising application area for THiGER. This choice is motivated by the increasing prevalence of diet-related health conditions, such as obesity and type-2 diabetes, alongside the growing recognition and utilization of the health benefits associated with specific food products in wellness and medical contexts.

In summary, our contributions are as follows:

Methodology: We propose a novel temporal hierarchical transformer-based architecture for node pair encoding. In utilizing the temporal batch-contrastive strategy, our architecture differs from existing approaches that learn in conventional static or temporal graphs. In addition, we present a novel incremental training strategy for temporal graph node pair embedding and future relation prediction. This strategy effectively mitigates negative-label bias through active learning and improves generalization by training the model progressively on increasingly complex positive samples using curriculum learning.

Evaluation: We test the model’s efficacy on several real-world graphs of different sizes to give evidence for the model’s strength for temporal graph problems and hypothesis generation. The model is trained end-to-end and shows superior performance on HG tasks.

Application: To the best of our knowledge, this is the first application of temporal hypothesis generation in the health-related food domain. Through case studies, we validate the practical relevance of our findings.

The remaining sections of this paper include a discussion of related work in Sect.  2 , a detailed introduction of the proposed THiGER model and the THiGER-A active curriculum learning strategy in Sect.  3 , an overview of the datasets, the model setup and parameter tuning, and our evaluation approach in Sect. 4 , the results of our experimental evaluations in Sect.  5 , and finally, our conclusions and a discussion of future work in Sect.  6 .

2 Related works

2.1 hypothesis generation.

The development of effective methods for machine-assisted discovery is crucial in pushing scientific research into the next stage (Kitano 2021 ). In recent years, several approaches have been proposed in a bid to augment human abilities relevant to the scientific research process including tools for research design and analysis (Tabachnick and Fidell 2000 ), process modelling and simulation (Klein et al. 2002 ), or scientific hypothesis generation (King et al. 2004 , 2009 ).

The early pioneers of the hypothesis generation domain proposed the so called ABC model for generating novel scientific hypothesis based on existing knowledge (Swanson 1986 ; Swanson and Smalheiser 1997 ). ABC-based models are simple and efficient, and have been implemented in classical hypothesis generation systems such as ARROWSMITH (Swanson and Smalheiser 1997 ). However, several drawbacks remain, including the need for similarity metrics defined on heuristically determined term lists and significant costs in terms of computational complexity with respect to the size of common entities.

More recent approaches, thus, have aimed to curtain the limitation of the ABC model. Spangler et al. ( 2014 ); Spangler ( 2015 ) proposed text mining techniques to identify entity relationships from unstructured medical texts. AGATHA (Sybrandt et al. 2020 ) used a transformer encoder architecture to learn the ranking criteria between regions of a given semantic graph and the plausibility of new research connections. Srihari et al. ( 2007 ); Baek et al. ( 2017 ) proposed several text mining approaches to detect how concepts are linked within and across multiple text documents. Sybrandt et al. ( 2017 ) proposed incorporating machine learning techniques such as clustering and topical phrase mining. Shi et al. ( 2015 ) modeled the probability that concepts will be linked based on a given time window using random walks.

The previously mentioned methods do not consider temporal attributes of the data. More recent works (Jha et al. 2019 ; Akujuobi et al. 2020a ; Zhou et al. 2022 ; Xun et al. 2017 ) argue that capturing the temporal information available in scholarly data can lead to better predictive performance. Jha et al. ( 2019 ) explored the co-evolution of concepts across knowledge bases using a temporal matrix factorization framework. Xun et al. ( 2017 ) modeled concepts’ co-occurrence probability using their temporal embedding. Akujuobi et al. ( 2020a , 2020b ) and Zhou et al. ( 2022 ) captured the temporal information in the scholarly data using RNN techniques.

Our approach captures the dynamic relationship information using a temporal hierarchical transformer encoder model. This strategy alleviates the limitations of the RNN-based models. Furthermore, with the incorporation of active curriculum learning strategies, our model can incrementally learn from the data.

2.2 Temporal graph learning

Learning on temporal graphs has received considerable attention from the research community in recent years. Some works (Hisano 2018 ; Ahmed et al. 2016 ; Milani Fard et al. 2019 ) apply static methods on aggregated graph snapshots. Others, including (Zhou et al. 2018 ; Singer et al. 2019 ), utilize time as a regularizer over consecutive snap-shots of the graph to impose a smoothness constraint on the node embeddings. A popular category of approaches for dynamic graphs is to introduce point processes that are continuous in time. DyRep (Trivedi et al. 2019 ) models the occurrence of an edge as a point process using graph attention on the destination node neighbors. Dynamic-Triad (Zhou et al. 2018 ) models the evolution patterns in a graph by imposing a triadic closure-where a triad with three nodes is developed from an open triad (i.e., with two nodes not connected).

Some recent works on temporal graphs apply several combinations of GNNs and recurrent architectures (e.g., GRU). EvolveGCN (Pareja et al. 2020 ) adapts the graph convolutional network (GCN) model along the temporal dimension by using an RNN to evolve the GCN parameters. T-PAIR (Akujuobi et al. 2020b , a ) recurrently learns a node pair embedding by updating GraphSAGE parameters using gated neural networks (GRU). TGN (Rossi et al. 2020 ) introduces a memory module framework for learning on dynamic graphs. TDE (Zhou et al. 2022 ) captures the local and global changes in the graph structure using hierarchical RNN structures. TNodeEmbed (Singer et al. 2019 ) proposes the use of orthogonal procrustes on consecutive time-step node embeddings along the time dimension.

However, the limitation of RNN remains due to their sequential nature and robustness especially when working on a long timeline. Since the introduction of transformers, there has been interest in their application on temporal graph data. More related to this work, Zhong and Huang ( 2023 ) and Wang et al. ( 2022 ) both propose the use of a transformer architecture to aggregate the node neighborhood information while updating the memory of the nodes using GRU. TLC (Wang et al. 2021a ) design a two-stream encoder that independently processes temporal neighborhoods associated with the two target interaction nodes using a graph-topology-aware Transformer and then integrates them at a semantic level through a co-attentional Transformer.

Our approach utilizes a single hierarchical encoder model to better capture the temporal information in the network while simultaneously updating the node embedding on the task. The model training and node embedding learning is performed end-to-end.

2.3 Active curriculum learning

Active learning (AL) has been well-explored for vision and learning tasks (Settles 2012 ). However, most of the classical techniques rely on single-instance-oracle strategies, wherein, during each training round, a single instance with the highest utility is selected using measures such as uncertainty sampling (Kumari et al. 2020 ), expected gradient length (Ash et al. 2020 ), or query by committee (Gilad-Bachrach et al. 2006 ). The single-instance-oracle approach becomes computationally infeasible with large training datasets such as ours. To address these challenges, several batch-mode active learning methods have been proposed (Priyadarshini et al. 2021 ; Kirsch et al. 2019 ; Pinsler et al. 2019 ). Priyadarshini et al. ( 2021 ) propose a method for batch active metric learning, which enables sampling of informative and diverse triplet data for relative similarity ordering tasks. In order to prevent the selection of correlated samples in a batch, Kirsch et al. ( 2019 ); Pinsler et al. ( 2019 ) develop distinct methods that integrate mutual information into the utility function. All three approaches demonstrate effectiveness in sampling diverse batches of informative samples for metric learning and classification tasks. However, none of these approaches can be readily extended to our specific task of hypothesis prediction on an entity-relationship graph.

Inspired by human learning, Bengio et al. ( 2009 ) introduced the concept of progressive training, wherein the model is trained on increasingly difficult training samples. Various prior works have proposed different measures to quantify the difficulty of training examples. Hacohen and Weinshall ( 2019 ) introduced curriculum learning by transfer, where they developed a score function based on the prediction confidence of a pre-trained model. Wang et al. ( 2021b ) proposed a curriculum learning approach specifically for graph classification tasks. Another interesting work is relational curriculum learning (RCL) (Zhang et al. 2023 ) suggests training the model progressively on complex samples. Unlike most prior work, which typically consider data to be independent, RCL quantifies the difficulty level of an edge by aggregating the embeddings of the neighboring nodes. While their approach utilizes similar relational data to ours, their method does not specifically tackle the challenges inherent to the PU learning setting, which involves sampling both edges and unobserved relationships from the training data. In contrast, our proposed method introduces an incremental training strategy that progressively trains the model by focusing on positive edges of increasing difficulty, as well as incorporating highly informative and diverse negative edges.

figure 3

Schematic representation of the proposed model for temporal node-pair link prediction. In a , the hierarchical graph transformer model takes as input the aggregated node pair embeddings obtained at each time step \(\tau\) , these temporal node pair embeddings are further encoded and aggregated at each encoder layer. The final output is the generalized node pair embedding across all time steps. In b , a general overview of the model is given, highlighting the incorporation of the Active Curriculum Learning strategy

3 Methodology

3.1 notation.

\(G = \{G_0, \dots , G_T\}\) is a temporal graph such that \(G_\tau = \{V^\tau , E^\tau \}\) evolves over time \(\tau =0\dots T\) ,

\(e(v_i, v_j)\) or \(e_{ij}\) is used to denote the edge between nodes \(v_i\) and \(v_j\) , and \((v_i, v_j)\) is used to denote the node pair corresponding to the edge,

\(y_{i,j}\) is the label associated with the edge \(e(v_i,v_j)\) ,

\({{\mathcal {N}}}^{{\tau }}(v_{})\) gives the neighborhood of a node v in \(V^\tau\) ,

\(x_{v_{}}\) is the embedding of a node v and is static across time steps,

\(z_{i,j}^{\tau }\) is the embedding of a node pair \(\langle v_i, v_j \rangle\) . It depends on the neighborhood of the nodes at a time step \(\tau\) ,

\(h_{i,j}^{[\tau _0,\tau _f]}\) is the embedding of a node pair over a time step window \(\tau _0, \dots , \tau _f\) where \(0 \le \tau _0 \le \tau _f \le T\) ,

\(f(.; \theta )\) is a neural network depending on a set of parameters \(\theta\) . For brevity, \(\theta\) can be omitted if it is clear from the context.

\(E^+\) and \(E^-\) are the subsets of positive and negative edges, denoting observed and non-observed connections between biomedical concepts, respectively.

L is the number of encoder layers in the proposed model.

figure a

Hierarchical Node-Pair Embedding \(h_{i,j}^{[\tau _0,\tau _f]}\)

figure b

Link Prediction

3.2 Model overview

The whole THiGER(-A) model is shown in Fig.  3 b. Let \(v_i, v_j \in V_T\) be nodes denoting two concepts. The pair is assigned a positive label \(y_{i,j} = 1\) if a corresponding edge (i.e., a link) is observed in \(G_T\) . That is, \(y_{i,j} = 1\) iff \(e(v_i, v_j) \in E^{T}\) , otherwise 0. The model predicts a score \(p_{i,j}\) that reflects \(y_{i,j}\) . The prediction procedure is presented in Algorithm 2.

The link prediction score is given by a neural classifier \(p_{i,j} = f_C(h_{i,j}^{[0,T]}; \theta _C)\) , where \(h_{i,j}^{[0,T]}\) is an embedding vector for the node pair. This embedding is calculated in Algorithm 1 using a hierarchical transformer encoder and illustrated in Fig.  3 a.

The input to the hierarchical encoder layer is the independent local node pair embedding aggregation at each time step shown in line 3 of algorithm 1 as

where \({\textbf{x}}_{{{\mathcal {N}}}^{{\tau }}(v_{i})} = \{x_{v'}: v' \in {{\mathcal {N}}}^{{\tau }}(v_{i})\}\) and \({\textbf{x}}_{{{\mathcal {N}}}^{{\tau }}(v_{j})} = \{x_{v'}: v' \in {{\mathcal {N}}}^{{\tau }}(v_{j})\}\) are the embeddings of the neighbors of \(x_{v_{i}}\) and \(x_{v_{j}}\) at the given time step.

Subsequently, the local node pair embeddings aggregation is processed by the aggregation layer illustrated in Fig.  3 a and shown in line 10 of Algorithm 1. At each hierarchical layer, temporal node pair embeddings are calculated for a sub-window using

where n represents the sub-window size. When necessary, we ensure an even number of leaves to aggregate by adding zero padding values \(H_\textrm{padding} = {\textbf{0}}_d\) , where d is the dimension of the leaf embeddings. The entire encoder architecture is denoted as \(f_E = { f^l_E: l=1\dots L}\) .

In this work, the classifier \(f_C(.; \theta _C)\) is modeled using a multilayer perceptron network (MLP), \(f_A(.; \theta _A)\) is elaborated in Sect.  3.3 , and \(f_E(.;\theta _E)\) is modeled by a multilayer transformer encoder network, which is detailed in Sect.  3.4 .

3.3 Neighborhood aggregation

The neighborhood aggregation is modeled using GraphSAGE (Hamilton et al. 2017 ). GraphSAGE uses K layers to iteratively aggregate a node embedding \(x_{v_{}}\) and its neighbor embeddings \({\textbf{x}}_{{{\mathcal {N}}}^{{\tau }}(v_{})} = \{x_{v'}, v' \in {{\mathcal {N}}}^{{\tau }}(v_{})\}\) . \(f_A\) uses the GraphSAGE block to aggregate \((x_{v_{i}}, {\textbf{x}}_{{{\mathcal {N}}}^{{\tau }}(v_{i})})\) and \((x_{v_{j}}, {\textbf{x}}_{{{\mathcal {N}}}^{{\tau }}(v_{j})})\) in parallel, then merges the two aggregated representations using a MLP layer. In this paper, we explore three models based on the aggregation technique used at each iterative step of GraphSAGE.

Mean Aggregation: This straightforward technique amalgamates neighborhood representations by computing element-wise means of each node’s neighbors and subsequently propagating this information iteratively. For all nodes within the specified set:

Here, \(\beta _{v}^{k}\) denotes the aggregated vector at iteration k , and \(\beta ^{k-1}_{v}\) at iteration \(k-1\) . \(W^S\) and \(W^N\) represent trainable weights, and \(\sigma\) constitutes a sigmoid activation, collectively forming a conventional MLP layer.

GIN (Graph Isomorphism Networks): Arguing that traditional graph aggregation methods, like mean aggregation, possess limited expressive power, GIN introduces the concept of aggregating neighborhood representations as follows:

hypothesis generation model

In this formulation, \(\epsilon ^{k}\) governs the relative importance of the node compared to its neighbors at layer k and can be a learnable parameter or a fixed scalar.

Multi-head Attention: We introduce a multi-head attention-based aggregation technique. This method aggregates neighborhood representations by applying multi-head attention to the node and its neighbors at each iteration:

Here, \(\phi\) represents a multi-head attention function, as detailed in Vaswani et al. ( 2017 ).

3.3.1 Neighborhood definition

To balance performance and scalability considerations, we adopt the neighborhood sampling approach utilized in GraphSAGE to maintain a consistent computational footprint for each batch of neighbors. In this context, we employ a uniform sampling method to select a neighborhood node set of fixed size, denoted as \({{\mathcal {N}}}^{'}(v_{}) \subset {{\mathcal {N}}}^{{\tau }}(v_{})\) , from the original neighbor set at each step. This sampling procedure is essential as, without it, the memory and runtime complexity of a single batch becomes unpredictable and, in the worst-case scenario, reaches a prohibitive \({{\mathcal {O}}}(|V|)\) , making it impractical for handling large graphs.

3.4 Temporal hierarchical multilayer encoder layer

The temporal hierarchical multilayer encoder is the fundamental component of our proposed model, responsible for processing neighborhood representations collected over multiple time steps, specifically \((z_{i,j}^{0}, z_{i,j}^{1}, \dots , z_{i,j}^{T})\) . These neighborhood representations are utilized to construct a hierarchical tree.

At the initial hierarchical layer, we employ an encoder, denoted as \(f_E^1\) , to distill adjacent sequential local node-pair embeddings, represented as \((z_{i,j}^{\tau }, z_{i,j}^{\tau + 1})\) , combining them into a unified embedding, denoted as \(h_{i,j}^{[\tau ,\tau +1]}\) . In cases where the number of time steps is not an even multiple of 2, a zero-vector dummy input is appended.

This process repeats at each hierarchical level l within the tree, with \(h_{i,j}^{[\tau -n,\tau ]} = f^l_E(h_{i,j}^{[\tau -n,\tau -\frac{n}{2}]}, h_{i,j}^{[(\tau -\frac{n}{2}) + 1,\tau ]};\theta ^l_E)\) . Each layer \(f_E^l\) consists of a transformer encoder block and may contain \(N - 1\) encoder sublayers, where \(N \ge 1\) . This mechanism can be viewed as an iterative knowledge aggregation process, wherein the model progressively summarizes the information from pairs of local node pair embeddings.

The output of each encoder layer, denoted as \(h_{i,j}^{[\tau _0,\tau _f]}\) , offers a comprehensive summary of temporal node pair information from time step \(\tau _0\) to \(\tau _f\) . Finally, the output of the last layer, \(h_{i,j}^{[0,T]}\) , is utilized for inferring node pair relationships.

3.5 Parameter learning

The trainable parts of the architecture are the weights and parameters of the neighborhood aggregator \(f_A\) , the transformer network \(f_E\) , the classifier \(f_C\) and the embedding representations \(\{x_{v_{}}: v \in V\}\) .

To obtain suitable representations, we employ a combination of supervised and contrastive loss functions on the output of the hierarchical encoder layer \(h_{i,j}^{[0,T]}\) . The contrastive loss function encourages the embeddings of positive (i.e. a link exists in \(E^T\) ) node pairs to be closer while ensuring that the embeddings of negative node pairs are distinct.

We adopt a contrastive learning framework (Chen et al. 2020 ) to distinguish between positive and negative classes. For brevity, we temporarily denote \(h_{i,j}^{[0,T]}\) as \(h_{i,j}\) . Given two positive node pairs with corresponding embeddings \(e(v_i, v_j) \rightarrow h_{i,j}\) and \(e(v_o, v_n) \rightarrow h_{o,n}\) , the loss function is defined as follows:

where \(\alpha\) represents a temperature parameter, B is a set of node pairs in a given batch, and \(\mathbbm {1}_{(k,w) \ne (i,j)}\) indicates that the labels of node pair ( k ,  w ) and ( i ,  j ) are different. We employ the angular similarity function \(\textrm{sim}(x)=1 - \arccos (x)/\pi\) . We do not explicitly sample negative examples, following the methodology outlined in Chen et al. ( 2020 ).

The contrastive loss is summed over the positive training data \(E^+\) :

To further improve the discriminative power of the learned features, we also minimize the center loss:

where E is the data of positive and negative edges, \(y_{i,j}\) is the class of the pair (0 or 1), \(c_{y_{{i,j}}} \in R^d\) denotes the corresponding class center. The class centers are updated after each mini-batch step following the method proposed in Wen et al. ( 2016 ).

Finally, a good node pair vector \(h_{i,j}^{[0,T]}\) should minimize the binary cross entropy loss of the node pair prediction task:

We adopt the joint supervision of the prediction loss, contrastive loss, and center loss to jointly train the model for discriminative feature learning and relationship inference:

As is usual, the losses are applied over subsets of the entire dataset. In this case, we have an additional requirement for pairs of nodes in \(E^-\) : at least one of the two nodes needs to appear in \(E^+\) . An elaborate batch sampling strategy is proposed in the following section. The model parameters are trained end to end.

figure c

Training Procedure in THiGER-A

3.6 Incremental training strategy

This section introduces the incremental training strategy THiGER-A , which extends our base THiGER model. The pseudo-code for THiGER-A is presented in Algorithm 3. We represent the parameters used in the entire architecture as \(\varvec{\theta }= (\theta _A, \theta _E, \theta _C)\) . Let \(P(y \mid e_{i,j}; \varvec{\theta })\) , where \(y\in \{0,1\}\) , denote the link predictor for the nodes \((v_i, v_j)\) . Specifically, in shorthand, we denote \(P(y=1 \mid e_{i,j};\varvec{\theta })\) by \(p_{i,j}\) as in line 3 of Algorithm 2, likewise \(P(y=0\mid e_{i,j}; \varvec{\theta }) = 1 - p_{i,j}\) .

We define \(E^- = (V \times V) \setminus E\) as the set of negative edges representing non-observed connections in the graph. The size of the negative set grows quadratically with the number of nodes, resulting in a computational complexity of \({{\mathcal {O}}}(|V|^2)\) . For large, sparse graphs like ours, the vast number of negative edges makes it impractical to use all of them for model training.

Randomly sampling negative examples may introduce noise and hinder training convergence. To address this challenge, we propose an approach to sample a smaller subset of “informative” negative edges that effectively capture the entity relationships within the graph. Leveraging active learning, a technique for selecting high-utility datasets, we aim to choose a subset \(B^*_N \subset E^-\) that leads to improved model learning.

3.6.1 Negative Edge Sampling using Active Learning

Active learning (AL) is an iterative process centered around acquiring a high-utility subset of samples and subsequently retraining the model. The initial step involves selecting a subset of samples with high utility, determined by a specified informativeness measure. Once this subset is identified, it is incorporated into the training data, and the model is subsequently retrained. This iterative cycle, involving sample acquisition and model retraining, aims to improve the model’s performance and generalizability through the learning process.

In this context, we evaluate the informativeness of edges using a score function denoted as \(S_{AL}: (v_{i}^{-}, v_{j}^{-}) \rightarrow {\mathbb {R}}\) . An edge \((v_{i}^{-}, v_{j}^{-})\) is considered more informative than \((v_{k}^{-}, v_{l}^{-})\) if \(S_{AL}(v_{i}^{-}, v_{j}^{-}) > S_{AL}(v_{k}^{-}, v_{l}^{-})\) . The key challenge in AL lies in defining \(S_{AL}\) , which encodes the learning of the model \(P(.;\varvec{\theta })\) trained in the previous iteration.

We gauge the informativeness of an edge based on model uncertainty. An edge is deemed informative when the current model \(P(.;\varvec{\theta })\) exhibits high uncertainty in predicting its label. Uncertainty sampling is one of the most popular choices for the quantification of informativeness due to its simplicity and high effectiveness in selecting samples for which the model lacks sufficient knowledge. Similar to various previous techniques, we use Shannon entropy to approximate informativeness (Priyadarshini et al. 2021 ; Kirsch et al. 2019 ). It is important to emphasize that ground truth labels are unavailable for negative edges, which represent unobserved entity connections. Therefore, to estimate the informativeness of negative edges, we calculate the expected Shannon entropy across all possible labels. Consequently, the expected entropy for a negative edge \((v_{i}^{-}, v_{j}^{-})\) at the \(m^{th}\) training round is defined as:

Here, \(\varvec{\theta }^{m-1}\) is the base hypothesis predictor model trained at the \((m-1)^{th}\) training round and \(m = 0, 1, \cdots , M\) denotes the AL training round. Selecting a subset of uncertain edges, \(B_{U}\) using Eq.  12 unfortunately does not ensure diversity among the selected subset. The diversity metric is crucial in subset selection as it encourages the selection of diverse samples within the embedding space. This, in turn, results in a higher cumulative informativeness for the selected subset, particularly when the edges exhibit overlapping features. The presence of a highly-correlated edges in the selected subset can lead to a sub-optimal batch with high redundancy. The importance of diversity in selecting informative edges has been emphasized in several prior works (Kirsch et al. 2019 ; Priyadarshini et al. 2021 ). To obtain a diverse subset, both approaches aim to maximize the joint entropy (and consequently, minimize mutual information) among the samples in the selected batch. However, maximizing joint entropy is an expensive combinatorial optimization problem and does not scale well for larger datasets, as in our case.

We adopt a similar approach as (Kumari et al. 2020 ) and utilize the k-means++ algorithm (Arthur and Vassilvitskii 2006 ) to cluster the selected batch \(B_U\) into diverse landmark points. While (Kumari et al. 2020 ) is tailored for metric learning tasks with the triplet samples as inputs, our adaptation of the k-means++ algorithm is designed for graph datasets, leading to the selection of diverse edges within the gradient space. Although diversity in the gradient space is effective for gradient-based optimizer, a challenge arises due to the high dimensionality of the gradient space, particularly when the model is large. To overcome this challenge, we compute the expected gradient of the loss function with respect to only the penultimate layer of the network, \(\nabla _{\theta _{out}}{{\mathcal {L}}}_{e_{ij}^{-}}\) , assuming it captures task-specific features. We begin to construct an optimal subset \(B_{N}^{*} \in B_{U}\) by initially (say, at \(k=0\) ) selecting the two edges with the most distinct gradients. Subsequently, we iteratively select the most dissimilar gradient edge from the selected subset using the maxmin optimization objective defined in Eq.  13 .

Here \(d_{E}\) represents the Euclidean distance between two vectors in the gradient space, consisting of \(\nabla _{\theta _{out}}{{\mathcal {L}}}_{e_{ij}^{-1}}\) , which denotes the gradient of the loss function \({{\mathcal {L}}}\) with respect to the penultimate layer of the network \(\theta _{out}\) . The process continues until we reach the allocated incremental training budget, \(|B_{N}^{*}| = K\) . The resulting optimal subset of negative edges, \(B_{N}^{*}\) , comprises negative edges that are both diverse and informative.

3.6.2 Positive Edge Sampling

Inspired by Curriculum Learning (CL), a technique mimicking certain aspects of human learning, we investigate its potential to enhance the performance and generalization of the node pair predictor model. Curriculum Learning involves presenting training data to the model in a purposeful order, starting with easier examples and gradually progressing to more challenging ones. We hypothesize that applying CL principles can benefit our node pair predictor model. By initially emphasizing the learning of simpler connections and leveraging prior knowledge, the model can effectively generalize to more complex connections during later stages of training. Although Active Learning (AL) and CL both involve estimating the utility of training samples, they differ in their approach to label availability. AL operates in scenarios where labels are unknown and estimates sample utility based on expected scores. In contrast, CL uses known labels to assess sample difficulty. For our model, we use one of the common approaches to define a difficulty score \(S_{CL}\) based on the model’s prediction confidence. The model’s higher prediction confidence indicates easier samples.

Here, \(S_{CL}(v_{i}, v_{j})\) indicates predictive uncertainty of an edge \(e_{ij}\) to be positive by an existing trained model \(\theta ^{m-1}\) at \((m-1)^{th}\) iteration. In summary, for hypothesis prediction using a large training dataset, Active Curriculum Learning provides a natural approach to sample an informative and diverse subset of high-quality samples, helping to alleviate the challenges associated with label bias.

4 Experimental setup

In this section, we present the experimental setup for our evaluation. We compare our proposed model, THiGER(-A), against several state-of-the-art (SOTA) methods to provide context for the empirical results on benchmark datasets. To ensure fair comparisons, we utilize publicly available baseline implementations and modify those as needed to align with our model’s configuration and input requirements. All experiments were conducted using Python. For the evaluation of the interaction datasets, we train all models on a single NVIDIA A10G GPU. In the case of the food-related biomedical dataset, we employ 4 NVIDIA V100 GPUs for model training. Notably, all models are trained on single machines. In our experiments, we consider graphs as undirected. The node attribute embedding dimension is set to \(d=128\) for all models evaluated. For baseline methods, we performed a parameter search on the learning rate and training steps, and we report the best results achieved. Our model is implemented in TensorFlow.

4.1 Datasets and model setup

Table 1 shows the statistics of the datasets used in this study. Unless explicitly mentioned, all methods, including our model, share the same initial node attributes provided by pretrained Node2Vec (Grover and Leskovec 2016 ). The pre-trained Node2vec embedding effectively captures the structural information of nodes in the training graph. In our proposed framework, the choice of a fixed node embedding is to enable the model capture the temporal evolution of network relations, given that the node embeddings are in the same vector space. While employing a dynamic node embedding framework may enhance results, it introduces complexities associated with aligning vector spaces across different timestamps. This aspect is deferred to future research. It is important to note that the Node2vec embeddings serve solely as initializations for the embedding layer, and the embedding vectors undergo fine-tuning during the learning process to further capture the dynamic evolution of node relationships. For models that solely learn embedding vectors for individual nodes, we represent the \(h_{i,j}\) of a given node pair as the concatenation of the embedding vectors for nodes \(\langle x_i, x_j \rangle\) .

4.1.1 Interaction datasets

We have restructured the datasets to align with our specific use case. We partition the edges in the temporal graphs into five distinct groups based on their temporal labels. For example, if a dataset is labeled up to 500 time units, we reorganize them as follows: \(\{0, \dots , 100\} \rightarrow 0\) , \(\{101, \dots , 200\} \rightarrow 1\) , \(\{201, \dots , 300\} \rightarrow 2\) , \(\{301, \dots , 400\} \rightarrow 3\) , and \(\{401, \dots , 500\} \rightarrow 4\) . These User-Item based datasets create bipartite graphs. For all inductive evaluations, we assume knowledge of three nearest node neighbors for each of the unseen nodes. Neighborhood information is updated after model training to incorporate this knowledge, with zero vectors assigned to new nodes.

4.1.2 Food-related biomedical temporal datasets

To construct the relationship graph, we extract sentences containing predefined entities (Genes, Diseases, Chemical Compounds, Nutrition, and Food Ingredients). We establish connections between two concepts that appear in the same sentence within any publication in the dataset. The time step for each relationship between concept pairs corresponds to the publication year when the first mention was identified (i.e., the oldest publication year among all the publications where the concepts are associated). We generate three datasets for evaluation based on concept pair domains: Ingredient, disease pairs, Ingredient, Chemical compound pairs, and all pairs (unfiltered). Graph statistics are provided in Table 1 . For training and testing sets, we divide the graph into 10-year intervals starting from 1940 (i.e., { \(\le 1940\) }, {1941–1950}, \(\dots\) , {2011–2020}). The splits \(\le\) 2020 are used for training, and the split {2021–2022} is used for testing. In accordance with the problem configuration in the interaction dataset, we update the neighborhood information and also assume knowledge of three nearest node neighbors pertaining to each of the unseen nodes for inductive evaluations.

4.1.3 Model setup & parameter tuning

Model Configuration: We employ a hierarchical encoder with \(N\lceil \log _{2} T \rceil\) layers, where N is a multiple of each hierarchical layer (i.e., with \(N-1\) encoder sublayers), and T represents the number of time steps input to each hierarchical encoder layer. In our experiments, we set the number of encoder layer multiples to \(N=2\) . We use 8 attention heads with 128 dimensional states. For the position-wise feed-forward networks, we use 512 dimensional inner states. For the activation function, we applied the Gaussian Error Linear Unit (GELU, Hendrycks and Gimpel 2016 ). We apply a dropout (Srivastava et al. 2014 ) to the output of each sub-layer with a rate of \(P_{drop} = 0.1\) .

Optimizer: Our models are trained using the AdamW optimizer (Loshchilov and Hutter 2017 ), with the following hyper-parameters: \(\beta _1 = 0.9\) , \(\beta _2 = 0.99\) , and \(\epsilon = 10^{-7}\) . We use a linear decay of the learning rate. We set the number of warmup steps to \(10\%\) of the number of train steps. We vary the learning rate with the size of the training data.

Time Embedding: We use Time2Vec (T2V, Kazemi et al. 2019 ) to generate time-step embeddings which encode the temporal sequence of the time steps. The T2V model is learned and updated during the model training.

Active learning: The size of subset \(B_U\) is twice the size of the optimal subset \(B^{*}\) . The model undergoes seven training rounds for the Wikipedia, Reddit, and Last FM datasets, while it is trained for three rounds for the food-related biomedical dataset (All, Ingredient-Disease, Ingredient-Chemical). Due to the large size of biomedical dataset, we limit the model training to only three rounds. However, we anticipate that increasing the number of training rounds will lead to further improvements in performance.

4.2 Evaluation metrics

In this study, we assess the efficacy of the models by employing the binary F1 score and average precision score (AP) as the performance metrics. The binary F1 score is defined as the harmonic mean of precision and recall, represented by the formula:

Here, precision denotes the ratio of true positive predictions to the total predicted positives, while recall signifies the ratio of true positive predictions to the total actual positives.

The average precision score is the weighted mean of precisions achieved using different thresholds, using the incremental change in recall from the previous threshold as weight:

where N is the total number of thresholds, \(P_{k}\) is the precision at cut-off k , and \(\Delta R_{k} = R_{k} - R_{k - 1}\) is a sequential change in the recall value. Our emphasis on positive predictions in the evaluations is driven by our preference for models that efficiently forecast future connections between pairs of nodes.

4.3 Method categories

We categorize the methods into two main groups based on their handling of temporal information:

Static Methods: These methods treat the graph as static data and do not consider the temporal aspect. The static methods under consideration include the Logistic regression model, GraphSAGE (Hamilton et al. 2017 ), and AGATHA (Sybrandt et al. 2020 ).

Temporal Methods: These state-of-the-art methods leverage temporal information to create more informative node representations. We evaluate the performance of our base model, THiGER, and the final model, THiGER-A, against the following temporal methods: CTDNE (Nguyen et al. 2018 ), TGN (Rossi et al. 2020 ), JODIE (Kumar et al. 2019 ), TNodeEmbed (Singer et al. 2019 ), DyRep (Trivedi et al. 2019 ), T-PAIR (Akujuobi et al. 2020b ), and TDE (Zhou et al. 2022 ).

5 Experiments

The performance of THiGER-A is rigorously assessed across multiple benchmark datasets, as presented in Tables 2 and 3 . The experimental evaluations are primarily geared toward two distinct objectives:

Assessing the model’s effectiveness in handling interaction datasets pertinent to temporal graph problems.

Evaluating the model’s proficiency in dealing with food-related biomedical datasets, specifically for predicting relationships between food-related concepts and other biomedical terms.

In Sects.  4.1.1 and 4.1.2 , a comprehensive overview of the used datasets is provided. Our evaluations encompass two fundamental settings:

Transductive setup: This scenario involves utilizing data from all nodes during model training.

Inductive setup: In this configuration, at least one node in each evaluated node pair has not been encountered during the model’s training phase.

These experiments are designed to rigorously assess THiGER-A’s performance across diverse datasets, offering insights into its capabilities under varying conditions and problem domains.

5.1 Quantitative evaluation: interaction temporal datasets

We assess the performance of our proposed model in the context of future interaction prediction (Rossi et al. 2020 ; Kumar et al. 2019 ). The datasets record interactions between users and items.

We evaluate the performance on three distinct datasets: (i) Reddit, (ii) LastFM, and (iii) Wikipedia, considering both transductive and inductive settings. In the transductive setting, THiGER-A outperforms other models across all datasets, except Wikipedia, where AGATHA exhibits significant superiority. Our analysis reveals that AGATHA’s advantage lies in its utilization of the entire graph for neighborhood and negative sampling, which gives it an edge over models using a subset of the graph due to computational constraints. This advantage is more evident in the transductive setup since AGATHA’s training strategy leans towards seen nodes. Nevertheless, THiGER-A consistently achieves comparable or superior performance even in the presence of AGATHA’s implicit bias. It is imperative to clarify that AGATHA was originally designed for purposes other than node-pair predictions. Nonetheless, we have adapted the algorithm to align with the node-pair configuration specifically for our research evaluations.

In the inductive setup, our method excels in the Wikipedia and Reddit datasets but lags behind some baselines in the LastFM dataset. Striking a balance between inductive and transductive performance, THiGER-A’s significant performance gain over THiGER underscores the effectiveness of the proposed incremental learning strategy. This advantage is particularly pronounced in the challenging inductive test setting.

5.2 Quantitative evaluation: food-related biomedical temporal datasets

This section presents the quantitative evaluation of our proposed model on temporal node pair (or “link”) prediction, explicitly focusing on food-related concept relationships extracted from scientific publications in the PMC dataset. The evaluation encompasses concept pairs from different domains, including Ingredient, Disease pairs (referred to as F-ID), Ingredient, Chemical Compound pairs (F-IC), and all food-related pairs (F-A). The statistical characteristics of the dataset are summarized in Table  1 .

Table  3 demonstrates that our model outperforms the baseline models in both inductive and transductive setups. The second-best performing model is AGATHA, which, as discussed in the previous section, exhibits certain advantages over alternative methods. It is noteworthy that the CTDNE method exhibits scalability issues with larger datasets.

An intriguing observation from this evaluation is that, aside from our proposed model, static methods outperform temporal methods on this dataset. Further investigation revealed that the data is more densely distributed toward the later time steps. Notably, a substantial increase in information occurs during the last time steps. Up to the year 2000, the average number of edges per time step is approximately 100, 000. However, this number surges to about 1 million in the time window from 2001 to 2010, followed by another leap to around 4 million in the 2011–2020 time step. This surge indicates a significant influx of knowledge in food-related research in recent years.

We hypothesize that while this influx is advantageous for static methods, it might adversely affect some temporal methods due to limited temporal information. To test this hypothesis, we conduct an incremental evaluation, illustrated in Fig.  4 , using two comparable link prediction methods (Logistic Regression and GraphSAGE) and the two best temporal methods (tNodeEmbed and THiGER). In this evaluation, we incrementally assess the transductive performance on testing pairs up to the year 2000. Specifically, we evaluate the model performance on the food dataset (F-A) in the time intervals 1961–1970 by using all available training data up to 1960, and similarly for subsequent time intervals.

From Fig.  4 , it is evident that temporal methods outperform static methods when the temporal data is more evenly distributed, i.e., when there is an incremental increase in temporal data. The sudden exponential increase in data during the later years biases the dataset towards the last time steps. However, THiGER consistently outperforms the baseline methods in the incremental evaluation, underscoring its robustness and flexibility.

figure 4

Transductive F1 score of incremental prediction (per year) made by THiGER and three other baselines. The models are incrementally trained with data before the displayed evaluation time window

5.3 Ablation study

In this section, we conduct an ablation study to assess the impact of various sampling strategies on the base model’s performance. The results are presented in Table  4 , demonstrating the performance improvements achieved by the different versions of the THiGER model (-mean, -gin and -attn) for each dataset. Due to the much larger size of the food-related biomedical dataset, we conduct the ablation study only for the baseline datasets.

First, we investigate the influence of the active learning (AL)-based negative sampler on the base THiGER model. A comparison of the model’s performance with and without the AL-based negative sampler reveals significant improvements across all datasets. Notably, the performance gains are more pronounced in the challenging inductive test cases where at least one node of an edge is unseen in the training data. This underscores the effectiveness and generalizability of the AL-based learner for the hypothesis prediction model in the positive-unlabeled (PU) learning setup.

Next, we integrate curriculum learning (CL) as a positive data sampler, resulting in further enhancements to the base model. Similar to the AL-based sampling, the performance gains are more pronounced in the inductive test setting. The relatively minor performance improvement in the transductive case may be attributed to the limited room for enhancement in that specific context. Nevertheless, both AL alone and AL combined with CL enhance the base model’s performance and generalizability, particularly in the inductive test scenario.

figure 5

Pair embedding visualization. The blue color denotes the true negative samples, the red points are false negative, the green points are true positive, and the purple points are false positive

5.4 Pair embedding visualization

In this section, we conduct a detailed analysis of the node pair embeddings generated by THiGER using the F-ID dataset. To facilitate visualization, we randomly select 900 pairs and employ t-SNE (Van der Maaten and Hinton 2008 ) to compare these embeddings with those generated by Node2Vec, as shown in Fig.  5 . We employ color-coding to distinguish between the observed labels and the predicted labels. Notably, we observe distinct differences in the learned embeddings. THiGER effectively separates positive and negative node pairs in the embedding space. True positives (denoted in green) and true negatives (denoted in blue) are further apart in the embedding space, while false positives (indicated in red) and false negatives (shown in purple) occupy an intermediate region. This observation aligns with the idea that unknown connections are not unequivocal in our application domain, possibly due to missing data or discoveries yet to be made.

5.5 Case study

To assess the predictive accuracy of our model, we conducted a detailed analysis using the entire available food-related biomedical temporal dataset. We collaborated with biologists to evaluate the correctness of the generated hypotheses. Unlike providing binary predictions (1 or 0), we take a probabilistic approach by assigning a probability score within the range of 0 to 1. This score reflects the likelihood of a connection existing between the predicted node pairs. Consequently, the process of ranking a set of relation predictions associated with a specific node is tantamount to ranking the corresponding predicted probabilities.

Using this methodology, we selected 402 node pairs and presented them to biomedical researchers for evaluation. The researchers sought hypotheses related to specific oils. Subsequently, we generated hypotheses representing potential future connections between the oil nodes and other nodes, resulting in a substantial list. Given the anticipated extensive list, we implemented a filtering process based on the associated probability scores. This enabled us to selectively identify predictions with high probabilities, which were then communicated to the biomedical researchers for evaluation. The evaluation encompassed two distinct approaches.

First, they conducted manual searches for references to the predicted positive node pairs in various biology texts, excluding our dataset. Their findings revealed relationships in 70 percent of the node pairs through literature searches and reviews.

Secondly, to explore cases where no direct relationship was apparent in existing literature, they randomly selected and analyzed three intriguing node pairs: (i) Flaxseed oil and Root caries , (ii) Benzoxazinoid and Gingelly oil , and (iii) Senile osteoporosis and Soybean oil .

5.5.1 Flaxseed oil and root caries

Root caries refers to a dental condition characterized by the decay and demineralization of tooth root surfaces. This occurs when tooth roots become exposed due to gum recession, allowing bacterial invasion and tooth structure erosion. While the scientific literature does not explicitly mention the use of flaxseed oil for root caries, it is well-established that flaxseed oil possesses antibacterial properties (Liu et al. 2022 ). These properties may inhibit bacterial species responsible for root caries. Furthermore, flaxseed oil is a rich source of omega-3 fatty acids and lignans, factors potentially relevant to this context. Interestingly, observational studies are investigating the oil’s effects on gingivitis (Deepika 2018 ).

5.5.2 Benzoxazinoid and gingelly oil

Benzoxazinoids are plant secondary metabolites synthesized in many monocotyledonous species and some dicotyledonous plants (Schullehner et al. 2008 ). Gingelly oil, derived from sesame seeds, originates from a dicotyledonous plant. In the biologists’ opinion, this concurrence suggests a valid basis for the hypothesized connection.

5.5.3 Senile osteoporosis and soybean oil

Senile osteoporosis is a subtype of osteoporosis occurring in older individuals due to age-related bone loss. Soybean oil, a common vegetable oil derived from soybeans, contains phytic acid (Anderson and Wolf 1995 ). Phytic acid is known to inhibit the absorption of certain minerals, including calcium, which is essential for bone strength (Lönnerdal et al. 1989 ). Again, in the experts’ opinion, this suggests a valid basis for a (unfortunately detrimental) connection between the oil and the health condition.

6 Conclusions

We introduce an innovative approach to tackle the hypothesis generation problem within the context of temporal graphs. We present THiGER, a novel transformer-based model designed for node pair prediction in temporal graphs. THiGER leverages a hierarchical framework to effectively capture and learn from temporal information inherent in such graphs. This framework enables efficient parallel temporal information aggregation. We also introduce THiGER-A, an incremental training strategy that enhances the model’s performance and generalization by training it on high-utility samples selected through active curriculum learning, particularly benefiting the challenging inductive test setting. Quantitative experiments and analyses demonstrate the efficiency and robustness of our proposed method when compared to various state-of-the-art approaches. Qualitative analyses illustrate its practical utility.

For future work, an enticing avenue involves incorporating additional node-pair relationship information from established biomedical and/or food-related knowledge graphs. In scientific research, specific topics often experience sudden exponential growth, leading to temporal data distribution imbalances. Another intriguing research direction, thus, is the study of the relationship between temporal data distribution and the performance of temporal graph neural network models. We plan to analyze the performance of several temporal GNN models across diverse temporal data distributions and propose model enhancement methods tailored to such scenarios.

Due to the vast scale of the publication graph, training the hypothesis predictor with all positive and negative edges is impractical and limits the model’s ability to generalize, especially when the input data is noisy. Thus, it is crucial to train the model selectively on a high-quality subset of the training data. Our work presents active curriculum learning as a promising approach for feasible and robust training for hypothesis predictors. However, a static strategy struggles to generalize well across different scenarios. An exciting direction for future research could be to develop dynamic policies for data sampling that automatically adapt to diverse applications. Furthermore, improving time complexity is a critical challenge, particularly for applications involving large datasets and models.

Ahmed NM, Chen L, Wang Y et al. (2016) Sampling-based algorithm for link prediction in temporal networks. Inform Sci 374:1–14

Article   MathSciNet   Google Scholar  

Akujuobi U, Chen J, Elhoseiny M et al. (2020) Temporal positive-unlabeled learning for biomedical hypothesis generation via risk estimation. Adv Neural Inform Proc Syst 33:4597–4609

Google Scholar  

Akujuobi U, Spranger M, Palaniappan SK et al. (2020) T-pair: Temporal node-pair embedding for automatic biomedical hypothesis generation. IEEE Trans Knowledge Data Eng 34(6):2988–3001

Anderson RL, Wolf WJ (1995) Compositional changes in trypsin inhibitors, phytic acid, saponins and isoflavones related to soybean processing. J Nutr 125(suppl–3):581S-588S

Arthur D, Vassilvitskii S (2006) \(k\) -means++: The advantages of careful seeding. Stanford University, Tech. rep

Ash JT, Zhang C, Krishnamurthy A et al. (2020) Deep batch active learning by diverse, uncertain gradient lower bounds. ICLR, Vienna

Baek SH, Lee D, Kim M et al. (2017) Enriching plausible new hypothesis generation in pubmed. PloS One 12(7):e0180539

Article   Google Scholar  

Bengio Y, Louradour J, Collobert R, et al. (2009) Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, 41–48

Brainard J (2020) Scientists are drowning in COVID-19 papers. Can new tools keep them afloat? — science.org. https://www.science.org/content/article/scientists-are-drowning-covid-19-papers-can-new-tools-keep-them-afloat , [Accessed 25-May-2023]

Cartwright D, Harary F (1956) Structural balance: a generalization of Heider’s theory. Psychol Rev 63(5):277

Chen T, Kornblith S, Norouzi M, et al. (2020) A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, PMLR, 1597–1607

Crichton G, Guo Y, Pyysalo S et al. (2018) Neural networks for link prediction in realistic biomedical graphs: a multi-dimensional evaluation of graph embedding-based approaches. BMC Bioinform 19(1):1–11

Deepika A (2018) Effect of flaxseed oil in plaque induced gingivitis-a randomized control double-blind study. J Evid Based Med Healthc 5(10):882–5

Fan Jw, Lussier YA (2017) Word-of-mouth innovation: hypothesis generation for supplement repurposing based on consumer reviews. In: AMIA Annual Symposium Proceedings, American Medical Informatics Association, p 689

Gilad-Bachrach R, Navot A, Tishby N (2006) Query by committee made real. NeurIPS, Denver

Gitmez AA, Zárate RA (2022) Proximity, similarity, and friendship formation: Theory and evidence. arXiv preprint arXiv:2210.06611

Gopalakrishnan V, Jha K, Zhang A, et al. (2016) Generating hypothesis: Using global and local features in graph to discover new knowledge from medical literature. In: Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB, 23–30

Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864

Hacohen G, Weinshall D (2019) On the power of curriculum learning in training deep networks. In: International Conference on Machine Learning, PMLR, 2535–2544

Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. Adv Neural Inform Proc Syst. https://doi.org/10.48550/arXiv.1706.02216

Hendrycks D, Gimpel K (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/160608415 3

Hisano R (2018) Semi-supervised graph embedding approach to dynamic link prediction. In: Complex Networks IX: Proceedings of the 9th Conference on Complex Networks CompleNet 2018 9, Springer, 109–121

Hristovski D, Friedman C, Rindflesch TC, et al. (2006) Exploiting semantic relations for literature-based discovery. In: AMIA Annual Symposium Proceedings, 349

Jha K, Xun G, Wang Y, et al. (2019) Hypothesis generation from text based on co-evolution of biomedical concepts. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 843–851

Kazemi SM, Goel R, Eghbali S, et al. (2019) Time2vec: Learning a vector representation of time. arXiv preprint arXiv:1907.05321

King RD, Whelan KE, Jones FM et al. (2004) Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427(6971):247–252

King RD, Rowland J, Oliver SG et al. (2009) The automation of science. Science 324(5923):85–89

Kirsch A, van Amersfoort J, Gal Y (2019) BatchBALD: efficient and diverse batch acquisition for deep Bayesian active learning. NeurIPS, Denver

Kitano H (2021) Nobel turing challenge: creating the engine for scientific discovery. npj Syst Biol Appl 7(1):29

Klein MT, Hou G, Quann RJ et al. (2002) Biomol: a computer-assisted biological modeling tool for complex chemical mixtures and biological processes at the molecular level. Environ Health Perspect 110(suppl 6):1025–1029

Krenn M, Buffoni L, Coutinho B et al. (2023) Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network. Nat Machine Intell 5(11):1326–1335

Kumari P, Goru R, Chaudhuri S et al. (2020) Batch decorrelation for active metric learning. IJCAI-PRICAI, Jeju Island

Book   Google Scholar  

Kumar S, Zhang X, Leskovec J (2019) Predicting dynamic embedding trajectory in temporal interaction networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1269–1278

Liu Y, Liu Y, Li P et al. (2022) Antibacterial properties of cyclolinopeptides from flaxseed oil and their application on beef. Food Chem 385:132715

Lönnerdal B, Sandberg AS, Sandström B et al. (1989) Inhibitory effects of phytic acid and other inositol phosphates on zinc and calcium absorption in suckling rats. J Nutr 119(2):211–214

Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

Milani Fard A, Bagheri E, Wang K (2019) Relationship prediction in dynamic heterogeneous information networks. In: Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part I 41, Springer, 19–34

Nguyen GH, Lee JB, Rossi RA et al. (2018) Continuous-time dynamic network embeddings. Companion Proc Web Conf 2018:969–976

Pareja A, Domeniconi G, Chen J, et al. (2020) Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In: Proceedings of the AAAI conference on artificial intelligence, 5363–5370

Pinsler R, Gordon J, Nalisnick E et al. (2019) Bayesian batch active learning as sparse subset approximation. NeurIPS, Denver

Priyadarshini K, Chaudhuri S, Borkar V, et al. (2021) A unified batch selection policy for active metric learning. In: Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part II 21, Springer, 599–616

Rossi E, Chamberlain B, Frasca F, et al. (2020) Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637

Schullehner K, Dick R, Vitzthum F et al. (2008) Benzoxazinoid biosynthesis in dicot plants. Phytochemistry 69(15):2668–2677

Settles B (2012) Active learning. SLAIML, Shimla

Shi F, Foster JG, Evans JA (2015) Weaving the fabric of science: dynamic network models of science’s unfolding structure. Soc Networks 43:73–85

Singer U, Guy I, Radinsky K (2019) Node embedding over temporal graphs. arXiv preprint arXiv:1903.08889

Smalheiser NR, Swanson DR (1998) Using Arrowsmith: a computer-assisted approach to formulating and assessing scientific hypotheses. Comput Methods Prog Biomed 57(3):149–153

Spangler S (2015) Accelerating discovery: mining unstructured information for hypothesis generation. Chapman and Hall/CRC, Boca Raton

Spangler S, Wilkins AD, Bachman BJ, et al. (2014) Automated hypothesis generation based on mining scientific literature. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 1877–1886

Srihari RK, Xu L, Saxena T (2007) Use of ranked cross document evidence trails for hypothesis generation. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 677–686

Srivastava N, Hinton G, Krizhevsky A et al. (2014) Dropout: a simple way to prevent neural networks from overfitting. J Machine Learn Res 15(1):1929–1958

MathSciNet   Google Scholar  

Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30(1):7–18

Swanson DR, Smalheiser NR (1997) An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artif Intell 91(2):183–203

Sybrandt J, Shtutman M, Safro I (2017) Moliere: Automatic biomedical hypothesis generation system. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1633–1642

Sybrandt J, Tyagin I, Shtutman M, et al. (2020) Agatha: automatic graph mining and transformer based hypothesis generation approach. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2757–2764

Tabachnick BG, Fidell LS (2000) Computer-assisted research design and analysis. Allyn & Bacon Inc, Boston

Trautman A (2022) Nutritive knowledge based discovery: Enhancing precision nutrition hypothesis generation. PhD thesis, The University of North Carolina at Charlotte

Trivedi R, Farajtabar M, Biswal P, et al. (2019) Dyrep: Learning representations over dynamic graphs. In: International Conference on Learning Representations

Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Machine Learn Res 9(11):2579–2605

Vaswani A, Shazeer N, Parmar N et al. (2017) Attention is all you need. Adv Neural Inform Proc Syst. https://doi.org/10.48550/arXiv.1706.03762

Wang Y, Wang W, Liang Y et al. (2021) Curgraph: curriculum learning for graph classification. Proc Web Conf 2021:1238–1248

Wang Z, Li Q, Yu D et al. (2022) Temporal graph transformer for dynamic network. In: Part II (ed) Artificial Neural Networks and Machine Learning-ICANN 2022: 31st International Conference on Artificial Neural Networks, Bristol, UK, September 6–9, 2022, Proceedings. Springer, Cham, pp 694–705

Chapter   Google Scholar  

Wang L, Chang X, Li S, et al. (2021a) Tcl: Transformer-based dynamic graph modelling via contrastive learning. arXiv preprint arXiv:2105.07944

Weissenborn D, Schroeder M, Tsatsaronis G (2015) Discovering relations between indirectly connected biomedical concepts. J Biomed Semant 6(1):28

Wen Y, Zhang K, Li Z, et al. (2016) A discriminative feature learning approach for deep face recognition. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, Springer, 499–515

White K (2021) Publications Output: U.S. Trends and International Comparisons | NSF - National Science Foundation — ncses.nsf.gov. https://ncses.nsf.gov/pubs/nsb20214 , [Accessed 25-May-2023]

Xun G, Jha K, Gopalakrishnan V, et al. (2017) Generating medical hypotheses based on evolutionary medical concepts. In: 2017 IEEE International Conference on Data Mining (ICDM), IEEE, 535–544

Zhang R, Wang Q, Yang Q et al. (2022) Temporal link prediction via adjusted sigmoid function and 2-simplex structure. Sci Rep 12(1):16585

Zhang Y, Pang J (2015) Distance and friendship: A distance-based model for link prediction in social networks. In: Asia-Pacific Web Conference, Springer, 55–66

Zhang Z, Wang J, Zhao L (2023) Relational curriculum learning for graph neural networks. https://openreview.net/forum?id=1bLT3dGNS0

Zhong Y, Huang C (2023) A dynamic graph representation learning based on temporal graph transformer. Alexandria Eng J 63:359–369

Zhou H, Jiang H, Yao W et al. (2022) Learning temporal difference embeddings for biomedical hypothesis generation. Bioinformatics 38(23):5253–5261

Zhou L, Yang Y, Ren X, et al. (2018) Dynamic network embedding by modeling triadic closure process. In: Proceedings of the AAAI Conference on Artificial Intelligence

Download references

Author information

Uchenna Akujuobi and Priyadarshini Kumari have contributed equally to this work.

Authors and Affiliations

Sony AI, Barcelona, Spain

Uchenna Akujuobi, Samy Badreddine & Tarek R. Besold

Sony AI, Cupertino, USA

Priyadarshini Kumari

Sony AI, Tokyo, Japan

Jihun Choi & Kana Maruyama

The Systems Biology Institute, Tokyo, Japan

Sucheendra K. Palaniappan

You can also search for this author in PubMed   Google Scholar

Contributions

U.A. and P.K. co-lead the reported work and the writing of the manuscript, J.C., S.B., K.M., and S.P. supported the work and the writing of the manuscript. T.B. supervised the work overall. All authors reviewed the manuscript and contributed to the revisions based on the reviewers’ feedback.

Corresponding authors

Correspondence to Uchenna Akujuobi or Priyadarshini Kumari .

Ethics declarations

Conflict of interest.

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Akujuobi, U., Kumari, P., Choi, J. et al. Link prediction for hypothesis generation: an active curriculum learning infused temporal graph-based approach. Artif Intell Rev 57 , 244 (2024). https://doi.org/10.1007/s10462-024-10885-1

Download citation

Accepted : 25 July 2024

Published : 12 August 2024

DOI : https://doi.org/10.1007/s10462-024-10885-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Temporal graph neural network
  • Active learning
  • Hierarchical transformer
  • Curriculum learning
  • Literature based discovery
  • Edge prediction
  • Find a journal
  • Publish with us
  • Track your research

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Artificial Intelligence

Title: automating psychological hypothesis generation with ai: when large language models meet causal graph.

Abstract: Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 potential psychological hypotheses focusing on `well-being', then compared them against research ideas conceived by doctoral scholars and those produced solely by the LLM. Interestingly, our combined approach of a LLM and causal graphs mirrored the expert-level insights in terms of novelty, clearly surpassing the LLM-only hypotheses (t(59) = 3.34, p=0.007 and t(59) = 4.32, p<0.001, respectively). This alignment was further corroborated using deep semantic analysis. Our results show that combining LLM with machine learning techniques such as causal knowledge graphs can revolutionize automated discovery in psychology, extracting novel insights from the extensive literature. This work stands at the crossroads of psychology and artificial intelligence, championing a new enriched paradigm for data-driven hypothesis generation in psychological research.
Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as: [cs.AI]
  (or [cs.AI] for this version)
  Focus to learn more arXiv-issued DOI via DataCite
: Focus to learn more DOI(s) linking to related resources

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Nordlund's Hypotheses Generation Model

    hypothesis generation model

  2. General Model of Hypothesis Generation

    hypothesis generation model

  3. Flow diagram of the HyGene model of hypothesis generation, judgment

    hypothesis generation model

  4. Steps in the hypothesis Generation

    hypothesis generation model

  5. How To Develop A Digital Product Through A Hypothesis Generation Design

    hypothesis generation model

  6. General Model of Hypothesis Generation

    hypothesis generation model

COMMENTS

  1. Automating psychological hypothesis generation with AI: when large

    The HyGene model (Thomas et al., 2008) elucidated the intricacies of hypothesis generation, encompassing the constraints of working memory and the interplay between ambient and semantic memories.

  2. Data-Driven Hypothesis Generation in Clinical Research: What We Learned

    Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study ...

  3. Hypothesis Generation for Data Science Projects

    Hypothesis generation is a process beginning with an educated guess whereas hypothesis testing is a process to conclude that the educated guess is true/false or the relationship between the variables is statistically significant or not. This latter part could be used for further research using statistical proof.

  4. [2404.04326] Hypothesis Generation with Large Language Models

    Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled ...

  5. Hypothesis Generation with Large Language Models

    Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses.

  6. Hypothesis

    Generate a hypothesis in advance through pre-analyzing a problem (i.e., generation of a prestage hypothesis ). 3. Collect data related to the prestage hypothesis by appropriate means such as experiment, observation, database search, and Web search (i.e., data collection). 4. Process and transform the collected data as needed. 5.

  7. Scientific Hypothesis Generation by a Large Language Model: Laboratory

    View a PDF of the paper titled Scientific Hypothesis Generation by a Large Language Model: Laboratory Validation in Breast Cancer Treatment, by Abbi Abdel-Rehim and 10 other authors View PDF Abstract: Large language models (LLMs) have transformed AI and achieved breakthrough performance on a wide range of tasks that require human intelligence.

  8. Hypothesis Generation and Interpretation

    The author uses patterns discovered in a collection of big data applications to provide design principles for hypothesis generation, integrating big data processing and management, machine learning and data mining techniques. ... with macro-explanations (those based on applied processes and model generation). Practical case studies are used to ...

  9. Hypothesis Generation with Large Language Models

    In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled examples). To enable LLMs to handle arbitrarily long contexts, we generate initial hypotheses from a small number of examples and then update them iteratively to improve the quality of ...

  10. Hypothesis Generation with Large Language Models

    Hypothesis Generation with Large Language Models Anonymous ACL submission 001 Abstract 002 Effective generation of novel hypotheses is in- 003 strumental to scientific progress. So far, re-004 searchers have been the main powerhouse be- 005 hind hypothesis generation by painstaking data 006 analysis and thinking (also known as the Eu- 007 reka moment). In this paper, we examine the

  11. Hypothesis Generation by Difference

    The difference-based methods for hypothesis generation are introduced as design principles and patterns for integrated hypothesis generation. 6.1.1 Classification of Difference-Based Methods. First, we explain the difference-based methods for generating hypotheses in general regardless of the data type.

  12. Formulating Hypotheses for Different Study Designs

    Formulating Hypotheses for Different Study Designs. Generating a testable working hypothesis is the first step towards conducting original research. Such research may prove or disprove the proposed hypothesis. Case reports, case series, online surveys and other observational studies, clinical trials, and narrative reviews help to generate ...

  13. InterHG: an Interpretable and Accurate Model for Hypothesis Generation

    Hypothesis generation, which tries to identify implicit associations between two concepts, has attracted much attention due to its ability of linking key concepts scattered in different articles and enriching plausible new hypotheses. Among existing approaches for hypothesis generation, matrix factorization based methods have achieved start-of-the-art performance. However, matrix factorization ...

  14. Temporal dynamics of hypothesis generation: the influences of data

    HyGene: A Computational Model of Hypothesis Generation. HyGene (Thomas et al., 2008; Dougherty et al., 2010), short for hypothesis generation, is a computational architecture addressing hypothesis generation, evaluation, and testing. This framework has provided a useful account through which to understand the cognitive mechanisms underlying ...

  15. Musing 20: Hypothesis Generation with Large Language Models

    The authors present a computational framework, HypoGeniC (Hypothesis Generation in Context), which iteratively generates and updates hypotheses to improve predictive performance in classification tasks. This approach is inspired by the multi-armed bandit problem, utilizing a reward function to balance exploration and exploitation during ...

  16. Hypothesis Generation from Literature for Advancing Biological

    Hypothesis Generation is a literature-based discovery approach that utilizes existing literature to automatically generate implicit biomedical associations and provide reasonable predictions for future research. ... Guangxu Xun, Kishlay Jha, and Jing Gao. 2021. Interhg: an interpretable and accurate model for hypothesis generation. In 2021 IEEE ...

  17. Deep Learning-Based Hypothesis Generation Model and Its Application on

    In recent years, a tremendous amount of effort has been devoted to modeling the cognition of human brain, particularly hypothesis generation process. Most research of the hypothesis generation model is probability-based. However, computation of human brains is still neuron-based instead of calculating the probability. As an attempt to solve this problem in this paper, we propose a novel neuron ...

  18. Automating Psychological Hypothesis Generation with AI: Large Language

    Keywords: Hypothesis Generation, Causal Reasoning, Large Language Model, Psychological Science, Scientific Discovery 1 Introduction In an age in which the confluence of artificial intelligence (AI) with various subjects profoundly shapes sectors ranging from academic research to commercial enterprises, dissecting the interplay of these ...

  19. Hypothesis Generation : An Efficient Way of Performing EDA

    Hypothesis generation is an educated "guess" of various factors that are impacting the business problem that needs to be solved using machine learning. In short, you are making wise assumptions as to how certain factors would affect our target variable and in the process that follows, you try to prove and disprove them using various ...

  20. Hypothesis Generation

    Hypothesis generation is the formation of guesses as to what the segment of code does; this step can also guide a re- segmentation of the code. Finally, verification is the process of examining the code and associated documentation to determine the consistency of the code with the current hypotheses.

  21. How generative models can transform the way we discover

    GT4SD is an open-source library to accelerate hypothesis generation in the scientific discovery process that eases the adoption of state-of-the-art generative models. The GT4SD library provides an effective environment for generating new hypotheses (or inference) and for fine-tuning generative models for specific domains using custom data sets ...

  22. Demystifying Hypothesis Generation: A Guide to AI-Driven Insights

    Decision-making. Hypothesis generation involves making informed guesses about various aspects of a business, market, or problem that need further exploration and testing. This article discusses the process you need to follow while generating hypothesis and how an AI tool, like Akaike's BYOB can help you achieve the process quicker and better.

  23. Hypothesis Maker

    Create a hypothesis for your research based on your research question. HyperWrite's Hypothesis Maker is an AI-driven tool that generates a hypothesis based on your research question. Powered by advanced AI models like GPT-4 and ChatGPT, this tool can help streamline your research process and enhance your scientific studies.

  24. Hypothesis Maker

    Hypothesis generation should comply with ethical standards. Don't formulate hypotheses that contravene taboos or are questionable. Besides, your hypothesis should have correlations to published academic works to look data-based and authoritative. 🧭 6 Steps to Making a Good Hypothesis.

  25. GenTranslate: Large Language Models are Generative Multilingual Speech

    Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the ...

  26. Link prediction for hypothesis generation: an active curriculum

    2.1 Hypothesis generation. The development of effective methods for machine-assisted discovery is crucial in pushing scientific research into the next stage (Kitano 2021).In recent years, several approaches have been proposed in a bid to augment human abilities relevant to the scientific research process including tools for research design and analysis (Tabachnick and Fidell 2000), process ...

  27. Explainable Biomedical Hypothesis Generation via Retrieval Augmented

    The vast amount of biomedical information available today presents a significant challenge for investigators seeking to digest, process, and understand these findings effectively. Large Language Models (LLMs) have emerged as powerful tools to navigate this complex and challenging data landscape. However, LLMs may lead to hallucinatory responses, making Retrieval Augmented Generation (RAG ...

  28. [2402.14424] Automating Psychological Hypothesis Generation with AI

    Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 ...