Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved September 3, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

Logo

Unlocking the Power of Advanced Statistical Methods in Clinical Trials

statistical power and sample size determination

Mastering Statistical Power and Sample Size Determination

Explaining hypothesis testing with real-world examples.

explaining hypothesis testing with real-world examples

Hypothesis testing is a fundamental concept in statistics, particularly in the field of medical research. At StatisMed, we understand the importance of accurately analyzing data to draw meaningful conclusions. In this blog post, we get insights into explaining hypothesis testing with real-world examples and how it can be applied in medical settings.

Understanding Hypothesis Testing

Hypothesis testing is a statistical method used to make inferences about a population based on sample data. It involves formulating a null hypothesis (H0) and an alternative hypothesis (H1), collecting data, performing statistical tests, and drawing conclusions based on the results.

Real-World Examples

  • A pharmaceutical company wants to test a new drug’s effectiveness in lowering blood pressure compared to the current standard treatment. The null hypothesis states that there is no difference in blood pressure reduction between the two treatments, while the alternative hypothesis suggests that the new drug is more effective.
  • Researches conducting a clinical trial to determine the efficacy of a new vaccine in preventing a specific disease. The null hypothesis posits that the vaccine does not provide any protection, while the alternative hypothesis proposes that the vaccine is effective.
  • A medical team wants to compare the effectiveness of two treatments for a certain condition. By formulating null and alternative hypotheses, collecting data on patient outcomes, and performing statistical tests, they can determine which treatment is more effective.

Application in Medical Research

In the field of medical research, hypothesis testing plays a crucial role in evaluating new treatments, determining the effectiveness of interventions, and establishing evidence-based practices. By using statistical analysis services from StatisMed, medical professionals can ensure their research is conducted rigorously and ethically.

Hypothesis testing is a powerful tool that allows researchers to draw meaningful conclusions from data. By explaining hypothesis testing with real-world examples and applying them in real-world scenarios, medical professionals can make informed decisions about treatment options, interventions, and research outcomes. At StatisMed, we are committed to providing top-notch statistical analysis services to support evidence-based medical research. Contact us to learn more about how we can assist you in unlocking the potential of hypothesis testing in your research projects.

Related Articles

bayesian statistics

  • Uncategorized

Understanding Bayesian Inference

Normal Distribution

Exploring the Characteristics of Normal Distribution

medical research studies that pay

Top Medical Research Studies That Pay Participants

Request for a Quote

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed, hypothesis testing, p values, confidence intervals, and significance, affiliations.

  • 1 University of Louisville School of Medicine
  • 2 University of Louisville
  • PMID: 32491353
  • Bookshelf ID: NBK557421

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

Copyright © 2024, StatPearls Publishing LLC.

PubMed Disclaimer

Conflict of interest statement

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

  • Definition/Introduction
  • Issues of Concern
  • Clinical Significance
  • Nursing, Allied Health, and Interprofessional Team Interventions
  • Review Questions

Similar articles

  • The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). Messam LLM, Weng HY, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. Messam LLM, et al. PeerJ. 2021 Nov 24;9:e12453. doi: 10.7717/peerj.12453. eCollection 2021. PeerJ. 2021. PMID: 34900418 Free PMC article.
  • Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Ferrill MJ, Brown DA, Kyle JA. Ferrill MJ, et al. J Pharm Pract. 2010 Aug;23(4):344-51. doi: 10.1177/0897190009358774. Epub 2010 Apr 13. J Pharm Pract. 2010. PMID: 21507834 Review.
  • Interpreting "statistical hypothesis testing" results in clinical research. Sarmukaddam SB. Sarmukaddam SB. J Ayurveda Integr Med. 2012 Apr;3(2):65-9. doi: 10.4103/0975-9476.96518. J Ayurveda Integr Med. 2012. PMID: 22707861 Free PMC article.
  • Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. Alam M, Barzilai DA, Wrone DA. Alam M, et al. Dermatol Surg. 2005 Apr;31(4):462-6. doi: 10.1111/j.1524-4725.2005.31115. Dermatol Surg. 2005. PMID: 15871325
  • Is statistical significance testing useful in interpreting data? Savitz DA. Savitz DA. Reprod Toxicol. 1993;7(2):95-100. doi: 10.1016/0890-6238(93)90242-y. Reprod Toxicol. 1993. PMID: 8499671 Review.
  • Jones M, Gebski V, Onslow M, Packman A. Statistical power in stuttering research: a tutorial. J Speech Lang Hear Res. 2002 Apr;45(2):243-55. - PubMed
  • Sedgwick P. Pitfalls of statistical hypothesis testing: type I and type II errors. BMJ. 2014 Jul 03;349:g4287. - PubMed
  • Fethney J. Statistical and clinical significance, and how to use confidence intervals to help interpret both. Aust Crit Care. 2010 May;23(2):93-7. - PubMed
  • Hayat MJ. Understanding statistical significance. Nurs Res. 2010 May-Jun;59(3):219-23. - PubMed
  • Ferrill MJ, Brown DA, Kyle JA. Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. J Pharm Pract. 2010 Aug;23(4):344-51. - PubMed

Publication types

  • Search in PubMed
  • Search in MeSH
  • Add to Search

Related information

Linkout - more resources, full text sources.

  • NCBI Bookshelf

book cover photo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Statistical Hypothesis Testing Overview

By Jim Frost 59 Comments

In this blog post, I explain why you need to use statistical hypothesis testing and help you navigate the essential terminology. Hypothesis testing is a crucial procedure to perform when you want to make inferences about a population using a random sample. These inferences include estimating population properties such as the mean, differences between means, proportions, and the relationships between variables.

This post provides an overview of statistical hypothesis testing. If you need to perform hypothesis tests, consider getting my book, Hypothesis Testing: An Intuitive Guide .

Why You Should Perform Statistical Hypothesis Testing

Graph that displays mean drug scores by group. Use hypothesis testing to determine whether the difference between the means are statistically significant.

Hypothesis testing is a form of inferential statistics that allows us to draw conclusions about an entire population based on a representative sample. You gain tremendous benefits by working with a sample. In most cases, it is simply impossible to observe the entire population to understand its properties. The only alternative is to collect a random sample and then use statistics to analyze it.

While samples are much more practical and less expensive to work with, there are trade-offs. When you estimate the properties of a population from a sample, the sample statistics are unlikely to equal the actual population value exactly.  For instance, your sample mean is unlikely to equal the population mean. The difference between the sample statistic and the population value is the sample error.

Differences that researchers observe in samples might be due to sampling error rather than representing a true effect at the population level. If sampling error causes the observed difference, the next time someone performs the same experiment the results might be different. Hypothesis testing incorporates estimates of the sampling error to help you make the correct decision. Learn more about Sampling Error .

For example, if you are studying the proportion of defects produced by two manufacturing methods, any difference you observe between the two sample proportions might be sample error rather than a true difference. If the difference does not exist at the population level, you won’t obtain the benefits that you expect based on the sample statistics. That can be a costly mistake!

Let’s cover some basic hypothesis testing terms that you need to know.

Background information : Difference between Descriptive and Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics

Hypothesis Testing

Hypothesis testing is a statistical analysis that uses sample data to assess two mutually exclusive theories about the properties of a population. Statisticians call these theories the null hypothesis and the alternative hypothesis. A hypothesis test assesses your sample statistic and factors in an estimate of the sample error to determine which hypothesis the data support.

When you can reject the null hypothesis, the results are statistically significant, and your data support the theory that an effect exists at the population level.

The effect is the difference between the population value and the null hypothesis value. The effect is also known as population effect or the difference. For example, the mean difference between the health outcome for a treatment group and a control group is the effect.

Typically, you do not know the size of the actual effect. However, you can use a hypothesis test to help you determine whether an effect exists and to estimate its size. Hypothesis tests convert your sample effect into a test statistic, which it evaluates for statistical significance. Learn more about Test Statistics .

An effect can be statistically significant, but that doesn’t necessarily indicate that it is important in a real-world, practical sense. For more information, read my post about Statistical vs. Practical Significance .

Null Hypothesis

The null hypothesis is one of two mutually exclusive theories about the properties of the population in hypothesis testing. Typically, the null hypothesis states that there is no effect (i.e., the effect size equals zero). The null is often signified by H 0 .

In all hypothesis testing, the researchers are testing an effect of some sort. The effect can be the effectiveness of a new vaccination, the durability of a new product, the proportion of defect in a manufacturing process, and so on. There is some benefit or difference that the researchers hope to identify.

However, it’s possible that there is no effect or no difference between the experimental groups. In statistics, we call this lack of an effect the null hypothesis. Therefore, if you can reject the null, you can favor the alternative hypothesis, which states that the effect exists (doesn’t equal zero) at the population level.

You can think of the null as the default theory that requires sufficiently strong evidence against in order to reject it.

For example, in a 2-sample t-test, the null often states that the difference between the two means equals zero.

When you can reject the null hypothesis, your results are statistically significant. Learn more about Statistical Significance: Definition & Meaning .

Related post : Understanding the Null Hypothesis in More Detail

Alternative Hypothesis

The alternative hypothesis is the other theory about the properties of the population in hypothesis testing. Typically, the alternative hypothesis states that a population parameter does not equal the null hypothesis value. In other words, there is a non-zero effect. If your sample contains sufficient evidence, you can reject the null and favor the alternative hypothesis. The alternative is often identified with H 1 or H A .

For example, in a 2-sample t-test, the alternative often states that the difference between the two means does not equal zero.

You can specify either a one- or two-tailed alternative hypothesis:

If you perform a two-tailed hypothesis test, the alternative states that the population parameter does not equal the null value. For example, when the alternative hypothesis is H A : μ ≠ 0, the test can detect differences both greater than and less than the null value.

A one-tailed alternative has more power to detect an effect but it can test for a difference in only one direction. For example, H A : μ > 0 can only test for differences that are greater than zero.

Related posts : Understanding T-tests and One-Tailed and Two-Tailed Hypothesis Tests Explained

Image of a P for the p-value in hypothesis testing.

P-values are the probability that you would obtain the effect observed in your sample, or larger, if the null hypothesis is correct. In simpler terms, p-values tell you how strongly your sample data contradict the null. Lower p-values represent stronger evidence against the null. You use P-values in conjunction with the significance level to determine whether your data favor the null or alternative hypothesis.

Related post : Interpreting P-values Correctly

Significance Level (Alpha)

image of the alpha symbol for hypothesis testing.

For instance, a significance level of 0.05 signifies a 5% risk of deciding that an effect exists when it does not exist.

Use p-values and significance levels together to help you determine which hypothesis the data support. If the p-value is less than your significance level, you can reject the null and conclude that the effect is statistically significant. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level.

Related posts : Graphical Approach to Significance Levels and P-values and Conceptual Approach to Understanding Significance Levels

Types of Errors in Hypothesis Testing

Statistical hypothesis tests are not 100% accurate because they use a random sample to draw conclusions about entire populations. There are two types of errors related to drawing an incorrect conclusion.

  • False positives: You reject a null that is true. Statisticians call this a Type I error . The Type I error rate equals your significance level or alpha (α).
  • False negatives: You fail to reject a null that is false. Statisticians call this a Type II error. Generally, you do not know the Type II error rate. However, it is a larger risk when you have a small sample size , noisy data, or a small effect size. The type II error rate is also known as beta (β).

Statistical power is the probability that a hypothesis test correctly infers that a sample effect exists in the population. In other words, the test correctly rejects a false null hypothesis. Consequently, power is inversely related to a Type II error. Power = 1 – β. Learn more about Power in Statistics .

Related posts : Types of Errors in Hypothesis Testing and Estimating a Good Sample Size for Your Study Using Power Analysis

Which Type of Hypothesis Test is Right for You?

There are many different types of procedures you can use. The correct choice depends on your research goals and the data you collect. Do you need to understand the mean or the differences between means? Or, perhaps you need to assess proportions. You can even use hypothesis testing to determine whether the relationships between variables are statistically significant.

To choose the proper statistical procedure, you’ll need to assess your study objectives and collect the correct type of data . This background research is necessary before you begin a study.

Related Post : Hypothesis Tests for Continuous, Binary, and Count Data

Statistical tests are crucial when you want to use sample data to make conclusions about a population because these tests account for sample error. Using significance levels and p-values to determine when to reject the null hypothesis improves the probability that you will draw the correct conclusion.

To see an alternative approach to these traditional hypothesis testing methods, learn about bootstrapping in statistics !

If you want to see examples of hypothesis testing in action, I recommend the following posts that I have written:

  • How Effective Are Flu Shots? This example shows how you can use statistics to test proportions.
  • Fatality Rates in Star Trek . This example shows how to use hypothesis testing with categorical data.
  • Busting Myths About the Battle of the Sexes . A fun example based on a Mythbusters episode that assess continuous data using several different tests.
  • Are Yawns Contagious? Another fun example inspired by a Mythbusters episode.

Share this:

hypothesis testing examples in healthcare

Reader Interactions

' src=

January 14, 2024 at 8:43 am

Hello professor Jim, how are you doing! Pls. What are the properties of a population and their examples? Thanks for your time and understanding.

' src=

January 14, 2024 at 12:57 pm

Please read my post about Populations vs. Samples for more information and examples.

Also, please note there is a search bar in the upper-right margin of my website. Use that to search for topics.

' src=

July 5, 2023 at 7:05 am

Hello, I have a question as I read your post. You say in p-values section

“P-values are the probability that you would obtain the effect observed in your sample, or larger, if the null hypothesis is correct. In simpler terms, p-values tell you how strongly your sample data contradict the null. Lower p-values represent stronger evidence against the null.”

But according to your definition of effect, the null states that an effect does not exist, correct? So what I assume you want to say is that “P-values are the probability that you would obtain the effect observed in your sample, or larger, if the null hypothesis is **incorrect**.”

July 6, 2023 at 5:18 am

Hi Shrinivas,

The correct definition of p-value is that it is a probability that exists in the context of a true null hypothesis. So, the quotation is correct in stating “if the null hypothesis is correct.”

Essentially, the p-value tells you the likelihood of your observed results (or more extreme) if the null hypothesis is true. It gives you an idea of whether your results are surprising or unusual if there is no effect.

Hence, with sufficiently low p-values, you reject the null hypothesis because it’s telling you that your sample results were unlikely to have occurred if there was no effect in the population.

I hope that helps make it more clear. If not, let me know I’ll attempt to clarify!

' src=

May 8, 2023 at 12:47 am

Thanks a lot Ny best regards

May 7, 2023 at 11:15 pm

Hi Jim Can you tell me something about size effect? Thanks

May 8, 2023 at 12:29 am

Here’s a post that I’ve written about Effect Sizes that will hopefully tell you what you need to know. Please read that. Then, if you have any more specific questions about effect sizes, please post them there. Thanks!

' src=

January 7, 2023 at 4:19 pm

Hi Jim, I have only read two pages so far but I am really amazed because in few paragraphs you made me clearly understand the concepts of months of courses I received in biostatistics! Thanks so much for this work you have done it helps a lot!

January 10, 2023 at 3:25 pm

Thanks so much!

' src=

June 17, 2021 at 1:45 pm

Can you help in the following question: Rocinante36 is priced at ₹7 lakh and has been designed to deliver a mileage of 22 km/litre and a top speed of 140 km/hr. Formulate the null and alternative hypotheses for mileage and top speed to check whether the new models are performing as per the desired design specifications.

' src=

April 19, 2021 at 1:51 pm

Its indeed great to read your work statistics.

I have a doubt regarding the one sample t-test. So as per your book on hypothesis testing with reference to page no 45, you have mentioned the difference between “the sample mean and the hypothesised mean is statistically significant”. So as per my understanding it should be quoted like “the difference between the population mean and the hypothesised mean is statistically significant”. The catch here is the hypothesised mean represents the sample mean.

Please help me understand this.

Regards Rajat

April 19, 2021 at 3:46 pm

Thanks for buying my book. I’m so glad it’s been helpful!

The test is performed on the sample but the results apply to the population. Hence, if the difference between the sample mean (observed in your study) and the hypothesized mean is statistically significant, that suggests that population does not equal the hypothesized mean.

For one sample tests, the hypothesized mean is not the sample mean. It is a mean that you want to use for the test value. It usually represents a value that is important to your research. In other words, it’s a value that you pick for some theoretical/practical reasons. You pick it because you want to determine whether the population mean is different from that particular value.

I hope that helps!

' src=

November 5, 2020 at 6:24 am

Jim, you are such a magnificent statistician/economist/econometrician/data scientist etc whatever profession. Your work inspires and simplifies the lives of so many researchers around the world. I truly admire you and your work. I will buy a copy of each book you have on statistics or econometrics. Keep doing the good work. Remain ever blessed

November 6, 2020 at 9:47 pm

Hi Renatus,

Thanks so much for you very kind comments. You made my day!! I’m so glad that my website has been helpful. And, thanks so much for supporting my books! 🙂

' src=

November 2, 2020 at 9:32 pm

Hi Jim, I hope you are aware of 2019 American Statistical Association’s official statement on Statistical Significance: https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913 In case you do not bother reading the full article, may I quote you the core message here: “We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way."

With best wishes,

November 3, 2020 at 2:09 am

I’m definitely aware of the debate surrounding how to use p-values most effectively. However, I need to correct you on one point. The link you provide is NOT a statement by the American Statistical Association. It is an editorial by several authors.

There is considerable debate over this issue. There are problems with p-values. However, as the authors state themselves, much of the problem is over people’s mindsets about how to use p-values and their incorrect interpretations about what statistical significance does and does not mean.

If you were to read my website more thoroughly, you’d be aware that I share many of their concerns and I address them in multiple posts. One of the authors’ key points is the need to be thoughtful and conduct thoughtful research and analysis. I emphasize this aspect in multiple posts on this topic. I’ll ask you to read the following three because they all address some of the authors’ concerns and suggestions. But you might run across others to read as well.

Five Tips for Using P-values to Avoid Being Misled How to Interpret P-values Correctly P-values and the Reproducibility of Experimental Results

' src=

September 24, 2020 at 11:52 pm

HI Jim, i just want you to know that you made explanation for Statistics so simple! I should say lesser and fewer words that reduce the complexity. All the best! 🙂

September 25, 2020 at 1:03 am

Thanks, Rene! Your kind words mean a lot to me! I’m so glad it has been helpful!

' src=

September 23, 2020 at 2:21 am

Honestly, I never understood stats during my entire M.Ed course and was another nightmare for me. But how easily you have explained each concept, I have understood stats way beyond my imagination. Thank you so much for helping ignorant research scholars like us. Looking forward to get hardcopy of your book. Kindly tell is it available through flipkart?

September 24, 2020 at 11:14 pm

I’m so happy to hear that my website has been helpful!

I checked on flipkart and it appears like my books are not available there. I’m never exactly sure where they’re available due to the vagaries of different distribution channels. They are available on Amazon in India.

Introduction to Statistics: An Intuitive Guide (Amazon IN) Hypothesis Testing: An Intuitive Guide (Amazon IN)

' src=

July 26, 2020 at 11:57 am

Dear Jim I am a teacher from India . I don’t have any background in statistics, and still I should tell that in a single read I can follow your explanations . I take my entire biostatistics class for botany graduates with your explanations. Thanks a lot. May I know how I can avail your books in India

July 28, 2020 at 12:31 am

Right now my books are only available as ebooks from my website. However, soon I’ll have some exciting news about other ways to obtain it. Stay tuned! I’ll announce it on my email list. If you’re not already on it, you can sign up using the form that is in the right margin of my website.

' src=

June 22, 2020 at 2:02 pm

Also can you please let me if this book covers topics like EDA and principal component analysis?

June 22, 2020 at 2:07 pm

This book doesn’t cover principal components analysis. Although, I wouldn’t really classify that as a hypothesis test. In the future, I might write a multivariate analysis book that would cover this and others. But, that’s well down the road.

My Introduction to Statistics covers EDA. That’s the largely graphical look at your data that you often do prior to hypothesis testing. The Introduction book perfectly leads right into the Hypothesis Testing book.

June 22, 2020 at 1:45 pm

Thanks for the detailed explanation. It does clear my doubts. I saw that your book related to hypothesis testing has the topics that I am studying currently. I am looking forward to purchasing it.

Regards, Take Care

June 19, 2020 at 1:03 pm

For this particular article I did not understand a couple of statements and it would great if you could help: 1)”If sample error causes the observed difference, the next time someone performs the same experiment the results might be different.” 2)”If the difference does not exist at the population level, you won’t obtain the benefits that you expect based on the sample statistics.”

I discovered your articles by chance and now I keep coming back to read & understand statistical concepts. These articles are very informative & easy to digest. Thanks for the simplifying things.

June 20, 2020 at 9:53 pm

I’m so happy to hear that you’ve found my website to be helpful!

To answer your questions, keep in mind that a central tenant of inferential statistics is that the random sample that a study drew was only one of an infinite number of possible it could’ve drawn. Each random sample produces different results. Most results will cluster around the population value assuming they used good methodology. However, random sampling error always exists and makes it so that population estimates from a sample almost never exactly equal the correct population value.

So, imagine that we’re studying a medication and comparing the treatment and control groups. Suppose that the medicine is truly not effect and that the population difference between the treatment and control group is zero (i.e., no difference.) Despite the true difference being zero, most sample estimates will show some degree of either a positive or negative effect thanks to random sampling error. So, just because a study has an observed difference does not mean that a difference exists at the population level. So, on to your questions:

1. If the observed difference is just random error, then it makes sense that if you collected another random sample, the difference could change. It could change from negative to positive, positive to negative, more extreme, less extreme, etc. However, if the difference exists at the population level, most random samples drawn from the population will reflect that difference. If the medicine has an effect, most random samples will reflect that fact and not bounce around on both sides of zero as much.

2. This is closely related to the previous answer. If there is no difference at the population level, but say you approve the medicine because of the observed effects in a sample. Even though your random sample showed an effect (which was really random error), that effect doesn’t exist. So, when you start using it on a larger scale, people won’t benefit from the medicine. That’s why it’s important to separate out what is easily explained by random error versus what is not easily explained by it.

I think reading my post about how hypothesis tests work will help clarify this process. Also, in about 24 hours (as I write this), I’ll be releasing my new ebook about Hypothesis Testing!

' src=

May 29, 2020 at 5:23 am

Hi Jim, I really enjoy your blog. Can you please link me on your blog where you discuss about Subgroup analysis and how it is done? I need to use non parametric and parametric statistical methods for my work and also do subgroup analysis in order to identify potential groups of patients that may benefit more from using a treatment than other groups.

May 29, 2020 at 2:12 pm

Hi, I don’t have a specific article about subgroup analysis. However, subgroup analysis is just the dividing up of a larger sample into subgroups and then analyzing those subgroups separately. You can use the various analyses I write about on the subgroups.

Alternatively, you can include the subgroups in regression analysis as an indicator variable and include that variable as a main effect and an interaction effect to see how the relationships vary by subgroup without needing to subdivide your data. I write about that approach in my article about comparing regression lines . This approach is my preferred approach when possible.

' src=

April 19, 2020 at 7:58 am

sir is confidence interval is a part of estimation?

' src=

April 17, 2020 at 3:36 pm

Sir can u plz briefly explain alternatives of hypothesis testing? I m unable to find the answer

April 18, 2020 at 1:22 am

Assuming you want to draw conclusions about populations by using samples (i.e., inferential statistics ), you can use confidence intervals and bootstrap methods as alternatives to the traditional hypothesis testing methods.

' src=

March 9, 2020 at 10:01 pm

Hi JIm, could you please help with activities that can best teach concepts of hypothesis testing through simulation, Also, do you have any question set that would enhance students intuition why learning hypothesis testing as a topic in introductory statistics. Thanks.

' src=

March 5, 2020 at 3:48 pm

Hi Jim, I’m studying multiple hypothesis testing & was wondering if you had any material that would be relevant. I’m more trying to understand how testing multiple samples simultaneously affects your results & more on the Bonferroni Correction

March 5, 2020 at 4:05 pm

I write about multiple comparisons (aka post hoc tests) in the ANOVA context . I don’t talk about Bonferroni Corrections specifically but I cover related types of corrections. I’m not sure if that exactly addresses what you want to know but is probably the closest I have already written. I hope it helps!

' src=

January 14, 2020 at 9:03 pm

Thank you! Have a great day/evening.

January 13, 2020 at 7:10 pm

Any help would be greatly appreciated. What is the difference between The Hypothesis Test and The Statistical Test of Hypothesis?

January 14, 2020 at 11:02 am

They sound like the same thing to me. Unless this is specialized terminology for a particular field or the author was intending something specific, I’d guess they’re one and the same.

' src=

April 1, 2019 at 10:00 am

so these are the only two forms of Hypothesis used in statistical testing?

April 1, 2019 at 10:02 am

Are you referring to the null and alternative hypothesis? If so, yes, that’s those are the standard hypotheses in a statistical hypothesis test.

April 1, 2019 at 9:57 am

year very insightful post, thanks for the write up

' src=

October 27, 2018 at 11:09 pm

hi there, am upcoming statistician, out of all blogs that i have read, i have found this one more useful as long as my problem is concerned. thanks so much

October 27, 2018 at 11:14 pm

Hi Stano, you’re very welcome! Thanks for your kind words. They mean a lot! I’m happy to hear that my posts were able to help you. I’m sure you will be a fantastic statistician. Best of luck with your studies!

' src=

October 26, 2018 at 11:39 am

Dear Jim, thank you very much for your explanations! I have a question. Can I use t-test to compare two samples in case each of them have right bias?

October 26, 2018 at 12:00 pm

Hi Tetyana,

You’re very welcome!

The term “right bias” is not a standard term. Do you by chance mean right skewed distributions? In other words, if you plot the distribution for each group on a histogram they have longer right tails? These are not the symmetrical bell-shape curves of the normal distribution.

If that’s the case, yes you can as long as you exceed a specific sample size within each group. I include a table that contains these sample size requirements in my post about nonparametric vs parametric analyses .

Bias in statistics refers to cases where an estimate of a value is systematically higher or lower than the true value. If this is the case, you might be able to use t-tests, but you’d need to be sure to understand the nature of the bias so you would understand what the results are really indicating.

I hope this helps!

' src=

April 2, 2018 at 7:28 am

Simple and upto the point 👍 Thank you so much.

April 2, 2018 at 11:11 am

Hi Kalpana, thanks! And I’m glad it was helpful!

' src=

March 26, 2018 at 8:41 am

Am I correct if I say: Alpha – Probability of wrongly rejection of null hypothesis P-value – Probability of wrongly acceptance of null hypothesis

March 28, 2018 at 3:14 pm

You’re correct about alpha. Alpha is the probability of rejecting the null hypothesis when the null is true.

Unfortunately, your definition of the p-value is a bit off. The p-value has a fairly convoluted definition. It is the probability of obtaining the effect observed in a sample, or more extreme, if the null hypothesis is true. The p-value does NOT indicate the probability that either the null or alternative is true or false. Although, those are very common misinterpretations. To learn more, read my post about how to interpret p-values correctly .

' src=

March 2, 2018 at 6:10 pm

I recently started reading your blog and it is very helpful to understand each concept of statistical tests in easy way with some good examples. Also, I recommend to other people go through all these blogs which you posted. Specially for those people who have not statistical background and they are facing to many problems while studying statistical analysis.

Thank you for your such good blogs.

March 3, 2018 at 10:12 pm

Hi Amit, I’m so glad that my blog posts have been helpful for you! It means a lot to me that you took the time to write such a nice comment! Also, thanks for recommending by blog to others! I try really hard to write posts about statistics that are easy to understand.

' src=

January 17, 2018 at 7:03 am

I recently started reading your blog and I find it very interesting. I am learning statistics by my own, and I generally do many google search to understand the concepts. So this blog is quite helpful for me, as it have most of the content which I am looking for.

January 17, 2018 at 3:56 pm

Hi Shashank, thank you! And, I’m very glad to hear that my blog is helpful!

' src=

January 2, 2018 at 2:28 pm

thank u very much sir.

January 2, 2018 at 2:36 pm

You’re very welcome, Hiral!

' src=

November 21, 2017 at 12:43 pm

Thank u so much sir….your posts always helps me to be a #statistician

November 21, 2017 at 2:40 pm

Hi Sachin, you’re very welcome! I’m happy that you find my posts to be helpful!

' src=

November 19, 2017 at 8:22 pm

great post as usual, but it would be nice to see an example.

November 19, 2017 at 8:27 pm

Thank you! At the end of this post, I have links to four other posts that show examples of hypothesis tests in action. You’ll find what you’re looking for in those posts!

Comments and Questions Cancel reply

hypothesis testing examples in healthcare

  • Get new issue alerts Get alerts
  • IARS MEMBER LOGIN

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

Confidence Intervals in Clinical Research

Schober, Patrick MD, PhD, MMedStat * ; Vetter, Thomas R. MD, MPH †

From the * Department of Anesthesiology, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands

† Department of Surgery and Perioperative Care, Dell Medical School at the University of Texas at Austin, Austin, Texas.

Address correspondence to Patrick Schober, MD, PhD, MMedStat, Department of Anesthesiology, Amsterdam UMC, Vrije Universiteit Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands. Address e-mail to [email protected] .

Corresponding Article

Related Article, see p e119

Confidence intervals provide a range of plausible values for estimates of population parameters or effect sizes.

In an online article in this issue of Anesthesia & Analgesia , Reale et al 1 report an increase in the risk of postpartum hemorrhage (PPH) from 2.9% (95% confidence interval [CI], 2.7%–3.1%) of deliveries in 2010 to 3.2% (95% CI, 3.1%–3.3%) in 2014, with an estimated odds ratio for a 1-year increase of 1.03 (95% CI, 1.01–1.05).

In clinical research, authors commonly use a sample of study subjects to make inferences about the population from which the sample was drawn. 2 However, any sample is affected by randomness, and estimates would be different in a different sample. When estimating a parameter (eg, proportion of PPH) or the effect of an exposure (eg, odds ratio) from a sample, it is not innately evident how precise this estimate actually is. A CI addresses this issue by providing a range of values of what the true parameter value might plausibly be in the underlying population—assuming a representative sample. 2 CIs can be calculated for a wide range of parameters (eg, proportions or means) but also for various effect sizes measures including correlation and regression coefficients, measures of agreement, or measures of diagnostic accuracy. 2–5

With repeated sampling, a CI would be expected to contain the true population parameter in a fixed percentage of the samples. This fixed percentage is the so-called confidence level, which is commonly chosen as 95%. This means that if a random sample were to be taken over and over again from the same population—with a 95% CI calculated each time, about 95% of CIs would contain the true population parameter. 2 While it is impossible to know whether a specific 95% CI actually contains the true population parameter, the CI is often considered the best estimate of the range of plausible values that can be obtained from a study.

In this context, the width of the CI is a measure of the precision of the estimate. In the study by Reale et al, 1 the CIs are very narrow around the point estimates, and thus, the “true” risk of PPH is assumed to be very close to the estimated risk in the sample. Such narrow CIs can be explained by the large sample size as large samples generally provide more precise estimates than smaller samples.

There is a close relationship between CIs of effect size estimates and hypothesis testing. When the 95% CI of an effect size does not contain the null hypothesis value that indicates “no effect” (eg, an odds ratio of exactly 1), this corresponds to a “statistically significant” result with a .05 alpha level in a hypothesis test.

The CI additionally suggests how large the effect of a treatment or of an exposure could plausibly be. This is helpful in determining not only statistical significance but the clinical relevance of the findings.

F1

Therefore, important estimates of population parameters and effect sizes—in particular, when reporting primary outcomes—should generally be accompanied by CIs, as appropriately done by Reale et al. 1

  • Cited Here |
  • Google Scholar

Trends in Postpartum Hemorrhage in the United States From 2010 to 2014

Reale, Sharon C.; Easter, Sarah R.; Xu, Xinling; Bateman, Brian T.; Farber, Michaela K.

Anesthesia & Analgesia. 130(5):e119-e122, May 2020.

  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Chi-square tests in medical research, correlation coefficients: appropriate use and interpretation, p</em> values and confidence intervals really represent', 'schober patrick md phd mmedstat; bossers, sebastiaan m. md, msc; schwarte, lothar a. md, phd, mba', 'anesthesia & analgesia', 'march 2018', '126', '3' , 'p 1068-1072');" onmouseout="javascript:tooltip_mouseout()" class="ejp-uc__article-title-link">statistical significance versus clinical importance of observed effect sizes:..., correct baseline comparisons in a randomized trial, sample size and power in clinical research.

Enago Academy

Quick Guide to Biostatistics in Clinical Research: Hypothesis Testing

' src=

In this article series, we will be looking at some of the important concepts of biostatistics in clinical trials and clinical research. Statistics is frequently used to analyze quantitative research data. Clinical trials and clinical research both often rely on statistics. Clinical trials proceed through many phases . Contract Research Organizations (CRO) can be hired to conduct a clinical trial. Clinical trials are an important step in deciding if a treatment can be safely and effectively used in medical practice. Once the clinical trial phases are completed, biostatistics is used to analyze the results.

Research generally proceeds in an orderly fashion as shown below.

Research Process

Once you have identified the research question you need to answer, it is time to frame a good hypothesis. The hypothesis is the starting point for biostatistics and is usually based on a theory. Experiments are then designed to test the hypothesis. What is a hypothesis ? A research hypothesis is a statement describing a relationship between two or more variables that can be tested. A good hypothesis will be clear, avoid moral judgments, specific, objective, and relevant to the research question. Above all, a hypothesis must be testable.

A simple hypothesis would contain one predictor and one outcome variable. For instance, if your hypothesis was, “Chocolate consumption is linked to type II diabetes” the predictor would be whether or not a person eats chocolate and the outcome would be developing type II diabetes. A good hypothesis would also be specific. This means that it should be clear which subjects and research methodology will be used to test the hypothesis. An example of a specific hypothesis would be, “Adults who consume more than 20 grams of milk chocolate per day, as measured by a questionnaire over the course of 12 months, are more likely to develop type II diabetes than adults who consume less than 10 grams of milk chocolate per day.”

Null and Alternative Hypothesis

In statistics, the null hypothesis (H 0 ) states that there is no relationship between the predictor and the outcome variable in the population being studied. For instance, “There is no relationship between a family history of depression and the probability that a person will attempt suicide.” The alternative hypothesis (H 1 ) states that there is a relationship between the predictor (a history of depression) and the outcome (attempted suicide). It is impossible to prove a statement by making several observations but it is possible to disprove a statement with a single observation. If you always saw red tulips, it is not proof that no other colors exist. However, seeing a single tulip that was not red would immediately prove that the statement, “All tulips are red” is false. This is why statistics tests the null hypothesis. It is also why the alternative hypothesis cannot be tested directly.

The alternative hypothesis proposed in medical research may be one-tailed or two-tailed. A one-tailed alternative hypothesis would predict the direction of the effect. Clinical studies may have an alternative hypothesis that patients taking the study drug will have a lower cholesterol level than those taking a placebo. This is an example of a one-tailed hypothesis. A two-tailed alternative hypothesis would only state that there is an association without specifying a direction. An example would be, “Patients who take the study drug will have a significantly different cholesterol level than those patients taking a placebo”. The alternative hypothesis does not state if that level will be higher or lower in those taking the placebo.

The P-Value Approach to Test Hypothesis

Once the hypothesis has been designed, statistical tests help you to decide if you should accept or reject the null hypothesis. Statistical tests determine the p-value associated with the research data. The p-value is the probability that one could have obtained the result by chance; assuming the null hypothesis (H 0 ) was true. You must reject the null hypothesis if the p-value of the data falls below the predetermined level of statistical significance. Usually, the level of statistical significance is set at 0.05. If the p- value is less than 0.05, then you would reject the null hypothesis stating that there is no relationship between the predictor and the outcome in the sample population.

However, if the p-value is greater than the predetermined level of significance, then there is no statistically significant association between the predictor and the outcome variable. This does not mean that there is no association between the predictor and the outcome in the population. It only means that the difference between the relationship observed and the relationship that could have occurred by random chance is small.

For example, null hypothesis (H 0 ): The patients who take the study drug after a heart attack did not have a better chance of not having a second heart attack over the next 24 months.

Data suggests that those who did not take the study drug were twice as likely to have a second heart attack with a p-value of 0.08. This p-value would indicate that there was an 8% chance that you would see a similar result (people on the placebo being twice as likely to have a second heart attack) in the general population because of random chance.

The hypothesis is not a trivial part of the clinical research process. It is a key element in a good biostatistics plan regardless of the clinical trial phase. There are many other concepts that are important for analyzing data from clinical trials. In our next article in the series, we will examine hypothesis testing for one or many populations, as well as error types.

' src=

Thank you for this very informative article. You describe all the things very well. I am doing a fellowship in Clinical research training. This information really helps me a lot in my research studies. I have been connected with your site since a long time for such updates. Thank you once again

Rate this article Cancel Reply

Your email address will not be published.

hypothesis testing examples in healthcare

Enago Academy's Most Popular Articles

manuscript writing with AI

  • AI in Academia
  • Infographic
  • Manuscripts & Grants
  • Reporting Research
  • Trending Now

Can AI Tools Prepare a Research Manuscript From Scratch? — A comprehensive guide

As technology continues to advance, the question of whether artificial intelligence (AI) tools can prepare…

difference between abstract and introduction

Abstract Vs. Introduction — Do you know the difference?

Ross wants to publish his research. Feeling positive about his research outcomes, he begins to…

hypothesis testing examples in healthcare

  • Old Webinars
  • Webinar Mobile App

Demystifying Research Methodology With Field Experts

Choosing research methodology Research design and methodology Evidence-based research approach How RAxter can assist researchers

Best Research Methodology

  • Manuscript Preparation
  • Publishing Research

How to Choose Best Research Methodology for Your Study

Successful research conduction requires proper planning and execution. While there are multiple reasons and aspects…

Methods and Methodology

Top 5 Key Differences Between Methods and Methodology

While burning the midnight oil during literature review, most researchers do not realize that the…

How to Draft the Acknowledgment Section of a Manuscript

Discussion Vs. Conclusion: Know the Difference Before Drafting Manuscripts

hypothesis testing examples in healthcare

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

  • Industry News
  • Promoting Research
  • Career Corner
  • Diversity and Inclusion
  • Infographics
  • Expert Video Library
  • Other Resources
  • Enago Learn
  • Upcoming & On-Demand Webinars
  • Peer Review Week 2024
  • Open Access Week 2023
  • Conference Videos
  • Enago Report
  • Journal Finder
  • Enago Plagiarism & AI Grammar Check
  • Editing Services
  • Publication Support Services
  • Research Impact
  • Translation Services
  • Publication solutions
  • AI-Based Solutions
  • Thought Leadership
  • Call for Articles
  • Call for Speakers
  • Author Training
  • Edit Profile

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

hypothesis testing examples in healthcare

In your opinion, what is the most effective way to improve integrity in the peer review process?

  • Research article
  • Open access
  • Published: 19 May 2010

The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation

  • Luis Carlos Silva-Ayçaguer 1 ,
  • Patricio Suárez-Gil 2 &
  • Ana Fernández-Somoano 3  

BMC Medical Research Methodology volume  10 , Article number:  44 ( 2010 ) Cite this article

38k Accesses

23 Citations

18 Altmetric

Metrics details

The null hypothesis significance test (NHST) is the most frequently used statistical method, although its inferential validity has been widely criticized since its introduction. In 1988, the International Committee of Medical Journal Editors (ICMJE) warned against sole reliance on NHST to substantiate study conclusions and suggested supplementary use of confidence intervals (CI). Our objective was to evaluate the extent and quality in the use of NHST and CI, both in English and Spanish language biomedical publications between 1995 and 2006, taking into account the International Committee of Medical Journal Editors recommendations, with particular focus on the accuracy of the interpretation of statistical significance and the validity of conclusions.

Original articles published in three English and three Spanish biomedical journals in three fields (General Medicine, Clinical Specialties and Epidemiology - Public Health) were considered for this study. Papers published in 1995-1996, 2000-2001, and 2005-2006 were selected through a systematic sampling method. After excluding the purely descriptive and theoretical articles, analytic studies were evaluated for their use of NHST with P-values and/or CI for interpretation of statistical "significance" and "relevance" in study conclusions.

Among 1,043 original papers, 874 were selected for detailed review. The exclusive use of P-values was less frequent in English language publications as well as in Public Health journals; overall such use decreased from 41% in 1995-1996 to 21% in 2005-2006. While the use of CI increased over time, the "significance fallacy" (to equate statistical and substantive significance) appeared very often, mainly in journals devoted to clinical specialties (81%). In papers originally written in English and Spanish, 15% and 10%, respectively, mentioned statistical significance in their conclusions.

Conclusions

Overall, results of our review show some improvements in statistical management of statistical results, but further efforts by scholars and journal editors are clearly required to move the communication toward ICMJE advices, especially in the clinical setting, which seems to be imperative among publications in Spanish.

Peer Review reports

The null hypothesis statistical testing (NHST) has been the most widely used statistical approach in health research over the past 80 years. Its origins dates back to 1279 [ 1 ] although it was in the second decade of the twentieth century when the statistician Ronald Fisher formally introduced the concept of "null hypothesis" H 0 - which, generally speaking, establishes that certain parameters do not differ from each other. He was the inventor of the "P-value" through which it could be assessed [ 2 ]. Fisher's P-value is defined as a conditional probability calculated using the results of a study. Specifically, the P-value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The Fisherian significance testing theory considered the p-value as an index to measure the strength of evidence against the null hypothesis in a single experiment. The father of NHST never endorsed, however, the inflexible application of the ultimately subjective threshold levels almost universally adopted later on (although the introduction of the 0.05 has his paternity also).

A few years later, Jerzy Neyman and Egon Pearson considered the Fisherian approach inefficient, and in 1928 they published an article [ 3 ] that would provide the theoretical basis of what they called hypothesis statistical testing . The Neyman-Pearson approach is based on the notion that one out of two choices has to be taken: accept the null hypothesis taking the information as a reference based on the information provided, or reject it in favor of an alternative one. Thus, one can incur one of two types of errors: a Type I error, if the null hypothesis is rejected when it is actually true, and a Type II error, if the null hypothesis is accepted when it is actually false. They established a rule to optimize the decision process, using the p-value introduced by Fisher, by setting the maximum frequency of errors that would be admissible.

The null hypothesis statistical testing, as applied today, is a hybrid coming from the amalgamation of the two methods [ 4 ]. As a matter of fact, some 15 years later, both procedures were combined to give rise to the nowadays widespread use of an inferential tool that would satisfy none of the statisticians involved in the original controversy. The present method essentially goes as follows: given a null hypothesis, an estimate of the parameter (or parameters) is obtained and used to create statistics whose distribution, under H 0 , is known. With these data the P-value is computed. Finally, the null hypothesis is rejected when the obtained P-value is smaller than a certain comparative threshold (usually 0.05) and it is not rejected if P is larger than the threshold.

The first reservations about the validity of the method began to appear around 1940, when some statisticians censured the logical roots and practical convenience of Fisher's P-value [ 5 ]. Significance tests and P-values have repeatedly drawn the attention and criticism of many authors over the past 70 years, who have kept questioning its epistemological legitimacy as well as its practical value. What remains in spite of these criticisms is the lasting legacy of researchers' unwillingness to eradicate or reform these methods.

Although there are very comprehensive works on the topic [ 6 ], we list below some of the criticisms most universally accepted by specialists.

The P-values are used as a tool to make decisions in favor of or against a hypothesis. What really may be relevant, however, is to get an effect size estimate (often the difference between two values) rather than rendering dichotomous true/false verdicts [ 7 – 11 ].

The P-value is a conditional probability of the data, provided that some assumptions are met, but what really interests the investigator is the inverse probability: what degree of validity can be attributed to each of several competing hypotheses, once that certain data have been observed [ 12 ].

The two elements that affect the results, namely the sample size and the magnitude of the effect, are inextricably linked in the value of p and we can always get a lower P-value by increasing the sample size. Thus, the conclusions depend on a factor completely unrelated to the reality studied (i.e. the available resources, which in turn determine the sample size) [ 13 , 14 ].

Those who defend the NHST often assert the objective nature of that test, but the process is actually far from being so. NHST does not ensure objectivity. This is reflected in the fact that we generally operate with thresholds that are ultimately no more than conventions, such as 0.01 or 0.05. What is more, for many years their use has unequivocally demonstrated the inherent subjectivity that goes with the concept of P, regardless of how it will be used later [ 15 – 17 ].

In practice, the NHST is limited to a binary response sorting hypotheses into "true" and "false" or declaring "rejection" or "no rejection", without demanding a reasonable interpretation of the results, as has been noted time and again for decades. This binary orthodoxy validates categorical thinking, which results in a very simplistic view of scientific activity that induces researchers not to test theories about the magnitude of effect sizes [ 18 – 20 ].

Despite the weakness and shortcomings of the NHST, they are frequently taught as if they were the key inferential statistical method or the most appropriate, or even the sole unquestioned one. The statistical textbooks, with only some exceptions, do not even mention the NHST controversy. Instead, the myth is spread that NHST is the "natural" final action of scientific inference and the only procedure for testing hypotheses. However, relevant specialists and important regulators of the scientific world advocate avoiding them.

Taking especially into account that NHST does not offer the most important information (i.e. the magnitude of an effect of interest, and the precision of the estimate of the magnitude of that effect), many experts recommend the reporting of point estimates of effect sizes with confidence intervals as the appropriate representation of the inherent uncertainty linked to empirical studies [ 21 – 25 ]. Since 1988, the International Committee of Medical Journal Editors (ICMJE, known as the Vancouver Group ) incorporates the following recommendation to authors of manuscripts submitted to medical journals: "When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as P-values, which fail to convey important information about effect size" [ 26 ].

As will be shown, the use of confidence intervals (CI), occasionally accompanied by P-values, is recommended as a more appropriate method for reporting results. Some authors have noted several shortcomings of CI long ago [ 27 ]. In spite of the fact that calculating CI could be complicated indeed, and that their interpretation is far from simple [ 28 , 29 ], authors are urged to use them because they provide much more information than the NHST and do not merit most of its criticisms of NHST [ 30 ]. While some have proposed different options (for instance, likelihood-based information theoretic methods [ 31 ], and the Bayesian inferential paradigm [ 32 ]), confidence interval estimation of effect sizes is clearly the most widespread alternative approach.

Although twenty years have passed since the ICMJE began to disseminate such recommendations, systematically ignored by the vast majority of textbooks and hardly incorporated in medical publications [ 33 ], it is interesting to examine the extent to which the NHST is used in articles published in medical journals during recent years, in order to identify what is still lacking in the process of eradicating the widespread ceremonial use that is made of statistics in health research [ 34 ]. Furthermore, it is enlightening in this context to examine whether these patterns differ between English- and Spanish-speaking worlds and, if so, to see if the changes in paradigms are occurring more slowly in Spanish-language publications. In such a case we would offer various suggestions.

In addition to assessing the adherence to the above cited statistical recommendation proposed by ICMJE relative to the use of P-values, we consider it of particular interest to estimate the extent to which the significance fallacy is present, an inertial deficiency that consists of attributing -- explicitly or not -- qualitative importance or practical relevance to the found differences simply because statistical significance was obtained.

Many authors produce misleading statements such as "a significant effect was (or was not) found" when it should be said that "a statistically significant difference was (or was not) found". A detrimental consequence of this equivalence is that some authors believe that finding out whether there is "statistical significance" or not is the aim, so that this term is then mentioned in the conclusions [ 35 ]. This means virtually nothing, except that it indicates that the author is letting a computer do the thinking. Since the real research questions are never statistical ones, the answers cannot be statistical either. Accordingly, the conversion of the dichotomous outcome produced by a NHST into a conclusion is another manifestation of the mentioned fallacy.

The general objective of the present study is to evaluate the extent and quality of use of NHST and CI, both in English- and in Spanish-language biomedical publications, between 1995 and 2006 taking into account the International Committee of Medical Journal Editors recommendations, with particular focus on accuracy regarding interpretation of statistical significance and the validity of conclusions.

We reviewed the original articles from six journals, three in English and three in Spanish, over three disjoint periods sufficiently separated from each other (1995-1996, 2000-2001, 2005-2006) as to properly describe the evolution in prevalence of the target features along the selected periods.

The selection of journals was intended to get representation for each of the following three thematic areas: clinical specialties ( Obstetrics & Gynecology and Revista Española de Cardiología) ; Public Health and Epidemiology ( International Journal of Epidemiology and Atención Primaria) and the area of general and internal medicine ( British Medical Journal and Medicina Clínica ). Five of the selected journals formally endorsed ICMJE guidelines; the remaining one ( Revista Española de Cardiología ) suggests observing ICMJE demands in relation with specific issues. We attempted to capture journal diversity in the sample by selecting general and specialty journals with different degrees of influence, resulting from their impact factors in 2007, which oscillated between 1.337 (MC) and 9.723 (BMJ). No special reasons guided us to choose these specific journals, but we opted for journals with rather large paid circulations. For instance, the Spanish Cardiology Journal is the one with the largest impact factor among the fourteen Spanish Journals devoted to clinical specialties that have impact factor and Obstetrics & Gynecology has an outstanding impact factor among the huge number of journals available for selection.

It was decided to take around 60 papers for each biennium and journal, which means a total of around 1,000 papers. As recently suggested [ 36 , 37 ], this number was not established using a conventional method, but by means of a purposive and pragmatic approach in choosing the maximum sample size that was feasible.

Systematic sampling in phases [ 38 ] was used in applying a sampling fraction equal to 60/N, where N is the number of articles, in each of the 18 subgroups defined by crossing the six journals and the three time periods. Table 1 lists the population size and the sample size for each subgroup. While the sample within each subgroup was selected with equal probability, estimates based on other subsets of articles (defined across time periods, areas, or languages) are based on samples with various selection probabilities. Proper weights were used to take into account the stratified nature of the sampling in these cases.

Forty-nine of the 1,092 selected papers were eliminated because, although the section of the article in which they were assigned could suggest they were originals, detailed scrutiny revealed that in some cases they were not. The sample, therefore, consisted of 1,043 papers. Each of them was classified into one of three categories: (1) purely descriptive papers, those designed to review or characterize the state of affairs as it exists at present, (2) analytical papers, or (3) articles that address theoretical, methodological or conceptual issues. An article was regarded as analytical if it seeks to explain the reasons behind a particular occurrence by discovering causal relationships or, even if self-classified as descriptive, it was carried out to assess cause-effect associations among variables. We classify as theoretical or methodological those articles that do not handle empirical data as such, and focus instead on proposing or assessing research methods. We identified 169 papers as purely descriptive or theoretical, which were therefore excluded from the sample. Figure 1 presents a flow chart showing the process for determining eligibility for inclusion in the sample.

figure 1

Flow chart of the selection process for eligible papers .

To estimate the adherence to ICMJE recommendations, we considered whether the papers used P-values, confidence intervals, and both simultaneously. By "the use of P-values" we mean that the article contains at least one P-value, explicitly mentioned in the text or at the bottom of a table, or that it reports that an effect was considered as statistically significant . It was deemed that an article uses CI if it explicitly contained at least one confidence interval, but not when it only provides information that could allow its computation (usually by presenting both the estimate and the standard error). Probability intervals provided in Bayesian analysis were classified as confidence intervals (although conceptually they are not the same) since what is really of interest here is whether or not the authors quantify the findings and present them with appropriate indicators of the margin of error or uncertainty.

In addition we determined whether the "Results" section of each article attributed the status of "significant" to an effect on the sole basis of the outcome of a NHST (i.e., without clarifying that it is strictly statistical significance). Similarly, we examined whether the term "significant" (applied to a test) was mistakenly used as synonymous with substantive , relevant or important . The use of the term "significant effect" when it is only appropriate as a reference to a "statistically significant difference," can be considered a direct expression of the significance fallacy [ 39 ] and, as such, constitutes one way to detect the problem in a specific paper.

We also assessed whether the "Conclusions," which sometimes appear as a separate section in the paper or otherwise in the last paragraphs of the "Discussion" section mentioned statistical significance and, if so, whether any of such mentions were no more than an allusion to results.

To perform these analyses we considered both the abstract and the body of the article. To assess the handling of the significance issue, however, only the body of the manuscript was taken into account.

The information was collected by four trained observers. Every paper was assigned to two reviewers. Disagreements were discussed and, if no agreement was reached, a third reviewer was consulted to break the tie and so moderate the effect of subjectivity in the assessment.

In order to assess the reliability of the criteria used for the evaluation of articles and to effect a convergence of criteria among the reviewers, a pilot study of 20 papers from each of three journals ( Clinical Medicine , Primary Care , and International Journal of Epidemiology) was performed. The results of this pilot study were satisfactory. Our results are reported using percentages together with their corresponding confidence intervals. For sampling errors estimations, used to obtain confidence intervals, we weighted the data using the inverse of the probability of selection of each paper, and we took into account the complex nature of the sample design. These analyses were carried out with EPIDAT [ 40 ], a specialized computer program that is readily available.

A total of 1,043 articles were reviewed, of which 874 (84%) were found to be analytic, while the remainders were purely descriptive or of a theoretical and methodological nature. Five of them did not employ either P-values or CI. Consequently, the analysis was made using the remaining 869 articles.

Use of NHST and confidence intervals

The percentage of articles that use only P-values, without even mentioning confidence intervals, to report their results has declined steadily throughout the period analyzed (Table 2 ). The percentage decreased from approximately 41% in 1995-1996 to 21% in 2005-2006. However, it does not differ notably among journals of different languages, as shown by the estimates and confidence intervals of the respective percentages. Concerning thematic areas, it is highly surprising that most of the clinical articles ignore the recommendations of ICMJE, while for general and internal medicine papers such a problem is only present in one in five papers, and in the area of Public Health and Epidemiology it occurs only in one out of six. The use of CI alone (without P-values) has increased slightly across the studied periods (from 9% to 13%), but it is five times more prevalent in Public Health and Epidemiology journals than in Clinical ones, where it reached a scanty 3%.

Ambivalent handling of the significance

While the percentage of articles referring implicitly or explicitly to significance in an ambiguous or incorrect way - that is, incurring the significance fallacy -- seems to decline steadily, the prevalence of this problem exceeds 69%, even in the most recent period. This percentage was almost the same for articles written in Spanish and in English, but it was notably higher in the Clinical journals (81%) compared to the other journals, where the problem occurs in approximately 7 out of 10 papers (Table 3 ). The kappa coefficient for measuring agreement between observers concerning the presence of the "significance fallacy" was 0.78 (CI95%: 0.62 to 0.93), which is considered acceptable in the scale of Landis and Koch [ 41 ].

Reference to numerical results or statistical significance in Conclusions

The percentage of papers mentioning a numerical finding as a conclusion is similar in the three periods analyzed (Table 4 ). Concerning languages, this percentage is nearly twice as large for Spanish journals as for those published in English (approximately 21% versus 12%). And, again, the highest percentage (16%) corresponded to clinical journals.

A similar pattern is observed, although with less pronounced differences, in references to the outcome of the NHST (significant or not) in the conclusions (Table 5 ). The percentage of articles that introduce the term in the "Conclusions" does not appreciably differ between articles written in Spanish and in English. Again, the area where this insufficiency is more often present (more than 15% of articles) is the Clinical area.

There are some previous studies addressing the degree to which researchers have moved beyond the ritualistic use of NHST to assess their hypotheses. This has been examined for areas such as biology [ 42 ], organizational research [ 43 ], or psychology [ 44 – 47 ]. However, to our knowledge, no recent research has explored the pattern of use P-values and CI in medical literature and, in any case, no efforts have been made to study this problem in a way that takes into account different languages and specialties.

At first glance it is puzzling that, after decades of questioning and technical warnings, and after twenty years since the inception of ICMJE recommendation to avoid NHST, they continue being applied ritualistically and mindlessly as the dominant doctrine. Not long ago, when researchers did not observe statistically significant effects, they were unlikely to write them up and to report "negative" findings, since they knew there was a high probability that the paper would be rejected. This has changed a bit: editors are more prone to judge all findings as potentially eloquent. This is probably the frequent denunciations of the tendency for those papers presenting a significant positive result to receive more favorable publication decisions than equally well-conducted ones that report a negative or null result, the so-called publication bias [ 48 – 50 ]. This new openness is consistent with the fact that if the substantive question addressed is really relevant, the answer (whether positive or negative) will also be relevant.

Consequently, even though it was not an aim of our study, we found many examples in which statistical significance was not obtained. However, many of those negative results were reported with a comment of this type: " The results did not show a significant difference between groups; however, with a larger sample size, this difference would have probably proved to be significant ". The problem with this statement is that it is true; more specifically, it will always be true and it is, therefore, sterile. It is not fortuitous that one never encounters the opposite, and equally tautological, statement: " A significant difference between groups has been detected; however, perhaps with a smaller sample size, this difference would have proved to be not significant" . Such a double standard is itself an unequivocal sign of the ritual application of NHST.

Although the declining rates of NHST usage show that, gradually, ICMJE and similar recommendations are having a positive impact, most of the articles in the clinical setting still considered NHST as the final arbiter of the research process. Moreover, it appears that the improvement in the situation is mostly formal, and the percentage of articles that fall into the significance fallacy is huge.

The contradiction between what has been conceptually recommended and the common practice is sensibly less acute in the area of Epidemiology and Public Health, but the same pattern was evident everywhere in the mechanical way of applying significance tests. Nevertheless, the clinical journals remain the most unmoved by the recommendations.

The ICMJE recommendations are not cosmetic statements but substantial ones, and the vigorous exhortations made by outstanding authorities [ 51 ] are not mere intellectual exercises due to ingenious and inopportune methodologists, but rather they are very serious epistemological warnings.

In some cases, the role of CI is not as clearly suitable (e.g. when estimating multiple regression coefficients or because effect sizes are not available for some research designs [ 43 , 52 ]), but when it comes to estimating, for example, an odds ratio or a rates difference, the advantage of using CI instead of P values is very clear, since in such cases it is obvious that the goal is to assess what has been called the "effect size."

The inherent resistance to change old paradigms and practices that have been entrenched for decades is always high. Old habits die hard. The estimates and trends outlined are entirely consistent with Alvan Feinstein's warning 25 years ago: "Because the history of medical research also shows a long tradition of maintaining loyalty to established doctrines long after the doctrines had been discredited, or shown to be valueless, we cannot expect a sudden change in this medical policy merely because it has been denounced by leading connoisseurs of statistics [ 53 ]".

It is possible, however, that the nature of the problem has an external explanation: it is likely that some editors prefer to "avoid troubles" with the authors and vice versa, thus resorting to the most conventional procedures. Many junior researchers believe that it is wise to avoid long back-and-forth discussions with reviewers and editors. In general, researchers who want to appear in print and survive in a publish-or-perish environment are motivated by force, fear, and expedience in their use of NHST [ 54 ]. Furthermore, it is relatively natural that simple researchers use NHST when they take into account that some theoretical objectors have used this statistical analysis in empirical studies, published after the appearance of their own critiques [ 55 ].

For example, Journal of the American Medical Association published a bibliometric study [ 56 ] discussing the impact of statisticians' co-authorship of medical papers on publication decisions by two major high-impact journals: British Medical Journal and Annals of Internal Medicine . The data analysis is characterized by methodological orthodoxy. The authors just use chi-square tests without any reference to CI, although the NHST had been repeatedly criticized over the years by two of the authors:

Douglas Altman, an early promoter of confidence intervals as an alternative [ 57 ], and Steve Goodman, a critic of NHST from a Bayesian perspective [ 58 ]. Individual authors, however, cannot be blamed for broader institutional problems and systemic forces opposed to change.

The present effort is certainly partial in at least two ways: it is limited to only six specific journals and to three biennia. It would be therefore highly desirable to improve it by studying the problem in a more detailed way (especially by reviewing more journals with different profiles), and continuing the review of prevailing patterns and trends.

Curran-Everett D: Explorations in statistics: hypothesis tests and P values. Adv Physiol Educ. 2009, 33: 81-86. 10.1152/advan.90218.2008.

Article   PubMed   Google Scholar  

Fisher RA: Statistical Methods for Research Workers. 1925, Edinburgh: Oliver & Boyd

Google Scholar  

Neyman J, Pearson E: On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika. 1928, 20: 175-240.

Silva LC: Los laberintos de la investigación biomédica. En defensa de la racionalidad para la ciencia del siglo XXI. 2009, Madrid: Díaz de Santos

Berkson J: Test of significance considered as evidence. J Am Stat Assoc. 1942, 37: 325-335. 10.2307/2279000.

Article   Google Scholar  

Nickerson RS: Null hypothesis significance testing: A review of an old and continuing controversy. Psychol Methods. 2000, 5: 241-301. 10.1037/1082-989X.5.2.241.

Article   CAS   PubMed   Google Scholar  

Rozeboom WW: The fallacy of the null hypothesissignificance test. Psychol Bull. 1960, 57: 418-428. 10.1037/h0042040.

Callahan JL, Reio TG: Making subjective judgments in quantitative studies: The importance of using effect sizes and confidenceintervals. HRD Quarterly. 2006, 17: 159-173.

Nakagawa S, Cuthill IC: Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev. 2007, 82: 591-605. 10.1111/j.1469-185X.2007.00027.x.

Breaugh JA: Effect size estimation: factors to consider and mistakes to avoid. J Manage. 2003, 29: 79-97. 10.1177/014920630302900106.

Thompson B: What future quantitative social science research could look like: confidence intervals for effect sizes. Educ Res. 2002, 31: 25-32.

Matthews RA: Significance levels for the assessment of anomalous phenomena. Journal of Scientific Exploration. 1999, 13: 1-7.

Savage IR: Nonparametric statistics. J Am Stat Assoc. 1957, 52: 332-333.

Silva LC, Benavides A, Almenara J: El péndulo bayesiano: Crónica de una polémica estadística. Llull. 2002, 25: 109-128.

Goodman SN, Royall R: Evidence and scientific research. Am J Public Health. 1988, 78: 1568-1574. 10.2105/AJPH.78.12.1568.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Berger JO, Berry DA: Statistical analysis and the illusion of objectivity. Am Sci. 1988, 76: 159-165.

Hurlbert SH, Lombardi CM: Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Ann Zool Fenn. 2009, 46: 311-349.

Fidler F, Thomason N, Cumming G, Finch S, Leeman J: Editors can lead researchers to confidence intervals but they can't make them think: Statistical reform lessons from Medicine. Psychol Sci. 2004, 15: 119-126. 10.1111/j.0963-7214.2004.01502008.x.

Balluerka N, Vergara AI, Arnau J: Calculating the main alternatives to null-hypothesis-significance testing in between-subject experimental designs. Psicothema. 2009, 21: 141-151.

Cumming G, Fidler F: Confidence intervals: Better answers to better questions. J Psychol. 2009, 217: 15-26.

Jones LV, Tukey JW: A sensible formulation of the significance test. Psychol Methods. 2000, 5: 411-414. 10.1037/1082-989X.5.4.411.

Dixon P: The p-value fallacy and how to avoid it. Can J Exp Psychol. 2003, 57: 189-202.

Nakagawa S, Cuthill IC: Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev Camb Philos Soc. 2007, 82: 591-605. 10.1111/j.1469-185X.2007.00027.x.

Brandstaetter E: Confidence intervals as an alternative to significance testing. MPR-Online. 2001, 4: 33-46.

Masson ME, Loftus GR: Using confidence intervals for graphically based data interpretation. Can J Exp Psychol. 2003, 57: 203-220.

International Committee of Medical Journal Editors: Uniform requirements for manuscripts submitted to biomedical journals. Update October 2008. Accessed July 11, 2009, [ http://www.icmje.org ]

Feinstein AR: P-Values and Confidence Intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol. 1998, 51: 355-360. 10.1016/S0895-4356(97)00295-3.

Haller H, Kraus S: Misinterpretations of significance: A problem students share with their teachers?. MRP-Online. 2002, 7: 1-20.

Gigerenzer G, Krauss S, Vitouch O: The null ritual: What you always wanted to know about significance testing but were afraid to ask. The Handbook of Methodology for the Social Sciences. Edited by: Kaplan D. 2004, Thousand Oaks, CA: Sage Publications, Chapter 21: 391-408.

Curran-Everett D, Taylor S, Kafadar K: Fundamental concepts in statistics: elucidation and illustration. J Appl Physiol. 1998, 85: 775-786.

CAS   PubMed   Google Scholar  

Royall RM: Statistical evidence: a likelihood paradigm. 1997, Boca Raton: Chapman & Hall/CRC

Goodman SN: Of P values and Bayes: A modest proposal. Epidemiology. 2001, 12: 295-297. 10.1097/00001648-200105000-00006.

Sarria M, Silva LC: Tests of statistical significance in three biomedical journals: a critical review. Rev Panam Salud Publica. 2004, 15: 300-306.

Silva LC: Una ceremonia estadística para identificar factores de riesgo. Salud Colectiva. 2005, 1: 322-329.

Goodman SN: Toward Evidence-Based Medical Statistics 1: The p Value Fallacy. Ann Intern Med. 1999, 130: 995-1004.

Schulz KF, Grimes DA: Sample size calculations in randomised clinical trials: mandatory and mystical. Lancet. 2005, 365: 1348-1353. 10.1016/S0140-6736(05)61034-3.

Bacchetti P: Current sample size conventions: Flaws, harms, and alternatives. BMC Med. 2010, 8: 17-10.1186/1741-7015-8-17.

Article   PubMed   PubMed Central   Google Scholar  

Silva LC: Diseño razonado de muestras para la investigación sanitaria. 2000, Madrid: Díaz de Santos

Barnett ML, Mathisen A: Tyranny of the p-value: The conflict between statistical significance and common sense. J Dent Res. 1997, 76: 534-536. 10.1177/00220345970760010201.

Santiago MI, Hervada X, Naveira G, Silva LC, Fariñas H, Vázquez E, Bacallao J, Mújica OJ: [The Epidat program: uses and perspectives] [letter]. Pan Am J Public Health. 2010, 27: 80-82. Spanish.

Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977, 33: 159-74. 10.2307/2529310.

Fidler F, Burgman MA, Cumming G, Buttrose R, Thomason N: Impact of criticism of null-hypothesis significance testing on statistical reporting practices in conservation biology. Conserv Biol. 2005, 20: 1539-1544. 10.1111/j.1523-1739.2006.00525.x.

Kline RB: Beyond significance testing: Reforming data analysis methods in behavioral research. 2004, Washington, DC: American Psychological Association

Book   Google Scholar  

Curran-Everett D, Benos DJ: Guidelines for reporting statistics in journals published by the American Physiological Society: the sequel. Adv Physiol Educ. 2007, 31: 295-298. 10.1152/advan.00022.2007.

Hubbard R, Parsa AR, Luthy MR: The spread of statistical significance testing: The case of the Journal of Applied Psychology. Theor Psychol. 1997, 7: 545-554. 10.1177/0959354397074006.

Vacha-Haase T, Nilsson JE, Reetz DR, Lance TS, Thompson B: Reporting practices and APA editorial policies regarding statistical significance and effect size. Theor Psychol. 2000, 10: 413-425. 10.1177/0959354300103006.

Krueger J: Null hypothesis significance testing: On the survival of a flawed method. Am Psychol. 2001, 56: 16-26. 10.1037/0003-066X.56.1.16.

Rising K, Bacchetti P, Bero L: Reporting Bias in Drug Trials Submitted to the Food and Drug Administration: Review of Publication and Presentation. PLoS Med. 2008, 5: e217-10.1371/journal.pmed.0050217. doi:10.1371/journal.pmed.0050217

Sridharan L, Greenland L: Editorial policies and publication bias the importance of negative studies. Arch Intern Med. 2009, 169: 1022-1023. 10.1001/archinternmed.2009.100.

Falagas ME, Alexiou VG: The top-ten in journal impact factor manipulation. Arch Immunol Ther Exp (Warsz). 2008, 56: 223-226. 10.1007/s00005-008-0024-5.

Rothman K: Writing for Epidemiology. Epidemiology. 1998, 9: 98-104. 10.1097/00001648-199805000-00019.

Fidler F: The fifth edition of the APA publication manual: Why its statistics recommendations are so controversial. Educ Psychol Meas. 2002, 62: 749-770. 10.1177/001316402236876.

Feinstein AR: Clinical epidemiology: The architecture of clinical research. 1985, Philadelphia: W.B. Saunders Company

Orlitzky M: Institutionalized dualism: statistical significance testing as myth and ceremony. Accessed Feb 8, 2010, [ http://ssrn.com/abstract=1415926 ]

Greenwald AG, González R, Harris RJ, Guthrie D: Effect sizes and p-value. What should be reported and what should be replicated?. Psychophysiology. 1996, 33: 175-183. 10.1111/j.1469-8986.1996.tb02121.x.

Altman DG, Goodman SN, Schroter S: How statistical expertise is used in medical research. J Am Med Assoc. 2002, 287: 2817-2820. 10.1001/jama.287.21.2817.

Gardner MJ, Altman DJ: Statistics with confidence. Confidence intervals and statistical guidelines. 1992, London: BMJ

Goodman SN: P Values, Hypothesis Tests and Likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol. 1993, 137: 485-496.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/10/44/prepub

Download references

Acknowledgements

The authors would like to thank Tania Iglesias-Cabo and Vanesa Alvarez-González for their help with the collection of empirical data and their participation in an earlier version of the paper. The manuscript has benefited greatly from thoughtful, constructive feedback by Carlos Campillo-Artero, Tom Piazza and Ann Séror.

Author information

Authors and affiliations.

Centro Nacional de Investigación de Ciencias Médicas, La Habana, Cuba

Luis Carlos Silva-Ayçaguer

Unidad de Investigación. Hospital de Cabueñes, Servicio de Salud del Principado de Asturias (SESPA), Gijón, Spain

Patricio Suárez-Gil

CIBER Epidemiología y Salud Pública (CIBERESP), Spain and Departamento de Medicina, Unidad de Epidemiología Molecular del Instituto Universitario de Oncología, Universidad de Oviedo, Spain

Ana Fernández-Somoano

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Patricio Suárez-Gil .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors' contributions

LCSA designed the study, wrote the paper and supervised the whole process; PSG coordinated the data extraction and carried out statistical analysis, as well as participated in the editing process; AFS extracted the data and participated in the first stage of statistical analysis; all authors contributed to and revised the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions.

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Silva-Ayçaguer, L.C., Suárez-Gil, P. & Fernández-Somoano, A. The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation. BMC Med Res Methodol 10 , 44 (2010). https://doi.org/10.1186/1471-2288-10-44

Download citation

Received : 29 December 2009

Accepted : 19 May 2010

Published : 19 May 2010

DOI : https://doi.org/10.1186/1471-2288-10-44

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Clinical Specialty
  • Significance Fallacy
  • Null Hypothesis Statistical Testing
  • Medical Journal Editor
  • Clinical Journal

BMC Medical Research Methodology

ISSN: 1471-2288

hypothesis testing examples in healthcare

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Open Cardiovasc Med J

Insights in Hypothesis Testing and Making Decisions in Biomedical Research

Varin sacha.

1 Collège de Villamont, Lausanne, Switzerland

Demosthenes B. Panagiotakos

2 School of Health Science and Education, Harokopio University, Athens, Greece

It is a fact that p values are commonly used for inference in biomedical and other social fields of research. Unfortunately, the role of p value is very often misused and misinterpreted; that is why it has been recommended the use of resampling methods, like the bootstrap method, to calculate the confidence interval, which provides more robust results for inference than does p value. In this review a discussion is made about the use of p values through hypothesis testing and its alternatives using resampling methods to develop confidence intervals of the tested statistic or effect measure.

BRIEF HISTORY OF HYPOTHESIS TESTING

At first it has to be clarified that a "significance test" is different to a "hypothesis test". Many textbooks, especially in social and biomedical sciences, mix these two approaches to a logically flawed mishmash, which is referred as "null-hypothesis significance test". However, null-hypothesis significance test is a combination of ideas developed in the 1920s and 1930s, primarily by Ronald Fisher (1925) and Jerzy Neyman & Egon Pearson (1933) [ 1 ]. These two testing approaches are not philosophically compatible even if they are technically related. Fisher developed tests of significance as an inferential tool. The main reason was to walk away from the subjectivism inherent to Bayesian inference ( i.e. , namely in the form of giving equal prior probabilities to hypotheses) and substitute a more objective approach. However, Fisher’s tests also depend on two other important elements: research methodology (Fisher pioneered experimental control, random allocation to groups, etc. ) and small samples. Neyman & Pearson liked Fisher’s approach, although lacked a strong mathematical foundation. As their theory progressed, the approach stopped being an improvement on Fisher’s approach and became a different approach. The main differences between Fisher and Neyman & Pearson approaches are both philosophical and technical. Philosophically, Neyman and Pearson’s approach assumes a known hypotheses, and it is based on repeated sampling from the same population, focuses on decision making, and aims to control decision errors in the long run. Thus, it can be considered as less inferential and more deductive. Technically, Neyman and Pearson’s approach uses Fisher’s tests of significance, but also incorporates other elements, like effect sizes, Type II errors, and the power of the statistical test. Neyman and Pearson also incorporated other methodological improvements, such as random sampling [ 2 - 15 ].

Significance test and hypothesis test are based on the assumption of a (statistical) null hypothesis, i.e. , a statement that there is no relationship, e.g. , no difference between treatment effects on an outcome. This is a mere technical requirement giving a statistical context that is required to apply probabilistic calculations. In reference to the approach suggested by Fisher, a "significance test" considers only the null hypothesis and gives a p value which is a continuous empirical measure of the "significance of the results" (given the considered null hypothesis). This measure has no particular meaning and it is not calibrated to some kind of relevance. It is just a value between 0 and 1, referring on how likely is to observe "more extreme results" given the null hypothesis. According to the approach suggested by Neyman & Pearson, a "hypothesis test" is actually a test about an alternative hypothesis, which refers to a "minimally relevant effect" (and not about "some non-zero effect" as the null hypothesis). These tests are designed to control error-rates and allow a balance on the expected cost/benefit ratios that are associated with the actions taken based on the test results. To perform such tests, it must be specified a minimally relevant effect and also acceptable error rates. After the experiment or the study is conducted, the decision is actually about rejecting (or not) a hypothesis. So either the "null hypothesis" is not rejected, which means that the assumed effect was not relevant, or the alternative hypothesis is accepted, which means that the effect was relevant. Note that there is no point where the "truthfulness" of an effect is discussed. This does not matter in statistical hypothesis testing. The only thing that matters is what actions are taken based on an effect that is considered relevant [ 2 - 15 ].

Major Problems Using the p Values as Result of a Hypothesis Test

Many investigators, in various research fields refer to Neyman & Pearson hypothesis tests and their associated p values. Indeed, the p value is a widely used tool for inference in studies. However, despite the numerous books, papers and other scientific literature published on this topic, there still seems to be serious misuses and misinterpretations of the p value. According to Daniel Goodman, "a p value is the right answer to the wrong question" [ 1 ]. A summary is given by Joseph Lawrence that presented at least four different major problems associated with the use of the p values [ 16 ]:

  • " P values are often misinterpreted as the probability of the null hypothesis, given the data, when in fact they are calculated assuming the null hypothesis to be true."
  • "Researchers often use p values to “dichotomize” results into “important” or “unimportant” depending on whether p is less or greater than a significance level, e.g. , 5%, respectively. However, there is not much difference between p -values of 0.049 and 0.051, so that the cut off of 0.05 is considered arbitrary."
  • " P values concentrate attention away from the magnitude of the actual effect sizes. For example, one could have a p value that is very small, but is associated with a clinically unimportant difference. This is especially prone to occur in cases where the sample size is large. Conversely, results of potentially great clinical interest are not necessarily ruled out if p > 0.05, especially in studies with small sample sizes. Therefore, one should not confuse statistical significance with practical or clinical importance."
  • "The null hypothesis is almost never exactly true. In fact it is hard to believed that the null hypothesis, H o : µ = µ 0 , is correct! Since the null hypothesis is almost surely false to begin with, it makes little sense to test it. Instead, it should rational to start with the question “by how much are the two treatments different?"

There are so many major problems related to p values that most statisticians now recommend against their use, in favour of, for example, confidence intervals. In a previous publication entitled “The value of p -value in biomedical research” alternatives for evaluating the observed evidence were briefly discussed [ 17 ]. Here, a thorough review on hypothesis testing is presented.

Hypothesis Testing Versus Confidence Intervals

Researchers from many fields are very familiar with calculating and interpreting the outcome of empirical research based solely on the p value [ 18 ]. The commonly suggested alternative to the use of the hypothesis tests is the use of confidence intervals [ 19 - 26 ]. As it has been suggested by Wood (2014), “ the idea of confidence intervals is to use the data to derive an interval within a specified level of confidence that the population parameter will lie with confidence " [ 19 ]. Two-sided hypothesis tests are dual to two-sided confidence intervals. A parameter value is in the (1-α)x100% confidence interval if-and-only-if the hypothesis test whose assumed value under the null hypothesis is that parameter value accepts the null at level α . The principle is called the duality of hypothesis testing and confidence interval [ 20 ]. Thus, there is a one-to-one relationship between one-sided tests and one-sided confidence intervals. In addition, there is an exact relationship only if the standard error used in both the confidence intervals and the statistical tests, is identical.

However, many statisticians nowadays avoid using any hypothesis tests, since their interpretations may vary and the derived p values cannot, generally, be interpreted in meaningful ways. Moreover, it is adopted that by calculating the confidence interval, researchers may have “insights” to the nature of their data and the evaluated associations, whereas p values tell absolutely nothing. Criticism against hypothesis testing, dating for most of them more than 50 years ago, suggests that "they (hypotheses tests) are not a contribution to science" (Savage, 1957 in Gerrodette, 2011, p. 404) or "a serious impediment to the interpretation of data" (Skipper & et al ., 1967, in Gerrodette, 2011, p. 404), or "worse than irrelevant" (Nelder, 1985 in Gerrodette, 2011, p. 404) or "completely devoid of practical utility" (Finney, 1989, in Gerrodette, 2011, p. 404) [ 1 ].

Nevertheless, and despite all the criticism, the hypothesis tests and their associated p values are still widely prevalent. According to Lesaffre (2008) [ 21 ], it is important to note that a 95% confidence interval bears more information than a p value, since the confidence interval has a much easier interpretation and allows better comparability of results across different trials. Moreover, in meta-analyses, the confidence interval is the preferred tool for making statistical inference. According to Wood (2104) [ 19 ], a (1-α)x100% confidence interval provides directly the strength of the effect, as well as the uncertainty due to sampling error, in an obvious way by providing the width of the interval. The information displayed is not trivial or obvious like the NHST conclusions may be, and misinterpretations seem far less likely than for NHSTs. Thus, the use of the confidence intervals has the potential to avoid many of the widely acknowledged problems of NHSTs and p values [ 19 ]. Moreover, several high-impact journals, especially in health sciences and other fields, as well as Societies ( e.g. , American Psychological Association’s (APA) Task Force on Statistical Inference (TFSI)) have strongly discouraged the use of p values to prefer point and interval estimates of the effect size ( i.e. , odds ratios, relative risks, etc ), instead of p values, as an expression of uncertainty resulting from limited sample size and also encouraging the use of Bayesian methodology [ 21 - 22 ]. It is not surprising to note that, a century following its introduction many researchers still poorly understand the exact meaning of p value, resulting in many miss-interpretations [ 17 ].

Advantages of The Confidence Interval Versus p Value

It is now common belief that researchers should be interested in defining the size of the effect of a measured outcome, rather than a simple indication of whether it is or not statistically significant [ 23 ]. On the basis of the sample data, confidence intervals present a range of alternative values in which the unknown population value for such an effect is likely to lie. Indeed, confidence intervals give different information and have different interpretation than p values, since they specify a range of alternative values for the actual effect size (since they present the results directly on the scale of the measurement), while p values don't. Moreover, confidence intervals make the extent of uncertainty salient, which a p value cannot do. Since the mid 1980’s, Gardner & Altman suggested that " a confidence interval produces a move from a single value estimate - such as the sample mean, difference between sample means, etc – to a range of values that are considered to be plausible for the population " [ 24 ].

Resampling Techniques

It is known from basic statistics that many statistical criteria ( e.g. , t-test) are asymptotically normally distributed, but the normal distribution may not be always a good approximation to their actual sampling distribution in the empirical samples derived from experiments, clinical trials or observational surveys. Indeed, the validity of the traditional statistical inference is mostly based on a theorem known as the Central Limit Theorem, which stipulates that, under fairly general conditions, the sampling distribution of the test statistic can be approximated by a normal distribution or under more limited assumptions by the t- or chi-square distributions. Based on these assumptions confidence intervals and p values are then calculated; however, with a considerable level of doubts and concerns.

The point of resampling method is to not rely on the Gaussian assumptions. Resampling is a methodology suggested in early 1940s in order to estimate the precision of statistics, like means, medians, proportions, odds ratios, relative risks, etc. , by using k -subsets of size m (< n) of the originally collected data ( i.e. , jackknife method) or drawing a random set of data with replacement from the original set ( i.e. , bootstrap method). Indeed, when the Gaussian assumptions are not true, the validity of the classical inferential statistics tends to be undermined. It is in these situations that the resampling methods really come to the rescue. The main idea of resampling is to obtain an empirical distribution of the test statistics based on what it is observed and use it to approximate the true, but unknown, distribution of the test statistic. An important advantage of this approach is that it could be applied for many statistics ( e.g. , means, median, etc. ) and effect size measures ( e.g. , correlation coefficients, odds ratios, relative risks, etc. ) with the use of computer software. Specifically, there are different types of resampling methods, i.e. , bootstrap, jackknife, cross-validation (also called rotation estimation and permutation test, or randomization exact test). In classical parametric test the observed statistics are compared to the theoretical sampling distributions, while in resampling methods we start from theoretical distributions, which makes them innovative approaches [ 25 ]. Among all resampling methods, bootstrap is certainly the most frequently used procedure [ 26 ]. So, the resampling methods can be a substantial improvement over the traditional inference, since a confidence interval for the true value of unknown statistic or effect size measure has a much more concrete interpretation than has the p value from a statistical test, although there is still no guarantee.

However, at this point it should be mentioned that it is often the sampling distribution of various effect sizes to be highly skewed, thus, the traditional confidence intervals will not work well, since they will always be skewed, too. Symmetrical confidence intervals are appropriate for a few things such as means and linear regression coefficients, but they are inappropriate for many other measures [ 27 ]. So, it is better not to assume a symmetric confidence interval for a measure of association, and to start from the assumption that they are not normally distributed. The empirical distribution derived for example from the bootstrap method does not assume that the distribution is symmetrical.

In conclusion, it could be recommend for inferencial purposes, to present the results from studies using confidence interval of the statistics and effect size measures of interest, rather than hypothesis test and its associated p value. Moreover, depending on the statistics of interest, bootstrap techniques or another resampling methods are also recommended, because these techniques are independent of the shape of the underlying distribution and can easily performed using software.

CONFLICT OF INTEREST

The authors confirm that this article content has no conflict of interest.

ACKNOWLEDGEMENTS

Declared None.

Loading metrics

Open Access

Peer-reviewed

Research Article

Predicting cardiovascular disease risk using photoplethysmography and deep learning

Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected] (WHW); [email protected] (SP); [email protected] (DA)

¶ ‡ WHW, SB and MD contributed equally as first authors to this work.

Affiliation Google LLC, Mountain View, California, United States of America

ORCID logo

Roles Conceptualization, Data curation, Investigation, Methodology, Software, Validation, Writing – review & editing

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing – review & editing

Roles Formal analysis, Writing – review & editing

Roles Conceptualization, Investigation, Project administration, Validation, Writing – review & editing

Roles Methodology, Software

Roles Formal analysis, Methodology, Software, Writing – review & editing

Roles Funding acquisition, Resources

Roles Conceptualization, Funding acquisition, Project administration, Resources, Supervision

Roles Formal analysis, Methodology, Supervision, Visualization, Writing – review & editing

Roles Supervision, Writing – review & editing

Affiliation Department of Global Health and Population, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, United States of America

  •  [ ... ],

Roles Conceptualization, Investigation, Methodology, Project administration, Supervision, Validation, Writing – review & editing

  • [ view all ]
  • [ view less ]
  • Wei-Hung Weng, 
  • Sebastien Baur, 
  • Mayank Daswani, 
  • Christina Chen, 
  • Lauren Harrell, 
  • Sujay Kakarmath, 
  • Mariam Jabara, 
  • Babak Behsaz, 
  • Cory Y. McLean, 

PLOS

  • Published: June 4, 2024
  • https://doi.org/10.1371/journal.pgph.0003204
  • Peer Review
  • Reader Comments

Fig 1

Cardiovascular diseases (CVDs) are responsible for a large proportion of premature deaths in low- and middle-income countries. Early CVD detection and intervention is critical in these populations, yet many existing CVD risk scores require a physical examination or lab measurements, which can be challenging in such health systems due to limited accessibility. We investigated the potential to use photoplethysmography (PPG), a sensing technology available on most smartphones that can potentially enable large-scale screening at low cost, for CVD risk prediction. We developed a deep learning PPG-based CVD risk score (DLS) to predict the probability of having major adverse cardiovascular events (MACE: non-fatal myocardial infarction, stroke, and cardiovascular death) within ten years, given only age, sex, smoking status and PPG as predictors. We compare the DLS with the office-based refit-WHO score, which adopts the shared predictors from WHO and Globorisk scores (age, sex, smoking status, height, weight and systolic blood pressure) but refitted on the UK Biobank (UKB) cohort. All models were trained on a development dataset (141,509 participants) and evaluated on a geographically separate test (54,856 participants) dataset, both from UKB. DLS’s C-statistic (71.1%, 95% CI 69.9–72.4) is non-inferior to office-based refit-WHO score (70.9%, 95% CI 69.7–72.2; non-inferiority margin of 2.5%, p<0.01) in the test dataset. The calibration of the DLS is satisfactory, with a 1.8% mean absolute calibration error. Adding DLS features to the office-based score increases the C-statistic by 1.0% (95% CI 0.6–1.4). DLS predicts ten-year MACE risk comparable with the office-based refit-WHO score. Interpretability analyses suggest that the DLS-extracted features are related to PPG waveform morphology and are independent of heart rate. Our study provides a proof-of-concept and suggests the potential of a PPG-based approach strategies for community-based primary prevention in resource-limited regions.

Citation: Weng W-H, Baur S, Daswani M, Chen C, Harrell L, Kakarmath S, et al. (2024) Predicting cardiovascular disease risk using photoplethysmography and deep learning. PLOS Glob Public Health 4(6): e0003204. https://doi.org/10.1371/journal.pgph.0003204

Editor: Julia Robinson, PLOS: Public Library of Science, UNITED STATES

Received: May 25, 2023; Accepted: April 12, 2024; Published: June 4, 2024

Copyright: © 2024 Weng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: This research has been conducted using the UK Biobank Resource under Application Number 65275. Individual-level data from the UK Biobank are not publicly available according to the policy but will be made available after the application of UK Biobank. Please visit the UK Biobank website, https://www.ukbiobank.ac.uk/ , for application procedures. Code for collecting PPG in an app is available at https://github.com/google-research/CVD-paper-mobile-camera-example . Statistical code used for this study will be available at https://github.com/Google-Health/google-health . Embeddings for the UK Biobank PPG data will be returned to and made available via the UK Biobank.

Funding: The study was supported by Google LLC. All Google-affiliated authors are Google employees and own Alphabet stock. Google LLC was involved in the design and conduct of the study; analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Competing interests: Author WHW, SB, MD, CC, SK, YL, and DA are employed at Google LLC and hold shares in Alphabet, and are co-inventors on patents (in various stages) for CVD risk prediction using deep learning and PPG, but declare no non-financial competing interests. LH, BB, CYM, YM, GSC, SS, SP are employed at Google LLC and hold shares in Alphabet but declare no non-financial competing interests. SK serves as an Associate Editor for this journal but had no role to play in the editorial process and decisions for this manuscript. GD declares no financial or non-financial competing interests.

Introduction

Cardiovascular diseases (CVDs) are responsible for one third of deaths globally [ 1 ] with approximately three quarters occurring in low- and middle-income countries (LMICs) where there’s a paucity of resources for early disease detection [ 2 , 3 ]. Because CVD risk factors such as hypertension, diabetes, or hyperlipidemia are typically symptomless before advanced disease, there is a great need for screening programs to identify those at high risk of CVD events. Interventions such as lifestyle counseling, with or without prescription medications, have shown to be an effective strategy for CVD prevention among these individuals [ 4 ].

Multiple risk scores, such as WHO/ISH risk chart and Globorisk scores, have been developed to triage CVD risk based on demographics, past medical history, vital signs, and laboratory data [ 4 – 7 ]. However, the dependency of these risk scores on medical and laboratory equipment (e.g., sphygmomanometers) [ 8 , 9 ] limits their reach. Specifically, low-resource healthcare systems have relied largely on opportunistic screening [ 10 ], such as via community healthcare workers (CHWs) [ 11 ], to close access gaps. We reasoned that developing low-cost, easy-to-use, lightweight, digital point-of-care tools using sensors already available in smartphones [ 12 – 14 ], could potentially further the reach and capability of CHW-based programs and enable large-scale screening at low cost [ 15 ].

Among sensing signals for the circulatory system, photoplethysmography (PPG) is a non-invasive, fast, simple, and low-cost technology, and can be captured with sensors available on increasingly ubiquitous devices such as smartphones and pulse oximeters [ 16 ]. PPG measures the change in blood volume in an area of tissue across cardiac cycles and is primarily used for heart rate monitoring in healthcare settings [ 17 , 18 ]. Research has also investigated the utility of PPG in understanding short term fluctuations in vascular compliance, by estimating continuous blood pressure (BP) in an ICU setting [ 17 , 19 , 20 ], though the accuracy of such approaches is known to be insufficient even when per-user calibration is available [ 17 ]. Beyond short term vascular changes, research has also been conducted into understanding the slow manifestation of vascular aging and arterial stiffness from PPG waveforms [ 21 – 23 ], which are useful for longer-term CVD risk assessment. Since PPG is potentially more accessible and requires less training for measurement, such technologies could provide accurate real-time insights. The ubiquity of smartphones have also prompted research involving PPG as measured from smartphone cameras, via placing a finger on the camera [ 16 ]. Taken together, enabling CVD-risk estimation based on PPG signals can potentially be a highly accessible screening tool in low-resource health systems ( Fig 1 ).

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

The motivation of applying the PPG-based cardiovascular disease (CVD) risk assessment in the low-resource health systems. Non-office based information acquired from mobile-sensing technologies may help address the burden of cardiovascular disease risk screening and triage in resource-limited areas. In this study, we compare our developed model (DLS) with the existing office-based and lab-based CVD risk scores that have been developed for low-resource medical settings, including models refitting variables in the WHO and Globorisk scores, and office-based and lab-based Globorisk scores recalibrated on the same study cohort.

https://doi.org/10.1371/journal.pgph.0003204.g001

While there are existing works on the BP estimation and evaluating other related CVD risk-factors such as heart rate variability (HRV) or arterial stiffness from PPG [ 24 , 25 ], we have not found existing literature on predicting CVD risk directly from PPG waveform signals. The closest related works are those who predicted CVD risk via arterial stiffness index (ASI) estimated by PPG [ 26 – 28 ]. In this paper, we investigate the feasibility of leveraging PPG for CVD risk prediction using data from the UK Biobank (UKB). Specifically, we predict the ten-year risk of developing a major adverse cardiovascular event (MACE) using deep learning-based PPG embeddings and heart rate (measured by PPG), along with other demographics, including age, sex and smoking status, but without any inputs from physical examination or laboratory data ( S1 Fig ). We find that our deep learning PPG-based CVD risk prediction score (DLS) is well-calibrated and non-inferior to the existing comparative office-based CVD risk score using predictors from WHO/ISH and Globorisk, that requires blood pressure, weight and height measurement, or laboratory data.

Material and methods

We developed a new CVD risk prediction score, DLS, using age, sex, smoking status and the results of analysis of PPG signals using deep learning. We used a Cox proportional hazard model and data from UKB to predict the ten-year risk of MACE among individuals free of CVD at baseline.

Data source and cohort

The DLS was developed and evaluated using data from the UKB dataset, filtered to focus on participants aged 40–74 to mirror a previous study [ 4 ]. We then stratified UKB participants who had PPG waveforms recorded into three subsets: train (n = 105,319), tune (n = 46,868), and test (n = 57,702) subsets based on geographic information on the site of data collection, i.e., latitude and longitude. This strategy aligns with TRIPOD guidelines [ 29 ] on external validation (specifically validation on a different geographic region) by allowing for non-random variation between data splits such as differences in data acquisition or environment.

We used PPG waveforms from all visits for the participants in this train subset to train the PPG feature extractor in DLS (details in “Model Development”). The low-dimensional numeric outputs (embeddings) computed by this model were used as additional input features to our Cox model. To develop the Cox model that generates DLS to predict MACE risk, additional clinical and demographic variables and inclusion/exclusion criteria were needed. First, we excluded participants with non-fatal myocardial infarction or stroke before their first visit, or missing any of variables for our model (age, sex, and smoking status). We also excluded those without body mass index (BMI) or systolic BP (SBP) for a fair comparison against the other office- and lab-based risk prediction models. For each participant, we only included measurements related to their first visit. All numerically measured variables were standard-scaled. Cox models were regularized using a ridge penalty. In the final cohort, 97,970, 43,539, and 54,856 participants were included to train, tune, and test the survival model, respectively ( Fig 2 ). The participants were recruited to the UKB study between 2006 and 2010, and the anonymous individual participants’ data in UKB was collected and accessible by authors, for modeling and data analysis from September 2022 to March 2023. Descriptive statistics for this cohort are in Table 1 .

thumbnail

https://doi.org/10.1371/journal.pgph.0003204.g002

thumbnail

Details of the geographic split based on longitude and latitude are listed in S1 Table , S2 Fig .

https://doi.org/10.1371/journal.pgph.0003204.t001

Model development

First stage: ppg feature extractor..

For DLS, we first trained a deep learning-based feature extractor to learn PPG representations from UKB PPG summarized waveform signals, using a one-dimensional ResNet18 [ 21 , 22 , 30 ] as the neural network architecture. Specifically, the UKB PPG waveform signals (UKB Data-Field 4205) was collected by the PPG device, PulseTrace PCA2 (CareFusion, USA), and the device averaged a minimum of six pulse waveforms with a pulse interval close to the average pulse interval, and output a single summarized waveform stretched to 100 temporal units irrespective of heart rate (for example, see https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=100181 ). Since the UKB normalized the PPG data into 100-length time-series irrespective of the heart rate, the time interval between successive values can vary across participants due to variations in heart rate. Thus, we resampled the x-axis of the raw data so that the interval between two timesteps is uniform (18.2 milliseconds per step), resulting in variable-length time-series per participant. We then padded all samples to the same length with zeros. We also applied the Brownian-tape speed data augmentation (details in the S1 Text , the “Details of model training” section), which is specific to time-series data.

We selected the one-dimensional ResNet18 as the backbone for time-series modeling since one-dimensional convolutional neural networks are known to be a strong model for time-series tasks [ 31 ], and one-dimensional ResNet18 is lightweight for potential low-resource setting use, such as deploying on mobile devices.

We trained the feature extractor on the train subset, and picked the network weights that maximized the Cox pseudolikelihood (see description of the second stage below) on the tune subset. These weights were used to compute PPG embeddings on the train, tune, and test subsets. The PPG embeddings were further processed by principal component analysis (PCA)–a technique to reduce dimensionality–to five PCA-derived DLS features that are used by the survival model. Modeling details are listed in S1 Text .

Second stage: Survival model.

In the second stage, we developed a Cox proportional hazards regression model for predicting ten-year MACE risk, using as inputs age, sex, smoking status, PCA-derived PPG embeddings and PPG-HR (heart rate measured during PPG assessment). The model was trained on the train subset and tuned on the tune subset to decide the best-performing ridge regularization parameter ( S2 Table ).

Models for comparison/reference.

For comparisons, we developed different survival models based on different feature sets ( Table 2 “Features used” column, and S3 Table ), including office-based and laboratory-based refit-WHO scores using the CVD risk predictors adopted in WHO/ISH risk chart and Globorisk studies, metadata-only model (age, sex, smoking status), metadata + PPG morphology (a model with metadata and engineered PPG features describing waveform morphology, such as dicrotic notch presence, details in S4 Table ), a model without smoking status as an input (metadata without smoking, DLS without smoking), and the “Full” model that considered metadata, laboratory data, medication and medical history as a reference, to compute CVD risk score (details in the S2 Text ). We chose the model using shared predictors from the office-based WHO/ISH risk chart and Globorisk score (office-based refit-WHO score) as the main reference since they are adopted in the CVD risk research for low-resource settings. To ensure the fairest comparison the coefficients for the WHO and Globorisk predictors were re-fitted using the same UKB train subset as our DLS, and a sensitivity analysis was conducted using the original coefficients with recalibration.

thumbnail

https://doi.org/10.1371/journal.pgph.0003204.t002

We further developed DLS+ (DLS with BMI), and DLS++ (DLS with BMI and SBP) that additionally included more non-laboratory, office-based measurements as inputs of the survival model to better understand the prognostic value of PPG on top of the existing office-based refit-WHO model.

All models were trained on the same train subset and tuned on the tune subset except for laboratory-based refit-WHO score, metadata + PPG morphology, and the Full models that we trained, tuned and compared based on a subset of the testing data without missing values of the input features.

The outcome, ten-year risk of MACE, was defined as a composite outcome of three components, non-fatal myocardial infarction, stroke, and CVD-related death (using ICD codes and cause of death to identify, S4 Table for details) [ 7 , 32 ]. To define the outcome, we used (1) the date of heart attack, myocardial infarction, stroke, ischemic stroke, either diagnosed by doctor or self-reported, (2) the record of ICD-10 (international classification of diseases, 10th revision) clinical codes, and (3) and strings that are associated with the CVD-related death. The ICD-10 codes used included I21 (acute myocardial infarction), I22 (subsequent myocardial infarction), I23 (complications after myocardial infarction), I63 (cerebral infarction), I64 (stroke not specified as hemorrhage or infarction). The strings we used for matching include those related to coronary artery diseases, myocardial infarction, stroke, hypertensive diseases, heart failure, thromboembolism, arrhythmia, valvular diseases and other heart problems. We used the earliest date on any of the data sources mentioned above as the outcome date.

Statistical analysis.

For primary analysis, we compared DLS with the office-based refit-WHO score, which is a risk model for healthy individuals across different countries [ 4 , 7 , 33 , 34 ], using Harrell’s C-statistic. We conducted a non-inferiority test with a pre-specified margin of 2.5% and alpha of 0.05, both selected based on power simulations using the tune subset. For secondary analyses, we also compared DLS with scores generated by other models mentioned in “Models for Comparison/Reference” above.

Additional evaluation metrics included the category-free net reclassification improvement (cfNRI) [ 35 ], and after defining a specific risk threshold (model operating point), sensitivity, specificity, NRI, and adjusted hazard ratio (HRs). For NRI and cfNRI, we also reported the respective event and non-event components. Risk thresholds were selected in three ways: (1) matching the sensitivity of SBP-140 (described next), (2) matching the specificity of SBP-140, and (3) the 10% predicted risk threshold suggested by the Globorisk study [ 4 ]. Elevated SBP above 140 mmHg (“SBP-140”) [ 36 ] was used for threshold selection because it is used as a simple single-visit indicator of BP control in the healthcare program of some countries such as India [ 37 ], and we hypothesized that the PPG provided a single-visit indicator of vascular properties. To calculate sensitivity and specificity, we excluded the participants without a ten-year follow up if they didn’t have a MACE event within ten years. To evaluate model calibration, we used the slope of the line comparing predicted and actual event rates, for deciles of model prediction [ 33 ]. We also performed subgroup analyses based on smoking status, sex, age, elevated HbA1c and hypertension status. We used quintiles for the elevated HbA1c subgroup due to the smaller sample size.

For statistical precision, we used the Clopper-Pearson exact method to compute the 95% confidence intervals (CIs) for sensitivity and specificity, and used the non-parametric bootstrap method with 1,000 iterations to compute 95% of all remaining metrics and delta values. For hypothesis tests in secondary and exploratory analysis, we used a permutation test to examine the non-inferiority and superiority of the C-statistic, and the one-sided Wald test for sensitivity and specificity. The log-rank test was used to determine whether survival differs between the model-defined low and high risk groups. For all two-sided tests, we used an alpha value of 0.05.

The deep learning framework (JAX) used in this study is available at https://github.com/google/jax [ 38 ]. All survival analyses were conducted using Lifelines [ 39 ], an open source Python library.

Ethics statement

This study involving retrospective de-identified data was reviewed by and granted a waiver by the Advarra Institutional Review Board. All participants in the UK Biobank study gave broad consent to use their anonymized data and samples for any health-related research [ 40 ].

We showed that DLS demonstrated non-inferiority to the office-based refit-WHO score. We evaluated the ten-year MACE risk prediction performance of all methods using our UKB test subset, which was held-out during the training process. The DLS yielded C-statistic of 71.1% (95% CI [69.9, 72.4]). When compared with the office-based refit-WHO score, the DLS was non-inferior (p<0.01), with a delta of +0.2% (-0.4, 0.8). The cfNRI was 0.1% (0.0, 0.1), stemming primarily from improved reclassification of events (0.1% [0.0, 0.2]), without performance penalty in the non-events (0.0% [0.0, 0.0]).

Based on the C-statistic, there was an incremental improvement when the metadata model (69.1%) was augmented with manually engineered (not deep learning derived) PPG morphology features (70.0%). The DLS was superior to this metadata+PPG morphology features model (p<0.01), indicating value in deep learning based feature extraction. The lab-based model (which requires total cholesterol and glucose information) was superior to the office-based refit-WHO score (71.6% versus 70.9%, p<0.01). With only five deep learning-based PPG features, the C-statistic is 62.7% (61.3, 63.9). By applying the Globorisk scores in [ 7 ], which recalibrating on UKB cohort for baseline hazard and mean risk factors but without re-estimating the coefficients, the office-based Globorisk yielded a C-statistic of 70.0% (68.8, 71.2), and the lab-based Globorisk yielded a C-statistic of 69.8% (68.5, 71.1). More details are shown in Table 2 .

For a fair comparison, we then selected the risk thresholds that matched the specificity or sensitivity of SBP-140 (see Statistical analysis in Methods ) (specificity of 63.7%, sensitivity of 55.2%). We found that at matched specificity, the sensitivity of the DLS (67.9%) were non-inferior to the office-based refit-WHO score (67.7%) (p = 0.012), and a comparable NRI, while the metadata and metadata + PPG morphology models were not (p = 0.984 and p = 0.305, respectively). At the matched sensitivity, the DLS’s specificity (74.0%) was also non-inferior to the baseline (73.1%) (showed superiority with p<0.01), with a comparable NRI. The laboratory-based refit-WHO and the model using metadata and PPG morphology-based features were also non-inferior to the office-based refit-WHO score, despite these models requiring additional inputs from laboratory measurements or engineered PPG features, respectively. The metadata-only model performed more poorly than the office-based refit-WHO score across different metrics. More details are shown in Table 3 . We also conducted Kaplan Meier analysis on risk groups defined using the above approach ( Fig 3 ). Both thresholds showed significant (p<0.01, log rank tests) differences between the groups. Results for the 10% risk threshold are in S5 Table and S3 Fig .

thumbnail

(A) Risk threshold corresponding to a specificity of 63.6%, (B) Risk threshold corresponding to a sensitivity of 55.4% (see Methods ). The p-values were calculated by the log-rank test.

https://doi.org/10.1371/journal.pgph.0003204.g003

thumbnail

https://doi.org/10.1371/journal.pgph.0003204.t003

In addition to the default set of inputs to the DLS, we also evaluated models with BMI, and with both BMI and SBP (both of which are predictors in the office-based refit-WHO) included as additional inputs, which we refer to as DLS+ and DLS++, respectively. We found that adding BMI (DLS+) improved DLS in terms of both discrimination and net reclassification. Additional improvement was observed after adding SBP (DLS++), which further improved the DLS model, and demonstrated superiority across different metrics ( S6 Table ). We also showed that for DLS and its variants (DLS+ and DLS++), the cfNRI and NRI with different risk thresholds ( S6A Table ) were also on par with the office-based refit-WHO score ( S6B Table ). These findings indicate that combining the existing non-laboratory risk factors from the refit-WHO score with the DLS features yields a more accurate CV risk estimation. We further developed a model (Full model) that includes more risk factors used in the CVD risk scores commonly used in high-income countries (QRISK and/or ASCVD), as well as a model that incorporates genetic risk, and listed the findings in the S2 Text .

Meanwhile, the predicted and observed risks of ten-year MACE were similar across different models ( Fig 4 ), which indicated DLS has similar calibration performance compared with other models. The calibration slope of DLS was similar to the office-based refit-WHO score (0.981 versus 0.979) ( Table 2 ). We also found that DLS++ has a comparable calibration performance ( S6 Table ). All models except the DLS+ have an observed ten-year MACE risk estimation within 5% mean absolute calibration error (i.e., the slopes were between 0.95–1.05).

thumbnail

We discretized each model’s output into deciles and the slopes indicate the coefficient of a linear regression. METADATA: the risk model with age, sex, smoking status), OFFICE: the risk model using shared predictors from the office-based WHO/ISH risk chart and Globorisk score. DLS: the risk model with metadata and deep learning-based PPG features. DLS+: the risk model using all DLS predictors plus BMI. DLS++: the risk model using all DLS predictors plus BMI and systolic blood pressure.

https://doi.org/10.1371/journal.pgph.0003204.g004

Finally, DLS is on par with the office-based refit-WHO score in some subgroups. S7 Table shows that DLS demonstrated non-inferiority in some subgroups and showed superiority in smoking, hypertensive and male subgroups. Both the office-based refit-WHO score and DLS had similar performance trends. Both models have higher sensitivity and lower calibration error but lower specificity on the smoking, older, male, and hypertensive subgroups. The models were well-calibrated for most subgroups, but systematically overestimated absolute risk about 4.0% in the elevated A1c and about 1.0% in hypertensive subgroups. The finding indicates that the developed risk models tend to be better calibrated and better predict ten-year MACE risk in a population that has higher known CVD risk factors, such as older, male, smoking, higher blood glucose and hypertensive subgroups ( S7 Table ). Across different age, sex, smoking, and comorbidity (diabetes and hypertension) subgroups, the calibration for all risk scores were similar in predicting ten-year MACE risk in smoking, age greater than 55, male, not elevated A1c populations, with prediction errors within 10% (i.e. the calibration regression slope between 0.9 and 1.1 ( S7 Table , S4 Fig )).

To understand the DLS model deeper, we analyzed the DLS-based PPG features via coefficients and hazard ratios (HRs), correlations with known PPG morphological features, the difference of PPG waveforms between high/low values of DLS PPG feature, and saliency map of the PPG features, computed using integrated gradients [ 41 ] of the Cox log partial hazard with respect to the input waveform, using the same metadata and linear interpolation from a constant (all zeros) baseline for the PPG. We first examined the association between each model and MACE via the coefficients and hazard ratios (HRs) ( S8 Table ). We found that in the office-based refit-WHO score, smoking, older age, higher BMI, and higher SBP were associated with ten-year MACE risk. We found that some DLS features were also associated with ten-year MACE risk (p<0.05 for four deep learning PPG features in DLS and DLS+, and for two PPG features in DLS++). Next, we computed the Spearman’s rank correlation coefficient between DLS features and engineered PPG morphological features ( Table 4 ), and visualized the relationship between waveforms and the PPG feature values / predicted risk score change, along with the integrated gradients.

thumbnail

https://doi.org/10.1371/journal.pgph.0003204.t004

In Fig 5 , we have visualized the PPG waveforms based on the predicted risk score and five DLS PPG feature values, by presenting the average of 100 PPG waveforms sampled nearest to the following quantiles: 10th, 50th, and 90th. We found that with higher PPG-1 value, the systolic peak shifts earlier; notch appears more prominent. A leftward systolic peak shift is consistent with a higher slope and thus less stiff vessels, corroborated by the strong positive correlation with the peak-to-peak time feature and negative correlation with the stiffness index. Together, these findings are consistent with the risk prediction observations above because PPG-1 has the highest absolute Cox coefficient among the five DLS PPG features, and the direction is negative (hence inverse direction with the observations in the predicted score). PPG-2 is also correlated with peak-to-peak time and stiffness index, and the cases with higher PPG-2 values show a lower waveform and notch amplitude, which may be consistent with the negative correlation with reflection index, though other morphological changes along this spectrum are harder to describe precisely. PPG-3 is weakly correlated with cardiac events (it has a small Cox coefficient), and weakly correlated with the available UK Biobank-provided PPG features, we also didn’t see a significant difference between waveforms with the change of PPG-3 value. PPG-4 is correlated with the shoulder position and more weakly, with systolic peak position. The morphological differences at the notch are subtle, though the leftward shift is consistent from 10th to 50th to 90th percentile. PPG-5 appears correlated to changes in the relative notch height, which is consistent with its high correlation with reflection index. The reflection index is in turn a measure of peripheral resistance, and it also tends to be more common in some specific age group (40–54) and obese groups [ 42 ], and it also loosely correlates with other risk predictor metric such as pulse wave velocity (PWV) [ 43 ]. Interestingly, PPG-5 is the only PPG feature which had a positive univariable association (positive Cox coefficient) with MACE outcomes, but which became a negative association in multivariable analysis. This likely reflects its effect being that of modulating other features. We also found that stratifying by the DLS predicted risk score, the systolic peak shifts later with higher risk prediction and notch appears to be less prominent, which is also related to vascular stiffness. Finally, we found that saliency maps, computed using integrated gradients, ( Fig 5 right) highlighted the waveform’s peak, notch and diastolic phase as the areas most responsible for changes in predicted risk.

thumbnail

The first row sorts PPGs by predicted risk, whereas the next 5 rows sort PPGs based on the 5 PPG feature values. The first column presents the average of 100 PPG waveforms sampled nearest to the following quantiles of the quantity mentioned on the left: 10th (red) 50th (green) and 90th (blue). The next 3 columns present the respective averaged PPGs along with the normalized saliency values based on integrated gradients. We observed that in general salient areas that most influence the predictions seem to be near the top of the systolic peak and the notch, independent of which quantile and feature/prediction the PPG was sampled from. Each PPG feature appears to correspond to different morphological aspects (see Table 4 ).

https://doi.org/10.1371/journal.pgph.0003204.g005

We developed a deep learning PPG-based CVD risk score, DLS, to predict ten-year MACE risk using age, sex, smoking status, heart rate and deep learning-derived PPG features. Without requiring any vital signs or laboratory measurement, DLS demonstrated non-inferior performance compared to the office-based refit-WHO score with coefficients re-estimated on the same cohort. Results were consistent between metrics (C-statistic, NRI, cfNRI, sensitivity, specificity, calibration slope), and in various subgroups. Improved cfNRI and NRI also indicate the capability of DLS to reclassify cases better than the office-based refit-WHO score. Additionally, if available, adding office-based features (BMI, SBP) on top of DLS further improved the model performance.

Our work focuses on understanding the role that PPG and deep learning can play in settings where equipment access to healthcare is limited, such as community-based screening programs in LMICs. Several CVD prediction scores without an assumption of the availability of laboratory measurement exist for primary prevention, such as WHO/ISH risk prediction chart [ 34 ], office-based Framingham risk score (FRS) [ 44 ], office-based Globorisk score [ 4 ], non-laboratory INTERHEART risk score [ 45 ], and Harvard NHANES risk score [ 46 ]. Some of these are also deployed in real-world clinical practice [ 4 , 32 ], though these methods require either body measurements (BMI, waist-hip ratio), SBP, or both. Challenges remain in scaling up CVD screening in the resource-limited areas due to reasons such as the lack of laboratory devices, sphygmomanometer cuffs, or the necessary training of CHWs for accurate measurements. In our study, the DLS demonstrated performance comparable to that of the re-estimated office-based refit-WHO score, without requiring accurate laboratory examination, vital signs measured via additional devices, or BMI. This feature improves accessibility for health systems that have limited resources to collect vitals and labs for CVD risk screening and triage. More intriguing, PPG signals could in principle be captured through a smartphone [ 16 ], and future work could leverage smartphone-based PPGs along with the DLS to enable large-scale screening and triage in the community at low cost ( Fig 1 ) [ 14 , 47 ].

Due to the higher prevalence, lower diagnosis rate and lower treatment of CVD in LMICs, WHO has listed preventing and controlling CVD as main targets in their "Global action plan for the prevention and control of non-communicable diseases (NCDs) 2013–2030" [ 48 ]. PPG-based screening may allow healthcare systems to optimize use of resources by funneling in those who are likely to benefit the most and improve the early detection of CVDs. Thus, our study represents a step on the journey towards enabling community-based preventive treatment for high CVD risk individuals with limited healthcare access.

The deep learning-based features are challenging to interpret directly, and the pathophysiology between PPG and CVD risk is still under investigation [ 49 ]. Our analysis, including the hazard ratios, correlations, feature values, and saliency maps, provides interpretability insights, suggesting the DLS-extracted features reflect morphologic changes in the waveform independent of heart rate. However, further investigation is still required for interpreting the deep learning-based features and whether experts can learn from these features. We also found preliminary evidence that using a resting electrocardiogram (ECG) yielded a comparable performance to the office-based model on a UKB subset with available resting ECG data.

Several limitations of the study should also be noted. We used a single dataset, UKB, for both modeling and evaluation. Though we have stratified the UKB cohort based on geographical information to allow for non-random variation [ 29 ], further work is needed to understand generalization to other populations. Notably, UKB is not representative of the population in LMICs. However, using UKB to demonstrate the capability of using DLS for long-term CVD risk prediction is an important first step in justifying a prospective data collection in LMICs. The device used for PPG acquisition across the UKB is a specific clinical pulse oximeter (PulseTrace PCA2), thus our results provide direct evidence that waveforms from this pulse oximeter may be a reliable CVD screening tool. Studies have found that the heart rate and rhythm extracted from smartphone PPG were comparable with clinical grade devices such as ECG [ 50 – 52 ], but additional work is needed to know if our model would transfer to smartphone-collected PPGs. Since the PPG waveform signals in the UKB have only been collected by a single device and specific protocol (details in Model development section), further work may be necessary to understand if these data are biased in some way (e.g., less noisy) relative to less structured data collection protocols. However, our results on UKB indicate that using PPG for CVD screening and triaging is promising and worth investigating further, particularly in lower-resourced regions. Because a dataset of PPG collected using commodity devices such as smartphones does not yet exist, to help with this, we are open-sourcing a sample PPG data collection app (see Data Availability Statement) to facilitate the future relevant data collection and similar longitudinal PPG research studies in LMICs. Future work may be needed to understand how to mitigate any differences in PPG features based on PPG device or manufacturer. We are also releasing the trained PPG embeddings via the UK Biobank, and the analysis code via the GitHub repository (see Data Availability Statement). Future work could focus on predicting CVD risk using prospective smartphone PPG datasets from low-resource healthcare systems. Additional work will also be needed to know if our model would transfer the smartphone setup, because the smartphone PPG datasets with longitudinal MACE outcomes in LMIC do not exist to our knowledge, so direct evidence of the efficacy of such an approach will need to be evaluated when such data become available. We have also investigated other larger network architectures such as the dilated CNN and WaveNet yet without marked performance improvement, thus further investigation on the efficient modeling that can improve the performance yet also being lightweight will be considered.

To summarize, our study found that a deep learning model extracted features that when added to easily extractable clinical and demographic variables (such as smoking status, age and sex), provided statistically significant prognostic information about cardiovascular risk. Our work is an initial step towards accurate and scalable CVD screening in resource-limited areas around the world, and will hopefully inspire the collection of real-world datasets with smartphone-acquired PPG and longitudinal outcomes.

Supporting information

S1 fig. overview of our deep learning-based risk prediction model, dls..

Blue: models; yellow: inputs; red: intermediate data representations (embeddings) obtained from the deep learning-based PPG feature extractor.

https://doi.org/10.1371/journal.pgph.0003204.s001

S2 Fig. Geographical location information of sites visualized by longitude and latitude for dataset splits.

https://doi.org/10.1371/journal.pgph.0003204.s002

S3 Fig. Kaplan-Meier estimation of DLS with different operating points.

We compared the survival estimation between the high and low risk groups, which were defined by the risk threshold at 10% suggested by the Globorisk study [ 1 ]. For example, a case with prediction value higher than 0.1 will be high risk, else low risk. The p-values were calculated by the log-rank test.

https://doi.org/10.1371/journal.pgph.0003204.s003

S4 Fig. Calibration plots for all subgroups.

The calibration slope values indicate the coefficient of a linear regression where the dependent variable was the fraction of positives (predicted risk) and the independent variable was the mean prediction. We used ten bins to discretize the prediction interval and chose deciles of predicted risk to define the widths of the bins. For the elevated HbA1c subgroup, we used quintiles to ensure sufficient events. All models (office-based refit-WHO, DLS) are calibrated better in smoking, older, male, non A1c elevated, and non-hypertensive subgroups.

https://doi.org/10.1371/journal.pgph.0003204.s004

S5 Fig. Prevalence of major adverse cardiovascular event (MACE) in individuals according to model-predicted risk percentiles.

For each of four risk models, the prevalence of MACE was computed in the individuals scoring in the highest 20, 10, and 5% risk according to the model. Error bars computed via 100 bootstrap iterations. The dashed gray line shows MACE prevalence in the entire sample. Metadata+, model containing age, sex, smoking status, and BMI. DLS+, model containing age, sex, smoking status, BMI, and PPG. Metadata+ + polygenic risk score (PRS), model containing age, sex, smoking status, BMI, and polygenic risk score. DLS+ + PRS, model containing age, sex, smoking status, BMI, PPG, and PRS.

https://doi.org/10.1371/journal.pgph.0003204.s005

S1 Text. Supporting methods.

https://doi.org/10.1371/journal.pgph.0003204.s006

S2 Text. Supporting results.

https://doi.org/10.1371/journal.pgph.0003204.s007

S1 Table. Geographical location information of sites for split division.

https://doi.org/10.1371/journal.pgph.0003204.s008

S2 Table. Training setup for the photoplethysmography (PPG) feature extractor.

https://doi.org/10.1371/journal.pgph.0003204.s009

S3 Table. Features used in different models for comparison.

We compared all methods with the office-based refit-WHO model. The evaluations of DLS models and additional reference methods are in the main content, S5 , S6 and S10 Tables. For the supporting reference methods, the results are listed in the Supporting Tables. *Lab-based refit-WHO and metadata + PPG morphology models are compared with a subset of the whole cohort. **The full model used most QRISK features. The detail of the feature set is described in the Supporting Methods.

https://doi.org/10.1371/journal.pgph.0003204.s010

S4 Table. UK Biobank variables used in the study.

https://doi.org/10.1371/journal.pgph.0003204.s011

S5 Table. Model performance comparison of 10-year major adverse cardiovascular event (MACE) risk prediction between DLS versus other methods at the 10% risk threshold.

The sensitivity, specificity, and net reclassification improvement (NRI) were calculated at the 10% risk threshold suggested by the Globorisk study for the British population [ 1 ]. CIs of sensitivity and specificity were obtained from the Clopper-Pearson exact method, and the p-values were calculated by the permutation test with a prespecified margin of 2.5% and alpha of 0.05. The 95% CIs of NRI were computed by bootstrapping.

https://doi.org/10.1371/journal.pgph.0003204.s012

S6 Table. Model performance comparison of 10-year major adverse cardiovascular events (MACE) risk prediction between DLS versus DLS+ (adding BMI) and DLS++ (adding BMI and SBP).

(a) We examined the discrimination performance using C-statistic, reclassification improvement using category-free net reclassification improvement (cfNRI), and model calibration using the slope value from the reliability diagram. *In “Feature used” column, “Metadata” includes age, sex, and smoking status. (b) The sensitivity was calculated at the risk threshold matching the specificity of SBP-140, and the specificity was calculated at the risk threshold matching the sensitivity of SBP-140. 95% confidence intervals (CIs) of C-statistic, cfNRI, and slope were obtained from the bootstrapping, and p-values were computed by the permutation test. CIs of sensitivity and specificity were obtained from the Clopper-Pearson exact method, and the p-values were calculated by a permutation test with the prespecified margin of 2.5% and alpha of 0.05. The 95% CIs of NRI were computed by bootstrapping.

https://doi.org/10.1371/journal.pgph.0003204.s013

S7 Table. Comparison of 10-year major adverse cardiovascular event (MACE) risk prediction performance between different subgroups using DLS versus office-based refit-WHO model.

The sensitivity and specificity were calculated at the risk threshold matching SBP-140’s specificity (see Statistical Analysis). 95% confidence intervals (CIs) were obtained from the Clopper-Pearson exact method.

https://doi.org/10.1371/journal.pgph.0003204.s014

S8 Table. Coefficients and hazard ratios from the Cox’s models for 10-year major adverse cardiovascular events (MACE) risk prediction on the UK Biobank (UKB) cohort using DLS, DLS+ and DLS++.

Hazard ratios are shown at the median age of the MACE event, which is 63 years in the train split of UKB cohort. Hazard ratios for smokers are for men, and their interaction with sex shows the adjusted risk for women. We included interaction terms between age and other predictors because the HRs for proportional effects on CVD declined with age [ 2 , 3 ].

https://doi.org/10.1371/journal.pgph.0003204.s015

S9 Table. The list of proxy tasks used for multitask learning.

https://doi.org/10.1371/journal.pgph.0003204.s016

S10 Table. Model performance comparison of 10-year major adverse cardiovascular event (MACE) risk prediction between the office-based reference and DLS, and models without smoking status.

(a) We examined the ability of discrimination using C-statistic, reclassification improvement using category-free net reclassification improvement (cfNRI), and model calibration using the slope value from the reliability diagram. *In “Feature used” column, “Metadata” includes age, sex, and smoking status. (b) The sensitivity was calculated at the risk threshold matching specificity of the SBP-140 baseline at 63.7%, and the specificity was calculated based on the risk threshold matching sensitivity of the SBP-140 baseline at 55.2%. 95% confidence intervals (CIs) of C-statistic, cfNRI, and slope were obtained from the bootstrapping, and the p-values were computed by the permutation test. CIs of sensitivity and specificity were obtained from the Clopper-Pearson exact method, and the p-values were calculated by the permutation test with the prespecified margin of 2.5% and alpha of 0.05. The 95% CIs of NRI were computed by bootstrapping.

https://doi.org/10.1371/journal.pgph.0003204.s017

https://doi.org/10.1371/journal.pgph.0003204.s018

Acknowledgments

We acknowledge Nick Furlotte (Google Research) and the Google Research team for software infrastructure support. We thank Boris Babenko (Google Research) for his critical feedback on the manuscript. We also thank Madhuram Jajoo for the development of the open-source mobile application for collecting PPG signals. This research was conducted with the UK Biobank resource application 65275.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 2. Cardiovascular diseases (CVDs). [cited 17 Oct 2022]. Available: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)#:~:text=Cardiovascular%20diseases%20(CVDs)%20are%20the,%2D%20and%20middle%2Dincome%20countries
  • 16. Lovisotto G, Turner H, Eberz S, Martinovic I. Seeing Red: PPG Biometrics Using Smartphone Cameras. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2020. https://doi.org/10.1109/cvprw50498.2020.00417
  • 19. Schlesinger O, Vigderhouse N, Eytan D, Moshe Y. Blood Pressure Estimation From PPG Signals Using Convolutional Neural Networks And Siamese Network. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. https://doi.org/10.1109/icassp40776.2020.9053446
  • 30. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. https://doi.org/10.1109/cvpr.2016.90
  • 43. Padilla JM, Berjano EJ, Saiz J, Facila L, Diaz P, Merce S. Assessment of relationships between blood pressure, pulse wave velocity and digital volume pulse. 2006 Computers in Cardiology. IEEE; 2006. pp. 893–896.
  • 47. ASER Centre. Annual Status of Education Report (Rural) 2021. In: ASER 2021—Rural [Internet]. 17 Nov 2021 [cited 3 Jan 2023]. Available: www.asercentre.org
  • 48. World Health Organization. Global action plan for the prevention and control of NCDs 2013–2020. In: World Health Organization [Internet]. 14 Nov 2013 [cited 3 Dec 2022]. Available: https://www.who.int/publications/i/item/9789241506236

IMAGES

  1. Hypothesis Testing Concept Map

    hypothesis testing examples in healthcare

  2. Hypothesis testing

    hypothesis testing examples in healthcare

  3. PPT

    hypothesis testing examples in healthcare

  4. Hypothesis Testing Presentation

    hypothesis testing examples in healthcare

  5. PPT

    hypothesis testing examples in healthcare

  6. PPT

    hypothesis testing examples in healthcare

VIDEO

  1. Two-Sample Hypothesis Testing: Dependent Sample

  2. HYPOTHESIS TESTING PROBLEM-4 USING Z TEST VIDEO-7

  3. HYPOTHESIS TESTING PROBLEM-5 USING Z TEST VIDEO-8

  4. HYPOTHESIS TESTING PROBLEM-2 USING Z TEST VIDEO-5

  5. t-TEST PROBLEM 2- HYPOTHESIS TESTING VIDEO-17

  6. hypothesis testing in healthcare

COMMENTS

  1. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  2. 4 Examples of Hypothesis Testing in Real Life

    Example 1: Biology. Hypothesis tests are often used in biology to determine whether some new treatment, fertilizer, pesticide, chemical, etc. causes increased growth, stamina, immunity, etc. in plants or animals. For example, suppose a biologist believes that a certain fertilizer will cause plants to grow more during a one-month period than ...

  3. PDF Hypothesis Testing

    23.1 How Hypothesis Tests Are Reported in the News 1. Determine the null hypothesis and the alternative hypothesis. 2. Collect and summarize the data into a test statistic. 3. Use the test statistic to determine the p-value. 4. The result is statistically significant if the p-value is less than or equal to the level of significance.

  4. Probability, clinical decision making and hypothesis testing

    The present paper attempts to put the P value in proper perspective by explaining different types of probabilities, their role in clinical decision making, medical research and hypothesis testing. Keywords: Hypothesis testing, P value, Probability. The clinician who wishes to remain abreast with the results of medical research needs to develop ...

  5. Hypothesis Testing

    The first step in testing hypotheses is the transformation of the research question into a null hypothesis, H 0, and an alternative hypothesis, H A. 6 The null and alternative hypotheses are concise statements, usually in mathematical form, of 2 possible versions of "truth" about the relationship between the predictor of interest and the outcome in the population.

  6. An Introduction to Statistics: Understanding Hypothesis Testing and

    HYPOTHESIS TESTING. A clinical trial begins with an assumption or belief, and then proceeds to either prove or disprove this assumption. In statistical terms, this belief or assumption is known as a hypothesis. Counterintuitively, what the researcher believes in (or is trying to prove) is called the "alternate" hypothesis, and the opposite ...

  7. Hypothesis Testing

    Present the findings in your results and discussion section. Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps. Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test.

  8. Explaining Hypothesis Testing with Real-World Examples

    It involves formulating a null hypothesis (H0) and an alternative hypothesis (H1), collecting data, performing statistical tests, and drawing conclusions based on the results. Real-World Examples. Drug Efficacy; A pharmaceutical company wants to test a new drug's effectiveness in lowering blood pressure compared to the current standard treatment.

  9. Hypothesis Testing, P Values, Confidence Intervals, and ...

    Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the ...

  10. Statistical Hypothesis Testing Overview

    Hypothesis testing is a crucial procedure to perform when you want to make inferences about a population using a random sample. These inferences include estimating population properties such as the mean, differences between means, proportions, and the relationships between variables. This post provides an overview of statistical hypothesis testing.

  11. Hypothesis Testing

    Hypothesis testing is the method for determining the probability of an observed event that occurs only by chance. If chance were not the cause of an event, then something else must have been the cause, such as the treatment having had an effect on the observed event (the outcome) that was measured. This process of testing a hypothesis is at the ...

  12. Anesthesia & Analgesia

    There is a close relationship between CIs of effect size estimates and hypothesis testing. When the 95% CI of an effect size does not contain the null hypothesis value that indicates "no effect" (eg, an odds ratio of exactly 1), this corresponds to a "statistically significant" result with a .05 alpha level in a hypothesis test.

  13. Hypothesis tests

    A hypothesis test is a procedure used in statistics to assess whether a particular viewpoint is likely to be true. They follow a strict protocol, and they generate a 'p-value', on the basis of which a decision is made about the truth of the hypothesis under investigation.All of the routine statistical 'tests' used in research—t-tests, χ 2 tests, Mann-Whitney tests, etc.—are all ...

  14. S.3.3 Hypothesis Testing Examples

    If the biologist set her significance level \(\alpha\) at 0.05 and used the critical value approach to conduct her hypothesis test, she would reject the null hypothesis if her test statistic t* were less than -1.6939 (determined using statistical software or a t-table):s-3-3. Since the biologist's test statistic, t* = -4.60, is less than -1.6939, the biologist rejects the null hypothesis.

  15. PDF Second Edition

    Hypothesis testing Hypothesis testing, also known as statistical inference or significance testing, involves testing a specified hypothesized condition for a population's parameter. This condition is best described as the null hypothesis. For example, in a clinical trial of a new anti-hypertensive drug, the null hypothesis would state

  16. Hypothesis Testing in Medical Research: a Key Statistical Application

    Abstract. p>"Hypothesis testing" is an integral and most important component of research methodology, in all researches, whether in medical sciences, social sciences or any such allied field ...

  17. Quick Guide to Biostatistics in Clinical Research: Hypothesis Testing

    Once the clinical trial phases are completed, biostatistics is used to analyze the results. Research generally proceeds in an orderly fashion as shown below. Once you have identified the research question you need to answer, it is time to frame a good hypothesis. The hypothesis is the starting point for biostatistics and is usually based on a ...

  18. The null hypothesis significance test in health sciences research (1995

    The null hypothesis statistical testing (NHST) has been the most widely used statistical approach in health research over the past 80 years. Its origins dates back to 1279 [] although it was in the second decade of the twentieth century when the statistician Ronald Fisher formally introduced the concept of "null hypothesis" H 0 - which, generally speaking, establishes that certain parameters ...

  19. Hypothesis Testing in Public Health

    In this second course of the Biostatistics in Public Health Specialization, you'll learn to evaluate sample variability and apply statistical hypothesis testing methods. Along the way, you'll perform calculations and interpret real-world data from the published scientific literature. Topics include sample statistics, the central limit theorem ...

  20. Insights in Hypothesis Testing and Making Decisions in Biomedical

    Two-sided hypothesis tests are dual to two-sided confidence intervals. A parameter value is in the (1-α)x100% confidence interval if-and-only-if the hypothesis test whose assumed value under the null hypothesis is that parameter value accepts the null at level α. The principle is called the duality of hypothesis testing and confidence ...

  21. Hypothesis Test vs. Confidence Interval: What's the Difference?

    Here's the difference between the two: A hypothesis test is a formal statistical test that is used to determine if some hypothesis about a population parameter is true. A confidence interval is a range of values that is likely to contain a population parameter with a certain level of confidence. This tutorial shares a brief overview of each ...

  22. The Importance of Testing a Hypothesis Before Building Machine ...

    The hypothesis being tested is typically about the value of a population parameter, such as the mean or variance. In machine learning, hypothesis testing can be used to assess the performance of a model. For example, a healthcare provider may use hypothesis testing to compare the accuracy of two models for predicting heart disease.

  23. Predicting cardiovascular disease risk using ...

    Cardiovascular diseases (CVDs) are responsible for a large proportion of premature deaths in low- and middle-income countries. Early CVD detection and intervention is critical in these populations, yet many existing CVD risk scores require a physical examination or lab measurements, which can be challenging in such health systems due to limited accessibility. We investigated the potential to ...