P-Value in Statistical Hypothesis Tests: What is it?

P value definition.

A p value is used in hypothesis testing to help you support or reject the null hypothesis . The p value is the evidence against a null hypothesis . The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

P values are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage . For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“ significant “) your results.

When you run a hypothesis test , you compare the p value from your test to the alpha level you selected when you ran the test. Alpha levels can also be written as percentages.

p value

P Value vs Alpha level

Alpha levels are controlled by the researcher and are related to confidence levels . You get an alpha level by subtracting your confidence level from 100%. For example, if you want to be 98 percent confident in your research, the alpha level would be 2% (100% – 98%). When you run the hypothesis test, the test will give you a value for p. Compare that value to your chosen alpha level. For example, let’s say you chose an alpha level of 5% (0.05). If the results from the test give you:

  • A small p (≤ 0.05), reject the null hypothesis . This is strong evidence that the null hypothesis is invalid.
  • A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject the null.

P Values and Critical Values

p-value

What if I Don’t Have an Alpha Level?

In an ideal world, you’ll have an alpha level. But if you do not, you can still use the following rough guidelines in deciding whether to support or reject the null hypothesis:

  • If p > .10 → “not significant”
  • If p ≤ .10 → “marginally significant”
  • If p ≤ .05 → “significant”
  • If p ≤ .01 → “highly significant.”

How to Calculate a P Value on the TI 83

Example question: The average wait time to see an E.R. doctor is said to be 150 minutes. You think the wait time is actually less. You take a random sample of 30 people and find their average wait is 148 minutes with a standard deviation of 5 minutes. Assume the distribution is normal. Find the p value for this test.

  • Press STAT then arrow over to TESTS.
  • Press ENTER for Z-Test .
  • Arrow over to Stats. Press ENTER.
  • Arrow down to μ0 and type 150. This is our null hypothesis mean.
  • Arrow down to σ. Type in your std dev: 5.
  • Arrow down to xbar. Type in your sample mean : 148.
  • Arrow down to n. Type in your sample size : 30.
  • Arrow to <μ0 for a left tail test . Press ENTER.
  • Arrow down to Calculate. Press ENTER. P is given as .014, or about 1%.

The probability that you would get a sample mean of 148 minutes is tiny, so you should reject the null hypothesis.

Note : If you don’t want to run a test, you could also use the TI 83 NormCDF function to get the area (which is the same thing as the probability value).

Dodge, Y. (2008). The Concise Encyclopedia of Statistics . Springer. Gonick, L. (1993). The Cartoon Guide to Statistics . HarperPerennial.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

how to test hypothesis with p value

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved August 13, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Interpreting P values

By Jim Frost 98 Comments

P values determine whether your hypothesis test results are statistically significant. Statistics use them all over the place. You’ll find P values in t-tests, distribution tests, ANOVA, and regression analysis . P values have become so important that they’ve taken on a life of their own. They can determine which studies are published, which projects receive funding, and which university faculty members become tenured!

Ironically, despite being so influential, P values are misinterpreted very frequently. What is the correct interpretation of P values? What do P values really mean? That’s the topic of this post!

Parking sign that has the letter P and an arrow for P-values.

P values are a slippery concept. Don’t worry. I’ll explain p-values using an intuitive, concept-based approach so you can avoid making a widespread misinterpretation that can cause serious problems.

Learn more about Statistical Significance: Definition & Meaning .

What Is the Null Hypothesis?

P values are directly connected to the null hypothesis. So, we need to cover that first!

In all hypothesis tests, the researchers are testing an effect of some sort. The effect can be the effectiveness of a new vaccination, the durability of a new product, and so on. There is some benefit or difference that the researchers hope to identify.

Photograph of a scientist looking into a microscope.

To understand this idea, imagine a hypothetical study for medication that we know is entirely useless. In other words, the null hypothesis is true. There is no difference at the population level between subjects who take the medication and subjects who don’t.

Despite the null being accurate, you will likely observe an effect in the sample data due to random sampling error. It is improbable that samples will ever exactly equal the null hypothesis value. Therefore, the position you take for the sake of argument (devil’s advocate) is that random sample error produces the observed sample effect rather than it being an actual effect.

What Are P values?

P-values indicate the believability of the devil’s advocate case that the null hypothesis is correct given the sample data. They gauge how consistent your sample statistics are with the null hypothesis. Specifically, if the null hypothesis is right, what is the probability of obtaining an effect at least as large as the one in your sample?

  • High P-values: Your sample results are consistent with a true null hypothesis.
  • Low P-values: Your sample results are not consistent with a null hypothesis.

If your P value is small enough, you can conclude that your sample is so incompatible with the null hypothesis that you can reject the null for the entire population. P-values are an integral part of inferential statistics because they help you use your sample to draw conclusions about a population.

Background information : Difference between Descriptive and Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics

How Do You Interpret P values?

Here is the technical definition of P values:

P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true.

Let’s go back to our hypothetical medication study. Suppose the hypothesis test generates a P value of 0.03. You’d interpret this P-value as follows:

If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.

How probable are your sample data if the null hypothesis is correct? That’s the only question that P values answer. This restriction segues to a very persistent and problematic misinterpretation.

Related posts : Understanding P values can be easier using a graphical approach: How Hypothesis Tests Work: Significance Levels and P-values  and learn about significance levels from a conceptual standpoint .

P values Are NOT an Error Rate

Unfortunately, P values are frequently misinterpreted. A common mistake is that they represent the likelihood of rejecting a null hypothesis that is actually true (Type I error). The idea that P values are the probability of making a mistake is WRONG! You can read a blog post I wrote to learn why P values are misinterpreted so frequently .

You can’t use P values to directly calculate the error rate for several reasons.

First, P value calculations assume that the null hypothesis is correct. Thus, from the P value’s point of view, the null hypothesis is 100% true. Remember, P values assume that the null is true, and sampling error caused the observed sample effect.

Second, P values tell you how consistent your sample data are with a true null hypothesis. However, when your data are very inconsistent with the null hypothesis, P values can’t determine which of the following two possibilities is more probable:

  • The null hypothesis is true, but your sample is unusual due to random sampling error.
  • The null hypothesis is false.

To figure out which option is right, you must apply expert knowledge of the study area and, very importantly, assess the results of similar studies.

Going back to our medication study, let’s highlight the correct and incorrect way to interpret the P value of 0.03:

  • Correct : Assuming the medication has zero effect in the population, you’d obtain the sample effect, or larger, in 3% of studies because of random sample error.
  • Incorrect : There’s a 3% chance of making a mistake by rejecting the null hypothesis.

Yes, I realize that the incorrect definition seems more straightforward, and that’s why it is so common. Unfortunately, using this definition gives you a false sense of security, as I’ll show you next.

Related posts : See a graphical illustration of how t-tests and the F-test in ANOVA produce P values.

Learn why you “fail to reject the null hypothesis” rather than accepting it.

What Is the True Error Rate?

A caution sign so you can be wary of how to interpret p-values.

The P value for our medication study is 0.03. If you interpret that P value as a 3% chance of making a mistake by rejecting the null hypothesis, you’d feel like you’re on pretty safe ground. However, after reading this post, you should realize that P values are not an error rate, and you can’t interpret them this way.

If the P value is not the error rate for our study, what is the error rate? Hint: It’s higher!

As I explained earlier, you can’t directly calculate an error rate based on a P value, at least not using the frequentist approach that produces P values. However, you can estimate error rates associated with P values by using the Bayesian approach and simulation studies.

Sellke et al.* have done this. While the exact error rate varies based on different assumptions, the values below use run-of-the-mill assumptions.

0.05 At least 23% (and typically close to 50%)
0.01 At least 7% (and typically close to 15%)

These higher error rates probably surprise you! Regrettably, the common misconception that P values are the error rate produces the false impression of considerably more evidence against the null hypothesis than is warranted. A single study with a P value around 0.05 does not provide substantial evidence that the sample effect exists in the population. For more information about how these false positive rates are calculated, read my post about P-values, Error Rates, and False Positives .

These estimated error rates emphasize the need to have lower P values and replicate studies that confirm the initial results before you can safely conclude that an effect exists at the population level. Additionally, studies with smaller P values have higher reproducibility rates in follow-up studies . Learn about the Types of Errors in Hypothesis Testing .

Now that you know how to interpret P values correctly, check out my Five P Value Tips to Avoid Being Fooled by False Positives and Other Misleading Results!

Typically, you’re hoping for low p-values, but even high p-values have benefits !

Learn more about What is P-Hacking: Methods & Best Practices .

*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p-values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

Share this:

how to test hypothesis with p value

Reader Interactions

' src=

September 13, 2023 at 10:25 am

Thanks Jim for the nice explanation!

Given that even low p-values are associated with quiet high false positive rates (23 – 50%), wouldn’t studies for new drugs/medicines that get approved based on a significant p-value (and meaningful clinical benefit in the sample population) end up not showing any effect in the real world setting?

Or do real world drug studies look at a myriad of other sample statistics as well?

' src=

September 13, 2023 at 3:57 pm

I don’t know the specific guidelines the FDA uses to approve drugs here in the U.S., but I have heard that there must be more than one statistically significant study. Additionally, the FDA should evaluate not just the statistical significance but also the practical significance in real-world terms. Is the effect size meaningful?

Furthermore, many clinical trials for medications use extremely large sample sizes. For example, Moderna’s COVID vaccine study had 30,000 participants! (Click to read my review of the study.) Consequently, when a practically meaningful effect size truly exists, these powerful studies tend to produce very low p-values.

So, between assessing multiple studies, effect sizes, and using very large samples, the series of clinical trials that pharmaceutical companies perform for new medication tend to produce strong evidence. Pharmaceutical companies might not have the best reputation, but I have to admit their clinical trial protocols are top notch.

' src=

April 26, 2022 at 11:51 pm

This is extremely helpful, thank you!

April 28, 2022 at 12:21 am

You’re very welcome, Kirsten. I’m so glad it was helpful!

April 25, 2022 at 6:32 pm

Hi Jim, I hope it’s not too late to comment and ask a question on this subject? My graduate stats professor (one of them anyway) taught us not to report the actual p value if we reject the null hypothesis. If I remember his thinking properly (and I may not!) it is because, as you say, p values are based on the premise that the null hypothesis is true. Once you reject the null, then the actual p value shouldn’t then be reported. Just if it’s p< .01 or whatever. I've taken this as gospel for years. But it seems like you might not agree with this based on your reporting that it seems like it does matter how low the p value is (lower p value results more reproducible, etc.). I could imagine a couple conclusions here: he's right, he's wrong, he's right but your point is still valid that the lower p values are meaningful but you still shouldn't report the actual p value and just p<.001, or some combination I haven't thought of. Can you elucidate a bit? Thanks so much!

April 26, 2022 at 9:32 pm

Hi Kirsten,

I do disagree that advice to only state that the results are significant at some level, such as 0.05. The particular value of the p-value provides additional information. When the p-value is less than 0.05, it’s significant, but there’s a vast difference if the p-value is 0.045 or 0.001. When the p-value is near 0.05, it can be significant but the evidence against the null is fairly weak. On the other hand, if it’s near 0.001, that’s really strong evidence against the null. So just saying the results are significant at the 0.05 level leaves out a lot of information!

The key point to remember is that precise p-value indicates the strength of the evidence against the null. So, it’s really helpful knowing the exact value. And that’s true even when you do reject the null. Just how strong was the evidence?

In this post, I discuss some reasons for doing that based on Bayesian ideas. You might also be interested an empirical look at how lower p-values are related to greater reproducibility of results . In that article, I look at studies that were reproduced. But imagine you only had one study. You can see how the p-value is very helpful for helping you understand the strength of the results!

And there’s really no reason to not report the precise p-value. It’s not like it costs you more!

' src=

December 20, 2021 at 1:37 pm

can you please suggest any article which explains the relation ship between bias in data and p value? I am new to these concepts so getting confused.

December 21, 2021 at 12:48 am

I don’t have an article for you. However, most p-values assume that the data are unbiased. When there is bias (measurements tend to be too high or too low), p-values are generally not valid.

' src=

November 12, 2021 at 10:11 am

Thanks Jim, that’s very helpful and a bit of a relief as I’m just getting to grips with it, so not thrilled about the idea of having the whole rug pulled out from under my feet. I’ve also been reading your https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/ page which confirms that the p-values in regression models are simply a form of inferential statistical hypothesis test, which suggests the answer to my last query about this is yes, the same applies, but this is also subject to the info you provided in your reply (? or is there a different slant in relation to regression model p-values?)

November 13, 2021 at 11:41 pm

Sorry, I accidentally missed your question about p-values in regression!

That same principles apply to p-values in regression analysis. Although, I’d say there are extra concerns surrounding them because now you need to worry about the characteristics of the model. There are various issues that can affect the validity of the model and bias the p-values. However, once you get to a valid model, you’re dealing with the same principles behind p-values as elsewhere. P-values all relate to hypothesis tests that are a part of inferential statistics. These tests, from t-tests to regression analysis, all help you to use samples to draw conclusions about the population.

November 11, 2021 at 4:09 pm

Hi Jim, I love your website and am a happy owner of your ‘Intro to Statistics’ ebook.

I’m not properly trained in statistics, and I’ve been recently reading some of the debate around interpretation of p-values and statistical significance in the journals, such as ‘The ASA Statement on p-values’ in 2016 ( https://doi.org/10.1080/00031305.2016.1154108 ), the follow-up Editorial in 2019 ‘Moving to a world beyond “p<0.05"' ( https://doi.org/10.1080/00031305.2019.1583913 ), and recent articles like 'The p-value statement, five years on' ( https://doi.org/10.1111/1740-9713.01505 ).

For untrained people like me, it seems like statisticians are at war with other scientists over the issue of p-value interpretation, and it's difficult to know what to make of it.

Do you have any views on this debate, and whether the conventional use and interpretation of p-values for inferential hypothesis testing and statistical significance still holds any value or meaning?

And do the concerns raised by the ASA and others also potentially undermine interpretation of p-values in regression models?

November 12, 2021 at 12:38 am

Thanks so much for getting my book and so glad to hear that you’re a happy owner! 🙂

I’ve followed those p-value debates with interest over the years. I do have some thoughts on it. For starters, in this post, you get some sense for where I think the problem lies. There’s the common misinterpretation that I write about which falsely overstates the strength of the evidence against the null hypothesis. And that’s where I think the problem really starts. You get a p-value of 0.04 and think, it’s significant! But, a single study with that p-value provides fairly week evidence against the null hypothesis. So, you really need lower p-values and/or more replication studies. Preferably both! Even one study with a lower p-value isn’t conclusive.

But, I do think p-values are valuable tools. They quantify the strength of the evidence against the null. The problem is that people misuse and abuse them.

To get a sense of how I think they should be used, read my post about Five P Value Tips to Avoid Being Fooled . Hopefully, from that post you’ll see there is a smart way to use p-values and other tools, such as confidence intervals. And read my post about P-values and Reproducibility to see how they can really shine as measures of evidence.

Finally, to get a perspective on why p-values are misinterpreted so frequently , click that link to learn more!

I’d hope that people can learn to use p-values correctly. They’re good tools, but they’re being used incorrectly.

I hope that answers your question!

' src=

April 26, 2021 at 12:05 pm

Hey Can you please tell how to calculate P value mathematically in regression

' src=

April 12, 2021 at 10:12 am

Wow, thank you for this brilliant article, Jim!

If I get a p-value of 0.03, would it be correct to say: “If the H0 is true, there is only a probability of 3% of observing this data. Hence, we reject the the null hypothesis.”

Is this statement correct, and is there any other credible way of bringing the word ‘probability’ into the interpretation?

Thank you very much mate!

Cheers, Christian

April 13, 2021 at 12:33 am

Hi Christian,

That’s very close to be 100% correct! The only thing I’d add is that is “this data or more extreme .” But you’re right on with the idea. Most people don’t get it that close!

There’s really no other way to work in probability to this context. In fact, I’ll often tell people that if they’re using probability in relation to anything other than their data/effect size, that’s a sign that they’re barking up the wrong tree.

Thanks for writing!

' src=

March 8, 2021 at 2:50 pm

March 5, 2021 at 4:04 pm

I conducted a mediation analysis (Baron and Kenny) and my p-value from a Strobel Test came back negative? What does a negative p-value signify?

March 5, 2021 at 11:01 pm

Hi Monique, I’ll assume that you’re actually asking about the Sobel test (there is no Strobel test that I’m aware of). I don’t know why you got a negative p-value. That should not occur. There might be a problem with the code or application you’re using.

' src=

February 16, 2021 at 2:35 pm

February 7, 2021 at 11:28 am

I enjoy reading your blogs. I purchased two of your books. I have learnt more from these books than from textbooks written by other people. I have a question about interpretation of significance level and p-value – two statements from your book come across as contradictory (to me).

On page 11 of your “hypothesis testing” book, these statements concerning interpretation of significance level are made :

(1) In other words it is the probability that you say there is an effect when there is no effect. For instance, a significance level of 0.05 signifies a 5% risk of deciding that an effect exists when it does not exist.

On page 77 the following statement is made about interpretation of pvalue : (2) A common mistake is that they represent the likelihood of rejecting a null hypothesis that is actually true (Type I error). The idea that p-values are the probability of making a mistake is wrong !

I find statements (1) and (2) contradictory because of the following. In making the decision about whether to reject the null hypothesis one compares the p-value to the significance level. (If pvalue is lower than the preset significance level one rejects the null hypothesis). It is possible to compare two quantities only if they have the same interpretation (same units in problems in the area of physics). Therefore the interpretation of significance level and pvalue should be the same ! For example if pvalue turns about to be 0.04, we reject the null hypothesis since 0.04 is lower than 0.05. If 0.05 significance level implies 5% risk of (incorrectly ) rejecting a true null hypothesis then a pval of 0.04 should be interpreted as a 4% risk of (incorrectly ) rejecting a true null hypothesis ?

What am I missing here ?

February 7, 2021 at 2:36 pm

Thanks so much for supporting my books!

This issue is very confusing. You might find it surprising, but there are no contradictory statements in what I wrote!

Keep in mind that your 1 and 2 statements are about the significance level and p-values, respectively. So, they’re about different concepts and, hence, it’s not surprising that different conditions apply.

For significance levels (alpha), it is appropriate to say that if you use a significance level of 0.05, then for all studies that use that significance level, you’d expect 5% of them to be positive when the null hypothesis is true. Importantly, significance levels apply to a range of p-values. Also, note that stating that you have a 5% false-positive rate when the null is true is entirely different than applying an error rate probability to the null hypothesis itself.

We’re not saying there’s a 5% chance that the test results for an individual study are incorrectly saying that the null is false when it is actually true. We’re saying that in cases where the null is true, 5% of studies that use a significance level of 0.05 will get false positives. Unfortunately, we’re never sure when the null is true or not. We just know the error rate for when it is true. In other words, it’s based on the assumption that the null is true.

Your second statement is about the p-value. That’s the probability for a specific study rather than a class of studies. It’s the probability of obtaining the observed results, or more extra, under the assumption that the null is true.

So, alpha applies to a class of studies (have p-values within a range and the null is true), whereas p-values apply to a specific study. For both, it’s under the assumption that the null is true and does not indicate the probability related to any hypothesis.

Let’s get to your example with a p-value of 0.04 and we’re using a significance level of 0.05. The correct interpretation for the p-value is that you have a 4% chance of observing the results you obtained, or more extreme, if the null is true. For the significance level, your study is significant. Consequently, it is in the class of studies that obtain significant results using an alpha of 0.05. In that class, 5% of the studies will produce significant results when the null is true. However, we don’t know whether the null is true or not for your study. Additionally, we can’t use those results to determine the probability of whether the null is true.

Specifically, it is NOT accurate to say that a p-value of 0.04 represents a 4% risk of incorrectly rejecting the null. That’s the common misconception I warn about!

I hope that helps clarify! It is a tricky area. Just remember that any time you start to think that either p-values or the significance level allow you to apply a probability to the null hypothesis, you’re barking up the wrong tree. Both assume that the null is true. Please note in my hypothesis testing book my illustrations of sampling distributions of the various tests statistics. All of those are based on the assumption that the null is true. From those distributions, we can apply the significance level and derive p-values. So, they’re incorporating the underlying assumption that the null is true.

' src=

January 14, 2021 at 2:03 pm

Hi, When writing the interpretation do we set it up as “Assuming the null is true, there is a 3% chance of getting null hypothesis or the alternative? I do not necessarily understand if the p-value is bigger than alpha why we fail to reject the null hypothesis.

' src=

November 11, 2020 at 12:49 pm

Would this be a fair statement?

With an alpha of 0.05, If one repeats the sample enough times, the mean percent of Type I errors will approach 5%? (since type I errors do assume a true null hypothesis). However, we cannot say that about an individual test and it’s P-value.

November 11, 2020 at 3:58 pm

Hi, that is sort of correct. More correct would be to say that if you repeat an experiment on a population where the null is true, you’d expect 5% (using alpha = 0.05) of the studies to be statistically significant (false positives). However, if the null is false, you can’t have a false positive! So, keep in mind that what you write is true only when the null is true.

And, right, using frequentist methodology, you can’t use the p-value (or anything else) to determine the probability that an individual study is a false positive.

November 11, 2020 at 9:58 am

I hope you don’t mind me continuing the conversation here, if not tell me.

Hopefully, I am also helping you in giving a clue where the mental blocks are.

I believe I get the distinction between P values and alpha (I would not conflate them). As I understand it now, P-Values are sample specific, point values, Alphas are related to a parameterized test statistic (PDF) that captures the results of repeated iterations of taking samples from the population. If that is wrong, then I need to be corrected before going any further.

What I did not grok, and which probably should be emphasized in the post, is that Alphas are still assuming null is true. Also I read in you posts that alpha === error rate, I was taking this as Type I error rate. It seems that was a false reading.

For the moment I am not interested in Type II errors and distinguishing them from Type I (False Positives). So what I would like to see in the blog post is why alpha is different from Type I errors, and why Bayesian simulation is needed to get a better handle on Type I errors.

And yes, this section (below) of your comment is also probably a great place of a blog article, since it would be great with a worked out example and a chart showing exactly how this disparity can happen.

“` So, yes, you can be 95% confident that the CI contains the true parameter, but you might be in the 5% portion and not know it. And, it comes down to the probability that null is false. If it’s likely that the null is false then you’re more likely to be in the 5%. When the null is more likely to be correct, you’re more likely in the 95% portion. I can see a lot minds blowing now! “`

November 11, 2020 at 3:50 pm

Yes, that’s right about p-values and alpha. P-values are specific to a particular study. Using the frequentist methodology, there is no way to translate from a p-value to an error rate for a hypothesis. Alpha applies to a class of studies and it IS an error rate. It is the Type I error rate.

You had it right earlier that alpha = the Type I error rate. Alpha is the probability that your test produces significant results when the null is true. And Type I errors are when you reject a null hypothesis that is true. Hence, alpha and the Type I error rate are the same thing.

Think back to the plots that show the sampling distributions of the test statistics. Again, these graphs show the distribution of the test statistic assuming the null is true. To determine probabilities, you need an area under the curve. The significance level (alpha) is simply shading the outer 5% (usually) portion of the curve. The test statistic will fall in those portions 5% of the time when the null is true. You can’t get a probability for an individual value of the test statistic because that doesn’t produce an area under the curve.

November 10, 2020 at 1:58 pm

Well, that is a good answer at the definitional level, i.e. that is the propability of the effect, with the assumption that the null hypothesis is true. OK, but what I am trying to do with my clogged block-head is wrap my mind about this. (I am 1/2 way through the hypothesis testing book, and yes, the diagrams help but not yet on this).

Here is another way I am struggling with this. Ok, granted that the P-Value is disconnected with error rate, but in your book you mention that alpha the same thing as the Type I error rate.

So if my alpha is 0.05 and my P-value is 0.03, why am I not at a 95% confidence level? As you say in this post , Sellke et al.* using simulation show that the actual error rate is probably closer to 50%. Huh? Should I not be at least 95% confident there is no Type I error?

Now, I have a hunch this all has to do with the fact that after the the alternative hypothesis is accepted, there is some conditional probabilities (Bayes strikes again). But I am trying to ground this in intuition, and that is why I think a worked example of how we go from 0.05 to 0.5

That is why I am looking for an example worked out with graphs that identify where the “additional” source of Type I errors is occurring.

November 10, 2020 at 10:59 pm

Hi Yechezkal,

I highlight the definition because it’ll point in the right direction when you’re starting out. If you ever start thinking that it’s the probability of a hypothesis, you know you’re barking up the wrong tree!

As you look at the graphs, keep in mind that they show the sampling distributions of the test statistic. These distributions assume that the null is true. Hence, the peak occurs at the null hypothesis value. You then place the test statistic for your sample into that distribution. That whole process shows how the null being true is baked right into the calculations. The distributions apply to a class of studies, those with the same characteristics as yours. The test statistics is specific to your test. You’ll see that distinction between class of study and your specific study again in a moment.

You raise a good point about alpha. And the fact that you’re comparing alpha (which is an error rate) to the p-value (not an error rate) definitely adds to the confusion. I write about this and other reasons for why p-values are misinterpreted so frequently . (There’s some historical reasons at play among other things.)

The significance level (alpha) is an error rate. However, it is an error rate for when the null hypothesis is true. This error rate applies to all studies where the null is true and have the same alpha. For example, if you use an alpha of 0.05 and you have 100 studies where the null is true, you’d expect five of them to have significant results. The key point is that the error rate for alpha applies to a class of studies (null is true, same alpha).

On the other hand, p-values apply to a specific study. Furthermore, while you know alpha, you don’t know whether the null is true. Not for sure. So, if you obtain significant results, is it because the effect exists in the population, or is it a Type I error (false positive). You just don’t know.

So, when you obtain a significant p-value and calculate a 95% confidence interval, those results will agree. However, you still don’t know the probability that the null is true or not. So, yes, you can be 95% confident that the CI contains the true parameter, but you might be in the 5% portion and not know it. And, it comes down to the probability that null is false. If it’s likely that the null is false then you’re more likely to be in the 5%. When the null is more likely to be correct, you’re more likely in the 95% portion. I can see a lot minds blowing now!

I will be writing a blog post on this, so I’m not going to explain it all here. It’s just too much for the comments section. P-values and CIs are part of the frequentist tradition in statistics. Under this view, there is no probability that the null is true or false. It’s either true or false but you don’t know. You can’t calculate the probability using frequentist methods. You know that if the null is true, then there’s a 5% chance of obtaining significant results anyway. However, there is no way to calculate the probability of the null being true so there’s no way to convert it into an error rate.

However, using simulations and Bayesian methodology, you can get to the point of estimating error rates for p-values . . . sort of in some cases. Some Frequentists don’t like this because it is going outside their methodology, but it sheds light on the real strength of the evidence for different p-values. And, the conclusions of the simulation studies and Bayesian methodology are consistent with attempts to reproduce significant results in experiments . P-values predict the likelihood of reproducing significant results.

So, stay tuned for that blog post! I’ll make it my next one. If you’re on my email list, you’ll receive an email when I publish it. If not, add yourself to the email list by looking in the right margin of my website and scroll partway down. You’ll see a box to enter your email to receive notifications of new blog posts.

November 4, 2020 at 11:10 am

Jim, I am a Ph.D. in Computer science. I really like your approach to teaching this, I have always struggled with getting an intuition into stats. But I am still mentally blocked on why the P-value is not the same as the error rate.

The null hypothesis is false. — I get this, that is part of the definition and assumption of p, but I still don’t see how it effects the error rate.

Later on you state (and I can accept it on authority, but not on intuition):

—– Sellke et al.* have done this. While the exact error rate varies based on different assumptions, the values below use run-of-the-mill assumptions.

P value Probability of rejecting a true null hypothesis 0.05 At least 23% (and typically close to 50%) 0.01 At least 7% (and typically close to 15%) These higher error rates probably surprise you!

Well yes, it does surprise me. Can I be somewhat chutzpanik and ask you to create an numerical example problem or two, that has low p-values (e.g. 0.05) and error rates of 15%-50%, then show what are the factors (from the example ) that lead to the higher error rate?

I have also read that if the significance level I am seeking (and yes, I grok that is different than p-value) is 0.05 if you do enough experiments, the error rate will approach the alpha (significance level)

If that could be also part of the example, I think folks would grasp this better from a real world example than from declarative statements?

Do you this would be worth a blog post to attach to this one? Tell me if that is true, and ping me if you do such a thing.

What I am working on are modeling and simulations of military battles with new equipment. I am looking at how many times I need to run a stochastic simulation (since causalities will be different each time) till I get definitive statement that this new equipment leads to less causalities.

November 4, 2020 at 10:52 pm

The p-value is a conditional probability based on the assumption that the null is false. However, what is it a probability of? It’s a probability of observing an effect size at least as large as the one you observed. That probability has nothing to do with whether the null is true or false. So, keep that in mind. It’s a probability of seeing an effect size. There’s nothing in the definition about being a probability related to one of the hypotheses! That’s why it’s not an error rate! Then map on to that the conditional assumption that the null is true.

I think it’s easier to understand graphically. So, check it out in the context of how hypothesis tests work where I graphically show how p-values and significance levels work.

I will write a post about how this works and the factors involved. It’s an interesting area to study. Bayesian and simulation studies have looked at this using their different methodologies and have come up with similar answers. Look for the post either later this year or early 2021!

Thanks for writing and the great suggestion!

' src=

October 29, 2020 at 1:51 am

Thank you for the article. I have always struggled to correctly interpret the p-value. I have two sets of data (readings for process durations conducted using different approaches). I have used graphical representation and they two sets seems very similar. However, I want to apply the t-test and examine if they are really similar or not. I have two questions: A) Should I use the whole datasets when conducting the t-test and examining the p-value? I have more than 10k in both datasets. Or should I “randomly” select a sample from these 10k records I have? B) If, let’s say I got a t-test = 2.5 and a p-value of 0.000045 (a very small), what does that mean? Does it mean that the two datasets are actually different? (meaning that I reject the null-hypothesis that I assume the are similar). Is there a better interpretation?

October 29, 2020 at 2:59 pm

This is a great question.

First, you should use the full dataset. There’s generally little reason to throw out data unless you question the data themselves. If you think the data are good, then keep it!

The “problem” with a large dataset is that it give hypothesis tests a lot of statistical power. Having a lot of power gives the test the ability to detect very small effects. Imagine that there is a trivial difference between the means of the two populations. A test with a very large sample sizes can detect this trivial difference and produce a very small, significant p-value. That might occur in your case because you have 10k observation in both groups. However, I put problem in quotes because it’s not actually a problem because there are methods for determining whether a statistically significant result is also practically significant.

I point out in various places that a significant p-value does not automatically indicate that the results are practically meaningful in the real world. In your example with a p-value of 0.000045, it indicates that the evidence supports the hypothesis that an effect exists at the population level. However, the p-value by itself does not indicate that the effect is necessarily meaningful in a practical sense. You should always take the extra step of assessing the precision and magnitude of that effect and the real-world implications regardless of your sample size. I write about this process in my post about practical versus statistical significant .

I also write about it in my post with 5 tips to avoid being mislead by p-values .

I’d read those posts and follow those tips. Pay particular attention to the parts about assessing CIs of the estimated effect. In case your, the populations are probably different, but now you need to determine if that difference in meaningful in a real-world sense.

I hope that helps!

' src=

October 27, 2020 at 9:45 pm

Unfortunately, the correct interpretation of the p-value is not valuable and is not informative for making judgements on the strength of the null hypothesis. Many people forget that the p-value strongly depends on the sample size: the larger n the smaller p (E. Demidenko. The p-value you can’t buy, 2016). The correct interpretation of the p-value is the proportion of samples from future samples of the same size that have the p-value less than the original one, if the null hypothesis is true. That is why I claim that the p-value is not informative but people try to overemphasize it. Use d-value — it has more sense.

October 27, 2020 at 11:04 pm

I’d agree that p-values are confusing and don’t answer the question that many people think it does. However, I’m afraid I have to disagree that it is not informative. It measures the strength of the evidence against the null hypothesis. As such, it is informative.

Sample size does affect p-values, but only when an effect is present. When the null hypothesis is true for the population, p-values do not tend to increase. So, it’s not accurate to say “the larger the n, the smaller the p.” Sometimes yes. Sometimes no. I think you’re referring to the potential problem that huge samples can detect miniscule effects that aren’t important in the real world. I write about this in my post about practical significance vs statistical significance .

I’m guessing that when you say “d-value,” you’re talking about Cohen’s d, a measure of the relative effect size and not the d-value in microbiology that is the decimal reduction time! Cohen’s d indicates the effect size relative to the pooled standard deviation. It can be informative when you’re assessing differences between means. But, it doesn’t help you with other types of parameters. I’d suggest that you need to evaluate confidence intervals. They indicate the likely effect size while incorporating a margin of error. You can also use them to determine statistical significance. Unlike Cohen’s d, you can assess confidence intervals for all types of parameters, such as means, proportions, and counts. In short, CIs help you assess practical significance, the precision of the estimate, and statistical significance. As I write in my blog posts, I really like confidence intervals!

Your definition of the p-value isn’t quite correct. P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true.

Are p-values informative? Yes, they are. As I show towards the end of this post, lower p-values are associated with lower false discovery rates. Additionally, a replication study found that lower p-values in the initial study were associated with a higher chance that the follow-up replication study is also statistically significant. Read my post about Relationship Between the Reproducibility of Experimental Results and P-values .

High p-values can help prevent you from jumping to conclusions!

And, finally, I present tips for how to use p-values and avoid misleading results .

I hope that helps clarify p-values!

' src=

September 10, 2020 at 5:05 am

Hello Sir.. If p value is 0.03 and it means that 3% in the study show the sample effect due to random errors, what does it mean?

Can you please extend the explanation from there.

Why do we call it as significant statistical value sir? What are we infering here?

September 11, 2020 at 5:13 pm

Hi Gudelli,

For what a p-value of 0.03 means, just use the information I provide in this article. In fact, I give the correct interpretation for a p-value of 0.03 in this article! Scroll down until you see the green Correct in the text. That’s what you’re looking for. If there’s a more specific point that’s not clear, please let me know. But there’s no point for me repeating what is already written in the article.

As for statistical significance, that indicates that an effect/relationship you observe in a random sample is likely to exist in the population from which you drew the sample. It helps you rule out the possibility that was random sampling error. Remember, just because you see an effect in a sample does not mean it necessarily exists in the population.

' src=

August 4, 2020 at 11:26 pm

Hope you’re well.

When calculating our z scores, we obviously use (score-mean)/SD.

Say i have 50 years of annual climate data (1951-2000) – 1 mean for each year – do i have to use the mean and standard deviation of all this data? Or can i use the mean & sd of 1951-1980 for example? (Is (1999 mean -(minus) mean of the 1951-1980 means)/SD of the 1951-1980 data) Of course this may well prompt more statistically significant points between 1981 and 2000.

However, is this reasonable practice in data science or is this overmanipulation/an absolute no no. Thank you in advance for your help! Hope you have a good day! Ben

August 5, 2020 at 12:12 am

A normal distribution, for which you calculate Z-scores, involves a series of independent and identically distributed events . I just wrote a post about that concept. Time series data don’t qualify because they’re not independent events–one point in time is correlated to another point in time. And, if there is a trend in temperatures, they’re not identically distributed. In a technical sense, it wouldn’t be best practice to calculate Z-scores for that type of data. If you’re just calculating them to find outliers , the requirements aren’t so stringent. However, be aware that an trend in the data would increase the variability, which decreases the Z-scores because the SD is in the denominator. If you were to use shorter timeframes, there might not be noticeable trends in the data.

Typically, what you’d want to do is fit a time series model to the data and then look for deviations from the model’s expected values (i.e., large residuals).

I hope this helps!

' src=

July 10, 2020 at 3:53 am

What is the difference between the p-value as given by Excel or a statistics program, such as r, and the alpha level. What is the relation to the critical value? Why does this matter?

July 12, 2020 at 5:50 pm

If you’re performing the same test with the same options, there should be no differences between Excel and statistical programs. However, I do notice that Excel sometimes has limited methodology options compared to statistical packages, which means their calculation might not always matchup.

As for p-values and critical values (regions), I write about that in a post about p-values and significance levels . Read that article and if you more questions on that topic, please post them there!

' src=

June 25, 2020 at 3:14 am

My p value of one way ANOVA is 1.09377645180021E-12. What does it mean.. Is it significant ?

June 27, 2020 at 4:12 pm

Hi Aakash, your p-value is written using scientific notation. Scientific notation is a convenient way to represent very large and very small numbers. In your case, it represents a very small p-values. Yes, it significant!

The minus 12 indicates that you need to move the decimal point 12 places to the left. Your p-value is much smaller than any reasonable significance level and, therefore, represent statistically significant results. You can reject the null hypothesis for your ANOVA.

' src=

June 10, 2020 at 7:30 pm

Hello, Thank you very much for your explanations, I have studied the significance of the correlation between several quantitative variables at the base of a software, but practically I want to know how to calculate p-value manually? in order to understand its principle. On the other hand, concerning the p-value, what does it mean technically, because I find it difficult to define this parameter practically in my field of environmental chemistry? Cordially

' src=

June 9, 2020 at 3:45 pm

Hi! thanks so much! this clarifies the difference very much. I’m analyzing and writing reports about Nutrition related literature. Two of the studies are prospective cohort studies, with several covariates. The topic is the egg/dietary cholesterol relationship with cardiovascular disease. You probably know that nutrition research is like a roller coaster 🙂 So I encountered new terms for statistics analysis used on these type of studies that explore non linear associations. The Rao-Scott chi-square test, the Cox proportional hazard models, restricted cubic splines are terms that I’ve learned recently. I love your blog, it’s helping me A LOT to understand, clarify basic and more advanced statistical concepts. I have bookmarked it and will be using it a lot! Lizette

June 10, 2020 at 12:08 pm

Hi Lizette, I often describe statistics as an adventure because it’s a process that leads to discoveries but it is filled with trials and tribulations! It sounds like you’re having an adventure! And, of course, we like having our “cool” terms in statistics! I don’t have blog posts on the procedures you mention, at least not yet.

I’m so glad my blog has been helpful in your journey! Thanks for taking the time to write. I really appreciate it!! 🙂

June 7, 2020 at 1:02 am

Hi, I’m trying to understand what “p linear” and “p non linear trend” mean. I have only taken basic statistics and I’m working on reviewing nutrition related research articles. thanks so much!

June 8, 2020 at 3:29 pm

Hi Lizette,

The context matters and I’m not sure what kind of analysis this is from? I’ve heard of those p-values in the context of time series analysis. In that scenario, these p-values help you determine whether the time series has a constant rate of change over time (p linear) or a variable rate of change over time (nonlinear). The meaning of linear trend is easy to understand because it represents a constant rate of change. Nonlinear trends are more nuanced because you might have a greater rate of change earlier, later, or in the middle. It’s not consistent throughout. You can also learn more from the combinations of the two p-values.

If the linear p-value is significant but nonlinear is not significant, you have a nice consistent rate of change (increase or decrease) over time. If both p-values are significant, it would suggest a variable rate of change but one that has a consistent direction over time. If neither p-value is significant, it suggests that the variable does not systematically tend to increase or decrease over time. If the nonlinear p-value is significant but not the linear p-value, it suggests you have variable rates of change in the short term but in the long run there is no systematic increase or decrease in the variable.

' src=

May 19, 2020 at 3:56 pm

How to you interpret a p-value that is displayed P=1.5 X 10 -19?

May 19, 2020 at 4:33 pm

Hi Natalie,

That p-value is written using scientific notation. Scientific notation is a convenient way to represent very large and very small numbers. In your case, it represents a very small p-values. Yes, it significant!

The minus 19 indicates that you need to move the decimal point 19 places to the left.

Your p-value is much smaller than any reasonable significance level and, therefore, represent statistically significant results. You can reject the null hypothesis for whichever hypothesis test you are performing.

' src=

May 15, 2020 at 1:45 am

I am getting a P value of 0.351. Can you please explain it.

' src=

May 7, 2020 at 2:23 am

Mine p value is 6.18694E-23 what does it mean.. Is it significant

May 7, 2020 at 3:45 pm

That p-value is written using scientific notation. Scientific notation is a convenient way to represent very large and very small numbers. In your case, these represent very small p-values. Yes, it significant!

The number after the E specifies the direction and number of places to move the decimal point. For example, the negative 23 value in “E-23” indicates you need to move the decimal point 23 places to the left. On the other hand, positive values indicate that you need to shift the decimal point to the right.

' src=

April 18, 2020 at 3:00 am

Thanks so much for your answer Jim!

Indeed I think we want to reach the same conclusion, but I’d like to see the results of the wrong approach to further cement my understanding since I’m not an expert in statistics (“seeing is believing!”) In other words I entirely agree that it’s WRONG to keep testing the pvalue as the experiment runs. But how can I prove it empirically? My idea was to show to me and others that all tests that should fail to reject the null hypothesis (my list of A/A tests described above), can reject it if left to run long enough (in other words, every A/A test will have p < 0.05 if it runs long enough). Is this statement correct? if not, why not? Thank you!

April 15, 2020 at 8:44 pm

after reading again, I have one more question: “If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.”

– in online testing, is it correct to say that we cannot have sampling error? Since we always compare (for a limited time) the entire population in A and the entire population in B? If yes, how does that affect the interpretation of pvalue?

April 16, 2020 at 10:53 pm

Hi again Alfonso,

It’s still a sample. While you might have test the entire population for a limited time, you are presumably applying the results outside of that time span. That’s also why you need to be careful about the time frame you choose. Is it representative? You can only apply the results beyond the sample (i.e., to other times) if your sample is representative of those other times. If you collect data all day Sunday, you might be capturing different users than you’d get during the week. If that’s true, you wouldn’t be able to generalize your Sunday sample to other days. Same if you were to collect day during only a specific time during the day.

In your context, you’re still collecting a sample that you want to generalize beyond the sample. So, you’d need to use hypothesis testing for that reason. You also ensure that your sampling method collects a representative sample.

I hope this helps! I’m glad that you’re hooked and reading!!

April 15, 2020 at 7:38 pm

A colleague just shared your blog with me and after 2 posts I’m hooked. I will read more today. I use ttest and pvalues in the domain of web and app A/B testing and I’ve read everything I could find online but I still wasn’t sure I understood. I built an A/A simulator in python and I got a lot more statistically significant results than 5% so I’m confused. Just for clarity I call an A/A test a randomise observation where both series use the same success rate in %. Even after reading your article alpha and pvalue are still somehow overlapping for me. I’ll keep reading your article to further clarify.

I have 3 questions that I hope you can answer: – what would the graph look like of plotting the pvalue of 20 A/A tests over time? I would expect the pvalue to swing widely in the beginning and then stay firmly above 0.05 and every so often a test would go to statistical significance for a while and then come back above 0.05. I would expect 1 or max 2 statistically significant experiments *at any given point in time* (this is crucial in my understanding) after a sample size big enough has been reached – is it true that if I keep collecting samples every single A/A test will eventually turn statistically significant even if just briefly? – given that I will run hundreds or thousands of tests, is there an accepted standard of build my analysis framework to guarantee a 5% false positives rate? I was thinking all I needed was to set the sample size at the start to avoid falling into the trap I ask about in the previous question but now I’m not so sure anymore. (I use a well known tool online to calculate my sample size based on base conversion rate and observable absolute or relative difference)

I will keep reading but if you talk about any of this in details in any other article I would be grateful if you could share the link and if you haven’t covered these topics I hope you might do so in the future.

April 16, 2020 at 10:46 pm

Hi Alfonso,

I know enough about the context of A/B test online to know that it is often fairly different than how we’d teach using hypothesis tests in statistics.

For statistical studies, you’d select a sample size, collect your sample, perform the experiment, and analyze the results just once. You wouldn’t keep collecting data and performing the hypothesis test with each new observation. The risk with continuing to perform the analysis as you collect the data is that, yes, you are very likely to get an anomalous significant result at some point. I don’t recommend the process you describe of plot p-values over time. Pick a sample size and stick with it and calculate the results only at the end.

Also, be aware that different types of users might be online at different times and days of week due things like age, employment status, and time zone. Use a sampling plan that gets a good representative sample. Otherwise your results might apply only to a subset.

If you follow the standard rule of collecting the sample of a set size and analyzing the data once at the end, then your false positives should equal your significance level. If you’re checking the p-values repeatedly or keep testing until you get a significant p-value, that will dramatically increase your false positive rate.

Finally, I’ve heard that some A/B testing uses one tailed testing. Don’t do that! For more details, read my post about when you should use one-tailed testing .

' src=

March 4, 2020 at 2:18 am

I have read the comments. I am not a specialist in statistics but i use stastistics in my research.Let us come to application of p at least to t and r. In each case a study is conducted and results are significant at 5 % level (p <=.05). t test access the mean difference on wage rate of females in two locations X, and Y, the mean indicate Y has a higher value), r indicates the relation between depression and low exam marks among students, in each case the sample-size is 100. It may be understood that in research, we test directional alternate hypothesis , not null (which is obviously the no difference or no relation in null and opposite of alternate hypothesis). Taking the 'p' into account, how will we give a convincing interpretation or linguistic expression so that another, non-expert can understand it? Liking it to false positive, error may not be understood by a common man. Please reply. Does it mean the following in context of t and r respectively? t- There are 95% chances that the wages in location Y are higher than location X and 5 %chances that difference will not be there. r- The relations between anxiety and low exam marks hold good in 95% of cases and do not hold good in 5% of cases.

March 4, 2020 at 11:35 am

Hi Damodar,

Many of the answers to your questions are right here in this post. So, you might want to reread it.

P-values represent the strength of the evidence that your sample provides against the null hypothesis. That’s why you use the p-value in conjunction with the significance level to determine whether you can reject the null. Hypothesis testing is all about the null hypothesis and whether you can reject it or fail to reject it.

Coming up with an easy to understand definition of p-values is difficult if not impossible. That’s unfortunate because it makes it difficult to interpret correctly. Read my post, why are p-values misinterpreted so frequently for more on that.

As for your interpretation, those are the common misconceptions that I refer to in this post. So, please reread the sections where I talk about the common misconceptions! P-values are NOT the probability that either of the hypotheses are correct!

P-values are the probability of obtaining the observed results, or more extreme, if the null hypothesis is correct.

' src=

January 9, 2020 at 2:20 am

Hi Jim, Thanks for a prompt reply. I have a fair understanding now. Please tell me if I am wrong if I say that for a statistically significant result, we say that if my null hypothesis is true, i would expect measure under consideration to be at least as large as the one observed in my study.

I came to this conclusion, by comparing my p value with alpha. if p value lies in critical region, we reject the null hypothesis and vice versa. Now that you stated that for a single study, we can’t state that error rate of false positive is alpha, how are we comparing alpha and p value for conclusions?

January 9, 2020 at 2:12 pm

Hi again Himani!

If you have read it already, read my post about p-values and significance levels . I think that will answer many of your questions.

A statistically results indicates that IF the null hypothesis is true, you’d be unlikely to obtain the results that your study actually obtained. Statistical significance and p-values relate to the probability of obtaining your observed data IF the null is true. Always note that the probability is based on the assumption that the null is true.

You can think of the significance level as an evidentiary standard. It describes how strong the evidence must be for you to be able to conclude using your sample that an effect exists in the population. The strength of the evidence is defined in terms of how probable is your observed data if the null is true.

The p-value represents the strength of your sample evidence against the null. Lower p-values represent stronger evidence. Like the significance level, the p-value is stated in terms of the likelihood of your sample evidence if the null is true. For example, a p-value of 0.03 indicates that the sample effect you observe, or more extreme, had a 3% chance of occurring if the null is true.

So, the significance level indicates how strong the evidence must be while the p-value indicates how strong your sample evidence actually is. If your sample evidence is stronger than the evidentiary standard, you can conclude that the effect exists in the population. In other words, when the p-value is less than or equal to the significance level, you have statistically significant results, you can reject the null, and conclude that the effect exists in the population.

Please do read the other post if you haven’t already because I show how this works graphically and I think it’s easier to understand in that format!

January 8, 2020 at 3:16 pm

Hi Jim, Your blog has been of great help. It would be great if you could explain a bit further about how alpha (false positive) is different from the false positive rate (0.23) mentioned by you in the post and role of simulation in this case.

Big help! Thank you

January 8, 2020 at 3:48 pm

Thanks for writing with the excellent question. I can see how these two errors sound kind of similar, but they’re actually very different!

The Type I error rate is the probability of rejecting null hypothesis when it is actually true. It is a probability that applies to a class of studies. For an alpha of 0.05, it applies to studies with that alpha level and to studies where the null is true. You can say that 5% of all studies that have a true null will have statistically significant results when alpha = 0.05. However, you cannot apply to probability to a single study. For example, for a statistically significant study at the alpha = 0.05 level, you CANNOT state that there is a 5% chance that the null is true. You cannot obtain the probability for a single study using alpha, p-values, etc with Frequentist methodologies. The reason you can’t apply it to a single study is because you don’t know whether the null is true or false and the Type I error rate only applies when the null is true.

The error rates based on the simulation studies and Bayesian methodology can be applied to individual studies, at least in theory. However, to get precise probabilities you’ll need information that you often won’t have. Using these methodologies, you can take the p-value of an individual study and estimate the probability that the particular study is a false positive. However, I don’t want you to get too wrapped in mapping p-values to false positive rates. You’ll need to know the prior probability, which is often unknown. However, the gist is that the common misinterpretation of p-values underestimates the chance of a false positive. Also, a p-value near 0.05 often represents weak evidence even though it is statistically significant.

I hope this clarifies matters!

' src=

November 18, 2019 at 12:21 pm

Thanks for helpful posts. I have been browsing your blog for some time now and I gained a lot.

One quick question:

What happens if the null hypothesis is rejected based on t-test but we can’t do so by looking at p-value.

I know one is derived from the other statistic. But which one we should look at first in order to be able to say something about the null hypothesis: t-statistics or p-value in the t-test?

The same applies to ANOVA as well.

Which one do we look at first? Whether if Significance F is less than F statistics or the P-value alone?

November 18, 2019 at 3:24 pm

You can either reject the null hypothesis by determining whether the test statistic (t, F, chi-square, etc.) falls into the critical or by comparing the p-value to the significance level. These two methods will always agree. If the test statistic falls within the critical region, then the p-value is less than or equal to the significance level.

Because the two methods are 100% consistent, you can use either one to evaluate statistical significance. You don’t need to use both methods, except maybe when you’re learning about how it all works. Personally, I find it easiest just to look at the p-value.

To see how both methods work, read my posts about how hypothesis tests work , how t-tests work , and how the F-test works in one-way ANOVA .

' src=

November 12, 2019 at 3:50 am

Hi Jim, I have 3 p values .. 0, 2E-12 and 3.2E-316. I dont know what is wrong but how do i interpret these values?

November 12, 2019 at 9:23 am

Those p-values are written using scientific notation. Scientific notation is a convenient way to represent very large and very small numbers. In your case, these represent very small p-values.

The number after the E specifies the direction and number of places to move the decimal point. For example, the negative 13 value in “E-12” indicates you need to move the decimal point 12 places to the left. On the other hand, positive values indicate that you need to shift the decimal point to the right.

These values are smaller than any reasonable significance level and, therefore, represent statistically significant results. You can reject the null hypothesis for whichever hypothesis test you are performing.

' src=

November 11, 2019 at 7:44 am

you are good jim. you are the best

' src=

October 25, 2019 at 9:46 pm

Jim, thank you so, so much for your patience and help over the past week. I think I can finally say that I get it. Not easy to keep everything straight, but your simplistic breakdown in your most recent post really helped to clear everything up. Even though I previously read about p-values and type I errors from your other blog posts, I guess I needed to re-hear/re-think those tricky concepts in a variety of different ways to finally absorb them. I finally feel comfortable enough to share these cool insights with my research peers, and I’ll point them to your blog for extra stats goodies!

Thank you so much, again. I’m slowly making my way through your blog (trying to balance grad school at the same time); I look forward to your other posts!

aloha trent

P.S. Please do email me about the notification issue, I don’t believe I received an email from you yet. Your blog has really helped me get a better grasp of stats (I found your blog from your chocolate vs mustard analogy for interaction analyses, that was brilliant!), and so I’d be more than happy to help with the notification issue in any way I can.

October 25, 2019 at 10:55 pm

You’re very welcome! P-values are a very confusing concept. Somewhere in one of my posts, I have a link to an external article that shows how even scientists have a hard time describing what they are! They’re not intuitive. And, when you conduct a study, p-values really aren’t exactly what you want them to be. You want them to be the probability that the null is true. That would be the most helpful. Unfortunately, they’re not that–and they can’t be that. I’m not sure if you read it, but I’ve written a post about why p-values are so easy to misunderstand .

Despite these difficulties, p-values provide valuable information. In fact, as I write in an article, there’s a relationship between p-values and the reproducibility of studies .

Just a couple more p-value posts to read if you’re so interested! If you haven’t already.

Best of luck with grad school! I’m sure you’ll do great!

By the way, I did email you. If you haven’t received it, that’s odd! I will try again from a different email address over the weekend.

' src=

October 25, 2019 at 10:21 am

Jim… I cannot explain how many videos I have watched and articles I have read to try and understand this and you just cleared it all up with this. Saved my life. Thank you, thank you, thank you.

October 25, 2019 at 2:02 pm

You’re very welcome! Presenting statistics in a clear manner is my goal for the website. So, it makes my day when I hear that my articles have actually helped people! Thanks for writing!

October 23, 2019 at 1:33 am

Hi Jim, thank you so much for your reply! I’m sorry I wasn’t able to check back in until now. It seems that I still haven’t been able to connect the final pieces of the puzzle, based on your response to: “Thus, for a sample statistic assessed by a large group of similar studies, a P<0.05 would translate to a Type I error rate of <5%."

This is where I'm getting stuck: Prior to a study, researchers typically set their significance level (alpha level) to 0.05. Researchers will then compare their p-value to the alpha level of 0.05 to determine if their results are statistically significant. If P<0.05, then the results are statistically significant at an alpha level of 0.05, which by extension means that the results have a 5% or lower probability of being a false positive (since the alpha level was set to 0.05, and alpha level = probability of a false positive), right? If this is all true, then a P<0.05 for a study with a significance level of 0.05 does not have a false positive probability of 23% (and typically close to 50%)… it has a 5% or lower probability of being a false positive.

That said, based on your article, I know I'm messing up my logic somewhere, but I can't figure out where…

P.S. I double checked my gmail spam & trash folders and there were no notification emails of any of your replies.

October 23, 2019 at 3:20 pm

I’m going to send you an email soon about the notification issue. So, be on the lookout for that.

I think part of the confusion is over the issue of single studies versus a group of studies. Or, relatedly, a single p-value versus a range of p-values. Alpha is a range of p-values and applies to a group of studies. All studies (the group) that have p-values less than or equal to 0.05 (range of p-values) have a Type I error rate of 0.05. That error rate applies to the groups of studies. You can’t apply it to a single study (i.e., a single p-value).

A single p-value for a single study is not that type of error rate at all. It represents the probability of obtaining your sample if the null is true. In other words, the p-value calculations begin with the assumption that the null is true. Therefore, you can not use the p-value to determine the probability that the null (or alternative) hypothesis is true. In other words, you can’t map p-values to the false positive rate.

So, when you say “If P<0.05, then the results are statistically significant at an alpha level of 0.05, which by extension means that the results have a 5% or lower probability of being a false positive (since the alpha level was set to 0.05, and alpha level = probability of a false positive)," that's not true. For one thing, the p-value assumes the null *is* true. For another, the group of studies as a whole has an error rate of 0.05, but you don't know the error rate for an individual study. Additionally, you just don't know whether the null is true or false. The error rate only applies to studies where the null is true. And, the p-value calculations assume the null is true. But, you don't know for sure whether it is true or not for any given study.

Let's go back to what I said about the p-values being the "devil's advocate" argument. For any treatment effect that you observe in sample data, you can make the argument that the effect is simply random sampling error rather than a true effect. The p-value essentially says, "OK, lets assume the null is true. How likely was it for us to observe these results in that case." If the probability is low, you were unlikely to obtain that sample if the null is true. It pokes a hole in the devil's advocate argument. It's important to remember that p-values are a probability related to obtaining your data assuming the null is true and *not* a probability that the null is true. You're trying to equate p-values to the probability of the null being true--which is not possible with the Frequentist approach.

October 18, 2019 at 5:05 pm

Thank you for your reply. The two other articles you linked were really helpful. I think I’m almost there with understanding the whole picture. May I clarify my current understanding with you?

Alpha applies to a group of similar studies, thus we can’t directly translate the p-value of a single study to the Type I error rate for a given hypothesis. However, using simulation studies or Bayesian methods, we can estimate the Type I error rate–from a single study–for a P=0.05 sample statistic to 23% (and typically close to 50%).

That said, in order to estimate the Type I error rate directly using alpha (and P-values), we need to see the results from a group of similar studies (ie meta-analysis). Thus, for a sample statistic assessed by a large group of similar studies, a P<0.05 would translate to a Type I error rate of <5%.

How did I do?

P.S. I'm unsure how the "Notify me of new comments via email" function is supposed to work on your blog, but it didn't notify me via email of your reply. So I had no idea that you replied to my comment until I checked back on this post.

October 18, 2019 at 10:43 pm

I’m glad the other articles were helpful! There’s actually quite a bit to take into understand p-values. It’s possible to come up with a brief definition, but implies a thorough knowledge of underlying concepts! I will look into the Notify function. It should email you. I’ll hunt around in the settings to be sure, but I believe it is set up to send emails. Is there a chance it went to your junk folder?

Yes! That’s very close! Just a couple of minor quibbles and clarifications. I wouldn’t say that you use simulation and Bayesian methods to estimate the Type I error rate. That’s specific to the hypothesis testing framework. And, it applies to group of similar studies. Alpha = the Type I error rate. And both apply to a group of studies.

Simulation studies and Bayesian methods can help you take a P-value from an individual study and estimate the probability of a false discovery (or false positive). P-values relate to individual studies and the probability of a false positive applies to that individual study. So, we’ve moved from probabilities for a group of studies (Alpha/Type I error) to probabilities of false positive for an individual study. To make that shift from a group to an individual study, we must switch methodologies because the Frequentist method cannot calculate the false discovery rate for a single study.

An important note, for simulation studies or Bayesian methodology to estimate the false discovery rate, you need additional information beyond just the sample data. You need an estimate of the probability that the alternative hypothesis is true at the beginning of the study. This is known as the prior probability in Bayesian circles. To develop this probably, you already need to know and incorporate external information into the calculations. This information can come from a group of similar studies as you mention. This probability along with the P-value affects the false discovery rate. That’s why there is a range of values for any given P-value. There is no direct, fixed mapping of p-values to the false discovery rate. A criticism of the prior probability is that it is being estimated. Presumably, the researchers are performing a study because they’re not sure if the alternative is true or not.

It’s not clear to me what you mean in your sentence, “Thus, for a sample statistic assessed by a large group of similar studies, a P<0.05 would translate to a Type I error rate of <5%." I'll assume you're referring to a p-value from a meta analysis. In that case, it still depends on the prior probability. If the prior probability is very high, the false discovery rate will be low. Conversely, if the prior probability is low, the false discovery rate will be higher. You can't state a general rule like the one in your sentence.

Thanks for writing with the interesting questions!

October 15, 2019 at 7:37 pm

Hi Jim, wonderful post! A lot to chew on. May I clarify a point of confusion?

I’ve been taught that alpha is the probability of committing a Type I error. In addition, studies typically set alpha to 0.05, and beta to 0.20 (giving a power of 0.8). Based on your article, this must be false. A true statement should read:

“Studies typically set the P-VALUE cut-off to 0.05, and beta to 0.20 (giving a power of 0.8).”

Logically following, this means that alpha is generally not set to anything. And for a study with a p-value cut-off of 0.05, the alpha would actually be about 0.23 (and typically close to 0.50).

Is my understanding, correct?

October 16, 2019 at 4:03 pm

It’s correct that alpha (aka the significance level) represents the probability of a type I error. Hypothesis tests are designed so that the researchers can set that value. However, it’s not possible to set beta. You can estimate beta using a power analysis. Power is just 1-beta. However, power analyses are estimates and not something your technically setting like you do with alpha. I write more about this in my post about Type I and Type II errors .

I definitely understand your confusion regarding p-values and alpha. The important thing to keep in mind is that alpha really applies to a class of studies. Of all studies that use an alpha of 0.05 and the null is true, you’d expect to obtain significant results (i.e., a false positive) in 5% of those cases.

P-values represent the strength of the evidence against the null for an individual study. You can state it as being the probability of obtaining the observed outcome, or more extreme, if the null is true. However, you can’t state that it is the probability of the null being true. It’s the probability of the outcome if you assume the null is true (which you don’t really know for sure). Not the probability of whether the null is true.

I think based on what you write, you might be confusing that issue (re: alpha actually being 0.23). Both P-values and alpha relate to cases where the null is true–which you don’t know. The false positive error rates which I think you’re getting at, and I write about at the end, are dealing with the probability of the null being true. In the former, you’re assuming the null is true while in the latter you’re calculating the probability of whether it is true. Using the Frequentist approach (p-values, alpha) you cannot calculate the probability of the null being true. However, you can do that using simulation studies and sometimes using Bayesian methods.

I always think this is a bit easier to understand using graphs and so highly recommend reading my post about p-values and the significance level , which primarily uses graphs.

' src=

May 27, 2019 at 6:02 am

Thank you. You give me good insight

' src=

May 2, 2019 at 12:54 pm

Awesome read! How would sample size affect the True Error rate? I would assume since p-values tend to become smaller as sample size increases, that would also effectively reduce the True Error rate since you are more confident about the population (assuming True Error means type I and type II errors).

May 3, 2019 at 1:57 am

Hi David, Thanks, and I’m glad you enjoyed the article!

There are two types of errors in hypothesis testing. So, let’s see how changing the sample size affects them. You might want to read my article about Type I and Type II Errors in Hypothesis Testing .

There’s three basic components for calculating p-values. The effect size, variability in the data, and the sample size. For the sake of discussion, let’s hold the effect size and the variability constant and just increase the sample size. In that case, you would expect that the p-values would decrease. Frequentists will cringe at this, but lower p-values are associated with lower false discovery rates (Type I errors). Additionally, increasing the sample size while holding the other two factors constant will increase the power of your test. Power is just (1 – Type II error rate). So, you’d expect the Type II errors (false negatives) to decrease. Increasing the sample size is good all around because it lowers both types of error for a single study ! I explain the italicized text later!

However, a couple of important caveats for the above. Of course, as I point out in this article, you can’t calculate any error rates from the p-value using the frequentist approach. There’s no direct mapping from p-values to an error rate. You can use simulation studies and the Bayesian approach to estimate the false positive rate from the p-value. However, this requires an estimate of the a priori probability that the alternative hypothesis is correct. That information might be hard to obtain. After all, you’re conducting the study because you don’t know. Additionally, it’s always difficult to calculate the type II error rate. So, while you can say that increasing the sample should reduce both type I and type II errors, you don’t really know what they are! By the way, in a related vein, you might want to read how P-values correlate with the reproducibility of scientific studies .

Let’s return to Frequentist approach because there’s another side of things that isn’t obvious. In contrast with the earlier example for an individual study, the Frequentist approach talks about the Type I errors not for an individual study but for a class of studies that use the same significance level. A result is statistically significant when the p-value is less than the significance level. The significance level equals the Type I Error for all studies that use a particular significance level. For example, 5% of all studies that use a significance level of 0.05 should be false positives. Of course, when you see significant test results, you don’t know for sure which ones are real effects and which ones are false discoveries.

Let’s now hold the other two factors constant but reduce sample size. Let’s reduce it enough so that you have low power for detecting an effect. As your statistical power decreases, your test is less likely to detect real effects when they exist (the Type II error rate increases). However, the hypothesis test controls or holds constant the Type I error rate at your significance level. That’s built into the test. If you have a low power hypothesis test, the test’s ability to detect a real effect is low but it’s false positive rate remains the same. Consequently, when you obtain statistically significant results for a test with low power, you need to be wary because it’s relatively likely to be false positive and less likely to represent a real effect.

That’s probably more than what you wanted, but it’s a fascinating topic!

' src=

October 20, 2018 at 2:02 am

Dear Jim, thank you very much for you posts!

Does it mean that after I have obtained some small p-value, I have to do some other tests?

October 21, 2018 at 1:06 am

Hi Tetyana,

After you obtain a small p-value, you can reject the null hypothesis. You don’t necessarily need to perform other tests. I just want analysts to avoid a common misinterpretation. Obtaining a statistically significant result is still a good thing, but you have to keep in mind what it really represents.

' src=

October 11, 2018 at 2:24 pm

October 11, 2018 at 11:23 am

Thank you very much. You made me reassuring . Appreciated. How could I record this result in a scientific manuscript?

October 11, 2018 at 2:12 pm

I think it’s perfectly acceptable to report such a small p-value using the scientific notation that is in your output. The other option would be to report it as a regular value by moving the decimal point 16 places to the left, but that takes up so much more room. So, I’d use scientific notation. It’s there to save space for extremely small and large values depending on the context.

October 10, 2018 at 2:40 pm

Hi Jim. Thanks for this value post. But if you can help me on that, I got this result (6.79974E-16) ??? What that mean? Appreciated.

October 10, 2018 at 3:06 pm

That is called scientific notation. The E-16 in it indicates that you need to move the decimal point 16 digits to the left. That’s a very small value. Therefore, you have a very significant p-value!

' src=

June 14, 2018 at 5:27 pm

What an awesome post! Should be required reading for all STEM students.

June 14, 2018 at 11:35 pm

Thanks, Pamela. That means a lot to me!

' src=

April 15, 2018 at 3:47 pm

Thanks Jim for your response. i think i got it..

April 15, 2018 at 9:09 am

Thanks for the post. Am little confused with the statement below “If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.”

Now as per defination “P-values indicate the believability of the devil’s advocate case that the null hypothesis is true given the sample data. ”

So doesn’t that mean higher P value will accept the alternate hypothesis since higher the probability of alternate happening when null is true. Am not able to get my head wrapped around this concept..

April 15, 2018 at 1:28 pm

Great question! So, the first thing to realize is that the null and alternative hypotheses are mutually exclusive. If the probability of the alternative being true is higher, then the probability of the null must be lower.

However, the p-value doesn’t indicate the probability of either hypothesis being true. This is a very common misconception. Anytime you start linking p-values to the probability that a hypothesis is true, you know you’re going in the wrong direction!

P-values represent the probability of obtaining the effect observed in your sample, or more extreme, if the null hypothesis is true. It’s a probability of obtaining your data assuming the null is true. Consequently, a low p-value indicates that you were unlikely to obtain the sample data that was collected if the null is true. In this manner, lower p-values represent stronger evidence against the null hypothesis. Lower p-values indicate that your data are less compatible with the null hypothesis.

I think this is easier to understand graphically. I have a link in this post to another post How Hypothesis Tests Work: Significance Levels and P-values. This post shows how it works with graphs. I’d recommend taking a look at it.

' src=

February 6, 2018 at 2:38 am

Hello sir …..hope u r f9 I hve no words that u hve cleared me a lot of concepts of stats ….nd I am really hppy

……nd Wht evr u r uploading Owsme

February 6, 2018 at 10:05 am

Hi Khursheed, I’m so happy to hear that you found this post to be helpful. Thanks for the encouraging words. They mean a lot to me!

' src=

July 12, 2017 at 9:22 am

What should be the nature of the relationship of p values (especially Bonferroni corrected) with the Cohen’s d values for the same set of data?

' src=

April 19, 2017 at 8:38 am

Jim, thanks for this post, but perhaps you could clarify something for me: assuming that H0 is true, if we set an alpha=0.05 level of significance and get a p-value less than that as the result of our sample data, wouldn’t that indicate, since less than 5% of samples would have such an effect due to random sample error, that there is only a 5% chance of getting such a sample, and thus, a 5% chance of rejecting the null hypothesis incorrectly? What am I missing here? Almost every stats book I’ve ever read has presented the concept this way (a type 1 error is even called an alpha-error!) Thanks for your feedback!

April 19, 2017 at 10:56 am

Hi Sean, thanks for your comment. Yes, you’re absolutely correct. The significance level (alpha) is the type I error rate. It’s the probability that you will reject the null hypothesis when it is true. However, the p-value is not an error rate. It’s a bit confusing because you compare one to the other.

In the post above, I provide a link to a post where I explain significance levels and p-values using graphs. I think it’s much easier to understand that way. I’ll explain below, but check that post out too.

Both alpha and p-values refer to regions on a probability distribution plot. You need an area under the curve to calculate probabilities. You can calculate probabilities for regions, but not a specific value.

That works fine for alpha. If the null is true, you expect sample values to fall in the critical regions X% of times based on the significance level that you specify. For p-values, the problems occur when you want to know the error rate for your specific study. You can’t do that for a single value from an individual study because you need an area under the curve.

The best you can say for p-values is: if the null is true, then you’d expect X% of studies to have an effect at least as large as the one in your study. X = your P-value. Notice the “at least as large.” That’s needed to produce the range of values for an area under the curve. It’s also means you can’t apply the percentage to your specific study. You can apply it only to the entire range of theoretical studies that have an effect at least as large as yours. That range collectively has an error rate that equals the p-value, but not your study alone.

Another thing to consider is that, within the range defined by the p-value, your study provides the weakest results because it defines the point closest to the null. So, the overall error rate for the range is largely based on theoretical studies that provide stronger evidence than your actual study!

In a similar fashion, if you reject the null for your study using an alpha = 0.05, you know that all studies in the critical region have a Type I error rate = 0.05. Again, this applies to the entire range of studies and not yours alone.

I hope this all makes sense. Again, read the other post and it’s easier to see with graphs.

Comments and Questions Cancel reply

P-Value And Statistical Significance: What It Is & Why It Matters

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

The p-value in statistics quantifies the evidence against a null hypothesis. A low p-value suggests data is inconsistent with the null, potentially favoring an alternative hypothesis. Common significance thresholds are 0.05 or 0.01.

P-Value Explained in Normal Distribution

Hypothesis testing

When you perform a statistical test, a p-value helps you determine the significance of your results in relation to the null hypothesis.

The null hypothesis (H0) states no relationship exists between the two variables being studied (one variable does not affect the other). It states the results are due to chance and are not significant in supporting the idea being investigated. Thus, the null hypothesis assumes that whatever you try to prove did not happen.

The alternative hypothesis (Ha or H1) is the one you would believe if the null hypothesis is concluded to be untrue.

The alternative hypothesis states that the independent variable affected the dependent variable, and the results are significant in supporting the theory being investigated (i.e., the results are not due to random chance).

What a p-value tells you

A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true).

The level of statistical significance is often expressed as a p-value between 0 and 1.

The smaller the p -value, the less likely the results occurred by random chance, and the stronger the evidence that you should reject the null hypothesis.

Remember, a p-value doesn’t tell you if the null hypothesis is true or false. It just tells you how likely you’d see the data you observed (or more extreme data) if the null hypothesis was true. It’s a piece of evidence, not a definitive proof.

Example: Test Statistic and p-Value

Suppose you’re conducting a study to determine whether a new drug has an effect on pain relief compared to a placebo. If the new drug has no impact, your test statistic will be close to the one predicted by the null hypothesis (no difference between the drug and placebo groups), and the resulting p-value will be close to 1. It may not be precisely 1 because real-world variations may exist. Conversely, if the new drug indeed reduces pain significantly, your test statistic will diverge further from what’s expected under the null hypothesis, and the p-value will decrease. The p-value will never reach zero because there’s always a slim possibility, though highly improbable, that the observed results occurred by random chance.

P-value interpretation

The significance level (alpha) is a set probability threshold (often 0.05), while the p-value is the probability you calculate based on your study or analysis.

A p-value less than or equal to your significance level (typically ≤ 0.05) is statistically significant.

A p-value less than or equal to a predetermined significance level (often 0.05 or 0.01) indicates a statistically significant result, meaning the observed data provide strong evidence against the null hypothesis.

This suggests the effect under study likely represents a real relationship rather than just random chance.

For instance, if you set α = 0.05, you would reject the null hypothesis if your p -value ≤ 0.05. 

It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random).

Therefore, we reject the null hypothesis and accept the alternative hypothesis.

Example: Statistical Significance

Upon analyzing the pain relief effects of the new drug compared to the placebo, the computed p-value is less than 0.01, which falls well below the predetermined alpha value of 0.05. Consequently, you conclude that there is a statistically significant difference in pain relief between the new drug and the placebo.

What does a p-value of 0.001 mean?

A p-value of 0.001 is highly statistically significant beyond the commonly used 0.05 threshold. It indicates strong evidence of a real effect or difference, rather than just random variation.

Specifically, a p-value of 0.001 means there is only a 0.1% chance of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is correct.

Such a small p-value provides strong evidence against the null hypothesis, leading to rejecting the null in favor of the alternative hypothesis.

A p-value more than the significance level (typically p > 0.05) is not statistically significant and indicates strong evidence for the null hypothesis.

This means we retain the null hypothesis and reject the alternative hypothesis. You should note that you cannot accept the null hypothesis; we can only reject it or fail to reject it.

Note : when the p-value is above your threshold of significance,  it does not mean that there is a 95% probability that the alternative hypothesis is true.

One-Tailed Test

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Two-Tailed Test

statistical significance two tailed

How do you calculate the p-value ?

Most statistical software packages like R, SPSS, and others automatically calculate your p-value. This is the easiest and most common way.

Online resources and tables are available to estimate the p-value based on your test statistic and degrees of freedom.

These tables help you understand how often you would expect to see your test statistic under the null hypothesis.

Understanding the Statistical Test:

Different statistical tests are designed to answer specific research questions or hypotheses. Each test has its own underlying assumptions and characteristics.

For example, you might use a t-test to compare means, a chi-squared test for categorical data, or a correlation test to measure the strength of a relationship between variables.

Be aware that the number of independent variables you include in your analysis can influence the magnitude of the test statistic needed to produce the same p-value.

This factor is particularly important to consider when comparing results across different analyses.

Example: Choosing a Statistical Test

If you’re comparing the effectiveness of just two different drugs in pain relief, a two-sample t-test is a suitable choice for comparing these two groups. However, when you’re examining the impact of three or more drugs, it’s more appropriate to employ an Analysis of Variance ( ANOVA) . Utilizing multiple pairwise comparisons in such cases can lead to artificially low p-values and an overestimation of the significance of differences between the drug groups.

How to report

A statistically significant result cannot prove that a research hypothesis is correct (which implies 100% certainty).

Instead, we may state our results “provide support for” or “give evidence for” our research hypothesis (as there is still a slight probability that the results occurred by chance and the null hypothesis was correct – e.g., less than 5%).

Example: Reporting the results

In our comparison of the pain relief effects of the new drug and the placebo, we observed that participants in the drug group experienced a significant reduction in pain ( M = 3.5; SD = 0.8) compared to those in the placebo group ( M = 5.2; SD  = 0.7), resulting in an average difference of 1.7 points on the pain scale (t(98) = -9.36; p < 0.001).

The 6th edition of the APA style manual (American Psychological Association, 2010) states the following on the topic of reporting p-values:

“When reporting p values, report exact p values (e.g., p = .031) to two or three decimal places. However, report p values less than .001 as p < .001.

The tradition of reporting p values in the form p < .10, p < .05, p < .01, and so forth, was appropriate in a time when only limited tables of critical values were available.” (p. 114)

  • Do not use 0 before the decimal point for the statistical value p as it cannot equal 1. In other words, write p = .001 instead of p = 0.001.
  • Please pay attention to issues of italics ( p is always italicized) and spacing (either side of the = sign).
  • p = .000 (as outputted by some statistical packages such as SPSS) is impossible and should be written as p < .001.
  • The opposite of significant is “nonsignificant,” not “insignificant.”

Why is the p -value not enough?

A lower p-value  is sometimes interpreted as meaning there is a stronger relationship between two variables.

However, statistical significance means that it is unlikely that the null hypothesis is true (less than 5%).

To understand the strength of the difference between the two groups (control vs. experimental) a researcher needs to calculate the effect size .

When do you reject the null hypothesis?

In statistical hypothesis testing, you reject the null hypothesis when the p-value is less than or equal to the significance level (α) you set before conducting your test. The significance level is the probability of rejecting the null hypothesis when it is true. Commonly used significance levels are 0.01, 0.05, and 0.10.

Remember, rejecting the null hypothesis doesn’t prove the alternative hypothesis; it just suggests that the alternative hypothesis may be plausible given the observed data.

The p -value is conditional upon the null hypothesis being true but is unrelated to the truth or falsity of the alternative hypothesis.

What does p-value of 0.05 mean?

If your p-value is less than or equal to 0.05 (the significance level), you would conclude that your result is statistically significant. This means the evidence is strong enough to reject the null hypothesis in favor of the alternative hypothesis.

Are all p-values below 0.05 considered statistically significant?

No, not all p-values below 0.05 are considered statistically significant. The threshold of 0.05 is commonly used, but it’s just a convention. Statistical significance depends on factors like the study design, sample size, and the magnitude of the observed effect.

A p-value below 0.05 means there is evidence against the null hypothesis, suggesting a real effect. However, it’s essential to consider the context and other factors when interpreting results.

Researchers also look at effect size and confidence intervals to determine the practical significance and reliability of findings.

How does sample size affect the interpretation of p-values?

Sample size can impact the interpretation of p-values. A larger sample size provides more reliable and precise estimates of the population, leading to narrower confidence intervals.

With a larger sample, even small differences between groups or effects can become statistically significant, yielding lower p-values. In contrast, smaller sample sizes may not have enough statistical power to detect smaller effects, resulting in higher p-values.

Therefore, a larger sample size increases the chances of finding statistically significant results when there is a genuine effect, making the findings more trustworthy and robust.

Can a non-significant p-value indicate that there is no effect or difference in the data?

No, a non-significant p-value does not necessarily indicate that there is no effect or difference in the data. It means that the observed data do not provide strong enough evidence to reject the null hypothesis.

There could still be a real effect or difference, but it might be smaller or more variable than the study was able to detect.

Other factors like sample size, study design, and measurement precision can influence the p-value. It’s important to consider the entire body of evidence and not rely solely on p-values when interpreting research findings.

Can P values be exactly zero?

While a p-value can be extremely small, it cannot technically be absolute zero. When a p-value is reported as p = 0.000, the actual p-value is too small for the software to display. This is often interpreted as strong evidence against the null hypothesis. For p values less than 0.001, report as p < .001

Further Information

  • P Value Calculator From T Score
  • P-Value Calculator For Chi-Square
  • P-values and significance tests (Kahn Academy)
  • Hypothesis testing and p-values (Kahn Academy)
  • Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “ p “< 0.05”.
  • Criticism of using the “ p “< 0.05”.
  • Publication manual of the American Psychological Association
  • Statistics for Psychology Book Download

Bland, J. M., & Altman, D. G. (1994). One and two sided tests of significance: Authors’ reply.  BMJ: British Medical Journal ,  309 (6958), 874.

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research.  American Journal of Public Health ,  78 (12), 1568-1574.

Goodman, S. (2008, July). A dirty dozen: twelve p-value misconceptions . In  Seminars in hematology  (Vol. 45, No. 3, pp. 135-140). WB Saunders.

Lang, J. M., Rothman, K. J., & Cann, C. I. (1998). That confounded P-value.  Epidemiology (Cambridge, Mass.) ,  9 (1), 7-8.

Print Friendly, PDF & Email

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

AP®︎/College Statistics

Course: ap®︎/college statistics   >   unit 10.

  • Idea behind hypothesis testing
  • Examples of null and alternative hypotheses
  • Writing null and alternative hypotheses

P-values and significance tests

  • Comparing P-values to different significance levels
  • Estimating a P-value from a simulation
  • Estimating P-values from simulations
  • Using P-values to make conclusions

how to test hypothesis with p value

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Great Answer

Video transcript

p-value Calculator

What is p-value, how do i calculate p-value from test statistic, how to interpret p-value, how to use the p-value calculator to find p-value from test statistic, how do i find p-value from z-score, how do i find p-value from t, p-value from chi-square score (χ² score), p-value from f-score.

Welcome to our p-value calculator! You will never again have to wonder how to find the p-value, as here you can determine the one-sided and two-sided p-values from test statistics, following all the most popular distributions: normal, t-Student, chi-squared, and Snedecor's F.

P-values appear all over science, yet many people find the concept a bit intimidating. Don't worry – in this article, we will explain not only what the p-value is but also how to interpret p-values correctly . Have you ever been curious about how to calculate the p-value by hand? We provide you with all the necessary formulae as well!

🙋 If you want to revise some basics from statistics, our normal distribution calculator is an excellent place to start.

Formally, the p-value is the probability that the test statistic will produce values at least as extreme as the value it produced for your sample . It is crucial to remember that this probability is calculated under the assumption that the null hypothesis H 0 is true !

More intuitively, p-value answers the question:

Assuming that I live in a world where the null hypothesis holds, how probable is it that, for another sample, the test I'm performing will generate a value at least as extreme as the one I observed for the sample I already have?

It is the alternative hypothesis that determines what "extreme" actually means , so the p-value depends on the alternative hypothesis that you state: left-tailed, right-tailed, or two-tailed. In the formulas below, S stands for a test statistic, x for the value it produced for a given sample, and Pr(event | H 0 ) is the probability of an event, calculated under the assumption that H 0 is true:

Left-tailed test: p-value = Pr(S ≤ x | H 0 )

Right-tailed test: p-value = Pr(S ≥ x | H 0 )

Two-tailed test:

p-value = 2 × min{Pr(S ≤ x | H 0 ), Pr(S ≥ x | H 0 )}

(By min{a,b} , we denote the smaller number out of a and b .)

If the distribution of the test statistic under H 0 is symmetric about 0 , then: p-value = 2 × Pr(S ≥ |x| | H 0 )

or, equivalently: p-value = 2 × Pr(S ≤ -|x| | H 0 )

As a picture is worth a thousand words, let us illustrate these definitions. Here, we use the fact that the probability can be neatly depicted as the area under the density curve for a given distribution. We give two sets of pictures: one for a symmetric distribution and the other for a skewed (non-symmetric) distribution.

  • Symmetric case: normal distribution:

p-values for symmetric distribution — left-tailed, right-tailed, and two-tailed tests.

  • Non-symmetric case: chi-squared distribution:

p-values for non-symmetric distribution — left-tailed, right-tailed, and two-tailed tests.

In the last picture (two-tailed p-value for skewed distribution), the area of the left-hand side is equal to the area of the right-hand side.

To determine the p-value, you need to know the distribution of your test statistic under the assumption that the null hypothesis is true . Then, with the help of the cumulative distribution function ( cdf ) of this distribution, we can express the probability of the test statistics being at least as extreme as its value x for the sample:

Left-tailed test:

p-value = cdf(x) .

Right-tailed test:

p-value = 1 - cdf(x) .

p-value = 2 × min{cdf(x) , 1 - cdf(x)} .

If the distribution of the test statistic under H 0 is symmetric about 0 , then a two-sided p-value can be simplified to p-value = 2 × cdf(-|x|) , or, equivalently, as p-value = 2 - 2 × cdf(|x|) .

The probability distributions that are most widespread in hypothesis testing tend to have complicated cdf formulae, and finding the p-value by hand may not be possible. You'll likely need to resort to a computer or to a statistical table, where people have gathered approximate cdf values.

Well, you now know how to calculate the p-value, but… why do you need to calculate this number in the first place? In hypothesis testing, the p-value approach is an alternative to the critical value approach . Recall that the latter requires researchers to pre-set the significance level, α, which is the probability of rejecting the null hypothesis when it is true (so of type I error ). Once you have your p-value, you just need to compare it with any given α to quickly decide whether or not to reject the null hypothesis at that significance level, α. For details, check the next section, where we explain how to interpret p-values.

As we have mentioned above, the p-value is the answer to the following question:

What does that mean for you? Well, you've got two options:

  • A high p-value means that your data is highly compatible with the null hypothesis; and
  • A small p-value provides evidence against the null hypothesis , as it means that your result would be very improbable if the null hypothesis were true.

However, it may happen that the null hypothesis is true, but your sample is highly unusual! For example, imagine we studied the effect of a new drug and got a p-value of 0.03 . This means that in 3% of similar studies, random chance alone would still be able to produce the value of the test statistic that we obtained, or a value even more extreme, even if the drug had no effect at all!

The question "what is p-value" can also be answered as follows: p-value is the smallest level of significance at which the null hypothesis would be rejected. So, if you now want to make a decision on the null hypothesis at some significance level α , just compare your p-value with α :

  • If p-value ≤ α , then you reject the null hypothesis and accept the alternative hypothesis; and
  • If p-value ≥ α , then you don't have enough evidence to reject the null hypothesis.

Obviously, the fate of the null hypothesis depends on α . For instance, if the p-value was 0.03 , we would reject the null hypothesis at a significance level of 0.05 , but not at a level of 0.01 . That's why the significance level should be stated in advance and not adapted conveniently after the p-value has been established! A significance level of 0.05 is the most common value, but there's nothing magical about it. Here, you can see what too strong a faith in the 0.05 threshold can lead to. It's always best to report the p-value, and allow the reader to make their own conclusions.

Also, bear in mind that subject area expertise (and common reason) is crucial. Otherwise, mindlessly applying statistical principles, you can easily arrive at statistically significant, despite the conclusion being 100% untrue.

As our p-value calculator is here at your service, you no longer need to wonder how to find p-value from all those complicated test statistics! Here are the steps you need to follow:

Pick the alternative hypothesis : two-tailed, right-tailed, or left-tailed.

Tell us the distribution of your test statistic under the null hypothesis: is it N(0,1), t-Student, chi-squared, or Snedecor's F? If you are unsure, check the sections below, as they are devoted to these distributions.

If needed, specify the degrees of freedom of the test statistic's distribution.

Enter the value of test statistic computed for your data sample.

Our calculator determines the p-value from the test statistic and provides the decision to be made about the null hypothesis. The standard significance level is 0.05 by default.

Go to the advanced mode if you need to increase the precision with which the calculations are performed or change the significance level .

In terms of the cumulative distribution function (cdf) of the standard normal distribution, which is traditionally denoted by Φ , the p-value is given by the following formulae:

Left-tailed z-test:

p-value = Φ(Z score )

Right-tailed z-test:

p-value = 1 - Φ(Z score )

Two-tailed z-test:

p-value = 2 × Φ(−|Z score |)

p-value = 2 - 2 × Φ(|Z score |)

🙋 To learn more about Z-tests, head to Omni's Z-test calculator .

We use the Z-score if the test statistic approximately follows the standard normal distribution N(0,1) . Thanks to the central limit theorem, you can count on the approximation if you have a large sample (say at least 50 data points) and treat your distribution as normal.

A Z-test most often refers to testing the population mean , or the difference between two population means, in particular between two proportions. You can also find Z-tests in maximum likelihood estimations.

The p-value from the t-score is given by the following formulae, in which cdf t,d stands for the cumulative distribution function of the t-Student distribution with d degrees of freedom:

Left-tailed t-test:

p-value = cdf t,d (t score )

Right-tailed t-test:

p-value = 1 - cdf t,d (t score )

Two-tailed t-test:

p-value = 2 × cdf t,d (−|t score |)

p-value = 2 - 2 × cdf t,d (|t score |)

Use the t-score option if your test statistic follows the t-Student distribution . This distribution has a shape similar to N(0,1) (bell-shaped and symmetric) but has heavier tails – the exact shape depends on the parameter called the degrees of freedom . If the number of degrees of freedom is large (>30), which generically happens for large samples, the t-Student distribution is practically indistinguishable from the normal distribution N(0,1).

The most common t-tests are those for population means with an unknown population standard deviation, or for the difference between means of two populations , with either equal or unequal yet unknown population standard deviations. There's also a t-test for paired (dependent) samples .

🙋 To get more insights into t-statistics, we recommend using our t-test calculator .

Use the χ²-score option when performing a test in which the test statistic follows the χ²-distribution .

This distribution arises if, for example, you take the sum of squared variables, each following the normal distribution N(0,1). Remember to check the number of degrees of freedom of the χ²-distribution of your test statistic!

How to find the p-value from chi-square-score ? You can do it with the help of the following formulae, in which cdf χ²,d denotes the cumulative distribution function of the χ²-distribution with d degrees of freedom:

Left-tailed χ²-test:

p-value = cdf χ²,d (χ² score )

Right-tailed χ²-test:

p-value = 1 - cdf χ²,d (χ² score )

Remember that χ²-tests for goodness-of-fit and independence are right-tailed tests! (see below)

Two-tailed χ²-test:

p-value = 2 × min{cdf χ²,d (χ² score ), 1 - cdf χ²,d (χ² score )}

(By min{a,b} , we denote the smaller of the numbers a and b .)

The most popular tests which lead to a χ²-score are the following:

Testing whether the variance of normally distributed data has some pre-determined value. In this case, the test statistic has the χ²-distribution with n - 1 degrees of freedom, where n is the sample size. This can be a one-tailed or two-tailed test .

Goodness-of-fit test checks whether the empirical (sample) distribution agrees with some expected probability distribution. In this case, the test statistic follows the χ²-distribution with k - 1 degrees of freedom, where k is the number of classes into which the sample is divided. This is a right-tailed test .

Independence test is used to determine if there is a statistically significant relationship between two variables. In this case, its test statistic is based on the contingency table and follows the χ²-distribution with (r - 1)(c - 1) degrees of freedom, where r is the number of rows, and c is the number of columns in this contingency table. This also is a right-tailed test .

Finally, the F-score option should be used when you perform a test in which the test statistic follows the F-distribution , also known as the Fisher–Snedecor distribution. The exact shape of an F-distribution depends on two degrees of freedom .

To see where those degrees of freedom come from, consider the independent random variables X and Y , which both follow the χ²-distributions with d 1 and d 2 degrees of freedom, respectively. In that case, the ratio (X/d 1 )/(Y/d 2 ) follows the F-distribution, with (d 1 , d 2 ) -degrees of freedom. For this reason, the two parameters d 1 and d 2 are also called the numerator and denominator degrees of freedom .

The p-value from F-score is given by the following formulae, where we let cdf F,d1,d2 denote the cumulative distribution function of the F-distribution, with (d 1 , d 2 ) -degrees of freedom:

Left-tailed F-test:

p-value = cdf F,d1,d2 (F score )

Right-tailed F-test:

p-value = 1 - cdf F,d1,d2 (F score )

Two-tailed F-test:

p-value = 2 × min{cdf F,d1,d2 (F score ), 1 - cdf F,d1,d2 (F score )}

Below we list the most important tests that produce F-scores. All of them are right-tailed tests .

A test for the equality of variances in two normally distributed populations . Its test statistic follows the F-distribution with (n - 1, m - 1) -degrees of freedom, where n and m are the respective sample sizes.

ANOVA is used to test the equality of means in three or more groups that come from normally distributed populations with equal variances. We arrive at the F-distribution with (k - 1, n - k) -degrees of freedom, where k is the number of groups, and n is the total sample size (in all groups together).

A test for overall significance of regression analysis . The test statistic has an F-distribution with (k - 1, n - k) -degrees of freedom, where n is the sample size, and k is the number of variables (including the intercept).

With the presence of the linear relationship having been established in your data sample with the above test, you can calculate the coefficient of determination, R 2 , which indicates the strength of this relationship . You can do it by hand or use our coefficient of determination calculator .

A test to compare two nested regression models . The test statistic follows the F-distribution with (k 2 - k 1 , n - k 2 ) -degrees of freedom, where k 1 and k 2 are the numbers of variables in the smaller and bigger models, respectively, and n is the sample size.

You may notice that the F-test of an overall significance is a particular form of the F-test for comparing two nested models: it tests whether our model does significantly better than the model with no predictors (i.e., the intercept-only model).

Can p-value be negative?

No, the p-value cannot be negative. This is because probabilities cannot be negative, and the p-value is the probability of the test statistic satisfying certain conditions.

What does a high p-value mean?

A high p-value means that under the null hypothesis, there's a high probability that for another sample, the test statistic will generate a value at least as extreme as the one observed in the sample you already have. A high p-value doesn't allow you to reject the null hypothesis.

What does a low p-value mean?

A low p-value means that under the null hypothesis, there's little probability that for another sample, the test statistic will generate a value at least as extreme as the one observed for the sample you already have. A low p-value is evidence in favor of the alternative hypothesis – it allows you to reject the null hypothesis.

Exponential growth prediction

Lower fence, millionaire.

  • Biology (103)
  • Chemistry (101)
  • Construction (148)
  • Conversion (304)
  • Ecology (32)
  • Everyday life (263)
  • Finance (592)
  • Health (443)
  • Physics (513)
  • Sports (108)
  • Statistics (184)
  • Other (186)
  • Discover Omni (40)

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

Hypothesis testing.

Key Topics:

  • Basic approach
  • Null and alternative hypothesis
  • Decision making and the p -value
  • Z-test & Nonparametric alternative

Basic approach to hypothesis testing

  • State a model describing the relationship between the explanatory variables and the outcome variable(s) in the population and the nature of the variability. State all of your assumptions .
  • Specify the null and alternative hypotheses in terms of the parameters of the model.
  • Invent a test statistic that will tend to be different under the null and alternative hypotheses.
  • Using the assumptions of step 1, find the theoretical sampling distribution of the statistic under the null hypothesis of step 2. Ideally the form of the sampling distribution should be one of the “standard distributions”(e.g. normal, t , binomial..)
  • Calculate a p -value , as the area under the sampling distribution more extreme than your statistic. Depends on the form of the alternative hypothesis.
  • Choose your acceptable type 1 error rate (alpha) and apply the decision rule : reject the null hypothesis if the p-value is less than alpha, otherwise do not reject.
sampled from a with unknown mean μ and known variance σ . : μ = μ
H : μ ≤ μ
H : μ ≥ μ
: μ ≠ μ
H : μ > μ
H : μ < μ
  • \(\frac{\bar{X}-\mu_0}{\sigma / \sqrt{n}}\)
  • general form is: (estimate - value we are testing)/(st.dev of the estimate)
  • z-statistic follows N(0,1) distribution
  • 2 × the area above |z|, area above z,or area below z, or
  • compare the statistic to a critical value, |z| ≥ z α/2 , z ≥ z α , or z ≤ - z α
  • Choose the acceptable level of Alpha = 0.05, we conclude …. ?

Making the Decision

It is either likely or unlikely that we would collect the evidence we did given the initial assumption. (Note: “likely” or “unlikely” is measured by calculating a probability!)

If it is likely , then we “ do not reject ” our initial assumption. There is not enough evidence to do otherwise.

If it is unlikely , then:

  • either our initial assumption is correct and we experienced an unusual event or,
  • our initial assumption is incorrect

In statistics, if it is unlikely, we decide to “ reject ” our initial assumption.

Example: Criminal Trial Analogy

First, state 2 hypotheses, the null hypothesis (“H 0 ”) and the alternative hypothesis (“H A ”)

  • H 0 : Defendant is not guilty.
  • H A : Defendant is guilty.

Usually the H 0 is a statement of “no effect”, or “no change”, or “chance only” about a population parameter.

While the H A , depending on the situation, is that there is a difference, trend, effect, or a relationship with respect to a population parameter.

  • It can one-sided and two-sided.
  • In two-sided we only care there is a difference, but not the direction of it. In one-sided we care about a particular direction of the relationship. We want to know if the value is strictly larger or smaller.

Then, collect evidence, such as finger prints, blood spots, hair samples, carpet fibers, shoe prints, ransom notes, handwriting samples, etc. (In statistics, the data are the evidence.)

Next, you make your initial assumption.

  • Defendant is innocent until proven guilty.

In statistics, we always assume the null hypothesis is true .

Then, make a decision based on the available evidence.

  • If there is sufficient evidence (“beyond a reasonable doubt”), reject the null hypothesis . (Behave as if defendant is guilty.)
  • If there is not enough evidence, do not reject the null hypothesis . (Behave as if defendant is not guilty.)

If the observed outcome, e.g., a sample statistic, is surprising under the assumption that the null hypothesis is true, but more probable if the alternative is true, then this outcome is evidence against H 0 and in favor of H A .

An observed effect so large that it would rarely occur by chance is called statistically significant (i.e., not likely to happen by chance).

Using the p -value to make the decision

The p -value represents how likely we would be to observe such an extreme sample if the null hypothesis were true. The p -value is a probability computed assuming the null hypothesis is true, that the test statistic would take a value as extreme or more extreme than that actually observed. Since it's a probability, it is a number between 0 and 1. The closer the number is to 0 means the event is “unlikely.” So if p -value is “small,” (typically, less than 0.05), we can then reject the null hypothesis.

Significance level and p -value

Significance level, α, is a decisive value for p -value. In this context, significant does not mean “important”, but it means “not likely to happened just by chance”.

α is the maximum probability of rejecting the null hypothesis when the null hypothesis is true. If α = 1 we always reject the null, if α = 0 we never reject the null hypothesis. In articles, journals, etc… you may read: “The results were significant ( p <0.05).” So if p =0.03, it's significant at the level of α = 0.05 but not at the level of α = 0.01. If we reject the H 0 at the level of α = 0.05 (which corresponds to 95% CI), we are saying that if H 0 is true, the observed phenomenon would happen no more than 5% of the time (that is 1 in 20). If we choose to compare the p -value to α = 0.01, we are insisting on a stronger evidence!

Neither decision of rejecting or not rejecting the H entails proving the null hypothesis or the alternative hypothesis. We merely state there is enough evidence to behave one way or the other. This is also always true in statistics!

So, what kind of error could we make? No matter what decision we make, there is always a chance we made an error.

Errors in Criminal Trial:

Errors in Hypothesis Testing

Type I error (False positive): The null hypothesis is rejected when it is true.

  • α is the maximum probability of making a Type I error.

Type II error (False negative): The null hypothesis is not rejected when it is false.

  • β is the probability of making a Type II error

There is always a chance of making one of these errors. But, a good scientific study will minimize the chance of doing so!

The power of a statistical test is its probability of rejecting the null hypothesis if the null hypothesis is false. That is, power is the ability to correctly reject H 0 and detect a significant effect. In other words, power is one minus the type II error risk.

\(\text{Power }=1-\beta = P\left(\text{reject} H_0 | H_0 \text{is false } \right)\)

Which error is worse?

Type I = you are innocent, yet accused of cheating on the test. Type II = you cheated on the test, but you are found innocent.

This depends on the context of the problem too. But in most cases scientists are trying to be “conservative”; it's worse to make a spurious discovery than to fail to make a good one. Our goal it to increase the power of the test that is to minimize the length of the CI.

We need to keep in mind:

  • the effect of the sample size,
  • the correctness of the underlying assumptions about the population,
  • statistical vs. practical significance, etc…

(see the handout). To study the tradeoffs between the sample size, α, and Type II error we can use power and operating characteristic curves.

Assume data are independently sampled from a normal distribution with unknown mean μ and known variance σ = 9. Make an initial assumption that μ = 65.

Specify the hypothesis: H : μ = 65 H : μ ≠ 65

z-statistic: 3.58

z-statistic follow N(0,1) distribution

The -value, < 0.0001, indicates that, if the average height in the population is 65 inches, it is unlikely that a sample of 54 students would have an average height of 66.4630.

Alpha = 0.05. Decision: -value < alpha, thus

Conclude that the average height is not equal to 65.

What type of error might we have made?

Type I error is claiming that average student height is not 65 inches, when it really is. Type II error is failing to claim that the average student height is not 65in when it is.

We rejected the null hypothesis, i.e., claimed that the height is not 65, thus making potentially a Type I error. But sometimes the p -value is too low because of the large sample size, and we may have statistical significance but not really practical significance! That's why most statisticians are much more comfortable with using CI than tests.

Based on the CI only, how do you know that you should reject the null hypothesis?

The 95% CI is (65.6628,67.2631) ...

What about practical and statistical significance now? Is there another reason to suspect this test, and the -value calculations?

There is a need for a further generalization. What if we can't assume that σ is known? In this case we would use s (the sample standard deviation) to estimate σ.

If the sample is very large, we can treat σ as known by assuming that σ = s . According to the law of large numbers, this is not too bad a thing to do. But if the sample is small, the fact that we have to estimate both the standard deviation and the mean adds extra uncertainty to our inference. In practice this means that we need a larger multiplier for the standard error.

We need one-sample t -test.

One sample t -test

  • Assume data are independently sampled from a normal distribution with unknown mean μ and variance σ 2 . Make an initial assumption, μ 0 .
: μ = μ
H : μ ≤ μ
H : μ ≥ μ
: μ ≠ μ
H : μ > μ
H : μ < μ
  • t-statistic: \(\frac{\bar{X}-\mu_0}{s / \sqrt{n}}\) where s is a sample st.dev.
  • t-statistic follows t -distribution with df = n - 1
  • Alpha = 0.05, we conclude ….

Testing for the population proportion

Let's go back to our CNN poll. Assume we have a SRS of 1,017 adults.

We are interested in testing the following hypothesis: H 0 : p = 0.50 vs. p > 0.50

What is the test statistic?

If alpha = 0.05, what do we conclude?

We will see more details in the next lesson on proportions, then distributions, and possible tests.

Icon Partners

  • Quality Improvement
  • Talk To Minitab

How to Correctly Interpret P Values

Topics: Hypothesis Testing

The P value is used all over statistics, from t-tests to regression analysis . Everyone knows that you use P values to determine statistical significance in a hypothesis test . In fact, P values often determine what studies get published and what projects get funding.

Despite being so important, the P value is a slippery concept that people often interpret incorrectly. How do you interpret P values?

In this post, I'll help you to understand P values in a more intuitive way and to avoid a very common misinterpretation that can cost you money and credibility.

What Is the Null Hypothesis in Hypothesis Testing?

Scientist performing an experiment

In every experiment, there is an effect or difference between groups that the researchers are testing. It could be the effectiveness of a new drug, building material, or other intervention that has benefits. Unfortunately for the researchers, there is always the possibility that there is no effect, that is, that there is no difference between the groups. This lack of a difference is called the null hypothesis , which is essentially the position a devil’s advocate would take when evaluating the results of an experiment.

To see why, let’s imagine an experiment for a drug that we know is totally ineffective. The null hypothesis is true: there is no difference between the experimental groups at the population level.

Despite the null being true, it’s entirely possible that there will be an effect in the sample data due to random sampling error. In fact, it is extremely unlikely that the sample groups will ever exactly equal the null hypothesis value. Consequently, the devil’s advocate position is that the observed difference in the sample does not reflect a true difference between populations .

What Are P Values?

  • High P values: your data are likely with a true null.
  • Low P values: your data are unlikely with a true null.

A low P value suggests that your sample provides enough evidence that you can reject the null hypothesis for the entire population.

How Do You Interpret P Values?

Vaccine

For example, suppose that a vaccine study produced a P value of 0.04. This P value indicates that if the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.

P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis. This limitation leads us into the next section to cover a very common misinterpretation of P values.

hbspt.cta._relativeUrls=true;hbspt.cta.load(3447555, '16128196-352b-4dd2-8356-f063c37c5b2a', {"useNewLoader":"true","region":"na1"});

P values are not the probability of making a mistake.

Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error ).

There are several reasons why P values can’t be the error rate.

First, P values are calculated based on the assumptions that the null is true for the population and that the difference in the sample is caused entirely by random chance. Consequently, P values can’t tell you the probability that the null is true or false because it is 100% true from the perspective of the calculations.

Second, while a low P value indicates that your data are unlikely assuming a true null, it can’t evaluate which of two competing cases is more likely:

  • The null is true but your sample was unusual.
  • The null is false.

Determining which case is more likely requires subject area knowledge and replicate studies.

Let’s go back to the vaccine study and compare the correct and incorrect way to interpret the P value of 0.04:

  • Correct: Assuming that the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.  
  • Incorrect: If you reject the null hypothesis, there’s a 4% chance that you’re making a mistake.

To see a graphical representation of how hypothesis tests work, see my post: Understanding Hypothesis Tests: Significance Levels and P Values .

What Is the True Error Rate?

Caution sign

If a P value is not the error rate, what the heck is the error rate? (Can you guess which way this is heading now?)

Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I discuss here ), the table summarizes them for middle-of-the-road assumptions.

0.05

At least 23% (and typically close to 50%)

0.01

At least 7% (and typically close to 15%)

Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. That can be costly!

Now that you know how to interpret P values, read my five guidelines for how to use P values and avoid mistakes .

You can also read my rebuttal to an academic journal that actually banned P values !

An exciting study about the reproducibility of experimental results was published in August 2015. This study highlights the importance of understanding the true error rate. For more information, read my blog post: P Values and the Replication of Experiments .

The American Statistical Association speaks out on how to use p-values!

*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

minitab-on-twitter

You Might Also Like

  • Trust Center

© 2023 Minitab, LLC. All Rights Reserved.

  • Terms of Use
  • Privacy Policy
  • Cookies Settings
  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

P-Value: Comprehensive Guide to Understand, Apply, and Interpret

A p-value is a statistical metric used to assess a hypothesis by comparing it with observed data.

This article delves into the concept of p-value, its calculation, interpretation, and significance. It also explores the factors that influence p-value and highlights its limitations.

Table of Content

  • What is P-value?

How P-value is calculated?

How to interpret p-value, p-value in hypothesis testing, implementing p-value in python, applications of p-value, what is the p-value.

The p-value, or probability value, is a statistical measure used in hypothesis testing to assess the strength of evidence against a null hypothesis. It represents the probability of obtaining results as extreme as, or more extreme than, the observed results under the assumption that the null hypothesis is true.

In simpler words, it is used to reject or support the null hypothesis during hypothesis testing. In data science, it gives valuable insights on the statistical significance of an independent variable in predicting the dependent variable. 

Calculating the p-value typically involves the following steps:

  • Formulate the Null Hypothesis (H0) : Clearly state the null hypothesis, which typically states that there is no significant relationship or effect between the variables.
  • Choose an Alternative Hypothesis (H1) : Define the alternative hypothesis, which proposes the existence of a significant relationship or effect between the variables.
  • Determine the Test Statistic : Calculate the test statistic, which is a measure of the discrepancy between the observed data and the expected values under the null hypothesis. The choice of test statistic depends on the type of data and the specific research question.
  • Identify the Distribution of the Test Statistic : Determine the appropriate sampling distribution for the test statistic under the null hypothesis. This distribution represents the expected values of the test statistic if the null hypothesis is true.
  • Calculate the Critical-value : Based on the observed test statistic and the sampling distribution, find the probability of obtaining the observed test statistic or a more extreme one, assuming the null hypothesis is true.
  • Interpret the results: Compare the critical-value with t-statistic. If the t-statistic is larger than the critical value, it provides evidence to reject the null hypothesis, and vice-versa.

Its interpretation depends on the specific test and the context of the analysis. Several popular methods for calculating test statistics that are utilized in p-value calculations.

Test

Scenario

Interpretation

Used when dealing with large sample sizes or when the population standard deviation is known.

A small p-value (smaller than 0.05) indicates strong evidence against the null hypothesis, leading to its rejection.

Appropriate for small sample sizes or when the population standard deviation is unknown.

Similar to the Z-test

Used for tests of independence or goodness-of-fit.

A small p-value indicates that there is a significant association between the categorical variables, leading to the rejection of the null hypothesis.

Commonly used in Analysis of Variance (ANOVA) to compare variances between groups.

A small p-value suggests that at least one group mean is different from the others, leading to the rejection of the null hypothesis.

Measures the strength and direction of a linear relationship between two continuous variables.

A small p-value indicates that there is a significant linear relationship between the variables, leading to rejection of the null hypothesis that there is no correlation.

In general, a small p-value indicates that the observed data is unlikely to have occurred by random chance alone, which leads to the rejection of the null hypothesis. However, it’s crucial to choose the appropriate test based on the nature of the data and the research question, as well as to interpret the p-value in the context of the specific test being used.

The table given below shows the importance of p-value and shows the various kinds of errors that occur during hypothesis testing.

Correct decision based 
on the given p-value

Type I error

Type II error

Incorrect decision based 
on the given p-value

Type I error: Incorrect rejection of the null hypothesis. It is denoted by α (significance level). Type II error: Incorrect acceptance of the null hypothesis. It is denoted by β (power level)

Let’s consider an example to illustrate the process of calculating a p-value for Two Sample T-Test:

A researcher wants to investigate whether there is a significant difference in mean height between males and females in a population of university students.

Suppose we have the following data:

\overline{x_1} = 175

Starting with interpreting the process of calculating p-value

Step 1 : Formulate the Null Hypothesis (H0):

H0: There is no significant difference in mean height between males and females.

Step 2 : Choose an Alternative Hypothesis (H1):

H1: There is a significant difference in mean height between males and females.

Step 3 : Determine the Test Statistic:

The appropriate test statistic for this scenario is the two-sample t-test, which compares the means of two independent groups.

The t-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

t = \frac{\overline{x_1} - \overline{x_2}}{ \sqrt{\frac{(s_1)^2}{n_1} + \frac{(s_2)^2}{n_2}}}

  • s1 = First sample’s standard deviation
  • s2 = Second sample’s standard deviation
  • n1 = First sample’s sample size
  • n2 = Second sample’s sample size

\begin{aligned}t &= \frac{175 - 168}{\sqrt{\frac{5^2}{30} + \frac{6^2}{35}}}\\&= \frac{7}{\sqrt{0.8333 + 1.0286}}\\&= \frac{7}{\sqrt{1.8619}}\\& \approx  \frac{7}{1.364}\\& \approx 5.13\end{aligned}

So, the calculated two-sample t-test statistic (t) is approximately 5.13.

Step 4 : Identify the Distribution of the Test Statistic:

The t-distribution is used for the two-sample t-test . The degrees of freedom for the t-distribution are determined by the sample sizes of the two groups.

 The t-distribution is a probability distribution with tails that are thicker than those of the normal distribution.

df = (n_1+n_2)-2

  • where, n1 is total number of values for 1st category.
  • n2 is total number of values for 2nd category.

df= (30+35)-2=63

The degrees of freedom (63) represent the variability available in the data to estimate the population parameters. In the context of the two-sample t-test, higher degrees of freedom provide a more precise estimate of the population variance, influencing the shape and characteristics of the t-distribution.

T-distribution-gfg

T-Statistic

The t-distribution is symmetric and bell-shaped, similar to the normal distribution. As the degrees of freedom increase, the t-distribution approaches the shape of the standard normal distribution. Practically, it affects the critical values used to determine statistical significance and confidence intervals.

Step 5 : Calculate Critical Value.

To find the critical t-value with a t-statistic of 5.13 and 63 degrees of freedom, we can either consult a t-table or use statistical software.

We can use scipy.stats module in Python to find the critical t-value using below code.

Comparing with T-Statistic:

1.9983<5.13

The larger t-statistic suggests that the observed difference between the sample means is unlikely to have occurred by random chance alone. Therefore, we reject the null hypothesis.

(\alpha)

  • p ≤ (α = 0.05) : Reject the null hypothesis. There is sufficient evidence to conclude that the observed effect or relationship is statistically significant, meaning it is unlikely to have occurred by chance alone.
  • p > (α = 0.05) : reject alternate hypothesis (or accept null hypothesis). The observed effect or relationship does not provide enough evidence to reject the null hypothesis. This does not necessarily mean there is no effect; it simply means the sample data does not provide strong enough evidence to rule out the possibility that the effect is due to chance.

In case the significance level is not specified, consider the below general inferences while interpreting your results. 

  • If p > .10: not significant
  • If p ≤ .10: slightly significant
  • If p ≤ .05: significant
  • If p ≤ .001: highly significant

Graphically, the p-value is located at the tails of any confidence interval. [As shown in fig 1]

how to test hypothesis with p value

Fig 1: Graphical Representation 

What influences p-value?

The p-value in hypothesis testing is influenced by several factors:

  • Sample Size : Larger sample sizes tend to yield smaller p-values, increasing the likelihood of detecting significant effects.
  • Effect Size: A larger effect size results in smaller p-values, making it easier to detect a significant relationship.
  • Variability in the Data : Greater variability often leads to larger p-values, making it harder to identify significant effects.
  • Significance Level : A lower chosen significance level increases the threshold for considering p-values as significant.
  • Choice of Test: Different statistical tests may yield different p-values for the same data.
  • Assumptions of the Test : Violations of test assumptions can impact p-values.

Understanding these factors is crucial for interpreting p-values accurately and making informed decisions in hypothesis testing.

Significance of P-value

  • The p-value provides a quantitative measure of the strength of the evidence against the null hypothesis.
  • Decision-Making in Hypothesis Testing
  • P-value serves as a guide for interpreting the results of a statistical test. A small p-value suggests that the observed effect or relationship is statistically significant, but it does not necessarily mean that it is practically or clinically meaningful.

Limitations of P-value

  • The p-value is not a direct measure of the effect size, which represents the magnitude of the observed relationship or difference between variables. A small p-value does not necessarily mean that the effect size is large or practically meaningful.
  • Influenced by Various Factors

The p-value is a crucial concept in statistical hypothesis testing, serving as a guide for making decisions about the significance of the observed relationship or effect between variables.

Let’s consider a scenario where a tutor believes that the average exam score of their students is equal to the national average (85). The tutor collects a sample of exam scores from their students and performs a one-sample t-test to compare it to the population mean (85).

  • The code performs a one-sample t-test to compare the mean of a sample data set to a hypothesized population mean.
  • It utilizes the scipy.stats library to calculate the t-statistic and p-value. SciPy is a Python library that provides efficient numerical routines for scientific computing.
  • The p-value is compared to a significance level (alpha) to determine whether to reject the null hypothesis.

Since, 0.7059>0.05 , we would conclude to fail to reject the null hypothesis. This means that, based on the sample data, there isn’t enough evidence to claim a significant difference in the exam scores of the tutor’s students compared to the national average. The tutor would accept the null hypothesis, suggesting that the average exam score of their students is statistically consistent with the national average.

  • During Forward and Backward propagation: When fitting a model (say a Multiple Linear Regression model), we use the p-value in order to find the most significant variables that contribute significantly in predicting the output.
  • Effects of various drug medicines: It is highly used in the field of medical research in determining whether the constituents of any drug will have the desired effect on humans or not. P-value is a very strong statistical tool used in hypothesis testing. It provides a plethora of valuable information while making an important decision like making a business intelligence inference or determining whether a drug should be used on humans or not, etc. For any doubt/query, comment below.

The p-value is a crucial concept in statistical hypothesis testing, providing a quantitative measure of the strength of evidence against the null hypothesis. It guides decision-making by comparing the p-value to a chosen significance level, typically 0.05. A small p-value indicates strong evidence against the null hypothesis, suggesting a statistically significant relationship or effect. However, the p-value is influenced by various factors and should be interpreted alongside other considerations, such as effect size and context.

Frequently Based Questions (FAQs)

Why is p-value greater than 1.

A p-value is a probability, and probabilities must be between 0 and 1. Therefore, a p-value greater than 1 is not possible.

What does P 0.01 mean?

It means that the observed test statistic is unlikely to occur by chance if the null hypothesis is true. It represents a 1% chance of observing the test statistic or a more extreme one under the null hypothesis.

Is 0.9 a good p-value?

A good p-value is typically less than or equal to 0.05, indicating that the null hypothesis is likely false and the observed relationship or effect is statistically significant.

What is p-value in a model?

It is a measure of the statistical significance of a parameter in the model. It represents the probability of obtaining the observed value of the parameter or a more extreme one, assuming the null hypothesis is true.

Why is p-value so low?

A low p-value means that the observed test statistic is unlikely to occur by chance if the null hypothesis is true. It suggests that the observed relationship or effect is statistically significant and not due to random sampling variation.

How Can You Use P-value to Compare Two Different Results of a Hypothesis Test?

Compare p-values: Lower p-value indicates stronger evidence against null hypothesis, favoring results with smaller p-values in hypothesis testing.

Please Login to comment...

Similar reads, improve your coding skills with practice.

 alt=

What kind of Experience do you want to share?

Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

Cover of StatPearls

StatPearls [Internet].

Hypothesis testing, p values, confidence intervals, and significance.

Jacob Shreffler ; Martin R. Huecker .

Affiliations

Last Update: March 13, 2023 .

  • Definition/Introduction

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

  • Issues of Concern

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.

The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.

Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1]  When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]

Significance

Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3]  Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4]  When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5]  One criterion often used to determine statistical significance is the utilization of p values.

P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6]  Hypothesis testing allows us to determine the size of the effect.

An example of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.

Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7]  The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.  

While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3]  In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]

Confidence Intervals

A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12]  Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13]  A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14]  Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15]  confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14]  A larger width indicates a smaller sample size or a larger variability. [16]  A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]

Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15]  Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14]  In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13]  An example is below:

Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

  • Clinical Significance

Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14]  Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4]  Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.

  • Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care. 

  • Review Questions
  • Access free multiple choice questions on this topic.
  • Comment on this article.

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

  • Cite this Page Shreffler J, Huecker MR. Hypothesis Testing, P Values, Confidence Intervals, and Significance. [Updated 2023 Mar 13]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

In this Page

Bulk download.

  • Bulk download StatPearls data from FTP

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). [PeerJ. 2021] The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). Messam LLM, Weng HY, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. PeerJ. 2021; 9:e12453. Epub 2021 Nov 24.
  • Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. [J Pharm Pract. 2010] Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Ferrill MJ, Brown DA, Kyle JA. J Pharm Pract. 2010 Aug; 23(4):344-51. Epub 2010 Apr 13.
  • Interpreting "statistical hypothesis testing" results in clinical research. [J Ayurveda Integr Med. 2012] Interpreting "statistical hypothesis testing" results in clinical research. Sarmukaddam SB. J Ayurveda Integr Med. 2012 Apr; 3(2):65-9.
  • Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. [Dermatol Surg. 2005] Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. Alam M, Barzilai DA, Wrone DA. Dermatol Surg. 2005 Apr; 31(4):462-6.
  • Review Is statistical significance testing useful in interpreting data? [Reprod Toxicol. 1993] Review Is statistical significance testing useful in interpreting data? Savitz DA. Reprod Toxicol. 1993; 7(2):95-100.

Recent Activity

  • Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearl... Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

  • Search Search Please fill out this field.

What Is P-Value?

Understanding p-value.

  • P-Value in Hypothesis Testing

The Bottom Line

  • Corporate Finance
  • Financial Analysis

P-Value: What It Is, How to Calculate It, and Why It Matters

how to test hypothesis with p value

Yarilet Perez is an experienced multimedia journalist and fact-checker with a Master of Science in Journalism. She has worked in multiple cities covering breaking news, politics, education, and more. Her expertise is in personal finance and investing, and real estate.

how to test hypothesis with p value

In statistics, a p-value is defined as In statistics, a p-value indicates the likelihood of obtaining a value equal to or greater than the observed result if the null hypothesis is true.

The p-value serves as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means stronger evidence in favor of the alternative hypothesis.

P-value is often used to promote credibility for studies or reports by government agencies. For example, the U.S. Census Bureau stipulates that any analysis with a p-value greater than 0.10 must be accompanied by a statement that the difference is not statistically different from zero. The Census Bureau also has standards in place stipulating which p-values are acceptable for various publications.

Key Takeaways

  • A p-value is a statistical measurement used to validate a hypothesis against observed data.
  • A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true.
  • The lower the p-value, the greater the statistical significance of the observed difference.
  • A p-value of 0.05 or lower is generally considered statistically significant.
  • P-value can serve as an alternative to—or in addition to—preselected confidence levels for hypothesis testing.

Jessica Olah / Investopedia

P-values are usually calculated using statistical software or p-value tables based on the assumed or known probability distribution of the specific statistic tested. While the sample size influences the reliability of the observed data, the p-value approach to hypothesis testing specifically involves calculating the p-value based on the deviation between the observed value and a chosen reference value, given the probability distribution of the statistic. A greater difference between the two values corresponds to a lower p-value.

Mathematically, the p-value is calculated using integral calculus from the area under the probability distribution curve for all values of statistics that are at least as far from the reference value as the observed value is, relative to the total area under the probability distribution curve. Standard deviations, which quantify the dispersion of data points from the mean, are instrumental in this calculation.

The calculation for a p-value varies based on the type of test performed. The three test types describe the location on the probability distribution curve: lower-tailed test, upper-tailed test, or two-tailed test . In each case, the degrees of freedom play a crucial role in determining the shape of the distribution and thus, the calculation of the p-value.

In a nutshell, the greater the difference between two observed values, the less likely it is that the difference is due to simple random chance, and this is reflected by a lower p-value.

The P-Value Approach to Hypothesis Testing

The p-value approach to hypothesis testing uses the calculated probability to determine whether there is evidence to reject the null hypothesis. This determination relies heavily on the test statistic, which summarizes the information from the sample relevant to the hypothesis being tested. The null hypothesis, also known as the conjecture, is the initial claim about a population (or data-generating process). The alternative hypothesis states whether the population parameter differs from the value of the population parameter stated in the conjecture.

In practice, the significance level is stated in advance to determine how small the p-value must be to reject the null hypothesis. Because different researchers use different levels of significance when examining a question, a reader may sometimes have difficulty comparing results from two different tests. P-values provide a solution to this problem.

Even a low p-value is not necessarily proof of statistical significance, since there is still a possibility that the observed data are the result of chance. Only repeated experiments or studies can confirm if a relationship is statistically significant.

For example, suppose a study comparing returns from two particular assets was undertaken by different researchers who used the same data but different significance levels. The researchers might come to opposite conclusions regarding whether the assets differ.

If one researcher used a confidence level of 90% and the other required a confidence level of 95% to reject the null hypothesis, and if the p-value of the observed difference between the two returns was 0.08 (corresponding to a confidence level of 92%), then the first researcher would find that the two assets have a difference that is statistically significant , while the second would find no statistically significant difference between the returns.

To avoid this problem, the researchers could report the p-value of the hypothesis test and allow readers to interpret the statistical significance themselves. This is called a p-value approach to hypothesis testing. Independent observers could note the p-value and decide for themselves whether that represents a statistically significant difference or not.

Example of P-Value

An investor claims that their investment portfolio’s performance is equivalent to that of the Standard & Poor’s (S&P) 500 Index . To determine this, the investor conducts a two-tailed test.

The null hypothesis states that the portfolio’s returns are equivalent to the S&P 500’s returns over a specified period, while the alternative hypothesis states that the portfolio’s returns and the S&P 500’s returns are not equivalent—if the investor conducted a one-tailed test , the alternative hypothesis would state that the portfolio’s returns are either less than or greater than the S&P 500’s returns.

The p-value hypothesis test does not necessarily make use of a preselected confidence level at which the investor should reset the null hypothesis that the returns are equivalent. Instead, it provides a measure of how much evidence there is to reject the null hypothesis. The smaller the p-value, the greater the evidence against the null hypothesis.

Thus, if the investor finds that the p-value is 0.001, there is strong evidence against the null hypothesis, and the investor can confidently conclude that the portfolio’s returns and the S&P 500’s returns are not equivalent.

Although this does not provide an exact threshold as to when the investor should accept or reject the null hypothesis, it does have another very practical advantage. P-value hypothesis testing offers a direct way to compare the relative confidence that the investor can have when choosing among multiple different types of investments or portfolios relative to a benchmark such as the S&P 500.

For example, for two portfolios, A and B, whose performance differs from the S&P 500 with p-values of 0.10 and 0.01, respectively, the investor can be much more confident that portfolio B, with a lower p-value, will actually show consistently different results.

Is a 0.05 P-Value Significant?

A p-value less than 0.05 is typically considered to be statistically significant, in which case the null hypothesis should be rejected. A p-value greater than 0.05 means that deviation from the null hypothesis is not statistically significant, and the null hypothesis is not rejected.

What Does a P-Value of 0.001 Mean?

A p-value of 0.001 indicates that if the null hypothesis tested were indeed true, then there would be a one-in-1,000 chance of observing results at least as extreme. This leads the observer to reject the null hypothesis because either a highly rare data result has been observed or the null hypothesis is incorrect.

How Can You Use P-Value to Compare 2 Different Results of a Hypothesis Test?

If you have two different results, one with a p-value of 0.04 and one with a p-value of 0.06, the result with a p-value of 0.04 will be considered more statistically significant than the p-value of 0.06. Beyond this simplified example, you could compare a 0.04 p-value to a 0.001 p-value. Both are statistically significant, but the 0.001 example provides an even stronger case against the null hypothesis than the 0.04.

The p-value is used to measure the significance of observational data. When researchers identify an apparent relationship between two variables, there is always a possibility that this correlation might be a coincidence. A p-value calculation helps determine if the observed relationship could arise as a result of chance.

U.S. Census Bureau. “ Statistical Quality Standard E1: Analyzing Data .”

how to test hypothesis with p value

  • Terms of Service
  • Editorial Policy
  • Privacy Policy

Table of Contents

What is p-value , p value vs alpha level, p values and critical values, how is p-value calculated, p-value in hypothesis testing, p-values and statistical significance, reporting p-values, our learners also ask, what is p-value in statistical hypothesis.

What Is P-Value in Statistical Hypothesis?

Few statistical estimates are as significant as the p-value. The p-value or probability value is a number, calculated from a statistical test , that describes how likely your results would have occurred if the null hypothesis were true. A P-value less than 0.5 is statistically significant, while a value higher than 0.5 indicates the null hypothesis is true; hence it is not statistically significant. So, what is P-Value exactly, and why is it so important?

In statistical hypothesis testing , P-Value or probability value can be defined as the measure of the probability that a real-valued test statistic is at least as extreme as the value actually obtained. P-value shows how likely it is that your set of observations could have occurred under the null hypothesis. P-Values are used in statistical hypothesis testing to determine whether to reject the null hypothesis. The smaller the p-value, the stronger the likelihood that you should reject the null hypothesis. 

Your Data Analytics Career is Around The Corner!

Your Data Analytics Career is Around The Corner!

P-values are expressed as decimals and can be converted into percentage. For example, a p-value of 0.0237 is 2.37%, which means there's a 2.37% chance of your results being random or having happened by chance. The smaller the P-value, the more significant your results are. 

In a hypothesis test, you can compare the p value from your test with the alpha level selected while running the test. Now, let’s try to understand what is P-Value vs Alpha level.    

A P-value indicates the probability of getting an effect no less than that actually observed in the sample data.

An alpha level will tell you the probability of wrongly rejecting a true null hypothesis. The level is selected by the researcher and obtained by subtracting your confidence level from 100%. For instance, if you are 95% confident in your research, the alpha level will be 5% (0.05).

When you run the hypothesis test, if you get:

  • A small p value (<=0.05), you should reject the null hypothesis
  • A large p value (>0.05), you should not reject the null hypothesis

In addition to the P-value, you can use other values given by your test to determine if your null hypothesis is true. 

For example, if you run an F-test to compare two variances in Excel, you will obtain a p-value, an f-critical value, and a f-value. Compare the f-value with f-critical value. If f-critical value is lower, you should reject the null hypothesis. 

P-Values are usually calculated using p-value tables or spreadsheets, or calculated automatically using statistical software like R, SPSS, etc. 

Depending on the test statistic and degrees of freedom (subtracting no. of independent variables from no. of observations) of your test, you can find out from the tables how frequently you can expect the test statistic to be under the null hypothesis. 

How to calculate P-value depends on which statistical test you’re using to test your hypothesis.  

  • Every statistical test uses different assumptions and generates different statistics. Select the test method that best suits your data and matches the effect or relationship being tested.
  • The number of independent variables included in your test determines how big or small the test statistic should be in order to generate the same p-value. 

Regardless of what statistical test you are using, the p-value will always denote the same thing – how frequently you can expect to get a test statistic as extreme or even more extreme than the one given by your test. 

In the P-Value approach to hypothesis testing, a calculated probability is used to decide if there’s evidence to reject the null hypothesis, also known as the conjecture. The conjecture is the initial claim about a data population, while the alternative hypothesis ascertains if the observed population parameter differs from the population parameter value according to the conjecture. 

Effectively, the significance level is declared in advance to determine how small the P-value needs to be such that the null hypothesis is rejected.  The levels of significance vary from one researcher to another; so it can get difficult for readers to compare results from two different tests. That is when P-value makes things easier. 

Readers could interpret the statistical significance by referring to the reported P-value of the hypothesis test. This is known as the P-value approach to hypothesis testing. Using this, readers could decide for themselves whether the p value represents a statistically significant difference.  

The level of statistical significance is usually represented as a P-value between 0 and 1. The smaller the p-value, the more likely it is that you would reject the null hypothesis. 

  • A P-Value < or = 0.05 is considered statistically significant. It denotes strong evidence against the null hypothesis, since there is below 5% probability of the null being correct. So, we reject the null hypothesis and accept the alternative hypothesis.
  • But if P-Value is lower than your threshold of significance, though the null hypothesis can be rejected, it does not mean that there is 95% probability of the alternative hypothesis being true. 
  • A P-Value >0.05 is not statistically significant. It denotes strong evidence for the null hypothesis being true. Thus, we retain the null hypothesis and reject the alternative hypothesis. We cannot accept null hypothesis; we can only reject or not reject it. 

A statistically significant result does not prove a research hypothesis to be correct. Instead, it provides support for or provides evidence for the hypothesis. 

  • You should report exact P-Values upto two or three decimal places. 
  • For P-values less than .001, report as p < .001. 
  • Do not use 0 before the decimal point as it cannot equal1. Write p = .001, and not p = 0.001
  • Make sure p is always italicized and there is space on either side of the = sign. 
  • It is impossible to get P = .000, and should be written as p < .001

An investor says that the performance of their investment portfolio is equivalent to that of the Standard & Poor’s (S&P) 500 Index. He performs a two-tailed test to determine this. 

The null hypothesis here says that the portfolio’s returns are equivalent to the returns of S&P 500, while the alternative hypothesis says that the returns of the portfolio and the returns of the S&P 500 are not equivalent.  

The p-value hypothesis test gives a measure of how much evidence is present to reject the null hypothesis. The smaller the p value, the higher the evidence against null hypothesis. 

Therefore, if the investor gets a P value of .001, it indicates strong evidence against null hypothesis. So he confidently deduces that the portfolio’s returns and the S&P 500’s returns are not equivalent.

1. What does P-value mean?

P-Value or probability value is a number that denotes the likelihood of your data having occurred under the null hypothesis of your statistical test. 

2. What does p 0.05 mean?

A P-value less than 0.05 is deemed to be statistically significant, meaning the null hypothesis should be rejected in such a case. A P-Value greater than 0.05 is not considered to be statistically significant, meaning the null hypothesis should not be rejected. 

3. What is P-value and how is it calculated?

The p-value or probability value is a number, calculated from a statistical test, that tells how likely it is that your results would have occurred under the null hypothesis of the test.  

P-values are usually automatically calculated using statistical software. They can also be calculated using p-value tables for the relevant statistical test. P values are calculated based on the null distribution of the test statistic. In case the test statistic is far from the mean of the null distribution, the p-value obtained is small. It indicates that the test statistic is unlikely to have occurred under the null hypothesis. 

4. What is p-value in research?

P values are used in hypothesis testing to help determine whether the null hypothesis should be rejected. It plays a major role when results of research are discussed. Hypothesis testing is a statistical methodology frequently used in medical and clinical research studies. 

5. Why is the p-value significant?

Statistical significance is a term that researchers use to say that it is not likely that their observations could have occurred if the null hypothesis were true. The level of statistical significance is usually represented as a P-value or probability value between 0 and 1. The smaller the p-value, the more likely it is that you would reject the null hypothesis. 

6. What is null hypothesis and what is p-value?

A null hypothesis is a kind of statistical hypothesis that suggests that there is no statistical significance in a set of given observations. It says there is no relationship between your variables.   

P-value or probability value is a number, calculated from a statistical test, that tells how likely it is that your results would have occurred under the null hypothesis of the test.   

P-Value is used to determine the significance of observational data. Whenever researchers notice an apparent relation between two variables, a P-Value calculation helps ascertain if the observed relationship happened as a result of chance. Learn more about statistical analysis and data analytics and fast track your career with our Professional Certificate Program In Data Analytics .  

Data Science & Business Analytics Courses Duration and Fees

Data Science & Business Analytics programs typically range from a few weeks to several months, with fees varying based on program and institution.

Program NameDurationFees

Cohort Starts:

3 Months€ 1,999

Cohort Starts:

11 months€ 2,290

Cohort Starts:

8 months€ 2,790

Cohort Starts:

11 Months€ 3,790

Cohort Starts:

11 months€ 2,790

Cohort Starts:

32 weeks€ 1,790
11 months€ 1,099
11 months€ 1,099

Recommended Reads

Unlocking Client Value with GenAI: A Guide for IT Service Leaders to Build Capability

Inferential Statistics Explained: From Basics to Advanced!

A Comprehensive Look at Percentile in Statistics

Free eBook: Top Programming Languages For A Data Scientist

The Difference Between Data Mining and Statistics

All You Need to Know About Bias in Statistics

Get Affiliated Certifications with Live Class programs

Post graduate program in data analytics.

  • Post Graduate Program certificate and Alumni Association membership
  • Exclusive hackathons and Ask me Anything sessions by IBM

Data Analyst

  • Industry-recognized Data Analyst Master’s certificate from Simplilearn
  • Dedicated live sessions by faculty of industry experts
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Thank You to all attendees for stopping by the BlueSky Statistics Booth (#406) at JSM, Portland, last week

BlueSky Statistics.jpg

How can I get p value for Hypothesis test ?

Hello Everyone ,

I was trying out Bluesky yesterday and I couldn't get the p-value for a hypothesis test. 2 sample t-test, Independent , to be precise. Any idea ?

The independent samples t-test first does an F test on the equality of variances. The p-value for that is labeled “Sig.”  You can use that test to choose which of the two t-tests to use. If the F is not significant (i.e. larger than 0.05), you would look at the top t-test for “Equal variances assumed” and its p-value labeled “Sig.(2-tailed).”  If the F is significant, then you would choose the bottom t-test labeled “Equal variances not assumed” and it’s p-value, also labeled “Sig(2-tailed).”

An alternative approach is to never assume the variances are equal and so always choose the bottom t-test.

For t-test paired samples look at the column p-value.

For t-test one sample look at the column Sig.(2-tail).

Just to confirm , P value is the highlighted column in red ?

how to test hypothesis with p value

Oh and I went to Analysis -> Means -> T-test , Independent Samples.

Review response above, I just made edits

Thanks Aaron. Will do that. :)

But any idea why Sig instead of P-value ? I am studying using Python , R , Minitab and all came out with P-value. Sig usually means Significance level.

Such as 5% significance level or alpha means 95% confidence level and if P is lower than 5% then do not accept null.

I thought Sig here means that Significance Level and not the P value.

IMAGES

  1. P-Value Method For Hypothesis Testing

    how to test hypothesis with p value

  2. Hypothesis testing tutorial using p value method

    how to test hypothesis with p value

  3. What is P-value in hypothesis testing

    how to test hypothesis with p value

  4. P-Value

    how to test hypothesis with p value

  5. Hypothesis testing p value steps to buying

    how to test hypothesis with p value

  6. Hypothesis Test for Proportion

    how to test hypothesis with p value

COMMENTS

  1. S.3.2 Hypothesis Testing (P-Value Approach)

    Note that the P-value for a two-tailed test is always two times the P-value for either of the one-tailed tests. The P-value, 0.0254, tells us it is "unlikely" that we would observe such an extreme test statistic t* in the direction of H A if the null hypothesis were true. Therefore, our initial assumption that the null hypothesis is true must ...

  2. Understanding P-values

    The p value is a number, calculated from a statistical test, that describes how likely you are to have found a particular set of observations if the null hypothesis were true. P values are used in hypothesis testing to help decide whether to reject the null hypothesis. The smaller the p value, the more likely you are to reject the null ...

  3. P-Value in Statistical Hypothesis Tests: What is it?

    P Value Definition. A p value is used in hypothesis testing to help you support or reject the null hypothesis. The p value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. P values are expressed as decimals although it may be easier to understand what they ...

  4. How to Find the P value: Process and Calculations

    To find the p value for your sample, do the following: Identify the correct test statistic. Calculate the test statistic using the relevant properties of your sample. Specify the characteristics of the test statistic's sampling distribution. Place your test statistic in the sampling distribution to find the p value.

  5. How Hypothesis Tests Work: Significance Levels (Alpha) and P values

    Hypothesis testing is a vital process in inferential statistics where the goal is to use sample data to draw conclusions about an entire population. In the testing process, you use significance levels and p-values to determine whether the test results are statistically significant. You hear about results being statistically significant all of ...

  6. Hypothesis Testing

    Step 5: Present your findings. The results of hypothesis testing will be presented in the results and discussion sections of your research paper, dissertation or thesis.. In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p-value).

  7. Interpreting P values

    Here is the technical definition of P values: P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true. Let's go back to our hypothetical medication study. Suppose the hypothesis test generates a P value of 0.03.

  8. 9.3

    P-Value. The P-value is the smallest significance level \(\alpha\) that leads us to reject the null hypothesis. Alternatively (and the way I prefer to think of P-values), the P-value is the probability that we'd observe a more extreme statistic than we did if the null hypothesis were true.

  9. Hypothesis testing and p-values (video)

    Then, if the null hypothesis is wrong, then the data will tend to group at a point that is not the value in the null hypothesis (1.2), and then our p-value will wind up being very small. If the null hypothesis is correct, or close to being correct, then the p-value will be larger, because the data values will group around the value we hypothesized.

  10. The p-value and rejecting the null (for one- and two-tail tests)

    The p-value (or the observed level of significance) is the smallest level of significance at which you can reject the null hypothesis, assuming the null hypothesis is true. You can also think about the p-value as the total area of the region of rejection. Remember that in a one-tailed test, the regi

  11. Understanding P-Values and Statistical Significance

    A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true). The level of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p -value, the less likely the results occurred by random chance, and the ...

  12. Hypothesis testing and p-values

    Courses on Khan Academy are always 100% free. Start practicing—and saving your progress—now: https://www.khanacademy.org/math/statistics-probability/signifi...

  13. Using P-values to make conclusions (article)

    Onward! We use p -values to make conclusions in significance testing. More specifically, we compare the p -value to a significance level α to make conclusions about our hypotheses. If the p -value is lower than the significance level we chose, then we reject the null hypothesis H 0 in favor of the alternative hypothesis H a .

  14. P-Value Method for Hypothesis Testing

    The P-value method is used in Hypothesis Testing to check the significance of the given Null Hypothesis. Then, deciding to reject or support it is based upon the specified significance level or threshold. A P-value is calculated in this method which is a test statistic. This statistic can give us the probability of finding a value (Sample Mean ...

  15. P-values and significance tests (video)

    Transcript. We compare a P-value to a significance level to make a conclusion in a significance test. Given the null hypothesis is true, a p-value is the probability of getting a result as or more extreme than the sample result by random chance alone. If a p-value is lower than our significance level, we reject the null hypothesis.

  16. p-value Calculator

    To determine the p-value, you need to know the distribution of your test statistic under the assumption that the null hypothesis is true.Then, with the help of the cumulative distribution function (cdf) of this distribution, we can express the probability of the test statistics being at least as extreme as its value x for the sample:Left-tailed test:

  17. Hypothesis Testing

    Using the p-value to make the decision. The p-value represents how likely we would be to observe such an extreme sample if the null hypothesis were true. The p-value is a probability computed assuming the null hypothesis is true, that the test statistic would take a value as extreme or more extreme than that actually observed. Since it's a probability, it is a number between 0 and 1.

  18. How to Correctly Interpret P Values

    The P value is used all over statistics, from t-tests to regression analysis. Everyone knows that you use P values to determine statistical significance in a hypothesis test. In fact, P values often determine what studies get published and what projects get funding.

  19. P-Value Method For Hypothesis Testing

    This statistics video explains how to use the p-value to solve problems associated with hypothesis testing. When the p-value is less than alpha, you should ...

  20. P-Value: Comprehensive Guide to Understand, Apply, and Interpret

    The p-value, or probability value, is a statistical measure used in hypothesis testing to assess the strength of evidence against a null hypothesis. It represents the probability of obtaining results as extreme as, or more extreme than, the observed results under the assumption that the null hypothesis is true.

  21. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  22. P-Value: What It Is, How to Calculate It, and Why It Matters

    P-Value: The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. The p-value is used as an ...

  23. What Is P-Value in Statistical Hypothesis?

    The p-value or probability value is a number, calculated from a statistical test, that tells how likely it is that your results would have occurred under the null hypothesis of the test. P-values are usually automatically calculated using statistical software. They can also be calculated using p-value tables for the relevant statistical test.

  24. How can I get p value for Hypothesis test ?

    Hello Everyone , Good day , I was trying out Bluesky yesterday and I couldn't get the p-value for a hypothesis test. 2 sample t-test, Independent , to be precise. Any idea ? Thank you