U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Psychol

Statistical Conclusion Validity: Some Common Threats and Simple Remedies

Miguel a. garcía-pérez.

1 Facultad de Psicología, Departamento de Metodología, Universidad Complutense, Madrid, Spain

The ultimate goal of research is to produce dependable knowledge or to provide the evidence that may guide practical decisions. Statistical conclusion validity (SCV) holds when the conclusions of a research study are founded on an adequate analysis of the data, generally meaning that adequate statistical methods are used whose small-sample behavior is accurate, besides being logically capable of providing an answer to the research question. Compared to the three other traditional aspects of research validity (external validity, internal validity, and construct validity), interest in SCV has recently grown on evidence that inadequate data analyses are sometimes carried out which yield conclusions that a proper analysis of the data would not have supported. This paper discusses evidence of three common threats to SCV that arise from widespread recommendations or practices in data analysis, namely, the use of repeated testing and optional stopping without control of Type-I error rates, the recommendation to check the assumptions of statistical tests, and the use of regression whenever a bivariate relation or the equivalence between two variables is studied. For each of these threats, examples are presented and alternative practices that safeguard SCV are discussed. Educational and editorial changes that may improve the SCV of published research are also discussed.

Psychologists are well aware of the traditional aspects of research validity introduced by Campbell and Stanley ( 1966 ) and further subdivided and discussed by Cook and Campbell ( 1979 ). Despite initial criticisms of the practically oriented and somewhat fuzzy distinctions among the various aspects (see Cook and Campbell, 1979 , pp. 85–91; see also Shadish et al., 2002 , pp. 462–484), the four facets of research validity have gained recognition and they are currently covered in many textbooks on research methods in psychology (e.g., Beins, 2009 ; Goodwin, 2010 ; Girden and Kabacoff, 2011 ). Methods and strategies aimed at securing research validity are also discussed in these and other sources. To simplify the description, construct validity is sought by using well-established definitions and measurement procedures for variables, internal validity is sought by ensuring that extraneous variables have been controlled and confounds have been eliminated, and external validity is sought by observing and measuring dependent variables under natural conditions or under an appropriate representation of them. The fourth aspect of research validity, which Cook and Campbell called statistical conclusion validity (SCV), is the subject of this paper.

Cook and Campbell, 1979 , pp. 39–50) discussed that SCV pertains to the extent to which data from a research study can reasonably be regarded as revealing a link (or lack thereof) between independent and dependent variables as far as statistical issues are concerned . This particular facet was separated from other factors acting in the same direction (the three other facets of validity) and includes three aspects: (1) whether the study has enough statistical power to detect an effect if it exists, (2) whether there is a risk that the study will “reveal” an effect that does not actually exist, and (3) how can the magnitude of the effect be confidently estimated. They nevertheless considered the latter aspect as a mere step ahead once the first two aspects had been satisfactorily solved, and they summarized their position by stating that SCV “refers to inferences about whether it is reasonable to presume covariation given a specified α level and the obtained variances” (Cook and Campbell, 1979 , p. 41). Given that mentioning “the obtained variances” was an indirect reference to statistical power and mentioning α was a direct reference to statistical significance, their position about SCV may have seemed to only entail consideration that the statistical decision can be incorrect as a result of Type-I and Type-II errors. Perhaps as a consequence of this literal interpretation, review papers studying SCV in published research have focused on power and significance (e.g., Ottenbacher, 1989 ; Ottenbacher and Maas, 1999 ), strategies aimed at increasing SCV have only considered these issues (e.g., Howard et al., 1983 ), and tutorials on the topic only or almost only mention these issues along with effect sizes (e.g., Orme, 1991 ; Austin et al., 1998 ; Rankupalli and Tandon, 2010 ). This emphasis on issues of significance and power may also be the reason that some sources refer to threats to SCV as “any factor that leads to a Type-I or a Type-II error” (e.g., Girden and Kabacoff, 2011 , p. 6; see also Rankupalli and Tandon, 2010 , Section 1.2), as if these errors had identifiable causes that could be prevented. It should be noted that SCV has also occasionally been purported to reflect the extent to which pre-experimental designs provide evidence for causation (Lee, 1985 ) or the extent to which meta-analyses are based on representative results that make the conclusion generalizable (Elvik, 1998 ).

But Cook and Campbell’s ( 1979 , p. 80) aim was undoubtedly broader, as they stressed that SCV “is concerned with sources of random error and with the appropriate use of statistics and statistical tests ” (italics added). Moreover, Type-I and Type-II errors are an essential and inescapable consequence of the statistical decision theory underlying significance testing and, as such, the potential occurrence of one or the other of these errors cannot be prevented. The actual occurrence of them for the data on hand cannot be assessed either. Type-I and Type-II errors will always be with us and, hence, SCV is only trivially linked to the fact that research will never unequivocally prove or reject any statistical null hypothesis or its originating research hypothesis. Cook and Campbell seemed to be well aware of this issue when they stressed that SCV refers to reasonable inferences given a specified significance level and a given power. In addition, Stevens ( 1950 , p. 121) forcefully emphasized that “ it is a statistician’s duty to be wrong the stated number of times,” implying that a researcher should accept the assumed risks of Type-I and Type-II errors, use statistical methods that guarantee the assumed error rates, and consider these as an essential part of the research process. From this position, these errors do not affect SCV unless their probability differs meaningfully from that which was assumed. And this is where an alternative perspective on SCV enters the stage, namely, whether the data were analyzed properly so as to extract conclusions that faithfully reflect what the data have to say about the research question. A negative answer raises concerns about SCV beyond the triviality of Type-I or Type-II errors. There are actually two types of threat to SCV from this perspective. One is when the data are subjected to thoroughly inadequate statistical analyses that do not match the characteristics of the design used to collect the data or that cannot logically give an answer to the research question. The other is when a proper statistical test is used but it is applied under conditions that alter the stated risk probabilities. In the former case, the conclusion will be wrong except by accident; in the latter, the conclusion will fail to be incorrect with the declared probabilities of Type-I and Type-II errors.

The position elaborated in the foregoing paragraph is well summarized in Milligan and McFillen’s ( 1984 , p. 439) statement that “under normal conditions (…) the researcher will not know when a null effect has been declared significant or when a valid effect has gone undetected (…) Unfortunately, the statistical conclusion validity, and the ultimate value of the research, rests on the explicit control of (Type-I and Type-II) error rates.” This perspective on SCV is explicitly discussed in some textbooks on research methods (e.g., Beins, 2009 , pp. 139–140; Goodwin, 2010 , pp. 184–185) and some literature reviews have been published that reveal a sound failure of SCV in these respects.

For instance, Milligan and McFillen’s ( 1984 , p. 438) reviewed evidence that “the business research community has succeeded in publishing a great deal of incorrect and statistically inadequate research” and they dissected and discussed in detail four additional cases (among many others that reportedly could have been chosen) in which a breach of SCV resulted from gross mismatches between the research design and the statistical analysis. Similarly, García-Pérez ( 2005 ) reviewed alternative methods to compute confidence intervals for proportions and discussed three papers (among many others that reportedly could have been chosen) in which inadequate confidence intervals had been computed. More recently, Bakker and Wicherts ( 2011 ) conducted a thorough analysis of psychological papers and estimated that roughly 50% of published papers contain reporting errors, although they only checked whether the reported p value was correct and not whether the statistical test used was appropriate. A similar analysis carried out by Nieuwenhuis et al. ( 2011 ) revealed that 50% of the papers reporting the results of a comparison of two experimental effects in top neuroscience journals had used an incorrect statistical procedure. And Bland and Altman ( 2011 ) reported further data on the prevalence of incorrect statistical analyses of a similar nature.

An additional indicator of the use of inadequate statistical procedures arises from consideration of published papers whose title explicitly refers to a re-analysis of data reported in some other paper. A literature search for papers including in their title the terms “a re-analysis,” “a reanalysis,” “re-analyses,” “reanalyses,” or “alternative analysis” was conducted on May 3, 2012 in the Web of Science (WoS; http://thomsonreuters.com ), which rendered 99 such papers with subject area “Psychology” published in 1990 or later. Although some of these were false positives, a sizeable number of them actually discussed the inadequacy of analyses carried out by the original authors and reported the results of proper alternative analyses that typically reversed the original conclusion. This type of outcome upon re-analyses of data are more frequent than the results of this quick and simple search suggest, because the information for identification is not always included in the title of the paper or is included in some other form: For a simple example, the search for the clause “a closer look” in the title rendered 131 papers, many of which also presented re-analyses of data that reversed the conclusion of the original study.

Poor design or poor sample size planning may, unbeknownst to the researcher, lead to unacceptable Type-II error rates, which will certainly affect SCV (as long as the null is not rejected; if it is, the probability of a Type-II error is irrelevant). Although insufficient power due to lack of proper planning has consequences on statistical tests, the thread of this paper de-emphasizes this aspect of SCV (which should perhaps more reasonably fit within an alternative category labeled design validity ) and emphasizes the idea that SCV holds when statistical conclusions are incorrect with the stated probabilities of Type-I and Type-II errors (whether the latter was planned or simply computed). Whether or not the actual significance level used in the research or the power that it had is judged acceptable is another issue, which does not affect SCV: The statistical conclusion is valid within the stated (or computed) error probabilities. A breach of SCV occurs, then, when the data are not subjected to adequate statistical analyses or when control of Type-I or Type-II errors is lost.

It should be noted that a further component was included into consideration of SCV in Shadish et al.’s ( 2002 ) sequel to Cook and Campbell’s ( 1979 ) book, namely, effect size. Effect size relates to what has been called a Type-III error (Crawford et al., 1998 ), that is, a statistically significant result that has no meaningful practical implication and that only arises from the use of a huge sample. This issue is left aside in the present paper because adequate consideration and reporting of effect sizes precludes Type-III errors, although the recommendations of Wilkinson and The Task Force on Statistical Inference ( 1999 ) in this respect are not always followed. Consider, e.g., Lippa’s ( 2007 ) study of the relation between sex drive and sexual attraction. Correlations generally lower than 0.3 in absolute value were declared strong as a result of p values below 0.001. With sample sizes sometimes nearing 50,000 paired observations, even correlations valued at 0.04 turned out significant in this study. More attention to effect sizes is certainly needed, both by researchers and by journal editors and reviewers.

The remainder of this paper analyzes three common practices that result in SCV breaches, also discussing simple replacements for them.

Stopping Rules for Data Collection without Control of Type-I Error Rates

The asymptotic theory that provides justification for null hypothesis significance testing (NHST) assumes what is known as fixed sampling , which means that the size n of the sample is not itself a random variable or, in other words, that the size of the sample has been decided in advance and the statistical test is performed once the entire sample of data has been collected. Numerous procedures have been devised to determine the size that a sample must have according to planned power (Ahn et al., 2001 ; Faul et al., 2007 ; Nisen and Schwertman, 2008 ; Jan and Shieh, 2011 ), the size of the effect sought to be detected (Morse, 1999 ), or the width of the confidence intervals of interest (Graybill, 1958 ; Boos and Hughes-Oliver, 2000 ; Shieh and Jan, 2012 ). For reviews, see Dell et al. ( 2002 ) and Maxwell et al. ( 2008 ). In many cases, a researcher simply strives to gather as large a sample as possible. Asymptotic theory supports NHST under fixed sampling assumptions, whether or not the size of the sample was planned.

In contrast to fixed sampling, sequential sampling implies that the number of observations is not fixed in advance but depends by some rule on the observations already collected (Wald, 1947 ; Anscombe, 1953 ; Wetherill, 1966 ). In practice, data are analyzed as they come in and data collection stops when the observations collected thus far satisfy some criterion. The use of sequential sampling faces two problems (Anscombe, 1953 , p. 6): (i) devising a suitable stopping rule and (ii) finding a suitable test statistic and determining its sampling distribution. The mere statement of the second problem evidences that the sampling distribution of conventional test statistics for fixed sampling no longer holds under sequential sampling. These sampling distributions are relatively easy to derive in some cases, particularly in those involving negative binomial parameters (Anscombe, 1953 ; García-Pérez and Núñez-Antón, 2009 ). The choice between fixed and sequential sampling (sometimes portrayed as the “experimenter’s intention”; see Wagenmakers, 2007 ) has important ramifications for NHST because the probability that the observed data are compatible (by any criterion) with a true null hypothesis generally differs greatly across sampling methods. This issue is usually bypassed by those who look at the data as a “sure fact” once collected, as if the sampling method used to collect the data did not make any difference or should not affect how the data are interpreted.

There are good reasons for using sequential sampling in psychological research. For instance, in clinical studies in which patients are recruited on the go, the experimenter may want to analyze data as they come in to be able to prevent the administration of a seemingly ineffective or even hurtful treatment to new patients. In studies involving a waiting-list control group, individuals in this group are generally transferred to an experimental group midway along the experiment. In studies with laboratory animals, the experimenter may want to stop testing animals before the planned number has been reached so that animals are not wasted when an effect (or the lack thereof) seems established. In these and analogous cases, the decision as to whether data will continue to be collected results from an analysis of the data collected thus far, typically using a statistical test that was devised for use in conditions of fixed sampling. In other cases, experimenters test their statistical hypothesis each time a new observation or block of observations is collected, and continue the experiment until they feel the data are conclusive one way or the other. Software has been developed that allows experimenters to find out how many more observations will be needed for a marginally non-significant result to become significant on the assumption that sample statistics will remain invariant when the extra data are collected (Morse, 1998 ).

The practice of repeated testing and optional stopping has been shown to affect in unpredictable ways the empirical Type-I error rate of statistical tests designed for use under fixed sampling (Anscombe, 1954 ; Armitage et al., 1969 ; McCarroll et al., 1992 ; Strube, 2006 ; Fitts, 2011a ). The same holds when a decision is made to collect further data on evidence of a marginally (non) significant result (Shun et al., 2001 ; Chen et al., 2004 ). The inaccuracy of statistical tests in these conditions represents a breach of SCV, because the statistical conclusion thus fails to be incorrect with the assumed (and explicitly stated) probabilities of Type-I and Type-II errors. But there is an easy way around the inflation of Type-I error rates from within NHST, which solves the threat to SCV that repeated testing and optional stopping entail.

In what appears to be the first development of a sequential procedure with control of Type-I error rates in psychology, Frick ( 1998 ) proposed that repeated statistical testing be conducted under the so-called COAST (composite open adaptive sequential test) rule: If the test yields p  < 0.01, stop collecting data and reject the null; if it yields p  > 0.36, stop also and do not reject the null; otherwise, collect more data and re-test. The low criterion at 0.01 and the high criterion at 0.36 were selected through simulations so as to ensure a final Type-I error rate of 0.05 for paired-samples t tests. Use of the same low and high criteria rendered similar control of Type-I error rates for tests of the product-moment correlation, but they yielded slightly conservative tests of the interaction in 2 × 2 between-subjects ANOVAs. Frick also acknowledged that adjusting the low and high criteria might be needed in other cases, although he did not address them. This has nevertheless been done by others who have modified and extended Frick’s approach (e.g., Botella et al., 2006 ; Ximenez and Revuelta, 2007 ; Fitts, 2010a , b , 2011b ). The result is sequential procedures with stopping rules that guarantee accurate control of final Type-I error rates for the statistical tests that are more widely used in psychological research.

Yet, these methods do not seem to have ever been used in actual research, or at least their use has not been acknowledged. For instance, of the nine citations to Frick’s ( 1998 ) paper listed in WoS as of May 3, 2012, only one is from a paper (published in 2011) in which the COAST rule was reportedly used, although unintendedly. And not a single citation is to be found in WoS from papers reporting the use of the extensions and modifications of Botella et al. ( 2006 ) or Ximenez and Revuelta ( 2007 ). Perhaps researchers in psychology invariably use fixed sampling, but it is hard to believe that “data peeking” or “data monitoring” was never used, or that the results of such interim analyses never led researchers to collect some more data. Wagenmakers ( 2007 , p. 785) regretted that “it is not clear what percentage of p values reported in experimental psychology have been contaminated by some form of optional stopping. There is simply no information in Results sections that allows one to assess the extent to which optional stopping has occurred.” This incertitude was quickly resolved by John et al. ( 2012 ). They surveyed over 2000 psychologists with highly revealing results: Respondents affirmatively admitted to the practices of data peeking, data monitoring, or conditional stopping in rates that varied between 20 and 60%.

Besides John et al.’s ( 2012 ) proposal that authors disclose these details in full and Simmons et al.’s ( 2011 ) proposed list of requirements for authors and guidelines for reviewers, the solution to the problem is simple: Use strategies that control Type-I error rates upon repeated testing and optional stopping. These strategies have been widely used in biomedical research for decades (Bauer and Köhne, 1994 ; Mehta and Pocock, 2011 ). There is no reason that psychological research should ignore them and give up efficient research with control of Type-I error rates, particularly when these strategies have also been adapted and further developed for use under the most common designs in psychological research (Frick, 1998 ; Botella et al., 2006 ; Ximenez and Revuelta, 2007 ; Fitts, 2010a , b ).

It should also be stressed that not all instances of repeated testing or optional stopping without control of Type-I error rates threaten SCV. A breach of SCV occurs only when the conclusion regarding the research question is based on the use of these practices. For an acceptable use, consider the study of Xu et al. ( 2011 ). They investigated order preferences in primates to find out whether primates preferred to receive the best item first rather than last. Their procedure involved several experiments and they declared that “three significant sessions (two-tailed binomial tests per session, p  < 0.05) or 10 consecutive non-significant sessions were required from each monkey before moving to the next experiment. The three significant sessions were not necessarily consecutive (…) Ten consecutive non-significant sessions were taken to mean there was no preference by the monkey” (p. 2304). In this case, the use of repeated testing with optional stopping at a nominal 95% significance level for each individual test is part of the operational definition of an outcome variable used as a criterion to proceed to the next experiment. And, in any event, the overall probability of misclassifying a monkey according to this criterion is certainly fixed at a known value that can easily be worked out from the significance level declared for each individual binomial test. One may object to the value of the resultant risk of misclassification, but this does not raise concerns about SCV.

In sum, the use of repeated testing with optional stopping threatens SCV for lack of control of Type-I and Type-II error rates. A simple way around this is to refrain from these practices and adhere to the fixed sampling assumptions of statistical tests; otherwise, use the statistical methods that have been developed for use with repeated testing and optional stopping.

Preliminary Tests of Assumptions

To derive the sampling distribution of test statistics used in parametric NHST, some assumptions must be made about the probability distribution of the observations or about the parameters of these distributions. The assumptions of normality of distributions (in all tests), homogeneity of variances (in Student’s two-sample t test for means or in ANOVAs involving between-subjects factors), sphericity (in repeated-measures ANOVAs), homoscedasticity (in regression analyses), or homogeneity of regression slopes (in ANCOVAs) are well known cases. The data on hand may or may not meet these assumptions and some parametric tests have been devised under alternative assumptions (e.g., Welch’s test for two-sample means, or correction factors for the degrees of freedom of F statistics from ANOVAs). Most introductory statistics textbooks emphasize that the assumptions underlying statistical tests must be formally tested to guide the choice of a suitable test statistic for the null hypothesis of interest. Although this recommendation seems reasonable, serious consequences on SCV arise from following it.

Numerous studies conducted over the past decades have shown that the two-stage approach of testing assumptions first and subsequently testing the null hypothesis of interest has severe effects on Type-I and Type-II error rates. It may seem at first sight that this is simply the result of cascaded binary decisions each of which has its own Type-I and Type-II error probabilities; yet, this is the result of more complex interactions of Type-I and Type-II error rates that do not have fixed (empirical) probabilities across the cases that end up treated one way or the other according to the outcomes of the preliminary test: The resultant Type-I and Type-II error rates of the conditional test cannot be predicted from those of the preliminary and conditioned tests. A thorough analysis of what factors affect the Type-I and Type-II error rates of two-stage approaches is beyond the scope of this paper but readers should be aware that nothing suggests in principle that a two-stage approach might be adequate. The situations that have been more thoroughly studied include preliminary goodness-of-fit tests for normality before conducting a one-sample t test (Easterling and Anderson, 1978 ; Schucany and Ng, 2006 ; Rochon and Kieser, 2011 ), preliminary tests of equality of variances before conducting a two-sample t test for means (Gans, 1981 ; Moser and Stevens, 1992 ; Zimmerman, 1996 , 2004 ; Hayes and Cai, 2007 ), preliminary tests of both equality of variances and normality preceding two-sample t tests for means (Rasch et al., 2011 ), or preliminary tests of homoscedasticity before regression analyses (Caudill, 1988 ; Ng and Wilcox, 2011 ). These and other studies provide evidence that strongly advises against conducting preliminary tests of assumptions. Almost all of these authors explicitly recommended against these practices and hoped for the misleading and misguided advice given in introductory textbooks to be removed. Wells and Hintze ( 2007 , p. 501) concluded that “checking the assumptions using the same data that are to be analyzed, although attractive due to its empirical nature, is a fruitless endeavor because of its negative ramifications on the actual test of interest.” The ramifications consist of substantial but unknown alterations of Type-I and Type-II error rates and, hence, a breach of SCV.

Some authors suggest that the problem can be solved by replacing the formal test of assumptions with a decision based on a suitable graphical display of the data that helps researchers judge by eye whether the assumption is tenable. It should be emphasized that the problem still remains, because the decision on how to analyze the data is conditioned on the results of a preliminary analysis. The problem is not brought about by a formal preliminary test, but by the conditional approach to data analysis. The use of a non-formal preliminary test only prevents a precise investigation of the consequences on Type-I and Type-II error rates. But the “out of sight, out of mind” philosophy does not eliminate the problem.

It thus seems that a researcher must make a choice between two evils: either not testing assumptions (and, thus, threatening SCV as a result of the uncontrolled Type-I and Type-II error rates that arise from a potentially undue application of the statistical test) or testing them (and, then, also losing control of Type-I and Type-II error rates owing to the two-stage approach). Both approaches are inadequate, as applying non-robust statistical tests to data that do not satisfy the assumptions has generally as severe implications on SCV as testing preliminary assumptions in a two-stage approach. One of the solutions to the dilemma consists of switching to statistical procedures that have been designed for use under the two-stage approach. For instance, Albers et al. ( 2000 ) used second-order asymptotics to derive the size and power of a two-stage test for independent means preceded by a test of equality of variances. Unfortunately, derivations of this type are hard to carry out and, hence, they are not available for most of the cases of interest. A second solution consists of using classical test statistics that have been shown to be robust to violation of their assumptions. Indeed, dependable unconditional tests for means or for regression parameters have been identified (see Sullivan and D’Agostino, 1992 ; Lumley et al., 2002 ; Zimmerman, 2004 , 2011 ; Hayes and Cai, 2007 ; Ng and Wilcox, 2011 ). And a third solution is switching to modern robust methods (see, e.g., Wilcox and Keselman, 2003 ; Keselman et al., 2004 ; Wilcox, 2006 ; Erceg-Hurn and Mirosevich, 2008 ; Fried and Dehling, 2011 ).

Avoidance of the two-stage approach in either of these ways will restore SCV while observing the important requirement that statistical methods should be used whose assumptions are not violated by the characteristics of the data.

Regression as a Means to Investigate Bivariate Relations of all Types

Correlational methods define one of the branches of scientific psychology (Cronbach, 1957 ) and they are still widely used these days in some areas of psychology. Whether in regression analyses or in latent variable analyses (Bollen, 2002 ), vast amounts of data are subjected to these methods. Regression analyses rely on an assumption that is often overlooked in psychology, namely, that the predictor variables have fixed values and are measured without error. This assumption, whose validity can obviously be assessed without recourse to any preliminary statistical test, is listed in all statistics textbooks.

In some areas of psychology, predictors actually have this characteristic because they are physical variables defining the magnitude of stimuli, and any error with which these magnitudes are measured (or with which stimuli with the selected magnitudes are created) is negligible in practice. Among others, this is the case in psychophysical studies aimed at estimating psychophysical functions describing the form of the relation between physical magnitude and perceived magnitude (e.g., Green, 1982 ) or psychometric functions describing the form of the relation between physical magnitude and performance in a detection, discrimination, or identification task (Armstrong and Marks, 1997 ; Saberi and Petrosyan, 2004 ; García-Pérez et al., 2011 ). Regression or analogous methods are typically used to estimate the parameters of these relations, with stimulus magnitude as the independent variable and perceived magnitude (or performance) as the dependent variable. The use of regression in these cases is appropriate because the independent variable has fixed values measured without error (or with a negligible error). Another area in which the use of regression is permissible is in simulation studies on parameter recovery (García-Pérez et al., 2010 ), where the true parameters generating the data are free of measurement error by definition.

But very few other predictor variables used in psychology meet this requirement, as they are often test scores or performance measures that are typically affected by non-negligible and sometimes large measurement error. This is the case of the proportion of hits and the proportion of false alarms in psychophysical tasks, whose theoretical relation is linear under some signal detection models (DeCarlo, 1998 ) and, thus, suggests the use of simple linear regression to estimate its parameters. Simple linear regression is also sometimes used as a complement to statistical tests of equality of means in studies in which equivalence or agreement is assessed (e.g., Maylor and Rabbitt, 1993 ; Baddeley and Wilson, 2002 ), and in these cases equivalence implies that the slope should not differ significantly from unity and that the intercept should not differ significantly from zero. The use of simple linear regression is also widespread in priming studies after Greenwald et al. ( 1995 ; see also Draine and Greenwald, 1998 ), where the intercept (and sometimes the slope) of the linear regression of priming effect on detectability of the prime are routinely subjected to NHST.

In all the cases just discussed and in many others where the X variable in the regression of Y on X is measured with error, a study of the relation between X and Y through regression is inadequate and has serious consequences on SCV. The least of these problems is that there is no basis for assigning the roles of independent and dependent variable in the regression equation (as a non-directional relation exists between the variables, often without even a temporal precedence relation), but regression parameters will differ according to how these roles are assigned. In influential papers of which most researchers in psychology seem to be unaware, Wald ( 1940 ) and Mandansky ( 1959 ) distinguished regression relations from structural relations, the latter reflecting the case in which both variables are measured with error. Both authors illustrated the consequences of fitting a regression line when a structural relation is involved and derived suitable estimators and significance tests for the slope and intercept parameters of a structural relation. This topic was brought to the attention of psychologists by Isaac ( 1970 ) in a criticism of Treisman and Watts’ ( 1966 ) use of simple linear regression to assess the equivalence of two alternative estimates of psychophysical sensitivity ( d ′ measures from signal detection theory analyses). The difference between regression and structural relations is briefly mentioned in passing in many elementary books on regression, the issue of fitting structural relations (sometimes referred to as Deming’s regression or the errors-in-variables regression model ) is addressed in detail in most intermediate and advance books on regression (e.g., Fuller, 1987 ; Draper and Smith, 1998 ) and hands-on tutorials have been published (e.g., Cheng and Van Ness, 1994 ; Dunn and Roberts, 1999 ; Dunn, 2007 ). But this type of analysis is not in the toolbox of the average researcher in psychology 1 . In contrast, recourse to this type analysis is quite common in the biomedical sciences.

Use of this commendable method may generalize when researchers realize that estimates of the slope β and the intercept α of a structural relation can be easily computed through

where X ¯ , Ȳ ,  S x 2 , S y 2 , and S x y are the sample means, variances, and covariance of X and Y , and λ = σ ε y 2 ∕ σ ε x 2 is the ratio of the variances of measurement errors in Y and in X . When X and Y are the same variable measured at different times or under different conditions (as in Maylor and Rabbitt, 1993 ; Baddeley and Wilson, 2002 ), λ = 1 can safely be assumed (for an actual application, see Smith et al., 2004 ). In other cases, a rough estimate can be used, as the estimates of α and β have been shown to be robust except under extreme departures of the guesstimated λ from its true value (Ketellapper, 1983 ).

For illustration, consider Yeshurun et al. ( 2008 ) comparison of signal detection theory estimates of d ′ in each of the intervals of a two alternative forced-choice task, which they pronounced different as revealed by a regression analysis through the origin. Note that this is the context in which Isaac ( 1970 ) had illustrated the inappropriateness of regression. The data are shown in Figure ​ Figure1, 1 , and Yeshurun et al. rejected equality of d 1 ′ and d 2 ′ because the regression slope through the origin (red line, whose slope is 0.908) differed significantly from unity: The 95% confidence interval for the slope ranged between 0.844 and 0.973. Using Eqs 1 and 2, the estimated structural relation is instead given by the blue line in Figure ​ Figure1. 1 . The difference seems minor by eye, but the slope of the structural relation is 0.963, which is not significantly different from unity ( p  = 0.738, two-tailed; see Isaac, 1970 , p. 215). This outcome, which reverses a conclusion raised upon inadequate data analyses, is representative of other cases in which the null hypothesis H 0 : β = 1 was rejected. The reason is dual: (1) the slope of a structural relation is estimated with severe bias through regression (Riggs et al., 1978 ; Kalantar et al., 1995 ; Hawkins, 2002 ) and (2) regression-based statistical tests of H 0 : β = 1 render empirical Type-I error rates that are much higher than the nominal rate when both variables are measured with error (García-Pérez and Alcalá-Quintana, 2011 ).

An external file that holds a picture, illustration, etc.
Object name is fpsyg-03-00325-g001.jpg

Replot of data from Yeshurun et al. ( 2008 , their Figure 8) with their fitted regression line through the origin (red line) and a fitted structural relation (blue line) . The identity line is shown with dashed trace for comparison. For additional analyses bearing on the SCV of the original study, see García-Pérez and Alcalá-Quintana ( 2011 ).

In sum, SCV will improve if structural relations instead of regression equations were fitted when both variables are measured with error.

Type-I and Type-II errors are essential components of the statistical decision theory underlying NHST and, therefore, data can never be expected to answer a research question unequivocally. This paper has promoted a view of SCV that de-emphasizes consideration of these unavoidable errors and considers instead two alternative issues: (1) whether statistical tests are used that match the research design, goals of the study, and formal characteristics of the data and (2) whether they are applied in conditions under which the resultant Type-I and Type-II error rates match those that are declared as limiting the validity of the conclusion. Some examples of common threats to SCV in these respects have been discussed and simple and feasible solutions have been proposed. For reasons of space, another threat to SCV has not been covered in this paper, namely, the problems arising from multiple testing (i.e., in concurrent tests of more than one hypothesis). Multiple testing is commonplace in brain mapping studies and some implications on SCV have been discussed, e.g., by Bennett et al. ( 2009 ), Vul et al. ( 2009a , b ), and Vecchiato et al. ( 2010 ).

All the discussion in this paper has assumed the frequentist approach to data analysis. In closing, and before commenting on how SCV could be improved, a few words are worth about how Bayesian approaches fare on SCV.

The Bayesian approach

Advocates of Bayesian approaches to data analysis, hypothesis testing, and model selection (e.g., Jennison and Turnbull, 1990 ; Wagenmakers, 2007 ; Matthews, 2011 ) overemphasize the problems of the frequentist approach and praise the solutions offered by the Bayesian approach: Bayes factors (BFs) for hypothesis testing, credible intervals for interval estimation, Bayesian posterior probabilities, Bayesian information criterion (BIC) as a tool for model selection and, above all else, strict reliance on observed data and independence of the sampling plan (i.e., fixed vs. sequential sampling). There is unquestionable merit in these alternatives and a fair comparison with their frequentist counterparts requires a detailed analysis that is beyond the scope of this paper. Yet, I cannot resist the temptation of commenting on the presumed problems of the frequentist approach and also on the standing of the Bayesian approach with respect to SCV.

One of the preferred objections to p values is that they relate to data that were never collected and which, thus, should not affect the decision of what hypothesis the observed data support or fail to support. Intuitively appealing as it may seem, the argument is flawed because the referent for a p value is not other data sets that could have been observed in undone replications of the same experiment. Instead, the referent is the properties of the test statistic itself, which is guaranteed to have the declared sampling distribution when data are collected as assumed in the derivation of such distribution. Statistical tests are calibrated procedures with known properties, and this calibration is what makes their results interpretable. As is the case for any other calibrated procedure or measuring instrument, the validity of the outcome only rests on adherence to the usage specifications. And, of course, the test statistic and the resultant p value on application cannot be blamed for the consequences of a failure to collect data properly or to apply the appropriate statistical test.

Consider a two-sample t test for means. Those who need a referent may want to notice that the p value for the data from a given experiment relates to the uncountable times that such test has been applied to data from any experiment in any discipline. Calibration of the t test ensures that a proper use with a significance level of, say, 5% will reject a true null hypothesis on 5% of the occasions, no matter what the experimental hypothesis is, what the variables are, what the data are, what the experiment is about, who carries it out, or in what research field. What a p value indicates is how tenable it is that the t statistic will attain the observed value if the null were correct, with only a trivial link to the data observed in the experiment of concern. And this only places in a precise quantitative framework the logic that the man on the street uses to judge, for instance, that getting struck by lightning four times over the past 10 years is not something that could identically have happened to anybody else, or that the source of a politician’s huge and untraceable earnings is not the result of allegedly winning top lottery prizes numerous times over the past couple of years. In any case, the advantage of the frequentist approach as regards SCV is that the probability of a Type-I or a Type-II error can be clearly and unequivocally stated, which is not to be mistaken for a statement that a p value is the probability of a Type-I error in the current case, or that it is a measure of the strength of evidence against the null that the current data provide. The most prevalent problems of p values are their potential for misuse and their widespread misinterpretation (Nickerson, 2000 ). But misuse or misinterpretation do not make NHST and p values uninterpretable or worthless.

Bayesian approaches are claimed to be free of these presumed problems, yielding a conclusion that is exclusively grounded on the data. In a naive account of Bayesian hypothesis testing, Malakoff ( 1999 ) attributes to biostatistician Steven Goodman the assertion that the Bayesian approach “says there is an X% probability that your hypothesis is true–not that there is some convoluted chance that if you assume the null hypothesis is true, you will get a similar or more extreme result if you repeated your experiment thousands of times.” Besides being misleading and reflecting a poor understanding of the logic of calibrated NHST methods, what goes unmentioned in this and other accounts is that the Bayesian potential to find out the probability that the hypothesis is true will not materialize without two crucial extra pieces of information. One is the a priori probability of each of the competing hypotheses, which certainly does not come from the data. The other is the probability of the observed data under each of the competing hypothesis, which has the same origin as the frequentist p value and whose computation requires distributional assumptions that must necessarily take the sampling method into consideration.

In practice, Bayesian hypothesis testing generally computes BFs and the result might be stated as “the alternative hypothesis is x times more likely than the null,” although the probability that this type of statement is wrong is essentially unknown. The researcher may be content with a conclusion of this type, but how much of these odds comes from the data and how much comes from the extra assumptions needed to compute a BF is undecipherable. In many cases research aims at gathering and analyzing data to make informed decisions such as whether application of a treatment should be discontinued, whether changes should be introduced in an educational program, whether daytime headlights should be enforced, or whether in-car use of cell phones should be forbidden. Like frequentist analyses, Bayesian approaches do not guarantee that the decisions will be correct. One may argue that stating how much more likely is one hypothesis over another bypasses the decision to reject or not reject any of them and, then, that Bayesian approaches to hypothesis testing are free of Type-I and Type-II errors. Although this is technically correct, the problem remains from the perspective of SCV: Statistics is only a small part of a research process whose ultimate goal is to reach a conclusion and make a decision, and researchers are in a better position to defend their claims if they can supplement them with a statement of the probability with which those claims are wrong.

Interestingly, analyses of decisions based on Bayesian approaches have revealed that they are no better than frequentist decisions as regards Type-I and Type-II errors and that parametric assumptions (i.e., the choice of prior and the assumed distribution of the observations) crucially determine the performance of Bayesian methods. For instance, Bayesian estimation is also subject to potentially large bias and lack of precision (Alcalá-Quintana and García-Pérez, 2004 ; García-Pérez and Alcalá-Quintana, 2007 ), the coverage probability of Bayesian credible intervals can be worse than that of frequentist confidence intervals (Agresti and Min, 2005 ; Alcalá-Quintana and García-Pérez, 2005 ), and the Bayesian posterior probability in hypothesis testing can be arbitrarily large or small (Zaslavsky, 2010 ). On another front, use of BIC for model selection may discard a true model as often as 20% of the times, while a concurrent 0.05-size chi-square test rejects the true model between 3 and 7% of times, closely approximating its stated performance (García-Pérez and Alcalá-Quintana, 2012 ). In any case, the probabilities of Type-I and Type-II errors in practical decisions made from the results of Bayesian analyses will always be unknown and beyond control.

Improving the SCV of research

Most breaches of SCV arise from a poor understanding of statistical procedures and the resultant inadequate usage. These problems can be easily corrected, as illustrated in this paper, but the problems will not have arisen if researchers had had a better statistical training in the first place. There was a time in which one simply could not run statistical tests without a moderate understanding of NHST. But these days the application of statistical tests is only a mouse-click away and all that students regard as necessary is learning the rule by which p values pouring out of statistical software tell them whether the hypothesis is to be accepted or rejected, as the study of Hoekstra et al. ( 2012 ) seems to reveal.

One way to eradicate the problem is by improving statistical education at undergraduate and graduate levels, perhaps not just focusing on giving formal training on a number of methods but by providing students with the necessary foundations that will subsequently allow them to understand and apply methods for which they received no explicit formal training. In their analysis of statistical errors in published papers, Milligan and McFillen ( 1984 , p. 461) concluded that “in doing projects, it is not unusual for applied researchers or students to use or apply a statistical procedure for which they have received no formal training. This is as inappropriate as a person conducting research in a given content area before reading the existing background literature on the topic. The individual simply is not prepared to conduct quality research. The attitude that statistical technology is secondary or less important to a person’s formal training is shortsighted. Researchers are unlikely to master additional statistical concepts and techniques after leaving school. Thus, the statistical training in many programs must be strengthened. A single course in experimental design and a single course in multivariate analysis is probably insufficient for the typical student to master the course material. Someone who is trained only in theory and content will be ill-prepared to contribute to the advancement of the field or to critically evaluate the research of others.” But statistical education does not seem to have changed much over the subsequent 25 years, as revealed by survey studies conducted by Aiken et al. ( 1990 ), Friedrich et al. ( 2000 ), Aiken et al. ( 2008 ), and Henson et al. ( 2010 ). Certainly some work remains to be done in this arena, and I can only second the proposals made in the papers just cited. But there is also the problem of the unhealthy over-reliance on narrow-breadth, clickable software for data analysis, which practically obliterates any efforts that are made to teach and promote alternatives (see the list of “Pragmatic Factors” discussed by Borsboom, 2006 , pp. 431–434).

The last trench in the battle against breaches of SCV is occupied by journal editors and reviewers. Ideally, they also watch for problems in these respects. There is no known in-depth analysis of the review process in psychology journals (but see Nickerson, 2005 ) and some evidence reveals that the focus of the review process is not always on the quality or validity of the research (Sternberg, 2002 ; Nickerson, 2005 ). Simmons et al. ( 2011 ) and Wicherts et al. ( 2012 ) have discussed empirical evidence of inadequate research and review practices (some of which threaten SCV) and they have proposed detailed schemes through which feasible changes in editorial policies may help eradicate not only common threats to SCV but also other threats to research validity in general. I can only second proposals of this type. Reviewers and editors have the responsibility of filtering out (or requesting amendments to) research that does not meet the journal’s standards, including SCV. The analyses of Milligan and McFillen ( 1984 ) and Nieuwenhuis et al. ( 2011 ) reveal a sizeable number of published papers with statistical errors. This indicates that some remains to be done in this arena too, and some journals have indeed started to take action (see Aickin, 2011 ).

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This research was supported by grant PSI2009-08800 (Ministerio de Ciencia e Innovación, Spain).

1 SPSS includes a regression procedure called “two-stage least squares” which only implements the method described by Mandansky ( 1959 ) as “use of instrumental variables” to estimate the slope of the relation between X and Y . Use of this method requires extra variables with specific characteristics (variables which may simply not be available for the problem at hand) and differs meaningfully from the simpler and more generally applicable method to be discussed next

  • Agresti A., Min Y. (2005). Frequentist performance of Bayesian confidence intervals for comparing proportions in 2 × 2 contingency tables . Biometrics 61 , 515–523 10.1111/j.1541-0420.2005.031228.x [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ahn C., Overall J. E., Tonidandel S. (2001). Sample size and power calculations in repeated measurement analysis . Comput. Methods Programs Biomed. 64 , 121–124 10.1016/S0169-2607(00)00095-X [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Aickin M. (2011). Test ban: policy of the Journal of Alternative and Complementary Medicine with regard to an increasingly common statistical error . J. Altern. Complement. Med. 17 , 1093–1094 10.1089/acm.2011.0878 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Aiken L. S., West S. G., Millsap R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: replication and extension of Aiken, West, Sechrest, and Reno’s (1990) survey of PhD programs in North America . Am. Psychol. 63 , 32–50 10.1037/0003-066X.63.1.32 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Aiken L. S., West S. G., Sechrest L., Reno R. R. (1990). Graduate training in statistics, methodology, and measurement in psychology: a survey of PhD programs in North America . Am. Psychol. 45 , 721–734 10.1037/0003-066X.45.6.721 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Albers W., Boon P. C., Kallenberg W. C. M. (2000). The asymptotic behavior of tests for normal means based on a variance pre-test . J. Stat. Plan. Inference 88 , 47–57 10.1016/S0378-3758(99)00211-6 [ CrossRef ] [ Google Scholar ]
  • Alcalá-Quintana R., García-Pérez M. A. (2004). The role of parametric assumptions in adaptive Bayesian estimation . Psychol. Methods 9 , 250–271 10.1037/1082-989X.9.2.250 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Alcalá-Quintana R., García-Pérez M. A. (2005). Stopping rules in Bayesian adaptive threshold estimation . Spat. Vis. 18 , 347–374 10.1163/1568568054089375 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Anscombe F. J. (1953). Sequential estimation . J. R. Stat. Soc. Series B 15 , 1–29 [ Google Scholar ]
  • Anscombe F. J. (1954). Fixed-sample-size analysis of sequential observations . Biometrics 10 , 89–100 10.2307/3001665 [ CrossRef ] [ Google Scholar ]
  • Armitage P., McPherson C. K., Rowe B. C. (1969). Repeated significance tests on accumulating data . J. R. Stat. Soc. Ser. A 132 , 235–244 10.2307/2343787 [ CrossRef ] [ Google Scholar ]
  • Armstrong L., Marks L. E. (1997). Differential effect of stimulus context on perceived length: implications for the horizontal–vertical illusion . Percept. Psychophys. 59 , 1200–1213 10.3758/BF03214208 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Austin J. T., Boyle K. A., Lualhati J. C. (1998). Statistical conclusion validity for organizational science researchers: a review . Organ. Res. Methods 1 , 164–208 10.1177/109442819812002 [ CrossRef ] [ Google Scholar ]
  • Baddeley A., Wilson B. A. (2002). Prose recall and amnesia: implications for the structure of working memory . Neuropsychologia 40 , 1737–1743 10.1016/S0028-3932(01)00146-4 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bakker M., Wicherts J. M. (2011). The (mis) reporting of statistical results in psychology journals . Behav. Res. Methods 43 , 666–678 10.3758/s13428-011-0075-y [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bauer P., Köhne K. (1994). Evaluation of experiments with adaptive interim analyses . Biometrics 50 , 1029–1041 10.2307/2533441 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Beins B. C. (2009). Research Methods. A Tool for Life , 2nd Edn. Boston, MA: Pearson Education [ Google Scholar ]
  • Bennett C. M., Wolford G. L., Miller M. B. (2009). The principled control of false positives in neuroimaging . Soc. Cogn. Affect. Neurosci. 4 , 417–422 10.1093/scan/nsp053 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bland J. M., Altman D. G. (2011). Comparisons against baseline within randomised groups are often used and can be highly misleading . Trials 12 , 264. 10.1186/1745-6215-12-264 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bollen K. A. (2002). Latent variables in psychology and the social sciences . Annu. Rev. Psychol. 53 , 605–634 10.1146/annurev.psych.53.100901.135239 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Boos D. D., Hughes-Oliver J. M. (2000). How large does n have to be for Z and t intervals? Am. Stat. 54 , 121–128 10.1080/00031305.2000.10474524 [ CrossRef ] [ Google Scholar ]
  • Borsboom D. (2006). The attack of the psychometricians . Psychometrika 71 , 425–440 10.1007/s11336-006-1502-3 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Botella J., Ximenez C., Revuelta J., Suero M. (2006). Optimization of sample size in controlled experiments: the CLAST rule . Behav. Res. Methods Instrum. Comput. 38 , 65–76 10.3758/BF03192751 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Campbell D. T., Stanley J. C. (1966). Experimental and Quasi-Experimental Designs for Research . Chicago, IL: Rand McNally [ Google Scholar ]
  • Caudill S. B. (1988). Type I errors after preliminary tests for heteroscedasticity . Statistician 37 , 65–68 10.2307/2348380 [ CrossRef ] [ Google Scholar ]
  • Chen Y. H. J., DeMets D. L., Lang K. K. G. (2004). Increasing sample size when the unblinded interim result is promising . Stat. Med. 23 , 1023–1038 10.1002/sim.1617 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cheng C. L., Van Ness J. W. (1994). On estimating linear relationships when both variables are subject to errors . J. R. Stat. Soc. Series B 56 , 167–183 [ Google Scholar ]
  • Cook T. D., Campbell D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings . Boston, MA: Houghton Mifflin [ Google Scholar ]
  • Crawford E. D., Blumenstein B., Thompson I. (1998). Type III statistical error . Urology 51 , 675. 10.1016/S0090-4295(98)00124-1 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cronbach L. J. (1957). The two disciplines of scientific psychology . Am. Psychol. 12 , 671–684 10.1037/h0043943 [ CrossRef ] [ Google Scholar ]
  • DeCarlo L. T. (1998). Signal detection theory and generalized linear models . Psychol. Methods 3 , 186–205 10.1037/1082-989X.3.2.186 [ CrossRef ] [ Google Scholar ]
  • Dell R. B., Holleran S., Ramakrishnan R. (2002). Sample size determination . ILAR J. 43 , 207–213 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Draine S. C., Greenwald A. G. (1998). Replicable unconscious semantic priming . J. Exp. Psychol. Gen. 127 , 286–303 10.1037/0096-3445.127.3.286 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Draper N. R., Smith H. (1998). Applied Regression Analysis , 3rd Edn. New York: Wiley [ Google Scholar ]
  • Dunn G. (2007). Regression models for method comparison data . J. Biopharm. Stat. 17 , 739–756 10.1080/10543400701329513 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dunn G., Roberts C. (1999). Modelling method comparison data . Stat. Methods Med. Res. 8 , 161–179 10.1191/096228099668524590 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Easterling R. G., Anderson H. E. (1978). The effect of preliminary normality goodness of fit tests on subsequent inference . J. Stat. Comput. Simul. 8 , 1–11 10.1080/00949657808810243 [ CrossRef ] [ Google Scholar ]
  • Elvik R. (1998). Evaluating the statistical conclusion validity of weighted mean results in meta-analysis by analysing funnel graph diagrams . Accid. Anal. Prev. 30 , 255–266 10.1016/S0001-4575(97)00076-6 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Erceg-Hurn C. M., Mirosevich V. M. (2008). Modern robust statistical methods: an easy way to maximize the accuracy and power of your research . Am. Psychol. 63 , 591–601 10.1037/0003-066X.63.7.591 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Faul F., Erdfelder E., Lang A.-G., Buchner A. (2007). G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences . Behav. Res. Methods 39 , 175–191 10.3758/BF03193146 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fitts D. A. (2010a). Improved stopping rules for the design of efficient small-sample experiments in biomedical and biobehavioral research . Behav. Res. Methods 42 , 3–22 10.3758/BRM.42.1.3 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fitts D. A. (2010b). The variable-criteria sequential stopping rule: generality to unequal sample sizes, unequal variances, or to large ANOVAs . Behav. Res. Methods 42 , 918–929 10.3758/BRM.42.1.3 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fitts D. A. (2011a). Ethics and animal numbers: Informal analyses, uncertain sample sizes, inefficient replications, and Type I errors . J. Am. Assoc. Lab. Anim. Sci. 50 , 445–453 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Fitts D. A. (2011b). Minimizing animal numbers: the variable-criteria sequential stopping rule . Comp. Med. 61 , 206–218 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Frick R. W. (1998). A better stopping rule for conventional statistical tests . Behav. Res. Methods Instrum. Comput. 30 , 690–697 10.3758/BF03209488 [ CrossRef ] [ Google Scholar ]
  • Fried R., Dehling H. (2011). Robust nonparametric tests for the two-sample location problem . Stat. Methods Appl. 20 , 409–422 10.1007/s10260-011-0164-1 [ CrossRef ] [ Google Scholar ]
  • Friedrich J., Buday E., Kerr D. (2000). Statistical training in psychology: a national survey and commentary on undergraduate programs . Teach. Psychol. 27 , 248–257 10.1207/S15328023TOP2704_02 [ CrossRef ] [ Google Scholar ]
  • Fuller W. A. (1987). Measurement Error Models . New York: Wiley [ Google Scholar ]
  • Gans D. J. (1981). Use of a preliminary test in comparing two sample means . Commun. Stat. Simul. Comput. 10 , 163–174 10.1080/03610918108812201 [ CrossRef ] [ Google Scholar ]
  • García-Pérez M. A. (2005). On the confidence interval for the binomial parameter . Qual. Quant. 39 , 467–481 10.1007/s11135-005-0233-3 [ CrossRef ] [ Google Scholar ]
  • García-Pérez M. A., Alcalá-Quintana R. (2007). Bayesian adaptive estimation of arbitrary points on a psychometric function . Br. J. Math. Stat. Psychol. 60 , 147–174 10.1348/000711006X104596 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • García-Pérez M. A., Alcalá-Quintana R. (2011). Testing equivalence with repeated measures: tests of the difference model of two-alternative forced-choice performance . Span. J. Psychol. 14 , 1023–1049 [ PubMed ] [ Google Scholar ]
  • García-Pérez M. A., Alcalá-Quintana R. (2012). On the discrepant results in synchrony judgment and temporal-order judgment tasks: a quantitative model . Psychon. Bull. Rev. (in press). 10.3758/s13423-012-0278-y [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • García-Pérez M. A., Alcalá-Quintana R., García-Cueto M. A. (2010). A comparison of anchor-item designs for the concurrent calibration of large banks of Likert-type items . Appl. Psychol. Meas. 34 , 580–599 10.1177/0146621609351259 [ CrossRef ] [ Google Scholar ]
  • García-Pérez M. A., Alcalá-Quintana R., Woods R. L., Peli E. (2011). Psychometric functions for detection and discrimination with and without flankers . Atten. Percept. Psychophys. 73 , 829–853 10.3758/s13414-011-0167-x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • García-Pérez M. A., Núñez-Antón V. (2009). Statistical inference involving binomial and negative binomial parameters . Span. J. Psychol. 12 , 288–307 [ PubMed ] [ Google Scholar ]
  • Girden E. R., Kabacoff R. I. (2011). Evaluating Research Articles. From Start to Finish , 3rd Edn. Thousand Oaks, CA: Sage [ Google Scholar ]
  • Goodwin C. J. (2010). Research in Psychology. Methods and Design , 6th Edn. Hoboken, NJ: Wiley [ Google Scholar ]
  • Graybill F. A. (1958). Determining sample size for a specified width confidence interval . Ann. Math. Stat. 29 , 282–287 10.1214/aoms/1177706627 [ CrossRef ] [ Google Scholar ]
  • Green B. G. (1982). The perception of distance and location for dual tactile figures . Percept. Psychophys. 31 , 315–323 10.3758/BF03206211 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Greenwald A. G., Klinger M. R., Schuh E. S. (1995). Activation by marginally perceptible (“subliminal”) stimuli: dissociation of unconscious from conscious cognition . J. Exp. Psychol. Gen. 124 , 22–42 10.1037/0096-3445.124.1.22 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hawkins D. M. (2002). Diagnostics for conformity of paired quantitative measurements . Stat. Med. 21 , 1913–1935 10.1002/sim.1013 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hayes A. F., Cai L. (2007). Further evaluating the conditional decision rule for comparing two independent means . Br. J. Math. Stat. Psychol. 60 , 217–244 10.1348/000711005X62576 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Henson R. K., Hull D. M., Williams C. S. (2010). Methodology in our education research culture: toward a stronger collective quantitative proficiency . Educ. Res. 39 , 229–240 10.3102/0013189X10365102 [ CrossRef ] [ Google Scholar ]
  • Hoekstra R., Kiers H., Johnson A. (2012). Are assumptions of well-known statistical techniques checked, and why (not)? Front. Psychol. 3 :137. 10.3389/fpsyg.2012.00137 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Howard G. S., Obledo F. H., Cole D. A., Maxwell S. E. (1983). Linked raters’ judgments: combating problems of statistical conclusion validity . Appl. Psychol. Meas. 7 , 57–62 10.1177/014662168300700108 [ CrossRef ] [ Google Scholar ]
  • Isaac P. D. (1970). Linear regression, structural relations, and measurement error . Psychol. Bull. 74 , 213–218 10.1037/h0029777 [ CrossRef ] [ Google Scholar ]
  • Jan S.-L., Shieh G. (2011). Optimal sample sizes for Welch’s test under various allocation and cost considerations . Behav. Res. Methods 43 , 1014–1022 10.3758/s13428-011-0095-7 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Jennison C., Turnbull B. W. (1990). Statistical approaches to interim monitoring of clinical trials: a review and commentary . Stat. Sci. 5 , 299–317 10.1214/ss/1177012095 [ CrossRef ] [ Google Scholar ]
  • John L. K., Loewenstein G., Prelec D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling . Psychol. Sci. 23 , 524–532 10.1177/0956797611430953 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kalantar A. H., Gelb R. I., Alper J. S. (1995). Biases in summary statistics of slopes and intercepts in linear regression with errors in both variables . Talanta 42 , 597–603 10.1016/0039-9140(95)01453-I [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Keselman H. J., Othman A. R., Wilcox R. R., Fradette K. (2004). The new and improved two-sample t test . Psychol. Sci. 15 , 47–51 10.1111/j.0963-7214.2004.01501008.x [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ketellapper R. H. (1983). On estimating parameters in a simple linear errors-in-variables model . Technometrics 25 , 43–47 10.1080/00401706.1983.10487818 [ CrossRef ] [ Google Scholar ]
  • Lee B. (1985). Statistical conclusion validity in ex post facto designs: practicality in evaluation . Educ. Eval. Policy Anal. 7 , 35–45 10.3102/01623737007001035 [ CrossRef ] [ Google Scholar ]
  • Lippa R. A. (2007). The relation between sex drive and sexual attraction to men and women: a cross-national study of heterosexual, bisexual, and homosexual men and women . Arch. Sex. Behav. 36 , 209–222 10.1007/s10508-006-9151-2 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lumley T., Diehr P., Emerson S., Chen L. (2002). The importance of the normality assumption in large public health data sets . Annu. Rev. Public Health 23 , 151–169 10.1146/annurev.publhealth.23.100901.140546 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Malakoff D. (1999). Bayes offers a “new” way to make sense of numbers . Science 286 , 1460–1464 10.1126/science.286.5441.883b [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mandansky A. (1959). The fitting of straight lines when both variables are subject to error . J. Am. Stat. Assoc. 54 , 173–205 10.1080/01621459.1959.10501505 [ CrossRef ] [ Google Scholar ]
  • Matthews W. J. (2011). What might judgment and decision making research be like if we took a Bayesian approach to hypothesis testing? Judgm. Decis. Mak. 6 , 843–856 [ Google Scholar ]
  • Maxwell S. E., Kelley K., Rausch J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation . Annu. Rev. Psychol. 59 , 537–563 10.1146/annurev.psych.59.103006.093735 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Maylor E. A., Rabbitt P. M. A. (1993). Alcohol, reaction time and memory: a meta-analysis . Br. J. Psychol. 84 , 301–317 10.1111/j.2044-8295.1993.tb02485.x [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • McCarroll D., Crays N., Dunlap W. P. (1992). Sequential ANOVAs and type I error rates . Educ. Psychol. Meas. 52 , 387–393 10.1177/0013164492052002014 [ CrossRef ] [ Google Scholar ]
  • Mehta C. R., Pocock S. J. (2011). Adaptive increase in sample size when interim results are promising: a practical guide with examples . Stat. Med. 30 , 3267–3284 10.1002/sim.4102 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Milligan G. W., McFillen J. M. (1984). Statistical conclusion validity in experimental designs used in business research . J. Bus. Res. 12 , 437–462 10.1016/0148-2963(84)90024-9 [ CrossRef ] [ Google Scholar ]
  • Morse D. T. (1998). MINSIZE: a computer program for obtaining minimum sample size as an indicator of effect size . Educ. Psychol. Meas. 58 , 142–153 10.1177/0013164498058003003 [ CrossRef ] [ Google Scholar ]
  • Morse D. T. (1999). MINSIZE2: a computer program for determining effect size and minimum sample size for statistical significance for univariate, multivariate, and nonparametric tests . Educ. Psychol. Meas. 59 , 518–531 10.1177/00131649921969901 [ CrossRef ] [ Google Scholar ]
  • Moser B. K., Stevens G. R. (1992). Homogeneity of variance in the two-sample means test . Am. Stat. 46 , 19–21 10.1080/00031305.1992.10475839 [ CrossRef ] [ Google Scholar ]
  • Ng M., Wilcox R. R. (2011). A comparison of two-stage procedures for testing least-squares coefficients under heteroscedasticity . Br. J. Math. Stat. Psychol. 64 , 244–258 10.1348/000711010X508683 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nickerson R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy . Psychol. Methods 5 , 241–301 10.1037/1082-989X.5.2.241 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nickerson R. S. (2005). What authors want from journal reviewers and editors . Am. Psychol. 60 , 661–662 10.1037/0003-066X.60.6.661 [ CrossRef ] [ Google Scholar ]
  • Nieuwenhuis S., Forstmann B. U., Wagenmakers E.-J. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance . Nat. Neurosci. 14 , 1105–1107 10.1038/nn.2812 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nisen J. A., Schwertman N. C. (2008). A simple method of computing the sample size for chi-square test for the equality of multinomial distributions . Comput. Stat. Data Anal. 52 , 4903–4908 10.1016/j.csda.2008.04.007 [ CrossRef ] [ Google Scholar ]
  • Orme J. G. (1991). Statistical conclusion validity for single-system designs . Soc. Serv. Rev. 65 , 468–491 10.1086/603858 [ CrossRef ] [ Google Scholar ]
  • Ottenbacher K. J. (1989). Statistical conclusion validity of early intervention research with handicapped children . Except. Child. 55 , 534–540 [ PubMed ] [ Google Scholar ]
  • Ottenbacher K. J., Maas F. (1999). How to detect effects: statistical power and evidence-based practice in occupational therapy research . Am. J. Occup. Ther. 53 , 181–188 10.5014/ajot.53.2.181 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rankupalli B., Tandon R. (2010). Practicing evidence-based psychiatry: 1. Applying a study’s findings: the threats to validity approach . Asian J. Psychiatr. 3 , 35–40 10.1016/j.ajp.2010.01.002 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rasch D., Kubinger K. D., Moder K. (2011). The two-sample t test: pre-testing its assumptions does not pay off . Stat. Pap. 52 , 219–231 10.1007/s00362-009-0224-x [ CrossRef ] [ Google Scholar ]
  • Riggs D. S., Guarnieri J. A., Addelman S. (1978). Fitting straight lines when both variables are subject to error . Life Sci. 22 , 1305–1360 10.1016/0024-3205(78)90098-X [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rochon J., Kieser M. (2011). A closer look at the effect of preliminary goodness-of-fit testing for normality for the one-sample t-test . Br. J. Math. Stat. Psychol. 64 , 410–426 10.1348/2044-8317.002003 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Saberi K., Petrosyan A. (2004). A detection-theoretic model of echo inhibition . Psychol. Rev. 111 , 52–66 10.1037/0033-295X.111.1.52 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schucany W. R., Ng H. K. T. (2006). Preliminary goodness-of-fit tests for normality do not validate the one-sample Student t . Commun. Stat. Theory Methods 35 , 2275–2286 10.1080/03610920600853308 [ CrossRef ] [ Google Scholar ]
  • Shadish W. R., Cook T. D., Campbell D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference . Boston, MA: Houghton Mifflin [ Google Scholar ]
  • Shieh G., Jan S.-L. (2012). Optimal sample sizes for precise interval estimation of Welch’s procedure under various allocation and cost considerations . Behav. Res. Methods 44 , 202–212 10.3758/s13428-011-0139-z [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Shun Z. M., Yuan W., Brady W. E., Hsu H. (2001). Type I error in sample size re-estimations based on observed treatment difference . Stat. Med. 20 , 497–513 10.1002/sim.533 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Simmons J. P., Nelson L. D., Simoshohn U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant . Psychol. Sci. 22 , 1359–1366 10.1177/0956797611417632 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Smith P. L., Wolfgang B. F., Sinclair A. J. (2004). Mask-dependent attentional cuing effects in visual signal detection: the psychometric function for contrast . Percept. Psychophys. 66 , 1056–1075 10.3758/BF03194995 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sternberg R. J. (2002). On civility in reviewing . APS Obs. 15 , 34 [ Google Scholar ]
  • Stevens W. L. (1950). Fiducial limits of the parameter of a discontinuous distribution . Biometrika 37 , 117–129 10.2307/2332154 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Strube M. J. (2006). SNOOP: a program for demonstrating the consequences of premature and repeated null hypothesis testing . Behav. Res. Methods 38 , 24–27 10.3758/BF03192746 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sullivan L. M., D’Agostino R. B. (1992). Robustness of the t test applied to data distorted from normality by floor effects . J. Dent. Res. 71 , 1938–1943 10.1177/00220345920710121601 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Treisman M., Watts T. R. (1966). Relation between signal detectability theory and the traditional procedures for measuring sensory thresholds: estimating d’ from results given by the method of constant stimuli . Psychol. Bull. 66 , 438–454 10.1037/h0020413 [ CrossRef ] [ Google Scholar ]
  • Vecchiato G., Fallani F. V., Astolfi L., Toppi J., Cincotti F., Mattia D., Salinari S., Babiloni F. (2010). The issue of multiple univariate comparisons in the context of neuroelectric brain mapping: an application in a neuromarketing experiment . J. Neurosci. Methods 191 , 283–289 10.1016/j.jneumeth.2010.07.009 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Vul E., Harris C., Winkielman P., Pashler H. (2009a). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition . Perspect. Psychol. Sci. 4 , 274–290 10.1111/j.1745-6924.2009.01132.x [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Vul E., Harris C., Winkielman P., Pashler H. (2009b). Reply to comments on “Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition.” Perspect. Psychol. Sci. 4 , 319–324 10.1111/j.1745-6924.2009.01132.x [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wagenmakers E.-J. (2007). A practical solution to the pervasive problems of p values . Psychon. Bull. Rev. 14 , 779–804 10.3758/BF03194105 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wald A. (1940). The fitting of straight lines if both variables are subject to error . Ann. Math. Stat. 11 , 284–300 10.1214/aoms/1177731946 [ CrossRef ] [ Google Scholar ]
  • Wald A. (1947). Sequential Analysis . New York: Wiley [ Google Scholar ]
  • Wells C. S., Hintze J. M. (2007). Dealing with assumptions underlying statistical tests . Psychol. Sch. 44 , 495–502 10.1002/pits.20241 [ CrossRef ] [ Google Scholar ]
  • Wetherill G. B. (1966). Sequential Methods in Statistics . London: Chapman and Hall [ Google Scholar ]
  • Wicherts J. M., Kievit R. A., Bakker M., Borsboom D. (2012). Letting the daylight in: reviewing the reviewers and other ways to maximize transparency in science . Front. Comput. Psychol. 6 :20. 10.3389/fncom.2012.00020 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wilcox R. R. (2006). New methods for comparing groups: strategies for increasing the probability of detecting true differences . Curr. Dir. Psychol. Sci. 14 , 272–275 10.1111/j.0963-7214.2005.00379.x [ CrossRef ] [ Google Scholar ]
  • Wilcox R. R., Keselman H. J. (2003). Modern robust data analysis methods: measures of central tendency . Psychol. Methods 8 , 254–274 10.1037/1082-989X.8.3.254 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wilkinson L., The Task Force on Statistical Inference (1999). Statistical methods in psychology journals: guidelines and explanations . Am. Psychol. 54 , 594–604 10.1037/0003-066X.54.8.594 [ CrossRef ] [ Google Scholar ]
  • Ximenez C., Revuelta J. (2007). Extending the CLAST sequential rule to one-way ANOVA under group sampling . Behav. Res. Methods Instrum. Comput. 39 , 86–100 10.3758/BF03192847 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Xu E. R., Knight E. J., Kralik J. D. (2011). Rhesus monkeys lack a consistent peak-end effect . Q. J. Exp. Psychol. 64 , 2301–2315 10.1080/17470218.2011.591936 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Yeshurun Y., Carrasco M., Maloney L. T. (2008). Bias and sensitivity in two-interval forced choice procedures: tests of the difference model . Vision Res. 48 , 1837–1851 10.1016/j.visres.2007.10.015 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zaslavsky B. G. (2010). Bayesian versus frequentist hypotheses testing in clinical trials with dichotomous and countable outcomes . J. Biopharm. Stat. 20 , 985–997 10.1080/10543401003619023 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zimmerman D. W. (1996). Some properties of preliminary tests of equality of variances in the two-sample location problem . J. Gen. Psychol. 123 , 217–231 10.1080/00221309.1996.9921274 [ CrossRef ] [ Google Scholar ]
  • Zimmerman D. W. (2004). A note on preliminary tests of equality of variances . Br. J. Math. Stat. Psychol. 57 , 173–181 10.1348/000711004849222 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zimmerman D. W. (2011). A simple and effective decision rule for choosing a significance test to protect against non-normality . Br. J. Math. Stat. Psychol. 64 , 388–409 10.1348/000711010X501671 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Privacy Policy

Research Method

Home » Validity – Types, Examples and Guide

Validity – Types, Examples and Guide

Table of Contents

Validity

Validity is a fundamental concept in research, referring to the extent to which a test, measurement, or study accurately reflects or assesses the specific concept that the researcher is attempting to measure. Ensuring validity is crucial as it determines the trustworthiness and credibility of the research findings.

Research Validity

Research validity pertains to the accuracy and truthfulness of the research. It examines whether the research truly measures what it claims to measure. Without validity, research results can be misleading or erroneous, leading to incorrect conclusions and potentially flawed applications.

How to Ensure Validity in Research

Ensuring validity in research involves several strategies:

  • Clear Operational Definitions : Define variables clearly and precisely.
  • Use of Reliable Instruments : Employ measurement tools that have been tested for reliability.
  • Pilot Testing : Conduct preliminary studies to refine the research design and instruments.
  • Triangulation : Use multiple methods or sources to cross-verify results.
  • Control Variables : Control extraneous variables that might influence the outcomes.

Types of Validity

Validity is categorized into several types, each addressing different aspects of measurement accuracy.

Internal Validity

Internal validity refers to the degree to which the results of a study can be attributed to the treatments or interventions rather than other factors. It is about ensuring that the study is free from confounding variables that could affect the outcome.

External Validity

External validity concerns the extent to which the research findings can be generalized to other settings, populations, or times. High external validity means the results are applicable beyond the specific context of the study.

Construct Validity

Construct validity evaluates whether a test or instrument measures the theoretical construct it is intended to measure. It involves ensuring that the test is truly assessing the concept it claims to represent.

Content Validity

Content validity examines whether a test covers the entire range of the concept being measured. It ensures that the test items represent all facets of the concept.

Criterion Validity

Criterion validity assesses how well one measure predicts an outcome based on another measure. It is divided into two types:

  • Predictive Validity : How well a test predicts future performance.
  • Concurrent Validity : How well a test correlates with a currently existing measure.

Face Validity

Face validity refers to the extent to which a test appears to measure what it is supposed to measure, based on superficial inspection. While it is the least scientific measure of validity, it is important for ensuring that stakeholders believe in the test’s relevance.

Importance of Validity

Validity is crucial because it directly affects the credibility of research findings. Valid results ensure that conclusions drawn from research are accurate and can be trusted. This, in turn, influences the decisions and policies based on the research.

Examples of Validity

  • Internal Validity : A randomized controlled trial (RCT) where the random assignment of participants helps eliminate biases.
  • External Validity : A study on educational interventions that can be applied to different schools across various regions.
  • Construct Validity : A psychological test that accurately measures depression levels.
  • Content Validity : An exam that covers all topics taught in a course.
  • Criterion Validity : A job performance test that predicts future job success.

Where to Write About Validity in A Thesis

In a thesis, the methodology section should include discussions about validity. Here, you explain how you ensured the validity of your research instruments and design. Additionally, you may discuss validity in the results section, interpreting how the validity of your measurements affects your findings.

Applications of Validity

Validity has wide applications across various fields:

  • Education : Ensuring assessments accurately measure student learning.
  • Psychology : Developing tests that correctly diagnose mental health conditions.
  • Market Research : Creating surveys that accurately capture consumer preferences.

Limitations of Validity

While ensuring validity is essential, it has its limitations:

  • Complexity : Achieving high validity can be complex and resource-intensive.
  • Context-Specific : Some validity types may not be universally applicable across all contexts.
  • Subjectivity : Certain types of validity, like face validity, involve subjective judgments.

By understanding and addressing these aspects of validity, researchers can enhance the quality and impact of their studies, leading to more reliable and actionable results.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Split-Half Reliability

Split-Half Reliability – Methods, Examples and...

External Validity

External Validity – Threats, Examples and Types

Test-Retest Reliability

Test-Retest Reliability – Methods, Formula and...

Internal Validity

Internal Validity – Threats, Examples and Guide

Internal_Consistency_Reliability

Internal Consistency Reliability – Methods...

Reliability

Reliability – Types, Examples and Guide

Validity in Analysis, Interpretation, and Conclusions

  • First Online: 01 January 2015

Cite this chapter

research conclusion validity

  • Apollo M. Nkwake 2  

985 Accesses

This phase of the evaluation process uses the appropriate methods and tools for cleaning, processing, and analysis; interprets the results to determine what they mean; applies appropriate approaches for comparing, verifying, and triangulating results; lastly, documents appropriate conclusions and recommendations. Therefore, critical validity questions include the following:

Are conclusions and inferences accurately derived from evaluation data and measures that generate this data?

To what extent can findings be applied to situations other than the one in which evaluation is conducted?

The main forms of validity affected at this stage include statistical conclusion validity, internal validity, and external validity. This chapter discusses the meaning, preconditions, and assumptions of these validity types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

a) Design of the study, for example, how were participants allocated to different comparison groups and conditions? b) Characteristics of study participants and settings (e.g., age and gender of individuals, socio-demographic features of areas). c) Sample sizes and attrition rates. d) Hypotheses to be tested and theories from which they are derived. e) The operational definition and detailed description of the intervention’s theory of change (including its intensity and duration). f) Implementation details and program delivery personnel. g) Description of what treatment the control or other comparison groups received. h) The operational definition and measurement of the outcome before and after the intervention. i) The reliability and validity of outcome measures. j) The follow-up period after the intervention (where applicable)k) Effect size, confidence intervals, statistical significance, and statistical methods used. l) How independent and extraneous variables were controlled so that it was possible to disentangle the impact of the intervention or how threats to internal validity were ruled out.m) Who knows what about the intervention? Conflict of interest issues: who funded the intervention, and how independent were the researchers? (Farrington, 2003 )

Calloway, M., & Belyea, M. J. (1988). Ensuring validity using coworker samples: A situationally driven approach. Evaluation Review, 3 (2), 186–195.

Article   Google Scholar  

Campbell, D. T. (1986). Relabeling internal and external validity for applied social scientists, In W. M. K. Trochim (Ed.), Advances in quasi-experimental design and analysis. New directions for program education (31st ed., pp. 67–78).Hoboken: Wiley. (Fall).

Google Scholar  

Chen, H. T., & Garbe, P. (2011). Assessing program outcomes from the bottom-up approach: An innovative perspective to outcome evaluation. In H. T. Chen, S. I. Donaldson, & M. M. Mark (Eds.), Advancing validity in outcome evaluation: Theory and practice. New Directions for Evaluation (130th ed., pp. 93–106).Hoboken: Wiley. (summer).

Cronbach, L. H., Glesser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles . New York: John Wiley.

Dikmen, S., Reitan. R. M., & Temkin, N. R. (1983). Neuropsychological recovery in head injury. Archives of Neurology, 40, 333–338.

Farrington, D. F. (2003). Methodological quality standards for evaluation research. Annals of the American Academy of Political and Social Science, 587 (2003), 49–68.

Field, A. (2014). Discovering statistics using IBM SPSS. London: Sage.

Glasgow, R. E., Klesges, L. M., Dzewaltowski, D. A., Bull, S. S., & Estabrooks, P. (2004). The future of health behavior change research: What is needed to improve translation of research into health promotion practice? Annals of Behavioral Medicine, 27, 3–12.

Glasgow, R. E., Green, L. W., & Ammerman, A. (2007). A focus on external validity. Evaluation & the Health Professions, 3 (2), 115–117.

Green, L. W., & Glasgow, R. E. (2006). Evaluating the relevance, generalization, and applicability of research issues in external validation and translation methodology. Evaluation & the Health Professions, 29 (1), 126–153.

Hahn, G. J., & Meeker, W. Q. (1993). Assumptions for statistical inference. The American Statistician, 47 (1), 1–11.

House, E. R. (1980). The logic of evaluative argument, monograph #7. Los Angeles: Center for the Study of Evaluation, UCLA.

House, E. R. (2008). Blowback: Consequences of evaluation for evaluation. American Journal of Evaluation, 29, 416–426.

Julnes, G. (2011). Reframing validity in research and evaluation: A multidimensional, systematic model of valid inference. In H. T. Chen, S. I. Donaldson, & M. M. Mark (Eds.), Advancing validity in outcome evaluation: Theory and practice. New directions for evaluation (130th ed., pp. 55–67).Hoboken: Wiley.

Klass, G. M. (1984). Drawing inferences from policy experiments: Issues of external validity and conflict of interest. Evaluation Review, 8 (1), 3–24.

Mark, M. M. (2011). New (and old) directions for validity concerning generalizability. In H. T. Chen, S. I. Donaldson, & M. M. Mark (Eds.), Advancing validity in outcome evaluation: Theory and practice. New Directions for Evaluation (130th ed., pp. 31–42).Hoboken: Wiley.

Peck, L. R., Kim, Y., & Lucio, J. (2012). An empirical examination of validity in evaluation. American Journal of Evaluation, 0 (0), 1–16.

Reichardt, C. S. (2011). Criticisms of and an alternative to the Shadish, Cook, and Campbell validity typology. In H. T. Chen, S. I. Donaldson, & M. M. Mark (Eds.), Advancing validity in outcome evaluation: Theory and practice. New Directions for Evaluation (130th ed., pp. 43–53).Hoboken: Wiley.

Shadish, W. R., Cook, T. D., & Leviton, L. C. (1991). Foundations of program evaluation: Theories of practice . Thousand Oaks: Sage.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental design for generalized causal inference . Boston: Houghton Mifflin.

Stone, R. (1993). The assumptions on which causal inferences rest. Journal of the Royal Statistical Society. Series B (Methodological), 55 (2), 455–466.

Tebes, J. K., Snow, D. L. & Arthur, M. W. (1992). Panel attrition and external validity in the short-term follow-up study of adolescent substance use. Evaluation Review , 16 (2), 151–170.

Tunis, S. R., Stryer, D. B., & Clancy, C. M. (2003). Practical clinical trials. Increasing the value of clinical research for decision making in clinical and health policy. Journal of the American Medical Association, 290, 1624–1632.

Yeaton, W. H., & Sechrest, L. (1986). Use and misuse of no-difference findings in eliminating threats to validity. Evaluation Review, 10(6), 836–852.

Download references

Author information

Authors and affiliations.

Tulane University, New Orleans, Louisiana, USA

Apollo M. Nkwake

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Apollo M. Nkwake .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Nkwake, A. (2015). Validity in Analysis, Interpretation, and Conclusions. In: Credibility, Validity, and Assumptions in Program Evaluation Methodology. Springer, Cham. https://doi.org/10.1007/978-3-319-19021-1_6

Download citation

DOI : https://doi.org/10.1007/978-3-319-19021-1_6

Published : 11 August 2015

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-19020-4

Online ISBN : 978-3-319-19021-1

eBook Packages : Humanities, Social Sciences and Law Social Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Reliability vs Validity in Research | Differences, Types & Examples

Reliability vs Validity in Research | Differences, Types & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 10 October 2022.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.

It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research .

Reliability vs validity
Reliability Validity
What does it tell you? The extent to which the results can be reproduced when the research is repeated under the same conditions. The extent to which the results really measure what they are supposed to measure.
How is it assessed? By checking the consistency of results across time, across different observers, and across parts of the test itself. By checking how well the results correspond to established theories and other measures of the same concept.
How do they relate? A reliable measurement is not always valid: the results might be reproducible, but they’re not necessarily correct. A valid measurement is generally reliable: if a test produces accurate results, they should be .

Table of contents

Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis.

Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.

What is reliability?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.

What is validity?

Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.

Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect your data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.

Prevent plagiarism, run a free check.

Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.

Types of reliability

Different types of reliability can be estimated through various statistical methods.

Type of reliability What does it assess? Example
The consistency of a measure : do you get the same results when you repeat the measurement? A group of participants complete a designed to measure personality traits. If they repeat the questionnaire days, weeks, or months apart and give the same answers, this indicates high test-retest reliability.
The consistency of a measure : do you get the same results when different people conduct the same measurement? Based on an assessment criteria checklist, five examiners submit substantially different results for the same student project. This indicates that the assessment checklist has low inter-rater reliability (for example, because the criteria are too subjective).
The consistency of : do you get the same results from different parts of a test that are designed to measure the same thing? You design a questionnaire to measure self-esteem. If you randomly split the results into two halves, there should be a between the two sets of results. If the two results are very different, this indicates low internal consistency.

Types of validity

The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.

Type of validity What does it assess? Example
The adherence of a measure to  of the concept being measured. A self-esteem questionnaire could be assessed by measuring other traits known or assumed to be related to the concept of self-esteem (such as social skills and optimism). Strong correlation between the scores for self-esteem and associated traits would indicate high construct validity.
The extent to which the measurement  of the concept being measured. A test that aims to measure a class of students’ level of Spanish contains reading, writing, and speaking components, but no listening component.  Experts agree that listening comprehension is an essential aspect of language ability, so the test lacks content validity for measuring the overall level of ability in Spanish.
The extent to which the result of a measure corresponds to of the same concept. A is conducted to measure the political opinions of voters in a region. If the results accurately predict the later outcome of an election in that region, this indicates that the survey has high criterion validity.

To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalisability of the results).

The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.

Ensuring validity

If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability, or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data .

  • Choose appropriate methods of measurement

Ensure that your method and measurement technique are of high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardised questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or the findings of previous studies, and the questions should be carefully and precisely worded.

  • Use appropriate sampling methods to select your subjects

To produce valid generalisable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession). Ensure that you have enough participants and that they are representative of the population.

Ensuring reliability

Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible.

  • Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations, clearly define how specific behaviours or responses will be counted, and make sure questions are phrased the same way each time.

  • Standardise the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.

For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions.

It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper. Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.

Reliability and validity in a thesis
Section Discuss
What have other researchers done to devise and improve methods that are reliable and valid?
How did you plan your research to ensure reliability and validity of the measures used? This includes the chosen sample set and size, sample preparation, external conditions, and measuring techniques.
If you calculate reliability and validity, state these values alongside your main results.
This is the moment to talk about how reliable and valid your results actually were. Were they consistent, and did they reflect true values? If not, why not?
If reliability and validity were a big problem for your findings, it might be helpful to mention this here.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, October 10). Reliability vs Validity in Research | Differences, Types & Examples. Scribbr. Retrieved 24 June 2024, from https://www.scribbr.co.uk/research-methods/reliability-or-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, the 4 types of validity | types, definitions & examples, a quick guide to experimental design | 5 steps & examples, sampling methods | types, techniques, & examples.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Validity in Research and Psychology: Types & Examples

By Jim Frost 3 Comments

What is Validity in Psychology, Research, and Statistics?

Validity in research, statistics , psychology, and testing evaluates how well test scores reflect what they’re supposed to measure. Does the instrument measure what it claims to measure? Do the measurements reflect the underlying reality? Or do they quantify something else?

photograph of a confident researcher because her data have high validity.

For example, does an intelligence test assess intelligence or another characteristic, such as education or the ability to recall facts?

Researchers need to consider whether they’re measuring what they think they’re measuring. Validity addresses the appropriateness of the data rather than whether measurements are repeatable ( reliability ). However, for a test to be valid, it must first be reliable (consistent).

Evaluating validity is crucial because it helps establish which tests to use and which to avoid. If researchers use the wrong instruments, their results can be meaningless!

Validity is usually less of a concern for tangible measurements like height and weight. You might have a cheap bathroom scale that tends to read too high or too low—but it still measures weight. For those types of measurements, you’re more interested in accuracy and precision . However, other types of measurements are not as straightforward.

Validity is often a more significant concern in psychology and the social sciences, where you measure intangible constructs such as self-esteem and positive outlook. If you’re assessing the psychological construct of conscientiousness, you need to ensure that the measurement instrument asks questions that evaluate this characteristic rather than, say, obedience.

Psychological assessments of unobservable latent constructs (e.g., intelligence, traits, abilities, proclivities, etc.) have a specific application known as test validity, which is the extent that theory and data support the interpretations of test scores. Consequently, it is a critical issue because it relates to understanding the test results.

Related post : Reliability vs Validity

Evaluating Validity

Researchers validate tests using different lines of evidence. An instrument can be strong for one type of validity but weaker for another. Consequently, it is not a black or white issue—it can have degrees.

In this vein, there are many different types of validity and ways of thinking about it. Let’s take a look at several of the more common types. Each kind is a line of evidence that can help support or refute a test’s overall validity. In this post, learn about face, content, criterion, discriminant, concurrent, predictive, and construct validity.

If you want to learn about experimental validity, read my post about internal and external validity . Those types relate to experimental design and methods.

Types of Validity

In this post, I cover the following seven types of validity:

  • Face Validity : On its face, does the instrument measure the intended characteristic?
  • Content Validity : Do the test items adequately evaluate the target topic?
  • Criterion Validity : Do measures correlate with other measures in a pattern that fits theory?
  • Discriminant Validity : Is there no correlation between measures that should not have a relationship?
  • Concurrent Validity : Do simultaneous measures of the same construct correlate?
  • Predictive Validity : Does the measure accurately predict outcomes?
  • Construct Validity : Does the instrument measure the correct attribute?

Let’s look at these types of validity in more detail!

Face Validity

Face validity is the simplest and weakest type. Does the measurement instrument appear “on its face” to measure the intended construct? For a survey that assesses thrill-seeking behavior, you’d expect it to include questions about seeking excitement, getting bored quickly, and risky behaviors. If the survey contains these questions, then “on its face,” it seems like the instrument measures the construct that the researchers intend.

While this is a low bar, it’s an important issue to consider. Never overlook the obvious. Ensure that you understand the nature of the instrument and how it assesses a construct. Look at the questions. After all, if a test can’t clear this fundamental requirement, the other types of validity are a moot point. However, when a measure satisfies face validity, understand it is an intuition or a hunch that it feels correct. It’s not a statistical assessment. If your instrument passes this low bar, you still have more validation work ahead of you.

Content Validity

Content validity is similar to face validity—but it’s a more rigorous form. The process often involves assessing individual questions on a test and asking experts whether each item appraises the characteristics that the instrument is designed to cover. This process compares the test against the researcher’s goals and the theoretical properties of the construct. Researchers systematically determine whether each question contributes, and that no aspect is overlooked.

For example, if researchers are designing a survey to measure the attitudes and activities of thrill-seekers, they need to determine whether the questions sufficiently cover both of those aspects.

Learn more about Content Validity .

Criterion Validity

Criterion validity relates to the relationships between the variables in your dataset. If your data are valid, you’d expect to observe a particular correlation pattern between the variables. Researchers typically assess criterion validity by correlating different types of data. For whatever you’re measuring, you expect it to have particular relationships with other variables.

For example, measures of anxiety should correlate positively with the number of negative thoughts. Anxiety scores might also correlate positively with depression and eating disorders. If we see this pattern of relationships, it supports criterion validity. Our measure for anxiety correlates with other variables as expected.

This type is also known as convergent validity because scores for different measures converge or correspond as theory suggests. You should observe high correlations (either positive or negative).

Related posts : Criterion Validity: Definition, Assessing, and Examples and Interpreting Correlation Coefficients

Discriminant Validity

This type is the opposite of criterion validity. If you have valid data, you expect particular pairs of variables to correlate positively or negatively. However, for other pairs of variables, you expect no relationship.

For example, if self-esteem and locus of control are not related in reality, their measures should not correlate. You should observe a low correlation between scores.

It is also known as divergent validity because it relates to how different constructs are differentiated. Low correlations (close to zero) indicate that the values of one variable do not relate to the values of the other variables—the measures distinguish between different constructs.

Concurrent Validity

Concurrent validity evaluates the degree to which a measure of a construct correlates with other simultaneous measures of that construct. For example, if you administer two different intelligence tests to the same group, there should be a strong, positive correlation between their scores.

Learn more about Concurrent Validity: Definition, Assessing and Examples .

Predictive Validity

Predictive validity evaluates how well a construct predicts an outcome. For example, standardized tests such as the SAT and ACT are intended to predict how high school students will perform in college. If these tests have high predictive ability, test scores will have a strong, positive correlation with college achievement. Testing this type of validity requires administering the assessment and then measuring the actual outcomes.

Learn more about Predictive Validity: Definition, Assessing and Examples .

Construct Validity

A test with high construct validity correctly fits into the big picture with other constructs. Consequently, this type incorporates aspects of criterion, discriminant, concurrent, and predictive validity. A construct must correlate positively and negatively with the theoretically appropriate constructs, have no correlation with the correct constructs, correlate with other measures of the same construct, etc.

Construct validity combines the theoretical relationships between constructs with empirical relationships to see how closely they align. It evaluates the full range of characteristics for the construct you’re measuring and determines whether they all correlate correctly with other constructs, behaviors, and events.

As you can see, validity is a complex issue, particularly when you’re measuring abstract characteristics. To properly validate a test, you need to incorporate a wide range of subject-area knowledge and determine whether the measurements from your instrument fit in with the bigger picture! Researchers often use factor analysis to assess construct validity. Learn more about Factor Analysis .

For more in-depth information, read my article about Construct Validity .

Learn more about Experimental Design: Definition, Types, and Examples .

Nevo, Baruch (1985), Face Validity Revisited , Journal of Educational Measurement.

Share this:

research conclusion validity

Reader Interactions

' src=

April 21, 2022 at 12:05 am

Thank you for the examples and easy-to-understand information about the various types of statistics used in psychology. As a current Ph.D. student, I have struggled in this area and finally, understand how to research using Inter-Rater Reliability and Predictive Validity. I greatly appreciate the information you are sharing and hope you continue to share information and examples that allows anyone, regardless of degree or not, an easy way to grasp the material.

' src=

April 21, 2022 at 1:38 am

Thanks so much! I really appreciate your kind words and I’m so glad my content has been helpful. I’m going to keep sharing! 🙂

' src=

March 14, 2023 at 1:27 am

Indeed! I think I’m grasping the concept reading your contents. Thanks!

Comments and Questions Cancel reply

Statistical Conclusion Validity

Statistics Definitions > Statistical Conclusion Validity

You may find it helpful to read this article first: Reliability and Validity in Research.

What is Statistical Conclusion Validity?

statistical conclusion validty

It’s important to realize that there’s no such thing as perfect validity. Type 1 errors and Type 2 errors are a part of any testing process, so you can never be 100% certain that your conclusions are correct. However, SCV refers to reasonable conclusions based on your data — not perfect ones.

Threats to Statistical Conclusion Validity

Threats lead you to make incorrect conclusions about relationships. They include:

  • Fishing (mining the data and repeating tests to find something…anything! significant…): can result in incorrectly concluding there is a relationship when in fact there is not.
  • Low statistical power can cause you to incorrectly conclude there is no relationship between your variables.
  • Poor reliability of treatment implementation: if you haven’t used standard procedures and protocols, it could cause you to underestimate effects.
  • Random irrelevancies in the setting: this means any distraction, from weather that’s too hot to dealing with cantankerous people.
  • Restriction of range: can also lead to incorrect estimates.
  • Unreliable measures : can result in over- or underestimating the size of the relationship between variables.
  • Violated assumptions for tests : can cause a multitude of problems including overestimating or underestimating effects.

Other Types of Validity

Three other types of validity are used to analyze research and tests:

  • External Validity : the test or research can be applied to other areas.
  • Internal Validity : the test or instrument is measuring what it’s supposed to.
  • Construct Validity : the research/tests are well-constructed using established standards and methods.
  • Search Menu

Sign in through your institution

  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical Literature
  • Classical Reception
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Archaeology
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Papyrology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Agriculture
  • History of Education
  • History of Emotions
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Variation
  • Language Families
  • Language Acquisition
  • Language Evolution
  • Language Reference
  • Lexicography
  • Linguistic Theories
  • Linguistic Typology
  • Linguistic Anthropology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Modernism)
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Culture
  • Music and Religion
  • Music and Media
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Science
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Society
  • Law and Politics
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Oncology
  • Medical Toxicology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Medical Ethics
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Games
  • Computer Security
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Neuroscience
  • Cognitive Psychology
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business History
  • Business Strategy
  • Business Ethics
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Methodology
  • Economic Systems
  • Economic History
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • Ethnic Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Theory
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Politics and Law
  • Politics of Development
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

Design and Analysis of Time Series Experiments

  • < Previous chapter
  • Next chapter >

6 Statistical Conclusion Validity

  • Published: May 2017
  • Cite Icon Cite
  • Permissions Icon Permissions

Chapter 6 addresses the sub-category of internal validity defined by Shadish et al., as statistical conclusion validity, or “validity of inferences about the correlation (covariance) between treatment and outcome.” The common threats to statistical conclusion validity can arise, or become plausible through either model misspecification or through hypothesis testing. The risk of a serious model misspecification is inversely proportional to the length of the time series, for example, and so is the risk of mistating the Type I and Type II error rates. Threats to statistical conclusion validity arise from the classical and modern hybrid significance testing structures, the serious threats that weigh heavily in p-value tests are shown to be undefined in Beyesian tests. While the particularly vexing threats raised by modern null hypothesis testing are resolved through the elimination of the modern null hypothesis test, threats to statistical conclusion validity would inevitably persist and new threats would arise.

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Institutional access

Sign in with a library card.

  • Sign in with username/password
  • Recommend to your librarian
  • Institutional account management
  • Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.

  • Click Sign in through your institution.
  • Select your institution from the list provided, which will take you to your institution's website to sign in.
  • When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
  • Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:

  • Click Sign in through society site.
  • When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

  • View your signed in personal account and access account management features.
  • View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Our books are available by subscription or purchase to libraries and institutions.

Month: Total Views:
October 2022 1
November 2022 12
December 2022 3
February 2023 3
March 2023 2
May 2023 3
June 2023 4
July 2023 3
September 2023 2
October 2023 1
November 2023 11
January 2024 5
February 2024 2
March 2024 2
May 2024 6
June 2024 4
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Rights and permissions
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Validity In Psychology Research: Types & Examples

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it’s intended to measure. It ensures that the research findings are genuine and not due to extraneous factors.

Validity can be categorized into different types based on internal and external validity .

The concept of validity was formulated by Kelly (1927, p. 14), who stated that a test is valid if it measures what it claims to measure. For example, a test of intelligence should measure intelligence and not something else (such as memory).

Internal and External Validity In Research

Internal validity refers to whether the effects observed in a study are due to the manipulation of the independent variable and not some other confounding factor.

In other words, there is a causal relationship between the independent and dependent variables .

Internal validity can be improved by controlling extraneous variables, using standardized instructions, counterbalancing, and eliminating demand characteristics and investigator effects.

External validity refers to the extent to which the results of a study can be generalized to other settings (ecological validity), other people (population validity), and over time (historical validity).

External validity can be improved by setting experiments more naturally and using random sampling to select participants.

Types of Validity In Psychology

Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion.

  • Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items adequately cover the topic or concept.
  • Criterion validity assesses the performance of a test based on its correlation with a known external criterion or outcome. It can be further divided into concurrent (measured at the same time) and predictive (measuring future performance) validity.

table showing the different types of validity

Face Validity

Face validity is simply whether the test appears (at face value) to measure what it claims to. This is the least sophisticated measure of content-related validity, and is a superficial and subjective assessment based on appearance.

Tests wherein the purpose is clear, even to naïve respondents, are said to have high face validity. Accordingly, tests wherein the purpose is unclear have low face validity (Nevo, 1985).

A direct measurement of face validity is obtained by asking people to rate the validity of a test as it appears to them. This rater could use a Likert scale to assess face validity.

For example:

  • The test is extremely suitable for a given purpose
  • The test is very suitable for that purpose;
  • The test is adequate
  • The test is inadequate
  • The test is irrelevant and, therefore, unsuitable

It is important to select suitable people to rate a test (e.g., questionnaire, interview, IQ test, etc.). For example, individuals who actually take the test would be well placed to judge its face validity.

Also, people who work with the test could offer their opinion (e.g., employers, university administrators, employers). Finally, the researcher could use members of the general public with an interest in the test (e.g., parents of testees, politicians, teachers, etc.).

The face validity of a test can be considered a robust construct only if a reasonable level of agreement exists among raters.

It should be noted that the term face validity should be avoided when the rating is done by an “expert,” as content validity is more appropriate.

Having face validity does not mean that a test really measures what the researcher intends to measure, but only in the judgment of raters that it appears to do so. Consequently, it is a crude and basic measure of validity.

A test item such as “ I have recently thought of killing myself ” has obvious face validity as an item measuring suicidal cognitions and may be useful when measuring symptoms of depression.

However, the implication of items on tests with clear face validity is that they are more vulnerable to social desirability bias. Individuals may manipulate their responses to deny or hide problems or exaggerate behaviors to present a positive image of themselves.

It is possible for a test item to lack face validity but still have general validity and measure what it claims to measure. This is good because it reduces demand characteristics and makes it harder for respondents to manipulate their answers.

For example, the test item “ I believe in the second coming of Christ ” would lack face validity as a measure of depression (as the purpose of the item is unclear).

This item appeared on the first version of The Minnesota Multiphasic Personality Inventory (MMPI) and loaded on the depression scale.

Because most of the original normative sample of the MMPI were good Christians, only a depressed Christian would think Christ is not coming back. Thus, for this particular religious sample, the item does have general validity but not face validity.

Construct Validity

Construct validity assesses how well a test or measure represents and captures an abstract theoretical concept, known as a construct. It indicates the degree to which the test accurately reflects the construct it intends to measure, often evaluated through relationships with other variables and measures theoretically connected to the construct.

Construct validity was invented by Cronbach and Meehl (1955). This type of content-related validity refers to the extent to which a test captures a specific theoretical construct or trait, and it overlaps with some of the other aspects of validity

Construct validity does not concern the simple, factual question of whether a test measures an attribute.

Instead, it is about the complex question of whether test score interpretations are consistent with a nomological network involving theoretical and observational terms (Cronbach & Meehl, 1955).

To test for construct validity, it must be demonstrated that the phenomenon being measured actually exists. So, the construct validity of a test for intelligence, for example, depends on a model or theory of intelligence .

Construct validity entails demonstrating the power of such a construct to explain a network of research findings and to predict further relationships.

The more evidence a researcher can demonstrate for a test’s construct validity, the better. However, there is no single method of determining the construct validity of a test.

Instead, different methods and approaches are combined to present the overall construct validity of a test. For example, factor analysis and correlational methods can be used.

Convergent validity

Convergent validity is a subtype of construct validity. It assesses the degree to which two measures that theoretically should be related are related.

It demonstrates that measures of similar constructs are highly correlated. It helps confirm that a test accurately measures the intended construct by showing its alignment with other tests designed to measure the same or similar constructs.

For example, suppose there are two different scales used to measure self-esteem:

Scale A and Scale B. If both scales effectively measure self-esteem, then individuals who score high on Scale A should also score high on Scale B, and those who score low on Scale A should score similarly low on Scale B.

If the scores from these two scales show a strong positive correlation, then this provides evidence for convergent validity because it indicates that both scales seem to measure the same underlying construct of self-esteem.

Concurrent Validity (i.e., occurring at the same time)

Concurrent validity evaluates how well a test’s results correlate with the results of a previously established and accepted measure, when both are administered at the same time.

It helps in determining whether a new measure is a good reflection of an established one without waiting to observe outcomes in the future.

If the new test is validated by comparison with a currently existing criterion, we have concurrent validity.

Very often, a new IQ or personality test might be compared with an older but similar test known to have good validity already.

Predictive Validity

Predictive validity assesses how well a test predicts a criterion that will occur in the future. It measures the test’s ability to foresee the performance of an individual on a related criterion measured at a later point in time. It gauges the test’s effectiveness in predicting subsequent real-world outcomes or results.

For example, a prediction may be made on the basis of a new intelligence test that high scorers at age 12 will be more likely to obtain university degrees several years later. If the prediction is born out, then the test has predictive validity.

Cronbach, L. J., and Meehl, P. E. (1955) Construct validity in psychological tests. Psychological Bulletin , 52, 281-302.

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.

Kelley, T. L. (1927). Interpretation of educational measurements. New York : Macmillan.

Nevo, B. (1985). Face validity revisited . Journal of Educational Measurement , 22(4), 287-293.

Print Friendly, PDF & Email

Related Articles

Mixed Methods Research

Research Methodology

Mixed Methods Research

Conversation Analysis

Conversation Analysis

Discourse Analysis

Discourse Analysis

Phenomenology In Qualitative Research

Phenomenology In Qualitative Research

Ethnography In Qualitative Research

Ethnography In Qualitative Research

Narrative Analysis In Qualitative Research

Narrative Analysis In Qualitative Research

Validity implies precise and exact results acquired from the data collected.  In technical terms, a measure can lead to a proper and correct conclusions to be drawn from the sample that are generalizable to the entire population.

Four Major Types:

1. Internal validity: When the relationship between variables is causal.  This type refers to the relationship between dependent and independent variables.  It is associated with the design of the experiment and is only relevant in studies that try to establish a causal relationship.  For example, it can be used for the random assignment of treatments.

2. External validity: When there is a causal relationship between the cause and effect that can be transferred to people, treatments, variables, and different measurement variables which differ from the other.

request a consultation

Discover How We Assist to Edit Your Dissertation Chapters

Aligning theoretical framework, gathering articles, synthesizing gaps, articulating a clear methodology and data plan, and writing about the theoretical and practical implications of your research are part of our comprehensive dissertation editing services.

  • Bring dissertation editing expertise to chapters 1-5 in timely manner.
  • Track all changes, then work with you to bring about scholarly writing.
  • Ongoing support to address committee feedback, reducing revisions.

3. Statistical conclusion validity: The conclusion reached or inference drawn about the extent of the relationship between the two variables. For instance, it can be found when we aim at finding the strength of relationship between any two variables that have been under observation and analysis.  If we do reach the correct conclusion, then it is said to be statistical conclusion validity. There are two types of statistical conclusion validity. They are as follows:

a. Type one error: Type one error is when we conclude that there is a relationship between two variables and we reject a true null hypothesis when in reality, there is no relationship between the two variables.  This is in fact very dangerous.

b. Type two errors: If we fail to reject a false null hypothesis that is true it is called type two error.

In statistical conclusion validity, the method of power analysis is used to detect the relationship.  Several problems crop up while making a statistical conclusion.  For instance, if a small sample size is used, then there is the possibility that the result will not be correct.  To avoid this, the sample size should be of considerable size.  Statistical validity is also threatened by the violation of statistical assumptions.  The results may not be accurate, however, if values in analysis are biased and the wrong statistical test is approved.

4. Construct validity: Extent that a measurement actually represents the construct it is measuring.  For instance, in structural equation modeling , when we draw the construct, then we presume that the factor loading for the construct is greater than .7.  To draw construct validity, Cronbach’s alpha is used.  For exploratory purposes .60 is accepted, for confirmatory purposes .70 is accepted, and .80 is considered good.  If the construct satisfies the above presumption and expectation, then the construct would be helpful in predicting the relationship for dependent variables.  Convergent/divergent validation and factor analysis are also used to test construct validity.

Relationship between reliability and validity: There is no way that a test that is unreliable is valid.  Again, any test that is valid must be reliable.  By this statement we are able to derive that validity plays a significant role in analysis as it ensures the conclusion of accurate results.

Overall threats:

1.Insufficient data collected to make valid conclusion 2.Measurement done with too few measurement variables 3.Too much variation in data or outliers in data 4.Wrong selection of samples 5.Inaccurate measurement method taken for analysis

Bagozzi, R. P., Yi, Y., & Phillips, L. W. (1991). Assessing construct validity in organizational research. Administrative Science Quarterly, 36 (3), 421-458.

Brinkman, W. -P., Haakma, R., & Bouwhuis, D. G. (2009). The theoretical foundation and validity of a component-based usability questionnaire. Behaviour & Information Technology, 28 (2), 121-137.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 443-507). Washington, DC: American Council on Education.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52 , 281-302.

Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18 (1), 39-50.

Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement , 6 (5), 427-439.

Krause, M. S. (1972). The implications of convergent and discriminant validity data for instrument validation. Psychometrika, 37 (2), 179-186.

Lieberman, D. Z. (2008). Evaluation of the stability and validity of participant samples recruited over the internet. CyberPsychology & Behavior, 11 (6), 743-746.

Lozano, L. M., Carcía-Cueto, E., & Muñoz, J. (2008). Effect of the number of response categories on the reliability and validity of rating scales. Methodology, 4 (2), 73-79.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Education measurement (3rd ed., pp. 13-103). Washington, DC: American Council on Education.

Moret, M., Reuzel, R., van der Wilt, G. J., & Grin, J. (2007). Validity and reliability of qualitative data analysis: Interobserver agreement in reconstructing interpretative frames. Field Methods, 19 (1), 24-39.

Rosenbaum, P. R. (1989). Criterion-related construct validity. Psychometrika, 54 (4), 625-659.

Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19 , 405-450.

Administration, Analysis and Reporting

Statistics Solutions consists of a team of professional methodologists and statisticians that can assist the student or professional researcher in administering the survey instrument, collecting the data, conducting the analyses and explaining the results.

For additional information on these services, click here.

Related Pages :

Structural Equation Modeling

Conduct and Interpret a Factor Analysis

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Research paper

Writing a Research Paper Conclusion | Step-by-Step Guide

Published on October 30, 2022 by Jack Caulfield . Revised on April 13, 2023.

  • Restate the problem statement addressed in the paper
  • Summarize your overall arguments or findings
  • Suggest the key takeaways from your paper

Research paper conclusion

The content of the conclusion varies depending on whether your paper presents the results of original empirical research or constructs an argument through engagement with sources .

Instantly correct all language mistakes in your text

Upload your document to correct all your mistakes in minutes

upload-your-document-ai-proofreader

Table of contents

Step 1: restate the problem, step 2: sum up the paper, step 3: discuss the implications, research paper conclusion examples, frequently asked questions about research paper conclusions.

The first task of your conclusion is to remind the reader of your research problem . You will have discussed this problem in depth throughout the body, but now the point is to zoom back out from the details to the bigger picture.

While you are restating a problem you’ve already introduced, you should avoid phrasing it identically to how it appeared in the introduction . Ideally, you’ll find a novel way to circle back to the problem from the more detailed ideas discussed in the body.

For example, an argumentative paper advocating new measures to reduce the environmental impact of agriculture might restate its problem as follows:

Meanwhile, an empirical paper studying the relationship of Instagram use with body image issues might present its problem like this:

“In conclusion …”

Avoid starting your conclusion with phrases like “In conclusion” or “To conclude,” as this can come across as too obvious and make your writing seem unsophisticated. The content and placement of your conclusion should make its function clear without the need for additional signposting.

Prevent plagiarism. Run a free check.

Having zoomed back in on the problem, it’s time to summarize how the body of the paper went about addressing it, and what conclusions this approach led to.

Depending on the nature of your research paper, this might mean restating your thesis and arguments, or summarizing your overall findings.

Argumentative paper: Restate your thesis and arguments

In an argumentative paper, you will have presented a thesis statement in your introduction, expressing the overall claim your paper argues for. In the conclusion, you should restate the thesis and show how it has been developed through the body of the paper.

Briefly summarize the key arguments made in the body, showing how each of them contributes to proving your thesis. You may also mention any counterarguments you addressed, emphasizing why your thesis holds up against them, particularly if your argument is a controversial one.

Don’t go into the details of your evidence or present new ideas; focus on outlining in broad strokes the argument you have made.

Empirical paper: Summarize your findings

In an empirical paper, this is the time to summarize your key findings. Don’t go into great detail here (you will have presented your in-depth results and discussion already), but do clearly express the answers to the research questions you investigated.

Describe your main findings, even if they weren’t necessarily the ones you expected or hoped for, and explain the overall conclusion they led you to.

Having summed up your key arguments or findings, the conclusion ends by considering the broader implications of your research. This means expressing the key takeaways, practical or theoretical, from your paper—often in the form of a call for action or suggestions for future research.

Argumentative paper: Strong closing statement

An argumentative paper generally ends with a strong closing statement. In the case of a practical argument, make a call for action: What actions do you think should be taken by the people or organizations concerned in response to your argument?

If your topic is more theoretical and unsuitable for a call for action, your closing statement should express the significance of your argument—for example, in proposing a new understanding of a topic or laying the groundwork for future research.

Empirical paper: Future research directions

In a more empirical paper, you can close by either making recommendations for practice (for example, in clinical or policy papers), or suggesting directions for future research.

Whatever the scope of your own research, there will always be room for further investigation of related topics, and you’ll often discover new questions and problems during the research process .

Finish your paper on a forward-looking note by suggesting how you or other researchers might build on this topic in the future and address any limitations of the current paper.

Full examples of research paper conclusions are shown in the tabs below: one for an argumentative paper, the other for an empirical paper.

  • Argumentative paper
  • Empirical paper

While the role of cattle in climate change is by now common knowledge, countries like the Netherlands continually fail to confront this issue with the urgency it deserves. The evidence is clear: To create a truly futureproof agricultural sector, Dutch farmers must be incentivized to transition from livestock farming to sustainable vegetable farming. As well as dramatically lowering emissions, plant-based agriculture, if approached in the right way, can produce more food with less land, providing opportunities for nature regeneration areas that will themselves contribute to climate targets. Although this approach would have economic ramifications, from a long-term perspective, it would represent a significant step towards a more sustainable and resilient national economy. Transitioning to sustainable vegetable farming will make the Netherlands greener and healthier, setting an example for other European governments. Farmers, policymakers, and consumers must focus on the future, not just on their own short-term interests, and work to implement this transition now.

As social media becomes increasingly central to young people’s everyday lives, it is important to understand how different platforms affect their developing self-conception. By testing the effect of daily Instagram use among teenage girls, this study established that highly visual social media does indeed have a significant effect on body image concerns, with a strong correlation between the amount of time spent on the platform and participants’ self-reported dissatisfaction with their appearance. However, the strength of this effect was moderated by pre-test self-esteem ratings: Participants with higher self-esteem were less likely to experience an increase in body image concerns after using Instagram. This suggests that, while Instagram does impact body image, it is also important to consider the wider social and psychological context in which this usage occurs: Teenagers who are already predisposed to self-esteem issues may be at greater risk of experiencing negative effects. Future research into Instagram and other highly visual social media should focus on establishing a clearer picture of how self-esteem and related constructs influence young people’s experiences of these platforms. Furthermore, while this experiment measured Instagram usage in terms of time spent on the platform, observational studies are required to gain more insight into different patterns of usage—to investigate, for instance, whether active posting is associated with different effects than passive consumption of social media content.

If you’re unsure about the conclusion, it can be helpful to ask a friend or fellow student to read your conclusion and summarize the main takeaways.

  • Do they understand from your conclusion what your research was about?
  • Are they able to summarize the implications of your findings?
  • Can they answer your research question based on your conclusion?

You can also get an expert to proofread and feedback your paper with a paper editing service .

Scribbr Citation Checker New

The AI-powered Citation Checker helps you avoid common mistakes such as:

  • Missing commas and periods
  • Incorrect usage of “et al.”
  • Ampersands (&) in narrative citations
  • Missing reference entries

research conclusion validity

The conclusion of a research paper has several key elements you should make sure to include:

  • A restatement of the research problem
  • A summary of your key arguments and/or findings
  • A short discussion of the implications of your research

No, it’s not appropriate to present new arguments or evidence in the conclusion . While you might be tempted to save a striking argument for last, research papers follow a more formal structure than this.

All your findings and arguments should be presented in the body of the text (more specifically in the results and discussion sections if you are following a scientific structure). The conclusion is meant to summarize and reflect on the evidence and arguments you have already presented, not introduce new ones.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Caulfield, J. (2023, April 13). Writing a Research Paper Conclusion | Step-by-Step Guide. Scribbr. Retrieved June 25, 2024, from https://www.scribbr.com/research-paper/research-paper-conclusion/

Is this article helpful?

Jack Caulfield

Jack Caulfield

Other students also liked, writing a research paper introduction | step-by-step guide, how to create a structured research paper outline | example, checklist: writing a great research paper, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Appraising systematic reviews: a comprehensive guide to ensuring validity and reliability

Affiliations.

  • 1 Alexandria Faculty of Medicine, Alexandria University, Alexandria, Egypt.
  • 2 Faculty of Medicine, South Valley University, Qena, Egypt.
  • 3 Faculty of Medicine, Zagazig University, Zagazig, Egypt.
  • 4 Medical Research Group of Egypt, Cairo, Egypt.
  • 5 Faculty of Medicine, Cairo University, Cairo, Egypt.
  • 6 Faculty of Health Sciences, Fenerbahce University, Istanbul, Türkiye.
  • 7 Department of Neurosurgery and Brain Repair, University of South Florida, Tampa, FL, United States.
  • PMID: 38179256
  • PMCID: PMC10764628
  • DOI: 10.3389/frma.2023.1268045

Systematic reviews play a crucial role in evidence-based practices as they consolidate research findings to inform decision-making. However, it is essential to assess the quality of systematic reviews to prevent biased or inaccurate conclusions. This paper underscores the importance of adhering to recognized guidelines, such as the PRISMA statement and Cochrane Handbook. These recommendations advocate for systematic approaches and emphasize the documentation of critical components, including the search strategy and study selection. A thorough evaluation of methodologies, research quality, and overall evidence strength is essential during the appraisal process. Identifying potential sources of bias and review limitations, such as selective reporting or trial heterogeneity, is facilitated by tools like the Cochrane Risk of Bias and the AMSTAR 2 checklist. The assessment of included studies emphasizes formulating clear research questions and employing appropriate search strategies to construct robust reviews. Relevance and bias reduction are ensured through meticulous selection of inclusion and exclusion criteria. Accurate data synthesis, including appropriate data extraction and analysis, is necessary for drawing reliable conclusions. Meta-analysis, a statistical method for aggregating trial findings, improves the precision of treatment impact estimates. Systematic reviews should consider crucial factors such as addressing biases, disclosing conflicts of interest, and acknowledging review and methodological limitations. This paper aims to enhance the reliability of systematic reviews, ultimately improving decision-making in healthcare, public policy, and other domains. It provides academics, practitioners, and policymakers with a comprehensive understanding of the evaluation process, empowering them to make well-informed decisions based on robust data.

Keywords: bias evaluation; quality assessment; systematic review; systematic review appraisal; systematic review methodology.

Copyright © 2023 Shaheen, Shaheen, Ramadan, Hefnawy, Ramadan, Ibrahim, Hassanein, Ashour and Flouty.

PubMed Disclaimer

Conflict of interest statement

Disclosure is widely used as a way to manage a competing interest, which is a significant cause of bias in research. Taking into account the significance of systematic reviews and the differing incidence of conflicting interests in various research disciplines. For researchers, reviewers, and editors, the recognition and declaration of competing interests, particularly nonfinancial interests, continues to be difficult. To identify and reduce potential conflicting interests in systematic reviews, the International Council of Medical Journal Editors (ICMJE) must continue to create more effective and efficient tools (Yu et al., 2020).The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Boolean operators; (A) describe using…

Boolean operators; (A) describe using AND in the search strategy, (B) describe using…

Similar articles

  • The methodological and reporting characteristics of Campbell reviews: A systematic review. Wang X, Welch V, Li M, Yao L, Littell J, Li H, Yang N, Wang J, Shamseer L, Chen Y, Yang K, Grimshaw JM. Wang X, et al. Campbell Syst Rev. 2021 Feb 7;17(1):e1134. doi: 10.1002/cl2.1134. eCollection 2021 Mar. Campbell Syst Rev. 2021. PMID: 37133262 Free PMC article. Review.
  • Public sector reforms and their impact on the level of corruption: A systematic review. Mugellini G, Della Bella S, Colagrossi M, Isenring GL, Killias M. Mugellini G, et al. Campbell Syst Rev. 2021 May 24;17(2):e1173. doi: 10.1002/cl2.1173. eCollection 2021 Jun. Campbell Syst Rev. 2021. PMID: 37131927 Free PMC article. Review.
  • Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas. Crider K, Williams J, Qi YP, Gutman J, Yeung L, Mai C, Finkelstain J, Mehta S, Pons-Duran C, Menéndez C, Moraleda C, Rogers L, Daniels K, Green P. Crider K, et al. Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217. Cochrane Database Syst Rev. 2022. PMID: 36321557 Free PMC article.
  • The future of Cochrane Neonatal. Soll RF, Ovelman C, McGuire W. Soll RF, et al. Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12. Early Hum Dev. 2020. PMID: 33036834
  • The Effectiveness of Integrated Care Pathways for Adults and Children in Health Care Settings: A Systematic Review. Allen D, Gillen E, Rixson L. Allen D, et al. JBI Libr Syst Rev. 2009;7(3):80-129. doi: 10.11124/01938924-200907030-00001. JBI Libr Syst Rev. 2009. PMID: 27820426
  • Prevalence and Clinical Characteristics of Sleeping Paralysis: A Systematic Review and Meta-Analysis. Hefnawy MT, Amer BE, Amer SA, Moghib K, Khlidj Y, Elfakharany B, Mouffokes A, Alazzeh ZJ, Soni NP, Wael M, Elsayed ME. Hefnawy MT, et al. Cureus. 2024 Jan 30;16(1):e53212. doi: 10.7759/cureus.53212. eCollection 2024 Jan. Cureus. 2024. PMID: 38425633 Free PMC article. Review.
  • White matter hyperintensities in bipolar disorder: systematic review and meta-analysis. Silva T, Nunes C, Ribeiro A, Santana I, Cerejeira J. Silva T, et al. Front Psychiatry. 2024 Jan 26;15:1343463. doi: 10.3389/fpsyt.2024.1343463. eCollection 2024. Front Psychiatry. 2024. PMID: 38343622 Free PMC article.
  • Botes (2002). Concept analysis Some limitations and possible solutions. Curationis. 25, 23–27. 10.4102/curationis.v25i3.779 - DOI - PubMed
  • Bramer W. M., De Jonge G. B., Rethlefsen M. L., Mast F., Kleijnen J. (2018). A systematic approach to searching an efficient and complete method to develop literature searches. JMLA. 106, 4. 10.5195/jmla.2018.283 - DOI - PMC - PubMed
  • CASP (2023). CASP- Critical Appraisal Skills Programme. Available online at: https://casp-uk.net (accessed Nov 07, 2023).
  • Chien W.-T., Norman I. (2009). The effectiveness and active ingredients of mutual support groups for family caregivers of people with psychotic disorders a literature review. Int. J. Nurs. Stud. 46, 1604–1623. 10.1016/j.ijnurstu.2009.04.003 - DOI - PubMed
  • Cochrane Bias (2023). Rob 2 A Revised Cochrane Risk-of-Bias Tool for Randomized Trials. Available online at: https://methods.cochrane.org/bias/resources/rob-2-revised-cochrane-risk-... (accessed Nov 07, 2023).

Publication types

  • Search in MeSH

Related information

Grants and funding, linkout - more resources, full text sources.

  • Europe PubMed Central
  • Frontiers Media SA
  • PubMed Central

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Popular searches

  • How to Get Participants For Your Study
  • How to Do Segmentation?
  • Conjoint Preference Share Simulator
  • MaxDiff Analysis
  • Likert Scales
  • Reliability & Validity

Request consultation

Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.

Looking for an online survey platform?

Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.

Research Methods Knowledge Base

  • Navigating the Knowledge Base
  • Foundations
  • Measurement
  • Research Design

Threats to Conclusion Validity

  • Improving Conclusion Validity
  • Statistical Power
  • Data Preparation
  • Descriptive Statistics
  • Inferential Statistics
  • Table of Contents

Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.

Completely free for academics and students .

A threat to conclusion validity is a factor that can lead you to reach an incorrect conclusion about a relationship in your observations. You can essentially make two kinds of errors about relationships:

  • Conclude that there is no relationship when in fact there is (you missed the relationship or didn’t see it)
  • Conclude that there is a relationship when in fact there is not (you’re seeing things that aren’t there!)

Most threats to conclusion validity have to do with the first problem. Why? Maybe it’s because it’s so hard in most research to find relationships in our data at all that it’s not as big or frequent a problem — we tend to have more problems finding the needle in the haystack than seeing things that aren’t there! So, I’ll divide the threats by the type of error they are associated with.

Finding no relationship when there is one (or, “missing the needle in the haystack”)

When you’re looking for the needle in the haystack you essentially have two basic problems: the tiny needle and too much hay. You can view this as a signal-to-noise ratio problem.The “signal” is the needle — the relationship you are trying to see. The “noise” consists of all of the factors that make it hard to see the relationship. There are several important sources of noise, each of which is a threat to conclusion validity. One important threat is low reliability of measures (see reliability ). This can be due to many factors including poor question wording, bad instrument design or layout, illegibility of field notes, and so on. In studies where you are evaluating a program you can introduce noise through poor reliability of treatment implementation . If the program doesn’t follow the prescribed procedures or is inconsistently carried out, it will be harder to see relationships between the program and other factors like the outcomes. Noise that is caused by random irrelevancies in the setting can also obscure your ability to see a relationship. In a classroom context, the traffic outside the room, disturbances in the hallway, and countless other irrelevant events can distract the researcher or the participants. The types of people you have in your study can also make it harder to see relationships. The threat here is due to random heterogeneity of respondents . If you have a very diverse group of respondents, they are likely to vary more widely on your measures or observations. Some of their variety may be related to the phenomenon you are looking at, but at least part of it is likely to just constitute individual differences that are irrelevant to the relationship being observed.

All of these threats add variability into the research context and contribute to the “noise” relative to the signal of the relationship you are looking for. But noise is only one part of the problem. We also have to consider the issue of the signal — the true strength of the relationship. There is one broad threat to conclusion validity that tends to subsume or encompass all of the noise-producing factors above and also takes into account the strength of the signal, the amount of information you collect, and the amount of risk you’re willing to take in making a decision about a whether a relationship exists. This threat is called low statistical power . Because this idea is so important in understanding how we make decisions about relationships, we have a separate discussion of statistical power .

Finding a relationship when there is not one (or “seeing things that aren’t there”)

In anything but the most trivial research study, the researcher will spend a considerable amount of time analyzing the data for relationships. Of course, it’s important to conduct a thorough analysis, but most people are well aware of the fact that if you play with the data long enough, you can often “turn up” results that support or corroborate your hypotheses. In more everyday terms, you are “fishing” for a specific result by analyzing the data repeatedly under slightly differing conditions or assumptions.

In statistical analysis, we attempt to determine the probability that the finding we get is a “real” one or could have been a “chance” finding. In fact, we often use this probability to decide whether to accept the statistical result as evidence that there is a relationship. In the social sciences, researchers often use the rather arbitrary value known as the 0.05 level of significance to decide whether their result is credible or could be considered a “fluke.” Essentially, the value 0.05 means that the result you got could be expected to occur by chance at least 5 times out of every 100 times you run the statistical analysis. The probability assumption that underlies most statistical analyses assumes that each analysis is “independent” of the other. But that may not be true when you conduct multiple analyses of the same data. For instance, let’s say you conduct 20 statistical tests and for each one you use the 0.05 level criterion for deciding whether you are observing a relationship. For each test, the odds are 5 out of 100 that you will see a relationship even if there is not one there (that’s what it means to say that the result could be “due to chance”). Odds of 5 out of 100 are equal to the fraction 5/100 which is also equal to 1 out of 20. Now, in this example, you conduct 20 separate analyses. Let’s say that you find that of the twenty results, only one is statistically significant at the 0.05 level. Does that mean you have found a statistically significant relationship? If you had only done the one analysis, you might conclude that you’ve found a relationship in that result. But if you did 20 analyses, you would expect to find one of them significant by chance alone, even if there is no real relationship in the data. We call this threat to conclusion validity fishing and the error rate problem . The basic problem is that you were “fishing” by conducting multiple analyses and treating each one as though it was independent. Instead, when you conduct multiple analyses, you should adjust the error rate (i.e. significance level) to reflect the number of analyses you are doing. The bottom line here is that you are more likely to see a relationship when there isn’t one when you keep reanalyzing your data and don’t take that fishing into account when drawing your conclusions.

Problems that can lead to either conclusion error

Every analysis is based on a variety of assumptions about the nature of the data, the procedures you use to conduct the analysis, and the match between these two. If you are not sensitive to the assumptions behind your analysis you are likely to draw erroneous conclusions about relationships. In quantitative research we refer to this threat as the violated assumptions of statistical tests . For instance, many statistical analyses assume that the data are distributed normally — that the population from which they are drawn would be distributed according to a “normal” or “bell-shaped” curve. If that assumption is not true for your data and you use that statistical test, you are likely to get an incorrect estimate of the true relationship. And, it’s not always possible to predict what type of error you might make — seeing a relationship that isn’t there or missing one that is.

I believe that the same problem can occur in qualitative research as well. There are assumptions, some of which we may not even realize, behind our qualitative methods. For instance, in interview situations we may assume that the respondent is free to say anything s/he wishes. If that is not true — if the respondent is under covert pressure from supervisors to respond in a certain way — you may erroneously see relationships in the responses that aren’t real and/or miss ones that are.

The threats listed above illustrate some of the major difficulties and traps that are involved in one of the most basic of research tasks — deciding whether there is a relationship in your data or observations. So, how do we attempt to deal with these threats? The researcher has a number of strategies for improving conclusion validity through minimizing or eliminating the threats described above.

Cookie Consent

Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.

For more information on Conjointly's use of cookies, please read our Cookie Policy .

Which one are you?

I am new to conjointly, i am already using conjointly.

Pew Research Center revises report about 'racial conspiracy theories' after backlash

The Pew Research Center has revised a report after it received criticism for saying a majority of Black Americans believe “racial conspiracy theories” about U.S. institutions. 

In the report released June 10 titled “Most Black Americans Believe Racial Conspiracy Theories About U.S. Institutions,” Pew detailed “the suspicions that Black adults might have about the actions of U.S. institutions based on their personal and collective historical experiences with racial discrimination.” Survey respondents highlighted issues such as discrimination in the medical field, incarceration, and guns and drugs in Black communities. 

The report’s initial title prompted swift backlash from critics who said “racial conspiracy theories” implied that Black Americans’ distrust of U.S. institutions is irrational and without historical context. The report made brief mention of the Tuskegee syphilis experiment , a medical scandal that fueled distrust in medical institutions. 

JustLeadershipUSA, a social justice organization, was one of the most vocal critics of the  report, calling it “ shockingly offensive ” for labeling Black Americans’ distrust over well-substantiated discrimination as conspiracy theories.

Two days later, Pew officials marked the report as being under revision and acknowledged that using the phrase “racial conspiracy theories” was not the best choice. 

“The comments were so thoughtful,” Neha Sahgal, vice president of research at Pew Research Center, said of the criticism. Sahgal said leaders at Pew “paid attention to what people were trying to tell us.”

“Upon reflection, we felt that this editorial shorthand detracted from the findings of this report, which we maintain are hugely important at this time in our country,” Sahgal said. “We have since revised the report. We have taken accountability for using a shorthand that was inappropriate.”

Pew released the revised report Saturday with a new title: “ Most Black Americans Believe U.S. Institutions Were Designed To Hold Black People Back .” The updated report includes a new headline, additional context and direct quotes from respondents.

“This is very important and an excellent update to correct those errors in the original version,” DeAnna Hoskins, president of JustLeadershipUSA, said. “But why didn’t you do that from the beginning?”

Before Pew’s acknowledgement and revisions, one person wrote in a post on X : “This new Pew report on Black belief in ‘conspiracy theories’ is interesting, but I take issue with the ‘CT’ label bc of how it lumps in well-substantiated truths alongside bunk like Q*Anon and flat earth.”

There are well-documented episodes of discrimination and targeting throughout the nation’s history, from the Tuskegee experiment to exclusion of Black Americans from New Deal programs and government targeting of civil rights and Black Power leaders under COINTELPRO .

“We have to ask: Why would the people at the Pew Research Center call the opinion of the vast majority of Black Americans—which is rooted in facts, history, and lived experience—a ‘conspiracy theory,’ when it is actually a reality?” Hoskins wrote in a statement on the organization’s website. 

In an interview with NBC News, Hoskins said  it was irresponsible of Pew to equate Black people’s concerns with conspiracy theories at such a politically turbulent time in the country.

“We’re talking about election fraud, we’re talking about QAnon — you were throwing us into that,” Hoskins said of Pew. 

The report states that most Black Americans believe U.S. institutions fall short “when it comes to treating Black people fairly.” More than 60% of Black Americans surveyed cited prison, political and economic systems as just some of the institutions intentionally designed to “hold Black people back, either a great deal or a fair amount.”

“Black Americans’ mistrust of U.S. institutions is informed by history, from slavery to the implementation of Jim Crow laws in the South, to the rise of mass incarceration and more,” the updated report states. “Several studies show that racial disparities in income , wealth , education , imprisonment and health outcomes persist to this day.”

research conclusion validity

Char Adams is a reporter for NBC BLK who writes about race.

  • Android Malware 23
  • Artificial Intelligence 4
  • Check Point Research Publications 374
  • Cloud Security 1
  • Data & Threat Intelligence 1
  • Data Analysis 0
  • Global Cyber Attack Reports 309
  • How To Guides 11
  • Ransomware 1
  • Russo-Ukrainian War 1
  • Security Report 1
  • Threat and data analysis 0
  • Threat Research 170
  • Web 3.0 Security 8

research conclusion validity

Rafel RAT, Android Malware from Espionage to Ransomware Operations

Research by: Antonis Terefos, Bohdan Melnykov

Introduction

Android, Google’s most popular mobile operating system, powers billions of smartphones and tablets globally. Known for its open-source nature and flexibility, Android offers users a wide array of features, customization options, and access to a vast ecosystem of applications through the Google Play Store and other sources.

However, with its widespread adoption and open environment comes the risk of malicious activity. Android malware, a malicious software designed to target Android devices, poses a significant threat to users’ privacy, security, and data integrity. These malicious programs come in various forms, including viruses, Trojans, ransomware, spyware, and adware, and they can infiltrate devices through multiple vectors, such as app downloads, malicious websites, phishing attacks, and even system vulnerabilities.

The evolving landscape of Android malware presents challenges for users, developers, and security experts. As attackers employ increasingly sophisticated techniques to evade detection and compromise devices, understanding the nature of Android malware, its distribution methods and effective prevention and mitigation strategies become paramount.

Rafel RAT is an open-source malware tool that operates stealthily on Android devices. It provides malicious actors with a powerful toolkit for remote administration and control, enabling a range of malicious activities from data theft to device manipulation.

Check Point Research has identified multiple threat actors utilizing Rafel, an open-source remote administration tool (RAT). The discovery of an espionage group leveraging Rafel in their operations was of particular significance, as it indicates the tool’s efficacy across various threat actor profiles and operational objectives.

In an earlier publication , we identified  APT-C-35 / DoNot Team  utilizing Rafel RAT. Rafel’s features and capabilities, such as remote access, surveillance, data exfiltration, and persistence mechanisms, make it a potent tool for conducting covert operations and infiltrating high-value targets.

Figure 1 - Rafel RAT features.

Campaigns Overview & Victims Analysis

We observed around 120 different malicious campaigns, some of which successfully targeted high-profile organizations, including the military sector. While most of the targeted victims were from the United States, China, and Indonesia, the geography of the attacks is pretty vast.

Such campaigns can be considered high-risk, as the fact that the victim’s phone book being exfiltrated could leak sensitive information about other contacts and allow lateral movement within the organization based on that data. Another point of concern is stolen two-factor authentication messages, which could lead to multiple accounts taking over.

research conclusion validity

The majority of victims had Samsung phones, with Xiaomi, Vivo, and Huawei users comprising the second-largest group among the targeted victims. This result corresponds to the popularity of the devices in various markets.

research conclusion validity

While certain brands had higher numbers of infected devices, a wide range of models were involved. Therefore, we categorized the models based on their series. Our findings also highlighted that most victims had Google devices (Pixel, Nexus), Samsung Galaxy A & S Series, and Xiaomi Redmi Series.

Figure 4 - Top Models.

It’s intriguing to note the distribution of Android versions among the most affected victims. Android 11 is the most prevalent, followed by versions 8 and 5. Despite the variety of Android versions, malware can generally operate across all. However, newer versions of the operating system typically present more challenges for malware to execute its functions or require more actions from the victim to be effective.

Figure 5 - Android Versions.

One thing we constantly observe in Windows bots is the consistently high number of Windows XP infections, despite the fact that this version reached its End of Life in 2014. We observed the same scenario in infected Android devices. More than 87% of the affected victims are running Android versions that are no longer supported and, consequently, not receiving security fixes.

Android VersionRelease DataLast Security Patch (End of Life)
4October 2011October 2017
5November 2014March 2018
6October 2015August 2018
7August 2016October 2019
8August 2017October 2021
9August 2018January 2022
10September 2019February 2023
11September 2020February 2024
12October 2021N/A
13August 2022N/A

Figure 6 - Victims’ Android version support.

Technical Analysis

This malware was developed to participate in phishing campaigns. It leverages deceptive tactics to manipulate user trust and exploit their interactions. Upon initiation, the malware seeks the necessary permissions and may also request to be added to the allowlist. Especially when the device’s manufacturer offers extra services for app optimization, this helps to ensure its persistence in the system.

research conclusion validity

Our investigation uncovered numerous phishing operations utilizing this specific malware variant. Under the guise of legitimate entities, the malware impersonates multiple widely recognized applications, including Instagram, WhatsApp, various e-commerce platforms, antivirus programs, and support apps for numerous services.

Figure 8 - Screenshots of malware activity.

Depending on the attacker’s modifications, the malware may request permissions for Notifications or Device Admin rights or stealthily seek minimal sensitive permissions (such as SMS, Call Logs, and Contacts) in its quest to remain undetected. Regardless, the malware commences its operations in the background immediately upon activation. It deploys a Background service that generates a notification with a deceptive label while operating covertly. At the same time, it initiates an InternalService to manage communications with the command-and-control (C&C) server.

research conclusion validity

The InternalService initiates communication with the (C&C) server, activates location tracking, and begins setting up Text-To-Speech components.

Figure 10 - InternalService’s first actions.

Communication occurs over HTTP(S) protocols, beginning with the initial phase of client-server interaction. This involves transmitting information about the device, including its identifiers, characteristics, locale, country, model specifics, and operator details. Next, a request is sent to the C&C server for the commands to execute on the device.

Figure 11 - Request to C&C with device information.

The range of supported commands and their names may vary depending on the specific malware variant. The table below outlines the fundamental commands found in the original malware sources:

CommandDescription
Leak PhoneBook to the C&C
Leak all SMS to the C&C
Send text messages to the provided phone number
Send device information (country, operator, model, language, battery, root status, amount of RAM)
Leak live location to the C&C
Leak Call Logs to the C&C
Show toast (floating message) with provided text message for the victim
Delete all files under the specified path
Locks the device screen
Start the process of file encryption
Change the device wallpaper
Perform device vibration for 20s
Wipe Call History
Text-to-speech command that can play incoming messages from attackers in different languages
Send the directory’s tree of the specified path to the C&C
Upload specific file to the C&C
Send a list of all installed applications

In addition to the primary communication channel, the malware was initially able to send quick messages through the Discord API. During the onboarding process, it notifies the attacker of a new victim’s appearance. This enables attackers to respond swiftly and extract the necessary data from the compromised device.

research conclusion validity

This communication channel is also used to intercept device notifications. The malware scans the content of these notifications and forwards it to the attackers. This enables the attackers to siphon sensitive data from other applications, such as capturing 2FA codes sent through messaging platforms.

Figure 14 - Notification Listener that leaks all notifications.

During our analysis, we encountered several protective mechanisms employed by the attackers. These ranged from string encryption and packer usage to various anti-evasion techniques designed to disrupt automated analysis pipelines or render some tools ineffective.

research conclusion validity

Some of the evasions used can be mitigated by newer versions of the analysis tools.

For more information about evasion techniques, refer to our  Check Point Research Evasion Encyclopedia .

Command & Control

Threat actors who use Rafel are provided with a PHP panel, which operates without the need for a traditional database setup and relies instead on JSON files for storage and management. During installation, the threat actor uses a designated username and password to access the administration panel. Through this interface, the threat actors can monitor and control the infected mobile devices.

Figure 16 - Admin login page.

Upon logging into the command and control interface, threat actors can access essential information about the infected devices, such as:

  • Device  – Phone model
  • Version  – Android Version
  • Country  – Provides geographical context, allowing threat actors to tailor their malicious activities or campaigns to specific regions or demographics.
  • SIM operator  – The mobile network operator associated with the device’s SIM card, which can help track the device’s location.
  • Charge  – The current power level of the infected device.
  • Is Rooted  – Indicates whether the device is rooted, providing information on the permitted access level.

Figure 17 - Control Panel Devices.

As the threat actors view bot details within the panel, additional information regarding the device’s specifications and available commands becomes accessible. The panel shows the following extracted device information:

  • Language  – Specifies the language setting configured on the infected device.
  • RAM  – Provides details about the device’s random access memory (RAM) capacity. This information could indicate whether the device is a sandbox.

In addition, the panel grants the operator access to a suite of phone features and commands that can be executed remotely on the infected device.

Figure 18 - Victim’s device Information.

The  GetContact  command enables the threat actors to retrieve contact details from the victim’s device, including names and phone numbers. This allows the attackers access to sensitive personal information stored on the device, facilitating identity theft, social engineering attacks, or further exploiting the victim’s contacts for malicious purposes.

Figure 19 - Contacts List.

Threat actors can retrieve SMS messages containing sensitive information by using the GetSMS command. We observed malicious actors abusing this functionality to obtain two-factor authentication (2FA) details. This presents a significant security risk, as 2FA codes are commonly used to secure accounts and transactions.

Figure 20 - SMS List.

The  Application  command provides further information regarding the installed applications on the victim’s devices.

Figure 21 - Application List.

Newer versions of the command and control panel provide extended functionalities, as seen below.

Figure 22 - Commands.

Our analysis of executed bot commands provided valuable insights into the tactics, techniques, and procedures (TTPs) employed by cyber criminals and yielded actionable intelligence.

Figure 23 - Executed Commands.

Deeper Analysis of Campaigns

Check Point Research took a deeper dive into three specific areas of Android infections:

  • Ransomware operations
  • Two-factor authentication messages that could have led to a 2FA bypass
  • Threat actors who hacked Pakistani government sites

The cases we uncovered underscore severe dangers for individuals and corporations operating in the Android ecosystem.

Ransomware Operation Analysis

In its fundamental iteration, the Rafel application possesses all the essential features required for executing extortion schemes effectively. When malware obtains DeviceAdmin privileges, it can alter the lock-screen password. In addition, leveraging device admin functionality aids in preventing the malware’s uninstallation. If a user attempts to revoke admin privileges from the application, it promptly changes the password and locks the screen, thwarting any attempts to intervene.

research conclusion validity

In addition to its locker functionality, the malware incorporates a variant that encrypts files using AES encryption, employing a predefined key. Alternatively, it may delete files from the device’s storage.

Figure 26 - File encryption methods.

Check Point Research identified a ransomware operation performed using Rafel RAT. The threat actor, who possibly originates from Iran, initially executed typical information-retrieving commands such as:

  • device_info  – Get device info.
  • application_list  – Get the device application list.
  • arama_gecmisi  – Get call logs.
  • rehber_oku  – Get contact details.
  • sms_oku  – Get SMS messages.

At this point, the operator determines with the information obtained that the victim has any value in terms of espionage and then begins the ransomware operation with these commands:

  • deletecalls  – Wipes call history.
  • ransomware  – Displays the message “Loda Pakistan” (the victim was from Pakistan).
  • changewallpaper  – Change the wallpaper, and message “loda Pakistan.”
  • LockTheScreen  – Locks the screen with the message “Loda Pakistan.”
  • send_sms  – Sends a message containing the ransom note.
  • vibrate  – Vibrate to alert the victim.

The “ransom note” in the form of an SMS message is written in Arabic and provides a Telegram channel to continue the dialogue.

Figure 28 - “Ransom note” message.

Two-Factor Authentication (2FA)

Our investigations revealed numerous cases where 2FA messages were stolen, potentially leading to a 2FA bypass. Compromised 2FA codes (OTP – one-time passwords) can enable malicious actors to circumvent additional security measures and gain unauthorized access to sensitive accounts and information.

Figure 29 - 2FA messages.

Threat Actors Targeting Government Infrastructure

In one recent case, we identified a threat actor who managed to hack a government website from Pakistan. The actor also installed the Rafel web panel on this server, and we observed infected devices reporting to this C&C.

Figure 31 - Hacked Pakistani government website.

The hacker  @LoaderCrazy  published his “achievement” on the Telegram channel  @EgyptHackerTeam , with the message in Arabic “ ما نخترقه نترك بصمتنا عليه ” (English:  What we penetrate we leave our mark on )

Figure 32 - Communication on a Telegram Channel.

The Rafel web panel was installed on May 18, 2024, though traces of the hacking date back to April 2023.

Figure 33 - proof.txt file.

The Rafel victims on this C&C are from diverse countries, including the United States, Russia, China, and Romania.

Figure 34 - Rafel RAT is hosted on Pakistan’s government website.

Rafel RAT is a potent example of the evolving landscape of Android malware, characterized by its open-source nature, extensive feature set, and widespread utilization across various illicit activities. The prevalence of Rafel RAT highlights the need for continual vigilance and proactive security measures to safeguard Android devices against malicious exploitation. As cyber criminals continue to leverage techniques and tools such as Rafel RAT to compromise user privacy, steal sensitive data, and perpetrate financial fraud, a multi-layered approach to cybersecurity is essential. Effective mitigation strategies should encompass comprehensive threat intelligence, robust endpoint protection mechanisms, user education initiatives, and stakeholder collaboration within the cybersecurity ecosystem.

Check Point’s  Harmony Mobile  prevents malware from infiltrating mobile devices by detecting and blocking the download of malicious apps in real-time. Harmony Mobile’s unique network security infrastructure—On-device Network Protection—allows you to stay ahead of emerging threats by extending Check Point’s industry-leading network security  technologies  to mobile devices.

SHA256
d1f2ed3e379cde7375a001f967ce145a5bba23ca668685ac96907ba8a0d29320
442fbbb66efd3c21ba1c333ce8be02bb7ad057528c72bf1eb1e07903482211a9
344d577a622f6f11c7e1213a3bd667a3aef638440191e8567214d39479e80821
c94416790693fb364f204f6645eac8a5483011ac73dba0d6285138014fa29a63
9b718877da8630ba63083b3374896f67eccdb61f85e7d5671b83156ab182e4de
5148ac15283b303357107ab4f4f17caf00d96291154ade7809202f9ab8746d0b
Command & Control Servers
districtjudiciarycharsadda.gov[.]pk
kafila001.000webhostapp[.]com
uni2phish[.]ru
zetalinks[.]tech
ashrat.000webhostapp[.]com
bazfinc[.]xyz
discord-rat23.000webhostapp[.]com

POPULAR POSTS

research conclusion validity

  • Android Malware
  • Check Point Research Publications
  • Threat Research

research conclusion validity

BLOGS AND PUBLICATIONS

research conclusion validity

Zooming In On “Domestic Kitten”

research conclusion validity

Vulnerability in Google Play Core Library Remains Unpatched in Google Play Applications

research conclusion validity

The NSO WhatsApp Vulnerability – This is How It Happened

research conclusion validity

SUBSCRIBE TO CYBER INTELLIGENCE REPORTS

Country —Please choose an option— China India United States Indonesia Brazil Pakistan Nigeria Bangladesh Russia Japan Mexico Philippines Vietnam Ethiopia Egypt Germany Iran Turkey Democratic Republic of the Congo Thailand France United Kingdom Italy Burma South Africa South Korea Colombia Spain Ukraine Tanzania Kenya Argentina Algeria Poland Sudan Uganda Canada Iraq Morocco Peru Uzbekistan Saudi Arabia Malaysia Venezuela Nepal Afghanistan Yemen North Korea Ghana Mozambique Taiwan Australia Ivory Coast Syria Madagascar Angola Cameroon Sri Lanka Romania Burkina Faso Niger Kazakhstan Netherlands Chile Malawi Ecuador Guatemala Mali Cambodia Senegal Zambia Zimbabwe Chad South Sudan Belgium Cuba Tunisia Guinea Greece Portugal Rwanda Czech Republic Somalia Haiti Benin Burundi Bolivia Hungary Sweden Belarus Dominican Republic Azerbaijan Honduras Austria United Arab Emirates Israel Switzerland Tajikistan Bulgaria Hong Kong (China) Serbia Papua New Guinea Paraguay Laos Jordan El Salvador Eritrea Libya Togo Sierra Leone Nicaragua Kyrgyzstan Denmark Finland Slovakia Singapore Turkmenistan Norway Lebanon Costa Rica Central African Republic Ireland Georgia New Zealand Republic of the Congo Palestine Liberia Croatia Oman Bosnia and Herzegovina Puerto Rico Kuwait Moldov Mauritania Panama Uruguay Armenia Lithuania Albania Mongolia Jamaica Namibia Lesotho Qatar Macedonia Slovenia Botswana Latvia Gambia Kosovo Guinea-Bissau Gabon Equatorial Guinea Trinidad and Tobago Estonia Mauritius Swaziland Bahrain Timor-Leste Djibouti Cyprus Fiji Reunion (France) Guyana Comoros Bhutan Montenegro Macau (China) Solomon Islands Western Sahara Luxembourg Suriname Cape Verde Malta Guadeloupe (France) Martinique (France) Brunei Bahamas Iceland Maldives Belize Barbados French Polynesia (France) Vanuatu New Caledonia (France) French Guiana (France) Mayotte (France) Samoa Sao Tom and Principe Saint Lucia Guam (USA) Curacao (Netherlands) Saint Vincent and the Grenadines Kiribati United States Virgin Islands (USA) Grenada Tonga Aruba (Netherlands) Federated States of Micronesia Jersey (UK) Seychelles Antigua and Barbuda Isle of Man (UK) Andorra Dominica Bermuda (UK) Guernsey (UK) Greenland (Denmark) Marshall Islands American Samoa (USA) Cayman Islands (UK) Saint Kitts and Nevis Northern Mariana Islands (USA) Faroe Islands (Denmark) Sint Maarten (Netherlands) Saint Martin (France) Liechtenstein Monaco San Marino Turks and Caicos Islands (UK) Gibraltar (UK) British Virgin Islands (UK) Aland Islands (Finland) Caribbean Netherlands (Netherlands) Palau Cook Islands (NZ) Anguilla (UK) Wallis and Futuna (France) Tuvalu Nauru Saint Barthelemy (France) Saint Pierre and Miquelon (France) Montserrat (UK) Saint Helena, Ascension and Tristan da Cunha (UK) Svalbard and Jan Mayen (Norway) Falkland Islands (UK) Norfolk Island (Australia) Christmas Island (Australia) Niue (NZ) Tokelau (NZ) Vatican City Cocos (Keeling) Islands (Australia) Pitcairn Islands (UK)

We value your privacy!

BFSI uses cookies on this site. We use cookies to enable faster and easier experience for you. By continuing to visit this website you agree to our use of cookies.

IMAGES

  1. PPT

    research conclusion validity

  2. PPT

    research conclusion validity

  3. PPT

    research conclusion validity

  4. PPT

    research conclusion validity

  5. conclusion validity example in research

    research conclusion validity

  6. 9 Types of Validity in Research (2024)

    research conclusion validity

VIDEO

  1. Lesson 8: Research-Phrases to use in Writing the Research Conclusion (Part 2) #researchtips

  2. Reliability and Validity in Research || Validity and Reliability in Research in Urdu and Hindi

  3. Valid Argument Definition: A Student's Guide

  4. Therefore Meaning In English

  5. Statistical conclusion validity

  6. Introduction to Proof Strategy

COMMENTS

  1. Conclusion Validity

    Conclusion validity is the degree to which the conclusion we reach is credible or believable. Although conclusion validity was originally thought to be a statistical inference issue, it has become more apparent that it is also relevant in qualitative research. For example, in an observational field study of homeless adolescents the researcher ...

  2. Statistical Conclusion Validity: Some Common Threats and Simple

    The fourth aspect of research validity, which Cook and Campbell called statistical conclusion validity (SCV), is the subject of this paper. Cook and Campbell, 1979 , pp. 39-50) discussed that SCV pertains to the extent to which data from a research study can reasonably be regarded as revealing a link (or lack thereof) between independent and ...

  3. The 4 Types of Validity in Research

    The 4 Types of Validity in Research | Definitions & Examples. Published on September 6, 2019 by Fiona Middleton.Revised on June 22, 2023. Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid.

  4. Validity

    Research Validity. Research validity pertains to the accuracy and truthfulness of the research. It examines whether the research truly measures what it claims to measure. Without validity, research results can be misleading or erroneous, leading to incorrect conclusions and potentially flawed applications. How to Ensure Validity in Research

  5. Validity in Analysis, Interpretation, and Conclusions

    Analysis Interpretation and Conclusions. In this stage, the data are processed, analyzed, and interpreted to generate conclusions that answer evaluation questions and document appropriate recommendations (Fig. 6.1 ). Validity assumptions at each stage of the evaluation process—analysis interpretation and conclusions.

  6. Statistical conclusion validity: some common threats and ...

    Statistical conclusion validity (SCV) holds when the conclusions of a research study are founded on an adequate analysis of the data, generally meaning that adequate statistical methods are used whose small-sample behavior is accurate, besides being logically capable of providing an answer to the research question. Compared to the three other ...

  7. Statistical conclusion validity: Some common threats and simple remedies

    The ultimate goal of research is to produce dependable knowledge or to provide the evidence that may guide practical decisions. Statistical conclusion validity (SCV) holds when the conclusions of a research study are founded on an adequate analysis of the data, generally meaning that adequate statistical methods are used whose small-sample behavior is accurate, besides being logically capable ...

  8. Introduction to Validity

    The theory of validity, and the many lists of specific threats, provide a useful scheme for assessing the quality of research conclusions. The theory is general in scope and applicability, well-articulated in its philosophical suppositions, and virtually impossible to explain adequately in a few minutes.

  9. Statistical conclusion validity

    Statistical conclusion validity is the degree to which conclusions about the relationship among variables based on the data are correct or "reasonable". This began as being solely about whether the statistical conclusion about the relationship of the variables was correct, but now there is a movement towards moving to "reasonable" conclusions that use: quantitative, statistical, and ...

  10. Validity in Analysis, Interpretation, and Conclusions

    Table 6.1 Validity questions in analysis, interpretation, and conclusion. Full size table. Generally, the validity at this stage is related to coherence or consistence in the story that an evaluation is trying to tell (Peck et al. 2012 ). The consistency of an evaluation's story definitely affects the persuasiveness of its argument.

  11. Internal Validity in Research

    Internal validity makes the conclusions of a causal relationship credible and trustworthy. Without high internal validity, an experiment cannot demonstrate a causal link between two variables. Research example. You want to test the hypothesis that drinking a cup of coffee improves memory. You schedule an equal number of college-aged ...

  12. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  13. (PDF) Statistical Conclusion Validity: Some Common ...

    Statistical conclusion validity (SCV) holds when the conclusions of a research study are founded on an adequate analysis of the data, generally meaning that adequate statistical methods are used ...

  14. Reliability vs Validity in Research

    Revised on 10 October 2022. Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method, technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure. It's important to consider reliability and validity when you are ...

  15. Validity in Research and Psychology: Types & Examples

    In this vein, there are many different types of validity and ways of thinking about it. Let's take a look at several of the more common types. Each kind is a line of evidence that can help support or refute a test's overall validity. In this post, learn about face, content, criterion, discriminant, concurrent, predictive, and construct ...

  16. Statistical Conclusion Validity

    Statistical Conclusion Validity (SCV), or just Conclusion Validity is a measure of how reasonable a research or experimental conclusion is. For example, let's say you ran some research to find out if two years of preschool is more effective than one. Based on the data, you conclude that there's a positive relationship between how well a ...

  17. 6 Statistical Conclusion Validity

    38) define statistical conclusion validity as the "validity of inferences about the correlation (covariance) between treatment and outcome.". In principle, all nine of the common threats to statistical conclusion validity identified by Shadish et al. (Table 6.1) apply to time series designs.

  18. Validity In Psychology Research: Types & Examples

    In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it's intended to measure. It ensures that the research findings are genuine and not due to extraneous factors. Validity can be categorized into different types, including construct validity (measuring the intended abstract trait), internal validity (ensuring causal conclusions ...

  19. PDF PSYC 204 15A Lecture Notes, Topic 02, Research Validity

    Research Validity—A conclusion based on a research study is valid when it corresponds to the actual or true state of the world. Four facets or dimensions of research validity are commonly recognized—internal validity, external validity, statistical conclusion validity, and construct validity. 1.

  20. Improving Conclusion Validity

    Good Implementation. When you are studying the effects of interventions, treatments or programs, you can improve conclusion validity by assuring good implementation. This can be accomplished by training program operators and standardizing the protocols for administering the program. Here are some general guidelines you can follow in designing ...

  21. Validity

    By this statement we are able to derive that validity plays a significant role in analysis as it ensures the conclusion of accurate results. Overall threats: 1.Insufficient data collected to make valid conclusion. 2.Measurement done with too few measurement variables. 3.Too much variation in data or outliers in data.

  22. Writing a Research Paper Conclusion

    Table of contents. Step 1: Restate the problem. Step 2: Sum up the paper. Step 3: Discuss the implications. Research paper conclusion examples. Frequently asked questions about research paper conclusions.

  23. Appraising systematic reviews: a comprehensive guide to ...

    Systematic reviews play a crucial role in evidence-based practices as they consolidate research findings to inform decision-making. However, it is essential to assess the quality of systematic reviews to prevent biased or inaccurate conclusions. This paper underscores the importance of adhering to r …

  24. Validity and reliability of the Turkish version of the Australian

    Feyza Mutlay a Department of Geriatric Medicine, Van Research and Training Hospital, Van, Turkey https: ... The known-group validity and divergent validity were assessed. The ANU-ADRI was administered during the baseline test and again within one week for retest purposes. ... Conclusion . The ANU-ADRI-Short Form was proved as a valuable tool ...

  25. Threats to Conclusion Validity

    The "noise" consists of all of the factors that make it hard to see the relationship. There are several important sources of noise, each of which is a threat to conclusion validity. One important threat is low reliability of measures (see reliability ). This can be due to many factors including poor question wording, bad instrument design ...

  26. Pew Research Center revises report about 'racial conspiracy theories

    By Char Adams. The Pew Research Center has revised a report after it received criticism for saying a majority of Black Americans believe "racial conspiracy theories" about U.S. institutions ...

  27. Rafel RAT, Android Malware from Espionage to Ransomware Operations

    Check Point Research took a deeper dive into three specific areas of Android infections: Ransomware operations; Two-factor authentication messages that could have led to a 2FA bypass; ... Conclusion. Rafel RAT is a potent example of the evolving landscape of Android malware, characterized by its open-source nature, extensive feature set, and ...