Javatpoint Logo

Machine Learning

Artificial Intelligence

Control System

Supervised Learning

Classification, miscellaneous, related tutorials.

Interview Questions

JavaTpoint

The hypothesis is a common term in Machine Learning and data science projects. As we know, machine learning is one of the most powerful technologies across the world, which helps us to predict results based on past experiences. Moreover, data scientists and ML professionals conduct experiments that aim to solve a problem. These ML professionals and data scientists make an initial assumption for the solution of the problem.

This assumption in Machine learning is known as Hypothesis. In Machine Learning, at various times, Hypothesis and Model are used interchangeably. However, a Hypothesis is an assumption made by scientists, whereas a model is a mathematical representation that is used to test the hypothesis. In this topic, "Hypothesis in Machine Learning," we will discuss a few important concepts related to a hypothesis in machine learning and their importance. So, let's start with a quick introduction to Hypothesis.

It is just a guess based on some known facts but has not yet been proven. A good hypothesis is testable, which results in either true or false.

: Let's understand the hypothesis with a common example. Some scientist claims that ultraviolet (UV) light can damage the eyes then it may also cause blindness.

In this example, a scientist just claims that UV rays are harmful to the eyes, but we assume they may cause blindness. However, it may or may not be possible. Hence, these types of assumptions are called a hypothesis.

The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is specifically used in Supervised Machine learning, where an ML model learns a function that best maps the input to corresponding outputs with the help of an available dataset.

There are some common methods given to find out the possible hypothesis from the Hypothesis space, where hypothesis space is represented by and hypothesis by Th ese are defined as follows:

It is used by supervised machine learning algorithms to determine the best possible hypothesis to describe the target function or best maps input to output.

It is often constrained by choice of the framing of the problem, the choice of model, and the choice of model configuration.

. It is primarily based on data as well as bias and restrictions applied to data.

Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper output and can be evaluated as well as used to make predictions.

The hypothesis (h) can be formulated in machine learning as follows:

Where,

Y: Range

m: Slope of the line which divided test data or changes in y divided by change in x.

x: domain

c: intercept (constant)

: Let's understand the hypothesis (h) and hypothesis space (H) with a two-dimensional coordinate plane showing the distribution of data as follows:

Hypothesis space (H) is the composition of all legal best possible ways to divide the coordinate plane so that it best maps input to proper output.

Further, each individual best possible way is called a hypothesis (h). Hence, the hypothesis and hypothesis space would be like this:

Similar to the hypothesis in machine learning, it is also considered an assumption of the output. However, it is falsifiable, which means it can be failed in the presence of sufficient evidence.

Unlike machine learning, we cannot accept any hypothesis in statistics because it is just an imaginary result and based on probability. Before start working on an experiment, we must be aware of two important types of hypotheses as follows:

A null hypothesis is a type of statistical hypothesis which tells that there is no statistically significant effect exists in the given set of observations. It is also known as conjecture and is used in quantitative analysis to test theories about markets, investment, and finance to decide whether an idea is true or false. An alternative hypothesis is a direct contradiction of the null hypothesis, which means if one of the two hypotheses is true, then the other must be false. In other words, an alternative hypothesis is a type of statistical hypothesis which tells that there is some significant effect that exists in the given set of observations.

The significance level is the primary thing that must be set before starting an experiment. It is useful to define the tolerance of error and the level at which effect can be considered significantly. During the testing process in an experiment, a 95% significance level is accepted, and the remaining 5% can be neglected. The significance level also tells the critical or threshold value. For e.g., in an experiment, if the significance level is set to 98%, then the critical value is 0.02%.

The p-value in statistics is defined as the evidence against a null hypothesis. In other words, P-value is the probability that a random chance generated the data or something else that is equal or rarer under the null hypothesis condition.

If the p-value is smaller, the evidence will be stronger, and vice-versa which means the null hypothesis can be rejected in testing. It is always represented in a decimal form, such as 0.035.

Whenever a statistical test is carried out on the population and sample to find out P-value, then it always depends upon the critical value. If the p-value is less than the critical value, then it shows the effect is significant, and the null hypothesis can be rejected. Further, if it is higher than the critical value, it shows that there is no significant effect and hence fails to reject the Null Hypothesis.

In the series of mapping instances of inputs to outputs in supervised machine learning, the hypothesis is a very useful concept that helps to approximate a target function in machine learning. It is available in all analytics domains and is also considered one of the important factors to check whether a change should be introduced or not. It covers the entire training data sets to efficiency as well as the performance of the models.

Hence, in this topic, we have covered various important concepts related to the hypothesis in machine learning and statistics and some important parameters such as p-value, significance level, etc., to understand hypothesis concepts in a better way.





Youtube

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

eml header

Best Guesses: Understanding The Hypothesis in Machine Learning

Stewart Kaplan

  • February 22, 2024
  • General , Supervised Learning , Unsupervised Learning

Machine learning is a vast and complex field that has inherited many terms from other places all over the mathematical domain.

It can sometimes be challenging to get your head around all the different terminologies, never mind trying to understand how everything comes together.

In this blog post, we will focus on one particular concept: the hypothesis.

While you may think this is simple, there is a little caveat regarding machine learning.

The statistics side and the learning side.

Don’t worry; we’ll do a full breakdown below.

You’ll learn the following:

What Is a Hypothesis in Machine Learning?

  • Is This any different than the hypothesis in statistics?
  • What is the difference between the alternative hypothesis and the null?
  • Why do we restrict hypothesis space in artificial intelligence?
  • Example code performing hypothesis testing in machine learning

learning together

In machine learning, the term ‘hypothesis’ can refer to two things.

First, it can refer to the hypothesis space, the set of all possible training examples that could be used to predict or answer a new instance.

Second, it can refer to the traditional null and alternative hypotheses from statistics.

Since machine learning works so closely with statistics, 90% of the time, when someone is referencing the hypothesis, they’re referencing hypothesis tests from statistics.

Is This Any Different Than The Hypothesis In Statistics?

In statistics, the hypothesis is an assumption made about a population parameter.

The statistician’s goal is to prove it true or disprove it.

prove them wrong

This will take the form of two different hypotheses, one called the null, and one called the alternative.

Usually, you’ll establish your null hypothesis as an assumption that it equals some value.

For example, in Welch’s T-Test Of Unequal Variance, our null hypothesis is that the two means we are testing (population parameter) are equal.

This means our null hypothesis is that the two population means are the same.

We run our statistical tests, and if our p-value is significant (very low), we reject the null hypothesis.

This would mean that their population means are unequal for the two samples you are testing.

Usually, statisticians will use the significance level of .05 (a 5% risk of being wrong) when deciding what to use as the p-value cut-off.

What Is The Difference Between The Alternative Hypothesis And The Null?

The null hypothesis is our default assumption, which we are trying to prove correct.

The alternate hypothesis is usually the opposite of our null and is much broader in scope.

For most statistical tests, the null and alternative hypotheses are already defined.

You are then just trying to find “significant” evidence we can use to reject our null hypothesis.

can you prove it

These two hypotheses are easy to spot by their specific notation. The null hypothesis is usually denoted by H₀, while H₁ denotes the alternative hypothesis.

Example Code Performing Hypothesis Testing In Machine Learning

Since there are many different hypothesis tests in machine learning and data science, we will focus on one of my favorites.

This test is Welch’s T-Test Of Unequal Variance, where we are trying to determine if the population means of these two samples are different.

There are a couple of assumptions for this test, but we will ignore those for now and show the code.

You can read more about this here in our other post, Welch’s T-Test of Unequal Variance .

We see that our p-value is very low, and we reject the null hypothesis.

welch t test result with p-value

What Is The Difference Between The Biased And Unbiased Hypothesis Spaces?

The difference between the Biased and Unbiased hypothesis space is the number of possible training examples your algorithm has to predict.

The unbiased space has all of them, and the biased space only has the training examples you’ve supplied.

Since neither of these is optimal (one is too small, one is much too big), your algorithm creates generalized rules (inductive learning) to be able to handle examples it hasn’t seen before.

Here’s an example of each:

Example of The Biased Hypothesis Space In Machine Learning

The Biased Hypothesis space in machine learning is a biased subspace where your algorithm does not consider all training examples to make predictions.

This is easiest to see with an example.

Let’s say you have the following data:

Happy  and  Sunny  and  Stomach Full  = True

Whenever your algorithm sees those three together in the biased hypothesis space, it’ll automatically default to true.

This means when your algorithm sees:

Sad  and  Sunny  And  Stomach Full  = False

It’ll automatically default to False since it didn’t appear in our subspace.

This is a greedy approach, but it has some practical applications.

greedy

Example of the Unbiased Hypothesis Space In Machine Learning

The unbiased hypothesis space is a space where all combinations are stored.

We can use re-use our example above:

This would start to breakdown as

Happy  = True

Happy  and  Sunny  = True

Happy  and  Stomach Full  = True

Let’s say you have four options for each of the three choices.

This would mean our subspace would need 2^12 instances (4096) just for our little three-word problem.

This is practically impossible; the space would become huge.

subspace

So while it would be highly accurate, this has no scalability.

More reading on this idea can be found in our post, Inductive Bias In Machine Learning .

Why Do We Restrict Hypothesis Space In Artificial Intelligence?

We have to restrict the hypothesis space in machine learning. Without any restrictions, our domain becomes much too large, and we lose any form of scalability.

This is why our algorithm creates rules to handle examples that are seen in production. 

This gives our algorithms a generalized approach that will be able to handle all new examples that are in the same format.

Other Quick Machine Learning Tutorials

At EML, we have a ton of cool data science tutorials that break things down so anyone can understand them.

Below we’ve listed a few that are similar to this guide:

  • Instance-Based Learning in Machine Learning
  • Types of Data For Machine Learning
  • Verbose in Machine Learning
  • Generalization In Machine Learning
  • Epoch In Machine Learning
  • Inductive Bias in Machine Learning
  • Understanding The Hypothesis In Machine Learning
  • Zip Codes In Machine Learning
  • get_dummies() in Machine Learning
  • Bootstrapping In Machine Learning
  • X and Y in Machine Learning
  • F1 Score in Machine Learning
  • Recent Posts

Stewart Kaplan

  • How to Use Volvo VIDA Software Like a Pro [Expert Tips] - September 4, 2024
  • Linear vs Logistic Regression: When to Choose Each [Master Regression Models!] - September 4, 2024
  • What does Epic Games pay software engineers? [Discover the Hidden Salary Secrets!] - September 4, 2024

MLP Logo

Hypothesis Testing – A Deep Dive into Hypothesis Testing, The Backbone of Statistical Inference

  • September 21, 2023

Explore the intricacies of hypothesis testing, a cornerstone of statistical analysis. Dive into methods, interpretations, and applications for making data-driven decisions.

define hypothesis in machine learning

In this Blog post we will learn:

  • What is Hypothesis Testing?
  • Steps in Hypothesis Testing 2.1. Set up Hypotheses: Null and Alternative 2.2. Choose a Significance Level (α) 2.3. Calculate a test statistic and P-Value 2.4. Make a Decision
  • Example : Testing a new drug.
  • Example in python

1. What is Hypothesis Testing?

In simple terms, hypothesis testing is a method used to make decisions or inferences about population parameters based on sample data. Imagine being handed a dice and asked if it’s biased. By rolling it a few times and analyzing the outcomes, you’d be engaging in the essence of hypothesis testing.

Think of hypothesis testing as the scientific method of the statistics world. Suppose you hear claims like “This new drug works wonders!” or “Our new website design boosts sales.” How do you know if these statements hold water? Enter hypothesis testing.

2. Steps in Hypothesis Testing

  • Set up Hypotheses : Begin with a null hypothesis (H0) and an alternative hypothesis (Ha).
  • Choose a Significance Level (α) : Typically 0.05, this is the probability of rejecting the null hypothesis when it’s actually true. Think of it as the chance of accusing an innocent person.
  • Calculate Test statistic and P-Value : Gather evidence (data) and calculate a test statistic.
  • p-value : This is the probability of observing the data, given that the null hypothesis is true. A small p-value (typically ≤ 0.05) suggests the data is inconsistent with the null hypothesis.
  • Decision Rule : If the p-value is less than or equal to α, you reject the null hypothesis in favor of the alternative.

2.1. Set up Hypotheses: Null and Alternative

Before diving into testing, we must formulate hypotheses. The null hypothesis (H0) represents the default assumption, while the alternative hypothesis (H1) challenges it.

For instance, in drug testing, H0 : “The new drug is no better than the existing one,” H1 : “The new drug is superior .”

2.2. Choose a Significance Level (α)

When You collect and analyze data to test H0 and H1 hypotheses. Based on your analysis, you decide whether to reject the null hypothesis in favor of the alternative, or fail to reject / Accept the null hypothesis.

The significance level, often denoted by $α$, represents the probability of rejecting the null hypothesis when it is actually true.

In other words, it’s the risk you’re willing to take of making a Type I error (false positive).

Type I Error (False Positive) :

  • Symbolized by the Greek letter alpha (α).
  • Occurs when you incorrectly reject a true null hypothesis . In other words, you conclude that there is an effect or difference when, in reality, there isn’t.
  • The probability of making a Type I error is denoted by the significance level of a test. Commonly, tests are conducted at the 0.05 significance level , which means there’s a 5% chance of making a Type I error .
  • Commonly used significance levels are 0.01, 0.05, and 0.10, but the choice depends on the context of the study and the level of risk one is willing to accept.

Example : If a drug is not effective (truth), but a clinical trial incorrectly concludes that it is effective (based on the sample data), then a Type I error has occurred.

Type II Error (False Negative) :

  • Symbolized by the Greek letter beta (β).
  • Occurs when you accept a false null hypothesis . This means you conclude there is no effect or difference when, in reality, there is.
  • The probability of making a Type II error is denoted by β. The power of a test (1 – β) represents the probability of correctly rejecting a false null hypothesis.

Example : If a drug is effective (truth), but a clinical trial incorrectly concludes that it is not effective (based on the sample data), then a Type II error has occurred.

Balancing the Errors :

define hypothesis in machine learning

In practice, there’s a trade-off between Type I and Type II errors. Reducing the risk of one typically increases the risk of the other. For example, if you want to decrease the probability of a Type I error (by setting a lower significance level), you might increase the probability of a Type II error unless you compensate by collecting more data or making other adjustments.

It’s essential to understand the consequences of both types of errors in any given context. In some situations, a Type I error might be more severe, while in others, a Type II error might be of greater concern. This understanding guides researchers in designing their experiments and choosing appropriate significance levels.

2.3. Calculate a test statistic and P-Value

Test statistic : A test statistic is a single number that helps us understand how far our sample data is from what we’d expect under a null hypothesis (a basic assumption we’re trying to test against). Generally, the larger the test statistic, the more evidence we have against our null hypothesis. It helps us decide whether the differences we observe in our data are due to random chance or if there’s an actual effect.

P-value : The P-value tells us how likely we would get our observed results (or something more extreme) if the null hypothesis were true. It’s a value between 0 and 1. – A smaller P-value (typically below 0.05) means that the observation is rare under the null hypothesis, so we might reject the null hypothesis. – A larger P-value suggests that what we observed could easily happen by random chance, so we might not reject the null hypothesis.

2.4. Make a Decision

Relationship between $α$ and P-Value

When conducting a hypothesis test:

  • We first choose a significance level ($α$), which sets a threshold for making decisions.

We then calculate the p-value from our sample data and the test statistic.

Finally, we compare the p-value to our chosen $α$:

  • If $p−value≤α$: We reject the null hypothesis in favor of the alternative hypothesis. The result is said to be statistically significant.
  • If $p−value>α$: We fail to reject the null hypothesis. There isn’t enough statistical evidence to support the alternative hypothesis.

3. Example : Testing a new drug.

Imagine we are investigating whether a new drug is effective at treating headaches faster than drug B.

Setting Up the Experiment : You gather 100 people who suffer from headaches. Half of them (50 people) are given the new drug (let’s call this the ‘Drug Group’), and the other half are given a sugar pill, which doesn’t contain any medication.

  • Set up Hypotheses : Before starting, you make a prediction:
  • Null Hypothesis (H0): The new drug has no effect. Any difference in healing time between the two groups is just due to random chance.
  • Alternative Hypothesis (H1): The new drug does have an effect. The difference in healing time between the two groups is significant and not just by chance.
  • Choose a Significance Level (α) : Typically 0.05, this is the probability of rejecting the null hypothesis when it’s actually true

Calculate Test statistic and P-Value : After the experiment, you analyze the data. The “test statistic” is a number that helps you understand the difference between the two groups in terms of standard units.

For instance, let’s say:

  • The average healing time in the Drug Group is 2 hours.
  • The average healing time in the Placebo Group is 3 hours.

The test statistic helps you understand how significant this 1-hour difference is. If the groups are large and the spread of healing times in each group is small, then this difference might be significant. But if there’s a huge variation in healing times, the 1-hour difference might not be so special.

Imagine the P-value as answering this question: “If the new drug had NO real effect, what’s the probability that I’d see a difference as extreme (or more extreme) as the one I found, just by random chance?”

For instance:

  • P-value of 0.01 means there’s a 1% chance that the observed difference (or a more extreme difference) would occur if the drug had no effect. That’s pretty rare, so we might consider the drug effective.
  • P-value of 0.5 means there’s a 50% chance you’d see this difference just by chance. That’s pretty high, so we might not be convinced the drug is doing much.
  • If the P-value is less than ($α$) 0.05: the results are “statistically significant,” and they might reject the null hypothesis , believing the new drug has an effect.
  • If the P-value is greater than ($α$) 0.05: the results are not statistically significant, and they don’t reject the null hypothesis , remaining unsure if the drug has a genuine effect.

4. Example in python

For simplicity, let’s say we’re using a t-test (common for comparing means). Let’s dive into Python:

Making a Decision : “The results are statistically significant! p-value < 0.05 , The drug seems to have an effect!” If not, we’d say, “Looks like the drug isn’t as miraculous as we thought.”

5. Conclusion

Hypothesis testing is an indispensable tool in data science, allowing us to make data-driven decisions with confidence. By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data.

More Articles

F statistic formula – explained, correlation – connecting the dots, the role of correlation in data analysis, sampling and sampling distributions – a comprehensive guide on sampling and sampling distributions, law of large numbers – a deep dive into the world of statistics, central limit theorem – a deep dive into central limit theorem and its significance in statistics, similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

define hypothesis in machine learning

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

define hypothesis in machine learning

Hold on, while it loads...

name

Evaluating Hypotheses in Machine Learning: A Comprehensive Guide

Learn how to evaluate hypotheses in machine learning, including types of hypotheses, evaluation metrics, and common pitfalls to avoid. Improve your ML model's performance with this in-depth guide.

Create an image featuring JavaScript code snippets and interview-related icons or graphics. Use a color scheme of yellows and blues. Include the title '7 Essential JavaScript Interview Questions for Freshers'.

Create an image featuring JavaScript code snippets and interview-related icons or graphics. Use a color scheme of yellows and blues. Include the title '7 Essential JavaScript Interview Questions for Freshers'.

Introduction

Machine learning is a crucial aspect of artificial intelligence that enables machines to learn from data and make predictions or decisions. The process of machine learning involves training a model on a dataset, and then using that model to make predictions on new, unseen data. However, before deploying a machine learning model, it is essential to evaluate its performance to ensure that it is accurate and reliable. One crucial step in this evaluation process is hypothesis testing.

In this blog post, we will delve into the world of hypothesis testing in machine learning, exploring what hypotheses are, why they are essential, and how to evaluate them. We will also discuss the different types of hypotheses, common pitfalls to avoid, and best practices for hypothesis testing.

What are Hypotheses in Machine Learning?

In machine learning, a hypothesis is a statement that proposes a possible explanation for a phenomenon or a problem. It is a conjecture that is made about a population parameter, and it is used as a basis for further investigation. In the context of machine learning, hypotheses are used to define the problem that we are trying to solve.

For example, let's say we are building a machine learning model to predict the prices of houses based on their features, such as the number of bedrooms, square footage, and location. A possible hypothesis could be: "The price of a house is directly proportional to its square footage." This hypothesis proposes a possible relationship between the price of a house and its square footage.

Why are Hypotheses Essential in Machine Learning?

Hypotheses are essential in machine learning because they provide a framework for understanding the problem that we are trying to solve. They help us to identify the key variables that are relevant to the problem, and they provide a basis for evaluating the performance of our machine learning model.

Without a clear hypothesis, it is difficult to develop an effective machine learning model. A hypothesis helps us to:

  • Identify the key variables that are relevant to the problem
  • Develop a clear understanding of the problem that we are trying to solve
  • Evaluate the performance of our machine learning model
  • Refine our model and improve its accuracy

Types of Hypotheses in Machine Learning

There are two main types of hypotheses in machine learning: null hypotheses and alternative hypotheses.

Null Hypothesis

A null hypothesis is a hypothesis that proposes that there is no significant difference or relationship between variables. It is a hypothesis of no effect or no difference. For example, let's say we are building a machine learning model to predict the prices of houses based on their features. A null hypothesis could be: "There is no significant relationship between the price of a house and its square footage."

Alternative Hypothesis

An alternative hypothesis is a hypothesis that proposes that there is a significant difference or relationship between variables. It is a hypothesis of an effect or a difference. For example, let's say we are building a machine learning model to predict the prices of houses based on their features. An alternative hypothesis could be: "There is a significant positive relationship between the price of a house and its square footage."

Evaluating Hypotheses in Machine Learning

Evaluating hypotheses in machine learning involves testing the null hypothesis against the alternative hypothesis. This is typically done using statistical methods, such as t-tests, ANOVA, and regression analysis.

Here are the general steps involved in evaluating hypotheses in machine learning:

  • Formulate the null and alternative hypotheses : Clearly define the null and alternative hypotheses that you want to test.
  • Collect and prepare the data : Collect the data that you will use to test the hypotheses. Ensure that the data is clean, relevant, and representative of the population.
  • Choose a statistical method : Select a suitable statistical method to test the hypotheses. This could be a t-test, ANOVA, regression analysis, or another method.
  • Test the hypotheses : Use the chosen statistical method to test the null hypothesis against the alternative hypothesis.
  • Interpret the results : Interpret the results of the hypothesis test. If the null hypothesis is rejected, it suggests that there is a significant relationship between the variables. If the null hypothesis is not rejected, it suggests that there is no significant relationship between the variables.

Common Pitfalls to Avoid in Hypothesis Testing

Here are some common pitfalls to avoid in hypothesis testing:

  • Overfitting : Overfitting occurs when a model is too complex and performs well on the training data but poorly on new, unseen data. To avoid overfitting, use techniques such as regularization, early stopping, and cross-validation.
  • Underfitting : Underfitting occurs when a model is too simple and fails to capture the underlying patterns in the data. To avoid underfitting, use techniques such as feature engineering, hyperparameter tuning, and model selection.
  • Data leakage : Data leakage occurs when the model is trained on data that it will also be tested on. To avoid data leakage, use techniques such as cross-validation and walk-forward optimization.
  • P-hacking : P-hacking occurs when a researcher selectively reports the results of multiple hypothesis tests to find a significant result. To avoid p-hacking, use techniques such as preregistration and replication.

Best Practices for Hypothesis Testing in Machine Learning

Here are some best practices for hypothesis testing in machine learning:

  • Clearly define the hypotheses : Clearly define the null and alternative hypotheses that you want to test.
  • Use a suitable statistical method : Choose a suitable statistical method to test the hypotheses.
  • Use cross-validation : Use cross-validation to evaluate the performance of the model on unseen data.
  • Avoid overfitting and underfitting : Use techniques such as regularization, early stopping, and feature engineering to avoid overfitting and underfitting.
  • Document the results : Document the results of the hypothesis test, including the statistical method used, the results, and any conclusions drawn.

Evaluating hypotheses is a crucial step in machine learning that helps us to understand the problem that we are trying to solve and to evaluate the performance of our machine learning model. By following the best practices outlined in this blog post, you can ensure that your hypothesis testing is rigorous, reliable, and effective.

Remember to clearly define the null and alternative hypotheses, choose a suitable statistical method, and avoid common pitfalls such as overfitting, underfitting, data leakage, and p-hacking. By doing so, you can develop machine learning models that are accurate, reliable, and effective.

  • [1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. Springer.
  • [2] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • [3] Han, J., Pei, J., & Kamber, M. (2012). Data Mining: Concepts and Techniques. Morgan Kaufmann.

I hope this helps! Let me know if you need any further assistance.

Interview scenario with a laptop and a man displaying behavioral interview questions

  • Trending Categories

Data Structure

  • Selected Reading
  • UPSC IAS Exams Notes
  • Developer's Best Practices
  • Questions and Answers
  • Effective Resume Writing
  • HR Interview Questions
  • Computer Glossary

What is hypothesis in Machine Learning?

The hypothesis is a word that is frequently used in Machine Learning and data science initiatives. As we all know, machine learning is one of the most powerful technologies in the world, allowing us to anticipate outcomes based on previous experiences. Moreover, data scientists and ML specialists undertake experiments with the goal of solving an issue. These ML experts and data scientists make an initial guess on how to solve the challenge.

What is a Hypothesis?

A hypothesis is a conjecture or proposed explanation that is based on insufficient facts or assumptions. It is only a conjecture based on certain known facts that have yet to be confirmed. A good hypothesis is tested and yields either true or erroneous outcomes.

Let's look at an example to better grasp the hypothesis. According to some scientists, ultraviolet (UV) light can harm the eyes and induce blindness.

In this case, a scientist just states that UV rays are hazardous to the eyes, but people presume they can lead to blindness. Yet, it is conceivable that it will not be achievable. As a result, these kinds of assumptions are referred to as hypotheses.

Defining Hypothesis in Machine Learning

In machine learning, a hypothesis is a mathematical function or model that converts input data into output predictions. The model's first belief or explanation is based on the facts supplied. The hypothesis is typically expressed as a collection of parameters characterizing the behavior of the model.

If we're building a model to predict the price of a property based on its size and location. The hypothesis function may look something like this −

$$\mathrm{h(x)\:=\:θ0\:+\:θ1\:*\:x1\:+\:θ2\:*\:x2}$$

The hypothesis function is h(x), its input data is x, the model's parameters are 0, 1, and 2, and the features are x1 and x2.

The machine learning model's purpose is to discover the optimal values for parameters 0 through 2 that minimize the difference between projected and actual output labels.

To put it another way, we're looking for the hypothesis function that best represents the underlying link between the input and output data.

Types of Hypotheses in Machine Learning

The next step is to build a hypothesis after identifying the problem and obtaining evidence. A hypothesis is an explanation or solution to a problem based on insufficient data. It acts as a springboard for further investigation and experimentation. A hypothesis is a machine learning function that converts inputs to outputs based on some assumptions. A good hypothesis contributes to the creation of an accurate and efficient machine-learning model. Several machine learning theories are as follows −

1. Null Hypothesis

A null hypothesis is a basic hypothesis that states that no link exists between the independent and dependent variables. In other words, it assumes the independent variable has no influence on the dependent variable. It is symbolized by the symbol H0. If the p-value falls outside the significance level, the null hypothesis is typically rejected (). If the null hypothesis is correct, the coefficient of determination is the probability of rejecting it. A null hypothesis is involved in test findings such as t-tests and ANOVA.

2. Alternative Hypothesis

An alternative hypothesis is a hypothesis that contradicts the null hypothesis. It assumes that there is a relationship between the independent and dependent variables. In other words, it assumes that there is an effect of the independent variable on the dependent variable. It is denoted by Ha. An alternative hypothesis is generally accepted if the p-value is less than the significance level (α). An alternative hypothesis is also known as a research hypothesis.

3. One-tailed Hypothesis

A one-tailed test is a type of significance test in which the region of rejection is located at one end of the sample distribution. It denotes that the estimated test parameter is more or less than the crucial value, implying that the alternative hypothesis rather than the null hypothesis should be accepted. It is most commonly used in the chi-square distribution, where all of the crucial areas, related to, are put in either of the two tails. Left-tailed or right-tailed one-tailed tests are both possible.

4. Two-tailed Hypothesis

The two-tailed test is a hypothesis test in which the region of rejection or critical area is on both ends of the normal distribution. It determines whether the sample tested falls within or outside a certain range of values, and an alternative hypothesis is accepted if the calculated value falls in either of the two tails of the probability distribution. α is bifurcated into two equal parts, and the estimated parameter is either above or below the assumed parameter, so extreme values work as evidence against the null hypothesis.

Overall, the hypothesis plays a critical role in the machine learning model. It provides a starting point for the model to make predictions and helps to guide the learning process. The accuracy of the hypothesis is evaluated using various metrics like mean squared error or accuracy.

The hypothesis is a mathematical function or model that converts input data into output predictions, typically expressed as a collection of parameters characterizing the behavior of the model. It is an explanation or solution to a problem based on insufficient data. A good hypothesis contributes to the creation of an accurate and efficient machine-learning model. A two-tailed hypothesis is used when there is no prior knowledge or theoretical basis to infer a certain direction of the link.

Premansh Sharma

  • Related Articles
  • What is Machine Learning?
  • What is momentum in Machine Learning?
  • What is Epoch in Machine Learning?
  • What is Standardization in Machine Learning
  • What is Q-learning with respect to reinforcement learning in Machine Learning?
  • What is Bayes Theorem in Machine Learning
  • What is field Mapping in Machine Learning?
  • What is Parameter Extraction in Machine Learning
  • What is Tpot AutoML in machine learning?
  • What is Projection Perspective in Machine Learning?
  • What is Grouped Convolution in Machine Learning?
  • What is a Neural Network in Machine Learning?
  • What is corporate fraud detection in machine learning?
  • What is Linear Algebra Application in Machine Learning
  • What is Continuous Kernel Convolution in machine learning?

Kickstart Your Career

Get certified by completing the course

Hypothesis in Machine Learning: Comprehensive Overview(2021)

img

Introduction

Supervised machine learning (ML) is regularly portrayed as the issue of approximating an objective capacity that maps inputs to outputs. This portrayal is described as looking through and assessing competitor hypothesis from hypothesis spaces. 

The conversation of hypothesis in machine learning can be confused for a novice, particularly when “hypothesis” has a discrete, but correlated significance in statistics and all the more comprehensively in science.

Hypothesis Space (H)

The hypothesis space utilized by an ML system is the arrangement of all hypotheses that may be returned by it. It is ordinarily characterized by a Hypothesis Language, conceivably related to a Language Bias. 

Many ML algorithms depend on some sort of search methodology: given a set of perceptions and a space of all potential hypotheses that may be thought in the hypothesis space. They see in this space for those hypotheses that adequately furnish the data or are ideal concerning some other quality standard.

ML can be portrayed as the need to utilize accessible data objects to discover a function that most reliable maps inputs to output, alluded to as function estimate, where we surmised an anonymous objective function that can most reliably map inputs to outputs on all expected perceptions from the difficult domain. An illustration of a model that approximates the performs mappings and target function of inputs to outputs is known as hypothesis testing in machine learning.

The hypothesis in machine learning of all potential hypothesis that you are looking over, paying little mind to their structure. For the wellbeing of accommodation, the hypothesis class is normally compelled to be just each sort of function or model in turn, since learning techniques regularly just work on each type at a time. This doesn’t need to be the situation, however:

  • Hypothesis classes don’t need to comprise just one kind of function. If you’re looking through exponential, quadratic, and overall linear functions, those are what your joined hypothesis class contains.
  • Hypothesis classes additionally don’t need to comprise of just straightforward functions. If you figure out how to look over all piecewise-tanh2 functions, those functions are what your hypothesis class incorporates.

The enormous trade-off is that the bigger your hypothesis class in   machine learning, the better the best hypothesis models the basic genuine function, yet the harder it is to locate that best hypothesis. This is identified with the bias-variance trade-off.

  • Hypothesis (h)

A hypothesis function in machine learning is best describes the target. The hypothesis that an algorithm would concoct relies on the data and relies on the bias and restrictions that we have forced on the data.

The hypothesis formula in machine learning:

  • y  is range
  • m  changes in y divided by change in x
  • x  is domain
  • b  is intercept

The purpose of restricting hypothesis space in machine learning is so that these can fit well with the general data that is needed by the user. It checks the reality or deception of observations or inputs and examinations them appropriately. Subsequently, it is extremely helpful and it plays out the valuable function of mapping all the inputs till they come out as outputs. Consequently, the target functions are deliberately examined and restricted dependent on the outcomes (regardless of whether they are free of bias), in ML.

The hypothesis in machine learning space and inductive bias in machine learning is that the hypothesis space is a collection of valid Hypothesis, for example, every single desirable function, on the opposite side the inductive bias (otherwise called learning bias) of a learning algorithm is the series of expectations that the learner uses to foresee outputs of given sources of inputs that it has not experienced. Regression and Classification are a kind of realizing which relies upon continuous-valued and discrete-valued sequentially. This sort of issues (learnings) is called inductive learning issues since we distinguish a function by inducting it on data.

In the Maximum a Posteriori or MAP hypothesis in machine learning, enhancement gives a Bayesian probability structure to fitting model parameters to training data and another option and sibling may be a more normal Maximum Likelihood Estimation system. MAP learning chooses a solitary in all probability theory given the data. The hypothesis in machine learning earlier is as yet utilized and the technique is regularly more manageable than full Bayesian learning. 

Bayesian techniques can be utilized to decide the most plausible hypothesis in machine learning given the data the MAP hypothesis. This is the ideal hypothesis as no other hypothesis is more probable.

Hypothesis in machine learning or ML the applicant model that approximates a target function for mapping instances of inputs to outputs.

Hypothesis in statistics probabilistic clarification about the presence of a connection between observations. 

Hypothesis in science is a temporary clarification that fits the proof and can be disproved or confirmed. We can see that a hypothesis in machine learning draws upon the meaning of the hypothesis all the more extensively in science.

There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this  Machine Learning And AI Courses   by Jigsaw Academy.

  • XGBoost Algorithm: An Easy Overview For 2021

tag-img

Fill in the details to know more

facebook

PEOPLE ALSO READ

define hypothesis in machine learning

Related Articles

define hypothesis in machine learning

From The Eyes Of Emerging Technologies: IPL Through The Ages

April 29, 2023

 width=

Personalized Teaching with AI: Revolutionizing Traditional Teaching Methods

April 28, 2023

img

Metaverse: The Virtual Universe and its impact on the World of Finance

April 13, 2023

img

Artificial Intelligence – Learning To Manage The Mind Created By The Human Mind!

March 22, 2023

define hypothesis in machine learning

Wake Up to the Importance of Sleep: Celebrating World Sleep Day!

March 18, 2023

define hypothesis in machine learning

Operations Management and AI: How Do They Work?

March 15, 2023

img

How Does BYOP(Bring Your Own Project) Help In Building Your Portfolio?

define hypothesis in machine learning

What Are the Ethics in Artificial Intelligence (AI)?

November 25, 2022

epoch in machine learning

What is Epoch in Machine Learning?| UNext

November 24, 2022

define hypothesis in machine learning

The Impact Of Artificial Intelligence (AI) in Cloud Computing

November 18, 2022

define hypothesis in machine learning

Role of Artificial Intelligence and Machine Learning in Supply Chain Management 

November 11, 2022

define hypothesis in machine learning

Best Python Libraries for Machine Learning in 2022

November 7, 2022

share

Are you ready to build your own career?

arrow

Query? Ask Us

define hypothesis in machine learning

Enter Your Details ×

What is Hypothesis in Machine Learning? How to Form a Hypothesis?

What is Hypothesis in Machine Learning? How to Form a Hypothesis?

Hypothesis Testing is a broad subject that is applicable to many fields. When we study statistics, the Hypothesis Testing there involves data from multiple populations and the test is to see how significant the effect is on the population.

Top Machine Learning and AI Courses Online

To Explore all our certification courses on AI & ML, kindly visit our page below.

This involves calculating the p-value and comparing it with the critical value or the alpha. When it comes to Machine Learning, Hypothesis Testing deals with finding the function that best approximates independent features to the target. In other words, map the inputs to the outputs.

By the end of this tutorial, you will know the following:

Ads of upGrad blog

  • What is Hypothesis in Statistics vs Machine Learning
  • What is Hypothesis space?

Process of Forming a Hypothesis

Trending machine learning skills.

Hypothesis in Statistics

A Hypothesis is an assumption of a result that is falsifiable, meaning it can be proven wrong by some evidence. A Hypothesis can be either rejected or failed to be rejected. We never accept any hypothesis in statistics because it is all about probabilities and we are never 100% certain. Before the start of the experiment, we define two hypotheses:

1. Null Hypothesis: says that there is no significant effect

2. Alternative Hypothesis: says that there is some significant effect

In statistics, we compare the P-value (which is calculated using different types of statistical tests) with the critical value or alpha. The larger the P-value, the higher is the likelihood, which in turn signifies that the effect is not significant and we conclude that we fail to reject the null hypothesis .

In other words, the effect is highly likely to have occurred by chance and there is no statistical significance of it. On the other hand, if we get a P-value very small, it means that the likelihood is small. That means the probability of the event occurring by chance is very low. 

Join the   ML and AI Course  online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

Significance Level

The Significance Level is set before starting the experiment. This defines how much is the tolerance of error and at which level can the effect can be considered significant. A common value for significance level is 95% which also means that there is a 5% chance of us getting fooled by the test and making an error. In other words, the critical value is 0.05 which acts as a threshold. Similarly, if the significance level was set at 99%, it would mean a critical value of 0.01%.

A statistical test is carried out on the population and sample to find out the P-value which then is compared with the critical value. If the P-value comes out to be less than the critical value, then we can conclude that the effect is significant and hence reject the Null Hypothesis (that said there is no significant effect). If P-Value comes out to be more than the critical value, we can conclude that there is no significant effect and hence fail to reject the Null Hypothesis.

Now, as we can never be 100% sure, there is always a chance of our tests being correct but the results being misleading. This means that either we reject the null when it is actually not wrong. It can also mean that we don’t reject the null when it is actually false. These are type 1 and type 2 errors of Hypothesis Testing. 

Example  

Consider you’re working for a vaccine manufacturer and your team develops the vaccine for Covid-19. To prove the efficacy of this vaccine, it needs to statistically proven that it is effective on humans. Therefore, we take two groups of people of equal size and properties. We give the vaccine to group A and we give a placebo to group B. We carry out analysis to see how many people in group A got infected and how many in group B got infected.

We test this multiple times to see if group A developed any significant immunity against Covid-19 or not. We calculate the P-value for all these tests and conclude that P-values are always less than the critical value. Hence, we can safely reject the null hypothesis and conclude there is indeed a significant effect.

Read:  Machine Learning Models Explained

Hypothesis in Machine Learning

Hypothesis in Machine Learning is used when in a Supervised Machine Learning, we need to find the function that best maps input to output. This can also be called function approximation because we are approximating a target function that best maps feature to the target.

1. Hypothesis(h): A Hypothesis can be a single model that maps features to the target, however, may be the result/metrics. A hypothesis is signified by “ h ”.

2. Hypothesis Space(H): A Hypothesis space is a complete range of models and their possible parameters that can be used to model the data. It is signified by “ H ”. In other words, the Hypothesis is a subset of Hypothesis Space.

In essence, we have the training data (independent features and the target) and a target function that maps features to the target. These are then run on different types of algorithms using different types of configuration of their hyperparameter space to check which configuration produces the best results. The training data is used to formulate and find the best hypothesis from the hypothesis space. The test data is used to validate or verify the results produced by the hypothesis.

Consider an example where we have a dataset of 10000 instances with 10 features and one target. The target is binary, which means it is a binary classification problem. Now, say, we model this data using Logistic Regression and get an accuracy of 78%. We can draw the regression line which separates both the classes. This is a Hypothesis(h). Then we test this hypothesis on test data and get a score of 74%. 

Checkout:  Machine Learning Projects & Topics

Now, again assume we fit a RandomForests model on the same data and get an accuracy score of 85%. This is a good improvement over Logistic Regression already. Now we decide to tune the hyperparameters of RandomForests to get a better score on the same data. We do a grid search and run multiple RandomForest models on the data and check their performance. In this step, we are essentially searching the Hypothesis Space(H) to find a better function. After completing the grid search, we get the best score of 89% and we end the search. 

FYI: Free nlp course !

Now we also try more models like XGBoost, Support Vector Machine and Naive Bayes theorem to test their performances on the same data. We then pick the best performing model and test it on the test data to validate its performance and get a score of 87%. 

Popular AI and ML Blogs & Free Courses

AI & ML Free Courses

Before you go

The hypothesis is a crucial aspect of Machine Learning and Data Science. It is present in all the domains of analytics and is the deciding factor of whether a change should be introduced or not. Be it pharma, software, sales, etc. A Hypothesis covers the complete training dataset to check the performance of the models from the Hypothesis space.

A Hypothesis must be falsifiable, which means that it must be possible to test and prove it wrong if the results go against it. The process of searching for the best configuration of the model is time-consuming when a lot of different configurations need to be verified. There are ways to speed up this process as well by using techniques like Random Search of hyperparameters.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s  Executive PG Programme in Machine Learning & AI  which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Profile

Pavan Vadapalli

Something went wrong

Our Trending Machine Learning Courses

  • Advanced Certificate Programme in Machine Learning and NLP from IIIT Bangalore - Duration 8 Months
  • Master of Science in Machine Learning & AI from LJMU - Duration 18 Months
  • Executive PG Program in Machine Learning and AI from IIIT-B - Duration 12 Months

Machine Learning Skills To Master

  • Artificial Intelligence Courses
  • Tableau Courses
  • NLP Courses
  • Deep Learning Courses

Our Popular Machine Learning Course

Machine Learning Course

Frequently Asked Questions (FAQs)

There are many reasons to do open-source projects. You are learning new things, you are helping others, you are networking with others, you are creating a reputation and many more. Open source is fun, and eventually you will get something back. One of the most important reasons is that it builds a portfolio of great work that you can present to companies and get hired. Open-source projects are a wonderful way to learn new things. You could be enhancing your knowledge of software development or you could be learning a new skill. There is no better way to learn than to teach.

Yes. Open-source projects do not discriminate. The open-source communities are made of people who love to write code. There is always a place for a newbie. You will learn a lot and also have the chance to participate in a variety of open-source projects. You will learn what works and what doesn't and you will also have the chance to make your code used by a large community of developers. There is a list of open-source projects that are always looking for new contributors.

GitHub offers developers a way to manage projects and collaborate with each other. It also serves as a sort of resume for developers, with a project's contributors, documentation, and releases listed. Contributions to a project show potential employers that you have the skills and motivation to work in a team. Projects are often more than code, so GitHub has a way that you can structure your project just like you would structure a website. You can manage your website with a branch. A branch is like an experiment or a copy of your website. When you want to experiment with a new feature or fix something, you make a branch and experiment there. If the experiment is successful, you can merge the branch back into the original website.

Explore Free Courses

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in Canada through this course.

Marketing

Advance your career in the field of marketing with Industry relevant free courses

Data Science & Machine Learning

Build your foundation in one of the hottest industry of the 21st century

Management

Master industry-relevant skills that are required to become a leader and drive organizational success

Technology

Build essential technical skills to move forward in your career in these evolving times

Career Planning

Get insights from industry leaders and career counselors and learn how to stay ahead in your career

Law

Kickstart your career in law by building a solid foundation with these relevant free courses.

Chat GPT + Gen AI

Stay ahead of the curve and upskill yourself on Generative AI and ChatGPT

Soft Skills

Build your confidence by learning essential soft skills to help you become an Industry ready professional.

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in USA through this course.

Suggested Blogs

RPA Developer Salary in India: For Freshers &#038; Experienced [2024]

by Pavan Vadapalli

29 Jul 2024

15 Interesting MATLAB Project Ideas &#038; Topics For Beginners [2024]

09 Jul 2024

5 Types of Research Design: Elements and Characteristics

07 Jul 2024

Biological Neural Network: Importance, Components &#038; Comparison

04 Jul 2024

Production System in Artificial Intelligence and its Characteristics

03 Jul 2024

AI vs Human Intelligence: Difference Between AI &#038; Human Intelligence

01 Jul 2024

Career Opportunities in Artificial Intelligence: List of Various Job Roles

26 Jun 2024

Gini Index for Decision Trees: Mechanism, Perfect &#038; Imperfect Split With Examples

by MK Gurucharan

24 Jun 2024

Random Forest Vs Decision Tree: Difference Between Random Forest and Decision Tree

EDUCBA

Hypothesis in Machine Learning

Aashiya Mittal

Definition of Hypothesis in Machine Learning

A hypothesis in machine learning is an initial assumption or proposed explanation regarding the relationship between independent variables (features) and dependent variables (target) within a dataset. It serves as the foundational concept for constructing a statistical model. The hypothesis is formulated to elucidate patterns or phenomena observed in the data and is subject to validation through statistical methods and empirical testing. In the context of machine learning, the hypothesis often manifests as a predictive model, typically represented by a mathematical function or a set of rules.

Throughout the training phase, the machine learning algorithm refines this hypothesis by iteratively adjusting its parameters to minimize the disparity between predicted outputs and actual observations in the training data. Once the model is trained, the hypothesis encapsulates the learned relationship between input features and output labels, enabling the algorithm to generalize its predictions to new, unseen data. Therefore, a well-formulated machine learning hypothesis is testable and can generate predictions that extend beyond the training dataset.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

For example, some scientists say we should not eat milk products with fish or seafood. In such a situation, scientists only say combining two food types is dangerous. But people presume it to result in fatal diseases or death. Such assumptions are called hypotheses.

Calculate Hypothesis-

  • m = slope of the lines
  • b = intercept

Table of Contents

Hypothesis in the machine learning workflow, hypothesis testing, types of hypotheses in machine learning, components of a hypothesis in machine learning.

  • How does a Hypothesis work

Hypothesis Testing in Model Evaluation

Hypothesis testing and validation.

  • Hypothesis in Statistics

Real-world examples

  • Challenges and Pitfalls

Key Takeaways

  • A hypothesis in ML is a predictive model or function.
  • During training, we fine-tune parameters to achieve accurate predictions.
  • Aims to make predictions applicable to new, unseen data.
  • We can broadly categorize hypotheses into two types: null hypotheses and alternative hypotheses.

However, in the case of machine learning, the hypothesis is a mathematical function that predicts the relationship between input data and output predictions. The model starts working on the known facts. In machine learning, a hypothesis is like a guess or a proposed idea about how data works. It’s a model we create to make predictions.

We express the hypothesis as a collection of various parameters that impact the model’s behavior. The algorithm attempts to discover a mapping function using the data set. The parameters are modified throughout the learning process to reduce discrepancies between the expected and actual results. The goal is to fine-tune the model so it predicts well on new data, and we use a measure (cost function) to check its accuracy.

Let us help it using the following example.

Imagine you want to predict students’ exam scores based on their study hours. Your hypothesis could be.

Predicted Score=Study Hours× (x)

The hypothesis suggests that the more hours a student studies, the higher their exam score. The “(x)” is what the machine learning algorithm will figure out during training. You collect data on study hours and actual exam scores, and the algorithm adjusts the “(x)” to make the predictions as accurate as possible. This process of changing the hypothesis is at the core of machine learning.

Hypothesis testing refers to the systematic approach to determine if the findings of a specific study validate the researcher’s theory regarding a population. You can say that hypothetical testing is just an assumption made about a population parameter.

To conduct a hypothesis on a population, researchers or scientists perform hypothesis testing on sample data. Then, they evaluate the assumptions against the evidence. It includes evaluating two mutually exclusive statements regarding the population to determine which is best supported by the sample data.

Types of Hypotheses in Machine Learning

In machine learning , We can broadly categorize hypotheses into two types: null hypotheses and alternative hypotheses. The null and alternative hypotheses are distinct statements regarding a population. Through a hypothesis test, sample data decides whether to reject the null hypothesis.

1. Null Hypothesis (H0): In this hypothesis, all samples exhibit identical characteristics or variables about a population. It posits no relationship between sample parameters and the dependent and independent variables. When there are negligible distinctions between the two means or the difference lacks significance, it aligns with the null hypothesis.

The new study method has no significant effect on exam scores compared to the traditional method.

2. Alternative Hypothesis (H1 or Ha): this hypothesis contradicts the case of the null hypothesis, saying that the actual value of a population parameter is different from the null hypothesis value.

The new study method is more effective, leading to higher exam scores than the traditional method.

Where (new method) is the average exam score of students using the new study method. And the (traditional method) is the average exam score of students using the traditional study method.

The null hypothesis assumes no divergence in average exam scores between the new and traditional study methods. In contrast, the alternate hypothesis proposes a positive difference, implying the new study method is more effective. The goal of data gathering and statistical testing is to ascertain whether there is enough evidence to reject the null hypothesis. Supporting the notion that the new study method is superior in enhancing exam scores.

Below are the core components of testing a hypothesis.

  • Level of Significance

It signifies the probability of rejecting the null hypothesis if it is true. (Alpha) represents the threshold for accepting or rejecting a hypothesis.

For example, a significance level of 0.05 (5%) implies 95% confidence in the results, meaning even if we repeat the test numerous times, 95% of the outcomes would fall within the accepted range.

It refers to the probability of getting outputs as extreme as the observed ones, where we assume the null hypothesis is true. In case the P-value surpasses the selected significance level (α), the null hypothesis is rejected.

For Example, A P-value of 0.03 suggests a 3% chance of obtaining the observed results if the null hypothesis is correct. If α is 0.05, the P-value is less than α, indicating a rejection of the null hypothesis.

  • Test Statistic

Refers to the numerical value calculated from the sample datasets during hypothesis testing. The test statistics formula assesses the deviation of the sample data from the null hypothesis’ expected values.

Test Statistic

For example, in a t-test, the test statistic may be the t-value, calculated by comparing the means of two groups and assessing if the difference is statistically significant.

  • Critical Value

It refers to the pre-defined threshold value that will help you decide whether to reject or accept the null hypothesis. You must reject the null hypothesis if the test statistic exceeds the critical value.

For Example, In a z-test, if the test statistic is greater than the critical value for a 95% confidence level, the null hypothesis is rejected.

  • Degrees of Freedom

Degrees of freedom refer to the variability in estimating a parameter, often linked to sample size. In hypothesis testing, degrees of freedom affect the shape of the distribution.

For Example, In a t-test, you can determine the degrees of freedom using the sample size and impact the critical values. The larger degrees of freedom provide more precision in estimating population parameters.

How does a Hypothesis work?

In many machine learning methods, our main aim is to discover a hypothesis (a potential solution) from a set of possible solutions. The goal is to find a hypothesis that accurately connects the input data to the correct outcomes. The process typically involves exploring various hypotheses in a space of possibilities to identify the most suitable one.

How does Hypothesis work

Hypothesis Space (H)

The “hypothesis space” collects all the allowed guesses a machine learning system can make. The algorithm picks the best guess for this set’s expected outcomes or results.

Hypothesis (h)

In supervised machine learning, a hypothesis is like a function that tries to explain the expected outcome. We influence the specific function the algorithm picks based on the data and any limitations or preferences we’ve set. The formula for this function can be expressed as

In this formula,

  • y represents the predicted outcome,
  • m represents the line slope,
  • x refers to the input,
  • b is the intercept.

Let us explain the concepts of (h) and (H) using the following coordinates.

concepts of (h) and (H)

Consider that we have some test data for which we have to identify the result. See the below image with test data.

Image with test data

Now, we divide the coordinates to predict the outcome.

coordinates to predict the outcome

The below image will reflect the test data result.

reflect the test data result

How we split the coordinate plane to make predictions depends on the data, algorithm, and rules we set. The collection of all the legal ways we can divide the plane to predict test data outcomes is called the Hypothesis Space. Each specific way is called a hypothesis. In this example, the hypothesis space is like-

Hypothesis -possible and space

Hypothesis testing in model evaluation involves formulating assumptions about the model’s performance based on sample statistics and rigorously evaluating these assumptions against empirical evidence. It helps determine whether observed differences between model outcomes and expected results are statistically significant. This statistical method checks the validity of hypotheses regarding the model’s predictive accuracy. It also provides a systematic approach to determining the model’s effectiveness in new, unseen data.

For example, we are testing a new model that predicts whether emails are spam.

  • Null Hypothesis (H0): The new model is insignificant to the existing one.
  • Alternative Hypothesis (H1): The new model is better than the existing one.
  • Train both models on a dataset.
  • Collect predictions on a sample of emails from each model.
  • Use hypothesis testing to assess if the differences in prediction accuracy are statistically significant.
  • Reject H0: If the new model’s improvement is statistically significant, you may conclude it performs better.
  • Fail to Reject H0: If there’s no significant improvement, you might stick with the existing model.

Hypothesis Testing and Validation

Below are the steps included in conducting detailed hypothesis testing.

1. Define null and alternate hypotheses.

The first step is to develop the prediction that you want to investigate. Based on that, create your null and alternate hypothesis to test it mathematically based on the data sets provided on a specific population.

Where the null hypothesis predicts no relationship between that population’s variables, an alternate hypothesis predicts if any relationship exists.

For example, testing a relationship between gender and height. For that, you hypothesize that men are, on average, taller than women.

H0- men are, on average, shorter than women.

H1- men are, on average, taller than women.

2. Find the right significance level.

Now, you must select the significance level (α), say 0.05. This number will set the threshold to reject the null hypothesis. It validates the hypothesis test, ensuring we have enough information to support our prediction. You must identify your significance level before starting the test using the p-value.

3. Collect sufficient data or samples.

To perform accurate statistical testing, you must do the correct sampling and collect data in such a way that it will complement your hypothesis. If your data is inaccurate, you might not be able to derive the right result for that specific population you want.

For example- To compare the average height of men and women, ensure an equal representation of both genders in your sample. Include diverse socio-economic groups and control variables. Consider the scope (global or specific country) and use census data for regions and social classes in multiple countries.

4. Calculate test statistic.

The T-statistic measures how different the averages of the two groups are, considering the variability within each group. The calculation involves dividing the difference in group averages by the standard error of the difference. People also call it the t-value or t-score.

Now, we analyze data for different scores based on their characteristics to perform hypothesis tests. The selection of the test statistic relies on the specific type of hypothesis test we are carrying out. Various tests, like the Z-test, Chi-square, T-test, etc., are employed based on the goals of the analysis.

Measures how many standard deviations a data point or sample mean is from the population mean.
Considering sample variability, assess if the means of the two groups are significantly different.
Identifies whether a significant relationship between two categorical variables exists within a contingency table.
Compares the means of more than two groups to evaluate if there are significant differences.
Calculates the strongness and direction of a linear relationship between two continuous variables.

We conducted a one-tailed t-test to check if men are taller than women. Results indicate an estimated average height difference of 13.7 cm, with a p-value of 0.002. The observed difference is statistically significant, suggesting men tend to be taller than women in the sample.

5. Compare the test statistics.

Comparing test statistics involves evaluating the obtained test statistic with critical values or p-values to decide the null hypothesis. The comparison method depends on the type of statistical test we are conducting.

Method 1- using critical values

Identify the critical value(s) from the distribution associated with your chosen significance level (alpha).

  • If the absolute value of your calculated test statistic is greater than the critical value(s), you reject the null hypothesis.
  • If the test statistic falls within the non-rejection region defined by the critical values, you fail to reject the null hypothesis.

In a two-sided test, the null hypothesis gets rejected if the calculated test statistic is either excessively small or large. Consequently, we divide the rejection region for this test into two parts, one on the left and one on the right.

two-sided test

In a left-tailed test, we reject the null hypothesis only if the test statistic is minimal. As such, for this kind of test, only one portion of the rejection region lies to the left of the center.

left-tailed test

If the test statistic in a right-tailed test is significant, we reject the null hypothesis. As such, only one portion of the rejection region for this test is located to the right of the center.

right-tailed test

Method 2- p-value approach

In the p-value approach, we assess the probability (p-value) of the test statistic’s numerical value compared to the hypothesis test’s predetermined significance level (α).

The p-value reflects the likelihood of observing sample data as extreme as the obtained test statistic. Lower p-values mean slight chances in favor of the null hypothesis. The closer the p-value is to 0, the more compelling the evidence against the null hypothesis.

If the p-value is less than or equal to the specified significance level α, we reject the null hypothesis. Conversely, if the p-value exceeds α, we do not deny the null hypothesis.

p-value approach

For example- analysis reveals a p-value of 0.002, below the 0.05 cutoff. Consequently, you reject the null hypothesis, indicating a significant difference in average height between men and women.

6. Present findings:

You can present the findings of your hypothesis testing, explaining the data sets, result summary, and other related information. Also, explain the process and methods involved to support your hypothesis.

In our study comparing the average height of men and women, we identified a difference of 13.7 cm with a p-value of 0.002. This study leads us to reject the idea that men and women have equal height, indicating a probable difference in their heights.

Hypothesis in Statistics:

A hypothesis denotes a proposition or assumption regarding a population parameter, guiding statistical analyses. There are two categories: the null hypothesis (H0) and the alternative hypothesis (H1 or Ha).

  • The null hypothesis (H0) posits no significant difference or effect, attributing observed results to chance, often representing the status quo or baseline assumption.
  • Conversely, the alternative hypothesis (H1 or Ha) opposes the null hypothesis, suggesting a significant difference or effect in the population and aiming for support with evidence.

Example 1- Impact of a Training Program on Employee Productivity

Suppose a company introduces a new training program to improve employee productivity. Before implementing the program across the organization, they conduct a study to assess its effectiveness.

Step 1: Define the Hypothesis

Null Hypothesis (H0): The training program does not affect employee productivity.

Alternate Hypothesis (H1): The training program positively affects employee productivity.

Step 2: Define the Significance Level.

Let’s consider the significance level at 0.05, indicating rejection of the null hypothesis if the evidence suggests less than a 5% chance of observing the results due to random variation.

Step 3: Compute the Test Statistic (T-statistic)

The formula for the T-statistic in a paired T-test is given by:

m = mean of the difference i.e, Xafter, Xbefore

s = standard deviation of the difference (d),

n = sample size,

mean_difference = np.mean(after_training – before_training)

std_dev_difference = np.std(after_training – before_training, ddof=1) # using ddof=1 for sample standard deviation

n_pairs = len(before_training)

t_statistic_manual = mean_difference / (std_dev_difference / np.sqrt(n_pairs))

Step 4: Find the P-value

Calculate the p-value using the test statistic and degrees of freedom.

df = n_pairs – 1

p_value_manual = 2 * (1 – stats.t.cdf(np.abs(t_statistic_manual), df))

Step 5: Result

  • If the p-value is less than or equal to 0.05, reject the null hypothesis.
  • If the p-value is greater than 0.05, fail to reject the null hypothesis.

Example using Python

import numpy as np

From Scipy import stats

before_training = np.array([120, 118, 125, 112, 130, 122, 115, 121, 128, 119])

after_training = np.array([130, 135, 142, 128, 125, 138, 130, 133, 140, 129])

# Step 1: Null and Alternate Hypotheses

null_hypothesis = “The training program has no effect on employee productivity.”

alternate_hypothesis = “The training program has a positive effect on employee productivity.”

# Step 2: Significance Level

alpha = 0.05

# Step 3: Paired T-test

t_statistic, p_value = stats.ttest_rel(after_training, before_training)

# Step 4: Decision

if p_value <= alpha:

decision = “Reject”

decision = “Fail to reject”

# Step 5: Conclusion

if decision == “Reject”:

conclusion = “It means the training program has a positive effect on employee productivity.”

conclusion = “There is insufficient evidence to claim a significant difference in employee productivity before and after the training program.”

# Display results

print(“Null Hypothesis:”, null_hypothesis)

print(“Alternate Hypothesis:”, alternate_hypothesis)

print(f”Significance Level (alpha): {alpha}”)

print(“\n— Hypothesis Testing Results —“)

print(“T-statistic (from scipy):”, t_statistic)

print(“P-value (from scipy):”, p_value)

print(f”Decision: {decision} the null hypothesis at alpha={alpha}.”)

print(“Conclusion:”, conclusion)

Null Hypothesis

Challenges and pitfalls

  • Failure to Capture Underlying Patterns

Some hypotheses or models may not effectively capture the actual patterns based on the available data. This failure results in poor model performance, as the predictions may need to align with the actual outcomes.

Example: If a linear regression model is used to fit a non-linear relationship, it might fail to capture the underlying complexity in the data.

  • Biased Training Data

The training data used to develop a model may contain biases, reflecting historical inequalities or skewed representations. Biased training data can lead to unfair predictions, especially for underrepresented groups, perpetuating or exacerbating existing disparities.

Example: If a facial recognition system is trained mainly on a specific demographic, it may need help accurately recognizing faces from other demographics.

  • Poor-quality data:

Data with noise, inaccuracies, or missing values can negatively affect the model’s performance. Only reliable input data can ensure the accuracy of the hypotheses or predictions.

Example: A weather prediction model may need help to provide accurate forecasts in a dataset with inconsistent temperature recordings.

  • Inclusion of Irrelevant or Redundant Features:

Including too many irrelevant or redundant features in the model can hamper its performance. Unnecessary features may introduce noise, increase computational complexity, and hinder the model’s generalization ability.

Example: In a spam email classification model, including irrelevant metadata might not contribute to accurate spam detection.

  • Assumptions about Data Distribution:

Making assumptions about data distribution that do not hold true can lead to unreliable hypotheses. Models relying on incorrect assumptions may fail to make accurate predictions.

Example: Assuming a normal distribution when skewed data could result in misinterpretations and poor predictions.

  • Evolution of Data

Over time, the characteristics and patterns in the data may change. Hypotheses developed based on outdated data may lose their relevance and accuracy.

Example: Economic models trained on historical data might not accurately predict market trends if there are significant changes in economic conditions.

  • Complex Models with Low Interpretability

Some advanced models, like deep neural networks, can be complex and challenging to interpret. Understanding and explaining the decisions of such models becomes difficult, particularly in regulated or sensitive domains where transparency is crucial.

Example: In healthcare, a highly complex model for disease prediction may provide accurate predictions but needs more transparency in explaining why a specific patient received a particular diagnosis.

Hypothesis testing is a cornerstone in machine learning, guiding model assessment and decision-making. It addresses overfitting risks, assesses the significance of performance differences, and aids in feature selection. With its versatility, it ensures robust evaluations across various ML tasks. The interplay between significance levels, model comparisons, and ethical considerations underscores its importance in crafting reliable and unbiased predictive models, fostering informed decision-making in the dynamic landscape of machine learning.

Frequently Asked Questions (FAQs)

Q1. Can hypothesis testing be applied to compare different machine-learning algorithms?

Answer: Yes, you can use hypothesis testing to compare the performance of various ML algorithms, providing a statistical framework to determine if observed differences in predictive accuracy are significant and not random fluctuations.

Q2. How can hypothesis testing assist in feature selection in machine learning?

Answer: You can use Hypothesis testing to evaluate the significance of individual features in a model. It aids in the selection of pertinent features and the removal of those that have little bearing on prediction accuracy.

Q3. How can continuous monitoring and adaptation be integrated with hypothesis testing in machine learning?

Answer: Continuous monitoring involves regularly reassessing model hypotheses to adapt to evolving data dynamics. Hypothesis testing is a systematic tool to evaluate ongoing model performance, ensuring timely adjustments and sustained reliability in predictive outcomes.

Recommended Articles

We hope that this EDUCBA information on “Hypothesis in Machine Learning” was beneficial to you. You can view EDUCBA’s recommended articles for more information.

  • Applications of Machine Learning
  • Research Hypothesis
  • Machine Learning Methods
  • Classification Algorithms

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy .

Forgot Password?

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Quiz

Explore 1000+ varieties of Mock tests View more

Submit Next Question

Early-Bird Offer: ENROLL NOW

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

What exactly is a hypothesis space in machine learning?

Whilst I understand the term conceptually, I'm struggling to understand it operationally. Could anyone help me out by providing an example?

  • machine-learning
  • terminology

Five σ's user avatar

  • $\begingroup$ A space where we can predict output by a set of some legal hypothesis (or function) and function is represented in terms of features. $\endgroup$ –  Abhishek Kumar Commented Aug 9, 2019 at 17:03

3 Answers 3

Lets say you have an unknown target function $f:X \rightarrow Y$ that you are trying to capture by learning . In order to capture the target function you have to come up with some hypotheses, or you may call it candidate models denoted by H $h_1,...,h_n$ where $h \in H$ . Here, $H$ as the set of all candidate models is called hypothesis class or hypothesis space or hypothesis set .

For more information browse Abu-Mostafa's presentaton slides: https://work.caltech.edu/textbook.html

pentanol's user avatar

  • 8 $\begingroup$ This answer conveys absolutely no information! What is the intended relationship between $f$, $h$, and $H$? What is meant by "hypothesis set"? $\endgroup$ –  whuber ♦ Commented Nov 28, 2015 at 20:50
  • 5 $\begingroup$ Please take a few minutes with our help center to learn about this site and its standards, JimBoy. $\endgroup$ –  whuber ♦ Commented Nov 28, 2015 at 20:57
  • $\begingroup$ The answer says very clear, h learns to capture target function f . H is the space where h1, h2,..hn got defined. $\endgroup$ –  Logan Commented Nov 29, 2018 at 21:47
  • $\begingroup$ @whuber I hope this is clearer $\endgroup$ –  pentanol Commented Aug 6, 2021 at 8:51
  • $\begingroup$ @pentanol You have succeeded in providing a different name for "hypothesis space," but without a definition or description of "candidate model," it doesn't seem to add any information to the post. What would be useful is information relevant to the questions that were posed, which concern "understand[ing] operationally" and a request for an example. $\endgroup$ –  whuber ♦ Commented Aug 6, 2021 at 13:55

Suppose an example with four binary features and one binary output variable. Below is a set of observations:

This set of observations can be used by a machine learning (ML) algorithm to learn a function f that is able to predict a value y for any input from the input space .

We are searching for the ground truth f(x) = y that explains the relation between x and y for all possible inputs in the correct way.

The function f has to be chosen from the hypothesis space .

To get a better idea: The input space is in the above given example $2^4$ , its the number of possible inputs. The hypothesis space is $2^{2^4}=65536$ because for each set of features of the input space two outcomes ( 0 and 1 ) are possible.

The ML algorithm helps us to find one function , sometimes also referred as hypothesis, from the relatively large hypothesis space.

  • A Few Useful Things to Know About ML

Lerner Zhang's user avatar

  • 1 $\begingroup$ Just a small note on your answer: the size of the hypothesis space is indeed 65,536, but the a more easily explained expression for it would be $2^{(2^4)}$, since, there are $2^4$ possible unique samples, and thus $2^{(2^4)}$ possible label assignments for the entire input space. $\endgroup$ –  engelen Commented Jan 10, 2018 at 9:52
  • 1 $\begingroup$ @engelen Thanks for your advice, I've edited the answer. $\endgroup$ –  So S Commented Jan 10, 2018 at 21:00
  • $\begingroup$ @SoS That one function is called classifier?? $\endgroup$ –  user125163 Commented Aug 22, 2018 at 16:26
  • 2 $\begingroup$ @Arjun Hedge: Not the one, but one function that you learned is the classifier. The classifier could be (and that's your aim) the one function. $\endgroup$ –  So S Commented Aug 22, 2018 at 16:50

The hypothesis space is very relevant to the topic of the so-called Bias-Variance Tradeoff in maximum likelihood. That's if the number of parameters in the model(hypothesis function) is too small for the model to fit the data(indicating underfitting and that the hypothesis space is too limited), the bias is high; while if the model you choose contains too many parameters than needed to fit the data the variance is high(indicating overfitting and that the hypothesis space is too expressive).

As stated in So S ' answer, if the parameters are discrete we can easily and concretely calculate how many possibilities are in the hypothesis space(or how large it is), but normally under realy life circumstances the parameters are continuous. Therefore generally the hypothesis space is uncountable.

Here is an example I borrowed and modified from the related part in the classical machine learning textbook: Pattern Recognition And Machine Learning to fit this question:

We are selecting a hypothesis function for an unknown function hidding in the training data given by a third person named CoolGuy living in an extragalactic planet. Let's say CoolGuy knows what the function is, because the data cases are provided by him and he just generated the data using the function. Let's call it(we only have the limited data and CoolGuy has both the unlimited data and the function generating them) the ground truth function and denote it by $y(x, w)$ .

enter image description here

The green curve is the $y(x,w)$ , and the little blue circles are the cases we have(they are not actually the true data cases transmitted by CoolGuy because of the it would be contaminated by some transmission noise, for example by macula or other things).

We thought that that hidden function would be very simple then we make an attempt at a linear model(make a hypothesis with a very limited space): $g_1(x, w)=w_0 + w_1 x$ with only two parameters: $w_0$ and $w_1$ , and we train the model use our data and we obtain this:

enter image description here

We can see that no matter how many data we use to fit the hypothesis it just doesn't work because it is not expressive enough.

So we try a much more expressive hypothesis: $g_9=\sum_j^9 w_j x^j $ with ten adaptive paramaters $w_0, w_1\cdots , w_9$ , and we also train the model and then we get:

enter image description here

We can see that it is just too expressive and fits all data cases. We see that a much larger hypothesis space( since $g_2$ can be expressed by $g_9$ by setting $w_2, w_3, \cdots, w_9$ as all 0 ) is more powerful than a simple hypothesis. But the generalization is also bad. That is, if we recieve more data from CoolGuy and to do reference, the trained model most likely fails in those unseen cases.

Then how large the hypothesis space is large enough for the training dataset? We can find an aswer from the textbook aforementioned:

One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model.

And you'll see from the textbook that if we try to use 4 parameters, $g_3=w_0+w_1 x + w_2 x^2 + w_3 x^3$ , the trained function is expressive enough for the underlying function $y=\sin(2\pi x)$ . It's kind a black art to find the number 3(the appropriate hypothesis space) in this case.

Then we can roughly say that the hypothesis space is the measure of how expressive you model is to fit the training data. The hypothesis that is expressive enough for the training data is the good hypothesis with an expressive hypothesis space. To test whether the hypothesis is good or bad we do the cross validation to see if it performs well in the validation data-set. If it is neither underfitting(too limited) nor overfititing(too expressive) the space is enough(according to Occam Razor a simpler one is preferable, but I digress).

  • $\begingroup$ This approach looks relevant, but your explanation does not agree with that on p. 5 of your first reference: "A function $h:X\to\{0,1\}$ is called [an] hypothesis. A set $H$ of hypotheses among which the approximation function $y$ is searched is called [the] hypothesis space." (I would agree the slide is confusing, because its explanation implicitly requires that $C=\{0,1\}$, whereas that is generically labeled "classes" in the diagram. But let's not pass along that confusion: let's rectify it.) $\endgroup$ –  whuber ♦ Commented Sep 24, 2016 at 15:33
  • 1 $\begingroup$ @whuber I updated my answer just now more than two years later after I have learned more knowledge on the topic. Please help check if I can rectify it in a better way. Thanks. $\endgroup$ –  Lerner Zhang Commented Feb 5, 2019 at 11:41

Not the answer you're looking for? Browse other questions tagged machine-learning terminology definition or ask your own question .

  • Featured on Meta
  • Announcing a change to the data-dump process
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • Word for when someone tries to make others hate each other
  • Getting an UK Visa with Ricevuta
  • Is it possible to recover from a graveyard spiral?
  • Help writing block matrix
  • Whats the safest way to store a password in database?
  • If I am to use midi keyboard only, do I still need audio interface?
  • Is it a date format of YYMMDD, MMDDYY, and/or DDMMYY?
  • How would you slow the speed of a rogue solar system?
  • What is the significance of the phrase " in the name of the Lord " as in Psalm 124:8?
  • In what instances are 3-D charts appropriate?
  • Citrix published app asks for credentials
  • figuring out the speed controller of a cassette tape motor
  • Can Christian Saudi Nationals visit Mecca?
  • Expensive constructors. Should they exist? Should they be replaced?
  • Is it possible to travel to USA with legal cannabis?
  • quantulum abest, quo minus .
  • What other marketable uses are there for Starship if Mars colonization falls through?
  • How can I connect 8 I2C modules with entirely different addresses on ESP32?
  • If a Palestinian converts to Judaism, can they get Israeli citizenship?
  • Why doesn’t dust interfere with the adhesion of geckos’ feet?
  • How do I keep my tikz drawing on the page?
  • How should I secure ceiling drywall with no edge backing?
  • Is there more evidence for god than Russell’s teapot?
  • Is it safe to install programs other than with a distro's package manager?

define hypothesis in machine learning

What is machine learning?

September 03, 2024

AI FOR:  Beginners

Facebook icon

You may also like:

What is Copilot, and how can you use it?

How to write a bedtime story with ai, what is generative ai, using the copilot app and its features.

Machine learning, a transformative technology at the core of artificial intelligence (AI), utilizes data and algorithms to emulate human learning processes. It underpins conversational search, predictive text, and more in AI-powered Copilot . Learn about the definition of machine learning, its mechanisms, and how it enhances the capabilities of Copilot.

What is the definition of machine learning?

Machine learning is a subset of the larger field dedicated to crafting intelligent machines. It empowers computers to learn from data and enhance their performance autonomously, without explicit programming. As a self-learning process, it aligns with AI's goal: creating computer models that mimic human intelligence. Machine learning achieves this by utilizing algorithms and data to train brain-like systems in pattern recognition and decision-making. It’s what makes Copilot ’s image and text generation capabilities  possible.

How does machine learning work?

Imagine you want to teach a computer to identify whether an email is spam or not. In traditional programming, you would write explicit rules for classifying emails. But in machine learning, you feed the computer thousands of emails, both spam and legitimate ones. The machine learns by analyzing these examples and finding patterns. As it digests more data, it becomes better at distinguishing spam from real emails. Machine learning relies on algorithms, which are like recipes for computers. These algorithms process data, learn from it, and make predictions or decisions. Copilot  is equipped with high-quality data and algorithms to deliver high-quality, tailored content and information.

An abstract drawing of machine learning in process

Credit: Image created with AI

How does Copilot use AI machine learning?

Copilot  harnesses the power of AI machine learning to elevate the user experience, enabling the following capabilities:

Contextual assistance

Copilot  employs machine learning to offer contextual assistance that adapts to your needs. Whether you’re drafting an email, writing poetry , planning a trip, or researching a topic , Copilot can help suggest relevant information. From brainstorming gift ideas to learning a new skill, use Copilot can help you gather information and resources quickly.

For enhanced performance during peak usage, you can upgrade to Copilot Pro  so you have assistance when it’s most important.

Text and images to order

Copilot  can produce text and images based on your text input, also known as your prompt . Simply enter a descriptive prompt asking Copilot to generate text or an image related to any subject or style. Using machine learning and natural language processing, Copilot will then produce writing or visuals based on your request. Because Copilot is conversational, you can keep asking for tweaks until you receive the output you’re after.

Predictive text enhancement

Machine learning within Copilot  enhances its predictive text capabilities, enabling it to anticipate your writing style and intent. Whether you're composing an email or a document, Copilot can help suggest word choices and sentence structures that align with your unique voice, saving you time and ensuring your content is tailored to you.

Thanks to machine learning, the more you use Copilot, the better its results. That’s one way it’s always evolving to enhance efficiency and productivity for users. Try Copilot  and the Copilot mobile app  today for AI assistance anytime, anywhere.

  • Features and functionality subject to change.
  • At Microsoft, we are always updating and testing features to offer our users the best possible experiences as we experiment with new approaches to functionality. To improve the user experience and streamline our tools that empower creativity, Bing Image Creator is now Designer and Bing Chat is now Copilot. Create wow-worthy images with your words and AI with Designer, and try Copilot, your AI-powered search assistant for the web.

Share this page

Products featured in this article.

Copilot logo

Copilot Pro

Copilot app, more articles.

Woman wearing a backpack with a confident look on her face

09 February, 2024 - 2 MIN

Copilot is AI for everyone

Copilot is now available on any device. Find out what it can do and how to download the app.

Hands typing on a Windows laptop

29 September, 2023 - 3 MIN

Enhance online research with AI

See how Copilot’s AI-powered features make online research faster and easier.

Visual of a wedding speech prompt in Copilot

What is Compose, and what does it do?

Generate content, improve your writing, and get creative ideas with AI-powered Bing Compose.

  • School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Class 8 Maths Notes
  • Class 9 Maths Notes
  • Class 10 Maths Notes
  • Class 11 Maths Notes
  • Class 12 Maths Notes

Hypothesis | Definition, Meaning and Examples

Hypothesis is a hypothesis is fundamental concept in the world of research and statistics. It is a testable statement that explains what is happening or observed. It proposes the relation between the various participating variables.

Hypothesis is also called Theory, Thesis, Guess, Assumption, or Suggestion . Hypothesis creates a structure that guides the search for knowledge.

In this article, we will learn what hypothesis is, its characteristics, types, and examples. We will also learn how hypothesis helps in scientific research.

Table of Content

What is Hypothesis?

Characteristics of hypothesis, sources of hypothesis, types of hypothesis, functions of hypothesis, how hypothesis help in scientific research.

Hypothesis is a suggested idea or an educated guess or a proposed explanation made based on limited evidence, serving as a starting point for further study. They are meant to lead to more investigation.

It’s mainly a smart guess or suggested answer to a problem that can be checked through study and trial. In science work, we make guesses called hypotheses to try and figure out what will happen in tests or watching. These are not sure things but rather ideas that can be proved or disproved based on real-life proofs. A good theory is clear and can be tested and found wrong if the proof doesn’t support it.

Hypothesis

Hypothesis Meaning

A hypothesis is a proposed statement that is testable and is given for something that happens or observed.
  • It is made using what we already know and have seen, and it’s the basis for scientific research.
  • A clear guess tells us what we think will happen in an experiment or study.
  • It’s a testable clue that can be proven true or wrong with real-life facts and checking it out carefully.
  • It usually looks like a “if-then” rule, showing the expected cause and effect relationship between what’s being studied.

Here are some key characteristics of a hypothesis:

  • Testable: An idea (hypothesis) should be made so it can be tested and proven true through doing experiments or watching. It should show a clear connection between things.
  • Specific: It needs to be easy and on target, talking about a certain part or connection between things in a study.
  • Falsifiable: A good guess should be able to show it’s wrong. This means there must be a chance for proof or seeing something that goes against the guess.
  • Logical and Rational: It should be based on things we know now or have seen, giving a reasonable reason that fits with what we already know.
  • Predictive: A guess often tells what to expect from an experiment or observation. It gives a guide for what someone might see if the guess is right.
  • Concise: It should be short and clear, showing the suggested link or explanation simply without extra confusion.
  • Grounded in Research: A guess is usually made from before studies, ideas or watching things. It comes from a deep understanding of what is already known in that area.
  • Flexible: A guess helps in the research but it needs to change or fix when new information comes up.
  • Relevant: It should be related to the question or problem being studied, helping to direct what the research is about.
  • Empirical: Hypotheses come from observations and can be tested using methods based on real-world experiences.

Hypotheses can come from different places based on what you’re studying and the kind of research. Here are some common sources from which hypotheses may originate:

  • Existing Theories: Often, guesses come from well-known science ideas. These ideas may show connections between things or occurrences that scientists can look into more.
  • Observation and Experience: Watching something happen or having personal experiences can lead to guesses. We notice odd things or repeat events in everyday life and experiments. This can make us think of guesses called hypotheses.
  • Previous Research: Using old studies or discoveries can help come up with new ideas. Scientists might try to expand or question current findings, making guesses that further study old results.
  • Literature Review: Looking at books and research in a subject can help make guesses. Noticing missing parts or mismatches in previous studies might make researchers think up guesses to deal with these spots.
  • Problem Statement or Research Question: Often, ideas come from questions or problems in the study. Making clear what needs to be looked into can help create ideas that tackle certain parts of the issue.
  • Analogies or Comparisons: Making comparisons between similar things or finding connections from related areas can lead to theories. Understanding from other fields could create new guesses in a different situation.
  • Hunches and Speculation: Sometimes, scientists might get a gut feeling or make guesses that help create ideas to test. Though these may not have proof at first, they can be a beginning for looking deeper.
  • Technology and Innovations: New technology or tools might make guesses by letting us look at things that were hard to study before.
  • Personal Interest and Curiosity: People’s curiosity and personal interests in a topic can help create guesses. Scientists could make guesses based on their own likes or love for a subject.

Here are some common types of hypotheses:

Simple Hypothesis

Complex hypothesis, directional hypothesis.

  • Non-directional Hypothesis

Null Hypothesis (H0)

Alternative hypothesis (h1 or ha), statistical hypothesis, research hypothesis, associative hypothesis, causal hypothesis.

Simple Hypothesis guesses a connection between two things. It says that there is a connection or difference between variables, but it doesn’t tell us which way the relationship goes. Example: Studying more can help you do better on tests. Getting more sun makes people have higher amounts of vitamin D.
Complex Hypothesis tells us what will happen when more than two things are connected. It looks at how different things interact and may be linked together. Example: How rich you are, how easy it is to get education and healthcare greatly affects the number of years people live. A new medicine’s success relies on the amount used, how old a person is who takes it and their genes.
Directional Hypothesis says how one thing is related to another. For example, it guesses that one thing will help or hurt another thing. Example: Drinking more sweet drinks is linked to a higher body weight score. Too much stress makes people less productive at work.

Non-Directional Hypothesis

Non-Directional Hypothesis are the one that don’t say how the relationship between things will be. They just say that there is a connection, without telling which way it goes. Example: Drinking caffeine can affect how well you sleep. People often like different kinds of music based on their gender.
Null hypothesis is a statement that says there’s no connection or difference between different things. It implies that any seen impacts are because of luck or random changes in the information. Example: The average test scores of Group A and Group B are not much different. There is no connection between using a certain fertilizer and how much it helps crops grow.
Alternative Hypothesis is different from the null hypothesis and shows that there’s a big connection or gap between variables. Scientists want to say no to the null hypothesis and choose the alternative one. Example: Patients on Diet A have much different cholesterol levels than those following Diet B. Exposure to a certain type of light can change how plants grow compared to normal sunlight.
Statistical Hypothesis are used in math testing and include making ideas about what groups or bits of them look like. You aim to get information or test certain things using these top-level, common words only. Example: The average smarts score of kids in a certain school area is 100. The usual time it takes to finish a job using Method A is the same as with Method B.
Research Hypothesis comes from the research question and tells what link is expected between things or factors. It leads the study and chooses where to look more closely. Example: Having more kids go to early learning classes helps them do better in school when they get older. Using specific ways of talking affects how much customers get involved in marketing activities.
Associative Hypothesis guesses that there is a link or connection between things without really saying it caused them. It means that when one thing changes, it is connected to another thing changing. Example: Regular exercise helps to lower the chances of heart disease. Going to school more can help people make more money.
Causal Hypothesis are different from other ideas because they say that one thing causes another. This means there’s a cause and effect relationship between variables involved in the situation. They say that when one thing changes, it directly makes another thing change. Example: Playing violent video games makes teens more likely to act aggressively. Less clean air directly impacts breathing health in city populations.

Hypotheses have many important jobs in the process of scientific research. Here are the key functions of hypotheses:

  • Guiding Research: Hypotheses give a clear and exact way for research. They act like guides, showing the predicted connections or results that scientists want to study.
  • Formulating Research Questions: Research questions often create guesses. They assist in changing big questions into particular, checkable things. They guide what the study should be focused on.
  • Setting Clear Objectives: Hypotheses set the goals of a study by saying what connections between variables should be found. They set the targets that scientists try to reach with their studies.
  • Testing Predictions: Theories guess what will happen in experiments or observations. By doing tests in a planned way, scientists can check if what they see matches the guesses made by their ideas.
  • Providing Structure: Theories give structure to the study process by arranging thoughts and ideas. They aid scientists in thinking about connections between things and plan experiments to match.
  • Focusing Investigations: Hypotheses help scientists focus on certain parts of their study question by clearly saying what they expect links or results to be. This focus makes the study work better.
  • Facilitating Communication: Theories help scientists talk to each other effectively. Clearly made guesses help scientists to tell others what they plan, how they will do it and the results expected. This explains things well with colleagues in a wide range of audiences.
  • Generating Testable Statements: A good guess can be checked, which means it can be looked at carefully or tested by doing experiments. This feature makes sure that guesses add to the real information used in science knowledge.
  • Promoting Objectivity: Guesses give a clear reason for study that helps guide the process while reducing personal bias. They motivate scientists to use facts and data as proofs or disprovals for their proposed answers.
  • Driving Scientific Progress: Making, trying out and adjusting ideas is a cycle. Even if a guess is proven right or wrong, the information learned helps to grow knowledge in one specific area.

Researchers use hypotheses to put down their thoughts directing how the experiment would take place. Following are the steps that are involved in the scientific method:

  • Initiating Investigations: Hypotheses are the beginning of science research. They come from watching, knowing what’s already known or asking questions. This makes scientists make certain explanations that need to be checked with tests.
  • Formulating Research Questions: Ideas usually come from bigger questions in study. They help scientists make these questions more exact and testable, guiding the study’s main point.
  • Setting Clear Objectives: Hypotheses set the goals of a study by stating what we think will happen between different things. They set the goals that scientists want to reach by doing their studies.
  • Designing Experiments and Studies: Assumptions help plan experiments and watchful studies. They assist scientists in knowing what factors to measure, the techniques they will use and gather data for a proposed reason.
  • Testing Predictions: Ideas guess what will happen in experiments or observations. By checking these guesses carefully, scientists can see if the seen results match up with what was predicted in each hypothesis.
  • Analysis and Interpretation of Data: Hypotheses give us a way to study and make sense of information. Researchers look at what they found and see if it matches the guesses made in their theories. They decide if the proof backs up or disagrees with these suggested reasons why things are happening as expected.
  • Encouraging Objectivity: Hypotheses help make things fair by making sure scientists use facts and information to either agree or disagree with their suggested reasons. They lessen personal preferences by needing proof from experience.
  • Iterative Process: People either agree or disagree with guesses, but they still help the ongoing process of science. Findings from testing ideas make us ask new questions, improve those ideas and do more tests. It keeps going on in the work of science to keep learning things.

People Also View:

Mathematics Maths Formulas Branches of Mathematics

Hypothesis is a testable statement serving as an initial explanation for phenomena, based on observations, theories, or existing knowledge . It acts as a guiding light for scientific research, proposing potential relationships between variables that can be empirically tested through experiments and observations.

The hypothesis must be specific, testable, falsifiable, and grounded in prior research or observation, laying out a predictive, if-then scenario that details a cause-and-effect relationship. It originates from various sources including existing theories, observations, previous research, and even personal curiosity, leading to different types, such as simple, complex, directional, non-directional, null, and alternative hypotheses, each serving distinct roles in research methodology .

The hypothesis not only guides the research process by shaping objectives and designing experiments but also facilitates objective analysis and interpretation of data , ultimately driving scientific progress through a cycle of testing, validation, and refinement.

Hypothesis – FAQs

What is a hypothesis.

A guess is a possible explanation or forecast that can be checked by doing research and experiments.

What are Components of a Hypothesis?

The components of a Hypothesis are Independent Variable, Dependent Variable, Relationship between Variables, Directionality etc.

What makes a Good Hypothesis?

Testability, Falsifiability, Clarity and Precision, Relevance are some parameters that makes a Good Hypothesis

Can a Hypothesis be Proven True?

You cannot prove conclusively that most hypotheses are true because it’s generally impossible to examine all possible cases for exceptions that would disprove them.

How are Hypotheses Tested?

Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data

Can Hypotheses change during Research?

Yes, you can change or improve your ideas based on new information discovered during the research process.

What is the Role of a Hypothesis in Scientific Research?

Hypotheses are used to support scientific research and bring about advancements in knowledge.

author

Please Login to comment...

Similar reads.

  • Geeks Premier League
  • School Learning
  • Geeks Premier League 2023
  • Maths-Class-12
  • Top Android Apps for 2024
  • Top Cell Phone Signal Boosters in 2024
  • Best Travel Apps (Paid & Free) in 2024
  • The Best Smart Home Devices for 2024
  • 15 Most Important Aptitude Topics For Placements [2024]

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 03 September 2024

Low-pass whole genome sequencing of circulating tumor cells to evaluate chromosomal instability in triple-negative breast cancer

  • Serena Di Cosimo 1 ,
  • Marco Silvestri 1 , 2 ,
  • Cinzia De Marco 1 ,
  • Alessia Calzoni 2 , 3 ,
  • Maria Carmen De Santis 4 , 5 ,
  • Maria Grazia Carnevale 4 , 5 ,
  • Carolina Reduzzi 6 ,
  • Massimo Cristofanilli 6 &
  • Vera Cappelletti 1  

Scientific Reports volume  14 , Article number:  20479 ( 2024 ) Cite this article

Metrics details

  • Breast cancer
  • Cancer genomics
  • Tumour biomarkers

Chromosomal Instability (CIN) is a common and evolving feature in breast cancer. Large-scale Transitions (LSTs), defined as chromosomal breakages leading to gains or losses of at least 10 Mb, have recently emerged as a metric of CIN due to their standardized definition across platforms. Herein, we report the feasibility of using low-pass Whole Genome Sequencing to assess LSTs, copy number alterations (CNAs) and their relationship in individual circulating tumor cells (CTCs) of triple-negative breast cancer (TNBC) patients. Initial assessment of LSTs in breast cancer cell lines consistently showed wide-ranging values (median 22, range 4–33, mean 21), indicating heterogeneous CIN. Subsequent analysis of CTCs revealed LST values (median 3, range 0–18, mean 5), particularly low during treatment, suggesting temporal changes in CIN levels. CNAs averaged 30 (range 5–49), with loss being predominant. As expected, CTCs with higher LSTs values exhibited increased CNAs. A CNA-based classifier of individual patient-derived CTCs, developed using machine learning, identified genes associated with both DNA proliferation and repair, such as RB1 , MYC , and EXO1 , as significant predictors of CIN. The model demonstrated a high predictive accuracy with an Area Under the Curve (AUC) of 0.89. Overall, these findings suggest that sequencing CTCs holds the potential to facilitate CIN evaluation and provide insights into its dynamic nature over time, with potential implications for monitoring TNBC progression through iterative assessments.

Similar content being viewed by others

define hypothesis in machine learning

Copy number alterations analysis of primary tumor tissue and circulating tumor cells from patients with early-stage triple negative breast cancer

define hypothesis in machine learning

Interrogating breast cancer heterogeneity using single and pooled circulating tumor cell analysis

define hypothesis in machine learning

Binary classification of copy number alteration profiles in liquid biopsy with potential clinical impact in advanced NSCLC

Introduction.

Breast cancer is a global health issue with approximately two and a half million new cases diagnosed annually worldwide 1 . Despite advances in screening, detection, and treatment, breast cancer remains the leading cause of cancer-related deaths among women 1 . The triple-negative (TNBC) subtype has the worst prognosis, emphasizing the need for improved care for both localized and metastatic patients 2 .

Chromosomal Instability (CIN) refers to the increased acquisition or loss of whole or fragmented chromosomes, and represents the most common form of genome instability in breast cancer 3 . Thus, improving our ability to assess CIN could offer promising insights into tumor progression and optimize patient care. Standard methods for evaluating CIN, such as DNA image cytometry and fluorescence in situ hybridization (FISH), are seldom used in the clinics due to their labor-intensive procedures and lack of high-throughput capabilities 4 . Alternative approaches including CIN70 5 and HET70 6 signatures, based on the expression of genes associated with aneuploidy and karyotype heterogeneity, or comparative genomic hybridization 7 have also been utilized, showing that increased CIN is associated with metastatic potential and dismal prognosis 5 , 6 , 7 . However, bulk analytical methods give a broad view of CIN without distinguishing between ongoing or past events that may not have continued. In addition, DNA image cytometry, FISH, and transcriptomic analysis face challenges in capturing the inherent cell-to-cell heterogeneity of CIN as they rely on pooled DNA samples 4 .

Single-cell sequencing (scDNAseq) is emerging as a promising approach to tackle the above listed challenges by providing accurate and quantitative CIN measures that are amenable to clinical use 8 . scDNAseq can provide insights into the underlying aberrant molecular pathways driving CIN, with DNA repair genes being prominent candidates 8 . Additionally, scDNAseq overcomes limitations and confounding factors associated with the use of bulk tissue, such as surrounding stromal tissue, tumor heterogeneity, and limited sample availability 8 . Importantly, scDNAseq can be applied to circulating tumor cells (CTCs), which are emerging as a significant resource for timely breast cancer molecular characterization 9 . Unlike invasive tumor tissue biopsy that is prone to sampling error, CTCs allow dynamic and repeatable assessment, representing the ideal source for longitudinal measuring of an evolving feature such as CIN 10 .

In this study, we leveraged our expertise in CTC genotyping by next-generation sequencing 11 to analyze CIN and underlying molecular alterations in TNBC patients. Specifically, we challenged low-pass Whole Genome Sequencing (lp-WGS) to determine the number of Large-Scale Transitions (LSTs) defined as contiguous regions of chromosomal breakage spanning at least 10 Mb 12 . The LST metric was chosen for its frequent use as a biomarker of CIN 8 , 13 . First, we tested the consistency of LST measurements using lp-WGS in a panel of breast cancer cell lines. Next, we extended our analyses to individual patient-derived CTCs collected at different clinical time-points, i.e., baseline, treatment, follow-up, and relapse. Finally, we developed a streamlined model for assessing CIN based on CTC copy number alterations (CNAs) within a specific set of genes.

As part of technical feasibility, we initially evaluated LSTs as a means of CIN evaluation in breast cancer cell lines undergoing whole genome amplification and lp-WGS at the single-cell level. The analyses were conducted on MDA-MB-453, MDA-MB-361, BT474, BT549, and ZR-75 cell lines in replicates as reported in Table 1 . We observed a wide range of LSTs (median 22, range 4–33), reflecting the heterogeneous nature of CIN both within individual cells and across different cell lines (Fig.  1 a).

figure 1

Large-scale transitions in breast cancer cell lines and patient-derived individual CTCs. Distribution of large- scale transitions (LSTs)—defined as chromosomal breakpoints between adjacent regions spanning at least 10 megabases—in breast cancer cell lines ( a ) and patient-derived individual CTCs ( b ).

Notably, LSTs values were significantly and reproducibly determined for the tested cell lines (Table 1 ).

We next analyzed clinical samples from 12 patients with histologically confirmed TNBC, successfully profiling (> 400,000 reads) a total of 35 CTCs collected at various time points throughout the disease trajectory (Table 2 ).

LSTs in CTCs showed heterogeneity (median 3, range 0–18), with values lower than those observed in cell lines, especially during treatment (median 2, range 0–13). Median LSTs in CTCs from patients with and without metastases were 2 and 3.5, respectively; 3 in germ-line BRCA mutation carriers . The distribution of LSTs values displayed a bimodal shape (Fig.  1 b). However, its limited extent prevented definition of a clear threshold, prompting the use of the median number of LSTs to classify CTCs as either LST-low (number of LSTs < 3) or high (number of LSTs ≥ 3).

We next analyzed the CTC CNA profile. The mean number of CNAs per CTC was 30 (range 5–49), with deletions outnumbering amplifications at 401:291 (Supplementary Fig. 3). The most frequently lost or gained chromosomal regions and the corresponding genes are reported in Fig.  2 .

figure 2

Copy number alterations in individual CTCs of TNBC patients. The heatmap shows CTCs in the columns according to their number of LSTs and classified as high when ≥ 3 (dark blue) or low when < 3 (yellow). The rows show the top-fifty altered genes by chromosomal arm, with red indicating gain and blue indicating loss.

Recurrent alterations involved 9p and 9q, containing ABL1 , NOTCH1 , and CDKN2A ; 10, containing MAPK8 and GATA3 ; and 22q, containing BCR , as expected and consistently with literature on genes involved in TNBC oncogenesis 14 . We also analyzed CNA with respect to LSTs. Compared to CTCs classified as LST-low, those with higher values had a numerical increase in CNAs overall, median CNAs in CTCs with high and low LSTs 22 and 13, p = 0.08, and a prevalence of copy number losses, particularly in homologous recombination deficiency (HDR) related genes, with 59% (13/22) of CTCs classified as LST-high and 31% (4/13) of the LST-low showing RAD51 , BLM , or WNR copy loss, p = 0.05. Oncogenic signaling pathways analysis showed that CTCs classified as LST-high were enriched for CNAs—either gains or losses—affecting NRF2, TP53, and TGF-beta signaling (Supplementary Fig. 1).

However, the question remained as to which factors most strongly influence LSTs. Therefore, we used a Random Forrest (RF) non parametric machine learning method to develop a CNA-based classifier of patient-derived CTCs with and without LSTs (Supplementary Fig. 2).

A total of 39 covariates were included in the model, consisting of CNAs of established HDR related 15 and TNBC driver 16 genes (Supplementary Table 1). RB1 , MYC , and EXO1 emerged as the most relevant predictors of CIN among all covariates, with variable importance index (VIMP) indicating that the prediction error rate would increase by up to 30% if the CNAs of these genes were randomly permuted in the model (Fig.  3 a).

figure 3

Model performance evaluation. ( a ) Internal measure of variable importance (VIMP) of altered genes in CTCs harboring CIN. The VIMP shows decreases in classification accuracy when the values of a given variable are randomly permuted, while all other predictors remain unchanged in the model. The larger the VIMP of a variable, the more predictive the variable ( b ) Receiver operating characteristic curve (ROC) for prediction of LSTs based on the CNAs of breast cancer related genes profiled by lp-WGS and computed through a RF learning model. AUC (Area under the curve).

Strikingly, the RF model yielded an AUC of 0.89 indicating that the analysis of CNAs in a few genes might be sufficient to achieve reliable classification of CIN (Fig.  3 b).

Chromosomal instability is increasingly recognized as a cancer hallmark, crucial in initiation, progression, and metastasis, with implications for optimizing care 3 , 17 . However, its regular assessment is hindered by its dynamic nature and limitations in currently available tools 4 . Hence, there is a critical need to develop CIN biomarkers that are easily and reliably assessable to inform and guide clinical management, including in breast cancer patients. To the best of our knowledge, several studies have assessed the CNA of CTCs, but none have tackled CIN analysis 18 , 19 , 20 . In this study, we analyzed lp-WGS data to evaluate LSTs and CNAs in individual CTCs from women with TNBC, and to build a predictive classifier of CIN at the single-cell level achieving an AUC of 0.89. While our study is preliminary, we are the first to report a cost-effective sequencing assay such as lp-WGS for assessing LSTs in CTCs, the utilization of distinctive genetic features to evaluate complex phenomena, and ultimately, the development of a performing predictive model based on CNAs interactions. Additionally, we incorporated the assessment of CIN, a dynamic variable on CTCs, whose analysis can be repeated over time through a minimally invasive blood draw. These findings not only pave the way to a novel analytical approach for assessing CIN but also provide significant contributions to the field.

The distributions of LSTs values, both in breast cancer cell lines and individual CTCs, confirm the significant heterogeneity of CIN. This observation is consistent with existing literature, which suggests that the CIN underlying mechanisms leading to dysfunctional chromosome duplication and segregation can vary 21 . Interestingly, the LSTs values observed in CTCs, particularly those from recurrent patients, were not as elevated as expected. These findings align with prior research indicating low karyotypic variance during disease progression across various cancer types including the breast 22 . To reconcile this observation with the well-documented prevalence of CIN in cancer, the theory of the CIN paradox posits that tumors typically exhibit intermediate levels of CIN as excessively high levels are detrimental, while insufficient levels do not guarantee an advantage in terms of proliferation and survival 23 . In addition, the low LST values observed in recurrent breast cancer patients may be influenced by the number of CTCs analyzed potentially affecting the prevalence of CIN. This raises the question of deriving individuals' features from their single-cell data. To the best of our knowledge, few previous work estimated the required sample size, i.e., the number of cells to profile, to infer CIN from scDNAseq data 24 . Regarding CTCs, while some have suggested diagnosing cancer with CIN based on the presence of only one 25 to at least 3 unstable CTCs 26 , it is uncertain if this also applies to breast cancer. Therefore, further research is needed.

Several studies have characterized CNAs in TNBC tissue using high-resolution genomic data 16 . Consistent with these findings, CTC CNAs more frequently showed deletions than amplifications. Despite potential limitations of lp-WGS compared to higher resolution next-generation sequencing, we report that CTC chromosomal gains and losses occurred in regions where breast cancer-related genes are generally found, supporting that our findings were unlikely to be due to random sequencing dropout or due to amplification bias. For instance, CDKN2A and NOTCH1 were identified in loss regions 14 , 16 . It is also not surprising that CTCs with high LSTs were more frequently characterized by the loss of HDR related genes. However, whether this is the cause of LSTs or if, conversely, the loss of these genes is the consequence, we cannot ascertain. The fact remains that DNA repair genes alone do not fully explain CTC CIN. As already reported for tumor tissue, other factors such as mitotic errors, replication stress, telomere crisis, and breakage fusion bridge cycles 21 , among others, may also be at play. Therefore, we hypothesized that the simultaneous analysis of copy number changes in a set of selected genes could help define CTCs with and without LSTs. To this end, we utilized, for the first time in this context, the RF learning model which allowed us to examine the impact of different potential predictors in creating a predictive model 27 . Our findings indicate that RB1 , EXO1 , and MYC are the most significant predictors among all covariates for identifying LSTs, with a variable importance index exceeding 30%. These results align with preclinical evidence suggesting that the loss of G1/S control resulting from RB1 pathway inactivation, coupled with MYC -induced mitogen addition and DNA damage, leads to chromatid breaks and chromatid cohesion defects in mitotic cells 28 . These aberrations ultimately contribute to aneuploidy in the offspring cell population. Furthermore, LSTs represent a subset of chromosomal rearrangements, particularly evident when double-strand breaks are repaired through non-homologous end joining, as observed in BRCA-deficient environments 12 . Aligned with this, alterations of BRCA1 and BRCA2 demonstrated substantial predictive value within the developed classifier.

This study and its methods have several strengths, as the classifier presented here represents a resource for a deeper understanding of the origins and diversity of CIN. Our results focus attention on a narrow group of genes involved in fundamental cellular processes for maintaining genomic integrity. Additionally, our results support the broader application of CIN measures in clinical diagnostics, as sequencing techniques, which have been rarely used due to technical difficulties, are becoming more widespread and affordable every day. Finally, this work focuses on targets that may lead to potentially applicable therapies, beyond those traditionally suggested based on platinum 21 and taxane 29 for the most unstable tumors.

Despite these strengths, this study and the methods used also have weakness that should be noted. First, the number of LSTs is only one functional measure of CIN, and other measures exist, including telomere allele imbalance and loss of heterozygosis. Second, data on the single-cell nature of copy number or LST burden in single tumor cells in a large cohort are lacking, and technical limitations require that the data generated to date be interpreted with caution. Finally, RF cannot produce hypothesis testing results, such as relative risks, odds ratios, or p-values, as in classical regression methods, and its use is for model exploration. Hence, the data presented herein merit confirmation.

In conclusion, our study demonstrates the feasibility of low-resolution lp-WGS for assessing both LSTs and CNAs in TNBC CTCs at a single-cell level. As a proof-of-concept study, we developed a classifier of LSTs based on CNAs of genes involved both in HDR and replication process. Future research with larger sample sizes will be necessary to evaluate the clinical application of this assay, which lays the groundwork for leveraging CIN in precision oncology efforts.

Materials and methods

Sample processing.

For spiking experiments, five cell lines broadly representative of breast cancer, expressing (+) or lacking (−) the estrogen receptor (ER), and showing Human Epidermal Growth Factor Receptor 2 amplified (HER2+) or normal (HER2−) status were purchased from the American Type Culture Collection (ATCC, Manassas, VA, USA). ZR75-1 (ER+/HER2−), MDA-MB-453 (ER−/HER2+), MDA-MB-361 (ER+/HER2+), and BT-549 (ER−/HER2−) were cultured in DMEM/F-12 (Lonza, Swizerland) medium supplemented with 10% fetal bovine serum, BT474 (ER+/HER2+) in Dulbecco’s Modified Eagle’s Medium (DMEM) (Sigma, Darmstadt, Germany). All culture media were supplemented with antibiotic–antimycotic Solution (100 ×) (Sigma, Darmstadt, Germany), 10% fetal bovine serum (FBS) (Sigma, Darmstadt, Germany) and L-glutamine (2 mM) (Invitrogen GmbH, USA), and tested negative for mycoplasma contamination. Single cells were manually captured under an inverted microscope using a p10 micropipette and directly spiked into healthy donor blood. Spiked-in samples were processed following the same protocols used for clinical samples.

Peripheral blood was collected from study patients in K2EDTA tubes (10 ml) and processed within 1 h of draw using the Parsortix platform (Angle plc, Guildford, UK) for size-based enrichment. Following enrichment, cells were harvested according to manufacturer’s instructions and fixed with 2% paraformaldehyde for 20 min at room temperature.

Cell isolation, amplification and sequencing

Enriched patient samples were processed using the DEPArray system (Menarini Silicon Biosystems, Bologna, IT) 11 . Individual cells were sorted based on morphological characteristics, DNA content, and fluorescence labeling against epithelial (CK, EpCAM, EGFR) and leukocyte (CD45, CD14, CD16) markers, as previously reported 11 . Subsequently, white blood cells expressing only leukocyte markers and single CTCs expressing either only epithelial markers or lacking any marker were recovered for downstream molecular analyses. WGA was performed on single cells using the Ampli1™ WGA kit version 02 (Menarini Silicon Biosystems, Bologna, IT) as per manufacturer instructions. For single cells derived from blood (CTCs and WBC), the quality of the WGA product was determined using the Ampli1™ QC Kit (Menarini Silicon Biosystems, Bologna, IT). A genomic integrity index (GII) was allocated for each sample scored from 0 to 4. Only single cells with sufficiently good quality DNA as determined by a GII ≥ 2 were selected for downstream analysis.

Low-pass whole genome sequencing and bioinformatics

Ampli1™ low-pass kit for Illumina (Menarini Silicon Biosystems, Bologna, IT) was used for preparing low-pass Whole Genome Sequencing (lpWGS) libraries from single cells. Forhigh-throughput processing, the manufacturer procedure was implemented in a fully automated workflow on Ion Torrent Ion S5-system (ThermoFisher, Waltham, MA, USA). Ampli1™ low-pass libraries were normalized and sequenced by Ion530 chip. The obtained FASTQ files were quality checked and aligned to the hg19 human reference sequence using tmap aligner tool on Torrent_Suite 5.10.0. and alignment (BAM) files were generated. All samples with < 400.000 reads were excluded from the analyses.

BAM files underwent quality filtering using qualimap 30 and were processed using two separate pipelines for CIN and CNAs. Each chromosomal break between contiguous regions of at least 10 Mb was tabulated to calculate the number of large-scale transitions (LSTs) per CTC genome. Copy number alterations were identified using QDNAseq software (version 11.0) according to the following settings: minMapq = 37, window = 500 kb. “Gain” and “loss” calls were filtered out by residual (> 4 standard deviation, SD) and black list regions reported in ENCODE database. Segmented copy number data of each sample were extracted starting from log2Ratio value. For the purpose of CNA profile, chromosome 19 was not considered due to its biased deletion associated with the high CG base percentage. Samples were classified as aberrant if they exhibited either ≥ 1 genomic regions with amplification/deletion greater than 12.5 Mb, or if the cumulative amplification/deletion of different genomic regions exceeded 37.5 Mb. OncoKb database was interrogated to evaluate biological and clinical relevant CNAs in CTCs (access date: March 2024).

Biological analyses relied on canonical oncogenic signaling pathways, as previously defined 31 and processed using custom functions from the maftools R package 32 , alongside Gene Ontology (GO) biological process terms and KEGG pathways via the ClusterProfiler Bioconductor package. CIN predictor was developed using the SMOTE method 33 to address sample imbalance between presence and absence of LSTs. Classification was performed using the random forest algorithm on 39 genes 34 with bootstrap re-sampling used to estimate standard errors and confidence intervals. The discriminatory capability of the CIN classifier was assessed using ROC curves and expressed by AUC values. Analyses of association were conducted using t-test for continuous variables, and Fisher test for categorical variables. All analyses were performed using R software ( www.R-project.org ), statistical significance was set at a p-value < 0.05.

Conference presentation

These results have been presented in part at the Molecular Analysis for Precision Oncology (MAP) Congress, Amsterdam, Netherlands, Oct 14–16, 2022.

Data availability

Raw sequencing data are available from the corresponding author upon request.

Arnold, M. et al. Current and future burden of breast cancer: Global statistics for 2020 and 2040. Breast 66 , 15–23 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Howard, F. M. & Olopade, O. I. Epidemiology of triple-negative breast cancer: A review. Cancer J. 27 , 8–16 (2021).

Article   CAS   PubMed   Google Scholar  

Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: New dimensions. Cancer Discov. 12 , 31–46 (2022).

Lynch, A. R. et al. A survey of chromosomal instability measures across mechanistic models. Proc. Natl. Acad. Sci. USA 121 , e2309621121 (2024).

Carter, S. L., Eklund, A. C., Kohane, I. S., Harris, L. N. & Szallasi, Z. A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nat. Genet. 38 , 1043–1048 (2006).

Sheltzer, J. M. A transcriptional and metabolic signature of primary aneuploidy is present in chromosomically unstable cancer cells and informs clinical prognosis. Cancer Res. 73 , 6401–6412 (2013).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Climent, J., Garcia, J. L., Mao, J. H., Arsuaga, J. & Perez-Losada, J. Characterization of breast cancer by array comparative genomic hybridization. Biochem. Cell. Biol. 85 , 497–508 (2007).

Greene, S. B. et al. Chromosomal instability estimation based on next generation sequencing and single cell genome wide copy number variation analysis. PLoS One 11 , e0165089 (2016).

Alix-Panabières, C. & Pantel, K. Challenges in circulating tumour cell research. Nat. Rev. Cancer 14 , 623–631 (2014).

Article   PubMed   Google Scholar  

Hiley, C. et al. Deciphering intratumor heterogeneity and temporal acquisition of driver events to refine precision medicine. Genome Biol. 15 , 453 (2014).

Silvestri, M. et al. Copy number alterations analysis of primary tumor tissue and circulating tumor cells from patients with early-stage triple negative breast cancer. Sci. Rep. 12 , 1470 (2022).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Popova, T. et al. Ploidy and large-scale genomic instability consistently identify basal-like carcinomas with BRCA1/2 inactivation. Cancer Res. 72 , 5454–5462 (2012).

Schonhoft, J. D. et al. Morphology-predicted large-scale transition number in circulating tumor cells identifies a chromosomal instability biomarker associated with poor outcome in castration-resistant prostate cancer. Cancer Res. 80 , 4892–4903 (2020).

Li, Z. et al. Comprehensive identification and characterization of somatic copy number alterations in triple-negative breast cancer. Int. J. Oncol. 56 , 522–530 (2020).

CAS   PubMed   Google Scholar  

Matis, T. S. et al. Current gene panel s account for nearly all homologous recombination repair-associated multiple-case breast cancer families. NPJ Breast Cancer 7 , 109 (2021).

Bareche, Y. et al. Unravelling triple-negative breast cancer molecular heterogeneity using an integrative multiomic analysis. Ann. Oncol. 29 , 895–902 (2018).

Eccleston, A. Targeting cancers with chromosome instability. Nat. Rev. Drug. Discov. 21 , 556 (2022).

Rossi, T. et al. Single-cell NGS-based analysis of copy number alterations reveals new insights in circulating tumor cells persistence in early-stage breast cancer. Cancers 12 (9), 2490. https://doi.org/10.3390/CANCERS12092490 (2020).

Rothé, F. et al. Interrogating breast cancer heterogeneity using single and pooled circulating tumor cell analysis. NPJ Breast Cancer 8 (1), 1–8. https://doi.org/10.1038/s41523-022-00445-7 (2022).

Article   CAS   Google Scholar  

Fernandez-Garcia, D. et al. Shallow WGS of individual CTCs identifies actionable targets for informing treatment decisions in metastatic breast cancer. Br. J. Cancer 127 (10), 1858–1864. https://doi.org/10.1038/s41416-022-01962-9 (2022).

Drews, R. M. et al. A pan-cancer compendium of chromosomal instability. Nature 606 , 976–983 (2022).

Gao, R. et al. Punctuated copy number evolution and clonal stasis in triple-negative breast cancer. Nat. Genet. 48 , 1119–1130 (2016).

Birkbak, N. J. et al. Paradoxical relationship between chromosomal instability and survival outcome in cancer. Cancer Res. 71 , 3447–3452 (2011).

Lynch, A. R., Arp, N. L., Zhou, A. S., Weaver, B. A. & Burkard, M. E. Quantifying chromosomal instability from intratumoral karyotype diversity using agent-based modeling and Bayesan inference. eLife 11 , e69799 (2022).

Malihi, P. D. et al. Single-cell circulating tumor cell analysis reveals genomic instability as a distinctive feature of aggressive prostate cancer. Clin. Cancer Res. 26 , 4143–4153 (2020).

Xu, Y. et al. Detection of circulating tumor cells using negative enrichment immunofluorescence and an in situ hybridization system in pancreatic cancer. Int. J. Mol. Sci. 18 , 622 (2017).

Breiman, L. Random forests. Mach. Learn. 45 , 5–32 (2001).

Article   Google Scholar  

van Harn, T. et al. Loss of Rb proteins causes genomic instability in the absence of mitogenic signaling. Genes Dev. 24 , 1377–1388 (2010).

Scribano, C. M. et al. Chromosomal instability sensitizes patient breast tumors to multipolar divisions induced by paclitaxel. Sci. Transl. Med. 13 , 610 (2021).

Okonechnikov, K., Conesa, A. & García-Alcalde, F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32 , 292–294 (2016).

Sanchez-Vega, F. et al. Oncogenic signaling pathways in the cancer genome atlas. Cell 173 , 321–337 (2018).

Mayakonda, A., Lin, D. C., Assenov, Y., Plass, C. & Koeffler, H. P. Maftools: Efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 28 , 1747–1756 (2018).

Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 , 321–357 (2002).

Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, S. Random survival forests. Ann. Appl. Stat. 2 , 841–860 (2008).

Article   MathSciNet   Google Scholar  

Download references

Acknowledgements

We acknowledge the skilful technical support by Patrizia Miodini and Rosita Motta for CTC enrichment.

Author information

Authors and affiliations.

Department of Advanced Diagnostics, Fondazione IRCCS Istituto Nazionale Dei Tumori Di Milano, Via Venezian 1, 20100, Milan, Italy

Serena Di Cosimo, Marco Silvestri, Cinzia De Marco & Vera Cappelletti

Isinnova S.R.L, Brescia, Italy

Marco Silvestri & Alessia Calzoni

Department of Information Engineering, University of Brescia, Brescia, Italy

Alessia Calzoni

Department of Radiation Oncology, Fondazione IRCCS Istituto Nazionale Dei Tumori Di Milano, Milan, Italy

Maria Carmen De Santis & Maria Grazia Carnevale

Breast Unit, Fondazione IRCCS Istituto Nazionale Dei Tumori Di Milano, Milan, Italy

Division of Hematology-Oncology, Weill Cornell Medicine, New York, NY, USA

Carolina Reduzzi & Massimo Cristofanilli

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: S.D.C., M.S., V.C.; Sample collection and processing: M.C.D.S., M.G.C., C.D.M., C.R.; Data curation and analysis: M.S., V.C., C.R., A.C., S.D.C.; Writing: S.D.C.; V.C., Supervision: S.D.C., V.C., M.C. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Marco Silvestri .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Ethics Committee of Fondazione IRCCS Istituto Nazionale dei Tumori di Milano (INT 196/14).

Informed consent

Informed consent was obtained from all study participants.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information 1., supplementary information 2., supplementary information 3., supplementary information 4., supplementary information 5., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Di Cosimo, S., Silvestri, M., De Marco, C. et al. Low-pass whole genome sequencing of circulating tumor cells to evaluate chromosomal instability in triple-negative breast cancer. Sci Rep 14 , 20479 (2024). https://doi.org/10.1038/s41598-024-71378-3

Download citation

Received : 24 May 2024

Accepted : 27 August 2024

Published : 03 September 2024

DOI : https://doi.org/10.1038/s41598-024-71378-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Chromosomal instability
  • Large-scale transitions
  • Circulating tumor cells
  • Triple-negative breast cancer
  • Copy number alterations

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

define hypothesis in machine learning

Z-number linguistic term set for multi-criteria group decision-making and its application in predicting the acceptance of academic papers

  • Published: 02 September 2024

Cite this article

define hypothesis in machine learning

  • Yangxue Li   ORCID: orcid.org/0000-0003-2649-3280 1 ,
  • Gang Kou 2 ,
  • Yi Peng 3 &
  • Juan Antonio Morente-Molinera 1  

Real-world information is often characterized by uncertainty and partial reliability, which led Zadeh to introduce the concept of Z-numbers as a more appropriate formal structure for describing such information. However, the computation of Z-numbers requires solving highly complex optimization problems, limiting their practical application. Although linguistic Z-numbers have been explored for their computational straightforwardness, they lack theoretical support from Z-number theory and exhibit certain limitations. To address these issues and provide theoretical support from Z-numbers, we propose a Z-number linguistic term set to facilitate more efficient processing of Z-number-based information. Specifically, we redefine linguistic Z-numbers as Z-number linguistic terms. By analyzing the hidden probability density functions of these terms, we identify patterns for ranking them. These patterns are used to define the Z-number linguistic term set, which includes all Z-number linguistic terms sorted in order. We also discuss the basic operators between these terms. Furthermore, we develop a multi-criteria group decision-making (MCGDM) model based on the Z-number linguistic term set. Applying our method to predict the acceptance of academic papers, we demonstrate its effectiveness and superiority. We compare the performance of our MCGDM method with five existing Z-number-based MCGDM methods and eight traditional machine learning clustering algorithms. Our results show that the proposed method outperforms others in terms of accuracy and time consumption, highlighting the potential of Z-number linguistic terms for enhancing Z-number computation and extending the application of Z-number-based information to real-world problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

define hypothesis in machine learning

Similar content being viewed by others

define hypothesis in machine learning

A parametric distance-based outranking method for probabilistic linguistic multi-criteria decision-making problems

define hypothesis in machine learning

Fuzzified AHP in the evaluation of scientific monographs

define hypothesis in machine learning

A novel approach to multi-criteria group decision-making problems based on linguistic D numbers

Explore related subjects.

  • Artificial Intelligence

Data Availability

The data set analyzed during the current study is available in the github; https://github.com/allenai/PeerRead/tree/master/data/acl_2017 .

Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning-i. Inf Sci 8(3):199–249

Article   MathSciNet   Google Scholar  

Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning-ii. Inf Sci 8(4):301–357

Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning-iii. Inf Sci 9(1):43–80

Jing F, Chao X (2021) Fairness concern: an equilibrium mechanism for consensus-reaching game in group decision-making. Inf Fusion 72:147–160

Article   Google Scholar  

Dong Y, Ran Q, Chao X, Li C, Yu S (2023) Personalized individual semantics learning to support a large-scale linguistic consensus process. ACM Trans Internet Tech 23(2):1–27

Wang S, Wu J, Chiclana F, Ji F, Fujita H (2023) Global feedback mechanism by explicit and implicit power for group consensus in social network. Inf Fusion 102205

Morente-Molinera JA, Kou G, Samuylov K, Cabrerizo F, Herrera-Viedma E (2021) Using argumentation in expert’s debate to analyze multi-criteria group decision making method results. Inf Sci 573:433–452

Luo N, Zhang Q, Yin L, Xie Q, Wu C, Wang G (2024) Three-way multi-attribute decision-making under the double hierarchy hesitant fuzzy linguistic information system. Appl Soft Comput 154:111315

Zheng Y, Xu Z, Li Y, Pedrycz W, Yi Z (2024) Bi-objective optimization method for large-scale group decision making based on hesitant fuzzy linguistic preference relations with granularity levels. IEEE Trans Fuzzy Syst 32(8):4759–4771

Zhang H, Zhu W, Chen X, Wu Y, Liang H, Li C-C, Dong Y (2022) Managing flexible linguistic expression and ordinal classification-based consensus in large-scale multi-attribute group decision making. Ann Oper Res 1–54

Zhang H, Dong Y, Xiao J, Chiclana F, Herrera-Viedma E (2020) Personalized individual semantics-based approach for linguistic failure modes and effects analysis with incomplete preference information. Iise Trans 52(11):1275–1296

Xiao J, Wang X, Zhang H (2022) Exploring the ordinal classifications of failure modes in the reliability management: an optimization-based consensus model with bounded confidences. Group Decis Negot 31(1):49–80

Zhou M, Zhou Y-J, Liu X-B, Wu J, Fujita H, Herrera-Viedma E (2023) An adaptive two-stage consensus reaching process based on heterogeneous judgments and social relations for large-scale group decision making. Inf Sci 119280

Chen Z-S, Martinez L, Chang J-P, Wang X-J, Xionge S-H, Chin K-S (2019) Sustainable building material selection: a qfd-and electre iii-embedded hybrid mcgdm approach with consensus building. Eng Appl Artif Intell 85:783–807

Chen Z-S, Yang L-L, Chin K-S, Yang Y, Pedrycz W, Chang J-P, Martínez L, Skibniewski MJ (2021) Sustainable building material selection: an integrated multi-criteria large group decision making framework. Appl Soft Comput 113:107903

Chen Z-S, Zhu Z, Wang X-J, Chiclana F, Herrera-Viedma E, Skibniewski MJ (2023) Multiobjective optimization-based collective opinion generation with fairness concern. IEEE Trans Syst Man Cybern

Chen Z-S, Zhu Z, Wang Z-J, Tsang Y (2023) Fairness-aware large-scale collective opinion generation paradigm: a case study of evaluating blockchain adoption barriers in medical supply chain. Inf Sci 635:257–278

Ran Q, Chao X, Cabrerizo FJ, Herrera-Viedma E (2023) Managing overconfidence behaviors from heterogeneous preference relations in linguistic group decision making. IEEE Trans Fuzzy Syst 31(7):2435–2449. https://doi.org/10.1109/TFUZZ.2022.3226321

Zadeh LA (2011) A note on z-numbers. Inf Sci 181(14):2923–2932

Li Y, Herrera-Viedma E, Pérez IJ, Xing W, Morente-Molinera JA (2023) The arithmetic of triangular z-numbers with reduced calculation complexity using an extension of triangular distribution. Inf Sci 647:119477

Wang J-Q, Cao Y-X, Zhang H-Y (2017) Multi-criteria decision-making method based on distance measure and choquet integral for linguistic z-numbers. Cognit Comput 9:827–842

Wang J, Wang J-Q, Tian Z-P, Zhao D-Y (2018) A multihesitant fuzzy linguistic multicriteria decision-making approach for logistics outsourcing with incomplete weight information. Int Trans Oper Res 25(3):831–856

Chen B, Cai Q, Wei G, Mo Z (2023) Novel aczel-alsina operations-based linguistic z-number aggregation operators and their applications in multi-attribute group decision-making process. Eng Appl Artif Intell 124:106541

Liu F, Liao H, Wu X, Al-Barakati A (2023) Evaluating internet hospitals by a linguistic z-number-based gained and lost dominance score method considering different risk preferences of experts. Inf Sci 630:647–668

Zheng Q, Liu X, Wang W, Han S (2024) A hybrid hfacs model using dematel-oreste method with linguistic z-number for risk analysis of human error factors in the healthcare system. Expert Syst Appl 235:121237

Yager RR (1995) An approach to ordinal decision making. Int J Approx Reason 12(3–4):237–261

Herrera F, Martínez L (2000) A 2-tuple fuzzy linguistic representation model for computing with words. IEEE Trans Fuzzy Syst 8(6):746–752

Gou X, Xu Z (2021) Double hierarchy linguistic term set and its extensions: the state-of-the-art survey. Int J Intell Syst 36(2):832–865

Xu B, Deng Y (2022) Information volume of z-number. Inf Sci 608:1617–1631

Aliev RA, Pedrycz W, Guirimov B, Huseynov OH (2020) Clustering method for production of z-number based if-then rules. Inf Sci 520:155–176

Aliev RA, Alizadeh AV, Huseynov OH (2015) The arithmetic of discrete z-numbers. Inf Sci 290:134–155

Aliev RA, Huseynov OH, Zeinalova LM (2016) The arithmetic of continuous z-numbers. Inf Sci 373:441–460

Xian S, Chai J, Guo H (2019) Linguistic-induced ordered weighted averaging operator for multiple attribute group decision-making. Int J Intell Syst 34(2):271–296

Peng H-G, Wang J-Q (2018) A multicriteria group decision-making method based on the normal cloud model with zadeh’sz-numbers. IEEE Trans Fuzzy Syst 26(6):3246–3260

Peng H-G, Wang X-K, Zhang H-Y, Wang J-Q (2021) Group decision-making based on the aggregation of z-numbers with archimedean t-norms and t-conorms. Inf Sci 569:264–286

Liu P, Zhang X, Pedrycz W (2021) A consensus model for hesitant fuzzy linguistic group decision-making in the framework of dempster-shafer evidence theory. Knowl-Based Syst 212:106559

Dubois D, Faux F, Prade H, Rico A (2022) Qualitative capacities: basic notions and potential applications. Int J Approx Reason 148:253–290

Kang D, Ammar W, Dalvi B, Zuylen M, Kohlmeier S, Hovy E, Schwartz R (2018) A dataset of peer reviews (peerread): Collection, insights and nlp applications. arXiv:1804.09635

Yaakob AM, Gegov A (2016) Interactive topsis based group decision making methodology using z-numbers. Int J Comput Intell Syst 9(2):311–324

Chatterjee K, Kar S (2018) A multi-criteria decision making for renewable energy selection using z-numbers in uncertain environment. Technol Econ Dev Eco 24(2):739–764

Shen K-W, Wang J-Q (2018) Z-vikor method based on a new comprehensive weighted distance measure of z-number and its application. IEEE Trans Fuzzy Syst 26(6):3232–3245

Download references

Acknowledgements

This research has been partially supported by grants from the National Natural Science Foundation of China (#71910107002), the China Scholarship Council (CSC), grant PID2022-139297OB-I00 funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU. Moreover, it is part of the project C-ING-165-UGR23, co-funded by the Regional Ministry of University, Research and Innovation and by the European Union under the Andalusia ERDF Programme 2021-2027.

Author information

Authors and affiliations.

Andalusian Research Institute in Data Science and Computational Intelligence, Department of Computer Science and AI, University of Granada, Granada, 18071, Spain

Yangxue Li & Juan Antonio Morente-Molinera

School of Business Administration, Faculty of Business Administration, Southwestern University of Finance and Economics, Chengdu, China

School of Management and Economics, University of Electronic Science and Technology of China, Chengdu, China

You can also search for this author in PubMed   Google Scholar

Contributions

Yangxue Li: conceptualization, methodology, formal analysis, software, validation, writing-original draft, investigation, data curation, visualization. Gang Kou: Writing - Review & Editing, visualization, supervision. Yi Peng: Writing - Review & Editing, visualization, supervision. Juan Antonio Morente-Molinera: Writing - Review & Editing, supervision, funding acquisition.

Corresponding authors

Correspondence to Gang Kou , Yi Peng or Juan Antonio Morente-Molinera .

Ethics declarations

Competing interests.

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Ethical and informed consent

Not applicable.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Li, Y., Kou, G., Peng, Y. et al. Z-number linguistic term set for multi-criteria group decision-making and its application in predicting the acceptance of academic papers. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05765-8

Download citation

Accepted : 11 August 2024

Published : 02 September 2024

DOI : https://doi.org/10.1007/s10489-024-05765-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Z-number linguistic term
  • Decision-making
  • Acceptance prediction
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Hypothesis in Machine Learning

    define hypothesis in machine learning

  2. Everything you need to know about Hypothesis Testing in Machine Learning

    define hypothesis in machine learning

  3. PPT

    define hypothesis in machine learning

  4. Hypothesis in Machine Learning. Written by: Preeti Yadav(201550105, GLA

    define hypothesis in machine learning

  5. Hypothesis in Machine Learning

    define hypothesis in machine learning

  6. Hypothesis Testing In Machine Learning While Using Python- Tutorial

    define hypothesis in machine learning

VIDEO

  1. Decision Tree :: Decision Tree Hypothesis @ Machine Learning Techniques (機器學習技法)

  2. Radial Basis Function Network :: RBF Network Hypothesis @ Machine Learning Techniques (機器學習技法)

  3. Concept of Hypothesis

  4. What Is A Hypothesis?

  5. Neural Network :: Neural Network Hypothesis @ Machine Learning Techniques (機器學習技法)

  6. Hypothesis spaces, Inductive bias, Generalization, Bias variance trade-off in tamil -AL3451 #ML

COMMENTS

  1. What is a Hypothesis in Machine Learning?

    A hypothesis in machine learning is a candidate model that approximates a target function for mapping inputs to outputs. Learn the difference between a hypothesis in science, in statistics, and in machine learning, and how they are used in supervised learning.

  2. Hypothesis in Machine Learning

    A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data. The Hypothesis can be calculated as: y = mx + b y =mx+b. Where, y = range. m = slope of the lines.

  3. Hypothesis in Machine Learning

    The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is specifically used in Supervised Machine learning, where an ML model learns a function that best maps the input to corresponding outputs with the help of an available dataset. In supervised learning techniques, the main aim is to determine the possible ...

  4. Best Guesses: Understanding The Hypothesis in Machine Learning

    In machine learning, the term 'hypothesis' can refer to two things. First, it can refer to the hypothesis space, the set of all possible training examples that could be used to predict or answer a new instance. Second, it can refer to the traditional null and alternative hypotheses from statistics. Since machine learning works so closely ...

  5. Hypothesis Testing

    Foundations Of Machine Learning (Free) Python Programming(Free) Numpy For Data Science(Free) Pandas For Data Science(Free) ... ($α$) 0.05: the results are not statistically significant, and they don't reject the null hypothesis, remaining unsure if the drug has a genuine effect. 4. Example in python. For simplicity, let's say we're using ...

  6. Hypothesis Testing in Machine Learning

    The steps involved in the hypothesis testing are as follow: Assume a null hypothesis, usually in machine learning algorithms we consider that there is no anomaly between the target and independent variable. Collect a sample. Calculate test statistics. Decide either to accept or reject the null hypothesis.

  7. Machine Learning: The Basics

    A learning rate or step-size parameter used by gradient-based methods. h() A hypothesis map that reads in features x of a data point and delivers a prediction ^y= h(x) for its label y. H A hypothesis space or model used by a ML method. The hypothesis space consists of di erent hypothesis maps h: X!Ybetween which the ML method has to choose. 8

  8. Evaluating Hypotheses in Machine Learning: A Comprehensive Guide

    Here are the general steps involved in evaluating hypotheses in machine learning: Formulate the null and alternative hypotheses: Clearly define the null and alternative hypotheses that you want to test. Collect and prepare the data: Collect the data that you will use to test the hypotheses. Ensure that the data is clean, relevant, and ...

  9. Machine Learning

    In machine learning, a hypothesis is a proposed explanation or solution for a problem. It is a tentative assumption or idea that can be tested and validated using data. In supervised learning, the hypothesis is the model that the algorithm is trained on to make predictions on unseen data. The hypothesis is generally expressed as a function that ...

  10. Hypothesis

    Null Hypothesis. The Null Hypothesis is position that there is no relationship between two measured groups. An example is the development of a new pharmaceutical drug, where the Null Hypothesis is that the drug is considered not effective. The Null Hypothesis is often referred to as H0 (H zero).

  11. What is hypothesis in Machine Learning?

    Learn what a hypothesis is in machine learning, a mathematical function or model that converts input data into output predictions. Explore the different types of hypotheses, such as null, alternative, one-tailed, and two-tailed, and how they are used in significance tests.

  12. Hypothesis in Machine Learning: Comprehensive Overview(2021)

    The hypothesis that an algorithm would concoct relies on the data and relies on the bias and restrictions that we have forced on the data. The hypothesis formula in machine learning: y= mx b. Where, y is range. m changes in y divided by change in x. x is domain. b is intercept. The purpose of restricting hypothesis space in machine learning is ...

  13. Understanding Hypothesis Testing

    Step 1: Define the Hypothesis. Null Hypothesis: (H 0)The new drug has no effect on blood pressure. ... The concept of a hypothesis is fundamental in Machine Learning and data science endeavours. In the realm of machine learning, a hypothesis serves as an initial assumption made by data scientists and ML professionals when attempting to address ...

  14. Everything you need to know about Hypothesis Testing in Machine Learning

    Learn how to perform hypothesis testing to validate your model assumptions and conclusions using sample data. See examples of linear regression models and how to check the significance of coefficients in python.

  15. Introduction of Hypothesis in Statistics and Machine Learning

    Here,we will study the difference between a hypothesis in science, in statistics, and in machine learning. Table of content:-What is Hypothesis? Hypothesis in Statistics. Hypothesis in Machine ...

  16. PDF CS534: Machine Learning

    Hypothesis space. The space of all hypotheses that can, in principle, be output by a particular learning algorithm. Version Space. The space of all hypotheses in the hypothesis space that have not yet been ruled out by a training example. Training Sample (or Training Set or Training Data): a set of N training examples drawn according to P(x,y).

  17. Understanding The Concept Of Hypothesis

    Alternative Hypothesis. In simple words, we can define the alternative hypothesis as the opposite of the null hypothesis. Continuing the same example 2, our alternative hypothesis is that he is guilty. ... Hands-On Machine Learning with Scikit-Learn and TensorFlow 2e. The shaded part on the left side of the graph is LCV(Lower Critical Values ...

  18. What is Hypothesis in Machine Learning? How to Form a Hypothesis?

    The hypothesis is a crucial aspect of Machine Learning and Data Science. It is present in all the domains of analytics and is the deciding factor of whether a change should be introduced or not. Be it pharma, software, sales, etc. A Hypothesis covers the complete training dataset to check the performance of the models from the Hypothesis space.

  19. Harnessing Hypothesis for Stellar Machine Learning Outcomes

    Definition of Hypothesis in Machine Learning. A hypothesis in machine learning is an initial assumption or proposed explanation regarding the relationship between independent variables (features) and dependent variables (target) within a dataset. It serves as the foundational concept for constructing a statistical model.

  20. What exactly is a hypothesis space in machine learning?

    Just a small note on your answer: the size of the hypothesis space is indeed 65,536, but the a more easily explained expression for it would be 2(24) 2 (2 4), since, there are 24 2 4 possible unique samples, and thus 2(24) 2 (2 4) possible label assignments for the entire input space. - engelen. Jan 10, 2018 at 9:52.

  21. An Introduction to Statistical Machine Learning

    Hypothesis testing provides a systematic approach to evaluating the significance of relationships or differences in machine learning tasks. It enables us to assess the validity of assumptions, compare models, and make statistically significant decisions based on the available evidence.

  22. Machine Learning Tutorial

    Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses on developing systems that learn—or improve performance—based on the data they ingest. Artificial intelligence is a broad word that refers to systems or machines that resemble human intelligence. Machine learning and AI are frequently discussed together, and ...

  23. What Is Machine Learning?

    What is the definition of machine learning? Machine learning is a subset of the larger field dedicated to crafting intelligent machines. It empowers computers to learn from data and enhance their performance autonomously, without explicit programming. As a self-learning process, it aligns with AI's goal: creating computer models that mimic ...

  24. How Elites Will Collapse America Like Rome: BlackRock, Trump vs Kamala

    Special thanks to Netsuite: Download the CFO's Guide to AI and Machine Learning for free at https://impacttheory.co/netsuiteITsept Welcome to another rivetin...

  25. Pitch-Tracking Metrics as a Predictor of Future Shoulder and Elbow

    The former demonstrates a ranking of the overall importance of input features on the predictive performance of the model, while the latter is a game theory-based approach to explain machine-learning models, where the input features are treated as players in a cooperative game and the model performance is treated as the payoff of the game. 23 ...

  26. What is Hypothesis

    Hypothesis is a hypothesis isfundamental concept in the world of research and statistics. It is a testable statement that explains what is happening or observed. It proposes the relation between the various participating variables. Hypothesis is also called Theory, Thesis, Guess, Assumption, or Suggestion. Hypothesis creates a structure that ...

  27. Low-pass whole genome sequencing of circulating tumor cells to evaluate

    Chromosomal Instability (CIN) is a common and evolving feature in breast cancer. Large-scale Transitions (LSTs), defined as chromosomal breakages leading to gains or losses of at least 10 Mb, have ...

  28. A Machine Learning Model Based on Counterfactual Theory for Treatment

    Therefore, a machine learning model built upon the counterfactual theory can effectively compare the efficacy of hepatectomy with TACE based on retrospective data. In this study, we first estimate the counterfactual outcomes of HCC patients treated with hepatectomy and TACE using retrospective data based on machine learning model, and construct ...

  29. 3 Z-number linguistic term set

    Real-world information is often characterized by uncertainty and partial reliability, which led Zadeh to introduce the concept of Z-numbers as a more appropriate formal structure for describing such information. However, the computation of Z-numbers requires solving highly complex optimization problems, limiting their practical application. Although linguistic Z-numbers have been explored for ...