Welcome to the on-line version of the UNC dissertation proposal collection. The purpose of this collection is to provide examples of proposals for those of you who are thinking of writing a proposal of your own. I hope that this on-line collection proves to be more difficult to misplace than the physical collection that periodically disappears. If you are preparing to write a proposal you should make a point of reading the excellent document The Path to the Ph.D., written by James Coggins. It includes advice about selecting a topic, preparing a proposal, taking your oral exam and finishing your dissertation. It also includes accounts by many people about the process that each of them went through to find a thesis topic. Adding to the Collection This collection of proposals becomes more useful with each new proposal that is added. If you have an accepted proposal, please help by including it in this collection. You may notice that the bulk of the proposals currently in this collection are in the area of computer graphics. This is an artifact of me knowing more computer graphics folks to pester for their proposals. Add your non-graphics proposal to the collection and help remedy this imbalance! There are only two requirements for a UNC proposal to be added to this collection. The first requirement is that your proposal must be completely approved by your committee. If we adhere to this, then each proposal in the collection serves as an example of a document that five faculty members have signed off on. The second requirement is that you supply, as best you can, exactly the document that your committee approved. While reading over my own proposal I winced at a few of the things that I had written. I resisted the temptation to change the document, however, because this collection should truely reflect what an accepted thesis proposal looks like. Note that there is no requirement that the author has finished his/her Ph.D. Several of the proposals in the collection were written by people who, as of this writing, are still working on their dissertation. This is fine! I encourage people to submit their proposals in any form they wish. Perhaps the most useful forms at the present are Postscript and HTML, but this may not always be so. Greg Coombe has generously provided LaTeX thesis style files , which, he says, conform to the 2004-2005 stlye requirements.
Many thanks to everyone who contributed to this collection!
Greg Coombe, "Incremental Construction of Surface Light Fields" in PDF . Karl Hillesland, "Image-Based Modelling Using Nonlinear Function Fitting on a Stream Architecture" in PDF . Martin Isenburg, "Compressing, Streaming, and Processing of Large Polygon Meshes" in PDF . Ajith Mascarenhas, "A Topological Framework for Visualizing Time-varying Volumetric Datasets" in PDF . Josh Steinhurst, "Practical Photon Mapping in Hardware" in PDF . Ronald Azuma, "Predictive Tracking for Head-Mounted Displays," in Postscript Mike Bajura, "Virtual Reality Meets Computer Vision," in Postscript David Ellsworth, "Polygon Rendering for Interactive Scientific Visualization on Multicomputers," in Postscript Richard Holloway, "A Systems-Engineering Study of the Registration Errors in a Virtual-Environment System for Cranio-Facial Surgery Planning," in Postscript Victoria Interrante, "Uses of Shading Techniques, Artistic Devices and Interaction to Improve the Visual Understanding of Multiple Interpenetrating Volume Data Sets," in Postscript Mark Mine, "Modeling From Within: A Proposal for the Investigation of Modeling Within the Immersive Environment" in Postscript Steve Molnar, "High-Speed Rendering using Scan-Line Image Composition," in Postscript Carl Mueller, " High-Performance Rendering via the Sort-First Architecture ," in Postscript Ulrich Neumann, "Direct Volume Rendering on Multicomputers," in Postscript Marc Olano, "Programmability in an Interactive Graphics Pipeline," in Postscript Krish Ponamgi, "Collision Detection for Interactive Environments and Simulations," in Postscript Russell Taylor, "Nanomanipulator Proposal," in Postscript Greg Turk, " Generating Textures on Arbitrary Surfaces ," in HTML and Postscript Terry Yoo, " Statistical Control of Nonlinear Diffusion ," in Postscript




facebook

  • Skip to primary navigation
  • Skip to main content

OpenCV

Open Computer Vision Library

A Comprehensive Guide to Computer Vision Research in 2024

bharat January 17, 2024 Leave a Comment AI Careers Tags: ai computer vision computer vision research computer vision research groups deep learning OpenCV

guide to computer vision research

Introduction 

In our earlier blogs , we discussed the best institutes across the world for computer vision research. In this fun read, we’ll look at the different stages of Computer Vision research and how you can go about publishing your research work. Let us delve into them now. Looking to become a Computer Vision Engineer? Check out our Comprehensive Guide !

Table of Contents

  • Introduction
  • Different Stages of Computer Vision

Research Publications

Different stages of computer vision research.

Computer Vision Research can be put into various stages, one building to the next. Let us look at them in detail.

Identification of Problem Statement

Computer Vision research starts with identifying the problem statement. It is a crucial step in defining the scope and goals of a research project. It involves clearly understanding the specific challenge or task the researchers aim to address using computer vision techniques. Here are the steps involved in identifying the problem statement in computer vision research:

  • Problem Statement Analysis: The first step is to pinpoint the specific application domain within computer vision. This could be related to object recognition in autonomous vehicles or medical image analysis for disease detection.
  • Defining the problem: Next, we define the precise problem we want to solve within that domain, like classifying images of animals or diagnosing diseases from X-rays.
  • Understanding the objectives: We need to understand the research objectives and outline what we intend to achieve through this project. For instance, improving classification accuracy or reducing false positives in a medical imaging system.
  • Data availability: Next, we need to analyze the availability of data for our project. Check if existing datasets are suitable for our task or if we need to gather our own data, like collecting images of specific objects or medical cases.
  • Review: Conduct a thorough review of existing research and the latest methodologies in the field. This will help you gain insights into the current state-of-the-art techniques and the challenges others have faced in similar projects.
  • Question formulation: Once we review the work, we can formulate research questions to guide our experiments. These questions could address specific aspects of our computer vision problem and help better structure our research.
  • Metrics: Next, we define the evaluation metrics that we’ll use to measure the performance of our vision system. Some common metrics include accuracy, precision, recall, and F1-score.
  • Highlighting: Highlight how solving the problem will have an effect in the real world. For instance, improving road safety through better object recognition or enhanced medical diagnoses for early treatment.
  • Research Outline: Finally, outline the research plan, and detail the methodology employed for data collection, model development, and evaluation. A structured outline will ensure we are on the right track throughout our research project.

research proposal on computer vision

Let us move to the next step, data collection and creation.

Dataset Collection and Creation

Creating and gathering datasets is one of the key building blocks in computer vision research. These datasets facilitate the algorithms and models used in vision systems. Let us see how this is done.

  • Firstly we need to know what we are trying to solve. For instance, are we training models to recognize dogs in photos or identify anomalies in medical images?
  • Now, we’ll need images or videos. Depending on the research needs, we can find them on public datasets or collect our own.
  • Next, we mark up the data. For instance, if you’re teaching a computer to spot dogs in pictures, you’ll draw boxes around the cars and say, “These are dogs!”
  • Raw data can be a mess. We may need to resize images, adjust colors, or add more examples to ensure our dataset is neat and complete.
  • 1-part for training your model
  • 1-part for fine-tuning
  • 1-part for testing how well your model works
  • Next, ensure the dataset fairly represents the real world and doesn’t favor one group or category too much.

One can also share their dataset and research with others for inputs and improvements. Dataset collection and creation are vital in computer vision research.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) briefly analyzes a dataset to answer preliminary questions and guide the modeling process. For instance, this could be looking for patterns across different classes. This is not only used by Computer Vision Engineers but also Data Scientists to ensure that the data they provide are aligned with different business goals or outcomes. This step involves understanding the specifics of image datasets. For instance, EDA is used to spot anomalies, understand data distribution, or gain insights to further model training. Let us look at the role of EDA in model development.

  • With EDA, one can develop data preprocessing pipelines and choose data augmentation strategies.
  • We can analyze how the findings from EDA can affect the choice of model architecture. For instance, the need for some convolutional layers or input images.
  • EDA is also crucial for advanced Computer Vision tasks like object detection, segmentation, and image generation backed by studies.

data preparation

Now let us dive into the specifics of EDA methods and preparing image datasets for model development.

Visualization

  • Sample Image Visualization involves displaying a random set of images from the dataset. This is a fundamental step where we get an idea of the data like lighting conditions or variations in image quality. From this, one can infer the visual diversity and any challenges in the dataset.
  • Analyzing the pixel distribution intensities offers insights into the brightness and contrast variations across the dataset if there is any need for image enhancement techniques.
  • Next, creating histograms for different color channels gives us a better understanding of the color distribution of the dataset. This is a crucial step for tasks such as image classification.

Image Property Analysis

  • Another crucial part is understanding the resolution and the aspect ratio of images in the dataset. It helps make decisions like resizing the image or normalizing the aspect ratio, which is crucial in maintaining consistency in input data for neural networks.
  • Analyzing the size and distribution of annotated objects can be insightful in datasets with annotations. This influences the design layers in the neural network and understanding the scale of objects.

Correlation Analysis

  • With some advanced EDA processes like high dimensional image data, analyzing the relation between different features is helpful. This would aid with dimensionality reduction or feature selection.
  • Next, it is crucial to understand the spatial correlations within images, like the relationship between different regions in an image. It helps in the development of spatial hierarchies in neural networks. 

Class Distribution Analysis

  • EDAs are important in understanding the imbalances in class distribution. This is key in classification tasks where imbalanced data can lead to biased models.
  • Once the imbalances are identified, we can adopt techniques like undersampling majority classes or oversampling minority classes during model training. 

Geometric Analysis

  • Understanding geometric properties like edges, shapes, and textures in images offers insights into the features important for the problem at hand. We can make informed decisions on selecting specific filters or layers in the network architecture. 
  • It’s important to understand how different morphological transformations affect images for segmentation and object detection tasks.

Sequential Analysis

The sequential analysis applies to video data. 

  • For instance, analyzing changes between frames can offer information like motion, temporal consistency, or the need for temporal modeling in video datasets or video sequences.
  • Identifying temporal variations and scene changes gives us insights into the dynamics within the video data that are crucial for tasks like event detection or action recognition.   

Now that we’ve discussed Exploratory Data Analysis and some of its techniques let us move to the next stage in Computer Vision research, defining the model architecture.

Defining Model Architecture 

Defining a model architecture is a critical component of research in computer vision, as it lays the foundation for how a machine learning model will perceive, process, and interpret visual data. We analyze a model that impacts the ability of the model to learn from visual data and perform tasks like object detection or semantic segmentation. 

Model architecture in computer vision refers to the structural design of an artificial neural network. The architecture defines how the model processes input images, extracts features, and makes predictions and classifications.  

What are the components of a model architecture? Let’s explore them.

model architecture

Input Layer

This is where the model receives the image data, mostly in the form of a multi-dimensional array. For colored images, this could be a 3D array where color channels show RGB values. Preprocessing steps like normalization are applied here.

Convolutional Layers

These layers apply a set of filters to the input. Every filter convolves across the width and height of the input volume, computing the dot product between the entries of the filter and the input, producing a 2D activation map for each filter. Preserving the relationship between pixels captures spatial hierarchies in the image.

Activation Functions

Activation functions facilitate networks to learn more complex representations by introducing them to non-linear properties. For instance, the ReLU (Rectified Linear Unit) function applies a non-linear transformation (f(x) = max(0,x)) that retains only positive values and sets all negative values to zero. Other functions include sigmoid and tanh.

Pooling Layers

These layers are used to perform a down-sampling operation along the spatial dimensions (width, height), reducing the number of parameters and computations in the network. For instance, Max pooling, a common approach, takes the maximum value from a set of values in the filter area, is a common approach. This operation offers spatial variance, making the recognition of features in the input invariant to scale and orientation changes.

Fully Connected Layers 

Here, the layers connect every neuron in one layer to every neuron in the next layer. In a CNN, the high-level reasoning in the neural network is performed via these dense layers. Typically, they are positioned near the end of the network and are used to flatten the output of convolutional and pooling layers to form a single vector of features used for final classification or regression tasks.

Dropout Layers

Dropout is a regularization technique where randomly selected neurons are ignored during training. This means that the contribution of these neurons to activate the downstream neurons is removed temporally on the forward pass and any weight updates are not applied to the neuron on the backward pass. This helps in preventing overfitting.

Batch Normalization

In batch normalization, the output from a previous activation layer is normalized by subtracting the batch mean and then dividing it by the standard deviation of the batch. This technique helps stabilize the learning process and significantly reduces the number of training epochs required for deep network training.

Loss Function

The difference between the expected outcomes and the predictions made by the model is quantified by the loss function. Cross-entropy for classification tasks and mean squared error for regression tasks are some of the common loss functions in computer vision.

The optimizer is an algorithm used to minimize the loss function. It updates the network’s weights based on the loss gradient. Some common optimizers include Stochastic Gradient Descent (SGD), Adam, and RMSprop. They use backpropagation to determine the direction in which each weight should be adjusted to minimize the loss.

Output Layer

This is the final layer, where the model’s output is produced. The output layer typically includes a softmax function for classification tasks that converts the outputs to probability values for each class. For regression tasks, the output layer may have a single neuron.

Frameworks like TensorFlow, PyTorch, and Keras are widely used for designing and implementing model architectures. They offer pre-built layers, training routines, and easy integration with hardware accelerators.

Defining a model architecture requires a good grasp of both the theoretical aspects of neural networks and the practical aspects of the specific task.

Training and Validation

Training and validation are crucial in developing a model. They help evaluate a model’s performance, especially when dealing with object detection or image classification tasks.

In this phase, the model is represented as a neural network that learns to recognize image patterns and features by altering its internal parameters iteratively. These parameters are weights and biases related to the network’s layers. Training is key for extracting meaningful features from raw visual data. Let us see how one can go about training a model.

  • Acquiring a dataset is the first step. It could be in the form of images or videos for model learning purposes. For robustness, they cover various environmental conditions, variations, and object classes.
  • Resizing is where all the input data has the same dimensions for batch processing.
  • In Normalization, pixels are standardized to zero mean and unit variance, aiding convergence.
  • Augmentation applies random transformations to increase the size of the dataset artificially, thereby improving the model’s ability to generalize.
  • Once data preprocessing is done, we must choose the appropriate neural network architecture catering to the specific vision task. For instance, CNNs are widely used for image-related tasks.
  • Next, we initialize the model parameters, usually weights, and biases, using random values or pre-trained weights from a model trained on a simple dataset. Transfer learning can significantly improve performance, especially when data is limited.
  • Then we can optimize the algorithm to adjust its parameters iteratively with stochastic gradient descent (SGD) or RMSprop. Gradients in relation to the model’s parameters are computed through backpropagation which are used to update the parameters.
  • Once the algorithm is optimized, the data is trained in mini-batches through the network, computing the loss for each mini-batch and performing gradient updates. This happens until the loss falls below a predefined threshold.
  • Next, we must optimize the training performance and convergence speed by fine-tuning the hyperparameters. This could done by optimizing learning rates, batch sizes, weight regulation terms, or network architectures. 
  • We need to assess the model’s performance using validation or test datasets and eventually deploy the model in real-world applications through software integrations or embedded devices.

Now let us move to the next step- Validation.

Validation is fundamental for the quantitative assessment of performance and generalization capabilities of algorithms. It ensures the reliability and effectiveness of the models when applied to real-world data. Validation evaluates the ability of a model to make accurate predictions of previously unseen data hence being able to gauge its ability for generalization.

Now let us explore some of the key techniques involved in validation.

Cross-Validation Techniques

  • K-Fold Cross-Validation is the method where the dataset is partitioned into K non-overlapping subsets. The model is trained and evaluated K times, with each fold taking turns as the validation set while the rest serve as the training set. The results are averaged to obtain a robust performance estimate.
  • Leave-One-Out Cross-Validation or LOOCV is an extreme form of cross-validation where each data point is used as the validation set while the remaining data points constitute the training set.LOOCV offers an exhaustive evaluation of model performance.

Stratified Sampling

In some imbalanced datasets where a few classes have significantly fewer instances than others, stratified sampling ensures the balance between training and validation sets for the distribution of classes.

Performance Metrics

To assess the model’s performance, a range of performance metrics specified for computer vision tasks are deployed. They are not limited to the following.

  • Accuracy is the ratio of the correctly predicted instances to the total number of instances.
  • Precision is the proportion of true positive predictions among all positive predictions.
  • Recall is the proportion of true positive predictions among all positive instances.
  • F1-Score is the harmonic mean of precision and recall.
  • Mean Average Precision (mAP)is commonly used in object detection and image retrieval tasks to evaluate the quality of ranked lists of results.

Hyperparameter Tuning

Validation is closely integrated with hyperparameter tuning, where the model’s hyperparameters are systematically adjusted and evaluated using the validation set. Techniques such as grid search, random search, or Bayesian optimization help identify the optimal hyperparameter configuration for the model.

Data Augmentation

Data augmentation techniques are applied to test the model’s robustness and the ability to handle different conditions or transformations during validation to simulate variations in the input data.

Training is where the model learns from labeled data, and Validation is where the model’s learning and generalization capabilities are assessed. They ensure that the final model is robust, accurate, and capable of performing well on unseen data, which is critical for computer vision research.

Hyperparameter tuning refers to systematically optimizing hyperparameters in deep learning models for tasks like image processing and segmentation. They control the learning algorithm’s performance but did not learn from the training data. Fine-tuning hyperparameters are crucial if we wish to achieve accurate results. 

Your Image Alt Text

It is the number of training examples used in every forward and backward pass. Large batch sizes offer smoother convergence but need more memory. On the contrary, small batch sizes need less memory and can help escape local minima.

Number of Epochs

The Number of epochs defines how often the entire training dataset is processed during training. Too few epochs can lead to underfitting, and too many can lead to overfitting. 

Learning Rate

This determines the step size during gradient-based optimization. If the learning rate is too high, it can lead to overshooting, causing the loss function to diverge, and if the learning rate is too short, it can cause slow convergence. 

Weight Initialization

The training stability is affected by the initialization of weights. Techniques such as Glorot initialization are designed to address the vanishing gradient problems.

Regularization Techniques

Some techniques like dropout and weight decay aid in preventing overfitting. The model generalization is enhanced through random rotations using data augmentation. 

Choice of Optimizer

The updates during training for model weights are determined by the optimizer. They have their parameters like momentum, decay rates and epsilon.

Hyperparameter tuning is usually approached as an optimization problem. Few techniques like Bayesian optimization efficiently explore the hyperparameter space balancing computational costs and do not slack on the performance. A well-defined hyperparameter tuning includes not just adjusting individual hyperparameters but also also considers their interactions.

Performance Evaluation on Unseen Data 

In the earlier section, we discussed how one must go about doing the training and validation of a model. Now we’ll discuss how to evaluate the performance of a dataset on unseen data.

performance evaluation on unseen data

Training and validation dataset split is paramount when developing and evaluating models. This is not to be confused with the training and validation we discussed earlier for a model. Splitting the dataset for training and validation aids in understanding the model’s performance on unseen data. This ensures that the model generalizes well to new data. Let us look at them.

  • A training dataset is a collection of labeled data points for training the model, adjusting parameters, and inferring patterns and features.
  • A separate dataset is used for evaluating the model during development for hyperparameter tuning and model selection. This is the Validation dataset. 
  • Then there is the test dataset , an independent dataset used for assessing the final performance and generalization ability on unseen data.

Splitting datasets is needed to prevent the model from training on the same data. This would hinder the model’s performance. Some commonly used split ratios for the dataset are 70:30, 80:20, or 90:10. The larger portion is used for training, while the smaller portion is used for validation.

You have put so much effort into your research paper. But how do we publish it? Where do we publish it? How do I find the right computer vision research groups? That is what this section covers, so let’s get to it.

Conferences

There are some top-tier computer vision conferences happening across the globe. They are among the best places to showcase research work, look for future collaborations, and build networks.

Conference on Computer Vision and Pattern Recognition (CVPR)

Also called the CVPR , it is one of the most prestigious conferences in the world of Computer Vision. It is organized by the IEEE Computer Society and is an annual event. It has an amazing history of showcasing cutting-edge research papers in image analysis, object detection, deep learning techniques, and much more. CVPR has set the bar high, placing a strong emphasis on the technical aspects of the submissions. They must meet the following criteria.

Papers must possess an innovative contribution to the field. This could be the development of new algorithms, techniques, or methodologies that can bring advancements in computer vision.

If applicable, the submissions must have mathematical formulations of their methods, like equations and theorem proofs. This offers a solid theoretical foundation for the paper’s approach.

Next, the paper should include comprehensive experimental results involving many datasets and benchmarking against existing models. These are key to demonstrating the effectiveness of your proposed approach.

Clarity – this is a no-brainer; the writing and presentation must be clear and concise. The writers are expected to explain the algorithms, models, and results in a technically sound manner. 

conference on computer vision and pattern recognition

CVPR is an amazing platform for networking and engaging with the community. It’s a great place to meet academics, researchers, and industry experts to collaborate and exchange ideas. The acceptance rate for papers is only 25.8% hence the recognition within the vision community is impressive. It often leads to citations, greater visibility, and potential collaborations with renowned researchers and professionals.

International Conference on Computer Vision (ICCV)

The ICCV is another premier conference held annually once, offering an amazing platform for cutting-edge computer vision research. Much like the CVPR, the ICCV is also organized by the IEEE Computer Society, attracting worldwide visionaries, researchers, and professionals. Topics range from object detection and recognition all the way to computational photography. ICCV invites original papers offering a significant contribution to the field. The criteria for submissions are very similar to the CVPR. They must possess mathematical formulations, algorithms, experimental methodology, and results. ICCV adopts peer review to add a layer of technical rigor and quality to the accepted papers. Submissions usually undergo multiple stages of review, giving detailed feedback on the technical aspects of the research paper. The acceptance rates at ICCV are typically low at 26.2%.

Besides the main conference, the ICCV hosts workshops and tutorials that offer in-depth discussions and presentations in emerging research areas. It also offers challenges and competitions associated with computer vision tasks like image segmentation and object detection. 

Like the CVPR, it offers excellent opportunities for future collaborations, networking with peers, and exchanging ideas. The papers accepted at the ICCV are typically published in the IEEE Computer Society and made available to the vision community. This offers significant visibility and recognition to researchers for papers that are accepted.

European Conference on Computer Vision (ECCV)

The European Conference on Computer Vision, or ECCV , is another comprehensive conference if you are looking for the top computer vision conferences globally. The ECCV lays a lot of emphasis on the scientific and technical quality of the paper. Like the above two conferences we discussed, it emphasizes how the researcher incorporates the mathematical foundations, algorithms, and detailed derivations and proofs with extensive experimental evaluations. 

According to the ECCV formatting guidelines, the research paper ideally ranges from 10 to 14 pages. It adopts a double-blind peer review, where the researchers must make their submissions anonymous to curb any discrepancies.

european conference on computer vision

ECCV also offers huge opportunities for collaborations and establishing connections. With an acceptance rate of 31.8%, a researcher can benefit from academic recognition, high visibility, and citations.

Winter Conference on Applications of Computer Vision (WACV)

WACV is a top international computer vision event with the main conference and a few workshops and tutorials. Much like the other conferences, it is held annually. With an acceptance rate below 30%, it attracts leading researchers and industry professionals. The conference usually takes place in the first week of January. 

winter conference on applications of computer vision

As a computer vision researcher, one must publish one’s works in journals to show your findings and give more insights into the field. Let us look at a few of the computer vision journals.

Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

Also called the TPAMI , this journal focuses on the various aspects of machine intelligence, pattern recognition, and computer vision. It offers a hybrid publication permitting traditional or author-paid open-access manuscript submissions. 

With open-access manuscripts, the paper has unrestricted access to it through the IEEE Xplore and Computer Society Digital Library. 

Regarding traditional manuscript submissions, the IEEE Computer Society has various award-winning journals for publication. One can browse through the different topics that fit their research. They often publish special sections on emerging topics. Some factors you need to consider are submission to publications time, bibliometric scores like impact factor, and publishing fees.

International Journal of Computer Vision (IJCV)

The IJCV offers a platform for new research results. With 15 issues a year, the International Journal of Computer Vision offers high-quality, original contributions to the field of computer vision. The length of the articles ranges from 10-page regular articles to up to 30 pages for survey papers that offer state-of-the-art presentations and results. The research must cover mathematical, physics, and computational aspects of computer vision, like image formation, processing, interpretation, machine learning techniques, and statistical approaches. Researchers are not charged to publish on IJCV . It is not only a journal that opens doors for researchers to showcase their papers but also a goldmine of information in deep learning, artificial intelligence, and robotics.

Journal of Machine Learning Research (JMLR)

Established in 2000, JMLR is a forum for electronic and paper publications of comprehensive research papers. This platform covers topics like machine learning algorithms and techniques, deep learning, neural networks, robotics, and computer vision. JMLR is freely available to the public. It is run by volunteers, and the papers undergo rigorous reviews, which serve as a valuable resource for the latest updates in the field.

You’ve invested weeks and months into this paper. Why not get the recognition and credibility your work deserves? The above Journals and Conferences offer the ultimate gateway for a researcher to showcase their works and open up a plethora of opportunities for academic and industry collaborations.

In conclusion, our journey through the intricate world of computer vision research has been a fun one. From the initial stages of understanding the problem statements to the final steps of publication in computer vision research groups, we’ve comprehensively delved into each of them.

There is no research, big or small; each offers its own contributions to the ever-evolving field of the Computer Vision domain. 

We’ve more detailed posts coming your way. Stay tuned! See you guys in the next one!!

Related Blog Posts

  • How to Become a Computer Vision Engineer in 2024?
  • Top Computer Vision Research Institutes in the USA
  • Exploring OpenCV Applications in 2023
  • Computer Vision and Image Processing: Understanding the Distinction and Connection

Related Posts

introduction to ai jobs in 2023

August 16, 2023    Leave a Comment

introduction to artificial intelligence

August 23, 2023    Leave a Comment

Knowing the history of AI is important in understanding where AI is now and where it may go in the future.

August 30, 2023    Leave a Comment

Become a Member

Stay up to date on OpenCV and Computer Vision news

Free Courses

  • TensorFlow & Keras Bootcamp
  • OpenCV Bootcamp
  • Python for Beginners
  • Mastering OpenCV with Python
  • Fundamentals of CV & IP
  • Deep Learning with PyTorch
  • Deep Learning with TensorFlow & Keras
  • Computer Vision & Deep Learning Applications
  • Mastering Generative AI for Art

Partnership

  • Intel, OpenCV’s Platinum Member
  • Gold Membership
  • Development Partnership

General Link

research proposal on computer vision

Subscribe and Start Your Free Crash Course

research proposal on computer vision

Stay up to date on OpenCV and Computer Vision news and our new course offerings

  • We hate SPAM and promise to keep your email address safe.

Join the waitlist to receive a 20% discount

Courses are (a little) oversubscribed and we apologize for your enrollment delay. As an apology, you will receive a 20% discount on all waitlist course purchases. Current wait time will be sent to you in the confirmation email. Thank you!

research proposal on computer vision

Search ISY Search LiU.se Find an employee Find a location

Logotype

Computer Vision Laboratory (CVL)

Undergraduate.

  • Computer Vision and Signal Analysis profiles
  • Biological Vision Systems
  • Geometry for Computer Vision
  • Group theoretical methods and their applications
  • Reading Group in Computer and Robot Vision

Master-thesis

  • External Partners
  • Project proposals
  • Edupack Orientation
  • Rolling Shutter

Master thesis project proposals

Internal projects.

  • A list of internal CVL projects can be found in the CVL GIT (open to all LiU students).
  • If you are interested in doing a research related project, but do not see a suitable one listed here, feel free to contact one of the researchers at the lab. We normally have several more opportunities for internal master thesis projects related to research projects. These can often be adapted to the particular interests of the student.

External projects

  • NB! Please first check the list of new external projects .
  • [2022-11-18] Nordic Evolution: Digital Guides for Visually Impaired Athletes
  • [2022-10-10] Zenseact: Multiple computer vision master theses proposals. E.g. Learning-based Road Estimation
  • [2022-09-06] FOI: Neuromorfisk Avbildning
  • [2022-02-21] FOI: Mörkerseende med maskininlärningsbaserad bildfusion
  • [2021-10-14] Scania: Estimation of Scene Depth for Perception in Autonomous Heavy-Duty Vehicles
  • [2021-10-14] Scania: Visual-Inertial Odometry (VIO)
  • [2021-10-14] Scania: Really, really fast tracking in image space
  • [2021-10-14] Scania: Single Stage Instance Segmentation in Autonomous Heavy-Duty Vehicles
  • [2021-10-14] Scania: Trajectory and intention prediction of annotated tracked objects
  • [2021-10-14] Scania: Efficient algorithm development for GPUs
  • [2021-09-09] Viscando AB Gothenburg: Projects in deep learning, signal processing and modelling for traffic and autonomous vehicle safety
  • [2021-02-12] NFC: Fotometrisk stereo på verktygsspår
  • [2020-11-13] FOI Linköping: Deep Learning för 3D-avbildande LiDAR
  • [2020-11-06] Arkus AI: Apply Machine Learning and Computer Vision in Genetic Diagnostics
  • [2020-10-28] Ericsson: 3D reconstruction for mobile devices
  • [2020-10-07] Veoneer: Static and Dynamic Windshield Distortion Modeling
  • [2020-01-09] IEI: Facial Analysis in Thermal Images for Pilot Stress Recognition

More information about Master's thesis projects in Computer Vision .

Last updated: 2023-10-13

Linköping University SE-581 83 LINKÖPING Tel: +46 13 28 10 00

Contact LiU | Maps

Organization

  • Arts & Sciences
  • Educational Sciences
  • Health Sciences
  • Science and Engineering
  • Departments
  • Offices & Administration
  • Collaboration
  • LiU Students
  • LiU Employees
  • LiU Fundraising
  • LiU Electronic Press

Department of Electrical Engineering Phone: +46 13 28 10 00 Visiting address: B:27, Valla

Top of page

Subscribe to the PwC Newsletter

Join the community, computer vision, semantic segmentation.

research proposal on computer vision

Tumor Segmentation

research proposal on computer vision

Panoptic Segmentation

research proposal on computer vision

3D Semantic Segmentation

research proposal on computer vision

Weakly-Supervised Semantic Segmentation

Representation learning.

research proposal on computer vision

Disentanglement

Graph representation learning, sentence embeddings.

research proposal on computer vision

Network Embedding

Classification.

research proposal on computer vision

Text Classification

research proposal on computer vision

Graph Classification

research proposal on computer vision

Audio Classification

research proposal on computer vision

Medical Image Classification

Object detection.

research proposal on computer vision

3D Object Detection

research proposal on computer vision

Real-Time Object Detection

research proposal on computer vision

RGB Salient Object Detection

research proposal on computer vision

Few-Shot Object Detection

Image classification.

research proposal on computer vision

Out of Distribution (OOD) Detection

research proposal on computer vision

Few-Shot Image Classification

research proposal on computer vision

Fine-Grained Image Classification

research proposal on computer vision

Learning with noisy labels

2d object detection.

research proposal on computer vision

Edge Detection

Thermal image segmentation.

research proposal on computer vision

Open Vocabulary Object Detection

Reinforcement learning (rl), off-policy evaluation, multi-objective reinforcement learning, 3d point cloud reinforcement learning, deep hashing, table retrieval, domain adaptation.

research proposal on computer vision

Unsupervised Domain Adaptation

research proposal on computer vision

Domain Generalization

research proposal on computer vision

Test-time Adaptation

Source-free domain adaptation, image generation.

research proposal on computer vision

Image-to-Image Translation

research proposal on computer vision

Text-to-Image Generation

research proposal on computer vision

Image Inpainting

research proposal on computer vision

Conditional Image Generation

Data augmentation.

research proposal on computer vision

Image Augmentation

research proposal on computer vision

Text Augmentation

research proposal on computer vision

Image Denoising

research proposal on computer vision

Color Image Denoising

research proposal on computer vision

Sar Image Despeckling

Grayscale image denoising, autonomous vehicles.

research proposal on computer vision

Autonomous Driving

research proposal on computer vision

Self-Driving Cars

research proposal on computer vision

Simultaneous Localization and Mapping

research proposal on computer vision

Autonomous Navigation

Contrastive learning.

research proposal on computer vision

Meta-Learning

research proposal on computer vision

Few-Shot Learning

research proposal on computer vision

Sample Probing

Universal meta-learning, super-resolution.

research proposal on computer vision

Image Super-Resolution

research proposal on computer vision

Video Super-Resolution

research proposal on computer vision

Multi-Frame Super-Resolution

research proposal on computer vision

Reference-based Super-Resolution

Pose estimation.

research proposal on computer vision

3D Human Pose Estimation

research proposal on computer vision

Keypoint Detection

research proposal on computer vision

3D Pose Estimation

research proposal on computer vision

6D Pose Estimation

Self-supervised learning.

research proposal on computer vision

Point Cloud Pre-training

Unsupervised video clustering, 2d semantic segmentation, image segmentation, text style transfer.

research proposal on computer vision

Scene Parsing

research proposal on computer vision

Reflection Removal

Visual question answering (vqa).

research proposal on computer vision

Visual Question Answering

research proposal on computer vision

Machine Reading Comprehension

research proposal on computer vision

Chart Question Answering

research proposal on computer vision

Embodied Question Answering

research proposal on computer vision

Depth Estimation

research proposal on computer vision

3D Reconstruction

research proposal on computer vision

Neural Rendering

research proposal on computer vision

3D Face Reconstruction

Anomaly detection.

research proposal on computer vision

Unsupervised Anomaly Detection

research proposal on computer vision

One-Class Classification

Supervised anomaly detection, anomaly detection in surveillance videos, sentiment analysis.

research proposal on computer vision

Aspect-Based Sentiment Analysis (ABSA)

research proposal on computer vision

Multimodal Sentiment Analysis

research proposal on computer vision

Aspect Sentiment Triplet Extraction

research proposal on computer vision

Twitter Sentiment Analysis

research proposal on computer vision

Temporal Action Localization

research proposal on computer vision

Video Understanding

Video generation.

research proposal on computer vision

Video Object Segmentation

research proposal on computer vision

Action Classification

3d object super-resolution.

research proposal on computer vision

One-Shot Learning

research proposal on computer vision

Few-Shot Semantic Segmentation

Cross-domain few-shot.

research proposal on computer vision

Unsupervised Few-Shot Learning

Activity recognition.

research proposal on computer vision

Action Recognition

research proposal on computer vision

Human Activity Recognition

Egocentric activity recognition.

research proposal on computer vision

Group Activity Recognition

Exposure fairness, medical image segmentation.

research proposal on computer vision

Lesion Segmentation

research proposal on computer vision

Brain Tumor Segmentation

research proposal on computer vision

Cell Segmentation

Skin lesion segmentation, monocular depth estimation.

research proposal on computer vision

Stereo Depth Estimation

Depth and camera motion.

research proposal on computer vision

3D Depth Estimation

Facial recognition and modelling.

research proposal on computer vision

Face Recognition

research proposal on computer vision

Face Swapping

research proposal on computer vision

Face Detection

research proposal on computer vision

Facial Expression Recognition (FER)

research proposal on computer vision

Face Verification

Optical character recognition (ocr).

research proposal on computer vision

Active Learning

research proposal on computer vision

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, instance segmentation.

research proposal on computer vision

Referring Expression Segmentation

research proposal on computer vision

3D Instance Segmentation

research proposal on computer vision

Unsupervised Object Segmentation

research proposal on computer vision

Real-time Instance Segmentation

Object tracking.

research proposal on computer vision

Multi-Object Tracking

research proposal on computer vision

Visual Object Tracking

research proposal on computer vision

Multiple Object Tracking

research proposal on computer vision

Cell Tracking

Zero-shot learning.

research proposal on computer vision

Generalized Zero-Shot Learning

research proposal on computer vision

Compositional Zero-Shot Learning

Multi-label zero-shot learning, quantization, data free quantization, unet quantization, continual learning.

research proposal on computer vision

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning.

research proposal on computer vision

Action Recognition In Videos

research proposal on computer vision

3D Action Recognition

Self-supervised action recognition, few shot action recognition.

research proposal on computer vision

Scene Understanding

research proposal on computer vision

Scene Text Recognition

research proposal on computer vision

Scene Graph Generation

research proposal on computer vision

Scene Recognition

Adversarial attack.

research proposal on computer vision

Backdoor Attack

research proposal on computer vision

Adversarial Text

Adversarial attack detection, real-world adversarial attack, image retrieval.

research proposal on computer vision

Sketch-Based Image Retrieval

research proposal on computer vision

Content-Based Image Retrieval

research proposal on computer vision

Composed Image Retrieval (CoIR)

research proposal on computer vision

Medical Image Retrieval

Active object detection, dimensionality reduction.

research proposal on computer vision

Supervised dimensionality reduction

Online nonnegative cp decomposition, emotion recognition.

research proposal on computer vision

Speech Emotion Recognition

research proposal on computer vision

Emotion Recognition in Conversation

research proposal on computer vision

Multimodal Emotion Recognition

Emotion-cause pair extraction.

research proposal on computer vision

Monocular 3D Object Detection

research proposal on computer vision

3D Object Detection From Stereo Images

research proposal on computer vision

Multiview Detection

Robust 3d object detection, image reconstruction.

research proposal on computer vision

MRI Reconstruction

research proposal on computer vision

Film Removal

Style transfer.

research proposal on computer vision

Image Stylization

Font style transfer, style generalization, face transfer, optical flow estimation.

research proposal on computer vision

Video Stabilization

Image captioning.

research proposal on computer vision

3D dense captioning

Controllable image captioning, aesthetic image captioning.

research proposal on computer vision

Relational Captioning

Action localization.

research proposal on computer vision

Action Segmentation

Spatio-temporal action localization, person re-identification.

research proposal on computer vision

Unsupervised Person Re-Identification

Video-based person re-identification, generalizable person re-identification, cloth-changing person re-identification, image restoration.

research proposal on computer vision

Demosaicking

Spectral reconstruction, underwater image restoration.

research proposal on computer vision

JPEG Artifact Correction

Visual relationship detection, lighting estimation.

research proposal on computer vision

3D Room Layouts From A Single RGB Panorama

Road scene understanding, action detection.

research proposal on computer vision

Skeleton Based Action Recognition

research proposal on computer vision

Online Action Detection

Audio-visual active speaker detection, metric learning.

research proposal on computer vision

Object Recognition

research proposal on computer vision

3D Object Recognition

Continuous object recognition.

research proposal on computer vision

Depiction Invariant Object Recognition

Image enhancement.

research proposal on computer vision

Low-Light Image Enhancement

Image relighting, de-aliasing.

research proposal on computer vision

Monocular 3D Human Pose Estimation

Pose prediction.

research proposal on computer vision

3D Multi-Person Pose Estimation

3d human pose and shape estimation, multi-label classification.

research proposal on computer vision

Missing Labels

Extreme multi-label classification, hierarchical multi-label classification, medical code prediction, continuous control.

research proposal on computer vision

Steering Control

Drone controller, 3d face modelling.

research proposal on computer vision

Semi-Supervised Video Object Segmentation

research proposal on computer vision

Unsupervised Video Object Segmentation

research proposal on computer vision

Referring Video Object Segmentation

research proposal on computer vision

Video Salient Object Detection

Trajectory prediction.

research proposal on computer vision

Trajectory Forecasting

Human motion prediction, out-of-sight trajectory prediction.

research proposal on computer vision

Multivariate Time Series Imputation

Novel view synthesis.

research proposal on computer vision

Novel LiDAR View Synthesis

research proposal on computer vision

Gournd video synthesis from satellite image

Image quality assessment, no-reference image quality assessment, blind image quality assessment.

research proposal on computer vision

Aesthetics Quality Assessment

Stereoscopic image quality assessment, object localization.

research proposal on computer vision

Weakly-Supervised Object Localization

Image-based localization, unsupervised object localization, monocular 3d object localization.

research proposal on computer vision

Blind Image Deblurring

Single-image blind deblurring, out-of-distribution detection, video semantic segmentation.

research proposal on computer vision

Camera shot segmentation

research proposal on computer vision

Facial Inpainting

Cloud removal.

research proposal on computer vision

Fine-Grained Image Inpainting

Instruction following, visual instruction following, change detection.

research proposal on computer vision

Semi-supervised Change Detection

Prompt engineering.

research proposal on computer vision

Visual Prompting

Image compression.

research proposal on computer vision

Feature Compression

Jpeg compression artifact reduction.

research proposal on computer vision

Lossy-Compression Artifact Reduction

Color image compression artifact reduction, explainable artificial intelligence, explainable models, explanation fidelity evaluation, fad curve analysis, saliency detection.

research proposal on computer vision

Saliency Prediction

research proposal on computer vision

Co-Salient Object Detection

Video saliency detection, unsupervised saliency detection, image registration.

research proposal on computer vision

Unsupervised Image Registration

Visual reasoning.

research proposal on computer vision

Visual Commonsense Reasoning

Ensemble learning, salient object detection, saliency ranking, visual tracking.

research proposal on computer vision

Point Tracking

Rgb-t tracking, real-time visual tracking.

research proposal on computer vision

RF-based Visual Tracking

3d point cloud classification.

research proposal on computer vision

3D Object Classification

research proposal on computer vision

Few-Shot 3D Point Cloud Classification

Supervised only 3d point cloud classification, zero-shot transfer 3d point cloud classification, visual grounding.

research proposal on computer vision

3D visual grounding

Person-centric visual grounding.

research proposal on computer vision

Phrase Extraction and Grounding (PEG)

2d classification.

research proposal on computer vision

Neural Network Compression

research proposal on computer vision

Music Source Separation

Cell detection.

research proposal on computer vision

Plant Phenotyping

Open-set classification, image manipulation detection.

research proposal on computer vision

Zero Shot Skeletal Action Recognition

Generalized zero shot skeletal action recognition, motion estimation, video question answering.

research proposal on computer vision

Zero-Shot Video Question Answer

Few-shot video question answering, video captioning.

research proposal on computer vision

Dense Video Captioning

Boundary captioning, visual text correction, audio-visual video captioning, whole slide images, gesture recognition.

research proposal on computer vision

Hand Gesture Recognition

research proposal on computer vision

Hand-Gesture Recognition

research proposal on computer vision

RF-based Gesture Recognition

Activity prediction, motion prediction, cyber attack detection, sequential skip prediction, text detection, point cloud registration.

research proposal on computer vision

Image to Point Cloud Registration

research proposal on computer vision

Robust 3D Semantic Segmentation

research proposal on computer vision

Real-Time 3D Semantic Segmentation

research proposal on computer vision

Unsupervised 3D Semantic Segmentation

Furniture segmentation, medical diagnosis.

research proposal on computer vision

Alzheimer's Disease Detection

research proposal on computer vision

Retinal OCT Disease Classification

Blood cell count, thoracic disease classification, 3d point cloud interpolation, visual odometry.

research proposal on computer vision

Face Anti-Spoofing

Monocular visual odometry, rain removal.

research proposal on computer vision

Single Image Deraining

research proposal on computer vision

Hand Pose Estimation

research proposal on computer vision

Hand Segmentation

Gesture-to-gesture translation.

research proposal on computer vision

Image Dehazing

research proposal on computer vision

Single Image Dehazing

Image clustering.

research proposal on computer vision

Online Clustering

research proposal on computer vision

Face Clustering

Multi-view subspace clustering, multi-modal subspace clustering, deepfake detection.

research proposal on computer vision

Synthetic Speech Detection

Human detection of deepfakes, multimodal forgery detection, robot navigation.

research proposal on computer vision

PointGoal Navigation

Social navigation.

research proposal on computer vision

Sequential Place Learning

Colorization.

research proposal on computer vision

Line Art Colorization

research proposal on computer vision

Point-interactive Image Colorization

research proposal on computer vision

Color Mismatch Correction

Conformal prediction, image manipulation, visual localization.

research proposal on computer vision

Image Editing

Rolling shutter correction, shadow removal, multimodel-guided image editing, joint deblur and frame interpolation, multimodal fashion image editing, visual place recognition.

research proposal on computer vision

Indoor Localization

3d place recognition.

research proposal on computer vision

Unsupervised Image-To-Image Translation

research proposal on computer vision

Synthetic-to-Real Translation

research proposal on computer vision

Multimodal Unsupervised Image-To-Image Translation

research proposal on computer vision

Cross-View Image-to-Image Translation

research proposal on computer vision

Fundus to Angiography Generation

Stereo matching.

research proposal on computer vision

Crowd Counting

research proposal on computer vision

Visual Crowd Analysis

Group detection in crowds, object reconstruction.

research proposal on computer vision

3D Object Reconstruction

Earth observation, human-object interaction detection.

research proposal on computer vision

Affordance Recognition

Image deblurring, low-light image deblurring and enhancement, image matching.

research proposal on computer vision

Semantic correspondence

Patch matching, set matching.

research proposal on computer vision

Matching Disparate Images

Video quality assessment, video alignment, temporal sentence grounding, long-video activity recognition, point cloud classification, jet tagging, few-shot point cloud classification, hyperspectral.

research proposal on computer vision

Hyperspectral Image Classification

Hyperspectral unmixing, hyperspectral image segmentation, classification of hyperspectral images, document text classification, multi-label classification of biomedical texts, political salient issue orientation detection, 3d point cloud reconstruction, scene classification.

research proposal on computer vision

Weakly-supervised Temporal Action Localization

Weakly supervised action localization.

research proposal on computer vision

Temporal Action Proposal Generation

Activity recognition in videos, referring expression, point cloud generation, point cloud completion, 2d human pose estimation, action anticipation.

research proposal on computer vision

3D Face Animation

Semi-supervised human pose estimation, reconstruction, 3d human reconstruction.

research proposal on computer vision

Single-View 3D Reconstruction

4d reconstruction, single-image-based hdr reconstruction, keyword spotting.

research proposal on computer vision

Small-Footprint Keyword Spotting

Visual keyword spotting, compressive sensing, camera calibration, scene text detection.

research proposal on computer vision

Curved Text Detection

Multi-oriented scene text detection, boundary detection.

research proposal on computer vision

Junction Detection

Image matting.

research proposal on computer vision

Semantic Image Matting

Video retrieval, video-text retrieval, video grounding, video-adverb retrieval, replay grounding, composed video retrieval (covr), document ai, document understanding, cross-modal retrieval, image-text matching, cross-modal retrieval with noisy correspondence, multilingual cross-modal retrieval.

research proposal on computer vision

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, motion synthesis.

research proposal on computer vision

Motion Style Transfer

Temporal human motion composition, video summarization.

research proposal on computer vision

Unsupervised Video Summarization

Supervised video summarization, emotion classification.

research proposal on computer vision

Point Cloud Segmentation

Sensor fusion, superpixels.

research proposal on computer vision

Remote Sensing

research proposal on computer vision

Remote Sensing Image Classification

Change detection for remote sensing images, building change detection for remote sensing images.

research proposal on computer vision

Segmentation Of Remote Sensing Imagery

research proposal on computer vision

The Semantic Segmentation Of Remote Sensing Imagery

research proposal on computer vision

Few-Shot Transfer Learning for Saliency Prediction

research proposal on computer vision

Aerial Video Saliency Prediction

3d anomaly detection, video anomaly detection, artifact detection, document layout analysis.

research proposal on computer vision

Face Generation

research proposal on computer vision

Talking Head Generation

Talking face generation.

research proposal on computer vision

Face Age Editing

Facial expression generation, kinship face generation.

research proposal on computer vision

Point cloud reconstruction

research proposal on computer vision

3D Semantic Scene Completion

research proposal on computer vision

3D Semantic Scene Completion from a single RGB image

Garment reconstruction, privacy preserving deep learning, membership inference attack, human detection.

research proposal on computer vision

Video Instance Segmentation

research proposal on computer vision

Generalized Few-Shot Semantic Segmentation

Video editing, video temporal consistency, line items extraction, virtual try-on.

research proposal on computer vision

Generalized Referring Expression Segmentation

Scene flow estimation.

research proposal on computer vision

Self-supervised Scene Flow Estimation

Depth completion.

research proposal on computer vision

Object Discovery

Motion forecasting.

research proposal on computer vision

Multi-Person Pose forecasting

research proposal on computer vision

Multiple Object Forecasting

3d classification, machine unlearning, continual forgetting, gaze estimation.

research proposal on computer vision

CARLA MAP Leaderboard

Dead-reckoning prediction, face reconstruction.

research proposal on computer vision

text-guided-image-editing

Text-based image editing, concept alignment.

research proposal on computer vision

Zero-Shot Text-to-Image Generation

Conditional text-to-image synthesis, texture synthesis, multi-view learning, incomplete multi-view clustering, sign language recognition.

research proposal on computer vision

Gait Recognition

research proposal on computer vision

Multiview Gait Recognition

Gait recognition in the wild, interactive segmentation, scene generation, image recognition, fine-grained image recognition, license plate recognition, material recognition.

research proposal on computer vision

Breast Cancer Detection

Skin cancer classification.

research proposal on computer vision

Breast Cancer Histology Image Classification

Lung cancer diagnosis, classification of breast cancer histology images, event-based vision.

research proposal on computer vision

Event-based Optical Flow

research proposal on computer vision

Event-Based Video Reconstruction

Event-based motion estimation, interest point detection, homography estimation.

research proposal on computer vision

3D Multi-Person Pose Estimation (absolute)

research proposal on computer vision

3D Multi-Person Mesh Recovery

research proposal on computer vision

3D Multi-Person Pose Estimation (root-relative)

Object counting, training-free object counting, open-vocabulary object counting, human parsing.

research proposal on computer vision

Multi-Human Parsing

Weakly supervised segmentation, disease prediction, disease trajectory forecasting, pose tracking.

research proposal on computer vision

3D Human Pose Tracking

research proposal on computer vision

3D Hand Pose Estimation

Facial landmark detection.

research proposal on computer vision

Unsupervised Facial Landmark Detection

research proposal on computer vision

3D Facial Landmark Localization

research proposal on computer vision

Dichotomous Image Segmentation

3d character animation from a single photo, activity detection, inverse rendering, scene segmentation, temporal localization.

research proposal on computer vision

Language-Based Temporal Localization

Temporal defect localization, text-to-video generation, text-to-video editing, subject-driven video generation, multi-label image classification.

research proposal on computer vision

Multi-label Image Recognition with Partial Labels

3d object tracking.

research proposal on computer vision

3D Single Object Tracking

Template matching, camera localization.

research proposal on computer vision

Camera Relocalization

Lidar semantic segmentation, motion segmentation, text spotting.

research proposal on computer vision

Visual Dialog

research proposal on computer vision

Intelligent Surveillance

research proposal on computer vision

Vehicle Re-Identification

Relation network, disparity estimation.

research proposal on computer vision

Few-Shot Class-Incremental Learning

Class-incremental semantic segmentation, non-exemplar-based class incremental learning, decision making under uncertainty.

research proposal on computer vision

Uncertainty Visualization

Knowledge distillation.

research proposal on computer vision

Data-free Knowledge Distillation

Self-knowledge distillation, moment retrieval.

research proposal on computer vision

Zero-shot Moment Retrieval

Text to video retrieval, partially relevant video retrieval, handwritten text recognition, handwritten document recognition, unsupervised text recognition, person search, semi-supervised object detection.

research proposal on computer vision

Mixed Reality

Shadow detection.

research proposal on computer vision

Shadow Detection And Removal

research proposal on computer vision

Unconstrained Lip-synchronization

Future prediction, human mesh recovery, video enhancement, video inpainting.

research proposal on computer vision

Face Image Quality Assessment

Lightweight face recognition.

research proposal on computer vision

Age-Invariant Face Recognition

Synthetic face recognition, face quality assessement.

research proposal on computer vision

Cross-corpus

Micro-expression recognition, micro-expression spotting.

research proposal on computer vision

3D Facial Expression Recognition

research proposal on computer vision

Smile Recognition

research proposal on computer vision

3D Multi-Object Tracking

Real-time multi-object tracking, multi-animal tracking with identification, trajectory long-tail distribution for muti-object tracking, grounded multiple object tracking.

research proposal on computer vision

Stereo Image Super-Resolution

Burst image super-resolution, satellite image super-resolution, multispectral image super-resolution, image categorization, fine-grained visual categorization, open vocabulary semantic segmentation, zero-guidance segmentation, physics-informed machine learning, soil moisture estimation, video reconstruction.

research proposal on computer vision

Zero Shot Segmentation

Sign language translation.

research proposal on computer vision

Overlapped 10-1

Overlapped 15-1, overlapped 15-5, disjoint 10-1, disjoint 15-1, color constancy.

research proposal on computer vision

Few-Shot Camera-Adaptive Color Constancy

Hdr reconstruction, multi-exposure image fusion, deep attention, line detection, visual recognition.

research proposal on computer vision

Fine-Grained Visual Recognition

Tone mapping, zero-shot action recognition, image cropping, stereo matching hand.

research proposal on computer vision

3D Absolute Human Pose Estimation

research proposal on computer vision

Text-to-Face Generation

Natural language transduction, image forensics, image to 3d, infrared and visible image fusion.

research proposal on computer vision

Novel Class Discovery

research proposal on computer vision

Breast Cancer Histology Image Classification (20% labels)

Landmark-based lipreading, transparent object detection, transparent objects, video restoration.

research proposal on computer vision

Analog Video Restoration

Abnormal event detection in video.

research proposal on computer vision

Semi-supervised Anomaly Detection

Surface normals estimation.

research proposal on computer vision

Vision-Language Navigation

research proposal on computer vision

hand-object pose

research proposal on computer vision

Grasp Generation

research proposal on computer vision

3D Canonical Hand Pose Estimation

Image animation.

research proposal on computer vision

cross-domain few-shot learning

Texture classification, probabilistic deep learning, action quality assessment, pedestrian attribute recognition.

research proposal on computer vision

Spoof Detection

Face presentation attack detection, detecting image manipulation, cross-domain iris presentation attack detection, finger dorsal image spoof detection, unsupervised few-shot image classification, generalized few-shot classification, highlight detection, steganalysis.

research proposal on computer vision

Sketch Recognition

research proposal on computer vision

Face Sketch Synthesis

Drawing pictures.

research proposal on computer vision

Photo-To-Caricature Translation

Computer vision techniques adopted in 3d cryogenic electron microscopy, single particle analysis, cryogenic electron tomography, meme classification, hateful meme classification, action understanding, dense captioning, person retrieval, segmentation, open-vocabulary semantic segmentation, iris recognition, pupil dilation, image to video generation.

research proposal on computer vision

Unconditional Video Generation

Image stitching.

research proposal on computer vision

One-shot visual object segmentation

research proposal on computer vision

Unbiased Scene Graph Generation

research proposal on computer vision

Panoptic Scene Graph Generation

Automatic post-editing.

research proposal on computer vision

Document Image Classification

research proposal on computer vision

Face Reenactment

research proposal on computer vision

Multi-View 3D Reconstruction

Universal domain adaptation, surgical phase recognition, online surgical phase recognition, offline surgical phase recognition, blind face restoration.

research proposal on computer vision

Geometric Matching

Human action generation.

research proposal on computer vision

Action Generation

Object categorization, text based person retrieval, diffusion personalization.

research proposal on computer vision

Diffusion Personalization Tuning Free

research proposal on computer vision

Efficient Diffusion Personalization

Human dynamics.

research proposal on computer vision

3D Human Dynamics

Severity prediction, intubation support prediction, cloud detection.

research proposal on computer vision

Table Recognition

Text-to-image, story visualization, complex scene breaking and synthesis, image fusion, pansharpening, image deconvolution.

research proposal on computer vision

Image Outpainting

research proposal on computer vision

Object Segmentation

research proposal on computer vision

Camouflaged Object Segmentation

Landslide segmentation, text-line extraction, point clouds, point cloud video understanding, point cloud rrepresentation learning.

research proposal on computer vision

Semantic SLAM

research proposal on computer vision

Object SLAM

Image shadow removal, intrinsic image decomposition, line segment detection, sports analytics, situation recognition, grounded situation recognition, face image quality, motion detection, multi-target domain adaptation, person identification, visual prompt tuning, single-source domain generalization, evolving domain generalization, source-free domain generalization.

research proposal on computer vision

Robot Pose Estimation

research proposal on computer vision

Camouflaged Object Segmentation with a Single Task-generic Prompt

Image morphing, image steganography, rotated mnist, weakly-supervised instance segmentation, image smoothing, fake image detection.

research proposal on computer vision

GAN image forensics

research proposal on computer vision

Fake Image Attribution

Lane detection.

research proposal on computer vision

3D Lane Detection

Layout design, occlusion handling, contour detection.

research proposal on computer vision

Crop Classification

License plate detection.

research proposal on computer vision

Video Panoptic Segmentation

Viewpoint estimation.

research proposal on computer vision

Drone navigation

Drone-view target localization, value prediction, body mass index (bmi) prediction, crop yield prediction, multi-object tracking and segmentation.

research proposal on computer vision

Zero-Shot Transfer Image Classification

research proposal on computer vision

motion retargeting

research proposal on computer vision

3D Object Reconstruction From A Single Image

research proposal on computer vision

CAD Reconstruction

3d point cloud linear classification, multiview learning, person recognition.

research proposal on computer vision

Photo Retouching

Shape representation of 3d point clouds, bird's-eye view semantic segmentation.

research proposal on computer vision

Dense Pixel Correspondence Estimation

Human part segmentation.

research proposal on computer vision

Document Shadow Removal

Symmetry detection, traffic sign detection, video style transfer, referring image matting.

research proposal on computer vision

Referring Image Matting (Expression-based)

research proposal on computer vision

Referring Image Matting (Keyword-based)

research proposal on computer vision

Referring Image Matting (RefMatte-RW100)

Referring image matting (prompt-based), human interaction recognition, one-shot 3d action recognition, mutual gaze, affordance detection.

research proposal on computer vision

Gaze Prediction

Hand detection, image forgery detection, image instance retrieval, amodal instance segmentation, image quality estimation.

research proposal on computer vision

Image Similarity Search

research proposal on computer vision

Material Classification

research proposal on computer vision

Precipitation Forecasting

Referring expression generation, road damage detection.

research proposal on computer vision

Space-time Video Super-resolution

Video matting.

research proposal on computer vision

inverse tone mapping

Semi-supervised image classification.

research proposal on computer vision

Open-World Semi-Supervised Learning

Semi-supervised image classification (cold start), facial editing.

research proposal on computer vision

Holdout Set

Multispectral object detection.

research proposal on computer vision

Open Vocabulary Attribute Detection

Image/document clustering, self-organized clustering, instance search.

research proposal on computer vision

Audio Fingerprint

3d shape modeling.

research proposal on computer vision

Action Analysis

Art analysis.

research proposal on computer vision

Zero-Shot Composed Image Retrieval (ZS-CIR)

Food recognition.

research proposal on computer vision

Motion Magnification

Semi-supervised instance segmentation, binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, video segmentation, camera shot boundary detection, open-vocabulary video segmentation, open-world video segmentation, lung nodule classification, lung nodule 3d classification, lung nodule detection, lung nodule 3d detection, 3d scene reconstruction, event segmentation, generic event boundary detection, image retouching, image-variation, jpeg artifact removal, point cloud super resolution, skills assessment.

research proposal on computer vision

Text-based Person Retrieval

research proposal on computer vision

Sensor Modeling

Handwriting verification, bangla spelling error correction, video prediction, earth surface forecasting, predict future video frames, 3d open-vocabulary instance segmentation.

research proposal on computer vision

Ad-hoc video search

Audio-visual synchronization, handwriting generation, pose retrieval, scanpath prediction, scene change detection.

research proposal on computer vision

Sketch-to-Image Translation

Skills evaluation, synthetic image detection, highlight removal, 2d pose estimation, category-agnostic pose estimation, overlapping pose estimation, 3d shape reconstruction from a single 2d image.

research proposal on computer vision

Shape from Texture

Deception detection, deception detection in videos.

research proposal on computer vision

Video Visual Relation Detection

Human-object relationship detection, 3d shape representation.

research proposal on computer vision

3D Dense Shape Correspondence

Birds eye view object detection.

research proposal on computer vision

Image Comprehension

Image manipulation localization, multiple people tracking.

research proposal on computer vision

Network Interpretation

Rgb-d reconstruction, seeing beyond the visible, semi-supervised domain generalization, unsupervised semantic segmentation.

research proposal on computer vision

Unsupervised Semantic Segmentation with Language-image Pre-training

Multiple object tracking with transformer.

research proposal on computer vision

Multiple Object Track and Segmentation

Constrained lip-synchronization, face dubbing, vietnamese visual question answering, explanatory visual question answering, 3d shape reconstruction, 4d panoptic segmentation, defocus blur detection, event data classification, instance shadow detection, kinship verification, medical image enhancement, open vocabulary panoptic segmentation, short-term object interaction anticipation, single-object discovery, training-free 3d point cloud classification, video forensics.

research proposal on computer vision

Sequential Place Recognition

Autonomous flight (dense forest), autonomous web navigation.

research proposal on computer vision

Generative 3D Object Classification

Cube engraving classification, facial expression recognition, cross-domain facial expression recognition, zero-shot facial expression recognition, multimodal machine translation.

research proposal on computer vision

Face to Face Translation

Multimodal lexical translation, 10-shot image generation, 2d semantic segmentation task 3 (25 classes), document enhancement, action assessment, bokeh effect rendering, drivable area detection, face anonymization, font recognition, horizon line estimation, image imputation.

research proposal on computer vision

Long Video Retrieval (Background Removed)

Medical image denoising.

research proposal on computer vision

Occlusion Estimation

Personalized image generation, physiological computing.

research proposal on computer vision

Lake Ice Monitoring

Spatio-temporal video grounding, text-based person retrieval with noisy correspondence.

research proposal on computer vision

Unsupervised 3D Point Cloud Linear Evaluation

Vcgbench-diverse, wireframe parsing, gaze redirection, single-image-generation, unsupervised anomaly detection with specified settings -- 30% anomaly, root cause ranking, anomaly detection at 30% anomaly, anomaly detection at various anomaly percentages.

research proposal on computer vision

Unsupervised Contextual Anomaly Detection

Landmark tracking, muscle tendon junction identification, mistake detection, online mistake detection, 3d object captioning, 3d semantic occupancy prediction, 3d scene editing, animated gif generation, generalized referring expression comprehension, image deblocking, image retargeting, infrared image super-resolution, motion disentanglement, personality trait recognition, persuasion strategies, scene text editing, image to sketch recognition, traffic accident detection, accident anticipation, unsupervised landmark detection, vehicle speed estimation, visual speech recognition, lip to speech synthesis, continual anomaly detection, weakly supervised action segmentation (transcript), weakly supervised action segmentation (action set)), calving front delineation in synthetic aperture radar imagery, calving front delineation in synthetic aperture radar imagery with fixed training amount.

research proposal on computer vision

Handwritten Line Segmentation

Handwritten word segmentation.

research proposal on computer vision

General Action Video Anomaly Detection

Physical video anomaly detection, monocular cross-view road scene parsing(road), monocular cross-view road scene parsing(vehicle).

research proposal on computer vision

Transparent Object Depth Estimation

Age and gender estimation, data ablation.

research proposal on computer vision

Occluded Face Detection

Fingertip detection, gait identification, historical color image dating, stochastic human motion prediction, image and video forgery detection, motion captioning, personalized segmentation, repetitive action counting, scene-aware dialogue, spatial relation recognition, spatial token mixer, steganographics, story continuation.

research proposal on computer vision

Unsupervised Anomaly Detection with Specified Settings -- 0.1% anomaly

Unsupervised anomaly detection with specified settings -- 1% anomaly, unsupervised anomaly detection with specified settings -- 10% anomaly, unsupervised anomaly detection with specified settings -- 20% anomaly, visual analogies, visual social relationship recognition, zero-shot text-to-video generation, text-guided-generation, video frame interpolation, 3d video frame interpolation, unsupervised video frame interpolation.

research proposal on computer vision

eXtreme-Video-Frame-Interpolation

Continual semantic segmentation, overlapped 5-3, overlapped 25-25, micro-expression generation, micro-expression generation (megc2021), period estimation, art period estimation (544 artists), unsupervised panoptic segmentation, unsupervised zero-shot panoptic segmentation, 3d rotation estimation, camera auto-calibration, defocus estimation, derendering, hierarchical text segmentation, human-object interaction concept discovery.

research proposal on computer vision

One-Shot Face Stylization

Keypoint detection and image matching, speaker-specific lip to speech synthesis, multi-person pose estimation, neural stylization.

research proposal on computer vision

Part-aware Panoptic Segmentation

research proposal on computer vision

Population Mapping

Pornography detection, prediction of occupancy grid maps, raw reconstruction, svbrdf estimation, semi-supervised video classification, spectrum cartography, supervised image retrieval, synthetic image attribution, training-free 3d part segmentation, unsupervised image decomposition, video propagation, vietnamese multimodal learning, weakly supervised 3d point cloud segmentation, weakly-supervised panoptic segmentation, drone-based object tracking, brain visual reconstruction, brain visual reconstruction from fmri.

research proposal on computer vision

Human-Object Interaction Generation

Image-guided composition, fashion understanding, semi-supervised fashion compatibility.

research proposal on computer vision

intensity image denoising

Lifetime image denoising, observation completion, active observation completion, boundary grounding.

research proposal on computer vision

Video Narrative Grounding

3d inpainting, 3d scene graph alignment, 4d spatio temporal semantic segmentation.

research proposal on computer vision

Age Estimation

research proposal on computer vision

Few-shot Age Estimation

Brdf estimation, camouflage segmentation, clothing attribute recognition, damaged building detection, depth image estimation, detecting shadows, dynamic texture recognition.

research proposal on computer vision

Disguised Face Verification

Few shot open set object detection, gaze target estimation, generalized zero-shot learning - unseen, grounded multimodal named entity recognition, hd semantic map learning, human-object interaction anticipation, image deep networks, manufacturing quality control, materials imaging, micro-gesture recognition, multi-person pose estimation and tracking.

research proposal on computer vision

Multi-modal image segmentation

Multi-object discovery, neural radiance caching.

research proposal on computer vision

Parking Space Occupancy

research proposal on computer vision

Partial Video Copy Detection

research proposal on computer vision

Multimodal Patch Matching

Perpetual view generation, procedure learning, prompt-driven zero-shot domain adaptation, safety perception recognition, jersey number recognition, photo to rest generalization, single-shot hdr reconstruction, on-the-fly sketch based image retrieval, thermal image denoising, trademark retrieval, unsupervised instance segmentation, unsupervised zero-shot instance segmentation, vehicle key-point and orientation estimation.

research proposal on computer vision

Video Individual Counting

Video-adverb retrieval (unseen compositions), video-to-image affordance grounding.

research proposal on computer vision

Vietnamese Scene Text

Visual sentiment prediction, human-scene contact detection, localization in video forgery, video classification, student engagement level detection (four class video classification), multi class classification (four-level video classification), 3d canonicalization, 3d surface generation.

research proposal on computer vision

Visibility Estimation from Point Cloud

Amodal layout estimation, blink estimation, camera absolute pose regression, change data generation, constrained diffeomorphic image registration, continuous affect estimation, dataset distillation, deep feature inversion, document image skew estimation, earthquake prediction, fashion compatibility learning.

research proposal on computer vision

Displaced People Recognition

Finger vein recognition, flooded building segmentation.

research proposal on computer vision

Future Hand Prediction

Generative temporal nursing, house generation, human fmri response prediction, hurricane forecasting, ifc entity classification, image declipping, image similarity detection.

research proposal on computer vision

Image Text Removal

Image-to-gps verification.

research proposal on computer vision

Image-based Automatic Meter Reading

Dial meter reading, indoor scene reconstruction, jpeg decompression.

research proposal on computer vision

Kiss Detection

Laminar-turbulent flow localisation.

research proposal on computer vision

Landmark Recognition

Brain landmark detection, corpus video moment retrieval, linear probing object-level 3d awareness, mllm evaluation: aesthetics, medical image deblurring, mental workload estimation, meter reading, motion expressions guided video segmentation, natural image orientation angle detection, multi-object colocalization, multilingual text-to-image generation, video emotion detection, nwp post-processing, occluded 3d object symmetry detection, open set video captioning, pso-convnets dynamics 1, pso-convnets dynamics 2, partial point cloud matching.

research proposal on computer vision

Partially View-aligned Multi-view Learning

research proposal on computer vision

Pedestrian Detection

research proposal on computer vision

Thermal Infrared Pedestrian Detection

Personality trait recognition by face, physical attribute prediction, point cloud semantic completion, point cloud classification dataset, point- of-no-return (pnr) temporal localization, pose contrastive learning, potrait generation, procedure step recognition, prostate zones segmentation, pulmorary vessel segmentation, pulmonary artery–vein classification, reference expression generation, interspecies facial keypoint transfer, specular reflection mitigation, specular segmentation, state change object detection, surface normals estimation from point clouds, train ego-path detection.

research proposal on computer vision

Transform A Video Into A Comics

Transparency separation, typeface completion.

research proposal on computer vision

Unbalanced Segmentation

research proposal on computer vision

Unsupervised Long Term Person Re-Identification

Video correspondence flow.

research proposal on computer vision

Key-Frame-based Video Super-Resolution (K = 15)

Zero-shot single object tracking, yield mapping in apple orchards, lidar absolute pose regression, opd: single-view 3d openable part detection, self-supervised scene text recognition, spatial-aware image editing, video narration captioning, spectral estimation, spectral estimation from a single rgb image, 3d prostate segmentation, aggregate xview3 metric, atomic action recognition, composite action recognition, calving front delineation from synthetic aperture radar imagery, computer vision transduction, crosslingual text-to-image generation, zero-shot dense video captioning, document to image conversion, frame duplication detection, geometrical view, hyperview challenge.

research proposal on computer vision

Image Operation Chain Detection

Kinematic based workflow recognition, logo recognition.

research proposal on computer vision

MLLM Aesthetic Evaluation

Motion detection in non-stationary scenes, open-set video tagging, satellite orbit determination.

research proposal on computer vision

Segmentation Based Workflow Recognition

2d particle picking, small object detection.

research proposal on computer vision

Rice Grain Disease Detection

Sperm morphology classification, video & kinematic base workflow recognition, video based workflow recognition, video, kinematic & segmentation base workflow recognition, animal pose estimation.

Javascript is disabled in your browser. Please enable it for full functionality and experience.

  • Direkt zur Navigation springen
  • Direkt zur Suche springen
  • Direkt zum Inhalt springen
  • Fraunhofer HHI
  • Departments
  • Start page >
  • Departments >
  • Vision and Imaging Technologies >
  • Research Groups >
  • Computer Vision & Graphics >
  • CVG Research Overview
  • AI-Based Building Digitalization
  • Portrait Relighting
  • Neural Speech-Driven Face Animation
  • Video-driven Facial Animation
  • Publications
  • Student Opportunities
  • IMC Research Overview
  • Research Topics
  • Pose and gesture analysis
  • Behaviour analysis for human-computer interaction
  • Contact-free Human-Computer Interaction
  • Image Quality Estimation
  • Subjective Tests
  • Birgit Nierula

Research Topics of the Computer Vision & Graphics Group

Seeing, modelling and animating humans.

research proposal on computer vision

Realistic human modelling is a challenging task in Computer Vision and Graphics. We investigate new methods for capturing and analyzing human bodies and faces in images and videos as well as new compact models for the representation of facial expressions as well as human bodies and their motion. We combine model-based and image-and video based representations with generative AI models as well as neural rendering.

Read more about current research projects in this field.

Scenes, Structure and Motion

research proposal on computer vision

We have a long tradition in 3D scene analysis and continuously perform innovative research in 3D capturing as well as 3D reconstruction, ranging from highly detailed stereo as well as multi-view images of static objects and scenes, addressing even complex surface and shape properties, over monocular shape-from-X methods, to analyzing deforming objects in monocular video.

Computational Imaging and Video

research proposal on computer vision

We perform innovative research in the field of video processing and computational video opening up new opportunities for how dynamic scenes can be analyzed and video footage can be represented, edited and seamlessly augmented with new content.

Learning and Inference

research proposal on computer vision

Our research combines computer vision, computer graphics, and machine learning to understand images and video data. In our research, we focus on the combination of deep learning with strong models or physical constraints in order to combine the advantages of model-based and data-driven methods.

Augmented and Mixed Reality

research proposal on computer vision

Our experience in tracking dynamic scenes and objects as well as photorealistic rendering enables new augmented reality solutions where virtual content is seamlessly blended into real video footage with applications e.g. multi-media, industry or medicine.

Previous Research Projects

research proposal on computer vision

We have performed various research projects in the above fields over the years.

Read more about older research projects here.

People detection with computer vision

  • Explore Blog

Data Collection

Building Blocks​

Device Enrollment

Monitoring Dashboards

Video Annotation​

Application Editor​

Device Management

Remote Maintenance

Model Training

Application Library

Deployment Manager

Unified Security Center

AI Model Library

Configuration Manager

IoT Edge Gateway

Privacy-preserving AI

Ready to get started?

  • Why Viso Suite

Top Computer Vision Papers of All Time (Updated 2024)

research proposal on computer vision

Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

Today’s boom in computer vision (CV) started at the beginning of the 21 st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation.

In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.

Gradient-based Learning Applied to Document Recognition (1998)

Distinctive image features from scale-invariant keypoints (2004), histograms of oriented gradients for human detection (2005), surf: speeded up robust features (2006), imagenet classification with deep convolutional neural networks (2012), very deep convolutional networks for large-scale image recognition (2014), googlenet – going deeper with convolutions (2014), resnet – deep residual learning for image recognition (2015), faster r-cnn: towards real-time object detection with region proposal networks (2015), yolo: you only look once: unified, real-time object detection (2016), mask r-cnn (2017), efficientnet – rethinking model scaling for convolutional neural networks (2019).

About us:  Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. Book a demo to learn more about how Viso Suite can help solve business problems.

Enterprise infrastructure you need to deliver computer vision systems faster, operate at large scale, and with maximum security.

Classic Computer Vision Papers

The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition . They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.

LeNet CNN architecture digits recognition

Characteristics of the model:

  • LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
  • The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
  • The training set consists of 30000 examples, and authors achieved a 0.35% error rate on the training set (after 19 passes).

Find the LeNet paper here .

David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT), while transforming image data into scale-invariant coordinates relative to local features.

SIFT method keypoints detection

Model characteristics:

  • The method generates large numbers of features that densely cover the image over the full range of scales and locations.
  • The model needs to match at least 3 features from each object – in order to reliably detect small objects in cluttered backgrounds.
  • For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
  • SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).

Find the SIFT paper here .

The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection .

histogram object detection

Authors achievements:

  • The histogram method gave near-perfect separation from the original MIT pedestrian database.
  • For good results – the model requires: fine-scale gradients, fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
  • Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
  • In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.

Find the HOG paper here .

Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness, while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.

surf detecting interest points

  • Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
  • Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
  • SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).

Find the SURF paper here .

Papers Based on Deep-Learning Models

Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.

alexnet CNN architecture

  • The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
  • They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
  • The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
  • After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.

Find the ImageNet paper here .

Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG) . They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.

 image classification CNN results VOC-2007, VOC-2012

  • Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
  • They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
  • They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.

Find the VGG paper here .

The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.

GoogleNet Inception CNN

  • A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
  • Their submission for ILSVRC14 was called GoogLeNet , a 22-layer deep network. Its quality was assessed in the context of classification and detection.
  • They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
  • Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.

Find the GoogLeNet paper here .

Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.

resnet error rates

  • They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
  • This result won 1st place on the ILSVRC 2015 classification task.
  • The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
  • Moreover – in ILSVRC & COCO 2015 competitions, they won 1 st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.

Find the ResNet paper here .

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which Fast R-CNN used for detection.

faster R-CNN object detection

  • Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “ attention” mechanisms .
  • For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
  • Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
  • In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.

Find the Faster R-CNN paper here .

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance .

YOLO CNN architecture

  • The base YOLO model processed images in real-time at 45 frames per second.
  • A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
  • Compared to state-of-the-art detection systems, YOLO was making more localization errors, but was less likely to predict false positives in the background.
  • YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN , when generalizing natural images.

Find the YOLO paper here .

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN , extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

mask R-CNN framework

  • Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
  • Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
  • Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
  • The model served as a solid baseline and eased future research in instance-level recognition.

Find the Mask R-CNN paper here .

The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet .

efficiennet model scaling CNN

  • Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
  • EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
  • It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.

Find the EfficientNet paper here .

Related Articles

research proposal on computer vision

MobileNet – Efficient Deep Learning for Mobile Vision

MobileNet, introduced in 2017 by a team of researchers at Google, is a Deep Learning model for Smartphones, IoT, and embedded devices.

Image illustrating the effects of radial distortion. The image to the left shows a basketball court curved spherically due to lens distortion. The image to the right shows a grid pattern with a barrel-like distortion pattern.

A Complete Guide for Camera Calibration in 2024

This blog is a comprehensive guide for camera calibration. Users can do all their necessary calibration after reading this blog.

All-in-one platform to build, deploy, and scale computer vision applications

research proposal on computer vision

Join 6,300+ Fellow AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

research proposal on computer vision

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy .

One unified solution for enterprise AI vision

The computer vision infrastructure for teams to build, deploy and operate real-world applications at scale.

Pcw-img

Privacy Overview

CookieDurationDescription
cookielawinfo-checkbox-advertisement1 yearSet by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementorneverThis cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONIDsessionThe JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
ZCAMPAIGN_CSRF_TOKENsessionThis cookie is used to distinguish between humans and bots.
zfccnsessionZoho sets this cookie for website security when a request is sent to campaigns.
CookieDurationDescription
_zcsr_tmpsessionZoho sets this cookie for the login function on the website.
CookieDurationDescription
_gat1 minuteThis cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
CookieDurationDescription
_ga2 yearsThe _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_177371481_21 minuteSet by Google to distinguish users.
_gid1 dayInstalled by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT2 yearsYouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
zabUserId1 yearThis cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time
zabVisitIdone yearUsed for identifying returning visits of users to the webpage.
zft-sdc24hoursIt records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience.
zps-tgr-dts1 yearThese cookies are used to measure and analyze the traffic of this website and expire in 1 year.
CookieDurationDescription
VISITOR_INFO1_LIVE5 months 27 daysA cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSCsessionYSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devicesneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-idneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
CookieDurationDescription
2d719b1dd3sessionThis cookie has not yet been given a description. Our team is working to provide more information.
4662279173sessionThis cookie is used by Zoho Page Sense to improve the user experience.
ad2d102645sessionThis cookie has not yet been given a description. Our team is working to provide more information.
zc_consent1 yearNo description available.
zc_show1 yearNo description available.
zsc2feeae1d12f14395b6d5128904ae37461 minuteThis cookie has not yet been given a description. Our team is working to provide more information.

research proposal on computer vision

research proposal on computer vision

Project Proposal

For the course project you will explore a topic in-depth of your own choosing. This can be an implementation (implement an existing algorithm); an application (apply a computer vision algorithm to a new problem); or research (trying to invent something new).

To get you started, we have prepared a list of suggested projects . We believe that any of these would be feasible to complete as a project.

We expect you to work in groups of 3-5 students for the course project. The course project should amount to roughly one homework worth of work per person. In previous years we have typically expected about two homework assignments worth of work per person in the group project; however we are explicitly lowering our expectations to account for the extra overhead of collaborating remotely.

We are not expecting state-of-the-art, publication ready results from your course project! The point of the project is to get practice applying concepts of the class to a problem of your choosing, without the “scaffolding” code provided in the homeworks.

What to submit

The project proposal is due on Monday, April 5, 2021 11:59:59pm to Gradescope. You only need to submit one project propsal per group and add all the other members on Gradescope.

After submitting your project proposal, please fill out this Google Form once as a group to help us keep track of who is working together.

Your project proposal should be a 1-page PDF that answers the following questions:

Project Title : What is the name of your project?

Group Members : What are the names and uniqnames of the students involved?

Problem Statement : What is the problem you are trying to solve?

Approach : How do you plan to go about solving this problem? You don’t need to have everything figured out exactly, but you should have a vague sense of how you will proceed.

Data : What dataset do you plan to use? A common failure mode for projects is to have a cool idea, but no idea where to get the necessary data. We recommend against collecting your own dataset for the project, as this will significantly increase the complexity and workload; instead you should try to get away with existing datasets.

Computational Resources : What computational resources will you use for this project? For some projects a laptop may be completely fine. But if you are planning to train any kind of neural network, you should have an estimate of how much time a model should take to train, and where you will get access to the computational resources you need. Google Colab is a great free resources for small amounts of GPU resources; but be aware that this is not sufficient for training large-scale models.

Evaluation : How do you plan to evaluate whether your project is successful? What metric will you use? Is there some simple baseline that you plan to compare your model against?

If you are following one of our suggested projects then your “Problem Statement” can be very brief – one or two sentences is fine. For the suggested projects, we don’t expect you to work on all of the datasets we link; but please do tell us which you are planning to use.

We have 11 Computer Vision (research proposal form) PhD Projects, Programmes & Scholarships in the UK

Computer Science

United Kingdom

Institution

All Institutions

All PhD Types

All Funding

Computer Vision (research proposal form) PhD Projects, Programmes & Scholarships in the UK

Human emotion analysis and recognition for improving trusted human-robot interaction. main project focus: ai and robotics, phd research project.

PhD Research Projects are advertised opportunities to examine a pre-defined topic or answer a stated research question. Some projects may also provide scope for you to propose your own ideas and approaches.

Self-Funded PhD Students Only

This project does not have funding attached. You will need to have your own means of paying fees and living costs and / or seek separate funding from student finance, charities or trusts.

Novel Applications of Remote Sensing for Health

Competition funded phd project (students worldwide).

This project is in competition for funding with other projects. Usually the project which receives the best applicant will be successful. Unsuccessful projects may still go ahead as self-funded opportunities. Applications for the project are welcome from all suitably qualified candidates, but potential funding may be restricted to a limited set of nationalities. You should check the project and department details for more information.

PhD Studentship in Computer Science: AI for Robotics in Agriculture

Funded phd project (uk students only).

This research project has funding attached. It is only available to UK citizens or those who have been resident in the UK for a period of 3 years or more. Some projects, which are funded by charities or by the universities themselves may have more stringent restrictions.

Generative AI for synthetic biometric and age estimation testing

Competition funded phd project (uk students only).

This research project is one of a number of projects at this institution. It is in competition for funding with one or more of these projects. Usually the project which receives the best applicant will be awarded the funding. The funding is only available to UK citizens or those who have been resident in the UK for a period of 3 years or more. Some projects, which are funded by charities or by the universities themselves may have more stringent restrictions.

Next Generation Machine Learning for Data Analysis

Leveraging plant biomechanics under hostile environments, machine unlearning for privacy preserving applications, biometric deepfake to physical attack instrument assessment, help hearing impaired listeners to understand speech better by audio-visual integration in wearable devices, xai for biometrics – legal presentation of data, lightweight model instance segmentation on edge devices.

FindAPhD. Copyright 2005-2024 All rights reserved.

Unknown    ( change )

Have you got time to answer some quick questions about PhD study?

Select your nearest city

You haven’t completed your profile yet. To get the most out of FindAPhD, finish your profile and receive these benefits:

  • Monthly chance to win one of ten £10 Amazon vouchers ; winners will be notified every month.*
  • The latest PhD projects delivered straight to your inbox
  • Access to our £6,000 scholarship competition
  • Weekly newsletter with funding opportunities, research proposal tips and much more
  • Early access to our physical and virtual postgraduate study fairs

Or begin browsing FindAPhD.com

or begin browsing FindAPhD.com

*Offer only available for the duration of your active subscription, and subject to change. You MUST claim your prize within 72 hours, if not we will redraw.

research proposal on computer vision

Do you want hassle-free information and advice?

Create your FindAPhD account and sign up to our newsletter:

  • Find out about funding opportunities and application tips
  • Receive weekly advice, student stories and the latest PhD news
  • Hear about our upcoming study fairs
  • Save your favourite projects, track enquiries and get personalised subject updates

research proposal on computer vision

Create your account

Looking to list your PhD opportunities? Log in here .

Filtering Results

research proposal on computer vision

CS231n: Deep Learning for Computer Vision

Stanford - spring 2024, final project, important dates.

  • Collaboration Policy

Late Policy

Final report.

  • Final Presentation
DeliverableWeightDue DateLate Days
1% 04/22/2024 Yes
Project Milestone 2% 05/14/2024 Yes
Final Report 29% 06/05/2024 No
Poster Session (in person) + Poster PDF & Code (submit online) 3% Poster Session: 06/12/2024; Submitting PDF and Code: 06/11/2024 No

The Course Project is an opportunity for you to apply what you have learned in class to a problem of your interest. Potential projects usually fall into these two tracks:

  • Applications. If you're coming to the class with a specific background and interests (e.g. biology, engineering, physics), we'd love to see you apply vision models learned in this class to problems related to your particular domain of interest. Pick a real-world problem and apply computer vision models to solve it.
  • Models. You can build a new model (algorithm) or a new variant of existing models, and apply it to tackle vision tasks. This track might be more challenging, and sometimes leads to a piece of publishable work.

One restriction to note is that this is a Computer Vision class, so your project should involve pixels of visual data in some form somewhere. E.g. a pure NLP project is not a good choice, even if your approach involves ConvNets.

We have compiled a list of project ideas for inspiration that combine recent trend and interesting applications. Note that you do not need to pick one from here. Rather, these can be served as starting points for you to find the ideas that excite you.

  • Spring 2022
  • Spring 2017
  • Winter 2016
  • Winter 2015

To inspire ideas, you might also look at recent deep learning publications from top-tier conferences, as well as other resources below.

  • CVPR : IEEE Conference on Computer Vision and Pattern Recognition
  • ICCV : International Conference on Computer Vision
  • ECCV : European Conference on Computer Vision
  • NIPS : Neural Information Processing Systems
  • ICLR : International Conference on Learning Representations
  • ICML : International Conference on Machine Learning
  • Publications from the Stanford Vision Lab
  • Awesome Deep Vision
  • Past CS229 Projects : Example projects from Stanford's machine learning class
  • Kaggle challenges : An online machine learning competition website. For example, a Yelp classification challenge .

For applications, this type of projects would involve careful data preparation, an appropriate loss function, details of training and cross-validation and good test set evaluations and model comparisons. Don't be afraid to think outside of the box. Some successful examples can be found below:

  • Teaching Deep Convolutional Neural Networks to Play Go
  • Playing Atari with Deep Reinforcement Learning

ConvNets also run in real time on mobile phones and Raspberry Pi's - building an interesting mobile application could be a good project. If you want to go this route you might want to check out PyTorch Mobile , TensorFlow Lite or Caffe2 iOS/Android integration .

For models, ConvNets have been successfully used in a variety of computer vision tasks. This type of projects would involve understanding the state-of-the-art vision models, and building new models or improving existing models for a vision task. The list below presents some papers on recent advances of ConvNets in the computer vision community.

  • Image Classification : [Krizhevsky et al.] , [Russakovsky et al.] , [Szegedy et al.] , [Simonyan et al.] , [He et al.] , [Huang et al.] , [Hu et al.] [Zoph et al.]
  • Object detection : [Girshick et al.] , [Ren et al.] , [He et al.]
  • Image segmentation : [Long et al.] [Noh et al.] [Chen et al.]
  • Video classification : [Karpathy et al.] , [Simonyan and Zisserman] [Tran et al.] [Carreira et al.] [Wang et al.]
  • Scene classification : [Zhou et al.]
  • Face recognition : [Taigman et al.] [Schroff et al.] [Parkhi et al.]
  • Depth estimation : [Eigen et al.]
  • Image-to-sentence generation : [Karpathy and Fei-Fei] , [Donahue et al.] , [Vinyals et al.] [Xu et al.] [Johnson et al.]
  • Visualization and optimization : [Szegedy et al.] , [Nguyen et al.] , [Zeiler and Fergus] , [Goodfellow et al.] , [Schaul et al.]

You might also gain inspiration by taking a look at some popular computer vision datasets:

  • Meta Pointer: A large collection organized by CV Datasets.
  • Yet another Meta pointer
  • ImageNet : a large-scale image dataset for visual recognition organized by WordNet hierarchy
  • Visual Genome
  • SA-1B : dataset of a large number of images and segmentation masks to segment objects in those images
  • COCO : large-scale object detection, segmentation, and captioning dataset
  • Open Images : a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives
  • Cityscapes Dataset : This dataset focuses on semantic understanding of urban street scenes, with pixel-level annotations for various object classes such as cars, pedestrians, and roads
  • DeepFashion : a large-scale clothes dataset containing over 800,000 diverse fashion images annotated with bounding boxes, clothing categories, and attributes
  • Hugging face datasets : collection of generic datasets available on hugging face
  • Objaverse : a large-scale 3D asset database
  • SUN Database : a benchmark for scene recognition and object detection with annotated scene categories and segmented objects
  • Places Database : a scene-centric database with 205 scene categories and 2.5 millions of labelled images
  • NYU Depth Dataset v2 : a RGB-D dataset of segmented indoor scenes
  • Microsoft COCO : a new benchmark for image recognition, segmentation and captioning
  • Flickr100M : 100 million creative commons Flickr images
  • Labeled Faces in the Wild : a dataset of 13,000 labeled face photographs
  • Human Pose Dataset : a benchmark for articulated human pose estimation
  • YouTube Faces DB : a face video dataset for unconstrained face recognition in videos
  • UCF101 : an action recognition data set of realistic action videos with 101 action categories
  • HMDB-51 : a large human motion dataset of 51 action classes
  • ActivityNet : A large-scale video dataset for human activity understanding
  • Moments in Time : A dataset of one million 3-second videos

Collaboration

You can work in teams of up to 3 people. We do expect that projects done with 3 people have more impressive writeup and results than projects done with fewer people. For example, to get a sense for the scope and expectations for projects, have a look at project reports from previous years. While we encourage that you work in teams, you may also work alone.

You may consult any papers, books, online references, or publicly available implementations for ideas and code that you may want to incorporate into your strategy or algorithm, so long as you clearly cite your sources in your code and your writeup. However, under no circumstances may you look at another group’s code or incorporate their code into your project.

If you are combining your course project with the project from another class, you must receive permission from the instructors, and clearly explain in the Proposal, Milestone, and Final Report the exact portion of the project that is being counted for CS 231n. In this case you must prepare separate reports for each course, and submit your final report for the other course as well.

If you are combining your course project with another course project or research project, you DO NOT need to receive prior permission from CS231n. Instead, we ask that you clearly explain in the Proposal, Milestone, and Final Report the UNIQUE portion of the project that is being counted for this class. In this case you must prepare separate reports for each class and submit the other report to CS231n as well (if available). Remember, it is an honor code violation to use the same final report PDF for multiple classes. For the report for this class, focus on the specific portion of the project that is counted for this class.

See the late policy on the home page .

Project Proposal

The project proposal should be one paragraph (200-400 words). Your project proposal should describe:

  • What is the problem that you will be investigating? Why is it interesting?
  • What reading will you examine to provide context and background?
  • What data will you use? If you are collecting new data, how will you do it?
  • What method or algorithm are you proposing? If there are existing implementations, will you use them and how? How do you plan to improve or modify such implementations? You don't have to have an exact answer at this point, but you should have a general sense of how you will approach the problem you are working on.
  • How will you evaluate your results? Qualitatively, what kind of results do you expect (e.g. plots or figures)? Quantitatively, what kind of analysis will you use to evaluate and/or compare your results (e.g. what performance metrics or statistical tests)?
  • If you are combining this project with another course/research project, what is the unique portion of the project that is counted towards this class?

Submission: Please submit your proposal as a PDF on Gradescope. Only one person on your team should submit. Please have this person add the rest of your team as collaborators as a "Group Submission".

Project Milestone

  • Title, Author(s)
  • Introduction: this section introduces your problem, and the overall plan for approaching your problem
  • Problem statement: Describe your problem precisely specifying the dataset to be used, expected results and evaluation
  • Technical Approach: Describe the methods you intend to apply to solve the given problem
  • Intermediate/Preliminary Results: State and evaluate your results upto the milestone

Submission : Please submit your milestone as a PDF on Gradescope. Only one person on your team should submit. Please have this person add the rest of your team as collaborators as a "Group Submission".

Your final write-up is required to be between 6 - 8 pages using the provided template , structured like a paper from a computer vision conference (CVPR, ECCV, ICCV, etc.). Please use this template so we can fairly judge all student projects without worrying about altered font sizes, margins, etc. After the class, we will post all the final reports online so that you can read about each others' work. If you do not want your writeup to be posted online, then please let us know via the project registration form.

The following is a suggested structure for your report, as well as the rubric that we will follow when evaluating reports. You don't necessarily have to organize your report using these sections in this order, but that would likely be a good starting point for most projects. Refer to Ed for more fine-grained details and explanations of each separate section.

  • Abstract : Briefly describe your problem, approach, and key results. Should be no more than 300 words.
  • Introduction (10%) : Describe the problem you are working on, why it's important, and an overview of your results
  • Related Work (10%) : Discuss published work that relates to your project. How is your approach similar or different from others?
  • Data (10%) : Describe the data you are working with for your project. What type of data is it? Where did it come from? How much data are you working with? Did you have to do any preprocessing, filtering, or other special treatment to use this data in your project?
  • Methods (30%) : Discuss your approach for solving the problems that you set up in the introduction. Why is your approach the right thing to do? Did you consider alternative approaches? You should demonstrate that you have applied ideas and skills built up during the quarter to tackling your problem of choice. It may be helpful to include figures, diagrams, or tables to describe your method or compare it with other methods.
  • Experiments (30%) : Discuss the experiments that you performed to demonstrate that your approach solves the problem. The exact experiments will vary depending on the project, but you might compare with previously published methods, perform an ablation study to determine the impact of various components of your system, experiment with different hyperparameters or architectural choices, use visualization techniques to gain insight into how your model works, discuss common failure modes of your model, etc. You should include graphs, tables, or other figures to illustrate your experimental results.
  • Conclusion (5%) Summarize your key results - what have you learned? Suggest ideas for future extensions or new applications of your ideas.
  • Writing / Formatting (5%) Is your paper clearly written and nicely formatted?
  • Source code (if your project proposed an algorithm, or code that is relevant and important for your project.).
  • Cool videos, interactive visualizations, demos, etc.
  • The entire PyTorch/TensorFlow Github source code.
  • Any code that is larger than 10 MB.
  • Model checkpoints.
  • A computer virus.

Submission : You will submit your final report as a PDF and your supplementary material as a separate PDF or ZIP file. We will provide detailed submission instructions as the deadline nears.

Additional Submission Requirements : We will also ask you do do the following when you submit your project report:

  • Your report PDF should list all authors who have contributed to your work; enough to warrant a co-authorship position. This includes people not enrolled in CS 231N such as faculty/advisors if they sponsored your work with funding or data, significant mentors (e.g., PhD students or postdocs who coded with you, collected data with you, or helped draft your model on a whiteboard). All authors should be listed directly underneath the title on your PDF. Include a footnote on the first page indicating which authors are not enrolled in CS 231N. All co-authors should have their institutional/organizational affiliation specified below the title.
  • If you have non-231N contributors, you will be asked to describe the following:
  • Specify whether the project has been submitted to a peer-reviewed conference or journal. Include the full name and acronym of the conference (if applicable). For example: Neural Information Processing Systems (NIPS). This only applies if you have already submitted your paper/manuscript and it is under review as of the report deadline.
  • Any code that was used as a base for projects must be referenced and cited in the body of the paper. This includes CS 231N assignment code, finetuning example code, open-source, or Github implementations. You can use a footnote or full reference/bibliography entry.
  • If you are using this project for multiple classes, submit the other class PDF as well. Remember, it is an honor code violation to use the same final report PDF for multiple classes.

In summary, include all contributing authors in your PDF; include detailed non-231N co-author information; tell us if you submitted to a conference, cite any code you used, and submit your dual-project report (e.g., CS 230, CS 231A, CS 234).

Poster Session

  • Date: Wednesday, June 12, 2024
  • Time: 12:00 pm to 4:30 pm
  • Location: AT&T Patio outside Gates Computer Science Building
  • Who: Student groups must present in-person at the poster session, unless approved by course staff beforehand to present online. Stanford students, faculty, and guests from industry are welcome!

Students: We will provide foam poster boards and easels.The foam boards we will provide have the size of 30x40 inches, so please print your poster = 20x30 inches. Our recommended size is 24x36 inches. You may print your poster in landscape or portrait orientation.

  • Lathrop Library’s Tech Desk: Approximately 3-day turnaround.
  • FedEx: Approximately 2-day(?) turnaround.
  • Walgreens: Approximately same-day pickup.
  • Biotech Productions: Approximately same-day delivery.
  • Staples: Approximately same-day pickup.
  • Can I print my poster on 8.5x11 inch pieces of paper and tape them together? Yes, but we encourage you to print out one full poster. If you do print sections and tape them together, make sure that all the content is still legible and fits on a 30x40 foam board.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of jimaging

A Review on Computer Vision-Based Methods for Human Action Recognition

Mahmoud al-faris.

1 School of Energy & Electronic Engineering, Faculty of Technology, University of Portsmouth, Portsmouth PO1 3DJ, UK; [email protected] (J.C.); [email protected] (A.I.A.)

John Chiverton

2 School of Computing, Engineering and Physical Sciences, University of the West of Scotland, Paisley PA1 2BE, UK; [email protected]

Ahmed Isam Ahmed

Human action recognition targets recognising different actions from a sequence of observations and different environmental conditions. A wide different applications is applicable to vision based action recognition research. This can include video surveillance, tracking, health care, and human–computer interaction. However, accurate and effective vision based recognition systems continue to be a big challenging area of research in the field of computer vision. This review introduces the most recent human action recognition systems and provides the advances of state-of-the-art methods. To this end, the direction of this research is sorted out from hand-crafted representation based methods including holistic and local representation methods with various sources of data, to a deep learning technology including discriminative and generative models and multi-modality based methods. Next, the most common datasets of human action recognition are presented. This review introduces several analyses, comparisons and recommendations that help to find out the direction of future research.

1. Introduction

Human Action Recognition (HAR) has a wide-range of potential applications. Its target is to recognise the actions of a person from either sensors or visual data. HAR approaches can be categorised into visual sensor-based, non-visual sensor-based and multi-modal categories [ 1 , 2 ]. The main difference between visual and other categories is the form of the sensed data. The visual data are captured in the form of 2D/3D images or video whilst others capture the data in the form of a 1D signal [ 2 ]. Over the last few years, wearable devices such as smart-phones, smart-watches, and fitness wristbands have been developed. These have small non-visual based sensors and are equipped with computing power and communication capability. They are also relatively low cost which has helped to open up new opportunities with ubiquitous applications. These include health monitoring, recuperative training and disease prevention, see, e.g., [ 3 ].

At the same time, visual sensor-based methods of human action recognition are one of the most prevalent and topical areas in the computer vision research community. Applications have included human–computer interaction, intelligent video surveillance, ambient assisted living, human–robot interaction, entertainment and content-based video search. In each one of those applications, the recognition system is trained to distinguish actions carried out in a scene. It may also perform some decisions or further processing based on that inference.

It can be stated that wearable devices have several limitations such as in most cases they need to be worn and to operate constantly. This might be a significant issue for real applications that may require readiness and deployability. In turn, requiring specific technical requirements related to e.g., battery life, size and performance of the sensor, see, e.g., [ 4 ]. In addition, they might not be suitable or efficient to employ in e.g., crowd applications or other related scenarios. These limitations are not applicable to computer-vision based HAR. Computer vision based HAR can be applied to most of application scenarios without these technical requirements or limitations.

From about 1980, researchers have presented different studies on action recognition based on images and/or video data [ 5 , 6 ]. In many instances, researchers have been following or drawing inspiration from elements of the operating principles of the human vision system. The human vision system receives visual information about an object especially with respect to movement and shape and how it changes with time. Observations are fed to a perception system for recognition processes. These biophysical processes of the human recognition system have been investigated by many researchers to achieve similar performance in the form of computer vision systems. However, several challenges such as environmental complexities, scale variations, non-rigid shapes, background clutter, viewpoint variations and occlusions make computer vision systems unable to fully realise many elementary aspects of a human vision system.

Action recognition systems can be categorised into different four categorises according to the complexity of human action. This can include: primitive [ 7 ], single person [ 8 ], interaction [ 9 ], and group [ 10 ] actions recognition. Primitive action indicates basic movement of human body parts—for example, “lifting a hand” and “bending”. Single person actions indicate a set of primitive actions of a single person such as “running” and “jumping”. Interaction indicates actions involve humans and objects, such as “carrying a box” and “playing a guitar”. Group actions refer to actions occurring in a group of people such as a “procession”, “meeting”, and “group walking”.

In general, computer vision methods based HAR can be classified into two categories in terms of a comprehensive investigation of the literature: (a) Traditional hand-crafted feature based methods followed by a trainable classifier for action recognition. In addition, (b) deep learning based approaches are able to learn features automatically from raw data and are commonly followed by a trainable classifier for action recognition [ 11 , 12 ].

Many important survey and review papers have been published on human action recognition and related techniques. However, usually, published reviews go out-of-date. For this reason, writing an updated review on human action recognition is significantly required although it is considered hard work and a challenging task. In this review, discussions, analysis and comparisons of state-of-the-art methods are provided for vision based human action recognition. Handcrafted based methods and deep learning based methods are introduced along with popular benchmark datasets and significant applications. This paper also considered different designs of recognition models including: hybrid, modalities-based and view-invariant based. A brief detail of different architectures is introduced for vision-based action recognition models. Recent research works are presented and explained to help researchers to follow the path for possible future works.

The structure of this review starts at low level based methods for action recognition. This is followed by description of some of the important details of feature descriptor based techniques. A number of improvements that can be achieved in these aspects are identified. These are also transferable with respect to the performance of action recognition systems in general. Thereafter, it reviews higher level feature representation based methods. It explains the widespread feature descriptor based techniques with respect to different aspects. The paper then covers the mainstream research that has resulted in the developments of the widely known deep learning based models and their relation to action recognition systems.

2. Popular Challenges in Action Recognition Models

Initially, it might be useful to highlight some of the most popular challenges in action recognition based methods.

2.1. Selection of Training and Testing Data

The type of data can strongly affect the efficiency of a recognition model. Three types of data are usually used for action recognition. These are RGB, depth, or skeleton information, each of which can have advantages and disadvantages. For instance, significant texture information can be provided from an RGB input. This might be considered to be closely related to the visual information that humans typically process. On the other hand, a lot of variations can occur in the appearance information that depend on e.g., lighting conditions. In contrast to RGB, depth map information is invariant to illumination changes. This makes it easier to detect foreground objects from the background scene. In addition, a depth map provides 3D characteristics about the captured scene. However, depth map information also commonly has some defects. For instance noisy measurements are sometimes a problem need to be purified and refined. Another input type is skeleton information. Skeletons can be obtained using different approaches; see, e.g., [ 13 , 14 , 15 , 16 ]. Skeleton can be obtained from RGB or more commonly depth information. However, this type of information is often captured or computed imperfectly especially in an occluded or noisy environment. In this work, the complementary information available in the RGB and depth map data are exploited directly for action recognition.

2.2. Variation in Viewpoint

Most methods assume that actions are performed from a fixed viewpoint. However, in a real case, the location and posture of the person vary considerably based on the viewpoint where the action is captured from. In addition, a variation in motion patterns are also appeared in each different view which makes recognition of an action more difficult. Training a classifier using multiple camera information is a way used by [ 17 ] to tackle this issue. View-invariant representation was also obtained by modeling a 3D body posture for action recognition such in [ 18 ]. Researchers try to to utilise view-invariant features space using Fourier transform and cylindrical coordinate systems [ 19 ]. However, researchers [ 20 , 21 ] have reported that most multi-view datasets involve uniform or fixed background. Therefore, in order to evaluate the performance of various methods, it would be necessary to validate those using actions recorded in real-world settings.

2.3. Occlusion

An action required to be recognised should be clearly visible in the video sequences. This is not true in the real case, especially in a normal surveillance video. Occlusion can be presented by the person itself or by any other objects in the field. This can make body parts performing an action invisible which can cause a big issue for the research community. Volumetric analysis and representation [ 22 ] of an action can tackle self-occlusion issues and helps to match and classify the action. Considering body parts separately is a feasible way to handle occlusions. This can be performed using Pose-based constraints [ 23 ] and Probabilistic-based methods [ 24 , 25 ]. The multiple camera setup method is another approach that is used by researchers to handle occlusion problems [ 26 ].

2.4. Features Modelling for Action Recognition

In general, two popular methods are found to be considered for designing features for action recognition. One can use feature design based application methods which lead to the utilisation of the hand-crafted features. Another way is to automatically capture features from input data. This can be achieved using deep learning techniques which have often shown competitive performance in comparison to hand-crafted feature based methods [ 27 ].

2.5. Cluttered Background

Cluttered background is a case that formed a distraction introducing ambiguous information in the video of an action [ 28 ]. Different vision-based methods are affected by this issue such as an optical flow algorithm that is used to calculate motion information but with unwanted background motion (due to cluttered background) along with the required motion. In addition, this issue has a great influence on colour-based and region-based segmentation approaches as these methods require uniform background to achieve high quality segmentation. In order to handle and avoid the issues introduced, many research works assumed a static background or an approach to deal with the videos prior to processing [ 20 , 29 ].

2.6. Feature Design Techniques

Different levels of features can be used for action recognition. Some researchers such as [ 30 , 31 , 32 ] proposed to employ the input as a whole referred to here as holistic methods. Other researchers such as [ 33 , 34 , 35 , 36 ] considered salient points of interest from input data with what are known as local feature based methods.

Motion is an important suorce of information that needs to be considered for action recognition. Different techniques have been proposed to model motion information in the feature computation step. This has included optical flow for low level feature displacements and trajectories across multiple frames which can then be fed to classifiers or to further feature extraction processes. Some other research has included motion information in the classification step with models such as: Hidden Markov Models [ 37 ]; Conditional Random Fields [ 38 ]; Recurrent Neural Network [ 39 ]; Long-Short Term Memory; and 3D Convolution Neural Network [ 40 ]. All of these are able to model sequential information by design.

In such systems, an efficient feature set is able to reduce the burden for improving the recognition. An overview is now provided of selected state-of-the-art methods with respect to all aforementioned challenges and approaches mentioned above. In the following, action recognition systems are partitioned based on hand-crafted features in addition to those based on different deep learning techniques.

3. Applications of Action Recognition Models

During the last decade, many researchers have paid attention to the action recognition field with a significant evolution of the number of publications. This section highlights state-of-the-art applications that consider human action recognition methodologies to assist humans. Different applications of the current action recognition approaches are discussed including: smart homes and assisted living, healthcare monitoring, security and surveillance, and human–robot interaction [ 41 , 42 ].

3.1. Surveillance and Assisted Living

Different modern technologies have provided a wide range of improvements in the performance of independent assisted living systems. This comes true using action recognition techniques to monitor and assist occupants. For example, a smart home system proposed by [ 43 ] used machine learning and features extraction techniques to analyse the activity patterns of an occupant to introduce automation policies based on the identified patterns to support the occupants. Another smart system has been introduced by [ 44 ] for human behaviour monitoring and support (HBMS). This was achieved by observing an occupant’s daily living activities using the Human Cognitive Modeling Language (HCM-L). Then, the HBMS control engine is applied to assist individuals in a smart way. On the other hand, vision-based technologies are introduced in different security applications such as the surveillance system that introduced by [ 45 ]. This system has the ability to recognise human behaviours such as fighting and vandalism events that may occur in a public district using one or several camera views [ 46 ]. Multiple camera views were used by [ 47 ] to detect and predict suspicious and aggressive behaviours in real time and in a crowded environment.

3.2. Healthcare Monitoring

The development of medical research and technology remarkably improved the quality of patients’ life. However, higher demands of medical personnel made researchers try different technologies to improve healthcare monitoring methods that may be essential in emergency situations. Basically, one or more factors can be involved in the design of healthcare monitoring systems. This can include fall detection, human tracking, security alarm and cognitive assistance components. In [ 48 ], a vision-based system was proposed for healthcare purposes. It used Convolutional Neural Networks to detect person falling. Optical flow sequences were used as input to the networks followed by a three training phases. Fall detection system for home surveillance was proposed by [ 49 ]. A surveillance video was used to detect the fall. Background subtraction was used to detect the moving object and segmented within a bounding box. Few rules were used with the transitions of a finite state machine (FSM) to detect the fall based on the measures of the extracted bounding box. An intelligent monitoring system was proposed by [ 50 ] to monitor the “elopement” events of dementia units and to automatically alert the caregivers. Audio and video daily activities were collected and detected using an HMM-based algorithm.

3.3. Entertainment and Games

In the recent years, gaming industries have developed a new generation of games based on the full body of a gamer such as dance and sports games. RGB-D sensors (see, e.g., [ 51 ]) are used in this kind of games to improve the perception of human actions. A rich information of an entire scene is provided by these sensors to facilitate action recognition tasks [ 52 , 53 ].

3.4. Human–Robot Interaction

Human–robot interaction is considerably adapted in home and industry environments. An interaction is achieved to perform a specific task such as “Passing a cup” or “locating an object”. A vision-based method is one of the effective communication ways between human and robots [ 54 , 55 ].

3.5. Video Retrieval

Most search engines use the associated information to manage video data. Text data such as tag, description, title and keywords is one piece of information that can be used for such purposes [ 56 ]. However, one piece of information can be incorrect, which results in unsuccessful video retrieval. An alternative approach was proposed by [ 57 ] for video retrieval by analysing human actions in videos. The designed framework computed the similarity between action observations to then be used to retrieve videos of children with autism in a classroom setting.

3.6. Autonomous Driving Vehicles

An automated driving system is aimed to ensure safety, security, and comfort. One of the most important components of this system is action prediction and recognition algorithms [ 55 , 58 ]. These methods can analyse human action and motion information in a short period of time that helps to avoid critical issues such as collision.

4. Hand-Crafted Feature Representation for Action Recognition

We will start by demonstrating some classical human action recognition based methods based on hand-crafted features. Classical image classification based methods usually consist of three consecutive steps: features extraction, local descriptor computation and classification. Similar steps have been employed more generally for image and video classification as well as human action recognition.

4.1. Holistic Feature Representation Based Methods

Holistic feature representation based methods treat Regions Of Interest (ROI)s as a whole in which all pixels are exploited to compute the descriptors. In general, holistic based methods consist of two steps for action recognition which are person detection and descriptor computation. Holistic methods consider a global structure of the human body to represent an action, where it is not necessary to localise body parts. The key idea is that discriminative global information can be represented from a region of interest which can then be used for action characterisation. Holistic methods can be efficient and effective in addition to simple to compute due to the use of global information only. This makes this kind of method important for videos which might contain background clutter, camera motion, and occlusions.

In general, holistic methods can be classified into two categories based on the information that is used for the recognition problem:

  • Recognition based on shape information such as shape masks and the silhouette of the person;
  • Recognition based on shape and global motion information.

4.1.1. Shape Information Based Methods

Holistic based approaches are often based on information from the silhouettes, edges, optical flow, etc. Such methods are sensitive to noise, background clutter, and variations in occlusion and view-points e.g., see [ 59 ]. Silhouette information provides shape information about the foreground in the image. Different techniques can be employed to compute silhouette information from the background scene. One simple technique is background subtraction that can be used with high confidence when the camera is static. On the other hand, some research such as in [ 60 ] has utilised human tracker and camera motion estimation to obtain silhouette information and to cope with the drawbacks of camera motion. Shape information can be utilised in the time domain to help to consider the evolution of the silhouette over time. Differences in the binary silhouettes have considered by [ 61 ]. These were accumulated in the spatial and temporal domains to construct a Motion Energy Image (MEI) and a Motion History Image (MHI), respectively. These depict an action with a single template. MEI is a binary template that indicates regions of movement. MHI indicates regions of motion where more recent motion regions have higher weight. Three-dimensional (3D) shape information was used by [ 31 ] for action recognition by stacking 2D silhouette information into a space-time volume. For invariant representations to geometrical transformations such as scaling and translation, an extended Random transform was proposed by [ 62 ]. This was applied to binary silhouette information for action recognition. Contours of MEI templates were exploited by [ 63 ]. A descriptor was obtained which was found to be invariant to scale changes and translations.

A lot of research has utilised shape and silhouette information to represent the human body for human action recognition. In [ 30 , 64 ], shape masks of different images were used to introduce MEI and MHI based temporal templates for action recognition.

It has been observed that some actions can be represented by key poses. This was proposed by [ 65 ] where a method was described to detect forehand and backhand tennis strokes by matching edge information to labelled key postures together with annotated joints. These were then tracked between the key consecutive frames based on the silhouette information.

A number of significant methods are presented by [ 66 ] to describe space-time shapes based on silhouette information for action recognition. Background subtraction was used to extract the silhouette of a person. The Poisson equation was then used to obtain saliency, dynamics and shape structure features. A high dimensional feature vector was introduced to describe sequences of 10 frames in length. This was matched to shapes of test sequences at the end.

Space-time shapes were also used by [ 67 ] where contour information was obtained using background subtraction. Then, a set of characteristic points (saddles, valleys, ridges, peaks and pits) were used to represent actions on the surface of the shape. The space-time shapes were matched to recognise actions using point-to-point correspondences.

In [ 68 ], a set of silhouette exemplars were used for matching against frames in action sequences. A vector was formed of the minimum matching distance between each exemplar and any frame of the sequence. A Bayes classifier was employed to learn action classes with two different scenarios: first, silhouette information; second, edge information.

A foreground shape based motion information model was presented by [ 69 ] to represent motion from a group of consecutive frames of an action sequence. A motion context descriptor was introduced over a region with the use of a polar search grid, where each cell was represented with a SIFT descriptor [ 70 ]. The final descriptor was created by summing up the entire groups of a sequence. After that, three different approaches were used to recognise actions which were Probabilistic Latent Semantic Analysis (pLSA) [ 71 ], w3-pLSA (pLSA extension) and Support Vector Machine (SVM).

Colour and location information based segmentation has been used by [ 72 ] to automatically over-segment event video. Then, optical flow and volumetric features were used to match over-segmented video against a set of training events such as picking up a dropped object or waving in a crowd.

It is obvious from the aforementioned approaches that silhouette information can provide strong cues for the human action recognition problem. However, significant challenges arise in the presence of clutter, occlusion and camera motion. In addition, silhouette information can describe some types of actions by showing characteristics of the outer contours of a person. However, other actions that include, e.g., self-occlusion, may not easily be recognised from silhouette information alone. Therefore, the motion and shape information is further enhanced with the use of local feature representations discussed shortly.

RGB-D Information Based Shape Models

A new era can be considered to have begun when low cost RGB-D sensors were produced. These simultaneously provide appearance and spatial 3D information. Such devices (e.g., Microsoft Kinect, Asus Xtion) have the ability to work in real time. By adding the depth-map feature, the device is able to provide information about the distance of each pixel to the sensor in a range from 0.5 m to 7 m. These have played a key role in the enhancement of object detection and segmentation algorithms. RGB-D sequences based methods improve recognition performance with a low time complexity. However, depth and skeleton representation based methods of action recognition remain only applicable over a limited range and specific environmental conditions.

As a result, many RGB holistic approaches have been extended to the RGB-D scenario to utilise depth-map characteristics. A 3D-MHI has proposed by [ 73 ] for action recognition. This was performed by extending the traditional MHI to use depth information. In [ 74 ], the depth silhouette was sampled into a representative set of 3D points and used to introduce the shape of salient regions. The key idea was to project the depth map onto three orthogonal Cartesian planes and use the points along each plane to recognise the actions. A useful technique was used by [ 75 ] where the depth maps were projected onto three orthogonal Cartesian planes to produce Depth Motion Maps (DMM) by combining through summation the stacked motion energy of each of the projected maps. DMMs can express the variation of a subject’s motions during the performance of an activity. In [ 76 ], DMMs were used for activity recognition together with an l 2 -regularised collaborative representation classifier with a distance-weighted Tikhonov matrix was also used. DMMs was used by [ 77 ] with Local Binary Patterns (LBP)s to utilise motion cues. Two fusion levels were also considered including feature-fusion level and decision-fusion level. The DMM based results showed reasonable human activity recognition performance.

Different levels of the same data sequence have been used with DMM computations to create a hierarchical DMMs in [ 78 ]. An LBP based descriptor was used to characterise local rotation invariant texture information. Then, a Fisher kernel was employed to create patch descriptors. These were fed into a kernel-based extreme learning machine classifier. A similar approach was followed by [ 79 ]. A Histogram of Oriented Gradients (HOG)s descriptor was used along with kernel entropy component analysis for dimensionality reduction. Finally, a linear support vector machine was used in the classification. For both hierarchical DMM based approaches, the results demonstrated a significant performance improvement.

A 4D space-time grid has introduced by [ 80 ] that extended the work by [ 31 ]. This has done by dividing space and time dimensions into multiple cells. These were used to obtain Space Time Occupancy Patter (STOP) feature vectors for action recognition. In [ 81 ], a 4D Histogram Of Surface Normal Orientations (HON4D) was proposed to describe video for action recognition after computing the normal vectors for each frame. The features of the surface normal were captured in the 4D space of spatial, depth and time dimensions.

The rich characteristics of the depth information can help make people detection and segmentation tasks easier and less challenging which in turn improves holistic approaches, making them more robust with RGB-D images. However, some drawbacks of holistic methods include their sensitivity to occlusions and noise in the depth maps. Therefore, a good representation can be presented by combining motion and shape information which in turn may improve the recognition rate of the system.

4.1.2. Hybrid Methods Based on Shape and Global Motion Information

The work by [ 82 ] is a good example of shape and motion feature based tracking and action recognition. The authors assumed that the movements of body parts were restricted to regions around the torso. Subjects were bounded with rectangular boxes where the centroids were selected as the feature for tracking. The velocity of the centroids was considered, utilising body motion features to cope with occlusions between multiple subjects. Periodic actions such as walking were detected with a nearest centroid algorithm calculated across spatio-temporal templates and reference templates. This approach, however, only utilised motion information which can be improved by considering other features such as texture, color, and shape.

Another method which used motion information was proposed by [ 83 ] based on optical flow to track soccer players and to recognise simple actions in video. A person was tracked and stabilised. Then, a descriptor was computed over the motion information and spatio-temporal cross-correlation was used for matching with a database. This approach was tested on sequences from ballet, tennis and football datasets, and it achieved impressive results on low resolution video. However, their types of systems may depend on several conditions such as position of the region of interest in the frame, spatial resolution and relative motion with respect to the camera. In addition, the model is based on a global representation which can be affected by occlusions between multiple objects and a noisy environment in the background.

Flow motion has also been used by [ 84 ] for action recognition. A flow descriptor was employed to select low level features in the form of a space-time overlapped grid. Then, mid level features were selected using the AdaBoost algorithm.

A space-time template based method was introduced by [ 85 ] for action recognition. It was based on the maximum average correlation height filter. A spatio-temporal regularity flow was used to capture spatio-temporal information and to train a Maximum Average Correlation Height (MACH) filter. Experiments on a number of datasets including the KTH dataset demonstrated action recognition and facial expression recognition.

Volumetric feature based action recognition was proposed by [ 86 ] where Viola–Jones features were computed over a video’s optical flow. A discriminative set of features were obtained by direct forward feature selection which employed a sliding window approach to recognise the actions. The model was trained and tested on real videos with actions that included sit down, stand up, close laptop and grab a cup actions.

Shape information was used by [ 87 ] to track an ice hockey player and to recognise actions. Histograms of Oriented Gradients (HOG)s were used to describe each single frame. Principal Component Analysis (PCA) was then used for dimensionality reduction. At the end, a Hidden Markov Model (HMM) was employed to recognise actions.

A new technique was proposed to utilise a hybrid representation by combining optical flow and appearance information by [ 88 ]. They exploited the optical flow information and Gabor filter features for action recognition. Both kinds of features were extracted from each single frame and then concatenated. They used different lengths of snippets of frames to highlight how many frames were required for recognising an action. The Weizmann and KTH datasets were used for evaluation schemes.

Motion and shape information based action recognition was also used by [ 89 ] where a multiple instance learning based approach was employed to learn different features from a bag of instances. This included foreground information, Motion History Image (MHI) and HOGs. Simple actions in crowded events in addition to shopping mall data were used to evaluate the proposed method. The experiments showed that the use of multiple types of features resulted in better performance in comparison with a single type of feature.

These holistic based methods have provided some reasonable levels of performance for action recognition. However, they are not view invariant. Different models would be needed for particular views. Large amounts of multiple view data would also be needed for training. Some body parts might be unseen across frames due to occlusions. Second, they are not invariant to time. The same action performed over different time periods would present quite differently. In addition, it is worth mentioning that the accuracy of holistic approaches is highly dependent on the detection and segmentation pre-processing. This work also includes local representation based methods to benefit from localised information. The next section presents a review of the local representation based methods for human action recognition.

4.2. Local Feature Representations Based Methods

Local feature based methods tend to capture characteristic features locally within a frame without a need for human detection or segmentation which can be quite a challenge for RGB based video. Local feature based methods have been successfully employed in many recognition system applications such as action recognition [ 90 ], object recognition [ 91 ] and scene recognition [ 92 ]. Local capture based methods can capture important characteristics of shape and motion information for a local region in a video. The main advantage of these methods is the autonomous representation of events in terms of changes across space-time and scale. Furthermore, with appropriate machine learning, it is often possible, given sufficient data, to capture the important characteristics of the local features of interest. If appropriately achieved, then it can be possible to separate these features from features computed from a cluttered background or even multiple movements or objects in a scene. In the following section, space-time feature detectors, feature trajectories and local descriptor based methods are discussed. In addition, the incorporation in action localisation and recognition in videos will be considered.

In general, local feature based methods consist of two steps: detecting a point of interest (POI) and descriptor computation. In image processing, interest points refer to points that have local variation of image intensities. Interest point detectors usually capture local characteristics. This can be in terms of space-time and scale in videos by maximising specific saliency functions.

Some research that can be highlighted has focused on feature detectors such as [ 33 ] who proposed to extend the Harris corner detector to a Harris3D detector to include both space and time. A different feature detector which employed spatial Gaussian kernels and temporal Gabor filters was proposed by [ 93 ]. This considered salient motion features to represent different regions in videos. Another detector proposed by [ 94 ] involved computing entropy characteristics in a cylindrical neighborhood around specific space-time positions. An extension of the Hessian saliency detector, Hessian3D, was proposed by [ 95 ] to consider spatio-temporal features. This used the determinant of the 3D Hessian matrix. Salient features were detected by [ 96 ] using a global information based method.

A wider experimental evaluation was introduced by [ 97 ]. They proposed to exploit different interest point detectors applied to publicly available action recognition datasets including KTH [ 98 ], UCF sports [ 85 ], and Hollywwod2 [ 99 ]. The results showed the robustness of dense sampling method, where interest points were sampled in equal segments in the space and time domains. It was found that the Harris3D detector achieved some of the best performance in some of the included experiments.

While local interest points are detected, local representation based methods can then be employed to compute one of the different descriptors over a given region. Different descriptors have been proposed in a lot of research such in [ 34 ] where Histogram of Oriented Gradients (HOG) [ 100 ] and Histogram of Oriented Optical Flow (HOOF) [ 101 ] descriptors were used. The authors introduced a different way to characterise local motion and appearance information. They combined HOG and HOOF based approaches on the space-time neighbourhood of the detected points of interest. For each cell of a grid of cells, four bins of HOG and five bins of HOOF were considered. Normalised and concatenation were used to form a HOG and HOOF combined descriptor. Moreover, different local descriptors based on gradient, brightness, and optical flow information were included by [ 93 ]. PCA was also used for dimensionality reduction. The authors explored different scenarios which included simple concatenation, grid of local histograms and a single global histogram. The experimental results determined that concatenated gradient information achieved the best performance.

A 3D version of the Histogram of Oriented Gradients (HOG3D) has introduced by [ 102 ] as an extension of the HOG descriptor by [ 100 ]. A space-time grid was constructed around each detected Point Of Interest (POI). A histogram descriptor was then computed and normalised over each of the cells. The final descriptor was then formed by concatenating the histograms.

In [ 103 ], the authors proposed to extend the Scale-Invariant Feature Transform (SIFT) descriptor originally proposed by [ 70 ]. Spatio-temporal gradients were computed over a set of randomly sampled positions. A Gaussian weight was used to weight each pixel in the neighbourhood with votes into an N × N × N grid of histograms of oriented gradients. To achieve orientation quantization, the gradients were represented in spherical coordinates that were divided into 8 × 4 histograms.

An extended Speeded-Up Robust Features (SURF) descriptor originally proposed by [ 104 ] was investigated by [ 95 ]. Application to videos was considered by utilising spatio-temporal interest points which were spatially and temporally scale invariant. The patches were divided into a grid within local N × N × N histograms. Then, each cell was represented by a vector of Haar wavelet sampled responses. The experimental results showed the good performance of the proposed detector in comparison with other detectors.

RGB-D Information Based Local Features

There has also been research that includes depth map data based local feature methods. These follow many of the same or similar steps as for RGB video. For instance, at the gross level, finding salient points of interest and then computing the descriptor. In [ 105 ], the authors proposed a Histogram of Oriented Principal Components (HOPC) descriptor. This captured the characteristics around each point of interest within a 3D cloud space. The descriptor was formed by concatenating projected Eigenvectors. These resulted from Principal Component Analysis on the space-time volume around the points of interest. The HOPC descriptor was found to be view invariant. Video was also treated in [ 106 ] as a space-time volume of depth values. A Comparative Coding Descriptor (CCD) was then used to encode space-time relations of points of interest. Set of cuboids were used to construct a series of codes that characterised the descriptor. In [ 107 ], a descriptor called Local Occupancy Pattern (LOP) was presented. This was used to describe the appearance information of sub-regions of depth images by which was utilised to characterise object-interaction actions. In another work by [ 108 ], a Random Occupancy Pattern (ROP) was introduced to deal with depth sequences as a space-time volume. The descriptor was defined by a sum of the pixel values in a sub-volume. Since several sub-volumes had different sizes and locations, a random sampling based method was used to effectively recognise the sub volumes. Overall, local feature based methods are commonly used with different inputs. These can include skeletons where joints have been a particular focus for detector, RGB where a detector have been used to detect POIs on an RGB frame, or similarity for the depth.

4.3. Trajectories Based Methods

Many researchers have claimed that the spatial domain in video has different characteristics from the temporal domain. Thus, points of interest should not be detected in a 3D spatio-temporal space. Consequently, a lot of research such as [ 36 , 101 , 109 , 110 , 111 ] has included tracking of detected points of interest across the temporal domain. Then, the volume of the trajectory points are often used to compute the descriptors for video representation.

Detecting points of interest in video and forming trajectories through the temporal domain has been used by many researchers. For instance, the Kanade–Lucas–Tomasi (KLT) tracker [ 112 ] was used in [ 109 ] to track Harris3D interest points [ 33 ]. These formed feature trajectories which were then represented as sequences of log polar quantised velocities. The KLT tracker has also been used by [ 36 ], where trajectories were clustered and used to compute affine transformation matrix to represent the trajectories. In [ 70 , 110 ], SIFT descriptors were matched between two consecutive frames for trajectory based feature extraction. Unique-match points were exploited whist others were discarded.

Dense sampling based interest point extraction achieved better performance in action recognition by [ 97 ]. Dense trajectories were later used by [ 101 ] who sampled dense points of interest on a grid. Dense optical flow was then used to track POIs through time. Trajectories were formed by concatenating points from subsequent frames. Moreover, to exploit motion information, different descriptors (HOG, HOOF, Motion Boundary Histogram (MBH)) were computed within a space-time volume around the trajectory. Finally, the method was evaluated with publicly available action datasets including: KTH, YouTube, Hollywood2, and UCF sports. Competitive performance was achieved in comparison to the state-of-the-art approaches. Different extensions of dense trajectory based methods have been proposed by many researchers such as [ 113 , 114 , 115 , 116 , 117 , 118 ].

Local descriptor based methods often follow similar steps in comparison to POI detection. Early research extracted descriptors from cuboids which were formed around the point of interest in space-time domains, see, e.g., [ 33 , 93 ]. However, the same process can be followed to utilise trajectories. Most popular local descriptor based approaches have exploited cuboids or trajectories as explained below.

A number of different descriptors were introduced by [ 119 ] to capture appearance and motion features from video. A comparison between single and multi scale higher order derivatives, histograms of optical flow, and histograms of spatio-temporal gradients was developed. The local neighbourhood of the detected interest points was described by computing histograms of optical flow and gradient components for each cell of a N × N × N grid. Thereafter, PCA was applied to the concatenation of optical flow and gradient component vectors to exploit the most significant eigenvalues as descriptors. The experiments showed the usefulness and applicability of the histograms of optical flow and spatial-temporal gradient based descriptors.

The Histograms of Optical Flow (HOOF) descriptor was proposed by [ 34 ] to identify local motion information. Spatio-temporal neighbourhoods were defined around detected POIs and optical flow was computed between consecutive frames.

Another robust descriptor, which also benefited from optical flow, was presented by [ 120 ] to extract local motion information called the Motion Boundary Histogram (MBH) descriptor. This descriptor follows the HOG descriptor in binning the orientation information of spatial derivatives into histograms. These descriptors can be employed with trajectory information as was done by [ 121 ]. A spatio-temporal volume was formed around each trajectory and divided into multiple cells. Each cell was represented by a combination of HOG, HOOF and MBH descriptors. Some other research that used trajectories for action recognition can be found such as [ 122 , 123 , 124 ].

4.4. Other Feature Representations Based Methods

A different representation method has been employed in computer vision tasks called Bag of Words (BOW) also referred to as a bag of visual models; see, e.g., [ 125 ]. The key idea of this approach is to represent image data as a normalised histogram called code words. The visual words (code words) can be constructed during the learning process by clustering similar patches of an image that can be described by a common feature descriptor. In this way, some techniques will result in similar histograms for similar images. These can be fed into a classification step. BOW based methods have been used in a lot of research for action recognition such as [ 28 , 93 , 126 , 127 ].

Another popular feature representation technique is the Fisher vector descriptor which can be considered as a global descriptor. This technique determines the best calibration for a generative model to better model the distribution of extracted local features. The descriptor is formed using the gradient of a given sample’s likelihood with respect to the parameters of the distribution. It is estimated from the training set and scaled by the inverse square root of the Fisher information matrix. A Fisher vector descriptor was first presented by [ 128 ] for image classification. For more details about Fisher vector based image classification and action recognition tasks, please see [ 129 , 130 ].

More comprehensive details of action recognition, motion analysis, and body tracking can be also found in [ 131 , 132 , 133 , 134 , 135 ]. Some state-of-the-art works that used traditional hand-crafted representation based methods are presented and compared in Table 1 .

State-of-the-art methods of traditional hand-crafted representations with different datasets for human action recognition.

PaperYearMethodDatasetAccuracy
[ ]2009Space-time volumesKTH89.4
[ ]2011Dense trajectoryKTH95
[ ]2011Space-time volumesKTH94.5
UCF sports91.30
[ ]2011Shape-motionWeizmann100
[ ]2011LBPWeizmann100
[ ]2012bag-of-visual-wordsHDMB-5129.2
[ ]2012TrajectoryHDMB-5140.7
[ ]2012HOJ3D + LDAMSR Action 3D96.20
[ ]2013Features (Pose-based)UCF sports90
MSR Action 3D90.22
[ ]20133D PoseMSR Action 3D91.7
[ ]2013Shape FeaturesWeizmann92.8
[ ]2013Dense trajectoryHDMB-5157.2
[ ]2014Shape-motionWeizmann95.56
KTH94.49
[ ]2014EigenJoints + AME + NBNNMSR Action 3D95.80
[ ]2014Features (FV + SFV)HDMB-5166.79
Youtube action93.38
[ ]2014Dissimilarity and sparse representationUPCV Action dataset89.25
[ ]2014Shape featuresIXMAS89.0
[ ]2016TrajectoryMSR Action 3D89
[ ]2016Shape FeaturesWeizmann100
[ ]2016Shape featuresIXMAS89.75
[ ]2016LBPIXMAS80.55
[ ]2016Motion featuresIXMAS83.03
[ ]2017MHIMuHAVi86.93
[ ]2017spatio-temporal+HMMMSR Action 3D93.3
MSR Daily94.1
[ ]2018Joints + KE DescriptorMSR Action 3D96.2

It is worth pointing out that a variety of higher-level representations techniques have been proposed to capture discriminative information for complex action recognition. Deep learning is an important technique that has demonstrated effective capability for producing higher-level representations with significant performance improvement. Deep learning based models have the ability to process input data from a low level and to convert it into a mid or high-level feature representation. Consequently, the next section presents a good review of deep learning based models that have been used for human action recognition.

5. Deep Learning Techniques Based Models

Recent research studies have shown that hand-crafted feature based methods are not suitable for all types of datasets. Consequently, a new relatively and important class of machine learning technique referred to as deep learning has been established. Multiple levels of feature representations can be learnt that can make sense of different data such as speech, image and text. Such methods are capable of automatically processing raw image and video data for feature extraction, description, and classification. Trainable filters and multiple layer based models are often employed in these methods for action recognition and representation.

This section presents descriptions of some important deep learning models that have been used for human action recognition. However, it is very difficult to train a deep learning model from scratch with limited data. Thus, models are often limited to appearance based data or some described representation. Deep learning based models can be classified into three categories which are: generative models e.g., Deep Belief Networks (DBNs), Deep Boltzmann machines (DBMs), Restricted Boltzmann Machines (RBMs), and regularized auto-encoders; supervised models e.g., Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs); and hybrid models. However, hybrid models are not discussed in this work.

5.1. Unsupervised (Generative) Models

The key idea of deep learning based generative models is that they do not need target labels for the learning process. Such models are appropriate when labelled data are scarce or unavailable. The evolutionary of deep learning models can be traced back [ 158 ] where a Deep Belief Network (DBN) was presented with a training algorithm based on Restricted Boltzmann Machines (RBMs) [ 159 ]. This was followed by a dimensional reduction technique by [ 160 ]. The parameters were learnt with an unsupervised training process which were then fine-tuned in a supervised approach using back-propagation.

This inspired great interest in deep learning models particularly on different applications such as human action recognition, image classification, object recognition, and speech recognition. Unsupervised learning based methods have been proposed by, e.g., [ 161 ], to automatically learn features from video data for action recognition. An independent subspace analysis algorithm was used to learn space-time features and combined with convolution and stacking based deep learning techniques for action representation.

In [ 162 ], the researchers proposed to train DBNs with RBMs for human action recognition. The experimental results on two public datasets demonstrated the impressive performance of the proposed method over hand-crafted feature based approaches.

An unsupervised deep learning based model was proposed by [ 163 ] to continuously learn from unlabelled video streams. In addition, DBNs based methods were used by [ 164 ] to learn features from an unconstrained video stream for human action recognition.

Generative or unsupervised learning based models have played a substantial role in inspiring researchers’ interest in the deep learning field. Nevertheless, the great development of the Convolution Neural Networks (CNNs) based supervised learning methods for object recognition has somewhat obscured the unsupervised learning based approaches; see, e.g., [ 165 ].

5.2. Supervised (Descriminative) Models

In line with the recent literature surveys for human action recognition, the most common technique used in supervised learning based models is Convolution Neural Networks (CNN)s. These were first proposed by [ 166 ]. CNNs can be considered to be a type of the deep learning model which has shown great performance in various recognition tasks such as pattern recognition, digit classification, image classification, and human action recognition see, e.g., [ 165 , 167 ]. The efficient utilisation of CNNs in image classification [ 165 ] opened a new era to employ deep learning based methods for human action recognition. The key advantage of CNNs is their ability to learn straight from the raw data such as RGB or depth map data. Consequently, it is possible to obtain discriminative features which can effectively describe the data and thus make the recognition process easier. Since this approach is susceptible to overfitting, one should be careful in the training process. CNN includes regularisation and has a significant requirement for a large amount of labeled data. These can help to prevent overfitting. Recently, it was shown that deep learning based methods outperform many state-of-the-art handcrafted features for image classification; see, e.g., [ 27 , 165 , 168 ].

Convolution Neural Networks (CNN)s have a hierarchical structure with multiple hidden layers to help translate a data sample into a set of categories. Such models consist of a number of different types of layers such as convolutional layers, pooling layers and fully connected layers. The temporal domain is introduced as an additional dimension in the case of videos. Since CNNs were originally designed for static image processing, it was not initially clear on how to incorporate motion information. Therefore, most research at that time used CNNs on still images to model appearance information for action recognition [ 165 ]. Thereafter, different ways were proposed to utilise motion information for action recognition. An extension was presented by [ 169 ] where stacked video frames were used as an input to a CNN for action recognition from video. However, the experimental results were worse than hand-crafted feature based approaches. An investigation made by [ 32 ] about this issue and developed the idea of having separate spatial and temporal CNN streams for action recognition.

Figure 1 illustrates the spatio-temporal CNN streams similar to [ 32 ] where the two streams are implemented as independent CNNs. One stream was the spatial stream which recognised actions from static images. The other stream was the temporal stream which recognised actions from stacked video frames based on motion information of dense optical flow. The output of the two streams was combined using a late fusion technique. The experiments showed improved performance for this method compared to hand-crafted feature based approaches. However, this type of architecture has additional hardware requirements to be suitable for a variety of applications.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00046-g001.jpg

Illustration of the spatio-temporal CNN streams as used by [ 32 ]. Here, the input data are split into two streams, one for the individual apperance based raw frames. The other for the temporal information corresponding to an optical flow stream. The two streams are fused at the end with class score fusion.

A lot of research on action recognition is based on works that have previously achieved relatively good performance in image classification problems. Recent works extended what was implemented in two dimensions to 3D to include the temporal domain. Most CNN models proposed for action recognition have been limited to deal with 2D input data. Nonetheless, some applications may include 3D data that requires a specialised deep learning model. To this end, 3D Convolution Neural Networks (3D-CNNs) based models were presented by [ 40 ] for surveillance tasks at airports. Spatio-temporal features were automatically utilised by employing 3D convolutions in the convolutional layers with respect to spatial and temporal dimensions. The experimental results demonstrated superior performance for this method in comparison to other state-of-the-art methods.

In general, there has been much success with 2D and 3D CNN in e.g., image classification, object recognition, speech recognition and action recognition. Nonetheless, some issues still need to be considered such as the immense amount of image or video data needed for training purposes. Collecting and annotating large amounts of image or video data are quite exhausting and requires a substantial amount of time. Fortunately, the availability of rich and relatively large action recognition datasets has provided a great support for designing such models in terms of their training and evaluation. A factorised 3D-CNN was proposed by [ 170 ] for human action recognition. The 3D-CNN was factorised into a standard 2D-CNN for spatial information at the lower layers and a 1D-CNN for the temporal information at the higher layers. This factorisation was to reduce the number of learning parameters and consequently reduce the computational complexity. Two benchmark datasets were used to evalauate the proposed method: UCF101 and HMDB51. The results showed comparable performance with state-of-the-art methods. Another spatio-temporal 3D-CNN approach was proposed by [ 171 ] for human action recognition. The authors used four public datasets to evaluate the proposed method. The 3D-CNN achieved improved performance with spatio-temporal features compared to a 2D-CNN. The authors also found that a small filter size such as the one used in their method i.e., 3 × 3 × 3 was the best choice for spatio-temporal features. Overall, the experimental results demonstrated competitive performance for the proposed method with a linear classifier.

Some research works have combined supervised and unsupervised learning models for action recognition. A Slow Feature Analysis (SFA) based method has used by [ 172 ] to extract slowly varying features from an input in an unsupervised manner. These were combined with a 3D-CNN for action recognition. This work achieved competitive performance compared to state-of-the-art approaches. Three standard action recognition datasets were used: KTH [ 98 ], UCF sports [ 85 ] and Hollywood2 [ 99 ] datasets.

In [ 173 ], a hierarchical framework combining 3D CNN and hidden Markov model (HMM) was proposed. This was used to recognise and segment continuous actions simultaneously. 3D CNN was used to learn a powerful high level features directly from raw data, and use it to extract effective and robust action features. The statistical dependencies over adjacent sub-actions was then modeled by HMM to infer actions sequences. The KTH and Weizmann dataset were used to evaluate the proposed method. The experimental results showed improved performance of the proposed method over some state-of-the-art approaches.

For efficient learning of spatio-temporal features in video action recognition, a hybrid CNN was introduced in [ 174 ] used a fusion convolutional architecture. 2D and 3D CNN was fused to present temporal encoding with fewer parameters. Three models are used to build the proposed model (semi-CNN) including: VGG-16, ResNets and DenseNets. The UCF-101 dataset was used in the evaluation to compare the performance of each model with its corresponding 3D models. Figure 2 shows the performance of the used models over 50 epochs.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00046-g002.jpg

The performance of action recognition models as mentioned in [ 174 ]. Including: ( a ) Semi-CNN model based on VGG16 architecture ( b ) Semi-CNN model based on ResNet34 architecture ( c ) Semi-CNN model based on DenseNet121 architecture.

Another way to model motion information in video was proposed by [ 39 ] for action recognition using Recurrent Neural Networks (RNN)s. CNN discriminative features were computed for each video frame and then they were fed into an RNN model. The key advantage of an RNN architecture is its ability to deal with sequential inputs as a single copy of the network is created for each sequence. In the RNN hidden layers, connections between neurons are found between each replica where the same weights are shared by each replica and with the others. The authors highlighted that local motion information can be obtained from video by optical flow through CNNs. On the other hand, global motion information can be modeled through the use of the RNN. RNN based supervised learning was used by [ 175 ] across five parts (right arm, left arm, right leg, left leg, trunk) of skeleton information. These were used as inputs to five separate sub-nets for action recognition. The outcomes of these sub-nets were then hierarchically fused to form the inputs to the higher layers. Thereafter, the final representation was fed into a single-layer perceptron to get the final decision. Three datasets were used to evaluate the proposed method including: MSR Action3D [ 74 ], Berkeley Multimodal Human Action (Berkeley Mhad) [ 176 ], and Motion Capture HDM05 [ 177 ] datasets. The results demonstrated state-of-the-art performance. However, RNN is not capable of processing very long sequences and it can not be stacked into very deep models. In addition, it lacks the capability of keeping track of long-term dependencies; which makes training of an RNN difficult.

New recurrent modules that improved long-range learning, Long Short-Term Memory (LSTM), has firstly proposed by [ 178 ]. LSTM units have hidden state augmented with nonlinear mechanisms, in which simple learned gating functions are utilised to enable state propagation with either no modification, update or reset. LSTMs have a significant impact on vision problems as these models are straightforward to fine-tune end-to-end. Moreover, LSTMs have the ability to deal with sequential data and are not limited to fixed length inputs or outputs. This helps to simply model a sequential data of varying lengths, such as text or video [ 179 ].

LSTMs have recently been introduced to be efficient to large-scale learning of speech recognition [ 180 ] and language translation models [ 181 ]. LSTM was also proposed for action recognition by [ 179 ]. A hybrid deep learning architecture was proposed using a long-term recurrent CNN (LRCN). Raw data and optical flow information were used as input to this unique system. The proposed methods were evaluated using a UCF101 dataset and showed an improvement in the performance in comparison with the baseline architecture.

Deep learning based approaches have achieved relatively high recognition performance. This is on the same level or better than hand-crafted features based methods. Some researchers have also proposed using multiple deep learning models alongside hand-crafted features to achieve even better results such as [ 32 , 117 , 182 ].

5.3. Multiple Modality Based Methods

A new insight is provided into human action recognition by using deep learning methods to extract action features from RGB, depth, and/or skeleton information. Different feature learning can be utilised [ 117 , 171 , 183 ] from deep networks such as appearance, optical flow, depth and/or skeleton sequences. It is very often that different modalities are provided with respect to the same dataset such as RGB, depth, and skeleton information or at least two of them. Therefore, a lot of research has been proposed to utilise combinations of different modalities or their hand-crafted features. They then merge them using fusion based strategies. A separate framework architecture is often employed for each modality; then, classification scores are often obtained for each one.

Some research has highlighted that significant improvements in performance of an action recognition system can be achieved by utilising hand-crafted features within CNN based deep learning models. A CNN model based on multiple sources of information was proposed by [ 184 ] to process spatially varying soft-gating. A fusion technique was then used to combine the multiple CNN models that were trained on various sources. A Stratified Pooling based CNN (SPCNN) was proposed by [ 185 ] to handle the issue of different feature levels of each frame in video data. To come up with video based features, the authors fine-tuned a pre-trained CNN model on target datasets. Frame-level features were extracted, then principal component analysis was used for dimensionality reduction. Stratified pooling of frame-level features was then used to convert them into video-level features, and finally fed them into an SVM classifier for classification. The method was evaluated on HMDB51 [ 27 ] and UCF101 [ 186 ] datasets. The experiments showed that the proposed method outperformed the state-of-the-art.

An extension of this two stream network approach was proposed in [ 117 ] using dense trajectories for more effective learning of motion information.

A general residual network architecture for human activity recognition was presented in [ 187 ] using cross-stream residual connections in the form of multiplicative interaction between appearance and motion streams. The motion information was exploited using stacked inputs of horizontal and vertical optical flow.

A fusion study was presented in [ 182 ] for human activity recognition using two streams of the pre-trained Visual Geometry Group (VGG) network model to compute spatio-temporal information combining RGB and stacked optical flow data. Various fusion mechanisms at different positions of the two streams were evaluated to determine the best possible recognition performance.

Some research studies have paid particular attention to auxiliary information which can improve the performance of action recognition. In some studies, audio has been combined with the video to detect the actions such as [ 188 ], where a combination of Hidden Markov Models (HMM) with audio were used to determine the actions. The main disadvantage of using audio recordings is the surrounding noise that can affect the results.

All of the above approaches suffer from a shortage of long-term temporal information. For example, the number of frames used in the optical flow stacking ranged between 7 and 15 frames, such as 7, 10, and 15 frames as used in [ 40 , 169 , 184 ], respectively. Often, people will perform the same action over different periods of time depending on many factors and particularly for different people. Consequently, multi-resolution hand-crafted features computed over different periods of time are used by [ 189 ] in order to avoid this problem. Furthermore, different weight phases are applied using a time-variant approach in the computation process of the DMMs to enable adaptation to different important regions of an action. Different fusion techniques are employed to merge spatial and motion information for best action recognition. Figure 3 illustrates the impact of different window frame lengths on the performance of action recognition systems.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00046-g003.jpg

Action recognition accuracy versus different window frame lengths that was proposed in [ 189 ].

5.4. Pose Estimation and Multi-View Action Recognition

Another considerable challenge in human action recognition is view variance. The same action can be viewed from different angles and thus looks excessively different. This issue was taken into account by [ 190 ]. Training data were generated by fitting a synthetic 3D human model to real motion information. Poses were then extracted from different view-points. A CNN based model was found to outperform a hand-crafted feature based approach for multi-view action recognition.

Dynamic image information was extracted by [ 191 ] from synthesised multi-view depth videos. Multi-view dynamic images were constructed from the synthesised data. A CNN model was then proposed to perform feature learning from the multi-view dynamic images. Multiple batches of motion history images (MB-MHIs) have been constructed by [ 192 ]. This information is then used to compute two descriptors by using: a deep residual network (ResNet) and histogram of oriented gradients (HOG). Later, an orthogonal matching pursuit approach was used to obtain the sparse codes of feature descriptions. A final view-invariant feature representation was formed and used to train SVM classifier for action recognition. MuHAVi-MAS [ 193 ] and MuHAVi-uncut [ 194 ] datasets are used to evaluate the proposed approach. Figure 4 illustrates the accuracy variations of the recognition model over different components.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00046-g004.jpg

The accuracy variations with the number of frames and number of batches as mentioned in [ 192 ].

A CNN model obtained from ImageNet was used by [ 195 ] to learn from multi-view DMM features for action recognition when video was projected onto different view-points within the 3D space. Different temporal scales were then used from the synthesised data to constitute a range of spatio-temporal pattern for each action. Finally, three fine-tuned models were employed independently for each DMM map. However, some actions including object interactions can be very difficult to be recognise from the raw depth data alone. This helps to justify the inclusion of RGB data for the recognition of such actions.

In [ 196 ], Multi-View Regional Adaptive Multi temporal-resolution DMMs (MV-RAMDMM) and Multi temporal-resolution RGB information is learnt with multiple 3D-CNNs stream for action recognition. The Adaptive Multi-resolution DMM is applied across multiple views to extract view and time invariant action information. It is adapted based on human movement to be used eventually in the deep learning model for action recognition. In addition, multi temporal raw appearance information is used to exploit various spatio-temporal features of the RGB scenes. This helps to capture more specific information which might be difficult to obtain purely from depth sequences. For instance, object-interaction information is more apparent in RGB space.

In a different way, semantic features based on pose can be seen to be very important cues that can describe the category of an action. Human joint information was utilised by [ 197 ] to compute the temporal variation between joints during actions. Time-variant functions were used to confirm the pose related with each action and considered for feature extraction. The feature representation for action recognition was constructed using the temporal variation of values associated with these time functions. Then, CNNs were trained to recognize human actions from the local patterns in the feature representation. The Berkeley MHAD dataset [ 176 ] was used to evaluate the proposed method and the results demonstrated the effectiveness of this approach. Similar to [ 197 ], a Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition was proposed by [ 198 ]. Descriptor aggregated motion and appearance information were used with respect to tracks of human body parts. This utilised skeleton information along with RGB raw data. JHMDB [ 199 ] and MPII [ 200 ] cooking datasets were used to evaluate the proposed method. However, it can be difficult to accurately capture skeleton information of a person in different environment conditions. This might be due to the need of accurate body-parts detection to precisely estimate skeleton information.

Some common datasets of human action recognition are introduced in Table 2 . In addition, an extensive comparison between deep learning based models and hand-crafted based models are presented in Table 3 for human action recognition.

Common dataset of human action recognition.

DatasetsRGBDepthSkeletonSamplesClasses
] 170712
] 450010
] 170712
] 676651
] 78316
] 661850
] 13,320101
] 56720
] 32016
] 147510
] 86127
] 86127
] 118913
] 56,88060

Comparison of deep learning based models and hand-crafted based models for human action recognition [ 208 , 209 , 210 , 211 ].

CharacteristicsDeep Learning Based ModelsHand-Crafted Feature Based Models
Ability to learn features directly from raw dataPre-process algorithms and /or detectors are needed to discover the most efficient patterns to improve recognition accuracy.
Automatically extract spatial, temporal and scale, transition invariant features from raw dataUse feature selection and dimensionality reduction methods which are not very generalisable.
Data pre-processing and normalisation is not mandatory in deep learning based models to achieve high performanceUsually require comprehensive data pre-processing and normalisation to achieve significant performance.
Hierarchical and translational invariant features are obtained from such models to solve this problemInefficient in managing such kind of problems.
Huge amount of data required for training purposes to avoid over-fitting and high computation powerful system with Graphical Processing Unit (GPU) to speed up trainingRequire less data for training purposes with less computation time and memory usage.

Furthermore, some recent works based on deep learning models for human action recognition are included in Table 4 .

State-of-the-art methods of deep learning based models with different datasets for human action recognition.

PaperYearMethodClass of ArchitectureDatasetAccuracy
[ ]2012ASD featuresSFAKTH93.5
[ ]2013Spatio-temporal3D CNNKTH90.2
[ ]2014STIP featuresSparse auto-encoderKTH96.6
[ ]2014Two-streamCNNHDMB-5159.4
[ ]2014DL-SFASFAHollywood248.1
[ ]2014Two-streamCNNUCF-10188.0
[ ]2015convolutional temporal featureCNN-LSTMUCF-10188.6
[ ]2015TDD DescriptorCNNUCF-10191.5
[ ]2015Spatio-TemporalCNNUCF-10188.1
[ ]2015Spatio-temporal3D CNNUCF-10190.4
[ ]2015Hierarchical modelRNNMSR Action3D94.49
[ ]2015DifferentialRNNMSR Action3D92.03
[ ]2015static and motion featuresCNNUCF Sports91.9
[ ]2015TDD DescriptorCNNHDMB-5165.9
[ ]2015Spatio-TemporalCNNHDMB-5159.1
[ ]2016Spatio-temporalLSTM-CNNHDMB-5155.3
[ ]2016Deep NetworkCNNUCF-10189.1
[ ]2016Spatio-temporalLSTM-CNNUCF-10186.9
[ ]2016Deep modelCNNHDMB-5154.9
[ ]20163D CNN + HMMCNNKTH89.20
[ ]2016LRCNCNN + LSTMUCF-10182.34
[ ]2017SP-CNNCNNHDMB-5174.7
[ ]2017Rank poolingCNNHDMB-5165.8
[ ]2017Rank poolingCNNHollywood275.2
[ ]2017SP-CNNCNNUCF-10191.6
[ ]2018DynamicMapsCNNNTU RGB+D87.08
[ ]2018Cooperative modelCNNNTU RGB+D86.42
[ ]2019Depth Dynamic ImagesCNNUWA3DII68.10
[ ]2019FWMDMMCNNMSR Daily Activity92.90
CNNNUCLA69.10
[ ]2020MB-MHIResNetMUHaVi83.8
[ ]2020MV-RAMDMM3DCNNMSR Daily Activity87.50
3DCNNNUCLA86.20
[ ]2020Semi-CNNResNetUCF-10189.00
Semi-CNNVGG-16UCF-10182.58
Semi-CNNDenseNetUCF-10177.72

6. Conclusions

In this paper, we have presented human action recognition methods and introduced a comprehensive overview of recent approaches to human action recognition research. This included a hand-crafted representation based method, deep learning based methods, human–object interaction and multiview action recognition. The conclusions of this study on human action recognition can focus on the following:

  • data selection: suitable data to capture the action may help to improve performance of action recognition.
  • approach of recognition: deep learning based methods achieved superior performance.
  • multiple-modal: current research highlighted that multi-modal fusion can efficiently improve the performance.

This paper has presented the most relevant and outstanding computer vision based methods for human action recognition. A variety of hand-crafted methods and deep learning models have been summarised along with various advantages and disadvantages for each approach. Hand-crafted feature based methods are categorised into holistic and local feature based methods. Holistic feature based methods have been summarised along with their limitations. These methods assume a static background. In other words, the camera must be stable and videos are supposed to have been captured in a constrained condition for a holistic representation. Otherwise, these methods need extra pre-processing steps such as people detection to be able to recognise human actions. This is particularly true in the presence of cluttered or a complex background or if the camera moves whilst action sequences are captured. Next, local feature based methods and different types of descriptors were also described in this paper. It is shown that local feature based methods more often achieve state-of-the-art results compared to other approaches. In addition, such kinds of methods require reduced computational complexity to recognise human actions compared to more complicated models. The main advantage of local feature based methods is their flexibility. They can be applied to video data without complex requirements such as human localisation or body parts detection, which is not feasible for many types of videos. However, in some cases, it is very difficult to address action variations using local representation based methods, which, in turn, fails to precisely recognise human actions. Therefore, using hand-crafted representations by taking advantage of combining both local and holistic based methods may help. Different issues are tackled benefiting from shape and motion information, and local feature representation of an action. This information alongside local representation strategies are considered as the key roles for recognising different actions and improving the performance of the recognition system.

A new direction has been proposed to enhance the action recognition performance using deep learning technology. Deep learning is summarised in this paper and classified into two categories including: supervised and unsupervised models. However, supervised models are considered in this work due to their vast ability and high effectiveness in implementing recognition systems. It has achieved competitive performance in comparison with traditional approaches in many applications of computer vision. The most important characteristic of deep learning models is the ability to learn features from raw data. This has somewhat reduced the need for hand-crafted feature detectors and descriptors.

One of the most popular supervised models is the Convolution Neural Network (CNN), which is currently being used in most of the existing deep learning based methods. However, deep learning based methods still have some limitations that need to be considered. One of these limitations is the need for huge amounts of data for training the models. In addition, there is a high-complexity hardware requirement to enable computation in a plausible amount of time. Therefore, transfer learning approaches are adopted in different works to benefit from pre-trained models to speed up the training processes. This also helps to improve the performance of the action recognition system with reasonable hardware requirements.

Two common types of deep learning techniques were used for either spatial or spatio-temporal feature extraction and representation. This can include CNN, 3D CNN, LSTM, etc. Some research has highlighted that significant improvements in performance of an action recognition system can be achieved by utilising multi-modalities structure based methods. This could include RGB sequences, hand-designed features, depth sequences and/or skeleton sequences.

Many researchers have highlighted the importance of temporal information that can be exploited to provide more discriminative features for action recognition. This information was processed early with an independent 2D-CNN stream.

Spatio-temporal features have also been learnt directly with the use of 3D-CNN or LSTM models. These have been summarised in this review in which temporal domain has been considered in the learning process. Multi-modalities structure may add great improvements to the recognition system within a deep learning model. Toward this aim, different action recognition systems were presented within different temporal batches involving a deep learning model.

Author Contributions

M.A.-F. designed the concept and drafted the manuscript. J.C. and D.N. supervised, helped and supported M.A.-F. to plan the design and structure of the manuscript. A.I.A. prepared the figures and public datasets analysis. All authors discussed the analyses, interpretation of methods and commented on the manuscript. All authors have read and agreed to the published version of the manuscript.

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Microsoft Research Blog

Microsoft at cvpr 2024: innovations in computer vision and ai research.

Published June 17, 2024

Share this page

  • Share on Facebook
  • Share on Twitter
  • Share on LinkedIn
  • Share on Reddit
  • Subscribe to our RSS feed

CVPR 2024 logo on a green and purple abstract background

Microsoft is proud to sponsor the 41st annual Conference on Computer Vision and Pattern Recognition (CVPR 2024), held from June 17 to June 21. This premier conference covers a broad spectrum of topics in the field, including 3D reconstruction and modeling, action and motion analysis, video and image processing, synthetic data generation, neural networks, and many more. This year, 63 papers from Microsoft have been accepted, with six selected for oral presentations. This post highlights these contributions.

The diversity of these research projects reflects the interdisciplinary approach that Microsoft research teams have taken, from techniques that precisely recreate 3D human figures and perspectives in augmented reality (AR) to combining advanced image segmentation with synthetic data to better replicate real-world scenarios. Other projects demonstrate how researchers are combining machine learning with natural language processing and structured data, developing models that not only visualize but also interact with their environments. Collectively, these projects aim to improve machine perception and enable more accurate and responsive interactions with the world. 

Spotlight: AI-POWERED EXPERIENCE

research proposal on computer vision

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Oral presentations 

Bioclip: a vision foundation model for the tree of life.

Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G. Campolongo, Chan Hee Song, David Carlyn,  Li Dong , W. Dahdul, Charles Stewart, Tanya Y. Berger-Wolf, Wei-Lun Chao, Yu Su  

The surge in images captured from diverse sources—from drones to smartphones—offers a rich source of biological data. To harness this potential, we introduce TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images, and BioCLIP, a foundation model intended for the biological sciences. BioCLIP, utilizing the TreeOfLife-10M’s vast array of organism images and structured knowledge, excels in fine-grained biological classification, outperforming existing models by significant margins and demonstrating strong generalizability. 

EgoGen: An Egocentric Synthetic Data Generator

Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu , Yan Zhang, Marc Pollefeys  

A critical challenge in augmented reality (AR) is simulating realistic anatomical movements to guide cameras for authentic egocentric views. To overcome this, the authors developed EgoGen, a sophisticated synthetic data generator that not only improves training data accuracy for egocentric tasks but also refines the integration of motion and perception. It offers a practical solution for creating realistic egocentric training data, with the goal of serving as a useful tool for egocentric computer vision research. 

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Bin Xiao , Haiping Wu, Weijian Xu, Xiyang Dai , Houdong Hu, Yumao Lu, Michael Zeng , Ce Liu, Lu Yuan  

Florence-2 introduces a unified, prompt-based vision foundation model capable of handling a variety of tasks, from captioning to object detection and segmentation. Designed to interpret text prompts as task instructions, Florence-2 generates text outputs across a spectrum of vision and vision-language tasks. This model’s training utilizes the FLD-5B dataset, which includes 5.4 billion annotations on 126 million images, developed using an iterative strategy of automated image annotation and continual model refinement.

LISA: Reasoning Segmentation via Large Language Model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan , Shu Liu, Jiaya Jia

This work introduces reasoning segmentation , a new segmentation task using complex query texts to generate segmentation masks. The authors also established a new benchmark, comprising over a thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation. Finally, the authors present Large Language Instructed Segmentation Assistant (LISA), a tool that combines the linguistic capabilities of large language models with the ability to produce segmentation masks. LISA effectively handles complex queries and shows robust zero-shot learning abilities, further enhanced by minimal fine-tuning.

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin (opens in new tab) , Otmar Hilliges, Jie Song  

MultiPly is a new framework for reconstructing multiple people in 3D from single-camera videos in natural settings. This technique employs a layered neural representation for the entire scene, refined through layer-wise differentiable volume rendering. Enhanced by a hybrid instance segmentation that combines self-supervised 3D and promptable 2D techniques, it provides reliable segmentation even with close interactions. The process uses confidence-guided optimization to alternately refine human poses and shapes, achieving high-fidelity, consistent 3D models.

SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

Alexandros Delitzas, Ayça Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys , Francis Engelmann  

Traditional 3D scene understanding methods are heavily focused on 3D sematic and instance segmentation, but the true challenge lies in interacting with functional interactive elements like handles, knobs, and buttons to achieve specific tasks. Enter SceneFun3D: a robust dataset featuring over 14,800 precise interaction annotations across 710 high-resolution real-world 3D indoor scenes. This dataset enriches scene comprehension with motion parameters and task-specific natural language descriptions, facilitating advanced research in functionality segmentation, task-driven affordance grounding, and 3D motion estimation.

Discover more about our work and contributions to CVPR 2024, including our full list of publications and sessions , on our conference webpage . 

Related publications

Continue reading.

Research Focus April 15, 2024

Research Focus: Week of April 15, 2024

Research Focus Week of February 19, 2024

Research Focus: Week of February 19, 2024

"ICCV23 PARIS" to the left of a picture of the first page of the HoloAssist publication on a blue and purple gradient background.

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

Logo for the CVPR 2023 conference showing the Vancouver, British Columbia skyline with the conference dates, June 18–23, 2023. In the background, there is a faded photo of the city of Vancouver on a sunny day.

Microsoft at CVPR 2023: Pushing the boundaries of computer vision

Research areas.

research proposal on computer vision

Related events

Related labs.

  • Mixed Reality & AI Lab – Zurich
  • Follow on Twitter
  • Like on Facebook
  • Follow on LinkedIn
  • Subscribe on Youtube
  • Follow on Instagram

Share this page:

ACM Digital Library home

  • Advanced Search

Temporal adaptive feature pyramid network for action detection

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, modeling temporal structure of complex actions using bag-of-sequencelets.

We propose a new model for recognizing complex actions named Bag-of-Sequencelets.We represent a video as a sequence of primitive actions.We model a complex action as an ensemble of sub-sequences (sequencelets).We automatically learn sequencelets without ...

Bi-direction Feature Pyramid Temporal Action Detection Network

Temporal action detection in long-untrimmed videos is still a challenging task in video content analysis. Many existing approaches contain two stages, which firstly generate action proposals and then classify them. The main drawback of these ...

A Temporal Action Detection Model With Feature Pyramid Network

To find out all actions included in an untrimmed video, temporal action detection localizes the starting and ending of each action, and identify their categories, simultaneously. Different with trimmed video which always involves a single action ...

Information

Published in.

Elsevier Science Inc.

United States

Publication History

Author tags.

  • Action detection
  • Deep learning
  • Feature pyramid network
  • Self-attention
  • 1D convolution
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

  • IEEE CS Standards
  • Career Center
  • Subscribe to Newsletter
  • IEEE Standards

research proposal on computer vision

  • For Industry Professionals
  • For Students
  • Launch a New Career
  • Membership FAQ
  • Membership FAQs
  • Membership Grades
  • Special Circumstances
  • Discounts & Payments
  • Distinguished Contributor Recognition
  • Grant Programs
  • Find a Local Chapter
  • Find a Distinguished Visitor
  • About Distinguished Visitors Program
  • Find a Speaker on Early Career Topics
  • Technical Communities
  • Collabratec (Discussion Forum)
  • My Subscriptions
  • My Referrals
  • Computer Magazine
  • ComputingEdge Magazine
  • Let us help make your event a success. EXPLORE PLANNING SERVICES
  • Events Calendar
  • Calls for Papers
  • Conference Proceedings
  • Conference Highlights
  • Top 2024 Conferences
  • Conference Sponsorship Options
  • Conference Planning Services
  • Conference Organizer Resources
  • Virtual Conference Guide
  • Get a Quote
  • CPS Dashboard
  • CPS Author FAQ
  • CPS Organizer FAQ
  • Find the latest in advanced computing research. VISIT THE DIGITAL LIBRARY
  • Open Access
  • Tech News Blog
  • Author Guidelines
  • Reviewer Information
  • Guest Editor Information
  • Editor Information
  • Editor-in-Chief Information
  • Volunteer Opportunities
  • Video Library
  • Member Benefits
  • Institutional Library Subscriptions
  • Advertising and Sponsorship
  • Code of Ethics
  • Educational Webinars
  • Online Education
  • Certifications
  • Industry Webinars & Whitepapers
  • Research Reports
  • Bodies of Knowledge
  • CS for Industry Professionals
  • Resource Library
  • Newsletters
  • Women in Computing
  • Digital Library Access
  • Organize a Conference
  • Run a Publication
  • Become a Distinguished Speaker
  • Participate in Standards Activities
  • Peer Review Content
  • Author Resources
  • Publish Open Access
  • Society Leadership
  • Boards & Committees
  • Local Chapters
  • Governance Resources
  • Conference Publishing Services
  • Chapter Resources
  • About the Board of Governors
  • Board of Governors Members
  • Diversity & Inclusion
  • Open Volunteer Opportunities
  • Award Recipients
  • Student Scholarships & Awards
  • Nominate an Election Candidate
  • Nominate a Colleague
  • Corporate Partnerships
  • Conference Sponsorships & Exhibits
  • Advertising
  • Recruitment
  • Publications
  • Education & Career

CVPR 2024 Announces Best Paper Award Winners

research proposal on computer vision

This year, from more than 11,500 paper submissions, the CVPR 2024 Awards Committee selected the following 10 winners for the honor of Best Papers during the Awards Program at CVPR 2024, taking place now through 21 June at the Seattle Convention Center in Seattle, Wash., U.S.A.

Best Papers

  • “ Generative Image Dynamics ” Authors: Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski The paper presents a new approach for modeling natural oscillation dynamics from a single still picture. This approach produces photo-realistic animations from a single picture and significantly outperforms prior baselines. It also demonstrates potential to enable several downstream applications such as creating seamlessly looping or interactive image dynamics.
  • “ Rich Human Feedback for Text-to-Image Generation ” Authors: Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M. Collins, Yiwen Luo, Yang Li, Kai J. Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam This paper highlights the first rich human feedback dataset for image generation. Authors designed and trained a multimodal Transformer to predict the rich human feedback and demonstrated some instances to improve image generation.

Honorable mention papers included, “ EventPS: Real-Time Photometric Stereo Using an Event Camera ” and “ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. ”

Best Student Papers

  • “ Mip-Splatting: Alias-free 3D Gaussian Splatting ” Authors: Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, Andreas Geiger This paper introduces Mip-Splatting, a technique improving 3D Gaussian Splatting (3DGS) with a 3D smoothing filter and a 2D Mip filter for alias-free rendering at any scale. This approach significantly outperforms state-of-the-art methods in out-of-distribution scenarios, when testing at sampling rates different from training, resulting in better generalization to out-of-distribution camera poses and zoom factors.
  • “ BioCLIP: A Vision Foundation Model for the Tree of Life ” Authors: Samuel Stevens, Jiaman Wu, Matthew J. Thompson, Elizabeth G. Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M. Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su This paper offers TREEOFLIFE-10M and BIOCLIP, a large-scale diverse biology image dataset and a foundation model for the tree of life, respectively. This work shows BIOCLIP is a strong fine-grained classifier for biology in both zero- and few-shot settings.

There also were four honorable mentions in this category this year: “ SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency ”; “ Image Processing GNN: Breaking Rigidity in Super-Resolution; Objects as Volumes: A Stochastic Geometry View of Opaque Solids ;” and “ Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods. ”

“We are honored to recognize the CVPR 2024 Best Paper Awards winners,” said David Crandall, Professor of Computer Science at Indiana University, Bloomington, Ind., U.S.A., and CVPR 2024 Program Co-Chair. “The 10 papers selected this year – double the number awarded in 2023 – are a testament to the continued growth of CVPR and the field, and to all of the advances that await.”

Additionally, the IEEE Computer Society (CS), a CVPR organizing sponsor, announced the Technical Community on Pattern Analysis and Machine Intelligence (TCPAMI) Awards at this year’s conference. The following were recognized for their achievements:

  • 2024 Recipient : “ Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation ” Authors: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik
  • 2024 Recipient : Angjoo Kanazawa, Carl Vondrick
  • 2024 Recipient : Andrea Vedaldi

“The TCPAMI Awards demonstrate the lasting impact and influence of CVPR research and researchers,” said Walter J. Scheirer, University of Notre Dame, Notre Dame, Ind., U.S.A., and CVPR 2024 General Chair. “The contributions of these leaders have helped to shape and drive forward continued advancements in the field. We are proud to recognize these achievements and congratulate them on their success.”

About the CVPR 2024 The Computer Vision and Pattern Recognition Conference (CVPR) is the preeminent computer vision event for new research in support of artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more. Sponsored by the IEEE Computer Society (CS) and the Computer Vision Foundation (CVF), CVPR delivers the important advances in all areas of computer vision and pattern recognition and the various fields and industries they impact. With a first-in-class technical program, including tutorials and workshops, a leading-edge expo, and robust networking opportunities, CVPR, which is annually attended by more than 10,000 scientists and engineers, creates a one-of-a-kind opportunity for networking, recruiting, inspiration, and motivation.

CVPR 2024 takes place 17-21 June at the Seattle Convention Center in Seattle, Wash., U.S.A., and participants may also access sessions virtually. For more information about CVPR 2024, visit cvpr.thecvf.com .

About the Computer Vision Foundation The Computer Vision Foundation (CVF) is a non-profit organization whose purpose is to foster and support research on all aspects of computer vision. Together with the IEEE Computer Society, it co-sponsors the two largest computer vision conferences, CVPR and the International Conference on Computer Vision (ICCV). Visit thecvf.com for more information.

About the IEEE Computer Society Engaging computer engineers, scientists, academia, and industry professionals from all areas and levels of computing, the IEEE Computer Society (CS) serves as the world’s largest and most established professional organization of its type. IEEE CS sets the standard for the education and engagement that fuels continued global technological advancement. Through conferences, publications, and programs that inspire dialogue, debate, and collaboration, IEEE CS empowers, shapes, and guides the future of not only its 375,000+ community members, but the greater industry, enabling new opportunities to better serve our world. Visit computer.org for more information.

Recommended by IEEE Computer Society

research proposal on computer vision

The IEEE International Roadmap for Devices and Systems (IRDS) Emerges as a Global Leader for Chips Acts Visions and Programs

research proposal on computer vision

IEEE Computer Society Announces 2024 Class of Fellow

research proposal on computer vision

IEEE CS Releases 20 in their 20s List, Identifying Emerging Leaders in Computer Science and Engineering

research proposal on computer vision

IEEE CS Authors, Speakers, and Leaders Named to Inaugural TIME100 Most Influential People in Artificial Intelligence List

research proposal on computer vision

IEEE SustainTech Leadership Forum 2024: Unlocking the Future of Sustainable Technology for Buildings and Factories in the Built Environment

research proposal on computer vision

J. Gregory Pauloski and Rohan Basu Roy Named Recipients of 2023 ACM/IEEE CS George Michael Memorial HPC Fellowships

research proposal on computer vision

Keshav Pingali Selected to Receive ACM-IEEE CS Ken Kennedy Award

research proposal on computer vision

Hironori Washizaki Elected IEEE Computer Society 2025 President

  • Search for: Toggle Search

Seamless in Seattle: NVIDIA Research Showcases Advancements in Visual Generative AI at CVPR

NVIDIA researchers are at the forefront of the rapidly advancing field of visual generative AI, developing new techniques to create and interpret images, videos and 3D environments.

More than 50 of these projects will be showcased at the Computer Vision and Pattern Recognition (CVPR) conference, taking place June 17-21 in Seattle. Two of the papers — one on the training dynamics of diffusion models and another on high-definition maps for autonomous vehicles — are finalists for CVPR’s Best Paper Awards.

NVIDIA is also the winner of the CVPR Autonomous Grand Challenge’s End-to-End Driving at Scale track — a significant milestone that demonstrates the company’s use of generative AI for comprehensive self-driving models. The winning submission, which outperformed more than 450 entries worldwide, also received CVPR’s Innovation Award.

NVIDIA’s research at CVPR includes a text-to-image model that can be easily customized to depict a specific object or character, a new model for object pose estimation, a technique to edit neural radiance fields ( NeRFs ) and a visual language model that can understand memes. Additional papers introduce domain-specific innovations for industries including automotive, healthcare and robotics.

Collectively, the work introduces powerful AI models that could enable creators to more quickly bring their artistic visions to life, accelerate the training of autonomous robots for manufacturing, and support healthcare professionals by helping process radiology reports.

“Artificial intelligence, and generative AI in particular, represents a pivotal technological advancement,” said Jan Kautz, vice president of learning and perception research at NVIDIA. “At CVPR, NVIDIA Research is sharing how we’re pushing the boundaries of what’s possible — from powerful image generation models that could supercharge professional creators to autonomous driving software that could help enable next-generation self-driving cars.”

At CVPR, NVIDIA also announced NVIDIA Omniverse Cloud Sensor RTX , a set of microservices that enable physically accurate sensor simulation to accelerate the development of fully autonomous machines of every kind.

Forget Fine-Tuning: JeDi Simplifies Custom Image Generation

Creators harnessing diffusion models, the most popular method for generating images based on text prompts, often have a specific character or object in mind — they may, for example, be developing a storyboard around an animated mouse or brainstorming an ad campaign for a specific toy.

Prior research has enabled these creators to personalize the output of diffusion models to focus on a specific subject using fine-tuning — where a user trains the model on a custom dataset — but the process can be time-consuming and inaccessible for general users.

JeDi , a paper by researchers from Johns Hopkins University, Toyota Technological Institute at Chicago and NVIDIA, proposes a new technique that allows users to easily personalize the output of a diffusion model within a couple of seconds using reference images. The team found that the model achieves state-of-the-art quality, significantly outperforming existing fine-tuning-based and fine-tuning-free methods.

JeDi can also be combined with retrieval-augmented generation , or RAG, to generate visuals specific to a database, such as a brand’s product catalog.

New Foundation Model Perfects the Pose

NVIDIA researchers at CVPR are also presenting FoundationPose , a foundation model for object pose estimation and tracking that can be instantly applied to new objects during inference, without the need for fine-tuning.

The model, which set a new record on a popular benchmark for object pose estimation, uses either a small set of reference images or a 3D representation of an object to understand its shape. It can then identify and track how that object moves and rotates in 3D across a video, even in poor lighting conditions or complex scenes with visual obstructions.

FoundationPose could be used in industrial applications to help autonomous robots identify and track the objects they interact with. It could also be used in augmented reality applications where an AI model is used to overlay visuals on a live scene.

NeRFDeformer Transforms 3D Scenes With a Single Snapshot

A NeRF is an AI model that can render a 3D scene based on a series of 2D images taken from different positions in the environment. In fields like robotics, NeRFs can be used to generate immersive 3D renders of complex real-world scenes, such as a cluttered room or a construction site. However, to make any changes, developers would need to manually define how the scene has transformed — or remake the NeRF entirely.

Researchers from the University of Illinois Urbana-Champaign and NVIDIA have simplified the process with NeRFDeformer. The method, being presented at CVPR, can successfully transform an existing NeRF using a single RGB-D image, which is a combination of a normal photo and a depth map that captures how far each object in a scene is from the camera.

research proposal on computer vision

VILA Visual Language Model Gets the Picture

A CVPR research collaboration between NVIDIA and the Massachusetts Institute of Technology is advancing the state of the art for vision language models, which are generative AI models that can process videos, images and text.

The group developed VILA , a family of open-source visual language models that outperforms prior neural networks on key benchmarks that test how well AI models answer questions about images. VILA’s unique pretraining process unlocked new model capabilities, including enhanced world knowledge, stronger in-context learning and the ability to reason across multiple images.

figure showing how VILA can reason based on multiple images

The VILA model family can be optimized for inference using the NVIDIA TensorRT-LLM open-source library and can be deployed on NVIDIA GPUs in data centers, workstations and even edge devices .

Read more about VILA on the NVIDIA Technical Blog and GitHub .

Generative AI Fuels Autonomous Driving, Smart City Research

A dozen of the NVIDIA-authored CVPR papers focus on autonomous vehicle research. Other AV-related highlights include:

  • NVIDIA’s AV applied research , which won the CVPR Autonomous Grand Challenge , is featured in this demo .
  • Sanja Fidler , vice president of AI research at NVIDIA, will present on vision language models at the Workshop on Autonomous Driving on June 17.
  • Producing and Leveraging Online Map Uncertainty in Trajectory Prediction , a paper authored by researchers from the University of Toronto and NVIDIA, has been selected as one of 24 finalists for CVPR’s best paper award.

Also at CVPR, NVIDIA contributed the largest ever indoor synthetic dataset to the AI City Challenge , helping researchers and developers advance the development of solutions for smart cities and industrial automation. The challenge’s datasets were generated using NVIDIA Omniverse , a platform of APIs, SDKs and services that enable developers to build Universal Scene Description (OpenUSD) -based applications and workflows.

NVIDIA Research has hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics. Learn more about NVIDIA Research at CVPR .

NVIDIA websites use cookies to deliver and improve the website experience. See our cookie policy for further details on how we use cookies and how to change your cookie settings.

Share on Mastodon

IMAGES

  1. Computer Vision Research Proposal

    research proposal on computer vision

  2. What are Computer Vision and why do we use it? by Baljeet Singh

    research proposal on computer vision

  3. (PDF) A Research Proposal for A PhD in Computational Complexity

    research proposal on computer vision

  4. (PDF) Synthetic Dataset Creation For Computer Vision Application

    research proposal on computer vision

  5. Proposal For The Computer Vision

    research proposal on computer vision

  6. research proposal example in computer science

    research proposal on computer vision

VIDEO

  1. Project Proposal for Computer programming

  2. How to |add citation in a |Research paper |Thesis or |Research Proposal

  3. Advancing the state of the art in computer vision with self-supervised Vision Transformers

  4. Microsoft Research : The vision

  5. AI in Computer Vision Technology and the Impact on Healthcare

  6. Computer vision enabling smart campuses

COMMENTS

  1. CSSA Sample PhD proposals

    This is an artifact of me knowing more computer graphics folks to pester for their proposals. Add your non-graphics proposal to the collection and help remedy this imbalance! There are only two requirements for a UNC proposal to be added to this collection. The first requirement is that your proposal must be completely approved by your committee.

  2. Research Areas in Computer Vision: Trends and Challenges

    Basics of Computer Vision. Computer Vision (CV) is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos, along with deep learning models, computers can accurately identify and classify objects, and then react to what they "see.".

  3. Your 2024 Guide to Computer Vision Research

    Here are the steps involved in identifying the problem statement in computer vision research: Problem Statement Analysis: The first step is to pinpoint the specific application domain within computer vision. This could be related to object recognition in autonomous vehicles or medical image analysis for disease detection.

  4. (PDF) Computer Vision Networks: a research proposal

    Final presentation of the project "Computer vision networks. Developing digital visual methods for social and media research". Project developed in 2021 at the Center for Advanced Internet Studies ...

  5. Deep learning in computer vision: A critical review of emerging

    The features of big data could be captured by DL automatically and efficiently. The current applications of DL include computer vision (CV), natural language processing (NLP), video/speech recognition (V/SP), and finance and banking (F&B). Chai and Li (2019) provided a survey of DL on NLP and the advances on V/SP. The survey emphasized the ...

  6. PDF The Computer Vision Project (draft 1.0)

    1 .2 Project Milestones. The following are the main milestones in the progress of a project: Project Topic Selection. Background Research and algorithm selection. The Project Proposal At this stage the following needs to be clearly specified: Algorithm design. Experiment design. E1: Identify a data set.

  7. PDF CS 6384 Computer Vision Project Proposal Description

    CS 6384 Computer Vision Project Proposal Description. Professor Yu Xiang February 6, 2022. 1 Introduction. For the computer vision course project, students can choose a topic related to computer vision, and explore the topic in one of the three different ways: •Research-oriented. In this direction, students are going to propose a new idea ...

  8. PDF Stanford Computational Vision and Geometry Lab

    Stanford Computational Vision and Geometry Lab

  9. PDF Writing a Research Proposal

    processing in many practical computer vision systems. The development of static image segmentation algorithms has actttread considerable research ieesnrtt and is ernciehd by a wide range of methodologies. Hvweeor, work that has been published in the video analyses domain is still quite noarrw and beaisd towards the sole use of motion ...

  10. Master thesis project proposals

    If you are interested in doing a research related project, but do not see a suitable one listed here, feel free to contact one of the researchers at the lab. ... [2022-10-10] Zenseact: Multiple computer vision master theses proposals. E.g. Learning-based Road Estimation [2022-09-06] FOI: Neuromorfisk Avbildning [2022-02-21] FOI: Mörkerseende ...

  11. Computer Vision Networks: a research proposal

    Computer Vision Networks: a research proposal. March 2021. Authors: Janna Joceli Omena. Universidade NOVA de Lisboa. To read the file of this research, you can request a copy directly from the author.

  12. PDF Object Proposals in Computer Vision

    The eld of computer vision was initially conceived as a summer undergraduate project [3] in 1966. Notwithstanding the seemingly simple de nition of 'seeing', it has proved to be a tough problem to solve. Going beyond perception and interpretation of visual data, research in computer vision now encompasses the following areas:

  13. Computer Vision

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... You can create a new account if you don't have one. Browse SoTA > Computer Vision Computer Vision. 4754 benchmarks • 1452 tasks • 3074 datasets • 49100 papers with code Semantic Segmentation Semantic Segmentation ...

  14. Computer Vision (proposal) PhD Projects, Programmes ...

    Computer Vision is one of the most active areas where artificial intelligence (AI) is being used. This area is extremely expanding and getting a lot of interests and investments these days. Read more. Supervisor: Dr H Kim. 19 June 2024 PhD Research Project Competition Funded PhD Project (Students Worldwide)

  15. Proposed PhD Projects in Computer Vision

    Proposed PhD Projects in Computer Vision. Synthesis of Stereoscopic Movie from Conventional Monocular Video Clips. In order to provide material for 3-dimensional television displays, methods are required for producing 3-dimensional video material from existing 2-dimensional video, such as old films. This project seeks to develop automatic and ...

  16. (PDF) Computer vision and its application

    Research Proposal PDF Available. Computer vision and its application. November 2018; Authors: ... One application where this has been very prominent is the analysis of images, i.e., computer vision.

  17. Research Topics of the Computer Vision & Graphics Group

    Dr.-Ing. Anna Hilsmann. Head of Vision & Imaging Technologies Department. Head of Computer Vision & Graphics Group. Phone +49 30 31002-569. Innovations for the digital society of the future are the focus of research and development work at the Fraunhofer HHI. The institute develops standards for information and communication technologies and ...

  18. Computer Vision Research Proposal

    Computer Vision Research Proposal - Free download as PDF File (.pdf), Text File (.txt) or read online for free. This document outlines a research project on sensor identification for digital image forensics. The goals are to determine what camera captured a given image and to improve existing techniques. Over the summer, the student will collect an image database using multiple cameras under ...

  19. Top Computer Vision Papers of All Time (Updated 2024)

    Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015) YOLO: You Only Look Once: Unified, Real-Time Object Detection (2016) Mask R-CNN (2017) EfficientNet - Rethinking Model Scaling for Convolutional Neural Networks (2019) About us: Viso Suite is the end-to-end computer vision solution for enterprises. With a ...

  20. Project Proposal

    Project Proposal. For the course project you will explore a topic in-depth of your own choosing. This can be an implementation (implement an existing algorithm); an application (apply a computer vision algorithm to a new problem); or research (trying to invent something new). To get you started, we have prepared a list of suggested projects.

  21. Computer Vision (research proposal form) PhD Projects ...

    We have 11 Computer Vision (research proposal form) PhD Projects, Programmes & Scholarships in the UK. More Details. Human emotion analysis and recognition for improving trusted human-robot interaction. Main project focus: AI and Robotics.

  22. CS231n: Deep Learning for Computer Vision

    Applications. If you're coming to the class with a specific background and interests (e.g. biology, engineering, physics), we'd love to see you apply vision models learned in this class to problems related to your particular domain of interest. Pick a real-world problem and apply computer vision models to solve it. Models.

  23. Computer Vision for Global Challenges request for proposals

    Facebook is calling for proposals for pilot and early-stage research that extends computer vision technologies in developing countries 1.We specifically seek projects that address the technical challenges impeding computer vision in these contexts, including data and hardware limitations and better integration of new information sources, such as high-resolution satellite imagery.

  24. A Review on Computer Vision-Based Methods for Human Action Recognition

    Abstract. Human action recognition targets recognising different actions from a sequence of observations and different environmental conditions. A wide different applications is applicable to vision based action recognition research. This can include video surveillance, tracking, health care, and human-computer interaction.

  25. Microsoft at CVPR 2024: Innovations in computer vision and AI research

    Microsoft is proud to sponsor the 41st annual Conference on Computer Vision and Pattern Recognition (CVPR 2024), held from June 17 to June 21. This premier conference covers a broad spectrum of topics in the field, including 3D reconstruction and modeling, action and motion analysis, video and image processing, synthetic data generation, neural networks, and […]

  26. Research Proposal PDF

    Research Proposal.pdf - Free download as PDF File (.pdf), Text File (.txt) or read online for free. The document proposes a method to detect human poses in video frames using pictorial structure modeling and estimate poses. Key steps include detecting humans using weak constraints on body part position and appearance, estimating poses represented as pictorial structures, and classifying poses ...

  27. Research Proposal

    research proposal - Free download as PDF File (.pdf), Text File (.txt) or read online for free. This document outlines a research proposal to investigate using Capsule Neural Networks (CapsNets) for traffic light image recognition in autonomous vehicles. It hypothesizes that CapsNets may improve upon current Convolutional Neural Network (CNN) methods by better preserving positional data.

  28. Temporal adaptive feature pyramid network for action detection

    AbstractDetecting actions in videos has become a prominent research task due to its wide application. ... Bai Y., Wang Y., Tong Y., Yang Y., Liu Q., Liu J., Boundary content graph neural network for temporal action proposal generation, in: European Conference on Computer Vision, Springer ... European Conference on Computer Vision, Springer ...

  29. CVPR 2024 Announces Best Paper Award Winners

    SEATTLE, 19 June 2024 - Today, during the 2024 Computer Vision and Pattern Recognition (CVPR) Conference opening session, the CVPR Awards Committee announced the winners of its prestigious Best Paper Awards, which annually recognize top research in computer vision, artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more.

  30. NVIDIA Research Showcases Visual Generative AI at CVPR

    More than 50 of these projects will be showcased at the Computer Vision and Pattern Recognition (CVPR) conference, taking place June 17-21 in Seattle. Two of the papers — one on the training dynamics of diffusion models and another on high-definition maps for autonomous vehicles — are finalists for CVPR's Best Paper Awards.