|
Welcome to the on-line version of the UNC dissertation proposal collection. The purpose of this collection is to provide examples of proposals for those of you who are thinking of writing a proposal of your own. I hope that this on-line collection proves to be more difficult to misplace than the physical collection that periodically disappears. If you are preparing to write a proposal you should make a point of reading the excellent document The Path to the Ph.D., written by James Coggins. It includes advice about selecting a topic, preparing a proposal, taking your oral exam and finishing your dissertation. It also includes accounts by many people about the process that each of them went through to find a thesis topic. Adding to the Collection This collection of proposals becomes more useful with each new proposal that is added. If you have an accepted proposal, please help by including it in this collection. You may notice that the bulk of the proposals currently in this collection are in the area of computer graphics. This is an artifact of me knowing more computer graphics folks to pester for their proposals. Add your non-graphics proposal to the collection and help remedy this imbalance! There are only two requirements for a UNC proposal to be added to this collection. The first requirement is that your proposal must be completely approved by your committee. If we adhere to this, then each proposal in the collection serves as an example of a document that five faculty members have signed off on. The second requirement is that you supply, as best you can, exactly the document that your committee approved. While reading over my own proposal I winced at a few of the things that I had written. I resisted the temptation to change the document, however, because this collection should truely reflect what an accepted thesis proposal looks like. Note that there is no requirement that the author has finished his/her Ph.D. Several of the proposals in the collection were written by people who, as of this writing, are still working on their dissertation. This is fine! I encourage people to submit their proposals in any form they wish. Perhaps the most useful forms at the present are Postscript and HTML, but this may not always be so. Greg Coombe has generously provided LaTeX thesis style files , which, he says, conform to the 2004-2005 stlye requirements.
Many thanks to everyone who contributed to this collection!
Greg Coombe, "Incremental Construction of Surface Light Fields" in PDF . Karl Hillesland, "Image-Based Modelling Using Nonlinear Function Fitting on a Stream Architecture" in PDF . Martin Isenburg, "Compressing, Streaming, and Processing of Large Polygon Meshes" in PDF . Ajith Mascarenhas, "A Topological Framework for Visualizing Time-varying Volumetric Datasets" in PDF . Josh Steinhurst, "Practical Photon Mapping in Hardware" in PDF . Ronald Azuma, "Predictive Tracking for Head-Mounted Displays," in Postscript Mike Bajura, "Virtual Reality Meets Computer Vision," in Postscript David Ellsworth, "Polygon Rendering for Interactive Scientific Visualization on Multicomputers," in Postscript Richard Holloway, "A Systems-Engineering Study of the Registration Errors in a Virtual-Environment System for Cranio-Facial Surgery Planning," in Postscript Victoria Interrante, "Uses of Shading Techniques, Artistic Devices and Interaction to Improve the Visual Understanding of Multiple Interpenetrating Volume Data Sets," in Postscript Mark Mine, "Modeling From Within: A Proposal for the Investigation of Modeling Within the Immersive Environment" in Postscript Steve Molnar, "High-Speed Rendering using Scan-Line Image Composition," in Postscript Carl Mueller, " High-Performance Rendering via the Sort-First Architecture ," in Postscript Ulrich Neumann, "Direct Volume Rendering on Multicomputers," in Postscript Marc Olano, "Programmability in an Interactive Graphics Pipeline," in Postscript Krish Ponamgi, "Collision Detection for Interactive Environments and Simulations," in Postscript Russell Taylor, "Nanomanipulator Proposal," in Postscript Greg Turk, " Generating Textures on Arbitrary Surfaces ," in HTML and Postscript Terry Yoo, " Statistical Control of Nonlinear Diffusion ," in Postscript
| |
- Skip to primary navigation
- Skip to main content
![research proposal on computer vision OpenCV](https://opencv.org/wp-content/uploads/2022/05/logo.png)
Open Computer Vision Library
![](http://sokolural.site/777/templates/cheerup/res/banner1.gif)
A Comprehensive Guide to Computer Vision Research in 2024
bharat January 17, 2024 Leave a Comment AI Careers Tags: ai computer vision computer vision research computer vision research groups deep learning OpenCV
![research proposal on computer vision guide to computer vision research](https://opencv.org/wp-content/uploads/2024/01/Artboard-1-copy-4.png)
Introduction
In our earlier blogs , we discussed the best institutes across the world for computer vision research. In this fun read, we’ll look at the different stages of Computer Vision research and how you can go about publishing your research work. Let us delve into them now. Looking to become a Computer Vision Engineer? Check out our Comprehensive Guide !
Table of Contents
- Introduction
- Different Stages of Computer Vision
Research Publications
Different stages of computer vision research.
Computer Vision Research can be put into various stages, one building to the next. Let us look at them in detail.
Identification of Problem Statement
Computer Vision research starts with identifying the problem statement. It is a crucial step in defining the scope and goals of a research project. It involves clearly understanding the specific challenge or task the researchers aim to address using computer vision techniques. Here are the steps involved in identifying the problem statement in computer vision research:
- Problem Statement Analysis: The first step is to pinpoint the specific application domain within computer vision. This could be related to object recognition in autonomous vehicles or medical image analysis for disease detection.
- Defining the problem: Next, we define the precise problem we want to solve within that domain, like classifying images of animals or diagnosing diseases from X-rays.
- Understanding the objectives: We need to understand the research objectives and outline what we intend to achieve through this project. For instance, improving classification accuracy or reducing false positives in a medical imaging system.
- Data availability: Next, we need to analyze the availability of data for our project. Check if existing datasets are suitable for our task or if we need to gather our own data, like collecting images of specific objects or medical cases.
- Review: Conduct a thorough review of existing research and the latest methodologies in the field. This will help you gain insights into the current state-of-the-art techniques and the challenges others have faced in similar projects.
- Question formulation: Once we review the work, we can formulate research questions to guide our experiments. These questions could address specific aspects of our computer vision problem and help better structure our research.
- Metrics: Next, we define the evaluation metrics that we’ll use to measure the performance of our vision system. Some common metrics include accuracy, precision, recall, and F1-score.
- Highlighting: Highlight how solving the problem will have an effect in the real world. For instance, improving road safety through better object recognition or enhanced medical diagnoses for early treatment.
- Research Outline: Finally, outline the research plan, and detail the methodology employed for data collection, model development, and evaluation. A structured outline will ensure we are on the right track throughout our research project.
![research proposal on computer vision](https://opencv.org/wp-content/uploads/2024/01/image-6.jpeg)
Let us move to the next step, data collection and creation.
Dataset Collection and Creation
Creating and gathering datasets is one of the key building blocks in computer vision research. These datasets facilitate the algorithms and models used in vision systems. Let us see how this is done.
- Firstly we need to know what we are trying to solve. For instance, are we training models to recognize dogs in photos or identify anomalies in medical images?
- Now, we’ll need images or videos. Depending on the research needs, we can find them on public datasets or collect our own.
- Next, we mark up the data. For instance, if you’re teaching a computer to spot dogs in pictures, you’ll draw boxes around the cars and say, “These are dogs!”
- Raw data can be a mess. We may need to resize images, adjust colors, or add more examples to ensure our dataset is neat and complete.
- 1-part for training your model
- 1-part for fine-tuning
- 1-part for testing how well your model works
- Next, ensure the dataset fairly represents the real world and doesn’t favor one group or category too much.
One can also share their dataset and research with others for inputs and improvements. Dataset collection and creation are vital in computer vision research.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) briefly analyzes a dataset to answer preliminary questions and guide the modeling process. For instance, this could be looking for patterns across different classes. This is not only used by Computer Vision Engineers but also Data Scientists to ensure that the data they provide are aligned with different business goals or outcomes. This step involves understanding the specifics of image datasets. For instance, EDA is used to spot anomalies, understand data distribution, or gain insights to further model training. Let us look at the role of EDA in model development.
- With EDA, one can develop data preprocessing pipelines and choose data augmentation strategies.
- We can analyze how the findings from EDA can affect the choice of model architecture. For instance, the need for some convolutional layers or input images.
- EDA is also crucial for advanced Computer Vision tasks like object detection, segmentation, and image generation backed by studies.
![research proposal on computer vision data preparation](https://opencv.org/wp-content/uploads/2024/01/image-3.png)
Now let us dive into the specifics of EDA methods and preparing image datasets for model development.
Visualization
- Sample Image Visualization involves displaying a random set of images from the dataset. This is a fundamental step where we get an idea of the data like lighting conditions or variations in image quality. From this, one can infer the visual diversity and any challenges in the dataset.
- Analyzing the pixel distribution intensities offers insights into the brightness and contrast variations across the dataset if there is any need for image enhancement techniques.
- Next, creating histograms for different color channels gives us a better understanding of the color distribution of the dataset. This is a crucial step for tasks such as image classification.
Image Property Analysis
- Another crucial part is understanding the resolution and the aspect ratio of images in the dataset. It helps make decisions like resizing the image or normalizing the aspect ratio, which is crucial in maintaining consistency in input data for neural networks.
- Analyzing the size and distribution of annotated objects can be insightful in datasets with annotations. This influences the design layers in the neural network and understanding the scale of objects.
Correlation Analysis
- With some advanced EDA processes like high dimensional image data, analyzing the relation between different features is helpful. This would aid with dimensionality reduction or feature selection.
- Next, it is crucial to understand the spatial correlations within images, like the relationship between different regions in an image. It helps in the development of spatial hierarchies in neural networks.
Class Distribution Analysis
- EDAs are important in understanding the imbalances in class distribution. This is key in classification tasks where imbalanced data can lead to biased models.
- Once the imbalances are identified, we can adopt techniques like undersampling majority classes or oversampling minority classes during model training.
Geometric Analysis
- Understanding geometric properties like edges, shapes, and textures in images offers insights into the features important for the problem at hand. We can make informed decisions on selecting specific filters or layers in the network architecture.
- It’s important to understand how different morphological transformations affect images for segmentation and object detection tasks.
Sequential Analysis
The sequential analysis applies to video data.
- For instance, analyzing changes between frames can offer information like motion, temporal consistency, or the need for temporal modeling in video datasets or video sequences.
- Identifying temporal variations and scene changes gives us insights into the dynamics within the video data that are crucial for tasks like event detection or action recognition.
Now that we’ve discussed Exploratory Data Analysis and some of its techniques let us move to the next stage in Computer Vision research, defining the model architecture.
Defining Model Architecture
Defining a model architecture is a critical component of research in computer vision, as it lays the foundation for how a machine learning model will perceive, process, and interpret visual data. We analyze a model that impacts the ability of the model to learn from visual data and perform tasks like object detection or semantic segmentation.
Model architecture in computer vision refers to the structural design of an artificial neural network. The architecture defines how the model processes input images, extracts features, and makes predictions and classifications.
What are the components of a model architecture? Let’s explore them.
![research proposal on computer vision model architecture](https://opencv.org/wp-content/uploads/2024/01/image-4.png)
Input Layer
This is where the model receives the image data, mostly in the form of a multi-dimensional array. For colored images, this could be a 3D array where color channels show RGB values. Preprocessing steps like normalization are applied here.
Convolutional Layers
These layers apply a set of filters to the input. Every filter convolves across the width and height of the input volume, computing the dot product between the entries of the filter and the input, producing a 2D activation map for each filter. Preserving the relationship between pixels captures spatial hierarchies in the image.
Activation Functions
Activation functions facilitate networks to learn more complex representations by introducing them to non-linear properties. For instance, the ReLU (Rectified Linear Unit) function applies a non-linear transformation (f(x) = max(0,x)) that retains only positive values and sets all negative values to zero. Other functions include sigmoid and tanh.
Pooling Layers
These layers are used to perform a down-sampling operation along the spatial dimensions (width, height), reducing the number of parameters and computations in the network. For instance, Max pooling, a common approach, takes the maximum value from a set of values in the filter area, is a common approach. This operation offers spatial variance, making the recognition of features in the input invariant to scale and orientation changes.
Fully Connected Layers
Here, the layers connect every neuron in one layer to every neuron in the next layer. In a CNN, the high-level reasoning in the neural network is performed via these dense layers. Typically, they are positioned near the end of the network and are used to flatten the output of convolutional and pooling layers to form a single vector of features used for final classification or regression tasks.
Dropout Layers
Dropout is a regularization technique where randomly selected neurons are ignored during training. This means that the contribution of these neurons to activate the downstream neurons is removed temporally on the forward pass and any weight updates are not applied to the neuron on the backward pass. This helps in preventing overfitting.
Batch Normalization
In batch normalization, the output from a previous activation layer is normalized by subtracting the batch mean and then dividing it by the standard deviation of the batch. This technique helps stabilize the learning process and significantly reduces the number of training epochs required for deep network training.
Loss Function
The difference between the expected outcomes and the predictions made by the model is quantified by the loss function. Cross-entropy for classification tasks and mean squared error for regression tasks are some of the common loss functions in computer vision.
The optimizer is an algorithm used to minimize the loss function. It updates the network’s weights based on the loss gradient. Some common optimizers include Stochastic Gradient Descent (SGD), Adam, and RMSprop. They use backpropagation to determine the direction in which each weight should be adjusted to minimize the loss.
Output Layer
This is the final layer, where the model’s output is produced. The output layer typically includes a softmax function for classification tasks that converts the outputs to probability values for each class. For regression tasks, the output layer may have a single neuron.
Frameworks like TensorFlow, PyTorch, and Keras are widely used for designing and implementing model architectures. They offer pre-built layers, training routines, and easy integration with hardware accelerators.
Defining a model architecture requires a good grasp of both the theoretical aspects of neural networks and the practical aspects of the specific task.
Training and Validation
Training and validation are crucial in developing a model. They help evaluate a model’s performance, especially when dealing with object detection or image classification tasks.
In this phase, the model is represented as a neural network that learns to recognize image patterns and features by altering its internal parameters iteratively. These parameters are weights and biases related to the network’s layers. Training is key for extracting meaningful features from raw visual data. Let us see how one can go about training a model.
- Acquiring a dataset is the first step. It could be in the form of images or videos for model learning purposes. For robustness, they cover various environmental conditions, variations, and object classes.
- Resizing is where all the input data has the same dimensions for batch processing.
- In Normalization, pixels are standardized to zero mean and unit variance, aiding convergence.
- Augmentation applies random transformations to increase the size of the dataset artificially, thereby improving the model’s ability to generalize.
- Once data preprocessing is done, we must choose the appropriate neural network architecture catering to the specific vision task. For instance, CNNs are widely used for image-related tasks.
- Next, we initialize the model parameters, usually weights, and biases, using random values or pre-trained weights from a model trained on a simple dataset. Transfer learning can significantly improve performance, especially when data is limited.
- Then we can optimize the algorithm to adjust its parameters iteratively with stochastic gradient descent (SGD) or RMSprop. Gradients in relation to the model’s parameters are computed through backpropagation which are used to update the parameters.
- Once the algorithm is optimized, the data is trained in mini-batches through the network, computing the loss for each mini-batch and performing gradient updates. This happens until the loss falls below a predefined threshold.
- Next, we must optimize the training performance and convergence speed by fine-tuning the hyperparameters. This could done by optimizing learning rates, batch sizes, weight regulation terms, or network architectures.
- We need to assess the model’s performance using validation or test datasets and eventually deploy the model in real-world applications through software integrations or embedded devices.
Now let us move to the next step- Validation.
Validation is fundamental for the quantitative assessment of performance and generalization capabilities of algorithms. It ensures the reliability and effectiveness of the models when applied to real-world data. Validation evaluates the ability of a model to make accurate predictions of previously unseen data hence being able to gauge its ability for generalization.
Now let us explore some of the key techniques involved in validation.
Cross-Validation Techniques
- K-Fold Cross-Validation is the method where the dataset is partitioned into K non-overlapping subsets. The model is trained and evaluated K times, with each fold taking turns as the validation set while the rest serve as the training set. The results are averaged to obtain a robust performance estimate.
- Leave-One-Out Cross-Validation or LOOCV is an extreme form of cross-validation where each data point is used as the validation set while the remaining data points constitute the training set.LOOCV offers an exhaustive evaluation of model performance.
Stratified Sampling
In some imbalanced datasets where a few classes have significantly fewer instances than others, stratified sampling ensures the balance between training and validation sets for the distribution of classes.
Performance Metrics
To assess the model’s performance, a range of performance metrics specified for computer vision tasks are deployed. They are not limited to the following.
- Accuracy is the ratio of the correctly predicted instances to the total number of instances.
- Precision is the proportion of true positive predictions among all positive predictions.
- Recall is the proportion of true positive predictions among all positive instances.
- F1-Score is the harmonic mean of precision and recall.
- Mean Average Precision (mAP)is commonly used in object detection and image retrieval tasks to evaluate the quality of ranked lists of results.
Hyperparameter Tuning
Validation is closely integrated with hyperparameter tuning, where the model’s hyperparameters are systematically adjusted and evaluated using the validation set. Techniques such as grid search, random search, or Bayesian optimization help identify the optimal hyperparameter configuration for the model.
Data Augmentation
Data augmentation techniques are applied to test the model’s robustness and the ability to handle different conditions or transformations during validation to simulate variations in the input data.
Training is where the model learns from labeled data, and Validation is where the model’s learning and generalization capabilities are assessed. They ensure that the final model is robust, accurate, and capable of performing well on unseen data, which is critical for computer vision research.
Hyperparameter tuning refers to systematically optimizing hyperparameters in deep learning models for tasks like image processing and segmentation. They control the learning algorithm’s performance but did not learn from the training data. Fine-tuning hyperparameters are crucial if we wish to achieve accurate results.
![research proposal on computer vision Your Image Alt Text](https://opencv.org/wp-content/uploads/2024/06/OpenCV-bootcamp.png)
It is the number of training examples used in every forward and backward pass. Large batch sizes offer smoother convergence but need more memory. On the contrary, small batch sizes need less memory and can help escape local minima.
Number of Epochs
The Number of epochs defines how often the entire training dataset is processed during training. Too few epochs can lead to underfitting, and too many can lead to overfitting.
Learning Rate
This determines the step size during gradient-based optimization. If the learning rate is too high, it can lead to overshooting, causing the loss function to diverge, and if the learning rate is too short, it can cause slow convergence.
Weight Initialization
The training stability is affected by the initialization of weights. Techniques such as Glorot initialization are designed to address the vanishing gradient problems.
Regularization Techniques
Some techniques like dropout and weight decay aid in preventing overfitting. The model generalization is enhanced through random rotations using data augmentation.
Choice of Optimizer
The updates during training for model weights are determined by the optimizer. They have their parameters like momentum, decay rates and epsilon.
Hyperparameter tuning is usually approached as an optimization problem. Few techniques like Bayesian optimization efficiently explore the hyperparameter space balancing computational costs and do not slack on the performance. A well-defined hyperparameter tuning includes not just adjusting individual hyperparameters but also also considers their interactions.
Performance Evaluation on Unseen Data
In the earlier section, we discussed how one must go about doing the training and validation of a model. Now we’ll discuss how to evaluate the performance of a dataset on unseen data.
![research proposal on computer vision performance evaluation on unseen data](https://opencv.org/wp-content/uploads/2024/01/image-2.png)
Training and validation dataset split is paramount when developing and evaluating models. This is not to be confused with the training and validation we discussed earlier for a model. Splitting the dataset for training and validation aids in understanding the model’s performance on unseen data. This ensures that the model generalizes well to new data. Let us look at them.
- A training dataset is a collection of labeled data points for training the model, adjusting parameters, and inferring patterns and features.
- A separate dataset is used for evaluating the model during development for hyperparameter tuning and model selection. This is the Validation dataset.
- Then there is the test dataset , an independent dataset used for assessing the final performance and generalization ability on unseen data.
Splitting datasets is needed to prevent the model from training on the same data. This would hinder the model’s performance. Some commonly used split ratios for the dataset are 70:30, 80:20, or 90:10. The larger portion is used for training, while the smaller portion is used for validation.
You have put so much effort into your research paper. But how do we publish it? Where do we publish it? How do I find the right computer vision research groups? That is what this section covers, so let’s get to it.
Conferences
There are some top-tier computer vision conferences happening across the globe. They are among the best places to showcase research work, look for future collaborations, and build networks.
Conference on Computer Vision and Pattern Recognition (CVPR)
Also called the CVPR , it is one of the most prestigious conferences in the world of Computer Vision. It is organized by the IEEE Computer Society and is an annual event. It has an amazing history of showcasing cutting-edge research papers in image analysis, object detection, deep learning techniques, and much more. CVPR has set the bar high, placing a strong emphasis on the technical aspects of the submissions. They must meet the following criteria.
Papers must possess an innovative contribution to the field. This could be the development of new algorithms, techniques, or methodologies that can bring advancements in computer vision.
If applicable, the submissions must have mathematical formulations of their methods, like equations and theorem proofs. This offers a solid theoretical foundation for the paper’s approach.
Next, the paper should include comprehensive experimental results involving many datasets and benchmarking against existing models. These are key to demonstrating the effectiveness of your proposed approach.
Clarity – this is a no-brainer; the writing and presentation must be clear and concise. The writers are expected to explain the algorithms, models, and results in a technically sound manner.
![research proposal on computer vision conference on computer vision and pattern recognition](https://opencv.org/wp-content/uploads/2024/01/image-5.png)
CVPR is an amazing platform for networking and engaging with the community. It’s a great place to meet academics, researchers, and industry experts to collaborate and exchange ideas. The acceptance rate for papers is only 25.8% hence the recognition within the vision community is impressive. It often leads to citations, greater visibility, and potential collaborations with renowned researchers and professionals.
International Conference on Computer Vision (ICCV)
The ICCV is another premier conference held annually once, offering an amazing platform for cutting-edge computer vision research. Much like the CVPR, the ICCV is also organized by the IEEE Computer Society, attracting worldwide visionaries, researchers, and professionals. Topics range from object detection and recognition all the way to computational photography. ICCV invites original papers offering a significant contribution to the field. The criteria for submissions are very similar to the CVPR. They must possess mathematical formulations, algorithms, experimental methodology, and results. ICCV adopts peer review to add a layer of technical rigor and quality to the accepted papers. Submissions usually undergo multiple stages of review, giving detailed feedback on the technical aspects of the research paper. The acceptance rates at ICCV are typically low at 26.2%.
Besides the main conference, the ICCV hosts workshops and tutorials that offer in-depth discussions and presentations in emerging research areas. It also offers challenges and competitions associated with computer vision tasks like image segmentation and object detection.
Like the CVPR, it offers excellent opportunities for future collaborations, networking with peers, and exchanging ideas. The papers accepted at the ICCV are typically published in the IEEE Computer Society and made available to the vision community. This offers significant visibility and recognition to researchers for papers that are accepted.
European Conference on Computer Vision (ECCV)
The European Conference on Computer Vision, or ECCV , is another comprehensive conference if you are looking for the top computer vision conferences globally. The ECCV lays a lot of emphasis on the scientific and technical quality of the paper. Like the above two conferences we discussed, it emphasizes how the researcher incorporates the mathematical foundations, algorithms, and detailed derivations and proofs with extensive experimental evaluations.
According to the ECCV formatting guidelines, the research paper ideally ranges from 10 to 14 pages. It adopts a double-blind peer review, where the researchers must make their submissions anonymous to curb any discrepancies.
![research proposal on computer vision european conference on computer vision](https://opencv.org/wp-content/uploads/2024/01/image-5.jpeg)
ECCV also offers huge opportunities for collaborations and establishing connections. With an acceptance rate of 31.8%, a researcher can benefit from academic recognition, high visibility, and citations.
Winter Conference on Applications of Computer Vision (WACV)
WACV is a top international computer vision event with the main conference and a few workshops and tutorials. Much like the other conferences, it is held annually. With an acceptance rate below 30%, it attracts leading researchers and industry professionals. The conference usually takes place in the first week of January.
![research proposal on computer vision winter conference on applications of computer vision](https://opencv.org/wp-content/uploads/2024/01/image-7.jpeg)
As a computer vision researcher, one must publish one’s works in journals to show your findings and give more insights into the field. Let us look at a few of the computer vision journals.
Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Also called the TPAMI , this journal focuses on the various aspects of machine intelligence, pattern recognition, and computer vision. It offers a hybrid publication permitting traditional or author-paid open-access manuscript submissions.
With open-access manuscripts, the paper has unrestricted access to it through the IEEE Xplore and Computer Society Digital Library.
Regarding traditional manuscript submissions, the IEEE Computer Society has various award-winning journals for publication. One can browse through the different topics that fit their research. They often publish special sections on emerging topics. Some factors you need to consider are submission to publications time, bibliometric scores like impact factor, and publishing fees.
International Journal of Computer Vision (IJCV)
The IJCV offers a platform for new research results. With 15 issues a year, the International Journal of Computer Vision offers high-quality, original contributions to the field of computer vision. The length of the articles ranges from 10-page regular articles to up to 30 pages for survey papers that offer state-of-the-art presentations and results. The research must cover mathematical, physics, and computational aspects of computer vision, like image formation, processing, interpretation, machine learning techniques, and statistical approaches. Researchers are not charged to publish on IJCV . It is not only a journal that opens doors for researchers to showcase their papers but also a goldmine of information in deep learning, artificial intelligence, and robotics.
Journal of Machine Learning Research (JMLR)
Established in 2000, JMLR is a forum for electronic and paper publications of comprehensive research papers. This platform covers topics like machine learning algorithms and techniques, deep learning, neural networks, robotics, and computer vision. JMLR is freely available to the public. It is run by volunteers, and the papers undergo rigorous reviews, which serve as a valuable resource for the latest updates in the field.
You’ve invested weeks and months into this paper. Why not get the recognition and credibility your work deserves? The above Journals and Conferences offer the ultimate gateway for a researcher to showcase their works and open up a plethora of opportunities for academic and industry collaborations.
In conclusion, our journey through the intricate world of computer vision research has been a fun one. From the initial stages of understanding the problem statements to the final steps of publication in computer vision research groups, we’ve comprehensively delved into each of them.
There is no research, big or small; each offers its own contributions to the ever-evolving field of the Computer Vision domain.
We’ve more detailed posts coming your way. Stay tuned! See you guys in the next one!!
Related Blog Posts
- How to Become a Computer Vision Engineer in 2024?
- Top Computer Vision Research Institutes in the USA
- Exploring OpenCV Applications in 2023
- Computer Vision and Image Processing: Understanding the Distinction and Connection
Related Posts
![research proposal on computer vision introduction to ai jobs in 2023](https://opencv.org/wp-content/uploads/2023/08/Your-2023.png)
August 16, 2023 Leave a Comment
![research proposal on computer vision introduction to artificial intelligence](https://opencv.org/wp-content/uploads/2023/08/Blog-2-feature-img-final.png)
August 23, 2023 Leave a Comment
![research proposal on computer vision Knowing the history of AI is important in understanding where AI is now and where it may go in the future.](https://opencv.org/wp-content/uploads/2023/08/word-Cloud-OpenCV-768x432-01-scaled.jpg)
August 30, 2023 Leave a Comment
Become a Member
Stay up to date on OpenCV and Computer Vision news
Free Courses
- TensorFlow & Keras Bootcamp
- OpenCV Bootcamp
- Python for Beginners
- Mastering OpenCV with Python
- Fundamentals of CV & IP
- Deep Learning with PyTorch
- Deep Learning with TensorFlow & Keras
- Computer Vision & Deep Learning Applications
- Mastering Generative AI for Art
Partnership
- Intel, OpenCV’s Platinum Member
- Gold Membership
- Development Partnership
General Link
Subscribe and Start Your Free Crash Course
![research proposal on computer vision research proposal on computer vision](https://opencv.org/wp-content/uploads/2021/06/seperator.png)
Stay up to date on OpenCV and Computer Vision news and our new course offerings
- We hate SPAM and promise to keep your email address safe.
Join the waitlist to receive a 20% discount
Courses are (a little) oversubscribed and we apologize for your enrollment delay. As an apology, you will receive a 20% discount on all waitlist course purchases. Current wait time will be sent to you in the confirmation email. Thank you!
![research proposal on computer vision research proposal on computer vision](https://www.cvl.isy.liu.se/mall15/images/flag-sv.png)
Search ISY Search LiU.se Find an employee Find a location
![research proposal on computer vision Logotype](https://www.cvl.isy.liu.se/mall15/images/logo-en.png)
Computer Vision Laboratory (CVL)
Undergraduate.
- Computer Vision and Signal Analysis profiles
- Biological Vision Systems
- Geometry for Computer Vision
- Group theoretical methods and their applications
- Reading Group in Computer and Robot Vision
Master-thesis
- External Partners
- Project proposals
- Edupack Orientation
- Rolling Shutter
Master thesis project proposals
Internal projects.
- A list of internal CVL projects can be found in the CVL GIT (open to all LiU students).
- If you are interested in doing a research related project, but do not see a suitable one listed here, feel free to contact one of the researchers at the lab. We normally have several more opportunities for internal master thesis projects related to research projects. These can often be adapted to the particular interests of the student.
External projects
- NB! Please first check the list of new external projects .
- [2022-11-18] Nordic Evolution: Digital Guides for Visually Impaired Athletes
- [2022-10-10] Zenseact: Multiple computer vision master theses proposals. E.g. Learning-based Road Estimation
- [2022-09-06] FOI: Neuromorfisk Avbildning
- [2022-02-21] FOI: Mörkerseende med maskininlärningsbaserad bildfusion
- [2021-10-14] Scania: Estimation of Scene Depth for Perception in Autonomous Heavy-Duty Vehicles
- [2021-10-14] Scania: Visual-Inertial Odometry (VIO)
- [2021-10-14] Scania: Really, really fast tracking in image space
- [2021-10-14] Scania: Single Stage Instance Segmentation in Autonomous Heavy-Duty Vehicles
- [2021-10-14] Scania: Trajectory and intention prediction of annotated tracked objects
- [2021-10-14] Scania: Efficient algorithm development for GPUs
- [2021-09-09] Viscando AB Gothenburg: Projects in deep learning, signal processing and modelling for traffic and autonomous vehicle safety
- [2021-02-12] NFC: Fotometrisk stereo på verktygsspår
- [2020-11-13] FOI Linköping: Deep Learning för 3D-avbildande LiDAR
- [2020-11-06] Arkus AI: Apply Machine Learning and Computer Vision in Genetic Diagnostics
- [2020-10-28] Ericsson: 3D reconstruction for mobile devices
- [2020-10-07] Veoneer: Static and Dynamic Windshield Distortion Modeling
- [2020-01-09] IEI: Facial Analysis in Thermal Images for Pilot Stress Recognition
More information about Master's thesis projects in Computer Vision .
Last updated: 2023-10-13
Linköping University SE-581 83 LINKÖPING Tel: +46 13 28 10 00
Contact LiU | Maps
Organization
- Arts & Sciences
- Educational Sciences
- Health Sciences
- Science and Engineering
- Departments
- Offices & Administration
- Collaboration
- LiU Students
- LiU Employees
- LiU Fundraising
- LiU Electronic Press
Department of Electrical Engineering Phone: +46 13 28 10 00 Visiting address: B:27, Valla
![research proposal on computer vision Top of page](https://www.cvl.isy.liu.se/mall15/images/till_toppen.png)
Subscribe to the PwC Newsletter
Join the community, computer vision, semantic segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/b45b7a24-e2dd-47e2-9d1f-0f372e5d9074.jpg)
Tumor Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000895-78a1eb87.jpg)
Panoptic Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/48d55b59-3af2-4a6d-a195-572f1d4a1867.jpg)
3D Semantic Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000378-49a864d5.jpg)
Weakly-Supervised Semantic Segmentation
Representation learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000228-40138330.jpg)
Disentanglement
Graph representation learning, sentence embeddings.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000545-bf65a60c.jpg)
Network Embedding
Classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/53be8903-3dc3-4437-8791-c43483b4f962.jpg)
Text Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/d343960e-504c-458c-80f0-8c6014cfaa65.jpg)
Graph Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/d0eafcb3-1a12-430b-8bb5-6f6bbff1a4b3.jpg)
Audio Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/61a8e56a-7bf0-4da8-bf7e-f79e02b66ccd.jpg)
Medical Image Classification
Object detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/dd004e56-bc49-4cc1-b0d5-186f2dd17ce8.jpg)
3D Object Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000741-231617f1.jpg)
Real-Time Object Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000742-06430ae4.jpg)
RGB Salient Object Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/d6fcf503-5564-493b-a34d-1b01fcd80941.jpg)
Few-Shot Object Detection
Image classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/0aa45ecb-2bb1-4c8d-bd0c-16b4d9de739d.jpg)
Out of Distribution (OOD) Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/18c20497-9da0-4608-8f11-35c870d99005.jpg)
Few-Shot Image Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/dbefc490-9a44-469c-8114-a066e26699ca.jpg)
Fine-Grained Image Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/562d955c-73b9-4bab-b731-53860c9e03bc.jpg)
Learning with noisy labels
2d object detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000355-21d993ca_0d999Sl.jpg)
Edge Detection
Thermal image segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/ac0f46c4-6b3c-4465-b02d-402c11ddb3ba.jpg)
Open Vocabulary Object Detection
Reinforcement learning (rl), off-policy evaluation, multi-objective reinforcement learning, 3d point cloud reinforcement learning, deep hashing, table retrieval, domain adaptation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000588-ecdf2de6.jpg)
Unsupervised Domain Adaptation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000585-880539d1.jpg)
Domain Generalization
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000588-823db955.jpg)
Test-time Adaptation
Source-free domain adaptation, image generation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/e82a133f-22a3-472d-9d36-f9aad369655e.jpg)
Image-to-Image Translation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/872f0a76-ee06-408e-93c0-4123023fadba.jpg)
Text-to-Image Generation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/ec12422e-fae0-40b2-b2bb-129030e6dd8b.jpg)
Image Inpainting
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/6610198f-494c-4241-a7b7-c4633b48340a.jpg)
Conditional Image Generation
Data augmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001560-ec9b8d56.jpg)
Image Augmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001784-52f7b2c2.jpg)
Text Augmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/9772666b-6f6f-42fb-961b-54ed39504da6.jpg)
Image Denoising
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/6c4d53f8-9c6d-47c8-80c7-1b8e1c0a7d42.jpg)
Color Image Denoising
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000705-196c29ea.jpg)
Sar Image Despeckling
Grayscale image denoising, autonomous vehicles.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000363-4a36e709.jpg)
Autonomous Driving
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/a56ccdaa-c8ec-43e5-9386-9ba0b97c0065.jpg)
Self-Driving Cars
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001132-37a847fd.jpg)
Simultaneous Localization and Mapping
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000361-5c74c6f5.jpg)
Autonomous Navigation
Contrastive learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/52a08002-04d0-4679-8154-fcd2ad546613.jpg)
Meta-Learning
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001088-606b0b28.jpg)
Few-Shot Learning
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001088-6b0b3a7f_0bh9941.jpg)
Sample Probing
Universal meta-learning, super-resolution.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000032-03c2c2ea.jpg)
Image Super-Resolution
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000036-7f9f8d80.jpg)
Video Super-Resolution
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001824-8c53d952.jpg)
Multi-Frame Super-Resolution
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000032-0f0cf3b2.jpg)
Reference-based Super-Resolution
Pose estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000772-9d213c7e.jpg)
3D Human Pose Estimation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/27efc689-216a-4b18-b27f-dee62097414a.jpg)
Keypoint Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/9af705d3-1fa0-44a5-903c-b6845910057d.jpg)
3D Pose Estimation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001767-c1b24a25.jpg)
6D Pose Estimation
Self-supervised learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001882-4594fa1f.jpg)
Point Cloud Pre-training
Unsupervised video clustering, 2d semantic segmentation, image segmentation, text style transfer.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000189-a5e3056b.jpg)
Scene Parsing
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/tasks/dfb2e45e-22b0-4368-99c2-91a622d8f8f2.jpg)
Reflection Removal
Visual question answering (vqa).
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/ff52c247-71e0-4aca-9dfd-f7af0226b297.jpg)
Visual Question Answering
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000151-77f4f37a.jpg)
Machine Reading Comprehension
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/ad96e5a6-778a-4416-9868-432a19a998a5.jpg)
Chart Question Answering
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000170-7e3af75d.jpg)
Embodied Question Answering
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/1cc3e5ee-6280-4784-ac9e-dac9fc1ac49b.jpg)
Depth Estimation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000780-ab89f9f7.jpg)
3D Reconstruction
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000952-593862fd.jpg)
Neural Rendering
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/846006d5-5b2d-47f6-8eae-93bc46a361fc.jpg)
3D Face Reconstruction
Anomaly detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/e1bdc291-0932-45fd-87af-784f95d26ef8.jpg)
Unsupervised Anomaly Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/e00767d1-66de-4410-adac-a7efd0e00f60.jpg)
One-Class Classification
Supervised anomaly detection, anomaly detection in surveillance videos, sentiment analysis.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000572-88a37f14.jpg)
Aspect-Based Sentiment Analysis (ABSA)
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000569-3db5bbfd.jpg)
Multimodal Sentiment Analysis
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002224-e34ed86f.jpg)
Aspect Sentiment Triplet Extraction
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000571-f3ec0c11.jpg)
Twitter Sentiment Analysis
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/971695b5-b2fa-44c2-9b90-ac54e33f3950.jpg)
Temporal Action Localization
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000536-c863e3aa.jpg)
Video Understanding
Video generation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001367-6bece674.jpg)
Video Object Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000417-c863e3aa.jpg)
Action Classification
3d object super-resolution.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000154-e9e9e4ae.jpg)
One-Shot Learning
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/808c42b7-2bc8-434c-92b9-30df6ef65bc3.jpg)
Few-Shot Semantic Segmentation
Cross-domain few-shot.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/ba39ea79-54cf-4ea2-9158-457ddebaa108.jpg)
Unsupervised Few-Shot Learning
Activity recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000145-af362f59_Q7SFv0d.jpg)
Action Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000655-ec3df450.jpg)
Human Activity Recognition
Egocentric activity recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000654-a1de5b0a.jpg)
Group Activity Recognition
Exposure fairness, medical image segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000876-6fe8e464.jpg)
Lesion Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/180128ef-b1e3-45a7-9c19-13e9f3332743.jpg)
Brain Tumor Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000876-6fbe75a2_gBlYteG.jpg)
Cell Segmentation
Skin lesion segmentation, monocular depth estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000604-2b5b354d.jpg)
Stereo Depth Estimation
Depth and camera motion.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000603-d0cb489d.jpg)
3D Depth Estimation
Facial recognition and modelling.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000458-10ebe873.jpg)
Face Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/d2bb4e46-c886-4356-be78-5ad095adfe83.jpg)
Face Swapping
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000351-19fb9a84.jpg)
Face Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000449-5028c9a0.jpg)
Facial Expression Recognition (FER)
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000459-1a318ecd_7asfLRv.jpg)
Face Verification
Optical character recognition (ocr).
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000012-96a4bb03_wHfYaCD.jpg)
Active Learning
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000012-abcd0c32_qIWLaav.jpg)
Handwriting Recognition
Handwritten digit recognition, irregular text recognition, instance segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000003-fae0daac_XS6W0G2.jpg)
Referring Expression Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001552-8ade3a3c.jpg)
3D Instance Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/3370877f-48a8-40c3-84c1-ec7bcea8e6cb.jpg)
Unsupervised Object Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001409-8b1f0392.jpg)
Real-time Instance Segmentation
Object tracking.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000553-1a9eef99.jpg)
Multi-Object Tracking
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000552-8dc245dd.jpg)
Visual Object Tracking
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000554-a42873a2.jpg)
Multiple Object Tracking
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000553-467cdf5d_SvoYQZ2.jpg)
Cell Tracking
Zero-shot learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000158-a8a7e2cc.jpg)
Generalized Zero-Shot Learning
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/939b7389-c691-42ef-8ace-024e9a26e4b3.jpg)
Compositional Zero-Shot Learning
Multi-label zero-shot learning, quantization, data free quantization, unet quantization, continual learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/1b32d141-9ad3-43d3-87b6-ddcdc17b06ca.jpg)
Class Incremental Learning
Continual named entity recognition, unsupervised class-incremental learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000145-670c75d8_lBJNcK5.jpg)
Action Recognition In Videos
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000625-993c77cd.jpg)
3D Action Recognition
Self-supervised action recognition, few shot action recognition.
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/CSAILVision/semantic-segmentation-pytorch/master/./teaser/ADE_val_00000278.png)
Scene Understanding
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/bgshih/crnn/master/./data/demo2.jpg)
Scene Text Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000215-2a535688.jpg)
Scene Graph Generation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000193-3cfc96f6.jpg)
Scene Recognition
Adversarial attack.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000358-ba21f5af.jpg)
Backdoor Attack
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000357-22bc5c8b.jpg)
Adversarial Text
Adversarial attack detection, real-world adversarial attack, image retrieval.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/a7025228-3105-4ac6-b2c1-6e1ea6dacc0d.jpg)
Sketch-Based Image Retrieval
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000111-2ef95d07.jpg)
Content-Based Image Retrieval
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/1016847e-f821-4549-8528-a50e2a1227a0.jpg)
Composed Image Retrieval (CoIR)
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/834263fd-0f2e-47a9-bda1-0fd3f44c71df.jpg)
Medical Image Retrieval
Active object detection, dimensionality reduction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000831-0e6ae4c7.jpg)
Supervised dimensionality reduction
Online nonnegative cp decomposition, emotion recognition.
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/fengju514/Expression-Net/master/ExpNet_teaser_v2.jpg)
Speech Emotion Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/453d2082-dde8-432f-9859-04df8ed44dd6.jpg)
Emotion Recognition in Conversation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000384-2d1d8fd8.jpg)
Multimodal Emotion Recognition
Emotion-cause pair extraction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000785-645bd197.jpg)
Monocular 3D Object Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001846-fbe29ec8.jpg)
3D Object Detection From Stereo Images
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/81b29110-9cfa-498e-acee-0f5f403eb5e9.jpg)
Multiview Detection
Robust 3d object detection, image reconstruction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002391-c56e32a0.jpg)
MRI Reconstruction
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/3cc13a40-1344-4431-bcec-1752061f7036.jpg)
Film Removal
Style transfer.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/80297cde-f7f0-4af3-9996-be3bf979fd50.jpg)
Image Stylization
Font style transfer, style generalization, face transfer, optical flow estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/b77ae75e-6fa4-4dc1-9853-6c3b4a863eda.jpg)
Video Stabilization
Image captioning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/cd47481d-85b7-4a27-8ee6-5c969609c94f.jpg)
3D dense captioning
Controllable image captioning, aesthetic image captioning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002356-fdbfb5c2.jpg)
Relational Captioning
Action localization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000390-6354379e.jpg)
Action Segmentation
Spatio-temporal action localization, person re-identification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000127-6ba62ef3.jpg)
Unsupervised Person Re-Identification
Video-based person re-identification, generalizable person re-identification, cloth-changing person re-identification, image restoration.
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/titu1994/ImageSuperResolution/master/architectures/SRCNN.png)
Demosaicking
Spectral reconstruction, underwater image restoration.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002526-6f740253.jpg)
JPEG Artifact Correction
Visual relationship detection, lighting estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000133-9833a918.jpg)
3D Room Layouts From A Single RGB Panorama
Road scene understanding, action detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/dc08f825-1e11-41c5-ac43-4407cc259c8d.jpg)
Skeleton Based Action Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/fe4dde36-d569-498d-b386-af61cb831541.jpg)
Online Action Detection
Audio-visual active speaker detection, metric learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001355-46cddb3b.jpg)
Object Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/58ba4d27-a9fb-45c8-8a45-58a7dfad0e2b.jpg)
3D Object Recognition
Continuous object recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001053-fd305adb.jpg)
Depiction Invariant Object Recognition
Image enhancement.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000098-5f928654_Ty92qUH.jpg)
Low-Light Image Enhancement
Image relighting, de-aliasing.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/67f4d17a-534a-4c8d-bbf0-2f70d231cc59.jpg)
Monocular 3D Human Pose Estimation
Pose prediction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001649-7e12f901.jpg)
3D Multi-Person Pose Estimation
3d human pose and shape estimation, multi-label classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000674-1f06fb6b.jpg)
Missing Labels
Extreme multi-label classification, hierarchical multi-label classification, medical code prediction, continuous control.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000516-081b94a0.jpg)
Steering Control
Drone controller, 3d face modelling.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/3051f6f3-2120-45b0-bfb5-5fc9509d7986.jpg)
Semi-Supervised Video Object Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000881-e1482940.jpg)
Unsupervised Video Object Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001367-1082b77e.jpg)
Referring Video Object Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001790-02f23ffd.jpg)
Video Salient Object Detection
Trajectory prediction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000020-0ddebb3b.jpg)
Trajectory Forecasting
Human motion prediction, out-of-sight trajectory prediction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001619-3a5655f8.jpg)
Multivariate Time Series Imputation
Novel view synthesis.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001386-6aba34e7.jpg)
Novel LiDAR View Synthesis
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001386-5a4b94dc_tKiQfG2.jpg)
Gournd video synthesis from satellite image
Image quality assessment, no-reference image quality assessment, blind image quality assessment.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001676-478c0749.jpg)
Aesthetics Quality Assessment
Stereoscopic image quality assessment, object localization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000383-22bf0ed9.jpg)
Weakly-Supervised Object Localization
Image-based localization, unsupervised object localization, monocular 3d object localization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000014-2dbbecf3.jpg)
Blind Image Deblurring
Single-image blind deblurring, out-of-distribution detection, video semantic segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000910-ef1bd608.jpg)
Camera shot segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000444-66c74076.jpg)
Facial Inpainting
Cloud removal.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/2c8222e3-b09e-4ffb-b284-211afa8086a8.jpg)
Fine-Grained Image Inpainting
Instruction following, visual instruction following, change detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/42790a97-c01a-4afa-aa7c-3b72d5a52296.jpg)
Semi-supervised Change Detection
Prompt engineering.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/118d0d79-54e4-49a7-ab9a-e3606e82103d.jpg)
Visual Prompting
Image compression.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000730-aec83530.jpg)
Feature Compression
Jpeg compression artifact reduction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000729-ed7408b1.jpg)
Lossy-Compression Artifact Reduction
Color image compression artifact reduction, explainable artificial intelligence, explainable models, explanation fidelity evaluation, fad curve analysis, saliency detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000333-b088b240.jpg)
Saliency Prediction
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000331-df0072c9.jpg)
Co-Salient Object Detection
Video saliency detection, unsupervised saliency detection, image registration.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000522-69cd01de_zgJiQk0.jpg)
Unsupervised Image Registration
Visual reasoning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000242-3efd20a2.jpg)
Visual Commonsense Reasoning
Ensemble learning, salient object detection, saliency ranking, visual tracking.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000550-8dc245dd.jpg)
Point Tracking
Rgb-t tracking, real-time visual tracking.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001734-32c857c9.jpg)
RF-based Visual Tracking
3d point cloud classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/0eeeb7f4-67f5-4ec3-a76b-20ba51efae6a.jpg)
3D Object Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/a967964d-f619-4d68-b753-c0acae259fa0.jpg)
Few-Shot 3D Point Cloud Classification
Supervised only 3d point cloud classification, zero-shot transfer 3d point cloud classification, visual grounding.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/8bdd62b4-c02c-43ea-bbb5-77882024286e.jpg)
3D visual grounding
Person-centric visual grounding.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/441e442f-a774-406d-89d8-4a103876ad91.jpg)
Phrase Extraction and Grounding (PEG)
2d classification.
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/cambridge-mlg/miracle/master/figures/mnist_comp.png)
Neural Network Compression
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000139-7e4d4874.jpg)
Music Source Separation
Cell detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/61a791e1-5ac0-40cb-9cdf-5422434bbbfe.jpg)
Plant Phenotyping
Open-set classification, image manipulation detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000625-bb786447.jpg)
Zero Shot Skeletal Action Recognition
Generalized zero shot skeletal action recognition, motion estimation, video question answering.
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/jayleicn/TVQA/master/./imgs/example_main.png)
Zero-Shot Video Question Answer
Few-shot video question answering, video captioning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000542-44908c53.jpg)
Dense Video Captioning
Boundary captioning, visual text correction, audio-visual video captioning, whole slide images, gesture recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000632-7fc5c90c.jpg)
Hand Gesture Recognition
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/okankop/MFF-pytorch/master/images/motion_fused_frames.jpg)
Hand-Gesture Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001732-5fa30f4b.jpg)
RF-based Gesture Recognition
Activity prediction, motion prediction, cyber attack detection, sequential skip prediction, text detection, point cloud registration.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000520-849c9f31.jpg)
Image to Point Cloud Registration
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/1ebcc422-6179-47b1-b0d5-d4883043a38a.jpg)
Robust 3D Semantic Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/ed554fc5-ed7e-4bb2-9abb-ea0c385e0892.jpg)
Real-Time 3D Semantic Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/8666f2ea-9285-487a-b0b1-5b462667f66e.jpg)
Unsupervised 3D Semantic Segmentation
Furniture segmentation, medical diagnosis.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000293-ad63354d.jpg)
Alzheimer's Disease Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002230-9dfeec51.jpg)
Retinal OCT Disease Classification
Blood cell count, thoracic disease classification, 3d point cloud interpolation, visual odometry.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000070-1705b341.jpg)
Face Anti-Spoofing
Monocular visual odometry, rain removal.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000440-be3759b0.jpg)
Single Image Deraining
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000771-3b256a6c.jpg)
Hand Pose Estimation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000884-78baab10.jpg)
Hand Segmentation
Gesture-to-gesture translation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000710-8c508a28.jpg)
Image Dehazing
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000100-8c508a28.jpg)
Single Image Dehazing
Image clustering.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000702-79f3c03f.jpg)
Online Clustering
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000702-3c2f553a.jpg)
Face Clustering
Multi-view subspace clustering, multi-modal subspace clustering, deepfake detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001775-6d14e362.jpg)
Synthetic Speech Detection
Human detection of deepfakes, multimodal forgery detection, robot navigation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000547-5ff26267.jpg)
PointGoal Navigation
Social navigation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002538-5c74c6f5.jpg)
Sequential Place Learning
Colorization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/34171ffd-b64f-4c99-a8ae-25db8fc5921d.jpg)
Line Art Colorization
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/ea20e877-5ede-4a85-9b69-b2e91ff6838f.jpg)
Point-interactive Image Colorization
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/eb621262-556c-40c9-b6e3-71475de11079.jpg)
Color Mismatch Correction
Conformal prediction, image manipulation, visual localization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000391-9bc2256b.jpg)
Image Editing
Rolling shutter correction, shadow removal, multimodel-guided image editing, joint deblur and frame interpolation, multimodal fashion image editing, visual place recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000663-4df8d036.jpg)
Indoor Localization
3d place recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000108-08b670c5.jpg)
Unsupervised Image-To-Image Translation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001067-0d49abc9.jpg)
Synthetic-to-Real Translation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000109-90fbb2e0.jpg)
Multimodal Unsupervised Image-To-Image Translation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/e1f688c7-99a7-4198-8cf4-7ffcc0ffde8b.jpg)
Cross-View Image-to-Image Translation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002229-c773675f.jpg)
Fundus to Angiography Generation
Stereo matching.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000471-ceeed704.jpg)
Crowd Counting
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000239-ab2f099a.jpg)
Visual Crowd Analysis
Group detection in crowds, object reconstruction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000750-67b42af7.jpg)
3D Object Reconstruction
Earth observation, human-object interaction detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000617-9644872d.jpg)
Affordance Recognition
Image deblurring, low-light image deblurring and enhancement, image matching.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/4810ca04-9af3-4e79-9bf9-54f786052a27.jpg)
Semantic correspondence
Patch matching, set matching.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/b478c024-2fd4-46ce-8128-0da3bb642b75.jpg)
Matching Disparate Images
Video quality assessment, video alignment, temporal sentence grounding, long-video activity recognition, point cloud classification, jet tagging, few-shot point cloud classification, hyperspectral.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000818-f54abafb.jpg)
Hyperspectral Image Classification
Hyperspectral unmixing, hyperspectral image segmentation, classification of hyperspectral images, document text classification, multi-label classification of biomedical texts, political salient issue orientation detection, 3d point cloud reconstruction, scene classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000413-143ff75c.jpg)
Weakly-supervised Temporal Action Localization
Weakly supervised action localization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000210-b1ee5c73.jpg)
Temporal Action Proposal Generation
Activity recognition in videos, referring expression, point cloud generation, point cloud completion, 2d human pose estimation, action anticipation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001498-3f9c6ea2_65GTaFO.jpg)
3D Face Animation
Semi-supervised human pose estimation, reconstruction, 3d human reconstruction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000045-5164a6fa.jpg)
Single-View 3D Reconstruction
4d reconstruction, single-image-based hdr reconstruction, keyword spotting.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000085-ed8952fd.jpg)
Small-Footprint Keyword Spotting
Visual keyword spotting, compressive sensing, camera calibration, scene text detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000507-55533bc2.jpg)
Curved Text Detection
Multi-oriented scene text detection, boundary detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000356-04900360.jpg)
Junction Detection
Image matting.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/e3b47721-0830-4233-9253-d455d3be1c59.jpg)
Semantic Image Matting
Video retrieval, video-text retrieval, video grounding, video-adverb retrieval, replay grounding, composed video retrieval (covr), document ai, document understanding, cross-modal retrieval, image-text matching, cross-modal retrieval with noisy correspondence, multilingual cross-modal retrieval.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/dcfded35-e2e1-44e8-86da-c09af9cfafa9.jpg)
Zero-shot Composed Person Retrieval
Cross-modal retrieval on rsitmd, motion synthesis.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/cf274dc2-54c8-4d2b-b3db-fc42338ebb3e.jpg)
Motion Style Transfer
Temporal human motion composition, video summarization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/f39b8716-7e56-40a4-a528-823472ba7bfc.jpg)
Unsupervised Video Summarization
Supervised video summarization, emotion classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000563-59ba915b.jpg)
Point Cloud Segmentation
Sensor fusion, superpixels.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/3634dab6-a391-4056-88ab-11ee33af2ce0.jpg)
Remote Sensing
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/lehaifeng/RSI-CB/master/osm%E5%88%86%E5%B8%83%E5%9B%BE.png)
Remote Sensing Image Classification
Change detection for remote sensing images, building change detection for remote sensing images.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000484-0b85d8ec.jpg)
Segmentation Of Remote Sensing Imagery
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000485-0b85d8ec.jpg)
The Semantic Segmentation Of Remote Sensing Imagery
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002170-963c86db.jpg)
Few-Shot Transfer Learning for Saliency Prediction
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000847-b12abf24_wqnD1AJ.jpg)
Aerial Video Saliency Prediction
3d anomaly detection, video anomaly detection, artifact detection, document layout analysis.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/a118a5c0-8617-4374-ada1-a93e748ca0b5.jpg)
Face Generation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/2ce93ab0-992a-429f-8005-09601dddcb1f.jpg)
Talking Head Generation
Talking face generation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000990-2d591218_2jUWb8G.jpg)
Face Age Editing
Facial expression generation, kinship face generation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000780-3d4e01ee.jpg)
Point cloud reconstruction
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/e19e9626-0c29-4ac5-a1e8-a65170ad7c2d.jpg)
![](http://sokolural.site/777/templates/cheerup/res/banner1.gif)
3D Semantic Scene Completion
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/c6f9dd48-eb74-459b-9143-4b1d30b11dc8.jpg)
3D Semantic Scene Completion from a single RGB image
Garment reconstruction, privacy preserving deep learning, membership inference attack, human detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000497-6d1a4ae6.jpg)
Video Instance Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002416-bc645653.jpg)
Generalized Few-Shot Semantic Segmentation
Video editing, video temporal consistency, line items extraction, virtual try-on.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/33c33ac1-c08c-4d2c-925e-303af2f00b9c.jpg)
Generalized Referring Expression Segmentation
Scene flow estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/3cc22ae6-195d-4ff1-8dda-4271733f2ed6.jpg)
Self-supervised Scene Flow Estimation
Depth completion.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001115-db64b3a0.jpg)
Object Discovery
Motion forecasting.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000363-06d10c79.jpg)
Multi-Person Pose forecasting
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001736-1612b5d3.jpg)
Multiple Object Forecasting
3d classification, machine unlearning, continual forgetting, gaze estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000848-7a7f5179.jpg)
CARLA MAP Leaderboard
Dead-reckoning prediction, face reconstruction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/d5f1e0b9-1215-4f4f-ab92-8ea1e4f036d7.jpg)
text-guided-image-editing
Text-based image editing, concept alignment.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/5f077531-cc40-4eec-be00-a6337e075dfe.jpg)
Zero-Shot Text-to-Image Generation
Conditional text-to-image synthesis, texture synthesis, multi-view learning, incomplete multi-view clustering, sign language recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000649-69922a8d.jpg)
Gait Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/c4fc8e61-f5b3-4b2a-87a4-71293efe2e7c.jpg)
Multiview Gait Recognition
Gait recognition in the wild, interactive segmentation, scene generation, image recognition, fine-grained image recognition, license plate recognition, material recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000287-fc5b698e.jpg)
Breast Cancer Detection
Skin cancer classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000288-c86f61d3.jpg)
Breast Cancer Histology Image Classification
Lung cancer diagnosis, classification of breast cancer histology images, event-based vision.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/2eedb385-fb67-459a-a2c0-f7cc8feabc55.jpg)
Event-based Optical Flow
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/6cb48677-e623-4a5c-b329-5a974f537f44.jpg)
Event-Based Video Reconstruction
Event-based motion estimation, interest point detection, homography estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001652-7e12f901.jpg)
3D Multi-Person Pose Estimation (absolute)
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001649-ecb41cf2.jpg)
3D Multi-Person Mesh Recovery
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001651-edc0c2f2.jpg)
3D Multi-Person Pose Estimation (root-relative)
Object counting, training-free object counting, open-vocabulary object counting, human parsing.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001200-fb55e254.jpg)
Multi-Human Parsing
Weakly supervised segmentation, disease prediction, disease trajectory forecasting, pose tracking.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001299-897c396f.jpg)
3D Human Pose Tracking
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002042-ac2cbf8e.jpg)
3D Hand Pose Estimation
Facial landmark detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000441-787de252_HtXStMs.jpg)
Unsupervised Facial Landmark Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001971-ec7de5c2.jpg)
3D Facial Landmark Localization
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/84b92b4e-836f-4a64-8e53-a972dc5dc618.jpg)
Dichotomous Image Segmentation
3d character animation from a single photo, activity detection, inverse rendering, scene segmentation, temporal localization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000389-ae6f548d.jpg)
Language-Based Temporal Localization
Temporal defect localization, text-to-video generation, text-to-video editing, subject-driven video generation, multi-label image classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/4eee2add-78aa-46df-b8a3-296919b49cb5.jpg)
Multi-label Image Recognition with Partial Labels
3d object tracking.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/d79d32e0-35b4-4fc9-a69c-d373e3097a39.jpg)
3D Single Object Tracking
Template matching, camera localization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000474-5eb20b1e.jpg)
Camera Relocalization
Lidar semantic segmentation, motion segmentation, text spotting.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000086-2094f367.jpg)
Visual Dialog
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000243-1146f2d1.jpg)
Intelligent Surveillance
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000129-bd0ee47a.jpg)
Vehicle Re-Identification
Relation network, disparity estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000849-9022569c.jpg)
Few-Shot Class-Incremental Learning
Class-incremental semantic segmentation, non-exemplar-based class incremental learning, decision making under uncertainty.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000177-776f95bc.jpg)
Uncertainty Visualization
Knowledge distillation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/tasks/d0f2dad1-32df-46e2-8686-ce09e263353c.png)
Data-free Knowledge Distillation
Self-knowledge distillation, moment retrieval.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002886-eacfc398.jpg)
Zero-shot Moment Retrieval
Text to video retrieval, partially relevant video retrieval, handwritten text recognition, handwritten document recognition, unsupervised text recognition, person search, semi-supervised object detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/dfedaa2c-0eb8-4247-9cc0-db856cbf64ad.jpg)
Mixed Reality
Shadow detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000504-1345a6a4.jpg)
Shadow Detection And Removal
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/01d8624d-00e2-4e32-8159-ae45c4a3edd5.jpg)
Unconstrained Lip-synchronization
Future prediction, human mesh recovery, video enhancement, video inpainting.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001621-e4fa630c.jpg)
Face Image Quality Assessment
Lightweight face recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000452-a95d6931.jpg)
Age-Invariant Face Recognition
Synthetic face recognition, face quality assessement.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000449-5e15a1d3.jpg)
Cross-corpus
Micro-expression recognition, micro-expression spotting.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000448-d9c5224c.jpg)
3D Facial Expression Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000630-844318ed.jpg)
Smile Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001626-3b0fd806.jpg)
3D Multi-Object Tracking
Real-time multi-object tracking, multi-animal tracking with identification, trajectory long-tail distribution for muti-object tracking, grounded multiple object tracking.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/0d291bd4-583c-4a65-ac28-ef63253420fe.jpg)
Stereo Image Super-Resolution
Burst image super-resolution, satellite image super-resolution, multispectral image super-resolution, image categorization, fine-grained visual categorization, open vocabulary semantic segmentation, zero-guidance segmentation, physics-informed machine learning, soil moisture estimation, video reconstruction.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001827-3fb659a2_8tfLn4X.jpg)
Zero Shot Segmentation
Sign language translation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001449-2d0892c0.jpg)
Overlapped 10-1
Overlapped 15-1, overlapped 15-5, disjoint 10-1, disjoint 15-1, color constancy.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000724-0c23f7fd.jpg)
Few-Shot Camera-Adaptive Color Constancy
Hdr reconstruction, multi-exposure image fusion, deep attention, line detection, visual recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000718-75613b53.jpg)
Fine-Grained Visual Recognition
Tone mapping, zero-shot action recognition, image cropping, stereo matching hand.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000904-a0e8fdfb.jpg)
3D Absolute Human Pose Estimation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001751-381bcadd_22ti1hO.jpg)
Text-to-Face Generation
Natural language transduction, image forensics, image to 3d, infrared and visible image fusion.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000487-58d20eaf.jpg)
Novel Class Discovery
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/bb0fb0b6-5b77-439a-838d-d73850e88323.jpg)
Breast Cancer Histology Image Classification (20% labels)
Landmark-based lipreading, transparent object detection, transparent objects, video restoration.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/82695719-6c02-4632-9c43-ef66a18ab565.jpg)
Analog Video Restoration
Abnormal event detection in video.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002615-8ca44059.jpg)
Semi-supervised Anomaly Detection
Surface normals estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001360-1808c9b9.jpg)
Vision-Language Navigation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000067-333d5dfa.jpg)
hand-object pose
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000002042-8135da6d.jpg)
Grasp Generation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002043-e749e306.jpg)
3D Canonical Hand Pose Estimation
Image animation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/5e0606c2-1094-410e-a125-c9e3e166aab2.jpg)
cross-domain few-shot learning
Texture classification, probabilistic deep learning, action quality assessment, pedestrian attribute recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/948c0e0f-81d8-4c08-b8d9-e504cb114382.jpg)
Spoof Detection
Face presentation attack detection, detecting image manipulation, cross-domain iris presentation attack detection, finger dorsal image spoof detection, unsupervised few-shot image classification, generalized few-shot classification, highlight detection, steganalysis.
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/tosmaster/imagevision/master/images/architecture.png)
Sketch Recognition
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000088-1f169c25.jpg)
Face Sketch Synthesis
Drawing pictures.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000053-9e6cc36d.jpg)
Photo-To-Caricature Translation
Computer vision techniques adopted in 3d cryogenic electron microscopy, single particle analysis, cryogenic electron tomography, meme classification, hateful meme classification, action understanding, dense captioning, person retrieval, segmentation, open-vocabulary semantic segmentation, iris recognition, pupil dilation, image to video generation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/4a2d703b-aab3-4ce2-9cfe-9c742289b269.jpg)
Unconditional Video Generation
Image stitching.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/ac8ffe2d-fdb4-447e-88e9-b4ad83e5bb38.jpg)
One-shot visual object segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/23137faf-69e4-4e35-acec-5b817c16c737.jpg)
Unbiased Scene Graph Generation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/2a773854-1b01-410f-aaf3-a064d695d590.jpg)
Panoptic Scene Graph Generation
Automatic post-editing.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000257-2b560008_M7RFnV9.jpg)
Document Image Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000704-356f65e7.jpg)
Face Reenactment
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/9b9603d6-560f-4a69-bc48-90b9da3cf086.jpg)
Multi-View 3D Reconstruction
Universal domain adaptation, surgical phase recognition, online surgical phase recognition, offline surgical phase recognition, blind face restoration.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002731-0d3b184a.jpg)
Geometric Matching
Human action generation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001744-8142135a.jpg)
Action Generation
Object categorization, text based person retrieval, diffusion personalization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/f74a439f-308c-4f5d-89f3-0d294865d51c.jpg)
Diffusion Personalization Tuning Free
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/2d639db9-6da1-4cc0-a4bb-48efaead23f4.jpg)
Efficient Diffusion Personalization
Human dynamics.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000619-0a3c8ab0.jpg)
3D Human Dynamics
Severity prediction, intubation support prediction, cloud detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000502-dfb772c2.jpg)
Table Recognition
Text-to-image, story visualization, complex scene breaking and synthesis, image fusion, pansharpening, image deconvolution.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000713-8500001e.jpg)
Image Outpainting
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/caca4ee3-d312-45f3-95bc-996f1c27034e.jpg)
Object Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002128-167778ba.jpg)
Camouflaged Object Segmentation
Landslide segmentation, text-line extraction, point clouds, point cloud video understanding, point cloud rrepresentation learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/b2c1d06e-20f2-4046-8329-7e239774aa84.jpg)
Semantic SLAM
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/575355c4-187c-40f0-9713-674fc2fc5cb1.jpg)
Object SLAM
Image shadow removal, intrinsic image decomposition, line segment detection, sports analytics, situation recognition, grounded situation recognition, face image quality, motion detection, multi-target domain adaptation, person identification, visual prompt tuning, single-source domain generalization, evolving domain generalization, source-free domain generalization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001767-3c6c5a0d.jpg)
Robot Pose Estimation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/926d9879-0d61-474f-ae7a-9c50a59ad13f.jpg)
Camouflaged Object Segmentation with a Single Task-generic Prompt
Image morphing, image steganography, rotated mnist, weakly-supervised instance segmentation, image smoothing, fake image detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001199-7f4bf1fb_0BXMP1S.jpg)
GAN image forensics
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001199-087e3c6c_qt4yHKY.jpg)
Fake Image Attribution
Lane detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/8c89efa3-0b9d-4e3e-82d9-08b38f55bcc7.jpg)
3D Lane Detection
Layout design, occlusion handling, contour detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000354-8ae991ad.jpg)
Crop Classification
License plate detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/21e8bf09-6596-410d-b170-0432fcffb2b0.jpg)
Video Panoptic Segmentation
Viewpoint estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000851-d564eeeb.jpg)
Drone navigation
Drone-view target localization, value prediction, body mass index (bmi) prediction, crop yield prediction, multi-object tracking and segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/657c32e8-3131-400f-8474-224bad1a9b6e.jpg)
Zero-Shot Transfer Image Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/77dd7e5b-1335-4b2a-bcfd-9c388413b102.jpg)
motion retargeting
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000136-c08a5426.jpg)
3D Object Reconstruction From A Single Image
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000751-6cd9fb6b.jpg)
CAD Reconstruction
3d point cloud linear classification, multiview learning, person recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000131-3d972675.jpg)
Photo Retouching
Shape representation of 3d point clouds, bird's-eye view semantic segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/0d834282-fd21-4e57-be69-d5c2ed538690.jpg)
Dense Pixel Correspondence Estimation
Human part segmentation.
![research proposal on computer vision research proposal on computer vision](https://raw.githubusercontent.com/facebookresearch/detectron/master/demo/output/33823288584_1d21cf0a26_k_example_output.jpg)
Document Shadow Removal
Symmetry detection, traffic sign detection, video style transfer, referring image matting.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/tasks/d6b93a69-819b-4ec2-af10-06ffa587bb16.jpg)
Referring Image Matting (Expression-based)
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/3733d707-0d2b-49c9-9b2b-a2a615e74ff8.jpg)
Referring Image Matting (Keyword-based)
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/db80a83a-a702-4edd-aad9-892a147df206.jpg)
Referring Image Matting (RefMatte-RW100)
Referring image matting (prompt-based), human interaction recognition, one-shot 3d action recognition, mutual gaze, affordance detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002845-bd6fff4d.jpg)
Gaze Prediction
Hand detection, image forgery detection, image instance retrieval, amodal instance segmentation, image quality estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000091-e86362c7.jpg)
Image Similarity Search
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000992-fce976fd.jpg)
Material Classification
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000422-688b5ff1.jpg)
Precipitation Forecasting
Referring expression generation, road damage detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000480-bdfe2fa5.jpg)
Space-time Video Super-resolution
Video matting.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002360-3d203358.jpg)
inverse tone mapping
Semi-supervised image classification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001329-ea5992e0_Tmg8Zxv.jpg)
Open-World Semi-Supervised Learning
Semi-supervised image classification (cold start), facial editing.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/68278bb8-5f4a-43f4-bfbb-ab950c58df74.jpg)
Holdout Set
Multispectral object detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/e1b0bc95-9162-42a6-95ba-5bd44da0b885.jpg)
Open Vocabulary Attribute Detection
Image/document clustering, self-organized clustering, instance search.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000082-7dbded6b.jpg)
Audio Fingerprint
3d shape modeling.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001823-9593b40a.jpg)
Action Analysis
Art analysis.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/3f612eed-ab00-41f9-a6a1-91dbf5cf09f1.jpg)
Zero-Shot Composed Image Retrieval (ZS-CIR)
Food recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000634-b236aef7.jpg)
Motion Magnification
Semi-supervised instance segmentation, binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, video segmentation, camera shot boundary detection, open-vocabulary video segmentation, open-world video segmentation, lung nodule classification, lung nodule 3d classification, lung nodule detection, lung nodule 3d detection, 3d scene reconstruction, event segmentation, generic event boundary detection, image retouching, image-variation, jpeg artifact removal, point cloud super resolution, skills assessment.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/697a4af1-f65e-4992-8b02-48c6bf021517.jpg)
Text-based Person Retrieval
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001551-2fded264.jpg)
Sensor Modeling
Handwriting verification, bangla spelling error correction, video prediction, earth surface forecasting, predict future video frames, 3d open-vocabulary instance segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/60ff1d90-d451-4f8a-8c62-94550ec91252.jpg)
Ad-hoc video search
Audio-visual synchronization, handwriting generation, pose retrieval, scanpath prediction, scene change detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002411-873b5588_mEzUBaG.jpg)
Sketch-to-Image Translation
Skills evaluation, synthetic image detection, highlight removal, 2d pose estimation, category-agnostic pose estimation, overlapping pose estimation, 3d shape reconstruction from a single 2d image.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000135-9e64ef64.jpg)
Shape from Texture
Deception detection, deception detection in videos.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002904-8c6ca1c7_YGTMmks.jpg)
Video Visual Relation Detection
Human-object relationship detection, 3d shape representation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000796-71004bae.jpg)
3D Dense Shape Correspondence
Birds eye view object detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000015-28528fd4.jpg)
Image Comprehension
Image manipulation localization, multiple people tracking.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000368-67c60635.jpg)
Network Interpretation
Rgb-d reconstruction, seeing beyond the visible, semi-supervised domain generalization, unsupervised semantic segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001371-00a5d91b.jpg)
Unsupervised Semantic Segmentation with Language-image Pre-training
Multiple object tracking with transformer.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/e33508db-205c-4c2c-a8de-58a8a9e48a0e.jpg)
Multiple Object Track and Segmentation
Constrained lip-synchronization, face dubbing, vietnamese visual question answering, explanatory visual question answering, 3d shape reconstruction, 4d panoptic segmentation, defocus blur detection, event data classification, instance shadow detection, kinship verification, medical image enhancement, open vocabulary panoptic segmentation, short-term object interaction anticipation, single-object discovery, training-free 3d point cloud classification, video forensics.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000361-36f52818.jpg)
Sequential Place Recognition
Autonomous flight (dense forest), autonomous web navigation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/ae65120e-a9ac-4c25-8a19-5e734fd4cafc.jpg)
Generative 3D Object Classification
Cube engraving classification, facial expression recognition, cross-domain facial expression recognition, zero-shot facial expression recognition, multimodal machine translation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001101-fb2e2264.jpg)
Face to Face Translation
Multimodal lexical translation, 10-shot image generation, 2d semantic segmentation task 3 (25 classes), document enhancement, action assessment, bokeh effect rendering, drivable area detection, face anonymization, font recognition, horizon line estimation, image imputation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001396-994a63ac.jpg)
Long Video Retrieval (Background Removed)
Medical image denoising.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/948d7c0f-96f5-43bf-9f62-8d4553eef0e2.jpg)
Occlusion Estimation
Personalized image generation, physiological computing.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001878-93ba632b.jpg)
Lake Ice Monitoring
Spatio-temporal video grounding, text-based person retrieval with noisy correspondence.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/8ec025a9-5215-47df-a65c-5c95cd6e8f21.jpg)
Unsupervised 3D Point Cloud Linear Evaluation
Vcgbench-diverse, wireframe parsing, gaze redirection, single-image-generation, unsupervised anomaly detection with specified settings -- 30% anomaly, root cause ranking, anomaly detection at 30% anomaly, anomaly detection at various anomaly percentages.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002676-0f8402b4.jpg)
Unsupervised Contextual Anomaly Detection
Landmark tracking, muscle tendon junction identification, mistake detection, online mistake detection, 3d object captioning, 3d semantic occupancy prediction, 3d scene editing, animated gif generation, generalized referring expression comprehension, image deblocking, image retargeting, infrared image super-resolution, motion disentanglement, personality trait recognition, persuasion strategies, scene text editing, image to sketch recognition, traffic accident detection, accident anticipation, unsupervised landmark detection, vehicle speed estimation, visual speech recognition, lip to speech synthesis, continual anomaly detection, weakly supervised action segmentation (transcript), weakly supervised action segmentation (action set)), calving front delineation in synthetic aperture radar imagery, calving front delineation in synthetic aperture radar imagery with fixed training amount.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/8ce700a1-f5b9-4bc0-a529-c75e38f72e22.jpg)
Handwritten Line Segmentation
Handwritten word segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000002615-e42dabda.jpg)
General Action Video Anomaly Detection
Physical video anomaly detection, monocular cross-view road scene parsing(road), monocular cross-view road scene parsing(vehicle).
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000603-3510e464.jpg)
Transparent Object Depth Estimation
Age and gender estimation, data ablation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000351-f7066399.jpg)
Occluded Face Detection
Fingertip detection, gait identification, historical color image dating, stochastic human motion prediction, image and video forgery detection, motion captioning, personalized segmentation, repetitive action counting, scene-aware dialogue, spatial relation recognition, spatial token mixer, steganographics, story continuation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/92539aed-3b8c-4b62-b66c-b0fbf7ef3d53.jpg)
Unsupervised Anomaly Detection with Specified Settings -- 0.1% anomaly
Unsupervised anomaly detection with specified settings -- 1% anomaly, unsupervised anomaly detection with specified settings -- 10% anomaly, unsupervised anomaly detection with specified settings -- 20% anomaly, visual analogies, visual social relationship recognition, zero-shot text-to-video generation, text-guided-generation, video frame interpolation, 3d video frame interpolation, unsupervised video frame interpolation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002776-51e5d023.jpg)
eXtreme-Video-Frame-Interpolation
Continual semantic segmentation, overlapped 5-3, overlapped 25-25, micro-expression generation, micro-expression generation (megc2021), period estimation, art period estimation (544 artists), unsupervised panoptic segmentation, unsupervised zero-shot panoptic segmentation, 3d rotation estimation, camera auto-calibration, defocus estimation, derendering, hierarchical text segmentation, human-object interaction concept discovery.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/d79f60e0-4c23-412d-8dab-81b1119620ee.jpg)
One-Shot Face Stylization
Keypoint detection and image matching, speaker-specific lip to speech synthesis, multi-person pose estimation, neural stylization.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/68b38b9c-ed05-4372-a612-01f909dec050.jpg)
Part-aware Panoptic Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002134-7dd96854.jpg)
Population Mapping
Pornography detection, prediction of occupancy grid maps, raw reconstruction, svbrdf estimation, semi-supervised video classification, spectrum cartography, supervised image retrieval, synthetic image attribution, training-free 3d part segmentation, unsupervised image decomposition, video propagation, vietnamese multimodal learning, weakly supervised 3d point cloud segmentation, weakly-supervised panoptic segmentation, drone-based object tracking, brain visual reconstruction, brain visual reconstruction from fmri.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/a34d9382-cb10-400e-a6bd-31f40c72623f.jpg)
Human-Object Interaction Generation
Image-guided composition, fashion understanding, semi-supervised fashion compatibility.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000714-068a8901_2PQwzdm.jpg)
intensity image denoising
Lifetime image denoising, observation completion, active observation completion, boundary grounding.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/5e1f2ccb-0696-44cd-a506-ad8524a59b9b.jpg)
Video Narrative Grounding
3d inpainting, 3d scene graph alignment, 4d spatio temporal semantic segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001517-bc8b5f8c.jpg)
Age Estimation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000616-5c9160ff.jpg)
Few-shot Age Estimation
Brdf estimation, camouflage segmentation, clothing attribute recognition, damaged building detection, depth image estimation, detecting shadows, dynamic texture recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000459-e2dd17f7_kfjELuH.jpg)
Disguised Face Verification
Few shot open set object detection, gaze target estimation, generalized zero-shot learning - unseen, grounded multimodal named entity recognition, hd semantic map learning, human-object interaction anticipation, image deep networks, manufacturing quality control, materials imaging, micro-gesture recognition, multi-person pose estimation and tracking.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001240-5386a638.jpg)
Multi-modal image segmentation
Multi-object discovery, neural radiance caching.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/b5f50463-49bd-42d6-b0fc-2fa6388d11ec.jpg)
Parking Space Occupancy
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002889-9b2c229b.jpg)
Partial Video Copy Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/c16085d5-fd52-4516-b585-042f44e000f9.jpg)
Multimodal Patch Matching
Perpetual view generation, procedure learning, prompt-driven zero-shot domain adaptation, safety perception recognition, jersey number recognition, photo to rest generalization, single-shot hdr reconstruction, on-the-fly sketch based image retrieval, thermal image denoising, trademark retrieval, unsupervised instance segmentation, unsupervised zero-shot instance segmentation, vehicle key-point and orientation estimation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001689-53363f66.jpg)
Video Individual Counting
Video-adverb retrieval (unseen compositions), video-to-image affordance grounding.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/b4aebc62-bf07-4dad-b3cc-384f37397bde.jpg)
Vietnamese Scene Text
Visual sentiment prediction, human-scene contact detection, localization in video forgery, video classification, student engagement level detection (four class video classification), multi class classification (four-level video classification), 3d canonicalization, 3d surface generation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001065-36e9387d.jpg)
Visibility Estimation from Point Cloud
Amodal layout estimation, blink estimation, camera absolute pose regression, change data generation, constrained diffeomorphic image registration, continuous affect estimation, dataset distillation, deep feature inversion, document image skew estimation, earthquake prediction, fashion compatibility learning.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001467-12407941.jpg)
Displaced People Recognition
Finger vein recognition, flooded building segmentation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000470-a465c9ea.jpg)
Future Hand Prediction
Generative temporal nursing, house generation, human fmri response prediction, hurricane forecasting, ifc entity classification, image declipping, image similarity detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002890-fce976fd.jpg)
Image Text Removal
Image-to-gps verification.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000051-9fe1ac0a.jpg)
Image-based Automatic Meter Reading
Dial meter reading, indoor scene reconstruction, jpeg decompression.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/57764734-6608-4cd8-9806-0ba85e6a78b1.jpg)
Kiss Detection
Laminar-turbulent flow localisation.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/73fafcc1-c15b-4f9f-980c-fe4825198d5a.jpg)
Landmark Recognition
Brain landmark detection, corpus video moment retrieval, linear probing object-level 3d awareness, mllm evaluation: aesthetics, medical image deblurring, mental workload estimation, meter reading, motion expressions guided video segmentation, natural image orientation angle detection, multi-object colocalization, multilingual text-to-image generation, video emotion detection, nwp post-processing, occluded 3d object symmetry detection, open set video captioning, pso-convnets dynamics 1, pso-convnets dynamics 2, partial point cloud matching.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/cc3682a5-f480-439a-ba4f-224210065710.jpg)
Partially View-aligned Multi-view Learning
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002589-297611c0.jpg)
Pedestrian Detection
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000609-bfee3732.jpg)
Thermal Infrared Pedestrian Detection
Personality trait recognition by face, physical attribute prediction, point cloud semantic completion, point cloud classification dataset, point- of-no-return (pnr) temporal localization, pose contrastive learning, potrait generation, procedure step recognition, prostate zones segmentation, pulmorary vessel segmentation, pulmonary artery–vein classification, reference expression generation, interspecies facial keypoint transfer, specular reflection mitigation, specular segmentation, state change object detection, surface normals estimation from point clouds, train ego-path detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/b154d103-3d60-4544-944e-94bc938ba45d.jpg)
Transform A Video Into A Comics
Transparency separation, typeface completion.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000200-bd1e7765.jpg)
Unbalanced Segmentation
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/a261223e-db89-4890-bbcb-4a450c51705a.jpg)
Unsupervised Long Term Person Re-Identification
Video correspondence flow.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/174a98c2-b2fa-43f1-85a9-34dcb0da8001.jpg)
Key-Frame-based Video Super-Resolution (K = 15)
Zero-shot single object tracking, yield mapping in apple orchards, lidar absolute pose regression, opd: single-view 3d openable part detection, self-supervised scene text recognition, spatial-aware image editing, video narration captioning, spectral estimation, spectral estimation from a single rgb image, 3d prostate segmentation, aggregate xview3 metric, atomic action recognition, composite action recognition, calving front delineation from synthetic aperture radar imagery, computer vision transduction, crosslingual text-to-image generation, zero-shot dense video captioning, document to image conversion, frame duplication detection, geometrical view, hyperview challenge.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/d2a36764-c4ff-4697-be82-6215b1619e4d.jpg)
Image Operation Chain Detection
Kinematic based workflow recognition, logo recognition.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000629-4e8ac622.jpg)
MLLM Aesthetic Evaluation
Motion detection in non-stationary scenes, open-set video tagging, satellite orbit determination.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/84104664-e53c-4e20-97fd-21f04043d5ab.jpg)
Segmentation Based Workflow Recognition
2d particle picking, small object detection.
![research proposal on computer vision research proposal on computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002111-b5bf20cd.jpg)
Rice Grain Disease Detection
Sperm morphology classification, video & kinematic base workflow recognition, video based workflow recognition, video, kinematic & segmentation base workflow recognition, animal pose estimation.
Javascript is disabled in your browser. Please enable it for full functionality and experience.
- Direkt zur Navigation springen
- Direkt zur Suche springen
- Direkt zum Inhalt springen
- Fraunhofer HHI
- Departments
- Start page >
- Departments >
- Vision and Imaging Technologies >
- Research Groups >
- Computer Vision & Graphics >
- CVG Research Overview
- AI-Based Building Digitalization
- Portrait Relighting
- Neural Speech-Driven Face Animation
- Video-driven Facial Animation
- Publications
- Student Opportunities
- IMC Research Overview
- Research Topics
- Pose and gesture analysis
- Behaviour analysis for human-computer interaction
- Contact-free Human-Computer Interaction
- Image Quality Estimation
- Subjective Tests
- Birgit Nierula
Research Topics of the Computer Vision & Graphics Group
Seeing, modelling and animating humans.
![research proposal on computer vision research proposal on computer vision](https://www.hhi.fraunhofer.de/fileadmin/_processed_/7/f/csm_Teaser_AVH_3d6548a534.png)
Realistic human modelling is a challenging task in Computer Vision and Graphics. We investigate new methods for capturing and analyzing human bodies and faces in images and videos as well as new compact models for the representation of facial expressions as well as human bodies and their motion. We combine model-based and image-and video based representations with generative AI models as well as neural rendering.
Read more about current research projects in this field.
Scenes, Structure and Motion
![Teaser scenes research proposal on computer vision](https://www.hhi.fraunhofer.de/fileadmin/_processed_/d/3/csm_teaser_scenes_02ec7502cd.png)
We have a long tradition in 3D scene analysis and continuously perform innovative research in 3D capturing as well as 3D reconstruction, ranging from highly detailed stereo as well as multi-view images of static objects and scenes, addressing even complex surface and shape properties, over monocular shape-from-X methods, to analyzing deforming objects in monocular video.
Computational Imaging and Video
![Computational Video research proposal on computer vision](https://www.hhi.fraunhofer.de/fileadmin/_processed_/c/b/csm_computational_video_5273beeee3.png)
We perform innovative research in the field of video processing and computational video opening up new opportunities for how dynamic scenes can be analyzed and video footage can be represented, edited and seamlessly augmented with new content.
Learning and Inference
![Teaser learning research proposal on computer vision](https://www.hhi.fraunhofer.de/fileadmin/_processed_/5/3/csm_teaser_learning_9742e1ed9f.png)
Our research combines computer vision, computer graphics, and machine learning to understand images and video data. In our research, we focus on the combination of deep learning with strong models or physical constraints in order to combine the advantages of model-based and data-driven methods.
Augmented and Mixed Reality
![Augmented Reality research proposal on computer vision](https://www.hhi.fraunhofer.de/fileadmin/_processed_/e/8/csm_AugmentedReality_707002e0dd.jpg)
Our experience in tracking dynamic scenes and objects as well as photorealistic rendering enables new augmented reality solutions where virtual content is seamlessly blended into real video footage with applications e.g. multi-media, industry or medicine.
Previous Research Projects
![Older Research Projects research proposal on computer vision](https://www.hhi.fraunhofer.de/fileadmin/_processed_/a/c/csm_old_research_f76359c4a9.jpg)
We have performed various research projects in the above fields over the years.
Read more about older research projects here.
![research proposal on computer vision People detection with computer vision](https://viso.ai/wp-content/uploads/2022/02/people-detection-768x432.png)
- Explore Blog
Data Collection
Building Blocks
Device Enrollment
Monitoring Dashboards
Video Annotation
Application Editor
Device Management
Remote Maintenance
Model Training
Application Library
Deployment Manager
Unified Security Center
AI Model Library
Configuration Manager
IoT Edge Gateway
Privacy-preserving AI
Ready to get started?
- Why Viso Suite
Top Computer Vision Papers of All Time (Updated 2024)
![research proposal on computer vision](https://viso.ai/wp-content/uploads/2024/03/best-CV-papers-cover-image-1060x439.png)
Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.
Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.
Today’s boom in computer vision (CV) started at the beginning of the 21 st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation.
In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.
Gradient-based Learning Applied to Document Recognition (1998)
Distinctive image features from scale-invariant keypoints (2004), histograms of oriented gradients for human detection (2005), surf: speeded up robust features (2006), imagenet classification with deep convolutional neural networks (2012), very deep convolutional networks for large-scale image recognition (2014), googlenet – going deeper with convolutions (2014), resnet – deep residual learning for image recognition (2015), faster r-cnn: towards real-time object detection with region proposal networks (2015), yolo: you only look once: unified, real-time object detection (2016), mask r-cnn (2017), efficientnet – rethinking model scaling for convolutional neural networks (2019).
About us: Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. Book a demo to learn more about how Viso Suite can help solve business problems.
Enterprise infrastructure you need to deliver computer vision systems faster, operate at large scale, and with maximum security.
Classic Computer Vision Papers
The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition . They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.
![research proposal on computer vision LeNet CNN architecture digits recognition](https://viso.ai/wp-content/uploads/2024/03/lenet-architecture-digits-recognition-1060x286.png)
Characteristics of the model:
- LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
- The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
- The training set consists of 30000 examples, and authors achieved a 0.35% error rate on the training set (after 19 passes).
Find the LeNet paper here .
David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT), while transforming image data into scale-invariant coordinates relative to local features.
![research proposal on computer vision SIFT method keypoints detection](https://viso.ai/wp-content/uploads/2024/03/sift-method-keypoints-selection.jpg)
Model characteristics:
- The method generates large numbers of features that densely cover the image over the full range of scales and locations.
- The model needs to match at least 3 features from each object – in order to reliably detect small objects in cluttered backgrounds.
- For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
- SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).
Find the SIFT paper here .
The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection .
![research proposal on computer vision histogram object detection](https://viso.ai/wp-content/uploads/2024/03/histogram-feature-extraction-object-detection.jpg)
Authors achievements:
- The histogram method gave near-perfect separation from the original MIT pedestrian database.
- For good results – the model requires: fine-scale gradients, fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
- Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
- In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.
Find the HOG paper here .
Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness, while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.
![research proposal on computer vision surf detecting interest points](https://viso.ai/wp-content/uploads/2024/03/surf-detected-interest-points.jpg)
- Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
- Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
- SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).
Find the SURF paper here .
Papers Based on Deep-Learning Models
Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.
![research proposal on computer vision alexnet CNN architecture](https://viso.ai/wp-content/uploads/2024/03/alexnet-cnn-architecture.jpg)
- The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
- They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
- The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
- After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.
Find the ImageNet paper here .
Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG) . They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.
![research proposal on computer vision image classification CNN results VOC-2007, VOC-2012](https://viso.ai/wp-content/uploads/2024/03/VOC-2012-image-classification.jpg)
- Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
- They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
- They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.
Find the VGG paper here .
The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.
![research proposal on computer vision GoogleNet Inception CNN](https://viso.ai/wp-content/uploads/2024/03/googlenet-inception-module-dimension-reductions.jpg)
- A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
- Their submission for ILSVRC14 was called GoogLeNet , a 22-layer deep network. Its quality was assessed in the context of classification and detection.
- They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
- Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.
Find the GoogLeNet paper here .
Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.
![research proposal on computer vision resnet error rates](https://viso.ai/wp-content/uploads/2024/03/resnet-error-rates-ImageNet.jpg)
- They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
- This result won 1st place on the ILSVRC 2015 classification task.
- The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
- Moreover – in ILSVRC & COCO 2015 competitions, they won 1 st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.
Find the ResNet paper here .
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which Fast R-CNN used for detection.
![research proposal on computer vision faster R-CNN object detection](https://viso.ai/wp-content/uploads/2024/03/faster-R-CNN-unified-network.jpg)
- Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “ attention” mechanisms .
- For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
- Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
- In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.
Find the Faster R-CNN paper here .
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance .
![research proposal on computer vision YOLO CNN architecture](https://viso.ai/wp-content/uploads/2024/03/yolo-architecture.jpg)
- The base YOLO model processed images in real-time at 45 frames per second.
- A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
- Compared to state-of-the-art detection systems, YOLO was making more localization errors, but was less likely to predict false positives in the background.
- YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN , when generalizing natural images.
Find the YOLO paper here .
Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN , extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
![research proposal on computer vision mask R-CNN framework](https://viso.ai/wp-content/uploads/2024/03/mask-rcnn-framework.jpg)
- Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
- Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
- Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
- The model served as a solid baseline and eased future research in instance-level recognition.
Find the Mask R-CNN paper here .
The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet .
![research proposal on computer vision efficiennet model scaling CNN](https://viso.ai/wp-content/uploads/2024/03/efficientnet-model-scaling.jpg)
- Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
- EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
- It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.
Find the EfficientNet paper here .
Related Articles
![research proposal on computer vision research proposal on computer vision](https://viso.ai/wp-content/uploads/2024/04/dung-tr-n-vi-t-4ZtrbKfJQHs-unsplash-768x512.jpg)
MobileNet – Efficient Deep Learning for Mobile Vision
MobileNet, introduced in 2017 by a team of researchers at Google, is a Deep Learning model for Smartphones, IoT, and embedded devices.
![research proposal on computer vision Image illustrating the effects of radial distortion. The image to the left shows a basketball court curved spherically due to lens distortion. The image to the right shows a grid pattern with a barrel-like distortion pattern.](https://viso.ai/wp-content/uploads/2024/05/radial-distortion-768x346.jpg)
A Complete Guide for Camera Calibration in 2024
This blog is a comprehensive guide for camera calibration. Users can do all their necessary calibration after reading this blog.
All-in-one platform to build, deploy, and scale computer vision applications
![research proposal on computer vision research proposal on computer vision](https://viso.ai/wp-content/uploads/2021/07/intel-logo-transparent.png)
Join 6,300+ Fellow AI Enthusiasts
Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.
![research proposal on computer vision](https://viso.ai/wp-content/uploads/2021/06/Group-2091.png)
Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.
You can unsubscribe anytime. See our privacy policy .
One unified solution for enterprise AI vision
The computer vision infrastructure for teams to build, deploy and operate real-world applications at scale.
![research proposal on computer vision Pcw-img](https://viso.ai/wp-content/uploads/2023/01/Pcw-img-350x296.png)
Privacy Overview
Cookie | Duration | Description |
---|---|---|
cookielawinfo-checkbox-advertisement | 1 year | Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category . |
cookielawinfo-checkbox-analytics | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". |
cookielawinfo-checkbox-functional | 11 months | The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-others | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. |
cookielawinfo-checkbox-performance | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". |
elementor | never | This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time. |
JSESSIONID | session | The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application. |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
ZCAMPAIGN_CSRF_TOKEN | session | This cookie is used to distinguish between humans and bots. |
zfccn | session | Zoho sets this cookie for website security when a request is sent to campaigns. |
Cookie | Duration | Description |
---|---|---|
_zcsr_tmp | session | Zoho sets this cookie for the login function on the website. |
Cookie | Duration | Description |
---|---|---|
_gat | 1 minute | This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites. |
Cookie | Duration | Description |
---|---|---|
_ga | 2 years | The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors. |
_gat_gtag_UA_177371481_2 | 1 minute | Set by Google to distinguish users. |
_gid | 1 day | Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously. |
CONSENT | 2 years | YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data. |
zabUserId | 1 year | This cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time |
zabVisitId | one year | Used for identifying returning visits of users to the webpage. |
zft-sdc | 24hours | It records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience. |
zps-tgr-dts | 1 year | These cookies are used to measure and analyze the traffic of this website and expire in 1 year. |
Cookie | Duration | Description |
---|---|---|
VISITOR_INFO1_LIVE | 5 months 27 days | A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface. |
YSC | session | YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages. |
yt-remote-connected-devices | never | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
yt-remote-device-id | never | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
Cookie | Duration | Description |
---|---|---|
2d719b1dd3 | session | This cookie has not yet been given a description. Our team is working to provide more information. |
4662279173 | session | This cookie is used by Zoho Page Sense to improve the user experience. |
ad2d102645 | session | This cookie has not yet been given a description. Our team is working to provide more information. |
zc_consent | 1 year | No description available. |
zc_show | 1 year | No description available. |
zsc2feeae1d12f14395b6d5128904ae3746 | 1 minute | This cookie has not yet been given a description. Our team is working to provide more information. |
![research proposal on computer vision research proposal on computer vision](https://web.eecs.umich.edu/~justincj/teaching/eecs442//assets/img/umich.png)
Project Proposal
For the course project you will explore a topic in-depth of your own choosing. This can be an implementation (implement an existing algorithm); an application (apply a computer vision algorithm to a new problem); or research (trying to invent something new).
To get you started, we have prepared a list of suggested projects . We believe that any of these would be feasible to complete as a project.
We expect you to work in groups of 3-5 students for the course project. The course project should amount to roughly one homework worth of work per person. In previous years we have typically expected about two homework assignments worth of work per person in the group project; however we are explicitly lowering our expectations to account for the extra overhead of collaborating remotely.
We are not expecting state-of-the-art, publication ready results from your course project! The point of the project is to get practice applying concepts of the class to a problem of your choosing, without the “scaffolding” code provided in the homeworks.
What to submit
The project proposal is due on Monday, April 5, 2021 11:59:59pm to Gradescope. You only need to submit one project propsal per group and add all the other members on Gradescope.
After submitting your project proposal, please fill out this Google Form once as a group to help us keep track of who is working together.
Your project proposal should be a 1-page PDF that answers the following questions:
Project Title : What is the name of your project?
Group Members : What are the names and uniqnames of the students involved?
Problem Statement : What is the problem you are trying to solve?
Approach : How do you plan to go about solving this problem? You don’t need to have everything figured out exactly, but you should have a vague sense of how you will proceed.
Data : What dataset do you plan to use? A common failure mode for projects is to have a cool idea, but no idea where to get the necessary data. We recommend against collecting your own dataset for the project, as this will significantly increase the complexity and workload; instead you should try to get away with existing datasets.
Computational Resources : What computational resources will you use for this project? For some projects a laptop may be completely fine. But if you are planning to train any kind of neural network, you should have an estimate of how much time a model should take to train, and where you will get access to the computational resources you need. Google Colab is a great free resources for small amounts of GPU resources; but be aware that this is not sufficient for training large-scale models.
Evaluation : How do you plan to evaluate whether your project is successful? What metric will you use? Is there some simple baseline that you plan to compare your model against?
If you are following one of our suggested projects then your “Problem Statement” can be very brief – one or two sentences is fine. For the suggested projects, we don’t expect you to work on all of the datasets we link; but please do tell us which you are planning to use.
We have 11 Computer Vision (research proposal form) PhD Projects, Programmes & Scholarships in the UK
Computer Science
United Kingdom
Institution
All Institutions
All PhD Types
All Funding
Computer Vision (research proposal form) PhD Projects, Programmes & Scholarships in the UK
Human emotion analysis and recognition for improving trusted human-robot interaction. main project focus: ai and robotics, phd research project.
PhD Research Projects are advertised opportunities to examine a pre-defined topic or answer a stated research question. Some projects may also provide scope for you to propose your own ideas and approaches.
Self-Funded PhD Students Only
This project does not have funding attached. You will need to have your own means of paying fees and living costs and / or seek separate funding from student finance, charities or trusts.
Novel Applications of Remote Sensing for Health
Competition funded phd project (students worldwide).
This project is in competition for funding with other projects. Usually the project which receives the best applicant will be successful. Unsuccessful projects may still go ahead as self-funded opportunities. Applications for the project are welcome from all suitably qualified candidates, but potential funding may be restricted to a limited set of nationalities. You should check the project and department details for more information.
PhD Studentship in Computer Science: AI for Robotics in Agriculture
Funded phd project (uk students only).
This research project has funding attached. It is only available to UK citizens or those who have been resident in the UK for a period of 3 years or more. Some projects, which are funded by charities or by the universities themselves may have more stringent restrictions.
Generative AI for synthetic biometric and age estimation testing
Competition funded phd project (uk students only).
This research project is one of a number of projects at this institution. It is in competition for funding with one or more of these projects. Usually the project which receives the best applicant will be awarded the funding. The funding is only available to UK citizens or those who have been resident in the UK for a period of 3 years or more. Some projects, which are funded by charities or by the universities themselves may have more stringent restrictions.
Next Generation Machine Learning for Data Analysis
Leveraging plant biomechanics under hostile environments, machine unlearning for privacy preserving applications, biometric deepfake to physical attack instrument assessment, help hearing impaired listeners to understand speech better by audio-visual integration in wearable devices, xai for biometrics – legal presentation of data, lightweight model instance segmentation on edge devices.
FindAPhD. Copyright 2005-2024 All rights reserved.
Unknown ( change )
Have you got time to answer some quick questions about PhD study?
Select your nearest city
You haven’t completed your profile yet. To get the most out of FindAPhD, finish your profile and receive these benefits:
- Monthly chance to win one of ten £10 Amazon vouchers ; winners will be notified every month.*
- The latest PhD projects delivered straight to your inbox
- Access to our £6,000 scholarship competition
- Weekly newsletter with funding opportunities, research proposal tips and much more
- Early access to our physical and virtual postgraduate study fairs
Or begin browsing FindAPhD.com
or begin browsing FindAPhD.com
*Offer only available for the duration of your active subscription, and subject to change. You MUST claim your prize within 72 hours, if not we will redraw.
![research proposal on computer vision research proposal on computer vision](https://fau-res.cloudinary.com/image/upload/common/bespoke-pages/custom-pages/bpid6458/img2020114135.png)
Do you want hassle-free information and advice?
Create your FindAPhD account and sign up to our newsletter:
- Find out about funding opportunities and application tips
- Receive weekly advice, student stories and the latest PhD news
- Hear about our upcoming study fairs
- Save your favourite projects, track enquiries and get personalised subject updates
![research proposal on computer vision research proposal on computer vision](https://fau-res.cloudinary.com/image/upload/common/images/login-facebook.png)
Create your account
Looking to list your PhD opportunities? Log in here .
Filtering Results
![research proposal on computer vision research proposal on computer vision](https://cs231n.stanford.edu/img/svl_logo.png)
CS231n: Deep Learning for Computer Vision
Stanford - spring 2024, final project, important dates.
- Collaboration Policy
Late Policy
Final report.
- Final Presentation
Deliverable | Weight | Due Date | Late Days | 1% | 04/22/2024 | Yes |
---|---|---|---|
Project Milestone | 2% | 05/14/2024 | Yes |
Final Report | 29% | 06/05/2024 | No |
Poster Session (in person) + Poster PDF & Code (submit online) | 3% | Poster Session: 06/12/2024; Submitting PDF and Code: 06/11/2024 | No |
The Course Project is an opportunity for you to apply what you have learned in class to a problem of your interest. Potential projects usually fall into these two tracks:
- Applications. If you're coming to the class with a specific background and interests (e.g. biology, engineering, physics), we'd love to see you apply vision models learned in this class to problems related to your particular domain of interest. Pick a real-world problem and apply computer vision models to solve it.
- Models. You can build a new model (algorithm) or a new variant of existing models, and apply it to tackle vision tasks. This track might be more challenging, and sometimes leads to a piece of publishable work.
One restriction to note is that this is a Computer Vision class, so your project should involve pixels of visual data in some form somewhere. E.g. a pure NLP project is not a good choice, even if your approach involves ConvNets.
We have compiled a list of project ideas for inspiration that combine recent trend and interesting applications. Note that you do not need to pick one from here. Rather, these can be served as starting points for you to find the ideas that excite you.
- Spring 2022
- Spring 2017
- Winter 2016
- Winter 2015
To inspire ideas, you might also look at recent deep learning publications from top-tier conferences, as well as other resources below.
- CVPR : IEEE Conference on Computer Vision and Pattern Recognition
- ICCV : International Conference on Computer Vision
- ECCV : European Conference on Computer Vision
- NIPS : Neural Information Processing Systems
- ICLR : International Conference on Learning Representations
- ICML : International Conference on Machine Learning
- Publications from the Stanford Vision Lab
- Awesome Deep Vision
- Past CS229 Projects : Example projects from Stanford's machine learning class
- Kaggle challenges : An online machine learning competition website. For example, a Yelp classification challenge .
For applications, this type of projects would involve careful data preparation, an appropriate loss function, details of training and cross-validation and good test set evaluations and model comparisons. Don't be afraid to think outside of the box. Some successful examples can be found below:
- Teaching Deep Convolutional Neural Networks to Play Go
- Playing Atari with Deep Reinforcement Learning
ConvNets also run in real time on mobile phones and Raspberry Pi's - building an interesting mobile application could be a good project. If you want to go this route you might want to check out PyTorch Mobile , TensorFlow Lite or Caffe2 iOS/Android integration .
For models, ConvNets have been successfully used in a variety of computer vision tasks. This type of projects would involve understanding the state-of-the-art vision models, and building new models or improving existing models for a vision task. The list below presents some papers on recent advances of ConvNets in the computer vision community.
- Image Classification : [Krizhevsky et al.] , [Russakovsky et al.] , [Szegedy et al.] , [Simonyan et al.] , [He et al.] , [Huang et al.] , [Hu et al.] [Zoph et al.]
- Object detection : [Girshick et al.] , [Ren et al.] , [He et al.]
- Image segmentation : [Long et al.] [Noh et al.] [Chen et al.]
- Video classification : [Karpathy et al.] , [Simonyan and Zisserman] [Tran et al.] [Carreira et al.] [Wang et al.]
- Scene classification : [Zhou et al.]
- Face recognition : [Taigman et al.] [Schroff et al.] [Parkhi et al.]
- Depth estimation : [Eigen et al.]
- Image-to-sentence generation : [Karpathy and Fei-Fei] , [Donahue et al.] , [Vinyals et al.] [Xu et al.] [Johnson et al.]
- Visualization and optimization : [Szegedy et al.] , [Nguyen et al.] , [Zeiler and Fergus] , [Goodfellow et al.] , [Schaul et al.]
You might also gain inspiration by taking a look at some popular computer vision datasets:
- Meta Pointer: A large collection organized by CV Datasets.
- Yet another Meta pointer
- ImageNet : a large-scale image dataset for visual recognition organized by WordNet hierarchy
- Visual Genome
- SA-1B : dataset of a large number of images and segmentation masks to segment objects in those images
- COCO : large-scale object detection, segmentation, and captioning dataset
- Open Images : a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives
- Cityscapes Dataset : This dataset focuses on semantic understanding of urban street scenes, with pixel-level annotations for various object classes such as cars, pedestrians, and roads
- DeepFashion : a large-scale clothes dataset containing over 800,000 diverse fashion images annotated with bounding boxes, clothing categories, and attributes
- Hugging face datasets : collection of generic datasets available on hugging face
- Objaverse : a large-scale 3D asset database
- SUN Database : a benchmark for scene recognition and object detection with annotated scene categories and segmented objects
- Places Database : a scene-centric database with 205 scene categories and 2.5 millions of labelled images
- NYU Depth Dataset v2 : a RGB-D dataset of segmented indoor scenes
- Microsoft COCO : a new benchmark for image recognition, segmentation and captioning
- Flickr100M : 100 million creative commons Flickr images
- Labeled Faces in the Wild : a dataset of 13,000 labeled face photographs
- Human Pose Dataset : a benchmark for articulated human pose estimation
- YouTube Faces DB : a face video dataset for unconstrained face recognition in videos
- UCF101 : an action recognition data set of realistic action videos with 101 action categories
- HMDB-51 : a large human motion dataset of 51 action classes
- ActivityNet : A large-scale video dataset for human activity understanding
- Moments in Time : A dataset of one million 3-second videos
Collaboration
You can work in teams of up to 3 people. We do expect that projects done with 3 people have more impressive writeup and results than projects done with fewer people. For example, to get a sense for the scope and expectations for projects, have a look at project reports from previous years. While we encourage that you work in teams, you may also work alone.
You may consult any papers, books, online references, or publicly available implementations for ideas and code that you may want to incorporate into your strategy or algorithm, so long as you clearly cite your sources in your code and your writeup. However, under no circumstances may you look at another group’s code or incorporate their code into your project.
If you are combining your course project with the project from another class, you must receive permission from the instructors, and clearly explain in the Proposal, Milestone, and Final Report the exact portion of the project that is being counted for CS 231n. In this case you must prepare separate reports for each course, and submit your final report for the other course as well.
If you are combining your course project with another course project or research project, you DO NOT need to receive prior permission from CS231n. Instead, we ask that you clearly explain in the Proposal, Milestone, and Final Report the UNIQUE portion of the project that is being counted for this class. In this case you must prepare separate reports for each class and submit the other report to CS231n as well (if available). Remember, it is an honor code violation to use the same final report PDF for multiple classes. For the report for this class, focus on the specific portion of the project that is counted for this class.
See the late policy on the home page .
Project Proposal
The project proposal should be one paragraph (200-400 words). Your project proposal should describe:
- What is the problem that you will be investigating? Why is it interesting?
- What reading will you examine to provide context and background?
- What data will you use? If you are collecting new data, how will you do it?
- What method or algorithm are you proposing? If there are existing implementations, will you use them and how? How do you plan to improve or modify such implementations? You don't have to have an exact answer at this point, but you should have a general sense of how you will approach the problem you are working on.
- How will you evaluate your results? Qualitatively, what kind of results do you expect (e.g. plots or figures)? Quantitatively, what kind of analysis will you use to evaluate and/or compare your results (e.g. what performance metrics or statistical tests)?
- If you are combining this project with another course/research project, what is the unique portion of the project that is counted towards this class?
Submission: Please submit your proposal as a PDF on Gradescope. Only one person on your team should submit. Please have this person add the rest of your team as collaborators as a "Group Submission".
Project Milestone
- Title, Author(s)
- Introduction: this section introduces your problem, and the overall plan for approaching your problem
- Problem statement: Describe your problem precisely specifying the dataset to be used, expected results and evaluation
- Technical Approach: Describe the methods you intend to apply to solve the given problem
- Intermediate/Preliminary Results: State and evaluate your results upto the milestone
Submission : Please submit your milestone as a PDF on Gradescope. Only one person on your team should submit. Please have this person add the rest of your team as collaborators as a "Group Submission".
Your final write-up is required to be between 6 - 8 pages using the provided template , structured like a paper from a computer vision conference (CVPR, ECCV, ICCV, etc.). Please use this template so we can fairly judge all student projects without worrying about altered font sizes, margins, etc. After the class, we will post all the final reports online so that you can read about each others' work. If you do not want your writeup to be posted online, then please let us know via the project registration form.
The following is a suggested structure for your report, as well as the rubric that we will follow when evaluating reports. You don't necessarily have to organize your report using these sections in this order, but that would likely be a good starting point for most projects. Refer to Ed for more fine-grained details and explanations of each separate section.
- Abstract : Briefly describe your problem, approach, and key results. Should be no more than 300 words.
- Introduction (10%) : Describe the problem you are working on, why it's important, and an overview of your results
- Related Work (10%) : Discuss published work that relates to your project. How is your approach similar or different from others?
- Data (10%) : Describe the data you are working with for your project. What type of data is it? Where did it come from? How much data are you working with? Did you have to do any preprocessing, filtering, or other special treatment to use this data in your project?
- Methods (30%) : Discuss your approach for solving the problems that you set up in the introduction. Why is your approach the right thing to do? Did you consider alternative approaches? You should demonstrate that you have applied ideas and skills built up during the quarter to tackling your problem of choice. It may be helpful to include figures, diagrams, or tables to describe your method or compare it with other methods.
- Experiments (30%) : Discuss the experiments that you performed to demonstrate that your approach solves the problem. The exact experiments will vary depending on the project, but you might compare with previously published methods, perform an ablation study to determine the impact of various components of your system, experiment with different hyperparameters or architectural choices, use visualization techniques to gain insight into how your model works, discuss common failure modes of your model, etc. You should include graphs, tables, or other figures to illustrate your experimental results.
- Conclusion (5%) Summarize your key results - what have you learned? Suggest ideas for future extensions or new applications of your ideas.
- Writing / Formatting (5%) Is your paper clearly written and nicely formatted?
- Source code (if your project proposed an algorithm, or code that is relevant and important for your project.).
- Cool videos, interactive visualizations, demos, etc.
- The entire PyTorch/TensorFlow Github source code.
- Any code that is larger than 10 MB.
- Model checkpoints.
- A computer virus.
Submission : You will submit your final report as a PDF and your supplementary material as a separate PDF or ZIP file. We will provide detailed submission instructions as the deadline nears.
Additional Submission Requirements : We will also ask you do do the following when you submit your project report:
- Your report PDF should list all authors who have contributed to your work; enough to warrant a co-authorship position. This includes people not enrolled in CS 231N such as faculty/advisors if they sponsored your work with funding or data, significant mentors (e.g., PhD students or postdocs who coded with you, collected data with you, or helped draft your model on a whiteboard). All authors should be listed directly underneath the title on your PDF. Include a footnote on the first page indicating which authors are not enrolled in CS 231N. All co-authors should have their institutional/organizational affiliation specified below the title.
- If you have non-231N contributors, you will be asked to describe the following:
- Specify whether the project has been submitted to a peer-reviewed conference or journal. Include the full name and acronym of the conference (if applicable). For example: Neural Information Processing Systems (NIPS). This only applies if you have already submitted your paper/manuscript and it is under review as of the report deadline.
- Any code that was used as a base for projects must be referenced and cited in the body of the paper. This includes CS 231N assignment code, finetuning example code, open-source, or Github implementations. You can use a footnote or full reference/bibliography entry.
- If you are using this project for multiple classes, submit the other class PDF as well. Remember, it is an honor code violation to use the same final report PDF for multiple classes.
In summary, include all contributing authors in your PDF; include detailed non-231N co-author information; tell us if you submitted to a conference, cite any code you used, and submit your dual-project report (e.g., CS 230, CS 231A, CS 234).
Poster Session
- Date: Wednesday, June 12, 2024
- Time: 12:00 pm to 4:30 pm
- Location: AT&T Patio outside Gates Computer Science Building
- Who: Student groups must present in-person at the poster session, unless approved by course staff beforehand to present online. Stanford students, faculty, and guests from industry are welcome!
Students: We will provide foam poster boards and easels.The foam boards we will provide have the size of 30x40 inches, so please print your poster = 20x30 inches. Our recommended size is 24x36 inches. You may print your poster in landscape or portrait orientation.
- Lathrop Library’s Tech Desk: Approximately 3-day turnaround.
- FedEx: Approximately 2-day(?) turnaround.
- Walgreens: Approximately same-day pickup.
- Biotech Productions: Approximately same-day delivery.
- Staples: Approximately same-day pickup.
- Can I print my poster on 8.5x11 inch pieces of paper and tape them together? Yes, but we encourage you to print out one full poster. If you do print sections and tape them together, make sure that all the content is still legible and fits on a 30x40 foam board.
![research proposal on computer vision U.S. flag](https://www.ncbi.nlm.nih.gov/coreutils/uswds/img/favicons/favicon-57.png)
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
![research proposal on computer vision Logo of jimaging](https://www.ncbi.nlm.nih.gov/corehtml/pmc/pmcgifs/logo-jimaging.jpg)
A Review on Computer Vision-Based Methods for Human Action Recognition
Mahmoud al-faris.
1 School of Energy & Electronic Engineering, Faculty of Technology, University of Portsmouth, Portsmouth PO1 3DJ, UK; [email protected] (J.C.); [email protected] (A.I.A.)
John Chiverton
2 School of Computing, Engineering and Physical Sciences, University of the West of Scotland, Paisley PA1 2BE, UK; [email protected]
Ahmed Isam Ahmed
Human action recognition targets recognising different actions from a sequence of observations and different environmental conditions. A wide different applications is applicable to vision based action recognition research. This can include video surveillance, tracking, health care, and human–computer interaction. However, accurate and effective vision based recognition systems continue to be a big challenging area of research in the field of computer vision. This review introduces the most recent human action recognition systems and provides the advances of state-of-the-art methods. To this end, the direction of this research is sorted out from hand-crafted representation based methods including holistic and local representation methods with various sources of data, to a deep learning technology including discriminative and generative models and multi-modality based methods. Next, the most common datasets of human action recognition are presented. This review introduces several analyses, comparisons and recommendations that help to find out the direction of future research.
1. Introduction
Human Action Recognition (HAR) has a wide-range of potential applications. Its target is to recognise the actions of a person from either sensors or visual data. HAR approaches can be categorised into visual sensor-based, non-visual sensor-based and multi-modal categories [ 1 , 2 ]. The main difference between visual and other categories is the form of the sensed data. The visual data are captured in the form of 2D/3D images or video whilst others capture the data in the form of a 1D signal [ 2 ]. Over the last few years, wearable devices such as smart-phones, smart-watches, and fitness wristbands have been developed. These have small non-visual based sensors and are equipped with computing power and communication capability. They are also relatively low cost which has helped to open up new opportunities with ubiquitous applications. These include health monitoring, recuperative training and disease prevention, see, e.g., [ 3 ].
At the same time, visual sensor-based methods of human action recognition are one of the most prevalent and topical areas in the computer vision research community. Applications have included human–computer interaction, intelligent video surveillance, ambient assisted living, human–robot interaction, entertainment and content-based video search. In each one of those applications, the recognition system is trained to distinguish actions carried out in a scene. It may also perform some decisions or further processing based on that inference.
It can be stated that wearable devices have several limitations such as in most cases they need to be worn and to operate constantly. This might be a significant issue for real applications that may require readiness and deployability. In turn, requiring specific technical requirements related to e.g., battery life, size and performance of the sensor, see, e.g., [ 4 ]. In addition, they might not be suitable or efficient to employ in e.g., crowd applications or other related scenarios. These limitations are not applicable to computer-vision based HAR. Computer vision based HAR can be applied to most of application scenarios without these technical requirements or limitations.
From about 1980, researchers have presented different studies on action recognition based on images and/or video data [ 5 , 6 ]. In many instances, researchers have been following or drawing inspiration from elements of the operating principles of the human vision system. The human vision system receives visual information about an object especially with respect to movement and shape and how it changes with time. Observations are fed to a perception system for recognition processes. These biophysical processes of the human recognition system have been investigated by many researchers to achieve similar performance in the form of computer vision systems. However, several challenges such as environmental complexities, scale variations, non-rigid shapes, background clutter, viewpoint variations and occlusions make computer vision systems unable to fully realise many elementary aspects of a human vision system.
Action recognition systems can be categorised into different four categorises according to the complexity of human action. This can include: primitive [ 7 ], single person [ 8 ], interaction [ 9 ], and group [ 10 ] actions recognition. Primitive action indicates basic movement of human body parts—for example, “lifting a hand” and “bending”. Single person actions indicate a set of primitive actions of a single person such as “running” and “jumping”. Interaction indicates actions involve humans and objects, such as “carrying a box” and “playing a guitar”. Group actions refer to actions occurring in a group of people such as a “procession”, “meeting”, and “group walking”.
In general, computer vision methods based HAR can be classified into two categories in terms of a comprehensive investigation of the literature: (a) Traditional hand-crafted feature based methods followed by a trainable classifier for action recognition. In addition, (b) deep learning based approaches are able to learn features automatically from raw data and are commonly followed by a trainable classifier for action recognition [ 11 , 12 ].
Many important survey and review papers have been published on human action recognition and related techniques. However, usually, published reviews go out-of-date. For this reason, writing an updated review on human action recognition is significantly required although it is considered hard work and a challenging task. In this review, discussions, analysis and comparisons of state-of-the-art methods are provided for vision based human action recognition. Handcrafted based methods and deep learning based methods are introduced along with popular benchmark datasets and significant applications. This paper also considered different designs of recognition models including: hybrid, modalities-based and view-invariant based. A brief detail of different architectures is introduced for vision-based action recognition models. Recent research works are presented and explained to help researchers to follow the path for possible future works.
The structure of this review starts at low level based methods for action recognition. This is followed by description of some of the important details of feature descriptor based techniques. A number of improvements that can be achieved in these aspects are identified. These are also transferable with respect to the performance of action recognition systems in general. Thereafter, it reviews higher level feature representation based methods. It explains the widespread feature descriptor based techniques with respect to different aspects. The paper then covers the mainstream research that has resulted in the developments of the widely known deep learning based models and their relation to action recognition systems.
2. Popular Challenges in Action Recognition Models
Initially, it might be useful to highlight some of the most popular challenges in action recognition based methods.
2.1. Selection of Training and Testing Data
The type of data can strongly affect the efficiency of a recognition model. Three types of data are usually used for action recognition. These are RGB, depth, or skeleton information, each of which can have advantages and disadvantages. For instance, significant texture information can be provided from an RGB input. This might be considered to be closely related to the visual information that humans typically process. On the other hand, a lot of variations can occur in the appearance information that depend on e.g., lighting conditions. In contrast to RGB, depth map information is invariant to illumination changes. This makes it easier to detect foreground objects from the background scene. In addition, a depth map provides 3D characteristics about the captured scene. However, depth map information also commonly has some defects. For instance noisy measurements are sometimes a problem need to be purified and refined. Another input type is skeleton information. Skeletons can be obtained using different approaches; see, e.g., [ 13 , 14 , 15 , 16 ]. Skeleton can be obtained from RGB or more commonly depth information. However, this type of information is often captured or computed imperfectly especially in an occluded or noisy environment. In this work, the complementary information available in the RGB and depth map data are exploited directly for action recognition.
2.2. Variation in Viewpoint
Most methods assume that actions are performed from a fixed viewpoint. However, in a real case, the location and posture of the person vary considerably based on the viewpoint where the action is captured from. In addition, a variation in motion patterns are also appeared in each different view which makes recognition of an action more difficult. Training a classifier using multiple camera information is a way used by [ 17 ] to tackle this issue. View-invariant representation was also obtained by modeling a 3D body posture for action recognition such in [ 18 ]. Researchers try to to utilise view-invariant features space using Fourier transform and cylindrical coordinate systems [ 19 ]. However, researchers [ 20 , 21 ] have reported that most multi-view datasets involve uniform or fixed background. Therefore, in order to evaluate the performance of various methods, it would be necessary to validate those using actions recorded in real-world settings.
2.3. Occlusion
An action required to be recognised should be clearly visible in the video sequences. This is not true in the real case, especially in a normal surveillance video. Occlusion can be presented by the person itself or by any other objects in the field. This can make body parts performing an action invisible which can cause a big issue for the research community. Volumetric analysis and representation [ 22 ] of an action can tackle self-occlusion issues and helps to match and classify the action. Considering body parts separately is a feasible way to handle occlusions. This can be performed using Pose-based constraints [ 23 ] and Probabilistic-based methods [ 24 , 25 ]. The multiple camera setup method is another approach that is used by researchers to handle occlusion problems [ 26 ].
2.4. Features Modelling for Action Recognition
In general, two popular methods are found to be considered for designing features for action recognition. One can use feature design based application methods which lead to the utilisation of the hand-crafted features. Another way is to automatically capture features from input data. This can be achieved using deep learning techniques which have often shown competitive performance in comparison to hand-crafted feature based methods [ 27 ].
2.5. Cluttered Background
Cluttered background is a case that formed a distraction introducing ambiguous information in the video of an action [ 28 ]. Different vision-based methods are affected by this issue such as an optical flow algorithm that is used to calculate motion information but with unwanted background motion (due to cluttered background) along with the required motion. In addition, this issue has a great influence on colour-based and region-based segmentation approaches as these methods require uniform background to achieve high quality segmentation. In order to handle and avoid the issues introduced, many research works assumed a static background or an approach to deal with the videos prior to processing [ 20 , 29 ].
2.6. Feature Design Techniques
Different levels of features can be used for action recognition. Some researchers such as [ 30 , 31 , 32 ] proposed to employ the input as a whole referred to here as holistic methods. Other researchers such as [ 33 , 34 , 35 , 36 ] considered salient points of interest from input data with what are known as local feature based methods.
Motion is an important suorce of information that needs to be considered for action recognition. Different techniques have been proposed to model motion information in the feature computation step. This has included optical flow for low level feature displacements and trajectories across multiple frames which can then be fed to classifiers or to further feature extraction processes. Some other research has included motion information in the classification step with models such as: Hidden Markov Models [ 37 ]; Conditional Random Fields [ 38 ]; Recurrent Neural Network [ 39 ]; Long-Short Term Memory; and 3D Convolution Neural Network [ 40 ]. All of these are able to model sequential information by design.
In such systems, an efficient feature set is able to reduce the burden for improving the recognition. An overview is now provided of selected state-of-the-art methods with respect to all aforementioned challenges and approaches mentioned above. In the following, action recognition systems are partitioned based on hand-crafted features in addition to those based on different deep learning techniques.
3. Applications of Action Recognition Models
During the last decade, many researchers have paid attention to the action recognition field with a significant evolution of the number of publications. This section highlights state-of-the-art applications that consider human action recognition methodologies to assist humans. Different applications of the current action recognition approaches are discussed including: smart homes and assisted living, healthcare monitoring, security and surveillance, and human–robot interaction [ 41 , 42 ].
3.1. Surveillance and Assisted Living
Different modern technologies have provided a wide range of improvements in the performance of independent assisted living systems. This comes true using action recognition techniques to monitor and assist occupants. For example, a smart home system proposed by [ 43 ] used machine learning and features extraction techniques to analyse the activity patterns of an occupant to introduce automation policies based on the identified patterns to support the occupants. Another smart system has been introduced by [ 44 ] for human behaviour monitoring and support (HBMS). This was achieved by observing an occupant’s daily living activities using the Human Cognitive Modeling Language (HCM-L). Then, the HBMS control engine is applied to assist individuals in a smart way. On the other hand, vision-based technologies are introduced in different security applications such as the surveillance system that introduced by [ 45 ]. This system has the ability to recognise human behaviours such as fighting and vandalism events that may occur in a public district using one or several camera views [ 46 ]. Multiple camera views were used by [ 47 ] to detect and predict suspicious and aggressive behaviours in real time and in a crowded environment.
3.2. Healthcare Monitoring
The development of medical research and technology remarkably improved the quality of patients’ life. However, higher demands of medical personnel made researchers try different technologies to improve healthcare monitoring methods that may be essential in emergency situations. Basically, one or more factors can be involved in the design of healthcare monitoring systems. This can include fall detection, human tracking, security alarm and cognitive assistance components. In [ 48 ], a vision-based system was proposed for healthcare purposes. It used Convolutional Neural Networks to detect person falling. Optical flow sequences were used as input to the networks followed by a three training phases. Fall detection system for home surveillance was proposed by [ 49 ]. A surveillance video was used to detect the fall. Background subtraction was used to detect the moving object and segmented within a bounding box. Few rules were used with the transitions of a finite state machine (FSM) to detect the fall based on the measures of the extracted bounding box. An intelligent monitoring system was proposed by [ 50 ] to monitor the “elopement” events of dementia units and to automatically alert the caregivers. Audio and video daily activities were collected and detected using an HMM-based algorithm.
3.3. Entertainment and Games
In the recent years, gaming industries have developed a new generation of games based on the full body of a gamer such as dance and sports games. RGB-D sensors (see, e.g., [ 51 ]) are used in this kind of games to improve the perception of human actions. A rich information of an entire scene is provided by these sensors to facilitate action recognition tasks [ 52 , 53 ].
3.4. Human–Robot Interaction
Human–robot interaction is considerably adapted in home and industry environments. An interaction is achieved to perform a specific task such as “Passing a cup” or “locating an object”. A vision-based method is one of the effective communication ways between human and robots [ 54 , 55 ].
3.5. Video Retrieval
Most search engines use the associated information to manage video data. Text data such as tag, description, title and keywords is one piece of information that can be used for such purposes [ 56 ]. However, one piece of information can be incorrect, which results in unsuccessful video retrieval. An alternative approach was proposed by [ 57 ] for video retrieval by analysing human actions in videos. The designed framework computed the similarity between action observations to then be used to retrieve videos of children with autism in a classroom setting.
3.6. Autonomous Driving Vehicles
An automated driving system is aimed to ensure safety, security, and comfort. One of the most important components of this system is action prediction and recognition algorithms [ 55 , 58 ]. These methods can analyse human action and motion information in a short period of time that helps to avoid critical issues such as collision.
4. Hand-Crafted Feature Representation for Action Recognition
We will start by demonstrating some classical human action recognition based methods based on hand-crafted features. Classical image classification based methods usually consist of three consecutive steps: features extraction, local descriptor computation and classification. Similar steps have been employed more generally for image and video classification as well as human action recognition.
4.1. Holistic Feature Representation Based Methods
Holistic feature representation based methods treat Regions Of Interest (ROI)s as a whole in which all pixels are exploited to compute the descriptors. In general, holistic based methods consist of two steps for action recognition which are person detection and descriptor computation. Holistic methods consider a global structure of the human body to represent an action, where it is not necessary to localise body parts. The key idea is that discriminative global information can be represented from a region of interest which can then be used for action characterisation. Holistic methods can be efficient and effective in addition to simple to compute due to the use of global information only. This makes this kind of method important for videos which might contain background clutter, camera motion, and occlusions.
In general, holistic methods can be classified into two categories based on the information that is used for the recognition problem:
- Recognition based on shape information such as shape masks and the silhouette of the person;
- Recognition based on shape and global motion information.
4.1.1. Shape Information Based Methods
Holistic based approaches are often based on information from the silhouettes, edges, optical flow, etc. Such methods are sensitive to noise, background clutter, and variations in occlusion and view-points e.g., see [ 59 ]. Silhouette information provides shape information about the foreground in the image. Different techniques can be employed to compute silhouette information from the background scene. One simple technique is background subtraction that can be used with high confidence when the camera is static. On the other hand, some research such as in [ 60 ] has utilised human tracker and camera motion estimation to obtain silhouette information and to cope with the drawbacks of camera motion. Shape information can be utilised in the time domain to help to consider the evolution of the silhouette over time. Differences in the binary silhouettes have considered by [ 61 ]. These were accumulated in the spatial and temporal domains to construct a Motion Energy Image (MEI) and a Motion History Image (MHI), respectively. These depict an action with a single template. MEI is a binary template that indicates regions of movement. MHI indicates regions of motion where more recent motion regions have higher weight. Three-dimensional (3D) shape information was used by [ 31 ] for action recognition by stacking 2D silhouette information into a space-time volume. For invariant representations to geometrical transformations such as scaling and translation, an extended Random transform was proposed by [ 62 ]. This was applied to binary silhouette information for action recognition. Contours of MEI templates were exploited by [ 63 ]. A descriptor was obtained which was found to be invariant to scale changes and translations.
A lot of research has utilised shape and silhouette information to represent the human body for human action recognition. In [ 30 , 64 ], shape masks of different images were used to introduce MEI and MHI based temporal templates for action recognition.
It has been observed that some actions can be represented by key poses. This was proposed by [ 65 ] where a method was described to detect forehand and backhand tennis strokes by matching edge information to labelled key postures together with annotated joints. These were then tracked between the key consecutive frames based on the silhouette information.
A number of significant methods are presented by [ 66 ] to describe space-time shapes based on silhouette information for action recognition. Background subtraction was used to extract the silhouette of a person. The Poisson equation was then used to obtain saliency, dynamics and shape structure features. A high dimensional feature vector was introduced to describe sequences of 10 frames in length. This was matched to shapes of test sequences at the end.
Space-time shapes were also used by [ 67 ] where contour information was obtained using background subtraction. Then, a set of characteristic points (saddles, valleys, ridges, peaks and pits) were used to represent actions on the surface of the shape. The space-time shapes were matched to recognise actions using point-to-point correspondences.
In [ 68 ], a set of silhouette exemplars were used for matching against frames in action sequences. A vector was formed of the minimum matching distance between each exemplar and any frame of the sequence. A Bayes classifier was employed to learn action classes with two different scenarios: first, silhouette information; second, edge information.
A foreground shape based motion information model was presented by [ 69 ] to represent motion from a group of consecutive frames of an action sequence. A motion context descriptor was introduced over a region with the use of a polar search grid, where each cell was represented with a SIFT descriptor [ 70 ]. The final descriptor was created by summing up the entire groups of a sequence. After that, three different approaches were used to recognise actions which were Probabilistic Latent Semantic Analysis (pLSA) [ 71 ], w3-pLSA (pLSA extension) and Support Vector Machine (SVM).
Colour and location information based segmentation has been used by [ 72 ] to automatically over-segment event video. Then, optical flow and volumetric features were used to match over-segmented video against a set of training events such as picking up a dropped object or waving in a crowd.
It is obvious from the aforementioned approaches that silhouette information can provide strong cues for the human action recognition problem. However, significant challenges arise in the presence of clutter, occlusion and camera motion. In addition, silhouette information can describe some types of actions by showing characteristics of the outer contours of a person. However, other actions that include, e.g., self-occlusion, may not easily be recognised from silhouette information alone. Therefore, the motion and shape information is further enhanced with the use of local feature representations discussed shortly.
RGB-D Information Based Shape Models
A new era can be considered to have begun when low cost RGB-D sensors were produced. These simultaneously provide appearance and spatial 3D information. Such devices (e.g., Microsoft Kinect, Asus Xtion) have the ability to work in real time. By adding the depth-map feature, the device is able to provide information about the distance of each pixel to the sensor in a range from 0.5 m to 7 m. These have played a key role in the enhancement of object detection and segmentation algorithms. RGB-D sequences based methods improve recognition performance with a low time complexity. However, depth and skeleton representation based methods of action recognition remain only applicable over a limited range and specific environmental conditions.
As a result, many RGB holistic approaches have been extended to the RGB-D scenario to utilise depth-map characteristics. A 3D-MHI has proposed by [ 73 ] for action recognition. This was performed by extending the traditional MHI to use depth information. In [ 74 ], the depth silhouette was sampled into a representative set of 3D points and used to introduce the shape of salient regions. The key idea was to project the depth map onto three orthogonal Cartesian planes and use the points along each plane to recognise the actions. A useful technique was used by [ 75 ] where the depth maps were projected onto three orthogonal Cartesian planes to produce Depth Motion Maps (DMM) by combining through summation the stacked motion energy of each of the projected maps. DMMs can express the variation of a subject’s motions during the performance of an activity. In [ 76 ], DMMs were used for activity recognition together with an l 2 -regularised collaborative representation classifier with a distance-weighted Tikhonov matrix was also used. DMMs was used by [ 77 ] with Local Binary Patterns (LBP)s to utilise motion cues. Two fusion levels were also considered including feature-fusion level and decision-fusion level. The DMM based results showed reasonable human activity recognition performance.
Different levels of the same data sequence have been used with DMM computations to create a hierarchical DMMs in [ 78 ]. An LBP based descriptor was used to characterise local rotation invariant texture information. Then, a Fisher kernel was employed to create patch descriptors. These were fed into a kernel-based extreme learning machine classifier. A similar approach was followed by [ 79 ]. A Histogram of Oriented Gradients (HOG)s descriptor was used along with kernel entropy component analysis for dimensionality reduction. Finally, a linear support vector machine was used in the classification. For both hierarchical DMM based approaches, the results demonstrated a significant performance improvement.
A 4D space-time grid has introduced by [ 80 ] that extended the work by [ 31 ]. This has done by dividing space and time dimensions into multiple cells. These were used to obtain Space Time Occupancy Patter (STOP) feature vectors for action recognition. In [ 81 ], a 4D Histogram Of Surface Normal Orientations (HON4D) was proposed to describe video for action recognition after computing the normal vectors for each frame. The features of the surface normal were captured in the 4D space of spatial, depth and time dimensions.
The rich characteristics of the depth information can help make people detection and segmentation tasks easier and less challenging which in turn improves holistic approaches, making them more robust with RGB-D images. However, some drawbacks of holistic methods include their sensitivity to occlusions and noise in the depth maps. Therefore, a good representation can be presented by combining motion and shape information which in turn may improve the recognition rate of the system.
4.1.2. Hybrid Methods Based on Shape and Global Motion Information
The work by [ 82 ] is a good example of shape and motion feature based tracking and action recognition. The authors assumed that the movements of body parts were restricted to regions around the torso. Subjects were bounded with rectangular boxes where the centroids were selected as the feature for tracking. The velocity of the centroids was considered, utilising body motion features to cope with occlusions between multiple subjects. Periodic actions such as walking were detected with a nearest centroid algorithm calculated across spatio-temporal templates and reference templates. This approach, however, only utilised motion information which can be improved by considering other features such as texture, color, and shape.
Another method which used motion information was proposed by [ 83 ] based on optical flow to track soccer players and to recognise simple actions in video. A person was tracked and stabilised. Then, a descriptor was computed over the motion information and spatio-temporal cross-correlation was used for matching with a database. This approach was tested on sequences from ballet, tennis and football datasets, and it achieved impressive results on low resolution video. However, their types of systems may depend on several conditions such as position of the region of interest in the frame, spatial resolution and relative motion with respect to the camera. In addition, the model is based on a global representation which can be affected by occlusions between multiple objects and a noisy environment in the background.
Flow motion has also been used by [ 84 ] for action recognition. A flow descriptor was employed to select low level features in the form of a space-time overlapped grid. Then, mid level features were selected using the AdaBoost algorithm.
A space-time template based method was introduced by [ 85 ] for action recognition. It was based on the maximum average correlation height filter. A spatio-temporal regularity flow was used to capture spatio-temporal information and to train a Maximum Average Correlation Height (MACH) filter. Experiments on a number of datasets including the KTH dataset demonstrated action recognition and facial expression recognition.
Volumetric feature based action recognition was proposed by [ 86 ] where Viola–Jones features were computed over a video’s optical flow. A discriminative set of features were obtained by direct forward feature selection which employed a sliding window approach to recognise the actions. The model was trained and tested on real videos with actions that included sit down, stand up, close laptop and grab a cup actions.
Shape information was used by [ 87 ] to track an ice hockey player and to recognise actions. Histograms of Oriented Gradients (HOG)s were used to describe each single frame. Principal Component Analysis (PCA) was then used for dimensionality reduction. At the end, a Hidden Markov Model (HMM) was employed to recognise actions.
A new technique was proposed to utilise a hybrid representation by combining optical flow and appearance information by [ 88 ]. They exploited the optical flow information and Gabor filter features for action recognition. Both kinds of features were extracted from each single frame and then concatenated. They used different lengths of snippets of frames to highlight how many frames were required for recognising an action. The Weizmann and KTH datasets were used for evaluation schemes.
Motion and shape information based action recognition was also used by [ 89 ] where a multiple instance learning based approach was employed to learn different features from a bag of instances. This included foreground information, Motion History Image (MHI) and HOGs. Simple actions in crowded events in addition to shopping mall data were used to evaluate the proposed method. The experiments showed that the use of multiple types of features resulted in better performance in comparison with a single type of feature.
These holistic based methods have provided some reasonable levels of performance for action recognition. However, they are not view invariant. Different models would be needed for particular views. Large amounts of multiple view data would also be needed for training. Some body parts might be unseen across frames due to occlusions. Second, they are not invariant to time. The same action performed over different time periods would present quite differently. In addition, it is worth mentioning that the accuracy of holistic approaches is highly dependent on the detection and segmentation pre-processing. This work also includes local representation based methods to benefit from localised information. The next section presents a review of the local representation based methods for human action recognition.
4.2. Local Feature Representations Based Methods
Local feature based methods tend to capture characteristic features locally within a frame without a need for human detection or segmentation which can be quite a challenge for RGB based video. Local feature based methods have been successfully employed in many recognition system applications such as action recognition [ 90 ], object recognition [ 91 ] and scene recognition [ 92 ]. Local capture based methods can capture important characteristics of shape and motion information for a local region in a video. The main advantage of these methods is the autonomous representation of events in terms of changes across space-time and scale. Furthermore, with appropriate machine learning, it is often possible, given sufficient data, to capture the important characteristics of the local features of interest. If appropriately achieved, then it can be possible to separate these features from features computed from a cluttered background or even multiple movements or objects in a scene. In the following section, space-time feature detectors, feature trajectories and local descriptor based methods are discussed. In addition, the incorporation in action localisation and recognition in videos will be considered.
In general, local feature based methods consist of two steps: detecting a point of interest (POI) and descriptor computation. In image processing, interest points refer to points that have local variation of image intensities. Interest point detectors usually capture local characteristics. This can be in terms of space-time and scale in videos by maximising specific saliency functions.
Some research that can be highlighted has focused on feature detectors such as [ 33 ] who proposed to extend the Harris corner detector to a Harris3D detector to include both space and time. A different feature detector which employed spatial Gaussian kernels and temporal Gabor filters was proposed by [ 93 ]. This considered salient motion features to represent different regions in videos. Another detector proposed by [ 94 ] involved computing entropy characteristics in a cylindrical neighborhood around specific space-time positions. An extension of the Hessian saliency detector, Hessian3D, was proposed by [ 95 ] to consider spatio-temporal features. This used the determinant of the 3D Hessian matrix. Salient features were detected by [ 96 ] using a global information based method.
A wider experimental evaluation was introduced by [ 97 ]. They proposed to exploit different interest point detectors applied to publicly available action recognition datasets including KTH [ 98 ], UCF sports [ 85 ], and Hollywwod2 [ 99 ]. The results showed the robustness of dense sampling method, where interest points were sampled in equal segments in the space and time domains. It was found that the Harris3D detector achieved some of the best performance in some of the included experiments.
While local interest points are detected, local representation based methods can then be employed to compute one of the different descriptors over a given region. Different descriptors have been proposed in a lot of research such in [ 34 ] where Histogram of Oriented Gradients (HOG) [ 100 ] and Histogram of Oriented Optical Flow (HOOF) [ 101 ] descriptors were used. The authors introduced a different way to characterise local motion and appearance information. They combined HOG and HOOF based approaches on the space-time neighbourhood of the detected points of interest. For each cell of a grid of cells, four bins of HOG and five bins of HOOF were considered. Normalised and concatenation were used to form a HOG and HOOF combined descriptor. Moreover, different local descriptors based on gradient, brightness, and optical flow information were included by [ 93 ]. PCA was also used for dimensionality reduction. The authors explored different scenarios which included simple concatenation, grid of local histograms and a single global histogram. The experimental results determined that concatenated gradient information achieved the best performance.
A 3D version of the Histogram of Oriented Gradients (HOG3D) has introduced by [ 102 ] as an extension of the HOG descriptor by [ 100 ]. A space-time grid was constructed around each detected Point Of Interest (POI). A histogram descriptor was then computed and normalised over each of the cells. The final descriptor was then formed by concatenating the histograms.
In [ 103 ], the authors proposed to extend the Scale-Invariant Feature Transform (SIFT) descriptor originally proposed by [ 70 ]. Spatio-temporal gradients were computed over a set of randomly sampled positions. A Gaussian weight was used to weight each pixel in the neighbourhood with votes into an N × N × N grid of histograms of oriented gradients. To achieve orientation quantization, the gradients were represented in spherical coordinates that were divided into 8 × 4 histograms.
An extended Speeded-Up Robust Features (SURF) descriptor originally proposed by [ 104 ] was investigated by [ 95 ]. Application to videos was considered by utilising spatio-temporal interest points which were spatially and temporally scale invariant. The patches were divided into a grid within local N × N × N histograms. Then, each cell was represented by a vector of Haar wavelet sampled responses. The experimental results showed the good performance of the proposed detector in comparison with other detectors.
RGB-D Information Based Local Features
There has also been research that includes depth map data based local feature methods. These follow many of the same or similar steps as for RGB video. For instance, at the gross level, finding salient points of interest and then computing the descriptor. In [ 105 ], the authors proposed a Histogram of Oriented Principal Components (HOPC) descriptor. This captured the characteristics around each point of interest within a 3D cloud space. The descriptor was formed by concatenating projected Eigenvectors. These resulted from Principal Component Analysis on the space-time volume around the points of interest. The HOPC descriptor was found to be view invariant. Video was also treated in [ 106 ] as a space-time volume of depth values. A Comparative Coding Descriptor (CCD) was then used to encode space-time relations of points of interest. Set of cuboids were used to construct a series of codes that characterised the descriptor. In [ 107 ], a descriptor called Local Occupancy Pattern (LOP) was presented. This was used to describe the appearance information of sub-regions of depth images by which was utilised to characterise object-interaction actions. In another work by [ 108 ], a Random Occupancy Pattern (ROP) was introduced to deal with depth sequences as a space-time volume. The descriptor was defined by a sum of the pixel values in a sub-volume. Since several sub-volumes had different sizes and locations, a random sampling based method was used to effectively recognise the sub volumes. Overall, local feature based methods are commonly used with different inputs. These can include skeletons where joints have been a particular focus for detector, RGB where a detector have been used to detect POIs on an RGB frame, or similarity for the depth.
4.3. Trajectories Based Methods
Many researchers have claimed that the spatial domain in video has different characteristics from the temporal domain. Thus, points of interest should not be detected in a 3D spatio-temporal space. Consequently, a lot of research such as [ 36 , 101 , 109 , 110 , 111 ] has included tracking of detected points of interest across the temporal domain. Then, the volume of the trajectory points are often used to compute the descriptors for video representation.
Detecting points of interest in video and forming trajectories through the temporal domain has been used by many researchers. For instance, the Kanade–Lucas–Tomasi (KLT) tracker [ 112 ] was used in [ 109 ] to track Harris3D interest points [ 33 ]. These formed feature trajectories which were then represented as sequences of log polar quantised velocities. The KLT tracker has also been used by [ 36 ], where trajectories were clustered and used to compute affine transformation matrix to represent the trajectories. In [ 70 , 110 ], SIFT descriptors were matched between two consecutive frames for trajectory based feature extraction. Unique-match points were exploited whist others were discarded.
Dense sampling based interest point extraction achieved better performance in action recognition by [ 97 ]. Dense trajectories were later used by [ 101 ] who sampled dense points of interest on a grid. Dense optical flow was then used to track POIs through time. Trajectories were formed by concatenating points from subsequent frames. Moreover, to exploit motion information, different descriptors (HOG, HOOF, Motion Boundary Histogram (MBH)) were computed within a space-time volume around the trajectory. Finally, the method was evaluated with publicly available action datasets including: KTH, YouTube, Hollywood2, and UCF sports. Competitive performance was achieved in comparison to the state-of-the-art approaches. Different extensions of dense trajectory based methods have been proposed by many researchers such as [ 113 , 114 , 115 , 116 , 117 , 118 ].
Local descriptor based methods often follow similar steps in comparison to POI detection. Early research extracted descriptors from cuboids which were formed around the point of interest in space-time domains, see, e.g., [ 33 , 93 ]. However, the same process can be followed to utilise trajectories. Most popular local descriptor based approaches have exploited cuboids or trajectories as explained below.
A number of different descriptors were introduced by [ 119 ] to capture appearance and motion features from video. A comparison between single and multi scale higher order derivatives, histograms of optical flow, and histograms of spatio-temporal gradients was developed. The local neighbourhood of the detected interest points was described by computing histograms of optical flow and gradient components for each cell of a N × N × N grid. Thereafter, PCA was applied to the concatenation of optical flow and gradient component vectors to exploit the most significant eigenvalues as descriptors. The experiments showed the usefulness and applicability of the histograms of optical flow and spatial-temporal gradient based descriptors.
The Histograms of Optical Flow (HOOF) descriptor was proposed by [ 34 ] to identify local motion information. Spatio-temporal neighbourhoods were defined around detected POIs and optical flow was computed between consecutive frames.
Another robust descriptor, which also benefited from optical flow, was presented by [ 120 ] to extract local motion information called the Motion Boundary Histogram (MBH) descriptor. This descriptor follows the HOG descriptor in binning the orientation information of spatial derivatives into histograms. These descriptors can be employed with trajectory information as was done by [ 121 ]. A spatio-temporal volume was formed around each trajectory and divided into multiple cells. Each cell was represented by a combination of HOG, HOOF and MBH descriptors. Some other research that used trajectories for action recognition can be found such as [ 122 , 123 , 124 ].
4.4. Other Feature Representations Based Methods
A different representation method has been employed in computer vision tasks called Bag of Words (BOW) also referred to as a bag of visual models; see, e.g., [ 125 ]. The key idea of this approach is to represent image data as a normalised histogram called code words. The visual words (code words) can be constructed during the learning process by clustering similar patches of an image that can be described by a common feature descriptor. In this way, some techniques will result in similar histograms for similar images. These can be fed into a classification step. BOW based methods have been used in a lot of research for action recognition such as [ 28 , 93 , 126 , 127 ].
Another popular feature representation technique is the Fisher vector descriptor which can be considered as a global descriptor. This technique determines the best calibration for a generative model to better model the distribution of extracted local features. The descriptor is formed using the gradient of a given sample’s likelihood with respect to the parameters of the distribution. It is estimated from the training set and scaled by the inverse square root of the Fisher information matrix. A Fisher vector descriptor was first presented by [ 128 ] for image classification. For more details about Fisher vector based image classification and action recognition tasks, please see [ 129 , 130 ].
More comprehensive details of action recognition, motion analysis, and body tracking can be also found in [ 131 , 132 , 133 , 134 , 135 ]. Some state-of-the-art works that used traditional hand-crafted representation based methods are presented and compared in Table 1 .
State-of-the-art methods of traditional hand-crafted representations with different datasets for human action recognition.
Paper | Year | Method | Dataset | Accuracy |
---|---|---|---|---|
[ ] | 2009 | Space-time volumes | KTH | 89.4 |
[ ] | 2011 | Dense trajectory | KTH | 95 |
[ ] | 2011 | Space-time volumes | KTH | 94.5 |
UCF sports | 91.30 | |||
[ ] | 2011 | Shape-motion | Weizmann | 100 |
[ ] | 2011 | LBP | Weizmann | 100 |
[ ] | 2012 | bag-of-visual-words | HDMB-51 | 29.2 |
[ ] | 2012 | Trajectory | HDMB-51 | 40.7 |
[ ] | 2012 | HOJ3D + LDA | MSR Action 3D | 96.20 |
[ ] | 2013 | Features (Pose-based) | UCF sports | 90 |
MSR Action 3D | 90.22 | |||
[ ] | 2013 | 3D Pose | MSR Action 3D | 91.7 |
[ ] | 2013 | Shape Features | Weizmann | 92.8 |
[ ] | 2013 | Dense trajectory | HDMB-51 | 57.2 |
[ ] | 2014 | Shape-motion | Weizmann | 95.56 |
KTH | 94.49 | |||
[ ] | 2014 | EigenJoints + AME + NBNN | MSR Action 3D | 95.80 |
[ ] | 2014 | Features (FV + SFV) | HDMB-51 | 66.79 |
Youtube action | 93.38 | |||
[ ] | 2014 | Dissimilarity and sparse representation | UPCV Action dataset | 89.25 |
[ ] | 2014 | Shape features | IXMAS | 89.0 |
[ ] | 2016 | Trajectory | MSR Action 3D | 89 |
[ ] | 2016 | Shape Features | Weizmann | 100 |
[ ] | 2016 | Shape features | IXMAS | 89.75 |
[ ] | 2016 | LBP | IXMAS | 80.55 |
[ ] | 2016 | Motion features | IXMAS | 83.03 |
[ ] | 2017 | MHI | MuHAVi | 86.93 |
[ ] | 2017 | spatio-temporal+HMM | MSR Action 3D | 93.3 |
MSR Daily | 94.1 | |||
[ ] | 2018 | Joints + KE Descriptor | MSR Action 3D | 96.2 |
It is worth pointing out that a variety of higher-level representations techniques have been proposed to capture discriminative information for complex action recognition. Deep learning is an important technique that has demonstrated effective capability for producing higher-level representations with significant performance improvement. Deep learning based models have the ability to process input data from a low level and to convert it into a mid or high-level feature representation. Consequently, the next section presents a good review of deep learning based models that have been used for human action recognition.
5. Deep Learning Techniques Based Models
Recent research studies have shown that hand-crafted feature based methods are not suitable for all types of datasets. Consequently, a new relatively and important class of machine learning technique referred to as deep learning has been established. Multiple levels of feature representations can be learnt that can make sense of different data such as speech, image and text. Such methods are capable of automatically processing raw image and video data for feature extraction, description, and classification. Trainable filters and multiple layer based models are often employed in these methods for action recognition and representation.
This section presents descriptions of some important deep learning models that have been used for human action recognition. However, it is very difficult to train a deep learning model from scratch with limited data. Thus, models are often limited to appearance based data or some described representation. Deep learning based models can be classified into three categories which are: generative models e.g., Deep Belief Networks (DBNs), Deep Boltzmann machines (DBMs), Restricted Boltzmann Machines (RBMs), and regularized auto-encoders; supervised models e.g., Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs); and hybrid models. However, hybrid models are not discussed in this work.
5.1. Unsupervised (Generative) Models
The key idea of deep learning based generative models is that they do not need target labels for the learning process. Such models are appropriate when labelled data are scarce or unavailable. The evolutionary of deep learning models can be traced back [ 158 ] where a Deep Belief Network (DBN) was presented with a training algorithm based on Restricted Boltzmann Machines (RBMs) [ 159 ]. This was followed by a dimensional reduction technique by [ 160 ]. The parameters were learnt with an unsupervised training process which were then fine-tuned in a supervised approach using back-propagation.
This inspired great interest in deep learning models particularly on different applications such as human action recognition, image classification, object recognition, and speech recognition. Unsupervised learning based methods have been proposed by, e.g., [ 161 ], to automatically learn features from video data for action recognition. An independent subspace analysis algorithm was used to learn space-time features and combined with convolution and stacking based deep learning techniques for action representation.
In [ 162 ], the researchers proposed to train DBNs with RBMs for human action recognition. The experimental results on two public datasets demonstrated the impressive performance of the proposed method over hand-crafted feature based approaches.
An unsupervised deep learning based model was proposed by [ 163 ] to continuously learn from unlabelled video streams. In addition, DBNs based methods were used by [ 164 ] to learn features from an unconstrained video stream for human action recognition.
Generative or unsupervised learning based models have played a substantial role in inspiring researchers’ interest in the deep learning field. Nevertheless, the great development of the Convolution Neural Networks (CNNs) based supervised learning methods for object recognition has somewhat obscured the unsupervised learning based approaches; see, e.g., [ 165 ].
5.2. Supervised (Descriminative) Models
In line with the recent literature surveys for human action recognition, the most common technique used in supervised learning based models is Convolution Neural Networks (CNN)s. These were first proposed by [ 166 ]. CNNs can be considered to be a type of the deep learning model which has shown great performance in various recognition tasks such as pattern recognition, digit classification, image classification, and human action recognition see, e.g., [ 165 , 167 ]. The efficient utilisation of CNNs in image classification [ 165 ] opened a new era to employ deep learning based methods for human action recognition. The key advantage of CNNs is their ability to learn straight from the raw data such as RGB or depth map data. Consequently, it is possible to obtain discriminative features which can effectively describe the data and thus make the recognition process easier. Since this approach is susceptible to overfitting, one should be careful in the training process. CNN includes regularisation and has a significant requirement for a large amount of labeled data. These can help to prevent overfitting. Recently, it was shown that deep learning based methods outperform many state-of-the-art handcrafted features for image classification; see, e.g., [ 27 , 165 , 168 ].
Convolution Neural Networks (CNN)s have a hierarchical structure with multiple hidden layers to help translate a data sample into a set of categories. Such models consist of a number of different types of layers such as convolutional layers, pooling layers and fully connected layers. The temporal domain is introduced as an additional dimension in the case of videos. Since CNNs were originally designed for static image processing, it was not initially clear on how to incorporate motion information. Therefore, most research at that time used CNNs on still images to model appearance information for action recognition [ 165 ]. Thereafter, different ways were proposed to utilise motion information for action recognition. An extension was presented by [ 169 ] where stacked video frames were used as an input to a CNN for action recognition from video. However, the experimental results were worse than hand-crafted feature based approaches. An investigation made by [ 32 ] about this issue and developed the idea of having separate spatial and temporal CNN streams for action recognition.
Figure 1 illustrates the spatio-temporal CNN streams similar to [ 32 ] where the two streams are implemented as independent CNNs. One stream was the spatial stream which recognised actions from static images. The other stream was the temporal stream which recognised actions from stacked video frames based on motion information of dense optical flow. The output of the two streams was combined using a late fusion technique. The experiments showed improved performance for this method compared to hand-crafted feature based approaches. However, this type of architecture has additional hardware requirements to be suitable for a variety of applications.
![Click on image to zoom An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00046-g001.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321068/bin/jimaging-06-00046-g001.jpg)
Illustration of the spatio-temporal CNN streams as used by [ 32 ]. Here, the input data are split into two streams, one for the individual apperance based raw frames. The other for the temporal information corresponding to an optical flow stream. The two streams are fused at the end with class score fusion.
A lot of research on action recognition is based on works that have previously achieved relatively good performance in image classification problems. Recent works extended what was implemented in two dimensions to 3D to include the temporal domain. Most CNN models proposed for action recognition have been limited to deal with 2D input data. Nonetheless, some applications may include 3D data that requires a specialised deep learning model. To this end, 3D Convolution Neural Networks (3D-CNNs) based models were presented by [ 40 ] for surveillance tasks at airports. Spatio-temporal features were automatically utilised by employing 3D convolutions in the convolutional layers with respect to spatial and temporal dimensions. The experimental results demonstrated superior performance for this method in comparison to other state-of-the-art methods.
In general, there has been much success with 2D and 3D CNN in e.g., image classification, object recognition, speech recognition and action recognition. Nonetheless, some issues still need to be considered such as the immense amount of image or video data needed for training purposes. Collecting and annotating large amounts of image or video data are quite exhausting and requires a substantial amount of time. Fortunately, the availability of rich and relatively large action recognition datasets has provided a great support for designing such models in terms of their training and evaluation. A factorised 3D-CNN was proposed by [ 170 ] for human action recognition. The 3D-CNN was factorised into a standard 2D-CNN for spatial information at the lower layers and a 1D-CNN for the temporal information at the higher layers. This factorisation was to reduce the number of learning parameters and consequently reduce the computational complexity. Two benchmark datasets were used to evalauate the proposed method: UCF101 and HMDB51. The results showed comparable performance with state-of-the-art methods. Another spatio-temporal 3D-CNN approach was proposed by [ 171 ] for human action recognition. The authors used four public datasets to evaluate the proposed method. The 3D-CNN achieved improved performance with spatio-temporal features compared to a 2D-CNN. The authors also found that a small filter size such as the one used in their method i.e., 3 × 3 × 3 was the best choice for spatio-temporal features. Overall, the experimental results demonstrated competitive performance for the proposed method with a linear classifier.
Some research works have combined supervised and unsupervised learning models for action recognition. A Slow Feature Analysis (SFA) based method has used by [ 172 ] to extract slowly varying features from an input in an unsupervised manner. These were combined with a 3D-CNN for action recognition. This work achieved competitive performance compared to state-of-the-art approaches. Three standard action recognition datasets were used: KTH [ 98 ], UCF sports [ 85 ] and Hollywood2 [ 99 ] datasets.
In [ 173 ], a hierarchical framework combining 3D CNN and hidden Markov model (HMM) was proposed. This was used to recognise and segment continuous actions simultaneously. 3D CNN was used to learn a powerful high level features directly from raw data, and use it to extract effective and robust action features. The statistical dependencies over adjacent sub-actions was then modeled by HMM to infer actions sequences. The KTH and Weizmann dataset were used to evaluate the proposed method. The experimental results showed improved performance of the proposed method over some state-of-the-art approaches.
For efficient learning of spatio-temporal features in video action recognition, a hybrid CNN was introduced in [ 174 ] used a fusion convolutional architecture. 2D and 3D CNN was fused to present temporal encoding with fewer parameters. Three models are used to build the proposed model (semi-CNN) including: VGG-16, ResNets and DenseNets. The UCF-101 dataset was used in the evaluation to compare the performance of each model with its corresponding 3D models. Figure 2 shows the performance of the used models over 50 epochs.
![Click on image to zoom An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00046-g002.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321068/bin/jimaging-06-00046-g002.jpg)
The performance of action recognition models as mentioned in [ 174 ]. Including: ( a ) Semi-CNN model based on VGG16 architecture ( b ) Semi-CNN model based on ResNet34 architecture ( c ) Semi-CNN model based on DenseNet121 architecture.
Another way to model motion information in video was proposed by [ 39 ] for action recognition using Recurrent Neural Networks (RNN)s. CNN discriminative features were computed for each video frame and then they were fed into an RNN model. The key advantage of an RNN architecture is its ability to deal with sequential inputs as a single copy of the network is created for each sequence. In the RNN hidden layers, connections between neurons are found between each replica where the same weights are shared by each replica and with the others. The authors highlighted that local motion information can be obtained from video by optical flow through CNNs. On the other hand, global motion information can be modeled through the use of the RNN. RNN based supervised learning was used by [ 175 ] across five parts (right arm, left arm, right leg, left leg, trunk) of skeleton information. These were used as inputs to five separate sub-nets for action recognition. The outcomes of these sub-nets were then hierarchically fused to form the inputs to the higher layers. Thereafter, the final representation was fed into a single-layer perceptron to get the final decision. Three datasets were used to evaluate the proposed method including: MSR Action3D [ 74 ], Berkeley Multimodal Human Action (Berkeley Mhad) [ 176 ], and Motion Capture HDM05 [ 177 ] datasets. The results demonstrated state-of-the-art performance. However, RNN is not capable of processing very long sequences and it can not be stacked into very deep models. In addition, it lacks the capability of keeping track of long-term dependencies; which makes training of an RNN difficult.
New recurrent modules that improved long-range learning, Long Short-Term Memory (LSTM), has firstly proposed by [ 178 ]. LSTM units have hidden state augmented with nonlinear mechanisms, in which simple learned gating functions are utilised to enable state propagation with either no modification, update or reset. LSTMs have a significant impact on vision problems as these models are straightforward to fine-tune end-to-end. Moreover, LSTMs have the ability to deal with sequential data and are not limited to fixed length inputs or outputs. This helps to simply model a sequential data of varying lengths, such as text or video [ 179 ].
LSTMs have recently been introduced to be efficient to large-scale learning of speech recognition [ 180 ] and language translation models [ 181 ]. LSTM was also proposed for action recognition by [ 179 ]. A hybrid deep learning architecture was proposed using a long-term recurrent CNN (LRCN). Raw data and optical flow information were used as input to this unique system. The proposed methods were evaluated using a UCF101 dataset and showed an improvement in the performance in comparison with the baseline architecture.
Deep learning based approaches have achieved relatively high recognition performance. This is on the same level or better than hand-crafted features based methods. Some researchers have also proposed using multiple deep learning models alongside hand-crafted features to achieve even better results such as [ 32 , 117 , 182 ].
5.3. Multiple Modality Based Methods
A new insight is provided into human action recognition by using deep learning methods to extract action features from RGB, depth, and/or skeleton information. Different feature learning can be utilised [ 117 , 171 , 183 ] from deep networks such as appearance, optical flow, depth and/or skeleton sequences. It is very often that different modalities are provided with respect to the same dataset such as RGB, depth, and skeleton information or at least two of them. Therefore, a lot of research has been proposed to utilise combinations of different modalities or their hand-crafted features. They then merge them using fusion based strategies. A separate framework architecture is often employed for each modality; then, classification scores are often obtained for each one.
Some research has highlighted that significant improvements in performance of an action recognition system can be achieved by utilising hand-crafted features within CNN based deep learning models. A CNN model based on multiple sources of information was proposed by [ 184 ] to process spatially varying soft-gating. A fusion technique was then used to combine the multiple CNN models that were trained on various sources. A Stratified Pooling based CNN (SPCNN) was proposed by [ 185 ] to handle the issue of different feature levels of each frame in video data. To come up with video based features, the authors fine-tuned a pre-trained CNN model on target datasets. Frame-level features were extracted, then principal component analysis was used for dimensionality reduction. Stratified pooling of frame-level features was then used to convert them into video-level features, and finally fed them into an SVM classifier for classification. The method was evaluated on HMDB51 [ 27 ] and UCF101 [ 186 ] datasets. The experiments showed that the proposed method outperformed the state-of-the-art.
An extension of this two stream network approach was proposed in [ 117 ] using dense trajectories for more effective learning of motion information.
A general residual network architecture for human activity recognition was presented in [ 187 ] using cross-stream residual connections in the form of multiplicative interaction between appearance and motion streams. The motion information was exploited using stacked inputs of horizontal and vertical optical flow.
A fusion study was presented in [ 182 ] for human activity recognition using two streams of the pre-trained Visual Geometry Group (VGG) network model to compute spatio-temporal information combining RGB and stacked optical flow data. Various fusion mechanisms at different positions of the two streams were evaluated to determine the best possible recognition performance.
Some research studies have paid particular attention to auxiliary information which can improve the performance of action recognition. In some studies, audio has been combined with the video to detect the actions such as [ 188 ], where a combination of Hidden Markov Models (HMM) with audio were used to determine the actions. The main disadvantage of using audio recordings is the surrounding noise that can affect the results.
All of the above approaches suffer from a shortage of long-term temporal information. For example, the number of frames used in the optical flow stacking ranged between 7 and 15 frames, such as 7, 10, and 15 frames as used in [ 40 , 169 , 184 ], respectively. Often, people will perform the same action over different periods of time depending on many factors and particularly for different people. Consequently, multi-resolution hand-crafted features computed over different periods of time are used by [ 189 ] in order to avoid this problem. Furthermore, different weight phases are applied using a time-variant approach in the computation process of the DMMs to enable adaptation to different important regions of an action. Different fusion techniques are employed to merge spatial and motion information for best action recognition. Figure 3 illustrates the impact of different window frame lengths on the performance of action recognition systems.
![Click on image to zoom An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00046-g003.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321068/bin/jimaging-06-00046-g003.jpg)
Action recognition accuracy versus different window frame lengths that was proposed in [ 189 ].
5.4. Pose Estimation and Multi-View Action Recognition
Another considerable challenge in human action recognition is view variance. The same action can be viewed from different angles and thus looks excessively different. This issue was taken into account by [ 190 ]. Training data were generated by fitting a synthetic 3D human model to real motion information. Poses were then extracted from different view-points. A CNN based model was found to outperform a hand-crafted feature based approach for multi-view action recognition.
Dynamic image information was extracted by [ 191 ] from synthesised multi-view depth videos. Multi-view dynamic images were constructed from the synthesised data. A CNN model was then proposed to perform feature learning from the multi-view dynamic images. Multiple batches of motion history images (MB-MHIs) have been constructed by [ 192 ]. This information is then used to compute two descriptors by using: a deep residual network (ResNet) and histogram of oriented gradients (HOG). Later, an orthogonal matching pursuit approach was used to obtain the sparse codes of feature descriptions. A final view-invariant feature representation was formed and used to train SVM classifier for action recognition. MuHAVi-MAS [ 193 ] and MuHAVi-uncut [ 194 ] datasets are used to evaluate the proposed approach. Figure 4 illustrates the accuracy variations of the recognition model over different components.
![Click on image to zoom An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00046-g004.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321068/bin/jimaging-06-00046-g004.jpg)
The accuracy variations with the number of frames and number of batches as mentioned in [ 192 ].
A CNN model obtained from ImageNet was used by [ 195 ] to learn from multi-view DMM features for action recognition when video was projected onto different view-points within the 3D space. Different temporal scales were then used from the synthesised data to constitute a range of spatio-temporal pattern for each action. Finally, three fine-tuned models were employed independently for each DMM map. However, some actions including object interactions can be very difficult to be recognise from the raw depth data alone. This helps to justify the inclusion of RGB data for the recognition of such actions.
In [ 196 ], Multi-View Regional Adaptive Multi temporal-resolution DMMs (MV-RAMDMM) and Multi temporal-resolution RGB information is learnt with multiple 3D-CNNs stream for action recognition. The Adaptive Multi-resolution DMM is applied across multiple views to extract view and time invariant action information. It is adapted based on human movement to be used eventually in the deep learning model for action recognition. In addition, multi temporal raw appearance information is used to exploit various spatio-temporal features of the RGB scenes. This helps to capture more specific information which might be difficult to obtain purely from depth sequences. For instance, object-interaction information is more apparent in RGB space.
In a different way, semantic features based on pose can be seen to be very important cues that can describe the category of an action. Human joint information was utilised by [ 197 ] to compute the temporal variation between joints during actions. Time-variant functions were used to confirm the pose related with each action and considered for feature extraction. The feature representation for action recognition was constructed using the temporal variation of values associated with these time functions. Then, CNNs were trained to recognize human actions from the local patterns in the feature representation. The Berkeley MHAD dataset [ 176 ] was used to evaluate the proposed method and the results demonstrated the effectiveness of this approach. Similar to [ 197 ], a Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition was proposed by [ 198 ]. Descriptor aggregated motion and appearance information were used with respect to tracks of human body parts. This utilised skeleton information along with RGB raw data. JHMDB [ 199 ] and MPII [ 200 ] cooking datasets were used to evaluate the proposed method. However, it can be difficult to accurately capture skeleton information of a person in different environment conditions. This might be due to the need of accurate body-parts detection to precisely estimate skeleton information.
Some common datasets of human action recognition are introduced in Table 2 . In addition, an extensive comparison between deep learning based models and hand-crafted based models are presented in Table 3 for human action recognition.
Common dataset of human action recognition.
Datasets | RGB | Depth | Skeleton | Samples | Classes |
---|---|---|---|---|---|
] | 1707 | 12 | |||
] | 4500 | 10 | |||
] | 1707 | 12 | |||
] | 6766 | 51 | |||
] | 783 | 16 | |||
] | 6618 | 50 | |||
] | 13,320 | 101 | |||
] | 567 | 20 | |||
] | 320 | 16 | |||
] | 1475 | 10 | |||
] | 861 | 27 | |||
] | 861 | 27 | |||
] | 1189 | 13 | |||
] | 56,880 | 60 |
Comparison of deep learning based models and hand-crafted based models for human action recognition [ 208 , 209 , 210 , 211 ].
Characteristics | Deep Learning Based Models | Hand-Crafted Feature Based Models |
---|---|---|
Ability to learn features directly from raw data | Pre-process algorithms and /or detectors are needed to discover the most efficient patterns to improve recognition accuracy. | |
Automatically extract spatial, temporal and scale, transition invariant features from raw data | Use feature selection and dimensionality reduction methods which are not very generalisable. | |
Data pre-processing and normalisation is not mandatory in deep learning based models to achieve high performance | Usually require comprehensive data pre-processing and normalisation to achieve significant performance. | |
Hierarchical and translational invariant features are obtained from such models to solve this problem | Inefficient in managing such kind of problems. | |
Huge amount of data required for training purposes to avoid over-fitting and high computation powerful system with Graphical Processing Unit (GPU) to speed up training | Require less data for training purposes with less computation time and memory usage. |
Furthermore, some recent works based on deep learning models for human action recognition are included in Table 4 .
State-of-the-art methods of deep learning based models with different datasets for human action recognition.
Paper | Year | Method | Class of Architecture | Dataset | Accuracy |
---|---|---|---|---|---|
[ ] | 2012 | ASD features | SFA | KTH | 93.5 |
[ ] | 2013 | Spatio-temporal | 3D CNN | KTH | 90.2 |
[ ] | 2014 | STIP features | Sparse auto-encoder | KTH | 96.6 |
[ ] | 2014 | Two-stream | CNN | HDMB-51 | 59.4 |
[ ] | 2014 | DL-SFA | SFA | Hollywood2 | 48.1 |
[ ] | 2014 | Two-stream | CNN | UCF-101 | 88.0 |
[ ] | 2015 | convolutional temporal feature | CNN-LSTM | UCF-101 | 88.6 |
[ ] | 2015 | TDD Descriptor | CNN | UCF-101 | 91.5 |
[ ] | 2015 | Spatio-Temporal | CNN | UCF-101 | 88.1 |
[ ] | 2015 | Spatio-temporal | 3D CNN | UCF-101 | 90.4 |
[ ] | 2015 | Hierarchical model | RNN | MSR Action3D | 94.49 |
[ ] | 2015 | Differential | RNN | MSR Action3D | 92.03 |
[ ] | 2015 | static and motion features | CNN | UCF Sports | 91.9 |
[ ] | 2015 | TDD Descriptor | CNN | HDMB-51 | 65.9 |
[ ] | 2015 | Spatio-Temporal | CNN | HDMB-51 | 59.1 |
[ ] | 2016 | Spatio-temporal | LSTM-CNN | HDMB-51 | 55.3 |
[ ] | 2016 | Deep Network | CNN | UCF-101 | 89.1 |
[ ] | 2016 | Spatio-temporal | LSTM-CNN | UCF-101 | 86.9 |
[ ] | 2016 | Deep model | CNN | HDMB-51 | 54.9 |
[ ] | 2016 | 3D CNN + HMM | CNN | KTH | 89.20 |
[ ] | 2016 | LRCN | CNN + LSTM | UCF-101 | 82.34 |
[ ] | 2017 | SP-CNN | CNN | HDMB-51 | 74.7 |
[ ] | 2017 | Rank pooling | CNN | HDMB-51 | 65.8 |
[ ] | 2017 | Rank pooling | CNN | Hollywood2 | 75.2 |
[ ] | 2017 | SP-CNN | CNN | UCF-101 | 91.6 |
[ ] | 2018 | DynamicMaps | CNN | NTU RGB+D | 87.08 |
[ ] | 2018 | Cooperative model | CNN | NTU RGB+D | 86.42 |
[ ] | 2019 | Depth Dynamic Images | CNN | UWA3DII | 68.10 |
[ ] | 2019 | FWMDMM | CNN | MSR Daily Activity | 92.90 |
CNN | NUCLA | 69.10 | |||
[ ] | 2020 | MB-MHI | ResNet | MUHaVi | 83.8 |
[ ] | 2020 | MV-RAMDMM | 3DCNN | MSR Daily Activity | 87.50 |
3DCNN | NUCLA | 86.20 | |||
[ ] | 2020 | Semi-CNN | ResNet | UCF-101 | 89.00 |
Semi-CNN | VGG-16 | UCF-101 | 82.58 | ||
Semi-CNN | DenseNet | UCF-101 | 77.72 |
6. Conclusions
In this paper, we have presented human action recognition methods and introduced a comprehensive overview of recent approaches to human action recognition research. This included a hand-crafted representation based method, deep learning based methods, human–object interaction and multiview action recognition. The conclusions of this study on human action recognition can focus on the following:
- data selection: suitable data to capture the action may help to improve performance of action recognition.
- approach of recognition: deep learning based methods achieved superior performance.
- multiple-modal: current research highlighted that multi-modal fusion can efficiently improve the performance.
This paper has presented the most relevant and outstanding computer vision based methods for human action recognition. A variety of hand-crafted methods and deep learning models have been summarised along with various advantages and disadvantages for each approach. Hand-crafted feature based methods are categorised into holistic and local feature based methods. Holistic feature based methods have been summarised along with their limitations. These methods assume a static background. In other words, the camera must be stable and videos are supposed to have been captured in a constrained condition for a holistic representation. Otherwise, these methods need extra pre-processing steps such as people detection to be able to recognise human actions. This is particularly true in the presence of cluttered or a complex background or if the camera moves whilst action sequences are captured. Next, local feature based methods and different types of descriptors were also described in this paper. It is shown that local feature based methods more often achieve state-of-the-art results compared to other approaches. In addition, such kinds of methods require reduced computational complexity to recognise human actions compared to more complicated models. The main advantage of local feature based methods is their flexibility. They can be applied to video data without complex requirements such as human localisation or body parts detection, which is not feasible for many types of videos. However, in some cases, it is very difficult to address action variations using local representation based methods, which, in turn, fails to precisely recognise human actions. Therefore, using hand-crafted representations by taking advantage of combining both local and holistic based methods may help. Different issues are tackled benefiting from shape and motion information, and local feature representation of an action. This information alongside local representation strategies are considered as the key roles for recognising different actions and improving the performance of the recognition system.
A new direction has been proposed to enhance the action recognition performance using deep learning technology. Deep learning is summarised in this paper and classified into two categories including: supervised and unsupervised models. However, supervised models are considered in this work due to their vast ability and high effectiveness in implementing recognition systems. It has achieved competitive performance in comparison with traditional approaches in many applications of computer vision. The most important characteristic of deep learning models is the ability to learn features from raw data. This has somewhat reduced the need for hand-crafted feature detectors and descriptors.
One of the most popular supervised models is the Convolution Neural Network (CNN), which is currently being used in most of the existing deep learning based methods. However, deep learning based methods still have some limitations that need to be considered. One of these limitations is the need for huge amounts of data for training the models. In addition, there is a high-complexity hardware requirement to enable computation in a plausible amount of time. Therefore, transfer learning approaches are adopted in different works to benefit from pre-trained models to speed up the training processes. This also helps to improve the performance of the action recognition system with reasonable hardware requirements.
Two common types of deep learning techniques were used for either spatial or spatio-temporal feature extraction and representation. This can include CNN, 3D CNN, LSTM, etc. Some research has highlighted that significant improvements in performance of an action recognition system can be achieved by utilising multi-modalities structure based methods. This could include RGB sequences, hand-designed features, depth sequences and/or skeleton sequences.
Many researchers have highlighted the importance of temporal information that can be exploited to provide more discriminative features for action recognition. This information was processed early with an independent 2D-CNN stream.
Spatio-temporal features have also been learnt directly with the use of 3D-CNN or LSTM models. These have been summarised in this review in which temporal domain has been considered in the learning process. Multi-modalities structure may add great improvements to the recognition system within a deep learning model. Toward this aim, different action recognition systems were presented within different temporal batches involving a deep learning model.
Author Contributions
M.A.-F. designed the concept and drafted the manuscript. J.C. and D.N. supervised, helped and supported M.A.-F. to plan the design and structure of the manuscript. A.I.A. prepared the figures and public datasets analysis. All authors discussed the analyses, interpretation of methods and commented on the manuscript. All authors have read and agreed to the published version of the manuscript.
This research received no external funding.
Conflicts of Interest
The authors declare no conflict of interest.
Microsoft Research Blog
Microsoft at cvpr 2024: innovations in computer vision and ai research.
Published June 17, 2024
Share this page
- Share on Facebook
- Share on Twitter
- Share on LinkedIn
- Share on Reddit
- Subscribe to our RSS feed
![research proposal on computer vision CVPR 2024 logo on a green and purple abstract background](https://www.microsoft.com/en-us/research/uploads/prodnew/2024/06/CVPR_Blog_1400x788.png)
Microsoft is proud to sponsor the 41st annual Conference on Computer Vision and Pattern Recognition (CVPR 2024), held from June 17 to June 21. This premier conference covers a broad spectrum of topics in the field, including 3D reconstruction and modeling, action and motion analysis, video and image processing, synthetic data generation, neural networks, and many more. This year, 63 papers from Microsoft have been accepted, with six selected for oral presentations. This post highlights these contributions.
The diversity of these research projects reflects the interdisciplinary approach that Microsoft research teams have taken, from techniques that precisely recreate 3D human figures and perspectives in augmented reality (AR) to combining advanced image segmentation with synthetic data to better replicate real-world scenarios. Other projects demonstrate how researchers are combining machine learning with natural language processing and structured data, developing models that not only visualize but also interact with their environments. Collectively, these projects aim to improve machine perception and enable more accurate and responsive interactions with the world.
Spotlight: AI-POWERED EXPERIENCE
![research proposal on computer vision research proposal on computer vision](https://www.microsoft.com/en-us/research/uploads/prod/2024/01/MSR-Chat-Promo.png)
Microsoft research copilot experience
Discover more about research at Microsoft through our AI-powered experience
Oral presentations
Bioclip: a vision foundation model for the tree of life.
Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G. Campolongo, Chan Hee Song, David Carlyn, Li Dong , W. Dahdul, Charles Stewart, Tanya Y. Berger-Wolf, Wei-Lun Chao, Yu Su
The surge in images captured from diverse sources—from drones to smartphones—offers a rich source of biological data. To harness this potential, we introduce TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images, and BioCLIP, a foundation model intended for the biological sciences. BioCLIP, utilizing the TreeOfLife-10M’s vast array of organism images and structured knowledge, excels in fine-grained biological classification, outperforming existing models by significant margins and demonstrating strong generalizability.
EgoGen: An Egocentric Synthetic Data Generator
Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu , Yan Zhang, Marc Pollefeys
A critical challenge in augmented reality (AR) is simulating realistic anatomical movements to guide cameras for authentic egocentric views. To overcome this, the authors developed EgoGen, a sophisticated synthetic data generator that not only improves training data accuracy for egocentric tasks but also refines the integration of motion and perception. It offers a practical solution for creating realistic egocentric training data, with the goal of serving as a useful tool for egocentric computer vision research.
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao , Haiping Wu, Weijian Xu, Xiyang Dai , Houdong Hu, Yumao Lu, Michael Zeng , Ce Liu, Lu Yuan
Florence-2 introduces a unified, prompt-based vision foundation model capable of handling a variety of tasks, from captioning to object detection and segmentation. Designed to interpret text prompts as task instructions, Florence-2 generates text outputs across a spectrum of vision and vision-language tasks. This model’s training utilizes the FLD-5B dataset, which includes 5.4 billion annotations on 126 million images, developed using an iterative strategy of automated image annotation and continual model refinement.
LISA: Reasoning Segmentation via Large Language Model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan , Shu Liu, Jiaya Jia
This work introduces reasoning segmentation , a new segmentation task using complex query texts to generate segmentation masks. The authors also established a new benchmark, comprising over a thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation. Finally, the authors present Large Language Instructed Segmentation Assistant (LISA), a tool that combines the linguistic capabilities of large language models with the ability to produce segmentation masks. LISA effectively handles complex queries and shows robust zero-shot learning abilities, further enhanced by minimal fine-tuning.
MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild
Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin (opens in new tab) , Otmar Hilliges, Jie Song
MultiPly is a new framework for reconstructing multiple people in 3D from single-camera videos in natural settings. This technique employs a layered neural representation for the entire scene, refined through layer-wise differentiable volume rendering. Enhanced by a hybrid instance segmentation that combines self-supervised 3D and promptable 2D techniques, it provides reliable segmentation even with close interactions. The process uses confidence-guided optimization to alternately refine human poses and shapes, achieving high-fidelity, consistent 3D models.
SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes
Alexandros Delitzas, Ayça Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys , Francis Engelmann
Traditional 3D scene understanding methods are heavily focused on 3D sematic and instance segmentation, but the true challenge lies in interacting with functional interactive elements like handles, knobs, and buttons to achieve specific tasks. Enter SceneFun3D: a robust dataset featuring over 14,800 precise interaction annotations across 710 high-resolution real-world 3D indoor scenes. This dataset enriches scene comprehension with motion parameters and task-specific natural language descriptions, facilitating advanced research in functionality segmentation, task-driven affordance grounding, and 3D motion estimation.
Discover more about our work and contributions to CVPR 2024, including our full list of publications and sessions , on our conference webpage .
Related publications
Continue reading.
![research proposal on computer vision Research Focus April 15, 2024](https://www.microsoft.com/en-us/research/uploads/prodnew/2024/04/RF39-BlogHeroFeature-1400x788-1.png)
Research Focus: Week of April 15, 2024
![research proposal on computer vision Research Focus Week of February 19, 2024](https://www.microsoft.com/en-us/research/uploads/prod/2024/02/RF35-BlogHeroFeature-1400x788-1.png)
Research Focus: Week of February 19, 2024
![research proposal on computer vision "ICCV23 PARIS" to the left of a picture of the first page of the HoloAssist publication on a blue and purple gradient background.](https://www.microsoft.com/en-us/research/uploads/prod/2023/10/BLG_-ICCV-2023-BlogHeroFeature-1400x788-1.png)
HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world
![research proposal on computer vision Logo for the CVPR 2023 conference showing the Vancouver, British Columbia skyline with the conference dates, June 18–23, 2023. In the background, there is a faded photo of the city of Vancouver on a sunny day.](https://www.microsoft.com/en-us/research/uploads/prod/2023/06/CVPR2023_BlogFeatured-1400x788-1.jpg)
Microsoft at CVPR 2023: Pushing the boundaries of computer vision
Research areas.
![research proposal on computer vision](https://www.microsoft.com/en-us/research/uploads/prod/2020/07/msr-ai-2x.png)
Related events
Related labs.
- Mixed Reality & AI Lab – Zurich
- Follow on Twitter
- Like on Facebook
- Follow on LinkedIn
- Subscribe on Youtube
- Follow on Instagram
Share this page:
![research proposal on computer vision ACM Digital Library home](https://dl.acm.org/specs/products/acm/releasedAssets/images/acm-dl-logo-white-1ecfb82271e5612e8ca12aa1b1737479.png)
- Advanced Search
Temporal adaptive feature pyramid network for action detection
New citation alert added.
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
New Citation Alert!
Please log in to your account
Information & Contributors
Bibliometrics & citations, view options, recommendations, modeling temporal structure of complex actions using bag-of-sequencelets.
We propose a new model for recognizing complex actions named Bag-of-Sequencelets.We represent a video as a sequence of primitive actions.We model a complex action as an ensemble of sub-sequences (sequencelets).We automatically learn sequencelets without ...
Bi-direction Feature Pyramid Temporal Action Detection Network
Temporal action detection in long-untrimmed videos is still a challenging task in video content analysis. Many existing approaches contain two stages, which firstly generate action proposals and then classify them. The main drawback of these ...
A Temporal Action Detection Model With Feature Pyramid Network
To find out all actions included in an untrimmed video, temporal action detection localizes the starting and ending of each action, and identify their categories, simultaneously. Different with trimmed video which always involves a single action ...
Information
Published in.
Elsevier Science Inc.
United States
Publication History
Author tags.
- Action detection
- Deep learning
- Feature pyramid network
- Self-attention
- 1D convolution
- Research-article
Contributors
Other metrics, bibliometrics, article metrics.
- 0 Total Citations
- 0 Total Downloads
- Downloads (Last 12 months) 0
- Downloads (Last 6 weeks) 0
View options
Login options.
Check if you have access through your login credentials or your institution to get full access on this article.
Full Access
Share this publication link.
Copying failed.
Share on social media
Affiliations, export citations.
- Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
- Download citation
- Copy citation
We are preparing your search results for download ...
We will inform you here when the file is ready.
Your file of search results citations is now ready.
Your search export query has expired. Please try again.
- IEEE CS Standards
- Career Center
- Subscribe to Newsletter
- IEEE Standards
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2022/05/03192605/ieee-logo.png)
- For Industry Professionals
- For Students
- Launch a New Career
- Membership FAQ
- Membership FAQs
- Membership Grades
- Special Circumstances
- Discounts & Payments
- Distinguished Contributor Recognition
- Grant Programs
- Find a Local Chapter
- Find a Distinguished Visitor
- About Distinguished Visitors Program
- Find a Speaker on Early Career Topics
- Technical Communities
- Collabratec (Discussion Forum)
- My Subscriptions
- My Referrals
- Computer Magazine
- ComputingEdge Magazine
- Let us help make your event a success. EXPLORE PLANNING SERVICES
- Events Calendar
- Calls for Papers
- Conference Proceedings
- Conference Highlights
- Top 2024 Conferences
- Conference Sponsorship Options
- Conference Planning Services
- Conference Organizer Resources
- Virtual Conference Guide
- Get a Quote
- CPS Dashboard
- CPS Author FAQ
- CPS Organizer FAQ
- Find the latest in advanced computing research. VISIT THE DIGITAL LIBRARY
- Open Access
- Tech News Blog
- Author Guidelines
- Reviewer Information
- Guest Editor Information
- Editor Information
- Editor-in-Chief Information
- Volunteer Opportunities
- Video Library
- Member Benefits
- Institutional Library Subscriptions
- Advertising and Sponsorship
- Code of Ethics
- Educational Webinars
- Online Education
- Certifications
- Industry Webinars & Whitepapers
- Research Reports
- Bodies of Knowledge
- CS for Industry Professionals
- Resource Library
- Newsletters
- Women in Computing
- Digital Library Access
- Organize a Conference
- Run a Publication
- Become a Distinguished Speaker
- Participate in Standards Activities
- Peer Review Content
- Author Resources
- Publish Open Access
- Society Leadership
- Boards & Committees
- Local Chapters
- Governance Resources
- Conference Publishing Services
- Chapter Resources
- About the Board of Governors
- Board of Governors Members
- Diversity & Inclusion
- Open Volunteer Opportunities
- Award Recipients
- Student Scholarships & Awards
- Nominate an Election Candidate
- Nominate a Colleague
- Corporate Partnerships
- Conference Sponsorships & Exhibits
- Advertising
- Recruitment
- Publications
- Education & Career
CVPR 2024 Announces Best Paper Award Winners
![research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2024/05/16112304/CVPR24-300x300-press-release-1x.jpg)
This year, from more than 11,500 paper submissions, the CVPR 2024 Awards Committee selected the following 10 winners for the honor of Best Papers during the Awards Program at CVPR 2024, taking place now through 21 June at the Seattle Convention Center in Seattle, Wash., U.S.A.
Best Papers
- “ Generative Image Dynamics ” Authors: Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski The paper presents a new approach for modeling natural oscillation dynamics from a single still picture. This approach produces photo-realistic animations from a single picture and significantly outperforms prior baselines. It also demonstrates potential to enable several downstream applications such as creating seamlessly looping or interactive image dynamics.
- “ Rich Human Feedback for Text-to-Image Generation ” Authors: Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M. Collins, Yiwen Luo, Yang Li, Kai J. Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam This paper highlights the first rich human feedback dataset for image generation. Authors designed and trained a multimodal Transformer to predict the rich human feedback and demonstrated some instances to improve image generation.
Honorable mention papers included, “ EventPS: Real-Time Photometric Stereo Using an Event Camera ” and “ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. ”
Best Student Papers
- “ Mip-Splatting: Alias-free 3D Gaussian Splatting ” Authors: Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, Andreas Geiger This paper introduces Mip-Splatting, a technique improving 3D Gaussian Splatting (3DGS) with a 3D smoothing filter and a 2D Mip filter for alias-free rendering at any scale. This approach significantly outperforms state-of-the-art methods in out-of-distribution scenarios, when testing at sampling rates different from training, resulting in better generalization to out-of-distribution camera poses and zoom factors.
- “ BioCLIP: A Vision Foundation Model for the Tree of Life ” Authors: Samuel Stevens, Jiaman Wu, Matthew J. Thompson, Elizabeth G. Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M. Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su This paper offers TREEOFLIFE-10M and BIOCLIP, a large-scale diverse biology image dataset and a foundation model for the tree of life, respectively. This work shows BIOCLIP is a strong fine-grained classifier for biology in both zero- and few-shot settings.
There also were four honorable mentions in this category this year: “ SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency ”; “ Image Processing GNN: Breaking Rigidity in Super-Resolution; Objects as Volumes: A Stochastic Geometry View of Opaque Solids ;” and “ Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods. ”
“We are honored to recognize the CVPR 2024 Best Paper Awards winners,” said David Crandall, Professor of Computer Science at Indiana University, Bloomington, Ind., U.S.A., and CVPR 2024 Program Co-Chair. “The 10 papers selected this year – double the number awarded in 2023 – are a testament to the continued growth of CVPR and the field, and to all of the advances that await.”
Additionally, the IEEE Computer Society (CS), a CVPR organizing sponsor, announced the Technical Community on Pattern Analysis and Machine Intelligence (TCPAMI) Awards at this year’s conference. The following were recognized for their achievements:
- 2024 Recipient : “ Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation ” Authors: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik
- 2024 Recipient : Angjoo Kanazawa, Carl Vondrick
- 2024 Recipient : Andrea Vedaldi
“The TCPAMI Awards demonstrate the lasting impact and influence of CVPR research and researchers,” said Walter J. Scheirer, University of Notre Dame, Notre Dame, Ind., U.S.A., and CVPR 2024 General Chair. “The contributions of these leaders have helped to shape and drive forward continued advancements in the field. We are proud to recognize these achievements and congratulate them on their success.”
About the CVPR 2024 The Computer Vision and Pattern Recognition Conference (CVPR) is the preeminent computer vision event for new research in support of artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more. Sponsored by the IEEE Computer Society (CS) and the Computer Vision Foundation (CVF), CVPR delivers the important advances in all areas of computer vision and pattern recognition and the various fields and industries they impact. With a first-in-class technical program, including tutorials and workshops, a leading-edge expo, and robust networking opportunities, CVPR, which is annually attended by more than 10,000 scientists and engineers, creates a one-of-a-kind opportunity for networking, recruiting, inspiration, and motivation.
CVPR 2024 takes place 17-21 June at the Seattle Convention Center in Seattle, Wash., U.S.A., and participants may also access sessions virtually. For more information about CVPR 2024, visit cvpr.thecvf.com .
About the Computer Vision Foundation The Computer Vision Foundation (CVF) is a non-profit organization whose purpose is to foster and support research on all aspects of computer vision. Together with the IEEE Computer Society, it co-sponsors the two largest computer vision conferences, CVPR and the International Conference on Computer Vision (ICCV). Visit thecvf.com for more information.
About the IEEE Computer Society Engaging computer engineers, scientists, academia, and industry professionals from all areas and levels of computing, the IEEE Computer Society (CS) serves as the world’s largest and most established professional organization of its type. IEEE CS sets the standard for the education and engagement that fuels continued global technological advancement. Through conferences, publications, and programs that inspire dialogue, debate, and collaboration, IEEE CS empowers, shapes, and guides the future of not only its 375,000+ community members, but the greater industry, enabling new opportunities to better serve our world. Visit computer.org for more information.
Recommended by IEEE Computer Society
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2023/12/12101015/IRDS-Emerges-as-Global-Leader-for-Chips.jpg)
The IEEE International Roadmap for Devices and Systems (IRDS) Emerges as a Global Leader for Chips Acts Visions and Programs
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2023/11/30103941/2024-Class-of-Fellow-Announced-e1709247683948.jpg)
IEEE Computer Society Announces 2024 Class of Fellow
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2023/11/29145127/20-in-their-20s.jpg)
IEEE CS Releases 20 in their 20s List, Identifying Emerging Leaders in Computer Science and Engineering
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2023/11/14081325/AI-Top-100.jpg)
IEEE CS Authors, Speakers, and Leaders Named to Inaugural TIME100 Most Influential People in Artificial Intelligence List
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2023/11/12160633/SustainTech_LeadershipForum.png)
IEEE SustainTech Leadership Forum 2024: Unlocking the Future of Sustainable Technology for Buildings and Factories in the Built Environment
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2022/10/19160701/TCI-Awards-2021.webp)
J. Gregory Pauloski and Rohan Basu Roy Named Recipients of 2023 ACM/IEEE CS George Michael Memorial HPC Fellowships
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2023/10/02134429/Pingali-keshav_2023-1-1.jpg)
Keshav Pingali Selected to Receive ACM-IEEE CS Ken Kennedy Award
![research proposal on computer vision research proposal on computer vision](https://ieeecs-media.computer.org/wp-media/2023/09/18130159/election-results-in-2023-250x250-1-1.jpg)
Hironori Washizaki Elected IEEE Computer Society 2025 President
- Search for: Toggle Search
Seamless in Seattle: NVIDIA Research Showcases Advancements in Visual Generative AI at CVPR
NVIDIA researchers are at the forefront of the rapidly advancing field of visual generative AI, developing new techniques to create and interpret images, videos and 3D environments.
More than 50 of these projects will be showcased at the Computer Vision and Pattern Recognition (CVPR) conference, taking place June 17-21 in Seattle. Two of the papers — one on the training dynamics of diffusion models and another on high-definition maps for autonomous vehicles — are finalists for CVPR’s Best Paper Awards.
NVIDIA is also the winner of the CVPR Autonomous Grand Challenge’s End-to-End Driving at Scale track — a significant milestone that demonstrates the company’s use of generative AI for comprehensive self-driving models. The winning submission, which outperformed more than 450 entries worldwide, also received CVPR’s Innovation Award.
NVIDIA’s research at CVPR includes a text-to-image model that can be easily customized to depict a specific object or character, a new model for object pose estimation, a technique to edit neural radiance fields ( NeRFs ) and a visual language model that can understand memes. Additional papers introduce domain-specific innovations for industries including automotive, healthcare and robotics.
Collectively, the work introduces powerful AI models that could enable creators to more quickly bring their artistic visions to life, accelerate the training of autonomous robots for manufacturing, and support healthcare professionals by helping process radiology reports.
“Artificial intelligence, and generative AI in particular, represents a pivotal technological advancement,” said Jan Kautz, vice president of learning and perception research at NVIDIA. “At CVPR, NVIDIA Research is sharing how we’re pushing the boundaries of what’s possible — from powerful image generation models that could supercharge professional creators to autonomous driving software that could help enable next-generation self-driving cars.”
At CVPR, NVIDIA also announced NVIDIA Omniverse Cloud Sensor RTX , a set of microservices that enable physically accurate sensor simulation to accelerate the development of fully autonomous machines of every kind.
Forget Fine-Tuning: JeDi Simplifies Custom Image Generation
Creators harnessing diffusion models, the most popular method for generating images based on text prompts, often have a specific character or object in mind — they may, for example, be developing a storyboard around an animated mouse or brainstorming an ad campaign for a specific toy.
Prior research has enabled these creators to personalize the output of diffusion models to focus on a specific subject using fine-tuning — where a user trains the model on a custom dataset — but the process can be time-consuming and inaccessible for general users.
JeDi , a paper by researchers from Johns Hopkins University, Toyota Technological Institute at Chicago and NVIDIA, proposes a new technique that allows users to easily personalize the output of a diffusion model within a couple of seconds using reference images. The team found that the model achieves state-of-the-art quality, significantly outperforming existing fine-tuning-based and fine-tuning-free methods.
JeDi can also be combined with retrieval-augmented generation , or RAG, to generate visuals specific to a database, such as a brand’s product catalog.
New Foundation Model Perfects the Pose
NVIDIA researchers at CVPR are also presenting FoundationPose , a foundation model for object pose estimation and tracking that can be instantly applied to new objects during inference, without the need for fine-tuning.
The model, which set a new record on a popular benchmark for object pose estimation, uses either a small set of reference images or a 3D representation of an object to understand its shape. It can then identify and track how that object moves and rotates in 3D across a video, even in poor lighting conditions or complex scenes with visual obstructions.
FoundationPose could be used in industrial applications to help autonomous robots identify and track the objects they interact with. It could also be used in augmented reality applications where an AI model is used to overlay visuals on a live scene.
NeRFDeformer Transforms 3D Scenes With a Single Snapshot
A NeRF is an AI model that can render a 3D scene based on a series of 2D images taken from different positions in the environment. In fields like robotics, NeRFs can be used to generate immersive 3D renders of complex real-world scenes, such as a cluttered room or a construction site. However, to make any changes, developers would need to manually define how the scene has transformed — or remake the NeRF entirely.
Researchers from the University of Illinois Urbana-Champaign and NVIDIA have simplified the process with NeRFDeformer. The method, being presented at CVPR, can successfully transform an existing NeRF using a single RGB-D image, which is a combination of a normal photo and a depth map that captures how far each object in a scene is from the camera.
![research proposal on computer vision](https://blogs.nvidia.com/wp-content/uploads/2024/06/Screenshot-2024-06-05-at-5.05.51-PM.png)
VILA Visual Language Model Gets the Picture
A CVPR research collaboration between NVIDIA and the Massachusetts Institute of Technology is advancing the state of the art for vision language models, which are generative AI models that can process videos, images and text.
The group developed VILA , a family of open-source visual language models that outperforms prior neural networks on key benchmarks that test how well AI models answer questions about images. VILA’s unique pretraining process unlocked new model capabilities, including enhanced world knowledge, stronger in-context learning and the ability to reason across multiple images.
![research proposal on computer vision figure showing how VILA can reason based on multiple images](https://blogs.nvidia.com/wp-content/uploads/2024/06/VILA.png)
The VILA model family can be optimized for inference using the NVIDIA TensorRT-LLM open-source library and can be deployed on NVIDIA GPUs in data centers, workstations and even edge devices .
Read more about VILA on the NVIDIA Technical Blog and GitHub .
Generative AI Fuels Autonomous Driving, Smart City Research
A dozen of the NVIDIA-authored CVPR papers focus on autonomous vehicle research. Other AV-related highlights include:
- NVIDIA’s AV applied research , which won the CVPR Autonomous Grand Challenge , is featured in this demo .
- Sanja Fidler , vice president of AI research at NVIDIA, will present on vision language models at the Workshop on Autonomous Driving on June 17.
- Producing and Leveraging Online Map Uncertainty in Trajectory Prediction , a paper authored by researchers from the University of Toronto and NVIDIA, has been selected as one of 24 finalists for CVPR’s best paper award.
Also at CVPR, NVIDIA contributed the largest ever indoor synthetic dataset to the AI City Challenge , helping researchers and developers advance the development of solutions for smart cities and industrial automation. The challenge’s datasets were generated using NVIDIA Omniverse , a platform of APIs, SDKs and services that enable developers to build Universal Scene Description (OpenUSD) -based applications and workflows.
NVIDIA Research has hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics. Learn more about NVIDIA Research at CVPR .
NVIDIA websites use cookies to deliver and improve the website experience. See our cookie policy for further details on how we use cookies and how to change your cookie settings.
Share on Mastodon
![](http://sokolural.site/777/templates/cheerup/res/banner1.gif)
IMAGES
VIDEO
COMMENTS
This is an artifact of me knowing more computer graphics folks to pester for their proposals. Add your non-graphics proposal to the collection and help remedy this imbalance! There are only two requirements for a UNC proposal to be added to this collection. The first requirement is that your proposal must be completely approved by your committee.
Basics of Computer Vision. Computer Vision (CV) is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos, along with deep learning models, computers can accurately identify and classify objects, and then react to what they "see.".
Here are the steps involved in identifying the problem statement in computer vision research: Problem Statement Analysis: The first step is to pinpoint the specific application domain within computer vision. This could be related to object recognition in autonomous vehicles or medical image analysis for disease detection.
Final presentation of the project "Computer vision networks. Developing digital visual methods for social and media research". Project developed in 2021 at the Center for Advanced Internet Studies ...
The features of big data could be captured by DL automatically and efficiently. The current applications of DL include computer vision (CV), natural language processing (NLP), video/speech recognition (V/SP), and finance and banking (F&B). Chai and Li (2019) provided a survey of DL on NLP and the advances on V/SP. The survey emphasized the ...
1 .2 Project Milestones. The following are the main milestones in the progress of a project: Project Topic Selection. Background Research and algorithm selection. The Project Proposal At this stage the following needs to be clearly specified: Algorithm design. Experiment design. E1: Identify a data set.
CS 6384 Computer Vision Project Proposal Description. Professor Yu Xiang February 6, 2022. 1 Introduction. For the computer vision course project, students can choose a topic related to computer vision, and explore the topic in one of the three different ways: •Research-oriented. In this direction, students are going to propose a new idea ...
Stanford Computational Vision and Geometry Lab
processing in many practical computer vision systems. The development of static image segmentation algorithms has actttread considerable research ieesnrtt and is ernciehd by a wide range of methodologies. Hvweeor, work that has been published in the video analyses domain is still quite noarrw and beaisd towards the sole use of motion ...
If you are interested in doing a research related project, but do not see a suitable one listed here, feel free to contact one of the researchers at the lab. ... [2022-10-10] Zenseact: Multiple computer vision master theses proposals. E.g. Learning-based Road Estimation [2022-09-06] FOI: Neuromorfisk Avbildning [2022-02-21] FOI: Mörkerseende ...
Computer Vision Networks: a research proposal. March 2021. Authors: Janna Joceli Omena. Universidade NOVA de Lisboa. To read the file of this research, you can request a copy directly from the author.
The eld of computer vision was initially conceived as a summer undergraduate project [3] in 1966. Notwithstanding the seemingly simple de nition of 'seeing', it has proved to be a tough problem to solve. Going beyond perception and interpretation of visual data, research in computer vision now encompasses the following areas:
Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... You can create a new account if you don't have one. Browse SoTA > Computer Vision Computer Vision. 4754 benchmarks • 1452 tasks • 3074 datasets • 49100 papers with code Semantic Segmentation Semantic Segmentation ...
Computer Vision is one of the most active areas where artificial intelligence (AI) is being used. This area is extremely expanding and getting a lot of interests and investments these days. Read more. Supervisor: Dr H Kim. 19 June 2024 PhD Research Project Competition Funded PhD Project (Students Worldwide)
Proposed PhD Projects in Computer Vision. Synthesis of Stereoscopic Movie from Conventional Monocular Video Clips. In order to provide material for 3-dimensional television displays, methods are required for producing 3-dimensional video material from existing 2-dimensional video, such as old films. This project seeks to develop automatic and ...
Research Proposal PDF Available. Computer vision and its application. November 2018; Authors: ... One application where this has been very prominent is the analysis of images, i.e., computer vision.
Dr.-Ing. Anna Hilsmann. Head of Vision & Imaging Technologies Department. Head of Computer Vision & Graphics Group. Phone +49 30 31002-569. Innovations for the digital society of the future are the focus of research and development work at the Fraunhofer HHI. The institute develops standards for information and communication technologies and ...
Computer Vision Research Proposal - Free download as PDF File (.pdf), Text File (.txt) or read online for free. This document outlines a research project on sensor identification for digital image forensics. The goals are to determine what camera captured a given image and to improve existing techniques. Over the summer, the student will collect an image database using multiple cameras under ...
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015) YOLO: You Only Look Once: Unified, Real-Time Object Detection (2016) Mask R-CNN (2017) EfficientNet - Rethinking Model Scaling for Convolutional Neural Networks (2019) About us: Viso Suite is the end-to-end computer vision solution for enterprises. With a ...
Project Proposal. For the course project you will explore a topic in-depth of your own choosing. This can be an implementation (implement an existing algorithm); an application (apply a computer vision algorithm to a new problem); or research (trying to invent something new). To get you started, we have prepared a list of suggested projects.
We have 11 Computer Vision (research proposal form) PhD Projects, Programmes & Scholarships in the UK. More Details. Human emotion analysis and recognition for improving trusted human-robot interaction. Main project focus: AI and Robotics.
Applications. If you're coming to the class with a specific background and interests (e.g. biology, engineering, physics), we'd love to see you apply vision models learned in this class to problems related to your particular domain of interest. Pick a real-world problem and apply computer vision models to solve it. Models.
Facebook is calling for proposals for pilot and early-stage research that extends computer vision technologies in developing countries 1.We specifically seek projects that address the technical challenges impeding computer vision in these contexts, including data and hardware limitations and better integration of new information sources, such as high-resolution satellite imagery.
Abstract. Human action recognition targets recognising different actions from a sequence of observations and different environmental conditions. A wide different applications is applicable to vision based action recognition research. This can include video surveillance, tracking, health care, and human-computer interaction.
Microsoft is proud to sponsor the 41st annual Conference on Computer Vision and Pattern Recognition (CVPR 2024), held from June 17 to June 21. This premier conference covers a broad spectrum of topics in the field, including 3D reconstruction and modeling, action and motion analysis, video and image processing, synthetic data generation, neural networks, and […]
Research Proposal.pdf - Free download as PDF File (.pdf), Text File (.txt) or read online for free. The document proposes a method to detect human poses in video frames using pictorial structure modeling and estimate poses. Key steps include detecting humans using weak constraints on body part position and appearance, estimating poses represented as pictorial structures, and classifying poses ...
research proposal - Free download as PDF File (.pdf), Text File (.txt) or read online for free. This document outlines a research proposal to investigate using Capsule Neural Networks (CapsNets) for traffic light image recognition in autonomous vehicles. It hypothesizes that CapsNets may improve upon current Convolutional Neural Network (CNN) methods by better preserving positional data.
AbstractDetecting actions in videos has become a prominent research task due to its wide application. ... Bai Y., Wang Y., Tong Y., Yang Y., Liu Q., Liu J., Boundary content graph neural network for temporal action proposal generation, in: European Conference on Computer Vision, Springer ... European Conference on Computer Vision, Springer ...
SEATTLE, 19 June 2024 - Today, during the 2024 Computer Vision and Pattern Recognition (CVPR) Conference opening session, the CVPR Awards Committee announced the winners of its prestigious Best Paper Awards, which annually recognize top research in computer vision, artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more.
More than 50 of these projects will be showcased at the Computer Vision and Pattern Recognition (CVPR) conference, taking place June 17-21 in Seattle. Two of the papers — one on the training dynamics of diffusion models and another on high-definition maps for autonomous vehicles — are finalists for CVPR's Best Paper Awards.