• Tips & Tricks
  • Website & Apps
  • ChatGPT Blogs
  • ChatGPT News
  • ChatGPT Tutorial

What is Speech Recognition?

Speech recognition or speech-to-text recognition , is the capacity of a machine or program to recognize spoken words and transform them into text. Speech Recognition  is an important feature in several applications used such as home automation, artificial intelligence, etc. In this article, we are going to explore how speech recognition software work, speech recognition algorithms, and the role of NLP. See examples of how this technology is used in everyday life and various industries, making interactions with devices smarter and more intuitive.

Speech Recognition , also known as automatic speech recognition ( ASR ), computer speech recognition, or speech-to-text, focuses on enabling computers to understand and interpret human speech. Speech recognition involves converting spoken language into text or executing commands based on the recognized words. This technology relies on sophisticated algorithms and machine learning models to process and understand human speech in real-time , despite the variations in accents , pitch , speed , and slang .

Key Features of Speech Recognition

  • Accuracy and Speed: They can process speech in real-time or near real-time, providing quick responses to user inputs.
  • Natural Language Understanding (NLU): NLU enables systems to handle complex commands and queries , making technology more intuitive and user-friendly .
  • Multi-Language Support: Support for multiple languages and dialects , allowing users from different linguistic backgrounds to interact with technology in their native language.
  • Background Noise Handling: This feature is crucial for voice-activated systems used in public or outdoor settings.

Speech Recognition Algorithms

Speech recognition technology relies on complex algorithms to translate spoken language into text or commands that computers can understand and act upon. Here are the algorithms and approaches used in speech recognition:

1. Hidden Markov Models (HMM)

Hidden Markov Models have been the backbone of speech recognition for many years. They model speech as a sequence of states, with each state representing a phoneme (basic unit of sound) or group of phonemes. HMMs are used to estimate the probability of a given sequence of sounds, making it possible to determine the most likely words spoken. Usage : Although newer methods have surpassed HMM in performance, it remains a fundamental concept in speech recognition, often used in combination with other techniques.

2. Natural language processing (NLP)

NLP is the area of  artificial intelligence  which focuses on the interaction between humans and machines through language through speech and text. Many mobile devices incorporate speech recognition into their systems to conduct voice search. Example such as : Siri or provide more accessibility around texting. 

3. Deep Neural Networks (DNN)

DNNs have improved speech recognition's accuracy a lot. These networks can learn hierarchical representations of data, making them particularly effective at modeling complex patterns like those found in human speech. DNNs are used both for acoustic modeling , to better understand the sound of speech , and for language modeling, to predict the likelihood of certain word sequences.

4. End-to-End Deep Learning

Now, the trend has shifted towards end-to-end deep learning models , which can directly map speech inputs to text outputs without the need for intermediate phonetic representations. These models, often based on advanced RNNs , Transformers, or Attention Mechanisms , can learn more complex patterns and dependencies in the speech signal.

What is Automatic Speech Recognition?

Automatic Speech Recognition (ASR) is a technology that enables computers to understand and transcribe spoken language into text. It works by analyzing audio input, such as spoken words, and converting them into written text , typically in real-time. ASR systems use algorithms and machine learning techniques to recognize and interpret speech patterns , phonemes, and language models to accurately transcribe spoken words. This technology is widely used in various applications, including virtual assistants , voice-controlled devices , dictation software , customer service automation , and language translation services .

What is Dragon speech recognition software?

Dragon speech recognition software is a program developed by Nuance Communications that allows users to dictate text and control their computer using voice commands. It transcribes spoken words into written text in real-time , enabling hands-free operation of computers and devices. Dragon software is widely used for various purposes, including dictating documents , composing emails , navigating the web , and controlling applications . It also features advanced capabilities such as voice commands for editing and formatting text , as well as custom vocabulary and voice profiles for improved accuracy and personalization.

What is a normal speech recognition threshold?

The normal speech recognition threshold refers to the level of sound, typically measured in decibels (dB) , at which a person can accurately recognize speech. In quiet environments, this threshold is typically around 0 to 10 dB for individuals with normal hearing. However, in noisy environments or for individuals with hearing impairments , the threshold may be higher, meaning they require a louder volume to accurately recognize speech .

Uses of Speech Recognition

  • Virtual Assistants: These are like digital helpers that understand what you say. They can do things like set reminders, search the internet, and control smart home devices, all without you having to touch anything. Examples include Siri , Alexa , and Google Assistant .
  • Accessibility Tools: Speech recognition makes technology easier to use for people with disabilities. Features like voice control on phones and computers help them interact with devices more easily. There are also special apps for people with disabilities.
  • Automotive Systems: In cars, you can use your voice to control things like navigation and music. This helps drivers stay focused and safe on the road. Examples include voice-activated navigation systems in cars.
  • Healthcare: Doctors use speech recognition to quickly write down notes about patients, so they have more time to spend with them. There are also voice-controlled bots that help with patient care. For example, doctors use dictation tools to write down patient information quickly.
  • Customer Service: Speech recognition is used to direct customer calls to the right place or provide automated help. This makes things run smoother and keeps customers happy. Examples include call centers that you can talk to and customer service bots .
  • Education and E-Learning: Speech recognition helps people learn languages by giving them feedback on their pronunciation. It also transcribes lectures, making them easier to understand. Examples include language learning apps and lecture transcribing services.
  • Security and Authentication: Voice recognition, combined with biometrics , keeps things secure by making sure it's really you accessing your stuff. This is used in banking and for secure facilities. For example, some banks use your voice to make sure it's really you logging in.
  • Entertainment and Media: Voice recognition helps you find stuff to watch or listen to by just talking. This makes it easier to use things like TV and music services . There are also games you can play using just your voice.

Speech recognition is a powerful technology that lets computers understand and process human speech. It's used everywhere, from asking your smartphone for directions to controlling your smart home devices with just your voice. This tech makes life easier by helping with tasks without needing to type or press buttons, making gadgets like virtual assistants more helpful. It's also super important for making tech accessible to everyone, including those who might have a hard time using keyboards or screens. As we keep finding new ways to use speech recognition, it's becoming a big part of our daily tech life, showing just how much we can do when we talk to our devices.

What is Speech Recognition?- FAQs

What are examples of speech recognition.

Note Taking/Writing: An example of speech recognition technology in use is speech-to-text platforms such as Speechmatics or Google's speech-to-text engine. In addition, many voice assistants offer speech-to-text translation.

Is speech recognition secure?

Security concerns related to speech recognition primarily involve the privacy and protection of audio data collected and processed by speech recognition systems. Ensuring secure data transmission, storage, and processing is essential to address these concerns.

Is speech recognition and voice recognition same?

No, speech recognition and voice recognition are different. Speech recognition converts spoken words into text using NLP, focusing on the content of speech. Voice recognition, however, identifies the speaker based on vocal characteristics, emphasizing security and personalization without interpreting the speech's content.

What is speech recognition in AI?

Speech recognition is the process of converting sound signals to text transcriptions. Steps involved in conversion of a sound wave to text transcription in a speech recognition system are: Recording: Audio is recorded using a voice recorder. Sampling: Continuous audio wave is converted to discrete values.

What are the type of Speech Recognition?

Dictation Systems: Convert speech to text. Voice Command Systems: Execute spoken commands. Speaker-Dependent Systems: Trained for specific users. Speaker-Independent Systems: Work for any user. Continuous Speech Recognition: Allows natural, flowing speech. Discrete Speech Recognition: Requires pauses between words. NLP-Integrated Systems: Understand context and meaning

How accurate is speech recognition technology?

The accuracy of speech recognition technology can vary depending on factors such as the quality of audio input , language complexity , and the specific application or system being used. Advances in machine learning and deep learning have improved accuracy significantly in recent years.
  • tech-updates
  • Computer Subject
  • Computer Networks

Pattern Recognition | Basics and Design Principles

article_img

How to use built-in image classifiers of visual recognition module using IBM watson?

article_img

What is Optical Character Recognition (OCR)?

Pattern recognition - phases and activities.

article_img

What is Gesture Recognition? Use Cases, Technologies and Algorithms

What is voice recognition, problems in facial recognition.

article_img

Top 10 Richest Actors in the World

article_img

Best Free Movie Download Sites for 2024

Top 10 highest paid actors in the world: from adam sandler to ben affleck (complete list).

article_img

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format.

While speech recognition is commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.

IBM has had a prominent role within speech recognition since its inception, releasing of “Shoebox” in 1962. This machine had the ability to recognize 16 different words, advancing the initial work from Bell Labs from the 1950s. However, IBM didn’t stop there, but continued to innovate over the years, launching VoiceType Simply Speaking application in 1996. This speech recognition software had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.

While speech technology had a limited vocabulary in the early days, it is utilized in a wide number of industries today, such as automotive, technology, and healthcare. Its adoption has only continued to accelerate in recent years due to advancements in deep learning and big data.  Research  (link resides outside ibm.com) shows that this market is expected to be worth USD 24.9 billion by 2025.

Explore the free O'Reilly ebook to learn how to get started with Presto, the open source SQL engine for data analytics.

Register for the guide on foundation models

Many speech recognition applications and devices are available, but the more advanced solutions use AI and machine learning . They integrate grammar, syntax, structure, and composition of audio and voice signals to understand and process human speech. Ideally, they learn as they go — evolving responses with each interaction.

The best kind of systems also allow organizations to customize and adapt the technology to their specific requirements — everything from language and nuances of speech to brand recognition. For example:

  • Language weighting: Improve precision by weighting specific words that are spoken frequently (such as product names or industry jargon), beyond terms already in the base vocabulary.
  • Speaker labeling: Output a transcription that cites or tags each speaker’s contributions to a multi-participant conversation.
  • Acoustics training: Attend to the acoustical side of the business. Train the system to adapt to an acoustic environment (like the ambient noise in a call center) and speaker styles (like voice pitch, volume and pace).
  • Profanity filtering: Use filters to identify certain words or phrases and sanitize speech output.

Meanwhile, speech recognition continues to advance. Companies, like IBM, are making inroads in several areas, the better to improve human and machine interaction.

The vagaries of human speech have made development challenging. It’s considered to be one of the most complex areas of computer science – involving linguistics, mathematics and statistics. Speech recognizers are made up of a few components, such as the speech input, feature extraction, feature vectors, a decoder, and a word output. The decoder leverages acoustic models, a pronunciation dictionary, and language models to determine the appropriate output.

Speech recognition technology is evaluated on its accuracy rate, i.e. word error rate (WER), and speed. A number of factors can impact word error rate, such as pronunciation, accent, pitch, volume, and background noise. Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal of speech recognition systems. Research from Lippmann (link resides outside ibm.com) estimates the word error rate to be around 4 percent, but it’s been difficult to replicate the results from this paper.

Various algorithms and computation techniques are used to recognize speech into text and improve the accuracy of transcription. Below are brief explanations of some of the most commonly used methods:

  • Natural language processing (NLP): While NLP isn’t necessarily a specific algorithm used in speech recognition, it is the area of artificial intelligence which focuses on the interaction between humans and machines through language through speech and text. Many mobile devices incorporate speech recognition into their systems to conduct voice search—e.g. Siri—or provide more accessibility around texting. 
  • Hidden markov models (HMM): Hidden Markov Models build on the Markov chain model, which stipulates that the probability of a given state hinges on the current state, not its prior states. While a Markov chain model is useful for observable events, such as text inputs, hidden markov models allow us to incorporate hidden events, such as part-of-speech tags, into a probabilistic model. They are utilized as sequence models within speech recognition, assigning labels to each unit—i.e. words, syllables, sentences, etc.—in the sequence. These labels create a mapping with the provided input, allowing it to determine the most appropriate label sequence.
  • N-grams: This is the simplest type of language model (LM), which assigns probabilities to sentences or phrases. An N-gram is sequence of N-words. For example, “order the pizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram. Grammar and the probability of certain word sequences are used to improve recognition and accuracy.
  • Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent.  While neural networks tend to be more accurate and can accept more data, this comes at a performance efficiency cost as they tend to be slower to train compared to traditional language models.
  • Speaker Diarization (SD): Speaker diarization algorithms identify and segment speech by speaker identity. This helps programs better distinguish individuals in a conversation and is frequently applied at call centers distinguishing customers and sales agents.

A wide number of industries are utilizing different applications of speech technology today, helping businesses and consumers save time and even lives. Some examples include:

Automotive: Speech recognizers improves driver safety by enabling voice-activated navigation systems and search capabilities in car radios.

Technology: Virtual agents are increasingly becoming integrated within our daily lives, particularly on our mobile devices. We use voice commands to access them through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. They’ll only continue to integrate into the everyday products that we use, fueling the “Internet of Things” movement.

Healthcare: Doctors and nurses leverage dictation applications to capture and log patient diagnoses and treatment notes.

Sales: Speech recognition technology has a couple of applications in sales. It can help a call center transcribe thousands of phone calls between customers and agents to identify common call patterns and issues. AI chatbots can also talk to people via a webpage, answering common queries and solving basic requests without needing to wait for a contact center agent to be available. It both instances speech recognition systems help reduce time to resolution for consumer issues.

Security: As technology integrates into our daily lives, security protocols are an increasing priority. Voice-based authentication adds a viable level of security.

Convert speech into text using AI-powered speech recognition and transcription.

Convert text into natural-sounding speech in a variety of languages and voices.

AI-powered hybrid cloud software.

Enable speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics.

Learn how to keep up, rethink how to use technologies like the cloud, AI and automation to accelerate innovation, and meet the evolving customer expectations.

IBM watsonx Assistant helps organizations provide better customer experiences with an AI chatbot that understands the language of the business, connects to existing customer care systems, and deploys anywhere with enterprise security and scalability. watsonx Assistant automates repetitive tasks and uses machine learning to resolve customer support issues quickly and efficiently.

Essential Guide to Automatic Speech Recognition Technology

speech recognition definition computer science

Over the past decade, AI-powered speech recognition systems have slowly become part of our everyday lives, from voice search to virtual assistants in contact centers, cars, hospitals, and restaurants. These speech recognition developments are made possible by deep learning advancements.

Developers across many industries now use automatic speech recognition (ASR) to increase business productivity, application efficiency, and even digital accessibility. This post discusses ASR, how it works, use cases, advancements, and more.

What is automatic speech recognition?

Speech recognition technology is capable of converting spoken language (an audio signal) into written text that is often used as a command.

Today’s most advanced software can accurately process varying language dialects and accents. For example, ASR is commonly seen in user-facing applications such as virtual agents, live captioning, and clinical note-taking. Accurate speech transcription is essential for these use cases.

Developers in the speech AI space also use  alternative terminologies  to describe speech recognition such as ASR, speech-to-text (STT), and voice recognition.

ASR is a critical component of  speech AI , which is a suite of technologies designed to help humans converse with computers through voice.

Why natural language processing is used in speech recognition

Developers are often unclear about the role of natural language processing (NLP) models in the ASR pipeline. Aside from being applied in language models, NLP is also used to augment generated transcripts with punctuation and capitalization at the end of the ASR pipeline.

After the transcript is post-processed with NLP, the text is used for downstream language modeling tasks:

  • Sentiment analysis
  • Text analytics
  • Text summarization
  • Question answering

Speech recognition algorithms

Speech recognition algorithms can be implemented in a traditional way using statistical algorithms or by using deep learning techniques such as neural networks to convert speech into text.

Traditional ASR algorithms

Hidden Markov models (HMM) and dynamic time warping (DTW) are two such examples of traditional statistical techniques for performing speech recognition.

Using a set of transcribed audio samples, an HMM is trained to predict word sequences by varying the model parameters to maximize the likelihood of the observed audio sequence.

DTW is a dynamic programming algorithm that finds the best possible word sequence by calculating the distance between time series: one representing the unknown speech and others representing the known words.

Deep learning ASR algorithms

For the last few years, developers have been interested in deep learning for speech recognition because statistical algorithms are less accurate. In fact, deep learning algorithms work better at understanding dialects, accents, context, and multiple languages, and they transcribe accurately even in noisy environments.

Some of the most popular state-of-the-art speech recognition acoustic models are Quartznet , Citrinet , and Conformer . In a typical speech recognition pipeline, you can choose and switch any acoustic model that you want based on your use case and performance.

Implementation tools for deep learning models

Several tools are available for developing deep learning speech recognition models and pipelines, including Kaldi , Mozilla DeepSpeech, NVIDIA NeMo , NVIDIA Riva , NVIDIA TAO Toolkit , and services from Google, Amazon, and Microsoft.

Kaldi, DeepSpeech, and NeMo are open-source toolkits that help you build speech recognition models. TAO Toolkit and Riva are closed-source SDKs that help you develop customizable pipelines that can be deployed in production.

Cloud service providers like Google, AWS, and Microsoft offer generic services that you can easily plug and play with.

Deep learning speech recognition pipeline

An ASR pipeline consists of the following components:

  • Spectrogram generator that converts raw audio to spectrograms.
  • Acoustic model that takes the spectrograms as input and outputs a matrix of probabilities over characters over time.
  • Decoder (optionally coupled with a language model) that generates possible sentences from the probability matrix.
  • Punctuation and capitalization model that formats the generated text for easier human consumption.

A typical deep learning pipeline for speech recognition includes the following components:

  • Data preprocessing
  • Neural acoustic model
  • Decoder (optionally coupled with an n-gram language model)
  • Punctuation and capitalization model

Figure 1 shows an example of a deep learning speech recognition pipeline:.

Diagram showing the ASR pipeline

Datasets are essential in any deep learning application. Neural networks function similarly to the human brain. The more data you use to teach the model, the more it learns. The same is true for the speech recognition pipeline.

A few popular speech recognition datasets are

  • LibriSpeech
  • Fisher English Training Speech
  • Mozilla Common Voice (MCV)
  • 2000 HUB 5 English Evaluation Speech
  • AN4 (includes recordings of people spelling out addresses and names)
  • Aishell-1/AIshell-2 Mandarin speech corpus

Data processing is the first step. It includes data preprocessing and augmentation techniques such as speed/time/noise/impulse perturbation and time stretch augmentation, fast Fourier Transformations (FFT) using windowing, and normalization techniques.

For example, in Figure 2, the mel spectrogram is generated from a raw audio waveform after applying FFT using the windowing technique.

Diagram showing two forms of an audio recording: waveform (left) and mel spectrogram (right).

We can also use perturbation techniques to augment the training dataset. Figures 3 and 4 represent techniques like noise perturbation and masking being used to increase the size of the training dataset in order to avoid problems like overfitting.

Diagram showing two forms of a noise augmented audio recording: waveform (left) and mel spectrogram (right).

The output of the data preprocessing stage is a spectrogram/mel spectrogram, which is a visual representation of the strength of the audio signal over time. 

Mel spectrograms are then fed into the next stage: a neural acoustic model . QuartzNet, CitriNet, ContextNet, Conformer-CTC, and Conformer-Transducer are examples of cutting-edge neural acoustic models. Multiple ASR models exist for several reasons, such as the need for real-time performance, higher accuracy, memory size, and compute cost for your use case.

However, Conformer-based models are becoming more popular due to their improved accuracy and ability to comprehend. The acoustic model returns the probability of characters/words at each time stamp.

Figure 5 shows the output of the acoustic model, with time stamps. 

Diagram showing the output of acoustic model which includes probabilistic distribution over vocabulary characters per each time step.

The acoustic model’s output is fed into the decoder along with the language model. Decoders include beam search and greedy decoders, and language models include n-gram language, KenLM, and neural scoring. When it comes to the decoder, it helps to generate top words, which are then passed to language models to predict the correct sentence.

In Figure 6, the decoder selects the next best word based on the probability score. Based on the final highest score, the correct word or sentence is selected and sent to the punctuation and capitalization model.

Diagram showing how a decoder picks the next word based on the probability scores to generate a final transcript.

The ASR pipeline generates text with no punctuation or capitalization.

Finally, a punctuation and capitalization model is used to improve the text quality for better readability. Bidirectional Encoder Representations from Transformers (BERT) models are commonly used to generate punctuated text.

Figure 7 shows a simple example of a before-and-after punctuation and capitalization model.

Diagram showing how a punctuation and capitalization model adds punctuations & capitalizations to a generated transcript.

Speech recognition industry impact

There are many unique applications for ASR . For example, speech recognition could help industries such as finance, telecommunications, and unified communications as a service (UCaaS) to improve customer experience, operational efficiency, and return on investment (ROI).

Speech recognition is applied in the finance industry for applications such as call center agent assist and trade floor transcripts. ASR is used to transcribe conversations between customers and call center agents or trade floor agents. The generated transcriptions can then be analyzed and used to provide real-time recommendations to agents. This adds to an 80% reduction in post-call time.

Furthermore, the generated transcripts are used for downstream tasks:

  • Intent and entity recognition

Telecommunications

Contact centers are critical components of the telecommunications industry. With contact center technology, you can reimagine the telecommunications customer center, and speech recognition helps with that.

As previously discussed in the finance call center use case, ASR is used in Telecom contact centers to transcribe conversations between customers and contact center agents to analyze them and recommend call center agents in real time. T-Mobile uses ASR for quick customer resolution , for example.

Unified communications as a software

COVID-19 increased demand for UCaaS solutions, and vendors in the space began focusing on the use of speech AI technologies such as ASR to create more engaging meeting experiences.

For example, ASR can be used to generate live captions in video conferencing meetings. Captions generated can then be used for downstream tasks such as meeting summaries and identifying action items in notes.

Future of ASR technology

Speech recognition is not as easy as it sounds. Developing speech recognition is full of challenges, ranging from accuracy to customization for your use case to real-time performance. On the other hand, businesses and academic institutions are racing to overcome some of these challenges and advance the use of speech recognition capabilities.

ASR challenges

Some of the challenges in developing and deploying speech recognition pipelines in production include the following:

  • Lack of tools and SDKs that offer state-of-the-art (SOTA) ASR models makes it difficult for developers to take advantage of the best speech recognition technology.
  • Limited customization capabilities that enable developers to fine-tune on domain-specific and context-specific jargon, multiple languages, dialects, and accents to have your applications understand and speak like you
  • Restricted deployment support; for example, depending on the use case, the software should be capable of being deployed in any cloud, on-premises, edge, and embedded. 
  • Real-time speech recognition pipelines; for instance, in a call center agent assist use case, we cannot wait several seconds for conversations to be transcribed before using them to empower agents.

For more information about the major pain points that developers face when adding speech-to-text capabilities to applications, see Solving Automatic Speech Recognition Deployment Challenges .

ASR advancements

Numerous advancements in speech recognition are occurring on both the research and software development fronts. To begin, research has resulted in the development of several new cutting-edge ASR architectures, E2E speech recognition models, and self-supervised or unsupervised training techniques.

On the software side, there are a few tools that enable quick access to SOTA models, and then there are different sets of tools that enable the deployment of models as services in production. 

Key takeaways

Speech recognition continues to grow in adoption due to its advancements in deep learning-based algorithms that have made ASR as accurate as human recognition. Also, breakthroughs like multilingual ASR help companies make their apps available worldwide, and moving algorithms from cloud to on-device saves money, protects privacy, and speeds up inference.

NVIDIA offers Riva , a speech AI SDK, to address several of the challenges discussed above. With Riva, you can quickly access the latest SOTA research models tailored for production purposes. You can customize these models to your domain and use case, deploy on any cloud, on-premises, edge, or embedded, and run them in real-time for engaging natural interactions.

Learn how your organization can benefit from speech recognition skills with the free ebook, Building Speech AI Applications .

Related resources

  • GTC session: Speech AI Demystified
  • GTC session: Mastering Speech AI for Multilingual Multimedia Transformation
  • GTC session: Human-Like AI Voices: Exploring the Evolution of Voice Technology
  • NGC Containers: Domain Specific NeMo ASR Application
  • NGC Containers: MATLAB
  • Webinar: How Telcos Transform Customer Experiences with Conversational AI

About the Authors

Avatar photo

Related posts

Decorative image of groups of people using speech AI in different ways standing around a globe.

Video: Exploring Speech AI from Research to Practical Production Applications

Deep learning is transforming asr and tts algorithms.

speech recognition definition computer science

Making an NVIDIA Riva ASR Service for a New Language

speech recognition definition computer science

Exploring Unique Applications of Automatic Speech Recognition Technology

speech recognition definition computer science

An Easy Introduction to Speech AI

speech recognition definition computer science

Transforming Telecom Networks to Manage and Optimize AI Workloads

speech recognition definition computer science

Multi-Agent AI and GPU-Powered Innovation in Sound-to-Text Technology

speech recognition definition computer science

Advanced RAG Techniques for Telco O-RAN Specifications Using NVIDIA NIM Microservices

Image of the GB200 NVL2 superchip.

Bringing AI-RAN to a Telco Near You

Telco wireless network design.

Automating Telco Network Design using NVIDIA NIM and NVIDIA NeMo

  • Create account
  • Contributions
  • Speech Recognition

This learning resource is about automatic conversion of spoken language into text, that can be stored as documents or processed as commands to control devices e.g. for handicapped people or elderly people or in a commercial setting allows to order goods and services by audio commands. The learning resource is based on the Open Community Approach so the used tools are Open Source to assure that learner have access to the tools.

Speech Recognition

Learning Tasks

speech recognition definition computer science

  • (Applications of Speech Recognition) Analyse the possible applications of speech recognition and identify challenges of the application!
  • (Human Speech Recognition) Compare human comprehension of speech with the algorithmic speech recognition approach. What are the similarities and differences of human and algorithmic speech recognition?
  • What are similarities and difference between text and emotion recognition in speech analysis?
  • What are possible application areas in digital assitants for both speech recognition and emotion recognition?
  • Analyze the different types of information systems and identify different areas of application of speech recognition and include mobile devices in your consideration!
  • (History) Analyse the history of speech recognition and compare the steps of development with current applications. Identify the major steps that are required for the current applications of speech recognition!
  • ( Risk Literacy ) Identify possible areas of risks and possible risk mitigation strategies if speech recognition is implemented in mobile devices, or with voice control for Internet of Things in general? What are required capacity building measures for business, research and development!
  • ( Commercial Data Harvesting ) Apply the concept of speech recognition to commercial data harvesting . What are potential benefits for generation of tailored advertisments for the users according to their generated profile? How is speech recognition contributing to user profile? What is the difference between offline and online speech recognition systems due to submission of recognized text or audio files submitted to remote servers for speech recognition?
  • (Context Awareness of Speech Recognition) The word "Fire" with a candle in your hand and with burning house in the background creates a different context and different expectations of people listening to what someone is going to tell you. Exlain why context awareness can be helpful to optimize the recognition correctness? How can a speech recognition system detect a context to the speech recognition. I.e. detecting the context without a user setting that switches to a dictation mode e.g. for medical report for X-Ray images.
  • ( Audio-Video-Compression ) Go to the learning resource about Audio-Video-Compression and explain how Speech Recognition can be used in conjunction with Speech Synthesis to reduce the consumption of bandwidth for Video conferencing .
  • ( Performance ) Explain why the performance of speech recognition and accurancy is relevant in many applications. Discuss application in cars or in general in vehicles. Which voice commands can be applied in a traffic situation and which command (not accurately recognized) could cause trouble or even an accident for the driver. Order the theortical application of speech recognition (e.g. "turn right at crossing", "switch on/off music",...) in terms of required performance and accuracy resp. to current available technologies to perform the command in an acceptable way.
  • Explain how the recognized words are encoded for speech recognition in the demo application (digits, cities, operating systems).
  • Explain how the concept of speech recognition can support handicapped people [ 1 ] with navigating in a WebApp or offline AppLSAC for digital learning environments .
  • (Size of Vocabulary) Explain how the size of the recognized vocabulary determines the precision of recognition.
  • (People with Disabilities) [ 2 ] Explore the available frameworks Open Source offline infrastructure for speech recognition without sending audio streams to a remote server for processing. Identify options to control robots or in the context of Ambient Assisted Living with voice recognition [ 3 ] .
  • Collaborative development of the Open Source code base of the speech recognition infrastructure,
  • Application on the collaborative development of a domain specific vocabulary for speech recognition for specific application scenarios.
  • Application on Open Educational Resources that support learners in using speech recognition and Open Source developers in integrating Open Source frameworks into learning environments.

Speech recognition is the interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition ( ASR ), computer speech recognition or speech to text ( STT ). It incorporates knowledge and research in the linguistics , computer science , and electrical engineering fields.

Training of Speech Recognition Algorithms

Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker independent" [ 4 ] systems. Systems that use training are called "speaker dependent".

Applications

Speech recognition applications include voice user interfaces such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics, [ 5 ] speech-to-text processing (e.g., word processors emails , and generating a string-searchable transcript from an audio track), and aircraft (usually termed direct voice input ).

The term voice recognition [ 6 ] [ 7 ] [ 8 ] or speaker identification [ 9 ] [ 10 ] refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data . The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems.

Models, methods, and algorithms

Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling is also used in many other natural language processing applications such as document classification or statistical machine translation .

  • Hidden Markov Model
  • Dynamic Time Warping
  • Neural Networks
  • End-to-End Automated Speech Recognition

Learning Task: Applications

The following learning tasks focus on different applications of Speech Recognition. Explore the different applications.

  • In-Car Systems
  • People with Disabilities
  • Health Care
  • Telephone Support Systems

Usage in education and daily life

For language learning , speech recognition can be useful for learning a second language . It can teach proper pronunciation, in addition to helping a person develop fluency with their speaking skills. [ 11 ]

Students who are blind (see Blindness and education ) or have very low vision can benefit from using the technology to convey words and then hear the computer recite them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard. [ 12 ]

Students who are physically disabled or suffer from Repetitive strain injury /other injuries to the upper extremities can be relieved from having to worry about handwriting, typing, or working with scribe on school assignments by using speech-to-text programs. They can also utilize speech recognition technology to freely enjoy searching the Internet or using a computer at home without having to physically operate a mouse and keyboard. [ 12 ]

Speech recognition can allow students with learning disabilities to become better writers. By saying the words aloud, they can increase the fluidity of their writing, and be alleviated of concerns regarding spelling, punctuation, and other mechanics of writing. [ 13 ] Also, see Learning disability .

Use of voice recognition software, in conjunction with a digital audio recorder and a personal computer running word-processing software has proven to be positive for restoring damaged short-term-memory capacity, in stroke and craniotomy individuals.

Further applications

  • Aerospace (e.g. space exploration , spacecraft , etc.) NASA's Mars Polar Lander used speech recognition technology from Sensory, Inc. in the Mars Microphone on the Lander [ 14 ]
  • Automatic subtitling with speech recognition
  • Automatic emotion recognition [ 15 ]
  • Automatic translation
  • Court reporting (Real time Speech Writing)
  • eDiscovery (Legal discovery)
  • Hands-free computing : Speech recognition computer user interface
  • Home automation
  • Interactive voice response
  • Mobile telephony , including mobile email
  • Multimodal interaction
  • Pronunciation evaluation in computer-aided language learning applications
  • Real Time Captioning [ citation needed ]
  • Speech to text (transcription of speech into text, real time video captioning , Court reporting )
  • Telematics (e.g. vehicle Navigation Systems)
  • Transcription (digital speech-to-text)
  • Video games , with Tom Clancy's EndWar and Lifeline as working examples
  • Virtual assistant (e.g. Apple's Siri )

Further information

Conferences and journals.

Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe, ICASSP , Interspeech /Eurospeech, and the IEEE ASRU. Conferences in the field of natural language processing , such as ACL , NAACL , EMNLP, and HLT, are beginning to include papers on speech processing . Important journals include the IEEE Transactions on Speech and Audio Processing (later renamed IEEE Transactions on Audio, Speech and Language Processing and since Sept 2014 renamed IEEE /ACM Transactions on Audio, Speech and Language Processing—after merging with an ACM publication), Computer Speech and Language, and Speech Communication.

Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek and "Spoken Language Processing (2001)" by Xuedong Huang etc. More up to date are "Computer Speech", by Manfred R. Schroeder , second edition published in 2004, and "Speech Processing: A Dynamic and Optimization-Oriented Approach" published in 2003 by Li Deng and Doug O'Shaughnessey. The recently updated textbook Speech and Language Processing (2008) by Jurafsky and Martin presents the basics and the state of the art for ASR. Speaker recognition also uses the same features, most of the same front-end processing, and classification techniques as is done in speech recognition. A most recent comprehensive textbook, "Fundamentals of Speaker Recognition" is an in depth source for up to date details on the theory and practice. [ 16 ] A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).

A good and accessible introduction to speech recognition technology and its history is provided by the general audience book "The Voice in the Machine. Building Computers That Understand Speech" by Roberto Pieraccini (2012).

The most recent book on speech recognition is Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer) written by D. Yu and L. Deng and published near the end of 2014, with highly mathematically oriented technical detail on how deep learning methods are derived and implemented in modern speech recognition systems based on DNNs and related deep learning methods. [ 17 ] A related book, published earlier in 2014, "Deep Learning: Methods and Applications" by L. Deng and D. Yu provides a less technical but more methodology-focused overview of DNN-based speech recognition during 2009–2014, placed within the more general context of deep learning applications including not only speech recognition but also image recognition, natural language processing, information retrieval, multimodal processing, and multitask learning. [ 18 ]

In terms of freely available resources, Carnegie Mellon University 's Sphinx toolkit is one place to start to both learn about speech recognition and to start experimenting. Another resource (free but copyrighted) is the HTK book (and the accompanying HTK toolkit). For more recent and state-of-the-art techniques, Kaldi toolkit can be used [ 19 ] . In 2017 Mozilla launched the open source project called Common Voice [ 20 ] to gather big database of voices that would help build free speech recognition project DeepSpeech (available free at GitHub ) [ 21 ] using Google open source platform TensorFlow [ 22 ] .

A demonstration of an on-line speech recognizer is available on Cobalt's webpage. [ 23 ]

For more software resources, see List of speech recognition software .

  • Applications of artificial intelligence
  • Articulatory speech recognition
  • Audio mining
  • Audio-visual speech recognition
  • Automatic Language Translator
  • Automotive head unit
  • Cache language model
  • Digital audio processing
  • Dragon NaturallySpeaking
  • Fluency Voice Technology
  • Google Voice Search
  • IBM ViaVoice
  • Keyword spotting
  • Multimedia information retrieval
  • Origin of speech
  • Phonetic search technology
  • Speaker diarisation
  • Speaker recognition
  • Speech analytics
  • Speech interface guideline
  • Speech recognition software for Linux
  • Speech Synthesis
  • Speech verification
  • Subtitle (captioning)
  • Windows Speech Recognition
  • List of emerging technologies
  • Outline of artificial intelligence
  • Timeline of speech and voice recognition
  • ↑ Pacheco-Tallaj, Natalia M., and Claudio-Palacios, Andrea P. "Development of a Vocabulary and Grammar for an Open-Source Speech-driven Programming Platform to Assist People with Limited Hand Mobility". Research report submitted to Keyla Soto, UHS Science Professor.
  • ↑ Stodden, Robert A., and Kelly D. Roberts. "The Use Of Voice Recognition Software As A Compensatory Strategy For Postsecondary Education Students Receiving Services Under The Category Of Learning Disabled." Journal Of Vocational Rehabilitation 22.1 (2005): 49--64. Academic Search Complete. Web. 1 Mar. 2015.
  • ↑ Zaman, S., & Slany, W. (2014). Smartphone-Based Online and Offline Speech Recognition System for ROS-Based Robots. Information Technology and Control, 43(4), 371-380.
  • ↑ P. Nguyen (2010). "Automatic classification of speaker characteristics" .
  • ↑ "British English definition of voice recognition" . Macmillan Publishers Limited. Archived from the original on 16 September 2011 . Retrieved 21 February 2012 .
  • ↑ "voice recognition, definition of" . WebFinance, Inc. Archived from the original on 3 December 2011 . Retrieved 21 February 2012 .
  • ↑ "The Mailbag LG #114" . Linuxgazette.net. Archived from the original on 19 February 2013 . Retrieved 15 June 2013 .
  • ↑ Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker identification using Gaussian mixture speaker models" . IEEE Transactions on Speech and Audio Processing 3 (1): 72–83. doi: 10.1109/89.365379 . ISSN  1063-6676 . OCLC  26108901 . Archived from the original on 8 March 2014 . https://web.archive.org/web/20140308001101/http://www.cs.toronto.edu/~frank/csc401/readings/ReynoldsRose.pdf . Retrieved 21 February 2014 .  
  • ↑ "Speaker Identification (WhisperID)" . Microsoft Research . Microsoft. Archived from the original on 25 February 2014 . Retrieved 21 February 2014 . When you speak to someone, they don't just recognize what you say: they recognize who you are. WhisperID will let computers do that, too, figuring out who you are by the way you sound.
  • ↑ Cerf, Vinton; Wrubel, Rob; Sherwood, Susan. "Can speech-recognition software break down educational language barriers?" . Curiosity.com . Discovery Communications. Archived from the original on 7 April 2014 . Retrieved 26 March 2014 .
  • ↑ 12.0 12.1 "Speech Recognition for Learning" . National Center for Technology Innovation. 2010. Archived from the original on 13 April 2014 . Retrieved 26 March 2014 .
  • ↑ Follensbee, Bob; McCloskey-Dale, Susan (2000). "Speech recognition in schools: An update from the field" . Technology And Persons With Disabilities Conference 2000 . Archived from the original on 21 August 2006 . Retrieved 26 March 2014 .
  • ↑ "Projects: Planetary Microphones" . The Planetary Society. Archived from the original on 27 January 2012.
  • ↑ Caridakis, George; Castellano, Ginevra; Kessous, Loic; Raouzaiou, Amaryllis; Malatesta, Lori; Asteriadis, Stelios; Karpouzis, Kostas (19 September 2007). Multimodal emotion recognition from expressive faces, body gestures and speech (in en). 247 . Springer US. 375–388. doi: 10.1007/978-0-387-74161-1_41 . ISBN  978-0-387-74160-4 .  
  • ↑ Beigi, Homayoon (2011). Fundamentals of Speaker Recognition . New York: Springer. ISBN  978-0-387-77591-3 . Archived from the original on 31 January 2018 . https://web.archive.org/web/20180131140911/http://www.fundamentalsofspeakerrecognition.org/ .  
  • ↑ Yu, D.; Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer) .  
  • ↑ Deng, Li; Yu, Dong (2014). "Deep Learning: Methods and Applications" . Foundations and Trends in Signal Processing 7 (3–4): 197–387. doi: 10.1561/2000000039 . Archived from the original on 22 October 2014 . https://web.archive.org/web/20141022161017/http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf .  
  • ↑ Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
  • ↑ https://voice.mozilla.org
  • ↑ https://github.com/mozilla/DeepSpeech
  • ↑ https://www.tensorflow.org/tutorials/sequences/audio_recognition
  • ↑ https://demo-cubic.cobaltspeech.com/

Further reading

  • Pieraccini, Roberto (2012). The Voice in the Machine. Building Computers That Understand Speech. . The MIT Press. ISBN  978-0262016858 .  
  • Woelfel, Matthias; McDonough, John (2009-05-26). Distant Speech Recognition . Wiley. ISBN  978-0470517048 .  
  • Karat, Clare-Marie; Vergo, John; Nahamoo, David (2007). "Conversational Interface Technologies". In Sears, Andrew ; Jacko, Julie A.. The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies, and Emerging Applications (Human Factors and Ergonomics) . Lawrence Erlbaum Associates Inc. ISBN  978-0-8058-5870-9 .  
  • Cole, Ronald; Mariani, Joseph; Uszkoreit, Hans et al., eds (1997). Survey of the state of the art in human language technology . Cambridge Studies in Natural Language Processing. XII–XIII . Cambridge University Press. ISBN  978-0-521-59277-2 .  
  • Junqua, J.-C.; Haton, J.-P. (1995). Robustness in Automatic Speech Recognition: Fundamentals and Applications . Kluwer Academic Publishers. ISBN  978-0-7923-9646-8 .  
  • Pirani, Giancarlo, ed (2013). Advanced algorithms and architectures for speech understanding . Springer Science & Business Media. ISBN  978-3-642-84341-9 .  

External links

  • Signer, Beat and Hoste, Lode: SpeeG2: A Speech- and Gesture-based Interface for Efficient Controller-free Text Entry , In Proceedings of ICMI 2013, 15th International Conference on Multimodal Interaction, Sydney, Australia, December 2013
  • Speech Technology at the Open Directory Project

Page Information

This page was based on the following wikipedia-source page :

  • Speech Recognition https://en.wikipedia.org/wiki/Speech%20Recognition
  • Date: 7/2/2019 - Source History
  • Wikipedia2Wikiversity-Converter : https://niebert.github.com/Wikipedia2Wikiversity

speech recognition definition computer science

  • Resources needing facts checked
  • Automatic identification and data capture
  • Computational linguistics
  • Human–computer interaction
  • Accessibility
  • Machine learning task

speech recognition definition computer science

Automatic Speech Recognition

speech recognition definition computer science

What is Automatic Speech Recognition?

Automatic Speech Recognition (ASR), also known as speech-to-text, is the process by which a computer or electronic device converts human speech into written text. This technology is a subset of computational linguistics that deals with the interpretation and translation of spoken language into text by computers. It enables humans to speak commands into devices, dictate documents, and interact with computer-based systems through natural language.

How Does Automatic Speech Recognition Work?

ASR systems typically involve several processing stages to accurately transcribe speech. The process begins with the acoustic signal being captured by a microphone. This signal is then digitized and processed to filter out noise and improve clarity.

The core of ASR technology involves two main models:

  • Acoustic Model: This model is trained to recognize the basic units of sound in speech, known as phonemes. It maps segments of audio to these phonemes and considers variations in pronunciation, accent, and intonation.
  • Language Model: This model is used to understand the context and semantics of the spoken words. It predicts the sequence of words that form a sentence, based on the likelihood of word sequences in the language. This helps in distinguishing between words that sound similar but have different meanings.

Once the audio has been processed through these models, the ASR system generates a transcription of the spoken words. Advanced systems may also include additional components, such as a dialogue manager in interactive voice response systems, or a natural language understanding module to interpret the intent behind the words.

Challenges in Automatic Speech Recognition

Despite significant advancements, ASR systems face numerous challenges that can affect their accuracy and performance:

  • Variability in Speech: Differences in accents, dialects, and individual speaker characteristics can make it difficult for ASR systems to accurately recognize words.
  • Background Noise: Noisy environments can interfere with the system's ability to capture clear audio, leading to transcription errors.
  • Homophones and Context: Words that sound the same but have different meanings can be challenging for ASR systems to differentiate without understanding the context.
  • Continuous Speech: Unlike written text, spoken language does not have clear boundaries between words, making it challenging to segment speech accurately.
  • Colloquialisms and Slang: Everyday speech often includes informal language and slang, which may not be present in the training data used for ASR models.

Applications of Automatic Speech Recognition

ASR technology has a wide range of applications across various industries:

  • Virtual Assistants: Devices like smartphones and smart speakers use ASR to enable voice commands and provide user assistance.
  • Accessibility: ASR helps individuals with disabilities by enabling voice control over devices and converting speech to text for those who are deaf or hard of hearing.
  • Transcription Services: ASR is used to automatically transcribe meetings, lectures, and interviews, saving time and effort in documentation.
  • Customer Service: Call centers use ASR to route calls and handle inquiries through interactive voice response systems.
  • Healthcare: ASR enables hands-free documentation for medical professionals, allowing them to dictate notes and records.

The Future of Automatic Speech Recognition

The future of ASR is promising, with ongoing research focused on improving accuracy, reducing latency, and understanding natural language more effectively. As machine learning algorithms become more sophisticated, we can expect ASR systems to become more reliable and integrated into an even broader array of applications, making human-computer interaction more seamless and natural.

Automatic Speech Recognition technology has revolutionized the way we interact with machines, making it possible to communicate with computers using our most natural form of communication: speech. While challenges remain, the continuous improvements in ASR systems are opening up new possibilities for innovation and convenience in our daily lives.

The world's most comprehensive data science & artificial intelligence glossary

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

Genius Mode videos

AD-free experience

Private images

  • Includes 500 AI images, 1750 chat messages, 30 videos, 60 Genius Mode messages, 60 Genius Mode images, and 5 Genius Mode videos per month. If you go over any of these limits, there is a $5 charge for each group. Extra Genius Mode videos cost $1 each.
  • Includes 100 AI images and 300 chat messages. Exceeding either limit requires reloading credits from $5 to $1000, paying only for what you use. Genius Mode videos are $1 each.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

IMAGES

  1. Transcribing an audio file using the Speech Recognition library and Python

    speech recognition definition computer science

  2. Speech Recognition: Learn About It's Definition and Diverse Applications

    speech recognition definition computer science

  3. PyLessons

    speech recognition definition computer science

  4. What Is Speech Recognition? The Future of Technology

    speech recognition definition computer science

  5. Speech Recognition: Definition, Importance and Uses

    speech recognition definition computer science

  6. Speech Recognition: Everything You Need to Know in 2023

    speech recognition definition computer science

VIDEO

  1. Statistical Modeling of Automatic Speech Recognition

  2. Improving Vocal Recognition Software #shorts #science #nsf #ai

  3. Part 1 || Continuous Speech Recognition in MIT App Inventor || Non-stop speech recognition tutorial

  4. Deep and segmental convolutional neural networks for speech recognition

  5. Automatic Speech Recognition

  6. How to Run Speech Recognition Recipe using SpeechBrain

COMMENTS

  1. Speech recognition - Wikipedia

    Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers.

  2. What is Speech Recognition? | Definition from TechTarget

    Speech recognition, or speech-to-text, is the ability of a machine or program to identify words spoken aloud and convert them into readable text. Rudimentary speech recognition software has a limited vocabulary and might only identify words and phrases that are spoken clearly.

  3. What is Speech Recognition? - GeeksforGeeks

    Automatic Speech Recognition (ASR) is a technology that enables computers to understand and transcribe spoken language into text. It works by analyzing audio input, such as spoken words, and converting them into written text, typically in real-time.

  4. What Is Speech Recognition? - IBM

    Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format.

  5. Speech recognition | Voice Recognition, AI & Machine Learning ...

    Speech recognition, the ability of devices to respond to spoken commands. Speech recognition enables hands-free control of various devices and equipment (a particular boon to many disabled persons), provides input to automatic translation, and creates print-ready dictation.

  6. What is Automatic Speech Recognition? | NVIDIA Technical Blog

    What is automatic speech recognition? Speech recognition technology is capable of converting spoken language (an audio signal) into written text that is often used as a command. Today’s most advanced software can accurately process varying language dialects and accents.

  7. Speech Recognition - Wikiversity

    Speech recognition is the interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT).

  8. Speech Recognition - an overview | ScienceDirect Topics

    It is the ability of a model to identify human speech and convert it into a textual or written format. This process is also known as speech-to-text, automatic speech recognition, or computer-assisted transcription. Speech recognition technology has advanced significantly in the last few years.

  9. Automatic Speech Recognition Definition - DeepAI

    What is Automatic Speech Recognition? Automatic Speech Recognition (ASR), also known as speech-to-text, is the process by which a computer or electronic device converts human speech into written text.

  10. Speech recognition - (Human-Computer Interaction) - Fiveable

    Speech recognition is the technological capability that enables a computer or device to identify and process spoken language, converting it into text or commands. This technology is essential for voice user interfaces and conversational AI, allowing for hands-free interaction and natural communication with devices.