NUHA Foundation

  • Mission Statement
  • Our History
  • A few of our favourite quotes
  • Young Writers
  • Matched Prizes
  • Frequently Asked Questions
  • Blogging Entries
  • Other Blog Posts
  • Country Resource Pages on Education
  • Alliance for Development and Population Services
  • Alternatives Durables pour le Development
  • Health and Education NOW!
  • Green Village Children Centre
  • Future Foundations
  • Educate A Child International
  • ComplitKenya
  • Club des Amis du Cameroun (CAMIC)
  • Canada-Mathare Education Trust (CMETrust)
  • Busoga Volunteers for Community Development
  • Bamburi Great News School
  • Angelic Army School
  • Akili Girls’ Preparatory School
  • Wamulu International
  • World Action Fund
  • English Conversation Programme (ECP)
  • Trivia about NUHA

The Power of the Human Voice

Posted on August 8, 2014 by the Editor

It takes the human voice to infuse words with shades of deeper meaning. The role of the human voice in giving deeper meaning to words is crucial when one looks at the significance of denotative and connotative meanings of expressions. For example, one person can utter the following words: l am thirsty . The surface or general meaning is that the person needs some water. However, depending on the context of the utterance, in terms of the reason for such expression, the role and position of the speaker-on a deeper or connotative basis the same words could mean: Give me some water now! In which case: I am thirsty would galvanise the person receiving the order to fetch water as quickly as humanly possible.

The human voice is able to infuse words with shades of deeper meaning because that power of speech can unearth the real intentions, mood, character, identity and culture of the speaker in question. It is easy for a person to write down something and mislead his or her audience or the entire world. However, once one has an opportunity to physically interact with and listen to the person`s voice- the real emotional, physical and cultural elements of the speaker can be easily picked up and placed in their right perspective. By the same token, actors, educators, editors, politicians, religious leaders, advertisers, insurance agents, singers, writers, inspirational speakers suffuse their voices with certain words to successfully appeal to their audiences.

Verbal communication is unique to humans. Human beings are emotional creatures. The human voice is thought to convey emotional valence, arousal and intensity. Music is a powerful medium capable of eliciting a broad range of emotions. The ability to detect emotion in speech and music is an important task in our daily lives. Studies have been conducted to determine why and how music is able to influence its listeners’ moods and emotions. Results showed that melodies with the voice were better recognised than all other instrumental melodies. The authors suggest that the biological significance of the human voice provides a greater depth of processing and enhanced memory.

Think about a normal day in one’s life. How many words does a person speak? How many words do you hear? According to Caleb Lott in the article titled: The Power of the Human Voice , while there are several different numbers floating around, an average human speaks a minimum of 7000 words every day. The same writer goes on to say that the human voice is a tremendous asset which can be used to make the ordinary extraordinary. For example, the games Thomas Was Alone and Bastion use the human voice in a unique way that dynamically affects the players’ experiences of the games. This is so because a narrative-focused game is not only a powerful and amazing way to tell the story but also does so in a way that the visuals cannot convey. The writing is amazing, but without the awe-inspiring narration, the impact of the writing would be lessened.

The human voice is an amazing tool that can have a profound effect on video games. Using a narrator affects the gameplay and the experience the player remembers after walking away from the game. Think of being held in awe, listening to the radio where the mellifluous voices of one`s favourite program’s hosts awaken, mesmerise, excite or sooth one. This boils down to the fact that our visceral reactions to the ways people play form an integral part of our interactions and communication. Annie Tucker Morgan in Talk to Me: The Powerful Effects of the Human Voice says there is a reason why many people’s first instinct when they are upset is to call their mother. Mother’s love is not only enduring but it is something strong that a person finds echoing instinctively and emotionally. She goes on to explain how a University of Wisconsin -Madison study has identified a concrete link between the sound of Mom’s voice and the soothing of jangled nerves through the release of stress-relieving oxytocin -also known as the “love hormone” in the brain. Researchers say that women prefer deep male voices on the condition that those voices are saying complementary things, but also that a woman’s particular preference for the pitch of a male voice depends on the pitch of her own. Jeffrey Jacob, founder and president of Persuasive Speaking highlighted the correlation between people’s voices and their professional and personal successes. A study conducted showed that if the other person does not like the sound of one’s voice, one might have a hard time securing his or her approval.

If we do not verbalise we write down things. Is writing not something of great magnificence? If so, why can we not make a difference?

The world has never been static, so has writing. It is dynamic. It makes the world revel and reveal itself. Out went the traditional writing feather or pen, and in surged the typewriter, then the “wise” computer. Kudos, the world crooned in celebration of probably one of civilization’s amazing conquest and result.

However, this does not mean that the pen is down and out. Not at all. Neither does it mean that the pen has ceased to be mightier than the sword. Writing is writing whether by virtue of the might of the pen or the wizardry of the computer. In verbal communication one can detect the power of the human voice and the mood of the speaker through such elements of speech as intonation, speed, pause, pitch and emphasis. In the written text, register and paragraphing (for example through the use exclamations) can help detect the speaker’s intentions and emotions.

Different words mean different things to different people. How do writers hold the attention of readers? Through the beauty of words, story-telling helps us derive entertainment from reading, escape from an onerous or anxious life, and of course, understand more about of the world. Through words writers create plots that are not devoid of suspense and mystery. Watts in Writing A Novel says, “A plot is like a knitted sweater-only as good as the stitches. Without the links we have a tangle of wool, chaotic and uninteresting. We get immersed in reading because of the power of causality, the power of words. Words play a crucial role in creating a work of art like a novel. Watts in Writing A Novel says a good answer to a narrative question is as satisfying as scratching an itch.

Through writing we find courage, ammunition and inspiration to go on, in spite of all the odds, we find vision to define and refine our identities and destinies. Yes, through writing we find ourselves, our voice and verve.

J.D. Salinger came up with an interesting observation. He said “What really knocks me out is a book that, when you’re all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn’t happen much, though.” Are you not ready to knock many a reader out? Are you not ready to unleash your greatness? How many writers are sitting on their works of art?

Writers and words are good bedfellows. Pass that word. Maya Angelou, the famous author of I Know Why the Caged Bird Sings says “Words mean more than what is set down on paper. It takes the human voice to infuse them with shades of deeper meaning.” A word is a unit of expression which is intertwined with sight, sound, smell, touch, and body movement. I think it is memorable (and obviously powerful) because it appeals to our physical, emotional and intellectual processes. As language practitioners, this knowledge (of the mental schema) is crucial.

What is in a word? For me, words illuminate, revel and reveal the world. Literature is literature because of words that constitute it. Patrick Rothfuss says, “Words are pale shadows of forgotten names. As names have power, words have power. Words can light fires in the minds of men. Words can wring tears from the hardest hearts.” Yet, Rudyard Kipling claims, “Words are, of course, the most powerful drug used by mankind” I think this is a very interesting observation.

Patrick Rothfuss illustrates this by declaring, “Words are pale shadows of forgotten names. As names have power, words have power. Words can light fires in the minds of men. Words can wring tears from the hardest hearts.”

The beauty of literature is in seeking and gaining an insight into the complexity and diversity of life through the analysis of how the human voice infuses words with shades of deeper meaning. For indeed the dynamic human voice can roar, soar and breathe life into different pregnant clouds of words and meanings.

14 comments on “ The Power of the Human Voice ”

' src=

Powerful essay, indeed the human voice has power to articulate emotions, ideas, perception, convictions and so much more and by so doing, breathing life into words.

' src=

Henry, thank you for your great words of encouragement.

Wonderful! Spoken words externalise how the speaker perceive the world, how the speaker feels inside…..

Francisco, thank you for stopping by!

Indeed what a wonderful piece of literature,It reminds me of my secondary education days back in the early 1980s when I did “ANIMAL FARM ” by Charles Dickens.

Mr. Mlotshwa, thank you for stopping by. Much appreciated.

Speechless! the language in this piece is just amazing.Well done Mr Ndaba

Khalaz, thank you!

this is a very nice and awesome essay. Great job! 😀

Musa, many thanks!

Ndaba is a compelling writer. An informative piec

Claire, thank you. Humbled.

Wow. This is very excellent, well-written,powerful and informative. You are a great writer. Keep writing.

Tshego, thank you for your kind words!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

The power of ‘voice,’ and empowering the voiceless

  • Search Search

essay on human voice

Many people use their voices everyday—to talk to people, to communicate their needs and wants—but the idea of ‘voice’ goes much deeper. Having a voice gives an individual agency and power, and a way to express his or her beliefs. But what happens when that voice is expressed differently from the norm? What happens when that voice is in some way silenced?

Headshot of Meryl Alper

Meryl Alper, assistant professor of communication studies at Northeastern, explored this idea of “voice” in children and young teenagers who used an iPad app that converted symbols to audible words to help them communicate.

While it may seem like the app helped to return voice to those who used it, Alper found that the technology was subject to economic structures and defined through the lens of ableism.

“People with disabilities are not passively given voices by the able-bodied; disabled individuals, rather, are actively taking and making them,” she said.

Her book on the subject, Giving Voice: Mobile Communication, Disability, and Inequality , was recently recognized by the Association of American Publishers’ PROSE Awards, which honor “the very best in professional and scholarly publishing.”

We often hear about technology giving voice to the voiceless. What does ‘voice’ represent in your research? And what sorts of ‘voices’ are left out of technological advances?

“Giving voice to the voiceless” regularly signifies that the historically underrepresented, disadvantaged, or vulnerable gain opportunities to organize, increase visibility, and express themselves by leveraging the strengths of information, media, and communication technologies. A long list of tools and platforms—including the internet, Facebook, Twitter, community radio, and free and open software—have all been said to “give voice.”

In the book, I critically reflect on how “giving voice to the voiceless” becomes a powerful, and potentially harmful, trope in our society that masks structural inequalities. I do this by considering the separate meanings of “giving,” “voice,” and “the voiceless.” The notion of “the voiceless” suggests a static and clearly defined group. Discussions about “giving” them voice can reinforce and naturalize not “having” a voice, without also questioning the complex dynamics between having and giving, as well as speaking and listening. Additionally, “giving voice” does not challenge the means and methods by which voice may have been obtained, taken, or even stolen in the first place, and how technology and technological infrastructure can and does uphold the status quo.

What were the biggest takeaways from your research?

I studied how non- and minimally-speaking youth with developmental disabilities impacting their speech used voice output communication technologies that take the form of mobile tablets and apps—think of the technology used by the late Stephen Hawking, but simplified on an iPad. The impact of these technologies on the lives of these children and their families was at once positive, negative, and sometimes of little impact at all. We are collectively responsible for how overly simplistic narratives about technology metaphorically and materially “giving voice” to those with disabilities circulate, particularly as social media platforms monetize and incentivize clicks and retweets of stories. These kinds of news and media portrayals are derided among many in the disability community as “inspiration porn.” In economically, politically, and socially uncertain times, certainty in technology as a fix, certainty in disability as something in need of fixing, and the relationship between these certain fixations is something to think very critically about.

We also need to stay vigilant about protecting disability rights and improving disability policy, as well as the policies that acutely impact people with disabilities, such as education, healthcare, and internet access. Having a voice in general, and the role of technology in exploiting that voice, must be understood in relation to other forms of exploitation. People with disabilities are not passively given voices by the able-bodied; individuals with disabilities, rather, are actively taking and making them. Considering all the ways in which our media ecology and political environment are rapidly changing, at stake in these matters is not only which voices get to speak, but who is thought to have agency to speak in the first place.

Giving Voice received an honorable mention from the PROSE Awards. What does this honor mean to you and for your work?

It is a great privilege for my book to be counted among the 2018 honorees and as one of two winners in the Media and Cultural Studies category, as hundreds of exceptional books were published in the discipline in 2017. Media, communication, and cultural studies is a wide and vibrant field, encompassing two different departments at Northeastern alone (communication studies, and media and screen studies). As an assistant professor, it is immensely rewarding and affirming for my work to be considered of a similar caliber to past category winners, including acclaimed senior scholars in my field.

The award also makes a clear statement about the future of the discipline. Giving Voice is broadly about what it means to have a voice in a technologized world and is based on qualitative research among children, families, and people with disabilities. Those populations, and their concerns, are more often than not treated as niche or specialty within the academy. Qualitative research is also regularly undervalued compared to quantitative research. The honor motivates me to keep following my instincts, centering marginalized groups in empirical and theoretical work on technology and society, and posing research questions that excite me.

Editor's Picks

An olympic runner was just killed by her partner, showing how domestic violence ‘affects every single person’, the iphone is about to be very different – and potentially better – in europe. experts explain why, how will ai transform health care northeastern to host workshop and conference focused on precision health  ,  is cooling paint the key to turning down the planet’s temperature dial, matt janning returns to northeastern as assistant coach after 13-year international basketball career, featured stories, students bond with ancient ‘beings’ in old growth forests of the pacific northwest, northeastern convocation 2024, emmy-nominated northeastern grad is helping revolutionize broadcast graphics at fox sports, how northeastern researchers are cutting valuable minutes off urban train maintenance with new strategies, science & technology.

essay on human voice

Recent Stories

essay on human voice

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

Don’t Underestimate the Power of Your Voice

  • Dan Bullock
  • Raúl Sánchez

essay on human voice

It’s not just what you say, it’s how you say it.

Our voices matter as much as our words matter. They have the power to awaken the senses and lead others to act, close deals, or land us successful job interviews. Through our voices, we create nuances of meaning, convey our emotions, and find the secret to communicating our executive presence. So, how do we train our voices to be more visceral, effective, and command attention?

  • The key lies in harnessing our voices using the principles of vocalics. Vocalics primarily consists of three linguistic elements: stress (volume) , intonation (rising and falling tone), and rhythm (pacing). By combining vocalics with public speaking skills, we can colors our words with the meaning and emotion that motivates others to act.
  • Crank up your volume: No, we don’t mean shout. The effective use of volume goes beyond trying to be the loudest person in the room. To direct the flow of any conversation, you must overtly stress what linguists call focus words. When you intentionally place volume on certain words, you emphasize parts of a message and shift the direction of a conversation toward your preferred outcome.
  • Use a powerful speech style: The key to achieving a powerful speech style, particularly during job interviews and hiring decisions, is to first concentrate on the “melody” of your voice, also called intonation. This rise or fall of our voice conveys grammatical meaning (questions or statements) or even attitude (surprise, joy, sarcasm).
  • Calibrate your vocal rhythm with the right melody: Our messages are perceived differently depending on the way we use rhythm in our voices. Deliberately varying our pacing with compelling pauses creates “voiced” punctuation, a powerful way to hold the pulse of the moment.
  • Dan Bullock is a language and communications specialist/trainer at the United Nations Secretariat, training diplomats and global UN staff. Dan is the co-author of How to Communicate Effectively with Anyone, Anywhere (Career Press, 2021).   He also serves as faculty teaching business communication, linguistics, and public relations within the Division of Programs in Business at New York University’s School of Professional Studies. Dan was the director of corporate communications at a leading NYC public relations firm, and his corporate clients have included TD Bank and Pfizer. 
  • Raúl Sánchez is an award-winning clinical assistant professor and the corporate program coordinator at New York University’s School of Professional Studies. Raúl is the co-author of How to Communicate Effectively with Anyone, Anywhere (Career Press, 2021). He has designed and delivered corporate trainings for Deloitte and the United Nations, as well as been a writing consultant for Barnes & Noble Press and PBS. Raúl was awarded the NYU School of Professional Studies Teaching Excellence Award and specializes in linguistics and business communication.

Partner Center

essay on human voice

The Power of Using your voice

sky

A voice is a tool that transports us into the future. A future that has more possibilities and more solutions. A voice is a tool that can be used for standing up for what is right, rather than what is easy. A voice gives your opinions a platform, and gifts you with the opportunity to have perspective and knowledge on things that matter. No two voices are the same, each voice has something different to say. And in a world that needs to represent freedom and democracy, a voice is a powerful symbol of this. It is what has allowed people to protest injustice, to sing for freedom, or simply speak the truth. A voice can be a source of hope in difficult times.

Using your voice for the truth is important to create a better world. Everyone’s voice matters. It is important to not let yourself become silenced, because when a voice is not used it prevents the opportunity for a true democracy where each voice is valued in a peaceful manner. Voices convey passion and excitement; voices can convey anything, whether it’s a feeling, a place, or an idea. In a way, voices are a superpower if you know how to use it.

Voices can be used to create change. People can take anything material from you, but your voice is one of the things that cannot be taken away. Voices are meant to encourage other voices too, to unite and support each other.  One of the most powerful things someone can do is to use their voice. 

View the discussion thread.

Related Stories

/Text; Use your voice/Your Voice

Why voices need to be heard

Young people jumping happy

You(th) are the future.

me in front of the national Diet

The Future of Youth Activism: How Our Votes can Uplift our Society for Generations

A girl thinking

Mental Health during an Economic Crisis

C 2019 Voices of Youth. All Rights Reserved. 

June 19, 2017

Human Voices Are Unique but We're Not That Good at Recognizing Them

People are good at picking out voices of familiar people’s speech but ear-witness testimonies of strangers’ voices are notoriously unreliable and inaccurate

By Carolyn McGettigan , Nadine Lavan & The Conversation Global

essay on human voice

AntonioMari Getty Images

The following essay is reprinted with permission from The Conversation , an online publication covering the latest research.

“Alexa, who am I? ” Amazon Echo’s voice-controlled virtual assistant, Alexa, doesn’t have an answer to that—yet. However, for other applications of speech technology, computer algorithms are increasingly able to discriminate, recognise and identify individuals from voice recordings.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing . By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Of course, these algorithms are far from perfect, as was recently shown when a BBC journalist broke into his own voice-controlled bank account  using his twin brother’s voice . Is this a case of computers just failing at something humans can do perfectly? We decided to find out.

Each human being  has a voice that is distinct  and different from everyone else’s. So it seems intuitive that we’d be able to identify someone from their voice fairly easily. But how well can you actually do this? When it comes to recognising your closest family and friends, you’re probably quite good. But would you be able to recognise the voice of your first primary school teacher if you heard them again today? How about the guy on the train this morning who was shouting into his phone? What if you had to pick him out, not from his talking voice, but from samples of his laughter, or singing?

To date, research has only explored voice identity perception using a limited set of vocalisations, for example sentences that have been read aloud or snippets of conversational speech. These studies have found that we can actually recognise voices of familiar people’s speech  quite well . But they have also shown that there are problems: ear-witness testimonies are notoriously  unreliable and inaccurate .

It’s important to keep in mind that these studies have not captured much of the flexibility of the sounds we can make with our voices. This is bound to have an effect on how we process the identity of the person behind the voice we are listening to. Therefore, we are currently missing a very large and important piece of the puzzle.

Recognising voices requires two broad processes to operate together: we need to distinguish between the voices of different people (telling people apart) and we need to be able to attribute a single identity to all the different sounds (talking, laughing, shouting) that can come from the same person (“telling people together”). We set out to investigate the limits of these abilities in humans.

Voice experiment

Our recent study,  published in the Journal of Experimental Psychology: General , confirms that voice identity perception can be extremely challenging. Capitalising on how variable a single person’s voice can be, we presented 46 listeners with laughter and vowels produced by five people. Listeners were asked to make a very simple judgement about pairs of sounds: were they made by the same person, or by two different people? As long as they could compare vowels to vowels or laughter to laughter respectively, discriminating between speakers was relatively successful.

But when we asked our listeners to make this judgement based on a mixed pair of sounds, such as directly comparing vowels to laughter in a pair, they couldn’t discriminate between speakers at all—especially if they were not familiar with the speaker. However, even though a sub-group of people who knew the speakers performed better overall, they still struggled significantly with the challenge of “telling people together”.

Similar effects have been reported by studies showing, for example, that it is  difficult to recognise a bilingual speaker  across their two languages. What’s surprising about these findings is how bad voice perception can be once listeners are exposed to natural variation in the sounds that a voice can produce. So, it’s intriguing to consider that while we each have a unique voice, we don’t yet know how useful that uniqueness is.

But why have we evolved to have unique voices if we can’t even recognise them? That’s really an open question so far. We don’t actually know whether we have evolved to have unique voices—we also all have different and largely unique fingerprints, but there’s no evolutionary advantage to that as far as we can tell. It just so happens that based on differences in anatomy and, probably most importantly, how we use our voice, that we all sound different to each other.

Luckily computer algorithms are still able to make the most of the individuality of the human voice. They have probably already outdone humans in some cases—and they will keep on improving. The way these machine-learning algorithms recognise speakers is based on mathematical solutions to create “voice prints”—unique representations picking up the specific acoustic features of each individual voice.

In contrast to computers, humans might not know what they are listening out for, or  how to separate out these acoustic features . So, the way that voice prints are created for the algorithms is not closely modelled on what human listeners appear to do—we’re still working on this. In the long term, it will be interesting to see if there is any overlap in the way human listeners and machine-learning algorithms recognise voices. While human listeners are unlikely to glean any insights from how computers solve this problem, conversely we might be able to build machines that emulate effective aspects of human performance.

It is rumoured that Amazon is currently working on teaching Alexa how to  identify specific users by their voice . If this works, it will be a truly impressive feat and may put a stop to  further unwanted orders of dollhouses . But, do be patient if Alexa makes mistakes—you may not be able to do it any better yourself.

This article was originally published on  The Conversation . Read the original article .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Acoust Soc Am

Logo of jas

Mechanics of human voice production and control

As the primary means of communication, voice plays an important role in daily life. Voice also conveys personal information such as social status, personal traits, and the emotional state of the speaker. Mechanically, voice production involves complex fluid-structure interaction within the glottis and its control by laryngeal muscle activation. An important goal of voice research is to establish a causal theory linking voice physiology and biomechanics to how speakers use and control voice to communicate meaning and personal information. Establishing such a causal theory has important implications for clinical voice management, voice training, and many speech technology applications. This paper provides a review of voice physiology and biomechanics, the physics of vocal fold vibration and sound production, and laryngeal muscular control of the fundamental frequency of voice, vocal intensity, and voice quality. Current efforts to develop mechanical and computational models of voice production are also critically reviewed. Finally, issues and future challenges in developing a causal theory of voice production and perception are discussed.

I. INTRODUCTION

In the broad sense, voice refers to the sound we produce to communicate meaning, ideas, opinions, etc. In the narrow sense, voice, as in this review, refers to sounds produced by vocal fold vibration, or voiced sounds. This is in contrast to unvoiced sounds which are produced without vocal fold vibration, e.g., fricatives which are produced by airflow through constrictions in the vocal tract, plosives produced by sudden release of a complete closure of the vocal tract, or other sound producing mechanisms such as whispering. For voiced sound production, vocal fold vibration modulates airflow through the glottis and produces sound (the voice source), which propagates through the vocal tract and is selectively amplified or attenuated at different frequencies. This selective modification of the voice source spectrum produces perceptible contrasts, which are used to convey different linguistic sounds and meaning. Although this selective modification is an important component of voice production, this review focuses on the voice source and its control within the larynx.

For effective communication of meaning, the voice source, as a carrier for the selective spectral modification by the vocal tract, contains harmonic energy across a large range of frequencies that spans at least the first few acoustic resonances of the vocal tract. In order to be heard over noise, such harmonic energy also has to be reasonably above the noise level within this frequency range, unless a breathy voice quality is desired. The voice source also contains important information of the pitch, loudness, prosody, and voice quality, which convey meaning (see Kreiman and Sidtis, 2011 , Chap. 8 for a review), biological information (e.g., size), and paralinguistic information (e.g., the speaker's social status, personal traits, and emotional state; Sundberg, 1987 ; Kreiman and Sidtis, 2011 ). For example, the same vowel may sound different when spoken by different people. Sometimes a simple “hello” is all it takes to recognize a familiar voice on the phone. People tend to use different voices to different speakers on different occasions, and it is often possible to tell if someone is happy or sad from the tone of their voice.

One of the important goals of voice research is to understand how the vocal system produces voice of different source characteristics and how people associate percepts to these characteristics. Establishing a cause-effect relationship between voice physiology and voice acoustics and perception will allow us to answer two essential questions in voice science and effective clinical care ( Kreiman et al. , 2014 ): when the output voice changes, what physiological alteration caused this change; if a change to voice physiology occurs, what change in perceived voice quality can be expected? Clinically, such knowledge would lead to the development of a physically based theory of voice production that is capable of better predicting voice outcomes of clinical management of voice disorders, thus improving both diagnosis and treatment. More generally, an understanding of this relationship could lead to a better understanding of the laryngeal adjustments that we use to change voice quality, adopt different speaking or singing styles, or convey personal information such as social status and emotion. Such understanding may also lead to the development of improved computer programs for synthesis of naturally sounding, speaker-specific speech of varying emotional percepts.

Understanding such cause-effect relationship between voice physiology and production necessarily requires a multi-disciplinary effort. While voice production results from a complex fluid-structure-acoustic interaction process, which again depends on the geometry and material properties of the lungs, larynx, and the vocal tract, the end interest of voice is its acoustics and perception. Changes in voice physiology or physics that cannot be heard are not that interesting. On the other hand, the physiology and physics may impose constraints on the co-variations among fundamental frequency (F0), vocal intensity, and voice quality, and thus the way we use and control our voice. Thus, understanding voice production and voice control requires an integrated approach, in which physiology, vocal fold vibration, and acoustics are considered as a whole instead of disconnected components. Traditionally, the multi-disciplinary nature of voice production has led to a clear divide between research activities in voice production, voice perception, and their clinical or speech applications, with few studies attempting to link them together. Although much advancement has been made in understanding the physics of phonation, some misconceptions still exist in textbooks in otolaryngology and speech pathology. For example, the Bernoulli effect, which has been shown to play a minor role in phonation, is still considered an important factor in initiating and sustaining phonation in many textbooks and reviews. Tension and stiffness are often used interchangeably despite that they have different physical meanings. The role of the thyroarytenoid muscle in regulating medial compression of the membranous vocal folds is often understated. On the other hand, research on voice production often focuses on the glottal flow and vocal fold vibration, but can benefit from a broader consideration of the acoustics of the produced voice and their implications for voice communication.

This paper provides a review on our current understanding of the cause-effect relation between voice physiology, voice production, and voice perception, with the hope that it will help better bridge research efforts in different aspects of voice studies. An overview of vocal fold physiology is presented in Sec. II , with an emphasis on laryngeal regulation of the geometry, mechanical properties, and position of the vocal folds. The physical mechanisms of self-sustained vocal fold vibration and sound generation are discussed in Sec. III , with a focus on the roles of various physical components and features in initiating phonation and affecting the produced acoustics. Some misconceptions of the voice production physics are also clarified. Section IV discusses the physiologic control of F0, vocal intensity, and voice quality. Section V reviews past and current efforts in developing mechanical and computational models of voice production. Issues and future challenges in establishing a causal theory of voice production and perception are discussed in Sec. VI .

II. VOCAL FOLD PHYSIOLOGY AND BIOMECHANICS

A. vocal fold anatomy and biomechanics.

The human vocal system includes the lungs and the lower airway that function to supply air pressure and airflow (a review of the mechanics of the subglottal system can be found in Hixon, 1987 ), the vocal folds whose vibration modulates the airflow and produces voice source, and the vocal tract that modifies the voice source and thus creates specific output sounds. The vocal folds are located in the larynx and form a constriction to the airway [Fig. 1(a) ]. Each vocal fold is about 11–15 mm long in adult women and 17–21 mm in men, and stretches across the larynx along the anterior-posterior direction, attaching anteriorly to the thyroid cartilage and posteriorly to the anterolateral surface of the arytenoid cartilages [Fig. 1(c) ]. Both the arytenoid [Fig. 1(d) ] and thyroid [Fig. 1(e) ] cartilages sit on top of the cricoid cartilage and interact with it through the cricoarytenoid joint and cricothyroid joint, respectively. The relative movement of these cartilages thus provides a means to adjust the geometry, mechanical properties, and position of the vocal folds, as further discussed below. The three-dimensional airspace between the two opposing vocal folds is the glottis. The glottis can be divided into a membranous portion, which includes the anterior portion of the glottis and extends from the anterior commissure to the vocal process of the arytenoid, and a cartilaginous portion, which is the posterior space between the arytenoid cartilages.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g001.jpg

(Color online) (a) Coronal view of the vocal folds and the airway; (b) histological structure of the vocal fold lamina propria in the coronal plane (image provided by Dr. Jennifer Long of UCLA); (c) superior view of the vocal folds, cartilaginous framework, and laryngeal muscles; (d) medial view of the cricoarytenoid joint formed between the arytenoid and cricoid cartilages; (e) posterolateral view of the cricothyroid joint formed by the thyroid and the cricoid cartilages. The arrows in (d) and (e) indicate direction of possible motions of the arytenoid and cricoid cartilages due to LCA and CT muscle activation, respectively.

The vocal folds are layered structures, consisting of an inner muscular layer (the thyroarytenoid muscle) with muscle fibers aligned primarily along the anterior-posterior direction, a soft tissue layer of the lamina propria, and an outmost epithelium layer [Figs. 1(a) and 1(b) ]. The thyroarytenoid (TA) muscle is sometimes divided into a medial and a lateral bundle, with each bundle responsible for a certain vocal fold posturing function. However, such functional division is still a topic of debate ( Zemlin, 1997 ). The lamina propria consists of the extracellular matrix (ECM) and interstitial substances. The two primary ECM proteins are the collagen and elastin fibers, which are aligned mostly along the length of the vocal folds in the anterior-posterior direction ( Gray et al. , 2000 ). Based on the density of the collagen and elastin fibers [Fig. 1(b) ], the lamina propria can be divided into a superficial layer with limited and loose elastin and collagen fibers, an intermediate layer of dominantly elastin fibers, and a deep layer of mostly dense collagen fibers ( Hirano and Kakita, 1985 ; Kutty and Webb, 2009 ). In comparison, the lamina propria (about 1 mm thick) is much thinner than the TA muscle.

Conceptually, the vocal fold is often simplified into a two-layer body-cover structure ( Hirano, 1974 ; Hirano and Kakita, 1985 ). The body layer includes the muscular layer and the deep layer of the lamina propria, and the cover layer includes the intermediate and superficial lamina propria and the epithelium layer. This body-cover concept of vocal fold structure will be adopted in the discussions below. Another grouping scheme divides the vocal fold into three layers. In addition to a body and a cover layer, the intermediate and deep layers of the lamina propria are grouped into a vocal ligament layer ( Hirano, 1975 ). It is hypothesized that this layered structure plays a functional role in phonation, with different combinations of mechanical properties in different layers leading to production of different voice source characteristics ( Hirano, 1974 ). However, because of lack of data of the mechanical properties in each vocal fold layer and how they vary at different conditions of laryngeal muscle activation, a definite understanding of the functional roles of each vocal fold layer is still missing.

The mechanical properties of the vocal folds have been quantified using various methods, including tensile tests ( Hirano and Kakita, 1985 ; Zhang et al. , 2006b ; Kelleher et al. , 2013a ), shear rheometry ( Chan and Titze, 1999 ; Chan and Rodriguez, 2008 ; Miri et al. , 2012 ), indentation ( Haji et al. , 1992a , b ; Tran et al. , 1993 ; Chhetri et al. , 2011 ), and a surface wave method ( Kazemirad et al. , 2014 ). These studies showed that the vocal folds exhibit a nonlinear, anisotropic, viscoelastic behavior. A typical stress-strain curve of the vocal folds under anterior-posterior tensile test is shown in Fig. ​ Fig.2. 2 . The slope of the curve, or stiffness, quantifies the extent to which the vocal folds resist deformation in response to an applied force. In general, after an initial linear range, the slope of the stress-strain curve (stiffness) increases gradually with further increase in the strain (Fig. ​ (Fig.2), 2 ), presumably due to the gradual engagement of the collagen fibers. Such nonlinear mechanical behavior provides a means to regulate vocal fold stiffness and tension through vocal fold elongation or shortening, which plays an important role in the control of the F0 or pitch of voice production. Typically, the stress is higher during loading than unloading, indicating a viscous behavior of the vocal folds. Due to the presence of the AP-aligned collagen, elastin, and muscle fibers, the vocal folds also exhibit anisotropic mechanical properties, stiffer along the AP direction than in the transverse plane. Experiments ( Hirano and Kakita, 1985 ; Alipour and Vigmostad, 2012 ; Miri et al. , 2012 ; Kelleher et al. , 2013a ) showed that the Young's modulus along the AP direction in the cover layer is more than 10 times (as high as 80 times in Kelleher et al. , 2013a ) larger than in the transverse plane. Stiffness anisotropy has been shown to facilitate medial-lateral motion of the vocal folds ( Zhang, 2014 ) and complete glottal closure during phonation ( Xuan and Zhang, 2014 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g002.jpg

Typical tensile stress-strain curve of the vocal fold along the anterior-posterior direction during loading and unloading at 1 Hz. The slope of the tangent line (dashed lines) to the stress-strain curve quantifies the tangent stiffness. The stress is typically higher during loading than unloading due to the viscous behavior of the vocal folds. The curve was obtained by averaging data over 30 cycles after a 10-cycle preconditioning.

Accurate measurement of vocal fold mechanical properties at typical phonation conditions is challenging, due to both the small size of the vocal folds and the relatively high frequency of phonation. Although tensile tests and shear rheometry allow direct measurement of material modules, the small sample size often leads to difficulties in mounting tissue samples to the testing equipment, thus creating concerns of accuracy. These two methods also require dissecting tissue samples from the vocal folds and the laryngeal framework, making it impossible for in vivo measurement. The indentation method is ideal for in vivo measurement and, because of the small size of indenters used, allows characterization of the spatial variation of mechanical properties of the vocal folds. However, it is limited for measurement of mechanical properties at conditions of small deformation. Although large indentation depths can be used, data interpretation becomes difficult and thus it is not suitable for assessment of the nonlinear mechanical properties of the vocal folds.

There has been some recent work toward understanding the contribution of individual ECM components to the macro-mechanical properties of the vocal folds and developing a structurally based constitutive model of the vocal folds (e.g., Chan et al. , 2001 ; Kelleher et al. , 2013b ; Miri et al. , 2013 ). The contribution of interstitial fluid to the viscoelastic properties of the vocal folds and vocal fold stress during vocal fold vibration and collision has also been investigated using a biphasic model of the vocal folds in which the vocal fold was modeled as a solid phase interacting with an interstitial fluid phase ( Zhang et al. , 2008 ; Tao et al. , 2009 , Tao et al. , 2010 ; Bhattacharya and Siegmund, 2013 ). This structurally based approach has the potential to predict vocal fold mechanical properties from the distribution of collagen and elastin fibers and interstitial fluids, which may provide new insights toward the differential mechanical properties between different vocal fold layers at different physiologic conditions.

B. Vocal fold posturing

Voice communication requires fine control and adjustment of pitch, loudness, and voice quality. Physiologically, such adjustments are made through laryngeal muscle activation, which stiffens, deforms, or repositions the vocal folds, thus controlling the geometry and mechanical properties of the vocal folds and glottal configuration.

One important posturing is adduction/abduction of the vocal folds, which is primarily achieved through motion of the arytenoid cartilages. Anatomical analysis and numerical simulations have shown that the cricoarytenoid joint allows the arytenoid cartilages to slide along and rotate about the long axis of the cricoid cartilage, but constrains arytenoid rotation about the short axis of the cricoid cartilage ( Selbie et al. , 1998 ; Hunter et al. , 2004 ; Yin and Zhang, 2014 ). Activation of the lateral cricoarytenoid (LCA) muscles, which attach anteriorly to the cricoid cartilage and posteriorly to the arytenoid cartilages, induce mainly an inward rotation motion of the arytenoid about the cricoid cartilages in the coronal plane, and moves the posterior portion of the vocal folds toward the glottal midline. Activation of the interarytenoid (IA) muscles, which connect the posterior surfaces of the two arytenoids, slides and approximates the arytenoid cartilages [Fig. 1(c) ], thus closing the cartilaginous glottis. Because both muscles act on the posterior portion of the vocal folds, combined action of the two muscles is able to completely close the posterior portion of the glottis, but is less effective in closing the mid-membranous glottis (Fig. ​ (Fig.3; 3 ; Choi et al. , 1993 ; Chhetri et al. , 2012 ; Yin and Zhang, 2014 ). Because of this inefficiency in mid-membranous approximation, LCA/IA muscle activation is unable to produce medial compression between the two vocal folds in the membranous portion, contrary to current understandings ( Klatt and Klatt, 1990 ; Hixon et al. , 2008 ). Complete closure and medial compression of the mid-membranous glottis requires the activation of the TA muscle ( Choi et al. , 1993 ; Chhetri et al. , 2012 ). The TA muscle forms the bulk of the vocal folds and stretches from the thyroid prominence to the anterolateral surface of the arytenoid cartilages (Fig. ​ (Fig.1). 1 ). Activation of the TA muscle produces a whole-body rotation of the vocal folds in the horizontal plane about the point of its anterior attachment to the thyroid cartilage toward the glottal midline ( Yin and Zhang, 2014 ). This rotational motion is able to completely close the membranous glottis but often leaves a gap posteriorly (Fig. ​ (Fig.3). 3 ). Complete closure of both the membranous and cartilaginous glottis thus requires combined activation of the LCA/IA and TA muscles. The posterior cricoarytenoid (PCA) muscles are primarily responsible for opening the glottis but may also play a role in voice production of very high pitches, as discussed below.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g003.jpg

Activation of the LCA/IA muscles completely closes the posterior glottis but leaves a small gap in the membranous glottis, whereas TA activation completely closes the anterior glottis but leaves a gap at the posterior glottis. From unpublished stroboscopic recordings from the in vivo canine larynx experiments in Choi et al. (1993) .

Vocal fold tension is regulated by elongating or shortening the vocal folds. Because of the nonlinear material properties of the vocal folds, changing vocal fold length also leads to changes in vocal fold stiffness, which otherwise would stay constant for linear materials. The two laryngeal muscles involved in regulating vocal fold length are the cricothyroid (CT) muscle and the TA muscle. The CT muscle consists of two bundles. The vertically oriented bundle, the pars recta, connects the anterior surface of the cricoid cartilage and the lower border of the thyroid lamina. Its contraction approximates the thyroid and cricoid cartilages anteriorly through a rotation about the cricothyroid joint. The other bundle, the pars oblique, is oriented upward and backward, connecting the anterior surface of the cricoid cartilage to the inferior cornu of the thyroid cartilage. Its contraction displaces the cricoid and arytenoid cartilages backwards ( Stone and Nuttall, 1974 ), although the thyroid cartilage may also move forward slightly. Contraction of both bundles thus elongates the vocal folds and increases the stiffness and tension in both the body and cover layers of the vocal folds. In contrast, activation of the TA muscle, which forms the body layer of the vocal folds, increase the stiffness and tension in the body layer. Activation of the TA muscle, in addition to an initial effect of mid-membranous vocal fold approximation, also shortens the vocal folds, which decreases both the stiffness and tension in the cover layer ( Hirano and Kakita, 1985 ; Yin and Zhang, 2013 ). One exception is when the tension in the vocal fold cover is already negative (i.e., under compression), in which case shortening the vocal folds further through TA activation decreases tension (i.e., increased compression force) but may increase stiffness in the cover layer. Activation of the LCA/IA muscles generally does not change the vocal fold length much and thus has only a slight effect on vocal fold stiffness and tension ( Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, activation of the LCA/IA muscles (and also the PCA muscles) does stabilize the arytenoid cartilage and prevent it from moving forward when the cricoid cartilage is pulled backward due to the effect of CT muscle activation, thus facilitating extreme vocal fold elongation, particularly for high-pitch voice production. As noted above, due to the lack of reliable measurement methods, our understanding of how vocal fold stiffness and tension vary at different muscular activation conditions is limited.

Activation of the CT and TA muscles also changes the medial surface shape of the vocal folds and the glottal channel geometry. Specifically, TA muscle activation causes the inferior part of the medial surface to bulge out toward the glottal midline ( Hirano and Kakita, 1985 ; Hirano, 1988 ; Vahabzadeh-Hagh et al. , 2016 ), thus increasing the vertical thickness of the medial surface. In contrast, CT activation reduces this vertical thickness of the medial surface. Although many studies have investigated the prephonatory glottal shape (convergent, straight, or divergent) on phonation ( Titze, 1988a ; Titze et al. , 1995 ), a recent study showed that the glottal channel geometry remains largely straight under most conditions of laryngeal muscle activation ( Vahabzadeh-Hagh et al. , 2016 ).

III. PHYSICS OF VOICE PRODUCTION

A. sound sources of voice production.

The phonation process starts from the adduction of the vocal folds, which approximates the vocal folds to reduce or close the glottis. Contraction of the lungs initiates airflow and establishes pressure buildup below the glottis. When the subglottal pressure exceeds a certain threshold pressure, the vocal folds are excited into a self-sustained vibration. Vocal fold vibration in turn modulates the glottal airflow into a pulsating jet flow, which eventually develops into turbulent flow into the vocal tract.

In general, three major sound production mechanisms are involved in this process ( McGowan, 1988 ; Hofmans, 1998 ; Zhao et al. , 2002 ; Zhang et al. , 2002a ), including a monopole sound source due to volume of air displaced by vocal fold vibration, a dipole sound source due to the fluctuating force applied by the vocal folds to the airflow, and a quadrupole sound source due to turbulence developed immediately downstream of the glottal exit. When the false vocal folds are tightly adducted, an additional dipole source may arise as the glottal jet impinges onto the false vocal folds ( Zhang et al. , 2002b ). The monopole sound source is generally small considering that the vocal folds are nearly incompressible and thus the net volume flow displacement is small. The dipole source is generally considered as the dominant sound source and is responsible for the harmonic component of the produced sound. The quadrupole sound source is generally much weaker than the dipole source in magnitude, but it is responsible for broadband sound production at high frequencies.

For the harmonic component of the voice source, an equivalent monopole sound source can be defined at a plane just downstream of the region of major sound sources, with the source strength equal to the instantaneous pulsating glottal volume flow rate. In the source-filter theory of phonation ( Fant, 1970 ), this monopole sound source is the input signal to the vocal tract, which acts as a filter and shapes the sound source spectrum into different sounds before they are radiated from the mouth to the open as the voice we hear. Because of radiation from the mouth, the sound source is proportional to the time derivative of the glottal flow. Thus, in the voice literature, the time derivate of the glottal flow, instead of the glottal flow, is considered as the voice source.

The phonation cycle is often divided into an open phase, in which the glottis opens (the opening phase) and closes (the closing phase), and a closed phase, in which the glottis is closed or remains a minimum opening area when the glottal closure is incomplete. The glottal flow increases and decreases in the open phase, and remains zero during the closed phase or minimum for incomplete glottal closure (Fig. ​ (Fig.4). 4 ). Compared to the glottal area waveform, the glottal flow waveform reaches its peak at a later time in the cycle so that the glottal flow waveform is more skewed to the right. This skewing in the glottal flow waveform to the right is due to the acoustic mass in the glottis and the vocal tract (when the F0 is lower than a nearby vocal tract resonance frequency), which causes a delay in the increase in the glottal flow during the opening phase, and a faster decay in the glottal flow during the closing phase ( Rothenberg, 1981 ; Fant, 1982 ). Because of this waveform skewing to the right, the negative peak of the time derivative of the glottal flow in the closing phase is often much more dominant than the positive peak in the opening phase. The instant of the most negative peak is thus considered the point of main excitation of the vocal tract and the corresponding negative peak, also referred to as the maximum flow declination rate (MFDR), is a major determinant of the peak amplitude of the produced voice. After the negative peak, the time derivative of the glottal flow waveform returns to zero as phonation enters the closed phase.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g004.jpg

(Color online) Typical glottal flow waveform and its time derivative (left) and their correspondence to the spectral slopes of the low-frequency and high-frequency portions of the voice source spectrum (right).

Much work has been done to directly link features of the glottal flow waveform to voice acoustics and potentially voice quality (e.g., Fant, 1979 , 1982 ; Fant et al. , 1985 ; Gobl and Chasaide, 2010 ). These studies showed that the low-frequency spectral shape (the first few harmonics) of the voice source is primarily determined by the relative duration of the open phase with respect to the oscillation period (To/T in Fig. ​ Fig.4, 4 , also referred to as the open quotient). A longer open phase often leads to a more dominant first harmonic (H1) in the low-frequency portion of the resulting voice source spectrum. For a given oscillation period, shortening the open phrase causes most of the glottal flow change to occur within a duration (To) that is increasingly shorter than the period T. This leads to an energy boost in the low-frequency portion of the source spectrum that peaks around a frequency of 1/To. For a glottal flow waveform of a very short open phase, the second harmonic (H2) or even the fourth harmonic (H4) may become the most dominant harmonic. Voice source with a weak H1 relative to H2 or H4 is often associated with a pressed voice quality.

The spectral slope in the high-frequency range is primarily related to the degree of discontinuity in the time derivative of the glottal flow waveform. Due to the waveform skewing discussed earlier, the most dominant source of discontinuity often occurs around the instant of main excitation when the time derivative of the glottal flow waveform returns from the negative peak to zero within a time scale of Ta (Fig. ​ (Fig.4). 4 ). For an abrupt glottal flow cutoff ( Ta  = 0), the time derivative of the glottal flow waveform has a strong discontinuity at the point of main excitation, which causes the voice source spectrum to decay asymptotically at a roll-off rate of −6 dB per octave toward high frequencies. Increasing Ta from zero leads to a gradual return from the negative peak to zero. When approximated by an exponential function, this gradual return functions as a lower-pass filter, with a cutoff frequency around 1/ Ta , and reduces the excitation of harmonics above the cutoff frequency 1/ Ta . Thus, in the frequency range concerning voice perception, increasing Ta often leads to reduced higher-order harmonic excitation. In the extreme case when there is minimal vocal fold contact, the time derivative of the glottal flow waveform is so smooth that the voice source spectrum only has a few lower-order harmonics. Perceptually, strong excitation of higher-order harmonics is often associated with a bright output sound quality, whereas voice source with limited excitation of higher-order harmonics is often perceived to be weak.

Also of perceptual importance is the turbulence noise produced immediately downstream of the glottis. Although small in amplitude, the noise component plays an important role in voice quality perception, particularly for female voice in which aspiration noise is more persistent than in male voice. While the noise component of voice is often modeled as white noise, its spectrum often is not flat and may exhibit different spectral shapes, depending on the glottal opening and flow rate as well as the vocal tract shape. Interaction between the spectral shape and relative levels of harmonic and noise energy in the voice source has been shown to influence the perception of voice quality ( Kreiman and Gerratt, 2012 ).

It is worth noting that many of the source parameters are not independent from each other and often co-vary. How they co-vary at different voicing conditions, which is essential to natural speech synthesis, remains to be the focus of many studies (e.g., Sundberg and Hogset, 2001 ; Gobl and Chasaide, 2003 ; Patel et al. , 2011 ).

B. Mechanisms of self-sustained vocal fold vibration

That vocal fold vibration results from a complex airflow-vocal fold interaction within the glottis rather than repetitive nerve stimulation of the larynx was first recognized by van den Berg (1958) . According to his myoelastic-aerodynamic theory of voice production, phonation starts from complete adduction of the vocal folds to close the glottis, which allows a buildup of the subglottal pressure. The vocal folds remain closed until the subglottal pressure is sufficiently high to push them apart, allowing air to escape and producing a negative (with respect to atmospheric pressure) intraglottal pressure due to the Bernoulli effect. This negative Bernoulli pressure and the elastic recoil pull the vocal folds back and close the glottis. The cycle then repeats, which leads to sustained vibration of the vocal folds.

While the myoelastic-aerodynamic theory correctly identifies the interaction between the vocal folds and airflow as the underlying mechanism of self-sustained vocal fold vibration, it does not explain how energy is transferred from airflow into the vocal folds to sustain this vibration. Traditionally, the negative intraglottal pressure is considered to play an important role in closing the glottis and sustaining vocal fold vibration. However, it is now understood that a negative intraglottal pressure is not a critical requirement for achieving self-sustained vocal fold vibration. Similarly, an alternatingly convergent-divergent glottal channel geometry during phonation has been considered a necessary condition that leads to net energy transfer from airflow into the vocal folds. We will show below that an alternatingly convergent-divergent glottal channel geometry does not always guarantee energy transfer or self-sustained vocal fold vibration.

For flow conditions typical of human phonation, the glottal flow can be reasonably described by Bernoulli's equation up to the point when airflow separates from the glottal wall, often at the glottal exit at which the airway suddenly expands. According to Bernoulli's equation, the flow pressure p at a location within the glottal channel with a time-varying cross-sectional area A is

where P sub and P sup are the subglottal and supraglottal pressure, respectively, and A sep is the time-varying glottal area at the flow separation location. For simplicity, we assume that the flow separates at the upper margin of the medial surface. To achieve a net energy transfer from airflow to the vocal folds over one cycle, the air pressure on the vocal fold surface has to be at least partially in-phase with vocal fold velocity. Specifically, the intraglottal pressure needs to be higher in the opening phase than in the closing phase of vocal fold vibration so that the airflow does more work on the vocal folds in the opening phase than the work the vocal folds do back to the airflow in the closing phase.

Theoretical analysis of the energy transfer between airflow and vocal folds ( Ishizaka and Matsudaira, 1972 ; Titze, 1988a ) showed that this pressure asymmetry can be achieved by a vertical phase difference in vocal fold surface motion (also referred to as a mucosal wave), i.e., different portions of the vocal fold surface do not necessarily move inward and outward together as a whole. This mechanism is illustrated in Fig. ​ Fig.5, 5 , the upper left of which shows vocal fold surface shape in the coronal plane for six consecutive, equally spaced instants during one vibration cycle in the presence of a vertical phase difference. Instants 2 and 3 in solid lines are in the closing phase whereas 5 and 6 in dashed lines are in the opening phase. Consider for an example energy transfer at the lower margin of the medial surface. Because of the vertical phase difference, the glottal channel has a different shape in the opening phase (dashed lines 5 and 6) from that in the closing (solid lines 3 and 2) when the lower margin of the medial surface crosses the same locations. Particularly, when the lower margin of the medial surface leads the upper margin in phase, the glottal channel during opening (e.g., instant 6) is always more convergent [thus a smaller A sep / A in Eq. (1) ] or less divergent than that in the closing (e.g., instant 2) for the same location of the lower margin, resulting in an air pressure [Eq. (1) ] that is higher in the opening phase than the closing phase (Fig. ​ (Fig.5, 5 , top row). As a result, energy is transferred from airflow into the vocal folds over one cycle, as indicated by a non-zero area enclosed by the aerodynamic force-vocal fold displacement curve in Fig. ​ Fig.5 5 (top right). The existence of a vertical phase difference in vocal fold surface motion is generally considered as the primary mechanism of phonation onset.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g005.jpg

Two energy transfer mechanisms. Top row: the presence of a vertical phase difference leads to different medial surface shapes between glottal opening (dashed lines 5 and 6; upper left panel) and closing (solid lines 2 and 3) when the lower margin of the medial surface crosses the same locations, which leads to higher air pressure during glottal opening than closing and net energy transfer from airflow into vocal folds at the lower margin of the medial surface. Middle row: without a vertical phase difference, vocal fold vibration produces an alternatingly convergent-divergent but identical glottal channel geometry between glottal opening and closing (bottom left panel), thus zero energy transfer (middle row). Bottom row: without a vertical phase difference, air pressure asymmetry can be imposed by a negative damping mechanism.

In contrast, without a vertical phase difference, the vocal fold surface during opening (Fig. ​ (Fig.5, 5 , bottom left; dashed lines 5 and 6) and closing (solid lines 3 and 2) would be identical when the lower margin crosses the same positions, for which Bernoulli's equation would predict symmetric flow pressure between the opening and closing phases, and zero net energy transfer over one cycle (Fig. ​ (Fig.5, 5 , middle row). Under this condition, the pressure asymmetry between the opening and closing phases has to be provided by an external mechanism that directly imposes a phase difference between the intraglottal pressure and vocal fold movement. In the presence of such an external mechanism, the intraglottal pressure is no longer the same between opening and closing even when the glottal channel has the same shape as the vocal fold crosses the same locations, resulting in a net energy transfer over one cycle from airflow to the vocal folds (Fig. ​ (Fig.5, 5 , bottom row). This energy transfer mechanism is often referred to as negative damping, because the intraglottal pressure depends on vocal fold velocity and appears in the system equations of vocal fold motion in a form similar to a damping force, except that energy is transferred to the vocal folds instead of being dissipated. Negative damping is the only energy transfer mechanism in a single degree-of-freedom system or when the entire medial surface moves in phase as a whole.

In humans, a negative damping can be provided by an inertive vocal tract ( Flanagan and Landgraf, 1968 ; Ishizaka and Matsudaira, 1972 ; Ishizaka and Flanagan, 1972 ) or a compliant subglottal system ( Zhang et al. , 2006a ). Because the negative damping associated with acoustic loading is significant only for frequencies close to an acoustic resonance, phonation sustained by such negative damping alone always occurs at a frequency close to that acoustic resonance ( Flanagan and Landgraf, 1968 ; Zhang et al. , 2006a ). Although there is no direct evidence of phonation sustained dominantly by acoustic loading in humans, instabilities in voice production (or voice breaks) have been reported when the fundamental frequency of vocal fold vibration approaches one of the vocal tract resonances (e.g., Titze et al. , 2008 ). On the other hand, this entrainment of phonation frequency to the acoustic resonance limits the degree of independent control of the voice source and the spectral modification by the vocal tract, and is less desirable for effective speech communication. Considering that humans are capable of producing a large variety of voice types independent of vocal tract shapes, negative damping due to acoustic coupling to the sub- or supra-glottal acoustics is unlikely the primary mechanism of energy transfer in voice production. Indeed, excised larynges are able to vibrate without a vocal tract. On the other hand, experiments have shown that in humans the vocal folds vibrate at a frequency close to an in vacuo vocal fold resonance ( Kaneko et al. , 1986 ; Ishizaka, 1988 ; Svec et al. , 2000 ) instead of the acoustic resonances of the sub- and supra-glottal tracts, suggesting that phonation is essentially a resonance phenomenon of the vocal folds.

A negative damping can be also provided by glottal aerodynamics. For example, glottal flow acceleration and deceleration may cause the flow to separate at different locations between opening and closing even when the glottis has identical geometry. This is particularly the case for a divergent glottal channel geometry, which often results in asymmetric flow separation and pressure asymmetry between the glottal opening and closing phases ( Park and Mongeau, 2007 ; Alipour and Scherer, 2004 ). The effect of this negative damping mechanism is expected to be small at phonation onset at which the vocal fold vibration amplitude and thus flow unsteadiness is small and the glottal channel is less likely to be divergent. However, its contribution to energy transfer may increase with increasing vocal fold vibration amplitude and flow unsteadiness ( Howe and McGowan, 2010 ). It is important to differentiate this asymmetric flow separation between glottal opening and closing due to unsteady flow effects from a quasi-steady asymmetric flow separation that is caused by asymmetry in the glottal channel geometry between opening and closing. In the latter case, because flow separation may occur at a more upstream location for a divergent glottal channel than a convergent glottal channel, an asymmetric glottal channel geometry (e.g., a glottis opening convergent and closing divergent) may lead to asymmetric flow separation between glottal opening and closing. Compared to conditions of a fixed flow separation (i.e., flow separates at the same location during the entire cycle, as in Fig. ​ Fig.5), 5 ), such geometry-induced asymmetric flow separation actually reduces pressure asymmetry between glottal opening and closing [this can be shown using Eq. (1) ] and thus weakens net energy transfer. In reality, these two types of asymmetric flow separation mechanisms (due to unsteady effects or changes in glottal channel geometry) interact and can result in very complex flow separation patterns ( Alipour and Scherer, 2004 ; Sciamarella and Le Quere, 2008 ; Sidlof et al. , 2011 ), which may or may not enhance energy transfer.

From the discussion above it is clear that a negative Bernoulli pressure is not a critical requirement in either one of the two mechanisms. Being proportional to vocal fold displacement, the negative Bernoulli pressure is not a negative damping and does not directly provide the required pressure asymmetry between glottal opening and closing. On the other hand, the existence of a vertical phase difference in vocal fold vibration is determined primarily by vocal fold properties (as discussed below), rather than whether the intraglottal pressure is positive or negative during a certain phase of the oscillation cycle.

Although a vertical phase difference in vocal fold vibration leads to a time-varying glottal channel geometry, an alternatingly convergent-divergent glottal channel geometry does not guarantee self-sustained vocal fold vibration. For example, although the in-phase vocal fold motion in the bottom left of Fig. ​ Fig.5 5 (the entire medial surface moves in and out together) leads to an alternatingly convergent-divergent glottal geometry, the glottal geometry is identical between glottal opening and closing and thus this motion is unable to produce net energy transfer into the vocal folds without a negative damping mechanism (Fig. ​ (Fig.5, 5 , middle row). In other words, an alternatingly convergent-divergent glottal geometry is an effect, not cause, of self-sustained vocal fold vibration. Theoretically, the glottis can maintain a convergent or divergent shape during the entire oscillation cycle and yet still self-oscillate, as observed in experiments using physical vocal fold models which had a divergent shape during most portions of the oscillation cycle ( Zhang et al. , 2006a ).

C. Eigenmode synchronization and nonlinear dynamics

The above shows that net energy transfer from airflow into the vocal folds is possible in the presence of a vertical phase difference. But how is this vertical phase difference established, and what determines the vertical phase difference and the vocal fold vibration pattern? In voice production, vocal fold vibration with a vertical phase difference results from a process of eigenmode synchronization, in which two or more in vacuo eigenmodes of the vocal folds are synchronized to vibrate at the same frequency but with a phase difference ( Ishizaka and Matsudaira, 1972 ; Ishizaka, 1981 ; Horacek and Svec, 2002 ; Zhang et al. , 2007 ), in the same way as a travelling wave formed by superposition of two standing waves. An eigenmode or resonance is a pattern of motion of the system that is allowed by physical laws and boundary constraints to the system. In general, for each mode, the vibration pattern is such that all parts of the system move either in-phase or 180° out of phase, similar to a standing wave. Each eigenmode has an inherently distinct eigenfrequency (or resonance frequency) at which the eigenmode can be maximally excited. An example of eigenmodes that is often encountered in speech science is formants, which are peaks in the output voice spectra due to excitation of acoustic resonances of the vocal tract, with the formant frequency dependent on vocal tract geometry. Figure ​ Figure6 6 shows three typical eigenmodes of the vocal fold in the coronal plane. In Fig. ​ Fig.6, 6 , the thin line indicates the resting vocal fold surface shape, whereas the solid and dashed lines indicate extreme positions of the vocal fold when vibrating at the corresponding eigenmode, spaced 180° apart in a vibratory cycle. The first eigenmode shows an up and down motion in the vertical direction, which does not modulate glottal airflow much. The second eigenmode has a dominantly in-phase medial-lateral motion along the medial surface, which does modulate airflow. The third eigenmode also exhibits dominantly medial-lateral motion, but the upper portion of the medial surface vibrates 180° out of phase with the lower portion of the medial surface. Such out-of-phase motion as in the third eigenmode is essential to achieving vocal fold vibration with a large vertical phase difference, e.g., when synchronized with an in-phase eigenmode as in Fig. 6(b) .

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g006.jpg

Typical vocal fold eigenmodes exhibiting (a) a dominantly superior-inferior motion, (b) a medial-lateral in-phase motion, and (c) a medial-lateral out-of-phase motion along the medial surface.

In the absence of airflow, the vocal fold in vacuo eigenmodes are generally neutral or damped, meaning that when excited they will gradually decay in amplitude with time. When the vocal folds are subject to airflow, however, the vocal fold-airflow coupling modifies the eigenmodes and, in some conditions, synchronizes two eigenmodes to the same frequency (Fig. ​ (Fig.7). 7 ). Although vibration in each eigenmode by itself does not produce net energy transfer (Fig. ​ (Fig.5, 5 , middle row), when two modes are synchronized at the same frequency but with a phase difference in time, the vibration velocity associated with one eigenmode [e.g., the eigenmode in Fig. 6(b) ] will be at least partially in-phase with the pressure induced by the other eigenmode [e.g., the eigenmode in Fig. 6(c) ], and this cross-model pressure-velocity interaction will produce net energy transfer into the vocal folds ( Ishizaka and Matsudaira, 1972 ; Zhang et al. , 2007 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g007.jpg

A typical eigenmode synchronization pattern. The evolution of the first three eigenmodes is shown as a function of the subglottal pressure. As the subglottal pressure increases, the frequencies (top) of the second and third vocal fold eigenmodes gradually approach each other and, at a threshold subglottal pressure, synchronize to the same frequency. At the same time, the growth rate (bottom) of the second mode becomes positive, indicating the coupled airflow-vocal fold system becomes linearly unstable and phonation starts.

The minimum subglottal pressure required to synchronize two eigenmodes and initiate net energy transfer, or the phonation threshold pressure, is proportional to the frequency spacing between the two eigenmodes being synchronized and the coupling strength between the two eigenmodes ( Zhang, 2010 ):

where ω 0,1 and ω 0,2 are the eigenfrequencies of the two in vacuo eigenmodes participating in the synchronization process and β is the coupling strength between the two eigenmodes. Thus, the closer the two eigenmodes are to each other in frequency or the more strongly they are coupled, the less pressure is required to synchronize them. This is particularly the case in an anisotropic material such as the vocal folds in which the AP stiffness is much larger than the stiffness in the transverse plane. Under such anisotropic stiffness conditions, the first few in vacuo vocal fold eigenfrequencies tend to cluster together and are much closer to each other compared to isotropic stiffness conditions ( Titze and Strong, 1975 ; Berry, 2001 ). Such clustering of eigenmodes makes it possible to initiate vocal fold vibration at very low subglottal pressures.

The coupling strength β between the two eigenmodes in Eq. (2) depends on the prephonatory glottal opening, with the coupling strength increasing with decreasing glottal opening (thus lowered phonation threshold pressure). In addition, the coupling strength also depends on the spatial similarity between the air pressure distribution over the vocal fold surface induced by one eigenmode and vocal fold surface velocity of the other eigenmode ( Zhang, 2010 ). In other words, the coupling strength β quantifies the cross-mode energy transfer efficiency between the eigenmodes that are being synchronized. The higher the degree of cross-mode pressure-velocity similarity, the better the two eigenmodes are coupled, and the less subglottal pressure is required to synchronize them.

In reality, the vocal folds have an infinite number of eigenmodes. Which eigenmodes are synchronized and eventually excited depends on the frequency spacing and relative coupling strength among different eigenmodes. Because vocal fold vibration depends on the eigenmodes that are eventually excited, changes in the eigenmode synchronization pattern often lead to changes in the F0, vocal fold vibration pattern, and the resulting voice quality. Previous studies have shown that a slight change in vocal fold properties such as stiffness or medial surface shape may cause phonation to occur at a different eigenmode, leading to a qualitatively different vocal fold vibration pattern and abrupt changes in F0 ( Tokuda et al. , 2007 ; Zhang, 2009 ). Eigenmode synchronization is not limited to two vocal fold eigenmodes, either. It may also occur between a vocal fold eigenmode and an eigenmode of the subglottal or supraglottal system. In this sense, the negative damping due to subglottal or supraglottal acoustic loading can be viewed as the result of synchronization between one of the vocal fold modes and one of the acoustic resonances.

Eigenmode synchronization discussed above corresponds to a 1:1 temporal synchronization of two eigenmodes. For a certain range of vocal fold conditions, e.g., when asymmetry (left-right or anterior-posterior) exists in the vocal system or when the vocal folds are strongly coupled with the sub- or supra-glottal acoustics, synchronization may occur so that the two eigenmodes are synchronized not toward the same frequency, but at a frequency ratio of 1:2, 1:3, etc., leading to subharmonics or biphonation ( Ishizaka and Isshiki, 1976 ; Herzel, 1993 ; Herzel et al. , 1994 ; Neubauer et al. , 2001 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Titze, 2008 ; Lucero et al. , 2015 ). Temporal desynchronization of eigenmodes often leads to irregular or chaotic vocal fold vibration ( Herzel et al. , 1991 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Steinecke and Herzel, 1995 ). Transition between different synchronization patterns, or bifurcation, often leads to a sudden change in the vocal fold vibration pattern and voice quality.

These studies show that the nonlinear interaction between vocal fold eigenmodes is a central feature of the phonation process, with different synchronization or desynchronization patterns producing a large variety of voice types. Thus, by changing the geometrical and biomechanical properties of the vocal folds, either through laryngeal muscle activation or mechanical modification as in phonosurgery, we can select eigenmodes and eigenmode synchronization pattern to control or modify our voice, in the same way as we control speech formants by moving articulators in the vocal tract to modify vocal tract acoustic resonances.

The concept of eigenmode and eigenmode synchronization is also useful for phonation modeling, because eigenmodes can be used as building blocks to construct more complex motion of the system. Often, only the first few eigenmodes are required for adequate reconstruction of complex vocal fold vibrations (both regular and irregular; Herzel et al. , 1994 ; Berry et al. , 1994 ; Berry et al. , 2006 ), which would significantly reduce the degrees of freedom required in computational models of phonation.

D. Biomechanical requirements of glottal closure during phonation

An important feature of normal phonation is the complete closure of the membranous glottis during vibration, which is essential to the production of high-frequency harmonics. Incomplete closure of the membranous glottis, as often observed in pathological conditions, often leads to voice production of a weak and/or breathy quality.

It is generally assumed that approximation of the vocal folds through arytenoid adduction is sufficient to achieve glottal closure during phonation, with the duration of glottal closure or the closed quotient increasing with increasing degree of vocal fold approximation. While a certain degree of vocal fold approximation is obviously required for glottal closure, there is evidence suggesting that other factors also are in play. For example, excised larynx experiments have shown that some larynges would vibrate with incomplete glottal closure despite that the arytenoids are tightly sutured together ( Isshiki, 1989 ; Zhang, 2011 ). Similar incomplete glottal closure is also observed in experiments using physical vocal fold models with isotropic material properties ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In these experiments, increasing the subglottal pressure increased the vocal fold vibration amplitude but often did not lead to improvement in the glottal closure pattern ( Xuan and Zhang, 2014 ). These studies show that addition stiffness or geometry conditions are required to achieve complete membranous glottal closure.

Recent studies have started to provide some insight toward these additional biomechanical conditions. Xuan and Zhang (2014) showed that embedding fibers along the anterior-posterior direction in otherwise isotropic models is able to improve glottal closure ( Xuan and Zhang, 2014 ). With an additional thin stiffer outmost layer simulating the epithelium, these physical models are able to vibrate with a considerably long closed period. It is interesting that this improvement in the glottal closure pattern occurred only when the fibers were embedded to a location close to the vocal fold surface in the cover layer. Embedding fibers in the body layer did not improve the closure pattern at all. This suggests a possible functional role of collagen and elastin fibers in the intermediate and deep layers of the lamina propria in facilitating glottal closure during vibration.

The difference in the glottal closure pattern between isotropic and anisotropic vocal folds could be due to many reasons. Compared to isotropic vocal folds, anisotropic vocal folds (or fiber-embedded models) are better able to maintain their adductory position against the subglottal pressure and are less likely to be pushed apart by air pressure ( Zhang, 2011 ). In addition, embedding fibers along the AP direction may also enhance the medial-lateral motion, further facilitating glottal closure. Zhang (2014) showed that the first few in vacuo eigenmodes of isotropic vocal folds exhibit similar in-phase, up-and-down swing-like motion, with the medial-lateral and superior-inferior motions locked in a similar phase relationship. Synchronization of modes of similar vibration patterns necessarily leads to qualitatively the same vibration patterns, in this case an up-and-down swing-like motion, with vocal fold vibration dominantly along the superior-inferior direction, as observed in recent physical model experiments ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In contrast, for vocal folds with the AP stiffness much higher than the transverse stiffness, the first few in vacuo modes exhibit qualitatively distinct vibration patterns, and the medial-lateral motion and the superior-inferior motion are no longer locked in a similar phase in the first few in vacuo eigenmodes. This makes it possible to strongly excite large medial-lateral motion without proportional excitation of the superior-inferior motion. As a result, anisotropic models exhibit large medial-lateral motion with a vertical phase difference along the medial surface. The improved capability to maintain adductory position against the subglottal pressure and to vibrate with large medial-lateral motion may contribute to the improved glottal closure pattern observed in the experiment of Xuan and Zhang (2014) .

Geometrically, a thin vocal fold has been shown to be easily pushed apart by the subglottal pressure ( Zhang, 2016a ). Although a thin anisotropic vocal fold vibrates with a dominantly medial-lateral motion, this is insufficient to overcome its inability to maintain position against the subglottal pressure. As a result, the glottis never completely closes during vibration, which leads to a relatively smooth glottal flow waveform and weak excitation of higher-order harmonics in the radiated output voice spectrum ( van den Berg, 1968 ; Zhang, 2016a ). Increasing vertical thickness of the medial surface allows the vocal fold to better resist the glottis-opening effect of the subglottal pressure, thus maintaining the adductory position and achieving complete glottal closure.

Once these additional stiffness and geometric conditions (i.e., certain degree of stiffness anisotropy and not-too-small vertical vocal fold thickness) are met, the duration of glottal closure can be regulated by varying the vertical phase difference in vocal fold motion along the medial surface. A non-zero vertical phase difference means that, when the lower margins of the medial surfaces start to open, the glottis would continue to remain closed until the upper margins start to open. One important parameter affecting the vertical phase difference is the vertical thickness of the medial surface or the degree of medial bulging in the inferior portion of the medial surface. Given the same condition of vocal fold stiffness and vocal fold approximation, the vertical phase difference during vocal fold vibration increases with increasing vertical medial surface thickness (Fig. ​ (Fig.8). 8 ). Thus, the thicker the medial surface, the larger the vertical phase difference, and the longer the closed phase (Fig. ​ (Fig.8; 8 ; van den Berg, 1968 ; Alipour and Scherer, 2000 ; Zhang, 2016a ). Similarly, the vertical phase difference and thus the duration of glottal closure can be also increased by reducing the elastic surface wave speed in the superior-inferior direction ( Ishizaka and Flanagan, 1972 ; Story and Titze, 1995 ), which depends primarily on the stiffness in the transverse plane and to a lesser degree on the AP stiffness, or increasing the body-cover stiffness ratio ( Story and Titze, 1995 ; Zhang, 2009 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g008.jpg

(Color online) The closed quotient CQ and vertical phase difference VPD as a function of the medial surface thickness, the AP stiffness (G ap ), and the resting glottal angle ( α ). Reprinted with permission of ASA from Zhang (2016a) .

Theoretically, the duration of glottal closure can be controlled by changing the ratio between the vocal fold equilibrium position (or the mean glottal opening) and the vocal fold vibration amplitude. Both stiffening the vocal folds and tightening vocal fold approximation are able to move the vocal fold equilibrium position toward glottal midline. However, such manipulations often simultaneously reduce the vibration amplitude. As a result, the overall effect on the duration of glottal closure is unclear. Zhang (2016a) showed that stiffening the vocal folds or increasing vocal fold approximation did not have much effect on the duration of glottal closure except around onset when these manipulations led to significant improvement in vocal fold contact.

E. Role of flow instabilities

Although a Bernoulli-based flow description is often used for phonation models, the realistic glottal flow is highly three-dimensional and much more complex. The intraglottal pressure distribution is shown to be affected by the three-dimensionality of the glottal channel geometry ( Scherer et al. , 2001 ; Scherer et al. , 2010 ; Mihaescu et al. , 2010 ; Li et al. , 2012 ). As the airflow separates from the glottal wall as it exits the glottis, a jet forms downstream of the flow separation point, which leads to the development of shear layer instabilities, vortex roll-up, and eventually vortex shedding from the jet and transition into turbulence. The vortical structures would in turn induce disturbances upstream, which may lead to oscillating flow separation point, jet attachment to one side of the glottal wall instead of going straight, and possibly alternating jet flapping ( Pelorson et al. , 1994 ; Shinwari et al. , 2003 ; Triep et al. , 2005 ; Kucinschi et al. , 2006 ; Erath and Plesniak, 2006 ; Neubauer et al. , 2007 ; Zheng et al. , 2009 ). Recent experiments and simulations also showed that for a highly divergent glottis, airflow may separate inside the glottis, which leads to the formation and convection of intraglottal vortices ( Mihaescu et al. , 2010 ; Khosla et al. , 2014 ; Oren et al. , 2014 ).

Some of these flow features have been incorporated in phonation models (e.g., Liljencrants, 1991 ; Pelorson et al. , 1994 ; Kaburagi and Tanabe, 2009 ; Erath et al. , 2011 ; Howe and McGowan, 2013 ). Resolving other features, particularly the jet instability, vortices, and turbulence downstream of the glottis, demands significantly increased computational costs so that simulation of a few cycles of vocal fold vibration often takes days or months. On the other hand, the acoustic and perceptual relevance of these intraglottal and supraglottal flow structures has not been established. From the sound production point of view, these complex flow structures in the downstream glottal flow field are sound sources of quadrupole type (dipole type when obstacles are present in the pathway of airflow, e.g., tightly adducted false vocal folds). Due to the small length scales associated with the flow structures, these sound sources are broadband in nature and mostly at high frequencies (generally above 2 kHz), with an amplitude much smaller than the harmonic component of the voice source. Therefore, if the high-frequency component of voice is of interest, these flow features have to be accurately modeled, although the degree of accuracy required to achieve perceptual sufficiency has yet to be determined.

It has been postulated that the vortical structures may directly affect the near-field glottal fluid-structure interaction and thus vocal fold vibration and the harmonic component of the voice source. Once separated from the vocal fold walls, the glottal jet starts to develop jet instabilities and is therefore susceptible to downstream disturbances, especially when the glottis takes on a divergent shape. In this way, the unsteady supraglottal flow structures may interact with the boundary layer at the glottal exit and affect the flow separation point within the glottal channel ( Hirschberg et al. , 1996 ). Similarly, it has been hypothesized that intraglottal vortices can induce a local negative pressure on the medial surface of the vocal folds as the intraglottal vortices are convected downstream and thus may facilitate rapid glottal closure during voice production ( Khosla et al. , 2014 ; Oren et al. , 2014 ).

While there is no doubt that these complex flow features affect vocal fold vibration, the question remains concerning how large an influence these vortical structures have on vocal fold vibration and the produced acoustics. For the flow conditions typical of voice production, many of the flow features or instabilities have time scales much different from that of vocal fold vibration. For example, vortex shedding at typical voice conditions occurs generally at frequencies above 1000 Hz ( Zhang et al. , 2004 ; Kucinschi et al. , 2006 ). Considering that phonation is essentially a resonance phenomenon of the vocal folds (Sec. III B ) and the mismatch between vocal fold resonance and typical frequency scales of the vortical structures, it is questionable that compared to vocal fold inertia and elastic recoil, the pressure perturbations on vocal fold surface due to intraglottal or supraglottal vortical structures are strong enough or last for a long enough period to have a significant effect on voice production. Given a longitudinal shear modulus of the vocal fold of about 10 kPa and a shear strain of 0.2, the elastic recoil stress of the vocal fold is approximately 2000 Pa. The pressure perturbations induced by intraglottal or supraglottal vortices are expected to be much smaller than the subglottal pressure. Assuming an upper limit of about 20% of the subglottal pressure for the pressure perturbations (as induced by intraglottal vortices, Oren et al. , 2014 ; in reality this number is expected to be much smaller at normal loudness conditions and even smaller for supraglottal vortices) and a subglottal pressure of 800 Pa (typical of normal speech production), the pressure perturbation on vocal fold surface is about 160 Pa, which is much smaller than the elastic recoil stress. Specifically to the intraglottal vortices, while a highly divergent glottal geometry is required to create intraglottal vortices, the presence of intraglottal vortices induces a negative suction force applied mainly on the superior portion of the medial surface and, if the vortices are strong enough, would reduce the divergence of the glottal channel. In other words, while intraglottal vortices are unable to create the necessary divergence conditions required for their creation, their existence tends to eliminate such conditions.

There have been some recent studies toward quantifying the degree of the influence of the vortical structures on phonation. In an excised larynx experiment without a vocal tract, it has been observed that the produced sound does not change much when sticking a finger very close to the glottal exit, which presumably would have significantly disturbed the supraglottal flow field. A more rigorous experiment was designed in Zhang and Neubauer (2010) in which they placed an anterior-posteriorly aligned cylinder in the supraglottal flow field and traversed it in the flow direction at different left-right locations and observed the acoustics consequences. The hypothesis was that, if these supraglottal flow structures had a significant effect on vocal fold vibration and acoustics, disturbing these flow structures would lead to noticeable changes in the produced sound. However, their experiment found no significant changes in the sound except when the cylinder was positioned within the glottal channel.

The potential impact of intraglottal vortices on phonation has also been numerically investigated ( Farahani and Zhang, 2014 ; Kettlewell, 2015 ). Because of the difficulty in removing intraglottal vortices without affecting other aspects of the glottal flow, the effect of the intraglottal vortices was modeled as a negative pressure superimposed on the flow pressure predicted by a base glottal flow model. In this way, the effect of the intraglottal vortices can be selectively activated or deactivated independently of the base flow so that its contribution to phonation can be investigated. These studies showed that intraglottal vortices only have small effects on vocal fold vibration and the glottal flow. Kettlewell (2015) further showed that the vortices are either not strong enough to induce significant pressure perturbation on vocal fold surfaces or, if they are strong enough, the vortices advect rapidly into the supraglottal region and the induced pressure perturbations would be too brief to have any impact to overcome the inertia of the vocal fold tissue.

Although phonation models using simplified flow models neglecting flow vortical structures are widely used and appear to qualitatively compare well with experiments ( Pelorson et al. , 1994 ; Zhang et al. , 2002a ; Ruty et al. , 2007 ; Kaburagi and Tanabe, 2009 ), more systematic investigations are required to reach a definite conclusion regarding the relative importance of these flow structures to phonation and voice perception. This may be achieved by conducting parametric studies in a large range of conditions over which the relative strength of these vortical structures are known to vary significantly and observing their consequences on voice production. Such an improved understanding would facilitate the development of computationally efficient reduced-order models of phonation.

IV. BIOMECHANICS OF VOICE CONTROL

A. fundamental frequency.

In the discussion of F0 control, an analogy is often made between phonation and vibration in strings in the voice literature (e.g., Colton et al. , 2011 ). The vibration frequency of a string is determined by its length, tension, and mass. By analogy, the F0 of voice production is also determined by its length, tension, and mass, with the mass interpreted as the mass of the vocal folds that is set into vibration. Specifically, F0 increases with increasing tension, decreasing mass, and decreasing vocal fold length. While the string analogy is conceptually simple and heuristically useful, some important features of the vocal folds are missing. Other than the vague definition of an effective mass, the string model, which implicitly assumes cross-section dimension much smaller than length, completely neglects the contribution of vocal fold stiffness in F0 control. Although stiffness and tension are often not differentiated in the voice literature, they have different physical meanings and represent two different mechanisms that resist deformation (Fig. ​ (Fig.2). 2 ). Stiffness is a property of the vocal fold and represents the elastic restoring force in response to deformation, whereas tension or stress describes the mechanical state of the vocal folds. The string analogy also neglects the effect of vocal fold contact, which introduces additional stiffening effect.

Because phonation is essentially a resonance phenomenon of the vocal folds, the F0 is primarily determined by the frequency of the vocal fold eigenmodes that are excited. In general, vocal fold eigenfrequencies depend on both vocal fold geometry, including length, depth, and thickness, and the stiffness and stress conditions of the vocal folds. Shorter vocal folds tend to have high eigenfrequencies. Thus, because of the small vocal fold size, children tend to have the highest F0, followed by female and then male. Vocal fold eigenfrequencies also increase with increasing stiffness or stress (tension), both of which provide a restoring force to resist vocal fold deformation. Thus, stiffening or tensioning the vocal folds would increase the F0 of the voice. In general, the effect of stiffness on vocal fold eigenfrequencies is more dominant than tension when the vocal fold is slightly elongated or shortened, at which the tension is small or even negative and the string model would underestimate F0 or fail to provide a prediction. As the vocal fold gets further elongated and tension increases, the stiffness and tension become equally important in affecting vocal fold eigenfrequencies ( Titze and Hunter, 2004 ; Yin and Zhang, 2013 ).

When vocal fold contact occurs during vibration, the vocal fold collision force appears as an additional restoring force ( Ishizaka and Flanagan, 1972 ). Depending on the extent, depth of influence, and duration of vocal fold collision, this additional force can significantly increase the effective stiffness of the vocal folds and thus F0. Because the vocal fold contact pattern depends on the degree of vocal fold approximation, subglottal pressure, and vocal fold stiffness and geometry, changes in any of these parameters may have an effect on F0 by affecting vocal fold contact ( van den Berg and Tran, 1959 ; Zhang, 2016a ).

In humans, F0 can be increased by increasing either vocal fold eigenfrequencies or the extent and duration of vocal fold contact. Control of vocal fold eigenfrequencies is largely achieved by varying the stiffness and tension along the AP direction. Due to the nonlinear material properties of the vocal folds, both the AP stiffness and tension can be controlled by elongating or shortening the vocal folds, through activation of the CT muscle. Although elongation also increases vocal fold length which lowers F0, the effect of the increase in stiffness and tension on F0 appears to dominate that of increasing length.

The effect of TA muscle activation on F0 control is a little more complex. In addition to shortening vocal fold length, TA activation tensions and stiffens the body layer, decreases tension in the cover layer, but may decrease or increase the cover stiffness ( Yin and Zhang, 2013 ). Titze et al. (1988) showed that depending on the depth of the body layer involved in vibration, increasing TA activation can either increase or decrease vocal fold eigenfrequencies. On the other hand, Yin and Zhang (2013) showed that for an elongated vocal fold, as is often the case in phonation, the overall effect of TA activation is to reduce vocal fold eigenfrequencies. Only for conditions of a slightly elongated or shortened vocal folds, TA activation may increase vocal fold eigenfrequencies. In addition to the effect on vocal fold eigenfrequencies, TA activation increases vertical thickness of the vocal folds and produces medial compression between the two folds, both of which increase the extent and duration of vocal tract contact and would lead to an increased F0 ( Hirano et al. , 1969 ). Because of these opposite effects on vocal fold eigenfrequencies and vocal fold contact, the overall effect of TA activation on F0 would vary depending on the specific vocal fold conditions.

Increasing subglottal pressure or activation of the LCA/IA muscles by themselves do not have much effect on vocal fold eigenfrequencies ( Hirano and Kakita, 1985 ; Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, they often increase the extent and duration of vocal fold contact during vibration, particularly with increasing subglottal pressure, and thus lead to increased F0 ( Hirano et al. , 1969 ; Ishizaka and Flanagan, 1972 ; Zhang, 2016a ). Due to nonlinearity in vocal fold material properties, increased vibration amplitude at high subglottal pressures may lead to increased effective stiffness and tension, which may also increase F0 ( van den Berg and Tan, 1959 ; Ishizaka and Flanagan, 1972 ; Titze, 1989 ). Ishizaka and Flanagan (1972) showed in their two-mass model that vocal fold contact and material nonlinearity combined can lead to an increase of about 40 Hz in F0 when the subglottal pressure is increased from about 200 to 800 Pa. In the continuum model of Zhang (2016a) , which includes the effect of vocal fold contact but not vocal fold material nonlinearity, increasing subglottal pressure alone can increase the F0 by as large as 20 Hz/kPa.

B. Vocal intensity

Because voice is produced at the glottis, filtered by the vocal tract, and radiated from the mouth, an increase in vocal intensity can be achieved by either increasing the source intensity or enhancing the radiation efficiency. The source intensity is controlled primarily by the subglottal pressure, which increases the vibration amplitude and the negative peak or MFDR of the time derivative of the glottal flow. The subglottal pressure depends primarily on the alveolar pressure in the lungs, which is controlled by the respiratory muscles and the lung volume. In general, conditions of the laryngeal system have little effect on the establishment of the alveolar pressure and subglottal pressure ( Hixon, 1987 ; Finnegan et al. , 2000 ). However, an open glottis often results in a small glottal resistance and thus a considerable pressure drop in the lower airway and a reduced subglottal pressure. An open glottis also leads to a large glottal flow rate and a rapid decline in the lung volume, thus reducing the duration of speech between breaths and increasing the respiratory effort required in order to maintain a target subglottal pressure ( Zhang, 2016b ).

In the absence of a vocal tract, laryngeal adjustments, which control vocal fold stiffness, geometry, and position, do not have much effect on the source intensity, as shown in many studies using laryngeal, physical, or computational models of phonation ( Tanaka and Tanabe, 1986 ; Titze, 1988b ; Zhang, 2016a ). In the experiment by Tanaka and Tanabe (1986) , for a constant subglottal pressure, stimulation of the CT and LCA muscles had almost no effects on vocal intensity whereas stimulation of the TA muscle slightly decreased vocal intensity. In an excised larynx experiment, Titze (1988b) found no dependence of vocal intensity on the glottal width. Similar secondary effects of laryngeal adjustments have also been observed in a recent computational study ( Zhang, 2016a ). Zhang (2016a) also showed that the effect of laryngeal adjustments may be important at subglottal pressures slightly above onset, in which case an increase in either AP stiffness or vocal fold approximation may lead to improved vocal fold contact and glottal closure, which significantly increased the MFDR and thus vocal intensity. However, these effects became less efficient with increasing vocal intensity.

The effect of laryngeal adjustments on vocal intensity becomes a little more complicated in the presence of the vocal tract. Changing vocal tract shape by itself does not amplify the produced sound intensity because sound propagation in the vocal tract is a passive process. However, changes in vocal tract shape may provide a better impedance match between the glottis and the free space outside the mouth and thus improve efficiency of sound radiation from the mouth ( Titze and Sundberg, 1992 ). This is particularly the case for harmonics close to a formant, which are often amplified more than the first harmonic and may become the most energetic harmonic in the spectrum of the output voice. Thus, vocal intensity can be increased through laryngeal adjustments that increase excitation of harmonics close to the first formant of the vocal tract ( Fant, 1982 ; Sundberg, 1987 ) or by adjusting vocal tract shape to match one of the formants with one of the dominant harmonics in the source spectrum.

In humans, all three strategies (respiratory, laryngeal, and articulatory) are used to increase vocal intensity. When asked to produce an intensity sweep from soft to loud voice, one generally starts with a slightly breathy voice with a relatively open glottis, which requires the least laryngeal effort but is inefficient in voice production. From this starting position, vocal intensity can be increased by increasing either the subglottal pressure, which increases vibration amplitude, or vocal fold adduction (approximation and/or thickening). For a soft voice with minimal vocal fold contact and minimal higher-order harmonic excitation, increasing vocal fold adduction is particularly efficient because it may significantly improve vocal fold contact, in both spatial extent and duration, thus significantly boosting the excitation of harmonics close to the first formant. In humans, for low to medium vocal intensity conditions, vocal intensity increase is often accompanied by simultaneous increases in the subglottal pressure and the glottal resistance ( Isshiki, 1964 ; Holmberg et al. , 1988 ; Stathopoulos and Sapienza, 1993 ). Because the pitch level did not change much in these experiments, the increase in glottal resistance was most likely due to tighter vocal fold approximation through LCA/IA activation. The duration of the closed phase is often observed to increase with increasing vocal intensity ( Henrich et al. , 2005 ), indicating increased vocal fold thickening or medial compression, which are primarily controlled by the TA muscle. Thus, it seems that both the LCA/IA/TA muscles and subglottal pressure increase play a role in vocal intensity increase at low to medium intensity conditions. For high vocal intensity conditions, when further increase in vocal fold adduction becomes less effective ( Hirano et al. , 1969 ), vocal intensity increase appears to rely dominantly on the subglottal pressure increase.

On the vocal tract side, Titze (2002) showed that the vocal intensity can be increased by matching a wide epilarynx with lower glottal resistance or a narrow epilarynx with higher glottal resistance. Tuning the first formant (e.g., by opening mouth wider) to match the F0 is often used in soprano singing to maximize vocal output ( Joliveau et al. , 2004 ). Because radiation efficiency can be improved through adjustments in either the vocal folds or the vocal tract, this makes it possible to improve radiation efficiency yet still maintain desired pitch or articulation, whichever one wishes to achieve.

C. Voice quality

Voice quality generally refers to aspects of the voice other than pitch and loudness. Due to the subjective nature of voice quality perception, many different descriptions are used and authors often disagree with the meanings of these descriptions ( Gerratt and Kreiman, 2001 ; Kreiman and Sidtis, 2011 ). This lack of a clear and consistent definition of voice quality makes it difficult for studies of voice quality and identifying its physiological correlates and controls. Acoustically, voice quality is associated with the spectral amplitude and shape of the harmonic and noise components of the voice source, and their temporal variations. In the following we focus on physiological factors that are known to have an impact on the voice spectra and thus are potentially perceptually important.

One of the first systematic investigations of the physiological controls of voice quality was conducted by Isshiki (1989 , 1998) using excised larynges, in which regions of normal, breathy, and rough voice qualities were mapped out in the three-dimensional parameter space of the subglottal pressure, vocal fold stiffness, and prephonatory glottal opening area (Fig. ​ (Fig.9). 9 ). He showed that for a given vocal fold stiffness and prephonatory glottal opening area, increasing subglottal pressure led to voice production of a rough quality. This effect of the subglottal pressure can be counterbalanced by increasing vocal fold stiffness, which increased the region of normal voice in the parameter space of Fig. ​ Fig.9. 9 . Unfortunately, the details of this study, including the definition and manipulation of vocal fold stiffness and perceptual evaluation of different voice qualities, are not fully available. The importance of the coordination between the subglottal pressure and laryngeal conditions was also demonstrated in van den Berg and Tan (1959) , which showed that although different vocal registers were observed, each register occurred in a certain range of laryngeal conditions and subglottal pressures. For example, for conditions of low longitudinal tension, a chest-like phonation was possible only for small airflow rates. At large values of the subglottal pressure, “it was impossible to obtain good sound production. The vocal folds were blown too wide apart…. The shape of the glottis became irregularly curved and this curving was propagated along the glottis.” Good voice production at large flow rates was possible only with thyroid cartilage compression which imitates the effect of TA muscle activation. Irregular vocal fold vibration at high subglottal pressures has also been observed in physical model experiments (e.g., Xuan and Zhang, 2014 ). Irregular or chaotic vocal fold vibration at conditions of pressure-stiffness mismatch has also been reported in the numerical simulation of Berry et al. (1994) , which showed that while regular vocal fold vibration was observed for typical vocal fold stiffness conditions, irregular vocal fold vibration (e.g., subharmonic or chaotic vibration) was observed when the cover layer stiffness was significantly reduced while maintaining the same subglottal pressure.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g009.jpg

A three-dimensional map of normal (N), breathy (B), and rough (R) phonation in the parameter space of the prephonatory glottal area (Ag0), subglottal pressure (Ps), vocal fold stiffness (k). Reprinted with permission of Springer from Isshiki (1989) .

The experiments of van den Berg and Tan (1959) and Isshiki (1989) also showed that weakly adducted vocal folds (weak LCA/IA/TA activation) often lead to vocal fold vibration with incomplete glottal closure during phonation. When the airflow is sufficiently high, the persistent glottal gap would lead to increased turbulent noise production and thus phonation of a breathy quality (Fig. ​ (Fig.9). 9 ). The incomplete glottal closure may occur in the membranous or the cartilaginous portion of the glottis. When the incomplete glottal closure is limited to the cartilaginous glottis, the resulting voice is breathy but may still have strong harmonics at high frequencies. When the incomplete glottal closure occurs in the membranous glottis, the reduced or slowed vocal fold contact would also reduce excitation of higher-order harmonics, resulting in a breathy and weak quality of the produced voice. When the vocal folds are sufficiently separated, the coupling between the two vocal folds may be weakened enough so that each vocal fold can vibrate at a different F0. This would lead to biphonation or voice containing two distinct fundamental frequencies, resulting in a perception similar to that of the beat frequency phenomenon.

Compared to a breathy voice, a pressed voice is presumably produced with tight vocal fold approximation or even some degree of medial compression in the membranous portion between the two folds. A pressed voice is often characterized by a second harmonic that is stronger than the first harmonic, or a negative H1-H2, with a long period of glottal closure during vibration. Although a certain degree of vocal fold approximation and stiffness anisotropy is required to achieve vocal fold contact during phonation, the duration of glottal closure has been shown to be primarily determined by the vertical thickness of the vocal fold medial surface ( van den Berg, 1968 ; Zhang, 2016a ). Thus, although it is generally assumed that a pressed voice can be produced with tight arytenoid adduction through LCA/IA muscle activation, activation of the LCA/IA muscles alone is unable to achieve prephonatory medial compression in the membranous glottis or change the vertical thickness of the medial surface. Activation of the TA muscle appears to be essential in producing a voice change from a breathy to a pressed voice quality. A weakened TA muscle, as in aging or muscle atrophy, would lead to difficulties in producing a pressed voice or even sufficient glottal closure during phonation. On the other hand, strong TA muscle activation, as in for example, spasmodic dysphonia, may lead to too tight a closure of the glottis and a rough voice quality ( Isshiki, 1989 ).

In humans, vocal fold stiffness, vocal fold approximation, and geometry are regulated by the same set of laryngeal muscles and thus often co-vary, which has long been considered as one possible origin of vocal registers and their transitions ( van den Berg, 1968 ). Specifically, it has been hypothesized that changes in F0 are often accompanied by changes in the vertical thickness of the vocal fold medial surface, which lead to changes in the spectral characteristics of the produced voice. The medial surface thickness is primarily controlled by the CT and TA muscles, which also regulate vocal fold stiffness and vocal fold approximation. Activation of the CT muscle reduces the medial surface thickness, but also increases vocal fold stiffness and tension, and in some conditions increases the resting glottal opening ( van den Berg and Tan, 1959 ; van den Berg, 1968 ; Hirano and Kakita, 1985 ). Because the LCA/IA/TA muscles are innervated by the same nerve and often activated together, an increase in the medial surface thickness through TA muscle activation is often accompanied by increased vocal fold approximation ( Hirano and Kakita, 1985 ) and contact. Thus, if one attempts to increase F0 primarily by activation of the LCA/IA/TA muscles, the vocal folds are likely to have a large medial surface thickness and probably low AP stiffness, which will lead to a chest-like voice production, with large vertical phase difference along the medial surface, long closure of the glottis, small flow rate, and strong harmonic excitation. In the extreme case of strong TA activation and minimum CT activation and very low subglottal pressure, the glottis can remain closed for most of the cycle, leading to a vocal fry-like voice production. In contrast, if one attempts to increase F0 by increasing CT activation alone, the vocal folds, with a small medial surface thickness, are likely to produce a falsetto-like voice production, with incomplete glottal closure and a nearly sinusoidal flow waveform, very high F0, and a limited number of harmonics.

V. MECHANICAL AND COMPUTER MODELS FOR VOICE APPLICATIONS

Voice applications generally fall into two major categories. In the clinic, simulation of voice production has the potential to predict outcomes of clinical management of voice disorders, including surgery and voice therapy. For such applications, accurate representation of vocal fold geometry and material properties to the degree that matches actual clinical treatment is desired, and for this reason continuum models of the vocal folds are preferred over lumped-element models. Computational cost is not necessarily a concern in such applications but still has to be practical. In contrast, for some other applications, particularly in speech technology applications, the primary goal is to reproduce speech acoustics or at least perceptually relevant features of speech acoustics. Real-time capability is desired in these applications, whereas realistic representation of the underlying physics involved is often not necessary. In fact, most of the current speech synthesis systems consider speech purely as an acoustic signal and do not model the physics of speech production at all. However, models that take into consideration the underlying physics, at least to some degree, may hold the most promise in speech synthesis of natural-sounding, speaker-specific quality.

A. Mechanical vocal fold models

Early efforts on artificial speech production, dating back to as early as the 18th century, focused on mechanically reproducing the speech production system. A detailed review can be found in Flanagan (1972) . The focus of these early efforts was generally on articulation in the vocal tract rather than the voice source, which is understandable considering that meaning is primarily conveyed through changes in articulation and the lack of understanding of the voice production process. The vibrating element in these mechanical models, either a vibrating reed or a slotted rubber sheet stretched over an opening, is only a rough approximation of the human vocal folds.

More sophisticated mechanical models have been developed more recently to better reproduce the three-dimensional layered structure of the vocal folds. A membrane (cover)-cushion (body) two-layer rubber vocal fold model was first developed by Smith (1956) . Similar mechanical models were later developed and used in voice production research (e.g., Isogai et al. , 1988 ; Kakita, 1988 ; Titze et al. , 1995 ; Thomson et al. , 2005 ; Ruty et al. , 2007 ; Drechsel and Thomson, 2008 ), using silicone or rubber materials or liquid-filled membranes. Recent studies ( Murray and Thomson, 2012 ; Xuan and Zhang, 2014 ) have also started to embed fibers into these models to simulate the anisotropic material properties due to the presence of collagen and elastin fibers in the vocal folds. A similar layered vocal fold model has been incorporated into a mechanical talking robot system ( Fukui et al. , 2005 ; Fukui et al. , 2007 ; Fukui et al. , 2008 ). The most recent version of the talking robot, Waseda Talker, includes mechanisms for the control of pitch and resting glottal opening, and is able to produce voice of modal, creaky, or breathy quality. Nevertheless, although a mechanical voice production system may find application in voice prosthesis or humanoid robotic systems in the future, current mechanical models are still a long way from reproducing or even approaching humans' capability and flexibility in producing and controlling voice.

B. Formant synthesis and parametric voice source models

Compared to mechanically reproducing the physical process involved in speech production, it is easier to reproduce speech as an acoustic signal. This is particularly the case for speech synthesis. One approach adopted in most of the current speech synthesis systems is to concatenate segments of pre-recorded natural voice into new speech phrases or sentences. While relatively easy to implement, in order to achieve natural-sounding speech, this approach requires a large database of words spoken in different contexts, which makes it difficult to apply to personalized speech synthesis of varying emotional percepts.

Another approach is to reproduce only perceptually relevant acoustic features of speech, as in formant synthesis. The target acoustic features to be reproduced generally include the F0, sound amplitude, and formant frequencies and bandwidths. This approach gained popularity with the development of electrical synthesizers and later computer simulations which allow flexible and accurate control of these acoustic features. Early formant-based synthesizers used simple sound sources, often a filtered impulse train as the sound source for voiced sounds and white noise for unvoiced sounds. Research on the voice sources (e.g., Fant, 1979 ; Fant et al. , 1985 ; Rothenberg et al. , 1971 ; Titze and Talkin, 1979 ) has led to the development of parametric voice source models in the time domain, which are capable of producing voice source waveforms of varying F0, amplitude, open quotient, and degree of abruptness of the glottal flow shutoff, and thus synthesis of different voice qualities.

While parametric voice source models provide flexibility in source variations, synthetic speech generated by the formant synthesis still suffers limited naturalness. This limited naturalness may result from the primitive rules used in specifying dynamic controls of the voice source models ( Klatt, 1987 ). Also, the source model control parameters are not independent from each other and often co-vary during phonation. A challenge in formant synthesis is thus to specify voice source parameter combinations and their time variation patterns that may occur in realistic voice production of different voice qualities by different speakers. It is also possible that some perceptually important features are missing from time-domain voice source models ( Klatt, 1987 ). Human perception of voice characteristics is better described in the frequency domain as the auditory system performs an approximation to Fourier analysis of the voice and sound in general. While time-domain models have better correspondence to the physical events occurring during phonation (e.g., glottal opening and closing, and the closed phase), it is possible some spectral details of perceptual importance are not captured in the simple time-domain voice source models. For example, spectral details in the low and middle frequencies have been shown to be of considerable importance to naturalness judgment, but are difficult to be represented in a time-domain source model ( Klatt, 1987 ). A recent study ( Kreiman et al. , 2015 ) showed that spectral-domain voice source models are able to create significantly better matches to natural voices than time-domain voice source models. Furthermore, because of the independence between the voice source and the sub- and supra-glottal systems in formant synthesis, interactions and co-variations between vocal folds and the sub- and supra-glottal systems are by design not accounted for. All these factors may contribute to the limited naturalness of the formant synthesized speech.

C. Physically based computer models

An alternative approach to natural speech synthesis is to computationally model the voice production process based on physical principles. The control parameters would be geometry and material properties of the vocal system or, in a more realistic way, respiratory and laryngeal muscle activation. This approach avoids the need to specify consistent characteristics of either the voice source or the formants, thus allowing synthesis and modification of natural voice in a way intuitively similar to human voice production and control.

The first such computer model of voice production is the one-mass model by Flanagan and Landgraf (1968) , in which the vocal fold is modeled as a horizontally moving single-degree of freedom mass-spring-damper system. This model is able to vibrate in a restricted range of conditions when the natural frequency of the mass-spring system is close to one of the acoustic resonances of the subglottal or supraglottal tracts. Ishizaka and Flanagan (1972) extended this model to a two-mass model in which the upper and lower parts of the vocal fold are modeled as two separate masses connected by an additional spring along the vertical direction. The two-mass model is able to vibrate with a vertical phase difference between the two masses, and thus able to vibrate independently of the acoustics of the sub- and supra-glottal tracts. Many variants of the two-mass model have since been developed. Titze (1973) developed a 16-mass model to better represent vocal fold motion along the anterior-posterior direction. To better represent the body-cover layered structure of the vocal folds, Story and Titze (1995) extended the two-mass model to a three-mass model, adding an additional lateral mass representing the inner muscular layer. Empirical rules have also been developed to relate control parameters of the three-mass model to laryngeal muscle activation levels ( Titze and Story, 2002 ) so that voice production can be simulated with laryngeal muscle activity as input. Designed originally for speech synthesis purpose, these lumped-element models of voice production are generally fast in computational time and ideal for real-time speech synthesis.

A drawback of the lumped-element models of phonation is that the model control parameters cannot be directly measured or easily related to the anatomical structure or material properties of the vocal folds. Thus, these models are not as useful in applications in which a realistic representation of voice physiology is required, as, for example, in the clinical management of voice disorders. To better understand the voice source and its control under different voicing conditions, more sophisticated computational models of the vocal folds based on continuum mechanics have been developed to understand laryngeal muscle control of vocal fold geometry, stiffness, and tension, and how changes in these vocal fold properties affect the glottal fluid-structure interaction and the produced voice. One of the first such models is the finite-difference model by Titze and Talkin (1979) , which coupled a three-dimensional vocal fold model of linear elasticity with the one-dimensional glottal flow model of Ishizaka and Flanagan (1972) . In the past two decades more refined phonation models using a two-dimensional or three-dimensional Navier-Stokes description of the glottal flow have been developed (e.g., Alipour et al. , 2000 ; Zhao et al. , 2002 ; Tao et al. , 2007 ; Luo et al. , 2009 ; Zheng et al. , 2009 ; Bhattacharya and Siegmund, 2013 ; Xue et al. , 2012 , 2014 ). Continuum models of laryngeal muscle activation have also been developed to model vocal fold posturing ( Hunter et al. , 2004 ; Gommel et al. , 2007 ; Yin and Zhang, 2013 , 2014 ). By directly modeling the voice production process, continuum models with realistic geometry and material properties ideally hold the most promise in reproducing natural human voice production. However, because the phonation process is highly nonlinear and involves large displacement and deformation of the vocal folds and complex glottal flow patterns, modeling this process in three dimensions is computationally very challenging and time-consuming. As a result, these computational studies are often limited to one or two specific aspects instead of the entire voice production process, and the acoustics of the produced voice, other than F0 and vocal intensity, are often not investigated. For practical applications, real-time or not, reduced-order models with significantly improved computational efficiency are required. Some reduced-order continuum models, with simplifications in both the glottal flow and vocal fold dynamics, have been developed and used in large-scale parametric studies of voice production (e.g., Titze and Talkin, 1979 ; Zhang, 2016a ), which appear to produce qualitatively reasonable predictions. However, these simplifications have yet to be rigorously validated by experiment.

VI. FUTURE CHALLENGES

We currently have a general understanding of the physical principles of voice production. Toward establishing a cause-effect theory of voice production, much is to be learned about voice physiology and biomechanics. This includes the geometry and mechanical properties of the vocal folds and their variability across subject, sex, and age, and how they vary across different voicing conditions under laryngeal muscle activation. Even less is known about changes in vocal fold geometry and material properties in pathologic conditions. The surface conditions of the vocal folds and their mechanical properties have been shown to affect vocal fold vibration ( Dollinger et al. , 2014 ; Bhattacharya and Siegmund, 2015 ; Tse et al. , 2015 ), and thus need to be better quantified. While in vivo animal or human larynx models ( Moore and Berke, 1988 ; Chhetri et al. , 2012 ; Berke et al. , 2013 ) could provide such information, more reliable measurement methods are required to better quantify the viscoelastic properties of the vocal fold, vocal fold tension, and the geometry and movement of the inner vocal fold layers. While macro-mechanical properties are of interest, development of vocal fold constitutive laws based on ECM distribution and interstitial fluids within the vocal folds would allow us to better understand how vocal fold mechanical properties change with prolonged vocal use, vocal fold injury, and wound healing, which otherwise is difficult to quantify.

While oversimplification of the vocal folds to mass and tension is of limited practical use, the other extreme is not appealing, either. With improved characterization and understanding of vocal fold properties, establishing a cause-effect relationship between voice physiology and production thus requires identifying which of these physiologic features are actually perceptually relevant and under what conditions, through systematic parametric investigations. Such investigations will also facilitate the development of reduced-order computational models of phonation in which perceptually relevant physiologic features are sufficiently represented and features of minimum perceptual relevance are simplified. We discussed earlier that many of the complex supraglottal flow phenomena have questionable perceptual relevance. Similar relevance questions can be asked with regard to the geometry and mechanical properties of the vocal folds. For example, while the vocal folds exhibit complex viscoelastic properties, what are the main material properties that are definitely required in order to reasonably predict vocal fold vibration and voice quality? Does each of the vocal fold layers, in particular, the different layers of the lamina propria, have a functional role in determining the voice output or preventing vocal injury? Current vocal fold models often use a simplified vocal fold geometry. Could some geometric features of a realistic vocal fold that are not included in current models have an important role in affecting voice efficiency and voice quality? Because voice communication spans a large range of voice conditions (e.g., pitch, loudness, and voice quality), the perceptual relevance and adequacy of specific features (i.e., do changes in specific features lead to perceivable changes in voice?) should be investigated across a large number of voice conditions rather than a few selected conditions. While physiologic models of phonation allow better reproduction of realistic vocal fold conditions, computational models are more suitable for such systematic parametric investigations. Unfortunately, due to the high computational cost, current studies using continuum models are often limited to a few conditions. Thus, the establishment of cause-effect relationship and the development of reduced-order models are likely to be iterative processes, in which the models are gradually refined to include more physiologic details to be considered in the cause-effect relationship.

A causal theory of voice production would allow us to map out regions in the physiological parameter space that produce distinct vocal fold vibration patterns and voice qualities of interest (e.g., normal, breathy, rough voices for clinical applications; different vocal registers for singing training), similar to that described by Isshiki (1989 ; also Fig. ​ Fig.9). 9 ). Although the voice production system is quite complex, control of voice should be both stable and simple, which is required for voice to be a robust and easily controlled means of communication. Understanding voice production in the framework of nonlinear dynamics and eigenmode interactions and relating it to voice quality may facilitate toward this goal. Toward practical clinical applications, such a voice map would help us understand what physiologic alteration caused a given voice change (the inverse problem), and what can be done to restore the voice to normal. Development of efficient and reliable tools addressing the inverse problem has important applications in the clinical diagnosis of voice disorders. Some methods already exist that solve the inverse problem in lumped-element models (e.g., Dollinger et al. , 2002 ; Hadwin et al. , 2016 ), and these can be extended to physiologically more realistic continuum models.

Solving the inverse problem would also provide an indirect approach toward understanding the physiologic states that lead to percepts of different emotional states or communication of other personal traits, which are otherwise difficult to measure directly in live human beings. When extended to continuous speech production, this approach may also provide insights into the dynamic physiologic control of voice in running speech (e.g., time contours of the respiratory and laryngeal adjustments). Such information would facilitate the development of computer programs capable of natural-sounding, conversational speech synthesis, in which the time contours of control parameters may change with context, speaking style, or emotional state of the speaker.

ACKNOWLEDGMENTS

This study was supported by research Grant Nos. R01 DC011299 and R01 DC009229 from the National Institute on Deafness and Other Communication Disorders, the National Institutes of Health. The author would like to thank Dr. Liang Wu for assistance in preparing the MRI images in Fig. ​ Fig.1, 1 , Dr. Jennifer Long for providing the image in Fig. 1(b) , Dr. Gerald Berke for providing the stroboscopic recording from which Fig. ​ Fig.3 3 was generated, and Dr. Jody Kreiman, Dr. Bruce Gerratt, Dr. Ronald Scherer, and an anonymous reviewer for the helpful comments on an earlier version of this paper.

  • Skip to main content
  • Keyboard shortcuts for audio player

Send in the clones: Using artificial intelligence to digitally replicate human voices

Chloe Veltman headshot

Chloe Veltman

essay on human voice

Reporter Chloe Veltman reacts to hearing her digital voice double, "Chloney," for the first time, with Speech Morphing chief linguist Mark Seligman. Courtesy of Speech Morphing hide caption

Reporter Chloe Veltman reacts to hearing her digital voice double, "Chloney," for the first time, with Speech Morphing chief linguist Mark Seligman.

The science behind making machines talk just like humans is very complex, because our speech patterns are so nuanced.

"The voice is not easy to grasp," says Klaus Scherer , emeritus professor of the psychology of emotion at the University of Geneva. "To analyze the voice really requires quite a lot of knowledge about acoustics, vocal mechanisms and physiological aspects. So it is necessarily interdisciplinary, and quite demanding in terms of what you need to master in order to do anything of consequence."

So it's not surprisingly taken well over 200 years for synthetic voices to get from the first speaking machine , invented by Wolfgang von Kempelen around 1800 – a boxlike contraption that used bellows, pipes and a rubber mouth and nose to simulate a few recognizably human utterances, like mama and papa – to a Samuel L. Jackson voice clone delivering the weather report on Alexa today.

essay on human voice

A model replica of Wolfgang von Kempelen's Speaking Machine. Fabian Brackhane hide caption

A model replica of Wolfgang von Kempelen's Speaking Machine.

Talking machines like Siri, Google Assistant and Alexa, or a bank's automated customer service line, are now sounding quite human. Thanks to advances in artificial intelligence, or AI, we've reached a point where it's sometimes difficult to distinguish synthetic voices from real ones.

I wanted to find out what's involved in the process at the customer end. So I approached San Francisco Bay Area-based natural language speech synthesis company Speech Morphing about creating a clone – or "digital double" – of my own voice.

A reporter gets her voice cloned

Given the complexities of speech synthesis, it's quite a shock to find out just how easy it is to order one up. For a basic conversational build, all a customer has to do is record themselves saying a bunch of scripted lines for roughly an hour. And that's about it.

"We extract 10 to 15 minutes of net recordings for a basic build," says Speech Morphing founder and CEO Fathy Yassa.

The hundreds of phrases I record so that Speech Morphing can build my digital voice double seem very random: "Here the explosion of mirth drowned him out." "That's what Carnegie did." "I'd like to be buried under Yankee Stadium with JFK." And so on.

But they aren't as random as they appear. Yassa says the company chooses utterances that will produce a wide enough variety of sounds across a range of emotions – such as apologetic, enthusiastic, angry and so on – to feed a neural network-based AI training system. It essentially teaches itself the specific patterns of a person's speech.

essay on human voice

Speech Morphing founder and CEO Fathy Yassa. Chloe Veltman/KQED hide caption

Speech Morphing founder and CEO Fathy Yassa.

Yassa says there are around 20 affects or tones to choose from, and some of these can be used interchangeably, or not at all. "Not every tone or affect is needed for every client," he says. "The choice depends on the target application and use cases. Banking is different from eBooks, is different from reporting and broadcast, is different from consumer."

At the end of the recording session, I send Speech Morphing the audio files. From there, the company breaks down and analyzes my utterances, and then builds the model for the AI to learn from. Yassa says the entire process takes less than a week.

He says the possibilities for the Chloe Veltman voice clone – or "Chloney" as I've affectionately come to call my robot self – are almost limitless.

"We can make you apologetic, we can make you promotional, we can make you act like you're in the theater," Yassa says. "We can make you sing, eventually, though we're not yet there."

A fast growing industry

The global speech and voice recognition industry is worth tens of billions of dollars,and is growing fast. Its uses are evident. The technology has given actor Val Kilmer , who lost his voice owing to throat cancer a few years ago, the chance to reclaim something approaching his former vocal powers.

It's enabled film directors, audiobook creators and game designers to develop characters without the need to have live voice talent on hand, as in the movie Roadrunner , where an AI was trained on Anthony Bourdain's extensive archive of media appearances to create a digital double of the late chef and TV personality's voice.

AI Brought Anthony Bourdain's Voice Back To Life. Should It Have?

AI Brought Anthony Bourdain's Voice Back To Life. Should It Have?

As pitch-perfect as Bourdain's digital voice double might be, it's also caused controversy. Some people raised ethical concerns about putting words into Bourdain's mouth that he never actually said while he was alive.

A cloned version of Barack Obama's voice warning people about the dangers of fake news, created by actor and film director Jordan Peele, hammers the point home: Sometimes we have cause to be wary of machines that sound too much like us.

[ Note: The video embedded below includes profanities. ]

"We're entering an era in which our enemies can make it look like anyone is saying anything at any point in time," says the Obama deepfake in the video, produced in collaboration with BuzzFeed in 2018. "Even if they would never say those things."

When too human is too much

Sometimes, though, we don't necessarily want machines to sound too human, because it creeps us out.

If you're looking for a digital voice double to read an audiobook to kids, or act as a companion or helper for a senior, a more human-sounding voice might be the right way to go.

"Maybe not something that actually breathes, because that's a little bit creepy, but a little more human might be more approachable," says user experience and voice designer Amy Jiménez Márquez , who led the voice, multimodal and UX Amazon Alexa personality-experience design team for four years.

But for a machine that performs basic tasks, like, say, a voice-activated refrigerator? Maybe less human is best. "Having something a little more robotic and you can even create a tinny voice that sounds like an actual robot that is cute, that would be more appropriate for a refrigerator," Jiménez Márquez says.

The big reveal

At a demo session with Speech Morphing, I get to hear Chloney, my digital voice double.

Her voice comes at me through a pair of portable speakers connected to a laptop. The laptop displays the programming interface into which whatever text I want Chloney to say is typed. The interface includes tools to make micro-adjustments to the pitch, speed and other vocal attributes that might need to be tweaked if Chloney's prosody doesn't come out sounding exactly right.

Listen to "Chloney" recite "Happy Birthday"

"Happy birthday to you. Happy birthday to you. Happy birthday, dear Chloney. Happy birthday to you," says Chloney.

Chloney can't sing "Happy Birthday" – at least for now. But she can read out news stories I didn't even report myself, like one ripped from an AP newswire about the COVID-19 pandemic. And she can even do it in Spanish.

Chloney sounds quite a lot like me. It's impressive, but it's also a little scary.

Listen to "Chloney" reading a news story in English

Here's "chloney" reading a news story in spanish.

"My jaw is on the floor," says the original voice behind Chloney – that's me, Chloe – as I listen to what my digital voice double can do. "Let's hope she doesn't put me out of a job anytime soon."

AI vs. Human Voices: How Delivery Source and Narrative Format Influence the Effectiveness of Persuasion Messages

Taylor & Francis

  • City University of Hong Kong

Jiyoung Lee at Sungkyunkwan University

  • Sungkyunkwan University

Ji Won Kim at City University of Hong Kong

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

No full-text available

Request Full-text Paper PDF

To read the full-text of this research, you can request a copy directly from the authors.

Jiyoung Lee

  • Int J Environ Res Publ Health

Binbin Ni

  • Jochen Peter

Marco Dehnert

  • J COMPUT-MEDIAT COMM
  • Bingjie Liu
  • Magdalena Wojcieszak

Arti Thakur

  • João Fernando Ferreira Gonçalves
  • Miriam Boon

Jingbo Meng

  • Matthew Lombard

Katharina Kühne

  • Yuefang Zhou

George Terzopoulos

  • J MED INTERNET RES
  • John Ferrand
  • Ryli Hockensmith

Rebecca Fagen Houghton

  • Eric R. Walsh-Buhi

Chelsea L Ratcliff

  • Andrew Palmer
  • Janelle Blankenburg

David Feil-Seifer

  • EMERG INFECT DIS

Virginia Senkomago

  • Denise Duran
  • Anagha Loharikar

Mona Saraiya

  • Yuhua (Jake) Liang

Seungcheol Austin Lee

  • J PUBLIC HEALTH-UK

John Federick Chua Yap

  • NEW MEDIA SOC

Kim Bissell

  • AM POLIT RES
  • Sharif Amlani
  • Spencer Kiesel
  • Ross Butters
  • Xiaoya Jiang

Juwon Hwang

  • HUM COMMUN RES

Eun-Ju Lee

  • COMPUT HUM BEHAV

Stephanie Kelly

  • Kelly Merrill

Hyun Suk Kim

  • Jin Won Park

Ji Won Kim

  • Jung Won Chun

Soojin Kim

  • Young Woo Kang

Emma Rodero

  • Robert A. Bell
  • J BROADCAST ELECTRON

Fanjue Liu

  • Tae Kyoung Lee

Andrea L Guzman

  • Jinyoung Kim

Emna Cherif

  • Chris Luvogt
  • Max Silverman
  • Danielle Lottrjdge

Bu Zhong

  • Jeff Hancock

Adam S Miner

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Acoustical Society of America

I. INTRODUCTION

Ii. vocal fold physiology and biomechanics, a. vocal fold anatomy and biomechanics, b. vocal fold posturing, iii. physics of voice production, a. sound sources of voice production, b. mechanisms of self-sustained vocal fold vibration, c. eigenmode synchronization and nonlinear dynamics, d. biomechanical requirements of glottal closure during phonation, e. role of flow instabilities, iv. biomechanics of voice control, a. fundamental frequency, b. vocal intensity, c. voice quality, v. mechanical and computer models for voice applications, a. mechanical vocal fold models, b. formant synthesis and parametric voice source models, c. physically based computer models, vi. future challenges, acknowledgments, mechanics of human voice production and control.

Electronic mail: [email protected]

  • Split-Screen
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Open the PDF for in another window
  • Reprints and Permissions
  • Cite Icon Cite
  • Search Site

Zhaoyan Zhang; Mechanics of human voice production and control. J. Acoust. Soc. Am. 1 October 2016; 140 (4): 2614–2635. https://doi.org/10.1121/1.4964509

Download citation file:

  • Ris (Zotero)
  • Reference Manager

As the primary means of communication, voice plays an important role in daily life. Voice also conveys personal information such as social status, personal traits, and the emotional state of the speaker. Mechanically, voice production involves complex fluid-structure interaction within the glottis and its control by laryngeal muscle activation. An important goal of voice research is to establish a causal theory linking voice physiology and biomechanics to how speakers use and control voice to communicate meaning and personal information. Establishing such a causal theory has important implications for clinical voice management, voice training, and many speech technology applications. This paper provides a review of voice physiology and biomechanics, the physics of vocal fold vibration and sound production, and laryngeal muscular control of the fundamental frequency of voice, vocal intensity, and voice quality. Current efforts to develop mechanical and computational models of voice production are also critically reviewed. Finally, issues and future challenges in developing a causal theory of voice production and perception are discussed.

In the broad sense, voice refers to the sound we produce to communicate meaning, ideas, opinions, etc. In the narrow sense, voice, as in this review, refers to sounds produced by vocal fold vibration, or voiced sounds. This is in contrast to unvoiced sounds which are produced without vocal fold vibration, e.g., fricatives which are produced by airflow through constrictions in the vocal tract, plosives produced by sudden release of a complete closure of the vocal tract, or other sound producing mechanisms such as whispering. For voiced sound production, vocal fold vibration modulates airflow through the glottis and produces sound (the voice source), which propagates through the vocal tract and is selectively amplified or attenuated at different frequencies. This selective modification of the voice source spectrum produces perceptible contrasts, which are used to convey different linguistic sounds and meaning. Although this selective modification is an important component of voice production, this review focuses on the voice source and its control within the larynx.

For effective communication of meaning, the voice source, as a carrier for the selective spectral modification by the vocal tract, contains harmonic energy across a large range of frequencies that spans at least the first few acoustic resonances of the vocal tract. In order to be heard over noise, such harmonic energy also has to be reasonably above the noise level within this frequency range, unless a breathy voice quality is desired. The voice source also contains important information of the pitch, loudness, prosody, and voice quality, which convey meaning (see Kreiman and Sidtis, 2011 , Chap. 8 for a review), biological information (e.g., size), and paralinguistic information (e.g., the speaker's social status, personal traits, and emotional state; Sundberg, 1987 ; Kreiman and Sidtis, 2011 ). For example, the same vowel may sound different when spoken by different people. Sometimes a simple “hello” is all it takes to recognize a familiar voice on the phone. People tend to use different voices to different speakers on different occasions, and it is often possible to tell if someone is happy or sad from the tone of their voice.

One of the important goals of voice research is to understand how the vocal system produces voice of different source characteristics and how people associate percepts to these characteristics. Establishing a cause-effect relationship between voice physiology and voice acoustics and perception will allow us to answer two essential questions in voice science and effective clinical care ( Kreiman et al. , 2014 ): when the output voice changes, what physiological alteration caused this change; if a change to voice physiology occurs, what change in perceived voice quality can be expected? Clinically, such knowledge would lead to the development of a physically based theory of voice production that is capable of better predicting voice outcomes of clinical management of voice disorders, thus improving both diagnosis and treatment. More generally, an understanding of this relationship could lead to a better understanding of the laryngeal adjustments that we use to change voice quality, adopt different speaking or singing styles, or convey personal information such as social status and emotion. Such understanding may also lead to the development of improved computer programs for synthesis of naturally sounding, speaker-specific speech of varying emotional percepts.

Understanding such cause-effect relationship between voice physiology and production necessarily requires a multi-disciplinary effort. While voice production results from a complex fluid-structure-acoustic interaction process, which again depends on the geometry and material properties of the lungs, larynx, and the vocal tract, the end interest of voice is its acoustics and perception. Changes in voice physiology or physics that cannot be heard are not that interesting. On the other hand, the physiology and physics may impose constraints on the co-variations among fundamental frequency (F0), vocal intensity, and voice quality, and thus the way we use and control our voice. Thus, understanding voice production and voice control requires an integrated approach, in which physiology, vocal fold vibration, and acoustics are considered as a whole instead of disconnected components. Traditionally, the multi-disciplinary nature of voice production has led to a clear divide between research activities in voice production, voice perception, and their clinical or speech applications, with few studies attempting to link them together. Although much advancement has been made in understanding the physics of phonation, some misconceptions still exist in textbooks in otolaryngology and speech pathology. For example, the Bernoulli effect, which has been shown to play a minor role in phonation, is still considered an important factor in initiating and sustaining phonation in many textbooks and reviews. Tension and stiffness are often used interchangeably despite that they have different physical meanings. The role of the thyroarytenoid muscle in regulating medial compression of the membranous vocal folds is often understated. On the other hand, research on voice production often focuses on the glottal flow and vocal fold vibration, but can benefit from a broader consideration of the acoustics of the produced voice and their implications for voice communication.

This paper provides a review on our current understanding of the cause-effect relation between voice physiology, voice production, and voice perception, with the hope that it will help better bridge research efforts in different aspects of voice studies. An overview of vocal fold physiology is presented in Sec. II , with an emphasis on laryngeal regulation of the geometry, mechanical properties, and position of the vocal folds. The physical mechanisms of self-sustained vocal fold vibration and sound generation are discussed in Sec. III , with a focus on the roles of various physical components and features in initiating phonation and affecting the produced acoustics. Some misconceptions of the voice production physics are also clarified. Section IV discusses the physiologic control of F0, vocal intensity, and voice quality. Section V reviews past and current efforts in developing mechanical and computational models of voice production. Issues and future challenges in establishing a causal theory of voice production and perception are discussed in Sec. VI .

The human vocal system includes the lungs and the lower airway that function to supply air pressure and airflow (a review of the mechanics of the subglottal system can be found in Hixon, 1987 ), the vocal folds whose vibration modulates the airflow and produces voice source, and the vocal tract that modifies the voice source and thus creates specific output sounds. The vocal folds are located in the larynx and form a constriction to the airway [Fig. 1(a) ]. Each vocal fold is about 11–15 mm long in adult women and 17–21 mm in men, and stretches across the larynx along the anterior-posterior direction, attaching anteriorly to the thyroid cartilage and posteriorly to the anterolateral surface of the arytenoid cartilages [Fig. 1(c) ]. Both the arytenoid [Fig. 1(d) ] and thyroid [Fig. 1(e) ] cartilages sit on top of the cricoid cartilage and interact with it through the cricoarytenoid joint and cricothyroid joint, respectively. The relative movement of these cartilages thus provides a means to adjust the geometry, mechanical properties, and position of the vocal folds, as further discussed below. The three-dimensional airspace between the two opposing vocal folds is the glottis. The glottis can be divided into a membranous portion, which includes the anterior portion of the glottis and extends from the anterior commissure to the vocal process of the arytenoid, and a cartilaginous portion, which is the posterior space between the arytenoid cartilages.

FIG. 1. (Color online) (a) Coronal view of the vocal folds and the airway; (b) histological structure of the vocal fold lamina propria in the coronal plane (image provided by Dr. Jennifer Long of UCLA); (c) superior view of the vocal folds, cartilaginous framework, and laryngeal muscles; (d) medial view of the cricoarytenoid joint formed between the arytenoid and cricoid cartilages; (e) posterolateral view of the cricothyroid joint formed by the thyroid and the cricoid cartilages. The arrows in (d) and (e) indicate direction of possible motions of the arytenoid and cricoid cartilages due to LCA and CT muscle activation, respectively.

(Color online) (a) Coronal view of the vocal folds and the airway; (b) histological structure of the vocal fold lamina propria in the coronal plane (image provided by Dr. Jennifer Long of UCLA); (c) superior view of the vocal folds, cartilaginous framework, and laryngeal muscles; (d) medial view of the cricoarytenoid joint formed between the arytenoid and cricoid cartilages; (e) posterolateral view of the cricothyroid joint formed by the thyroid and the cricoid cartilages. The arrows in (d) and (e) indicate direction of possible motions of the arytenoid and cricoid cartilages due to LCA and CT muscle activation, respectively.

The vocal folds are layered structures, consisting of an inner muscular layer (the thyroarytenoid muscle) with muscle fibers aligned primarily along the anterior-posterior direction, a soft tissue layer of the lamina propria, and an outmost epithelium layer [Figs. 1(a) and 1(b) ]. The thyroarytenoid (TA) muscle is sometimes divided into a medial and a lateral bundle, with each bundle responsible for a certain vocal fold posturing function. However, such functional division is still a topic of debate ( Zemlin, 1997 ). The lamina propria consists of the extracellular matrix (ECM) and interstitial substances. The two primary ECM proteins are the collagen and elastin fibers, which are aligned mostly along the length of the vocal folds in the anterior-posterior direction ( Gray et al. , 2000 ). Based on the density of the collagen and elastin fibers [Fig. 1(b) ], the lamina propria can be divided into a superficial layer with limited and loose elastin and collagen fibers, an intermediate layer of dominantly elastin fibers, and a deep layer of mostly dense collagen fibers ( Hirano and Kakita, 1985 ; Kutty and Webb, 2009 ). In comparison, the lamina propria (about 1 mm thick) is much thinner than the TA muscle.

Conceptually, the vocal fold is often simplified into a two-layer body-cover structure ( Hirano, 1974 ; Hirano and Kakita, 1985 ). The body layer includes the muscular layer and the deep layer of the lamina propria, and the cover layer includes the intermediate and superficial lamina propria and the epithelium layer. This body-cover concept of vocal fold structure will be adopted in the discussions below. Another grouping scheme divides the vocal fold into three layers. In addition to a body and a cover layer, the intermediate and deep layers of the lamina propria are grouped into a vocal ligament layer ( Hirano, 1975 ). It is hypothesized that this layered structure plays a functional role in phonation, with different combinations of mechanical properties in different layers leading to production of different voice source characteristics ( Hirano, 1974 ). However, because of lack of data of the mechanical properties in each vocal fold layer and how they vary at different conditions of laryngeal muscle activation, a definite understanding of the functional roles of each vocal fold layer is still missing.

The mechanical properties of the vocal folds have been quantified using various methods, including tensile tests ( Hirano and Kakita, 1985 ; Zhang et al. , 2006b ; Kelleher et al. , 2013a ), shear rheometry ( Chan and Titze, 1999 ; Chan and Rodriguez, 2008 ; Miri et al. , 2012 ), indentation ( Haji et al. , 1992a , b ; Tran et al. , 1993 ; Chhetri et al. , 2011 ), and a surface wave method ( Kazemirad et al. , 2014 ). These studies showed that the vocal folds exhibit a nonlinear, anisotropic, viscoelastic behavior. A typical stress-strain curve of the vocal folds under anterior-posterior tensile test is shown in Fig. 2 . The slope of the curve, or stiffness, quantifies the extent to which the vocal folds resist deformation in response to an applied force. In general, after an initial linear range, the slope of the stress-strain curve (stiffness) increases gradually with further increase in the strain (Fig. 2 ), presumably due to the gradual engagement of the collagen fibers. Such nonlinear mechanical behavior provides a means to regulate vocal fold stiffness and tension through vocal fold elongation or shortening, which plays an important role in the control of the F0 or pitch of voice production. Typically, the stress is higher during loading than unloading, indicating a viscous behavior of the vocal folds. Due to the presence of the AP-aligned collagen, elastin, and muscle fibers, the vocal folds also exhibit anisotropic mechanical properties, stiffer along the AP direction than in the transverse plane. Experiments ( Hirano and Kakita, 1985 ; Alipour and Vigmostad, 2012 ; Miri et al. , 2012 ; Kelleher et al. , 2013a ) showed that the Young's modulus along the AP direction in the cover layer is more than 10 times (as high as 80 times in Kelleher et al. , 2013a ) larger than in the transverse plane. Stiffness anisotropy has been shown to facilitate medial-lateral motion of the vocal folds ( Zhang, 2014 ) and complete glottal closure during phonation ( Xuan and Zhang, 2014 ).

FIG. 2. Typical tensile stress-strain curve of the vocal fold along the anterior-posterior direction during loading and unloading at 1 Hz. The slope of the tangent line (dashed lines) to the stress-strain curve quantifies the tangent stiffness. The stress is typically higher during loading than unloading due to the viscous behavior of the vocal folds. The curve was obtained by averaging data over 30 cycles after a 10-cycle preconditioning.

Typical tensile stress-strain curve of the vocal fold along the anterior-posterior direction during loading and unloading at 1 Hz. The slope of the tangent line (dashed lines) to the stress-strain curve quantifies the tangent stiffness. The stress is typically higher during loading than unloading due to the viscous behavior of the vocal folds. The curve was obtained by averaging data over 30 cycles after a 10-cycle preconditioning.

Accurate measurement of vocal fold mechanical properties at typical phonation conditions is challenging, due to both the small size of the vocal folds and the relatively high frequency of phonation. Although tensile tests and shear rheometry allow direct measurement of material modules, the small sample size often leads to difficulties in mounting tissue samples to the testing equipment, thus creating concerns of accuracy. These two methods also require dissecting tissue samples from the vocal folds and the laryngeal framework, making it impossible for in vivo measurement. The indentation method is ideal for in vivo measurement and, because of the small size of indenters used, allows characterization of the spatial variation of mechanical properties of the vocal folds. However, it is limited for measurement of mechanical properties at conditions of small deformation. Although large indentation depths can be used, data interpretation becomes difficult and thus it is not suitable for assessment of the nonlinear mechanical properties of the vocal folds.

There has been some recent work toward understanding the contribution of individual ECM components to the macro-mechanical properties of the vocal folds and developing a structurally based constitutive model of the vocal folds (e.g., Chan et al. , 2001 ; Kelleher et al. , 2013b ; Miri et al. , 2013 ). The contribution of interstitial fluid to the viscoelastic properties of the vocal folds and vocal fold stress during vocal fold vibration and collision has also been investigated using a biphasic model of the vocal folds in which the vocal fold was modeled as a solid phase interacting with an interstitial fluid phase ( Zhang et al. , 2008 ; Tao et al. , 2009 , Tao et al. , 2010 ; Bhattacharya and Siegmund, 2013 ). This structurally based approach has the potential to predict vocal fold mechanical properties from the distribution of collagen and elastin fibers and interstitial fluids, which may provide new insights toward the differential mechanical properties between different vocal fold layers at different physiologic conditions.

Voice communication requires fine control and adjustment of pitch, loudness, and voice quality. Physiologically, such adjustments are made through laryngeal muscle activation, which stiffens, deforms, or repositions the vocal folds, thus controlling the geometry and mechanical properties of the vocal folds and glottal configuration.

One important posturing is adduction/abduction of the vocal folds, which is primarily achieved through motion of the arytenoid cartilages. Anatomical analysis and numerical simulations have shown that the cricoarytenoid joint allows the arytenoid cartilages to slide along and rotate about the long axis of the cricoid cartilage, but constrains arytenoid rotation about the short axis of the cricoid cartilage ( Selbie et al. , 1998 ; Hunter et al. , 2004 ; Yin and Zhang, 2014 ). Activation of the lateral cricoarytenoid (LCA) muscles, which attach anteriorly to the cricoid cartilage and posteriorly to the arytenoid cartilages, induce mainly an inward rotation motion of the arytenoid about the cricoid cartilages in the coronal plane, and moves the posterior portion of the vocal folds toward the glottal midline. Activation of the interarytenoid (IA) muscles, which connect the posterior surfaces of the two arytenoids, slides and approximates the arytenoid cartilages [Fig. 1(c) ], thus closing the cartilaginous glottis. Because both muscles act on the posterior portion of the vocal folds, combined action of the two muscles is able to completely close the posterior portion of the glottis, but is less effective in closing the mid-membranous glottis (Fig. 3 ; Choi et al. , 1993 ; Chhetri et al. , 2012 ; Yin and Zhang, 2014 ). Because of this inefficiency in mid-membranous approximation, LCA/IA muscle activation is unable to produce medial compression between the two vocal folds in the membranous portion, contrary to current understandings ( Klatt and Klatt, 1990 ; Hixon et al. , 2008 ). Complete closure and medial compression of the mid-membranous glottis requires the activation of the TA muscle ( Choi et al. , 1993 ; Chhetri et al. , 2012 ). The TA muscle forms the bulk of the vocal folds and stretches from the thyroid prominence to the anterolateral surface of the arytenoid cartilages (Fig. 1 ). Activation of the TA muscle produces a whole-body rotation of the vocal folds in the horizontal plane about the point of its anterior attachment to the thyroid cartilage toward the glottal midline ( Yin and Zhang, 2014 ). This rotational motion is able to completely close the membranous glottis but often leaves a gap posteriorly (Fig. 3 ). Complete closure of both the membranous and cartilaginous glottis thus requires combined activation of the LCA/IA and TA muscles. The posterior cricoarytenoid (PCA) muscles are primarily responsible for opening the glottis but may also play a role in voice production of very high pitches, as discussed below.

FIG. 3. Activation of the LCA/IA muscles completely closes the posterior glottis but leaves a small gap in the membranous glottis, whereas TA activation completely closes the anterior glottis but leaves a gap at the posterior glottis. From unpublished stroboscopic recordings from the in vivo canine larynx experiments in Choi et al. (1993).

Activation of the LCA/IA muscles completely closes the posterior glottis but leaves a small gap in the membranous glottis, whereas TA activation completely closes the anterior glottis but leaves a gap at the posterior glottis. From unpublished stroboscopic recordings from the in vivo canine larynx experiments in Choi et al. (1993) .

Vocal fold tension is regulated by elongating or shortening the vocal folds. Because of the nonlinear material properties of the vocal folds, changing vocal fold length also leads to changes in vocal fold stiffness, which otherwise would stay constant for linear materials. The two laryngeal muscles involved in regulating vocal fold length are the cricothyroid (CT) muscle and the TA muscle. The CT muscle consists of two bundles. The vertically oriented bundle, the pars recta, connects the anterior surface of the cricoid cartilage and the lower border of the thyroid lamina. Its contraction approximates the thyroid and cricoid cartilages anteriorly through a rotation about the cricothyroid joint. The other bundle, the pars oblique, is oriented upward and backward, connecting the anterior surface of the cricoid cartilage to the inferior cornu of the thyroid cartilage. Its contraction displaces the cricoid and arytenoid cartilages backwards ( Stone and Nuttall, 1974 ), although the thyroid cartilage may also move forward slightly. Contraction of both bundles thus elongates the vocal folds and increases the stiffness and tension in both the body and cover layers of the vocal folds. In contrast, activation of the TA muscle, which forms the body layer of the vocal folds, increase the stiffness and tension in the body layer. Activation of the TA muscle, in addition to an initial effect of mid-membranous vocal fold approximation, also shortens the vocal folds, which decreases both the stiffness and tension in the cover layer ( Hirano and Kakita, 1985 ; Yin and Zhang, 2013 ). One exception is when the tension in the vocal fold cover is already negative (i.e., under compression), in which case shortening the vocal folds further through TA activation decreases tension (i.e., increased compression force) but may increase stiffness in the cover layer. Activation of the LCA/IA muscles generally does not change the vocal fold length much and thus has only a slight effect on vocal fold stiffness and tension ( Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, activation of the LCA/IA muscles (and also the PCA muscles) does stabilize the arytenoid cartilage and prevent it from moving forward when the cricoid cartilage is pulled backward due to the effect of CT muscle activation, thus facilitating extreme vocal fold elongation, particularly for high-pitch voice production. As noted above, due to the lack of reliable measurement methods, our understanding of how vocal fold stiffness and tension vary at different muscular activation conditions is limited.

Activation of the CT and TA muscles also changes the medial surface shape of the vocal folds and the glottal channel geometry. Specifically, TA muscle activation causes the inferior part of the medial surface to bulge out toward the glottal midline ( Hirano and Kakita, 1985 ; Hirano, 1988 ; Vahabzadeh-Hagh et al. , 2016 ), thus increasing the vertical thickness of the medial surface. In contrast, CT activation reduces this vertical thickness of the medial surface. Although many studies have investigated the prephonatory glottal shape (convergent, straight, or divergent) on phonation ( Titze, 1988a ; Titze et al. , 1995 ), a recent study showed that the glottal channel geometry remains largely straight under most conditions of laryngeal muscle activation ( Vahabzadeh-Hagh et al. , 2016 ).

The phonation process starts from the adduction of the vocal folds, which approximates the vocal folds to reduce or close the glottis. Contraction of the lungs initiates airflow and establishes pressure buildup below the glottis. When the subglottal pressure exceeds a certain threshold pressure, the vocal folds are excited into a self-sustained vibration. Vocal fold vibration in turn modulates the glottal airflow into a pulsating jet flow, which eventually develops into turbulent flow into the vocal tract.

In general, three major sound production mechanisms are involved in this process ( McGowan, 1988 ; Hofmans, 1998 ; Zhao et al. , 2002 ; Zhang et al. , 2002a ), including a monopole sound source due to volume of air displaced by vocal fold vibration, a dipole sound source due to the fluctuating force applied by the vocal folds to the airflow, and a quadrupole sound source due to turbulence developed immediately downstream of the glottal exit. When the false vocal folds are tightly adducted, an additional dipole source may arise as the glottal jet impinges onto the false vocal folds ( Zhang et al. , 2002b ). The monopole sound source is generally small considering that the vocal folds are nearly incompressible and thus the net volume flow displacement is small. The dipole source is generally considered as the dominant sound source and is responsible for the harmonic component of the produced sound. The quadrupole sound source is generally much weaker than the dipole source in magnitude, but it is responsible for broadband sound production at high frequencies.

For the harmonic component of the voice source, an equivalent monopole sound source can be defined at a plane just downstream of the region of major sound sources, with the source strength equal to the instantaneous pulsating glottal volume flow rate. In the source-filter theory of phonation ( Fant, 1970 ), this monopole sound source is the input signal to the vocal tract, which acts as a filter and shapes the sound source spectrum into different sounds before they are radiated from the mouth to the open as the voice we hear. Because of radiation from the mouth, the sound source is proportional to the time derivative of the glottal flow. Thus, in the voice literature, the time derivate of the glottal flow, instead of the glottal flow, is considered as the voice source.

The phonation cycle is often divided into an open phase, in which the glottis opens (the opening phase) and closes (the closing phase), and a closed phase, in which the glottis is closed or remains a minimum opening area when the glottal closure is incomplete. The glottal flow increases and decreases in the open phase, and remains zero during the closed phase or minimum for incomplete glottal closure (Fig. 4 ). Compared to the glottal area waveform, the glottal flow waveform reaches its peak at a later time in the cycle so that the glottal flow waveform is more skewed to the right. This skewing in the glottal flow waveform to the right is due to the acoustic mass in the glottis and the vocal tract (when the F0 is lower than a nearby vocal tract resonance frequency), which causes a delay in the increase in the glottal flow during the opening phase, and a faster decay in the glottal flow during the closing phase ( Rothenberg, 1981 ; Fant, 1982 ). Because of this waveform skewing to the right, the negative peak of the time derivative of the glottal flow in the closing phase is often much more dominant than the positive peak in the opening phase. The instant of the most negative peak is thus considered the point of main excitation of the vocal tract and the corresponding negative peak, also referred to as the maximum flow declination rate (MFDR), is a major determinant of the peak amplitude of the produced voice. After the negative peak, the time derivative of the glottal flow waveform returns to zero as phonation enters the closed phase.

FIG. 4. (Color online) Typical glottal flow waveform and its time derivative (left) and their correspondence to the spectral slopes of the low-frequency and high-frequency portions of the voice source spectrum (right).

(Color online) Typical glottal flow waveform and its time derivative (left) and their correspondence to the spectral slopes of the low-frequency and high-frequency portions of the voice source spectrum (right).

Much work has been done to directly link features of the glottal flow waveform to voice acoustics and potentially voice quality (e.g., Fant, 1979 , 1982 ; Fant et al. , 1985 ; Gobl and Chasaide, 2010 ). These studies showed that the low-frequency spectral shape (the first few harmonics) of the voice source is primarily determined by the relative duration of the open phase with respect to the oscillation period (To/T in Fig. 4 , also referred to as the open quotient). A longer open phase often leads to a more dominant first harmonic (H1) in the low-frequency portion of the resulting voice source spectrum. For a given oscillation period, shortening the open phrase causes most of the glottal flow change to occur within a duration (To) that is increasingly shorter than the period T. This leads to an energy boost in the low-frequency portion of the source spectrum that peaks around a frequency of 1/To. For a glottal flow waveform of a very short open phase, the second harmonic (H2) or even the fourth harmonic (H4) may become the most dominant harmonic. Voice source with a weak H1 relative to H2 or H4 is often associated with a pressed voice quality.

The spectral slope in the high-frequency range is primarily related to the degree of discontinuity in the time derivative of the glottal flow waveform. Due to the waveform skewing discussed earlier, the most dominant source of discontinuity often occurs around the instant of main excitation when the time derivative of the glottal flow waveform returns from the negative peak to zero within a time scale of Ta (Fig. 4 ). For an abrupt glottal flow cutoff ( Ta  = 0), the time derivative of the glottal flow waveform has a strong discontinuity at the point of main excitation, which causes the voice source spectrum to decay asymptotically at a roll-off rate of −6 dB per octave toward high frequencies. Increasing Ta from zero leads to a gradual return from the negative peak to zero. When approximated by an exponential function, this gradual return functions as a lower-pass filter, with a cutoff frequency around 1/ Ta , and reduces the excitation of harmonics above the cutoff frequency 1/ Ta . Thus, in the frequency range concerning voice perception, increasing Ta often leads to reduced higher-order harmonic excitation. In the extreme case when there is minimal vocal fold contact, the time derivative of the glottal flow waveform is so smooth that the voice source spectrum only has a few lower-order harmonics. Perceptually, strong excitation of higher-order harmonics is often associated with a bright output sound quality, whereas voice source with limited excitation of higher-order harmonics is often perceived to be weak.

Also of perceptual importance is the turbulence noise produced immediately downstream of the glottis. Although small in amplitude, the noise component plays an important role in voice quality perception, particularly for female voice in which aspiration noise is more persistent than in male voice. While the noise component of voice is often modeled as white noise, its spectrum often is not flat and may exhibit different spectral shapes, depending on the glottal opening and flow rate as well as the vocal tract shape. Interaction between the spectral shape and relative levels of harmonic and noise energy in the voice source has been shown to influence the perception of voice quality ( Kreiman and Gerratt, 2012 ).

It is worth noting that many of the source parameters are not independent from each other and often co-vary. How they co-vary at different voicing conditions, which is essential to natural speech synthesis, remains to be the focus of many studies (e.g., Sundberg and Hogset, 2001 ; Gobl and Chasaide, 2003 ; Patel et al. , 2011 ).

That vocal fold vibration results from a complex airflow-vocal fold interaction within the glottis rather than repetitive nerve stimulation of the larynx was first recognized by van den Berg (1958) . According to his myoelastic-aerodynamic theory of voice production, phonation starts from complete adduction of the vocal folds to close the glottis, which allows a buildup of the subglottal pressure. The vocal folds remain closed until the subglottal pressure is sufficiently high to push them apart, allowing air to escape and producing a negative (with respect to atmospheric pressure) intraglottal pressure due to the Bernoulli effect. This negative Bernoulli pressure and the elastic recoil pull the vocal folds back and close the glottis. The cycle then repeats, which leads to sustained vibration of the vocal folds.

While the myoelastic-aerodynamic theory correctly identifies the interaction between the vocal folds and airflow as the underlying mechanism of self-sustained vocal fold vibration, it does not explain how energy is transferred from airflow into the vocal folds to sustain this vibration. Traditionally, the negative intraglottal pressure is considered to play an important role in closing the glottis and sustaining vocal fold vibration. However, it is now understood that a negative intraglottal pressure is not a critical requirement for achieving self-sustained vocal fold vibration. Similarly, an alternatingly convergent-divergent glottal channel geometry during phonation has been considered a necessary condition that leads to net energy transfer from airflow into the vocal folds. We will show below that an alternatingly convergent-divergent glottal channel geometry does not always guarantee energy transfer or self-sustained vocal fold vibration.

For flow conditions typical of human phonation, the glottal flow can be reasonably described by Bernoulli's equation up to the point when airflow separates from the glottal wall, often at the glottal exit at which the airway suddenly expands. According to Bernoulli's equation, the flow pressure p at a location within the glottal channel with a time-varying cross-sectional area A is

where P sub and P sup are the subglottal and supraglottal pressure, respectively, and A sep is the time-varying glottal area at the flow separation location. For simplicity, we assume that the flow separates at the upper margin of the medial surface. To achieve a net energy transfer from airflow to the vocal folds over one cycle, the air pressure on the vocal fold surface has to be at least partially in-phase with vocal fold velocity. Specifically, the intraglottal pressure needs to be higher in the opening phase than in the closing phase of vocal fold vibration so that the airflow does more work on the vocal folds in the opening phase than the work the vocal folds do back to the airflow in the closing phase.

Theoretical analysis of the energy transfer between airflow and vocal folds ( Ishizaka and Matsudaira, 1972 ; Titze, 1988a ) showed that this pressure asymmetry can be achieved by a vertical phase difference in vocal fold surface motion (also referred to as a mucosal wave), i.e., different portions of the vocal fold surface do not necessarily move inward and outward together as a whole. This mechanism is illustrated in Fig. 5 , the upper left of which shows vocal fold surface shape in the coronal plane for six consecutive, equally spaced instants during one vibration cycle in the presence of a vertical phase difference. Instants 2 and 3 in solid lines are in the closing phase whereas 5 and 6 in dashed lines are in the opening phase. Consider for an example energy transfer at the lower margin of the medial surface. Because of the vertical phase difference, the glottal channel has a different shape in the opening phase (dashed lines 5 and 6) from that in the closing (solid lines 3 and 2) when the lower margin of the medial surface crosses the same locations. Particularly, when the lower margin of the medial surface leads the upper margin in phase, the glottal channel during opening (e.g., instant 6) is always more convergent [thus a smaller A sep / A in Eq. (1) ] or less divergent than that in the closing (e.g., instant 2) for the same location of the lower margin, resulting in an air pressure [Eq. (1) ] that is higher in the opening phase than the closing phase (Fig. 5 , top row). As a result, energy is transferred from airflow into the vocal folds over one cycle, as indicated by a non-zero area enclosed by the aerodynamic force-vocal fold displacement curve in Fig. 5 (top right). The existence of a vertical phase difference in vocal fold surface motion is generally considered as the primary mechanism of phonation onset.

FIG. 5. Two energy transfer mechanisms. Top row: the presence of a vertical phase difference leads to different medial surface shapes between glottal opening (dashed lines 5 and 6; upper left panel) and closing (solid lines 2 and 3) when the lower margin of the medial surface crosses the same locations, which leads to higher air pressure during glottal opening than closing and net energy transfer from airflow into vocal folds at the lower margin of the medial surface. Middle row: without a vertical phase difference, vocal fold vibration produces an alternatingly convergent-divergent but identical glottal channel geometry between glottal opening and closing (bottom left panel), thus zero energy transfer (middle row). Bottom row: without a vertical phase difference, air pressure asymmetry can be imposed by a negative damping mechanism.

Two energy transfer mechanisms. Top row: the presence of a vertical phase difference leads to different medial surface shapes between glottal opening (dashed lines 5 and 6; upper left panel) and closing (solid lines 2 and 3) when the lower margin of the medial surface crosses the same locations, which leads to higher air pressure during glottal opening than closing and net energy transfer from airflow into vocal folds at the lower margin of the medial surface. Middle row: without a vertical phase difference, vocal fold vibration produces an alternatingly convergent-divergent but identical glottal channel geometry between glottal opening and closing (bottom left panel), thus zero energy transfer (middle row). Bottom row: without a vertical phase difference, air pressure asymmetry can be imposed by a negative damping mechanism.

In contrast, without a vertical phase difference, the vocal fold surface during opening (Fig. 5 , bottom left; dashed lines 5 and 6) and closing (solid lines 3 and 2) would be identical when the lower margin crosses the same positions, for which Bernoulli's equation would predict symmetric flow pressure between the opening and closing phases, and zero net energy transfer over one cycle (Fig. 5 , middle row). Under this condition, the pressure asymmetry between the opening and closing phases has to be provided by an external mechanism that directly imposes a phase difference between the intraglottal pressure and vocal fold movement. In the presence of such an external mechanism, the intraglottal pressure is no longer the same between opening and closing even when the glottal channel has the same shape as the vocal fold crosses the same locations, resulting in a net energy transfer over one cycle from airflow to the vocal folds (Fig. 5 , bottom row). This energy transfer mechanism is often referred to as negative damping, because the intraglottal pressure depends on vocal fold velocity and appears in the system equations of vocal fold motion in a form similar to a damping force, except that energy is transferred to the vocal folds instead of being dissipated. Negative damping is the only energy transfer mechanism in a single degree-of-freedom system or when the entire medial surface moves in phase as a whole.

In humans, a negative damping can be provided by an inertive vocal tract ( Flanagan and Landgraf, 1968 ; Ishizaka and Matsudaira, 1972 ; Ishizaka and Flanagan, 1972 ) or a compliant subglottal system ( Zhang et al. , 2006a ). Because the negative damping associated with acoustic loading is significant only for frequencies close to an acoustic resonance, phonation sustained by such negative damping alone always occurs at a frequency close to that acoustic resonance ( Flanagan and Landgraf, 1968 ; Zhang et al. , 2006a ). Although there is no direct evidence of phonation sustained dominantly by acoustic loading in humans, instabilities in voice production (or voice breaks) have been reported when the fundamental frequency of vocal fold vibration approaches one of the vocal tract resonances (e.g., Titze et al. , 2008 ). On the other hand, this entrainment of phonation frequency to the acoustic resonance limits the degree of independent control of the voice source and the spectral modification by the vocal tract, and is less desirable for effective speech communication. Considering that humans are capable of producing a large variety of voice types independent of vocal tract shapes, negative damping due to acoustic coupling to the sub- or supra-glottal acoustics is unlikely the primary mechanism of energy transfer in voice production. Indeed, excised larynges are able to vibrate without a vocal tract. On the other hand, experiments have shown that in humans the vocal folds vibrate at a frequency close to an in vacuo vocal fold resonance ( Kaneko et al. , 1986 ; Ishizaka, 1988 ; Svec et al. , 2000 ) instead of the acoustic resonances of the sub- and supra-glottal tracts, suggesting that phonation is essentially a resonance phenomenon of the vocal folds.

A negative damping can be also provided by glottal aerodynamics. For example, glottal flow acceleration and deceleration may cause the flow to separate at different locations between opening and closing even when the glottis has identical geometry. This is particularly the case for a divergent glottal channel geometry, which often results in asymmetric flow separation and pressure asymmetry between the glottal opening and closing phases ( Park and Mongeau, 2007 ; Alipour and Scherer, 2004 ). The effect of this negative damping mechanism is expected to be small at phonation onset at which the vocal fold vibration amplitude and thus flow unsteadiness is small and the glottal channel is less likely to be divergent. However, its contribution to energy transfer may increase with increasing vocal fold vibration amplitude and flow unsteadiness ( Howe and McGowan, 2010 ). It is important to differentiate this asymmetric flow separation between glottal opening and closing due to unsteady flow effects from a quasi-steady asymmetric flow separation that is caused by asymmetry in the glottal channel geometry between opening and closing. In the latter case, because flow separation may occur at a more upstream location for a divergent glottal channel than a convergent glottal channel, an asymmetric glottal channel geometry (e.g., a glottis opening convergent and closing divergent) may lead to asymmetric flow separation between glottal opening and closing. Compared to conditions of a fixed flow separation (i.e., flow separates at the same location during the entire cycle, as in Fig. 5 ), such geometry-induced asymmetric flow separation actually reduces pressure asymmetry between glottal opening and closing [this can be shown using Eq. (1) ] and thus weakens net energy transfer. In reality, these two types of asymmetric flow separation mechanisms (due to unsteady effects or changes in glottal channel geometry) interact and can result in very complex flow separation patterns ( Alipour and Scherer, 2004 ; Sciamarella and Le Quere, 2008 ; Sidlof et al. , 2011 ), which may or may not enhance energy transfer.

From the discussion above it is clear that a negative Bernoulli pressure is not a critical requirement in either one of the two mechanisms. Being proportional to vocal fold displacement, the negative Bernoulli pressure is not a negative damping and does not directly provide the required pressure asymmetry between glottal opening and closing. On the other hand, the existence of a vertical phase difference in vocal fold vibration is determined primarily by vocal fold properties (as discussed below), rather than whether the intraglottal pressure is positive or negative during a certain phase of the oscillation cycle.

Although a vertical phase difference in vocal fold vibration leads to a time-varying glottal channel geometry, an alternatingly convergent-divergent glottal channel geometry does not guarantee self-sustained vocal fold vibration. For example, although the in-phase vocal fold motion in the bottom left of Fig. 5 (the entire medial surface moves in and out together) leads to an alternatingly convergent-divergent glottal geometry, the glottal geometry is identical between glottal opening and closing and thus this motion is unable to produce net energy transfer into the vocal folds without a negative damping mechanism (Fig. 5 , middle row). In other words, an alternatingly convergent-divergent glottal geometry is an effect, not cause, of self-sustained vocal fold vibration. Theoretically, the glottis can maintain a convergent or divergent shape during the entire oscillation cycle and yet still self-oscillate, as observed in experiments using physical vocal fold models which had a divergent shape during most portions of the oscillation cycle ( Zhang et al. , 2006a ).

The above shows that net energy transfer from airflow into the vocal folds is possible in the presence of a vertical phase difference. But how is this vertical phase difference established, and what determines the vertical phase difference and the vocal fold vibration pattern? In voice production, vocal fold vibration with a vertical phase difference results from a process of eigenmode synchronization, in which two or more in vacuo eigenmodes of the vocal folds are synchronized to vibrate at the same frequency but with a phase difference ( Ishizaka and Matsudaira, 1972 ; Ishizaka, 1981 ; Horacek and Svec, 2002 ; Zhang et al. , 2007 ), in the same way as a travelling wave formed by superposition of two standing waves. An eigenmode or resonance is a pattern of motion of the system that is allowed by physical laws and boundary constraints to the system. In general, for each mode, the vibration pattern is such that all parts of the system move either in-phase or 180° out of phase, similar to a standing wave. Each eigenmode has an inherently distinct eigenfrequency (or resonance frequency) at which the eigenmode can be maximally excited. An example of eigenmodes that is often encountered in speech science is formants, which are peaks in the output voice spectra due to excitation of acoustic resonances of the vocal tract, with the formant frequency dependent on vocal tract geometry. Figure 6 shows three typical eigenmodes of the vocal fold in the coronal plane. In Fig. 6 , the thin line indicates the resting vocal fold surface shape, whereas the solid and dashed lines indicate extreme positions of the vocal fold when vibrating at the corresponding eigenmode, spaced 180° apart in a vibratory cycle. The first eigenmode shows an up and down motion in the vertical direction, which does not modulate glottal airflow much. The second eigenmode has a dominantly in-phase medial-lateral motion along the medial surface, which does modulate airflow. The third eigenmode also exhibits dominantly medial-lateral motion, but the upper portion of the medial surface vibrates 180° out of phase with the lower portion of the medial surface. Such out-of-phase motion as in the third eigenmode is essential to achieving vocal fold vibration with a large vertical phase difference, e.g., when synchronized with an in-phase eigenmode as in Fig. 6(b) .

FIG. 6. Typical vocal fold eigenmodes exhibiting (a) a dominantly superior-inferior motion, (b) a medial-lateral in-phase motion, and (c) a medial-lateral out-of-phase motion along the medial surface.

Typical vocal fold eigenmodes exhibiting (a) a dominantly superior-inferior motion, (b) a medial-lateral in-phase motion, and (c) a medial-lateral out-of-phase motion along the medial surface.

In the absence of airflow, the vocal fold in vacuo eigenmodes are generally neutral or damped, meaning that when excited they will gradually decay in amplitude with time. When the vocal folds are subject to airflow, however, the vocal fold-airflow coupling modifies the eigenmodes and, in some conditions, synchronizes two eigenmodes to the same frequency (Fig. 7 ). Although vibration in each eigenmode by itself does not produce net energy transfer (Fig. 5 , middle row), when two modes are synchronized at the same frequency but with a phase difference in time, the vibration velocity associated with one eigenmode [e.g., the eigenmode in Fig. 6(b) ] will be at least partially in-phase with the pressure induced by the other eigenmode [e.g., the eigenmode in Fig. 6(c) ], and this cross-model pressure-velocity interaction will produce net energy transfer into the vocal folds ( Ishizaka and Matsudaira, 1972 ; Zhang et al. , 2007 ).

FIG. 7. A typical eigenmode synchronization pattern. The evolution of the first three eigenmodes is shown as a function of the subglottal pressure. As the subglottal pressure increases, the frequencies (top) of the second and third vocal fold eigenmodes gradually approach each other and, at a threshold subglottal pressure, synchronize to the same frequency. At the same time, the growth rate (bottom) of the second mode becomes positive, indicating the coupled airflow-vocal fold system becomes linearly unstable and phonation starts.

A typical eigenmode synchronization pattern. The evolution of the first three eigenmodes is shown as a function of the subglottal pressure. As the subglottal pressure increases, the frequencies (top) of the second and third vocal fold eigenmodes gradually approach each other and, at a threshold subglottal pressure, synchronize to the same frequency. At the same time, the growth rate (bottom) of the second mode becomes positive, indicating the coupled airflow-vocal fold system becomes linearly unstable and phonation starts.

The minimum subglottal pressure required to synchronize two eigenmodes and initiate net energy transfer, or the phonation threshold pressure, is proportional to the frequency spacing between the two eigenmodes being synchronized and the coupling strength between the two eigenmodes ( Zhang, 2010 ):

where ω 0,1 and ω 0,2 are the eigenfrequencies of the two in vacuo eigenmodes participating in the synchronization process and β is the coupling strength between the two eigenmodes. Thus, the closer the two eigenmodes are to each other in frequency or the more strongly they are coupled, the less pressure is required to synchronize them. This is particularly the case in an anisotropic material such as the vocal folds in which the AP stiffness is much larger than the stiffness in the transverse plane. Under such anisotropic stiffness conditions, the first few in vacuo vocal fold eigenfrequencies tend to cluster together and are much closer to each other compared to isotropic stiffness conditions ( Titze and Strong, 1975 ; Berry, 2001 ). Such clustering of eigenmodes makes it possible to initiate vocal fold vibration at very low subglottal pressures.

The coupling strength β between the two eigenmodes in Eq. (2) depends on the prephonatory glottal opening, with the coupling strength increasing with decreasing glottal opening (thus lowered phonation threshold pressure). In addition, the coupling strength also depends on the spatial similarity between the air pressure distribution over the vocal fold surface induced by one eigenmode and vocal fold surface velocity of the other eigenmode ( Zhang, 2010 ). In other words, the coupling strength β quantifies the cross-mode energy transfer efficiency between the eigenmodes that are being synchronized. The higher the degree of cross-mode pressure-velocity similarity, the better the two eigenmodes are coupled, and the less subglottal pressure is required to synchronize them.

In reality, the vocal folds have an infinite number of eigenmodes. Which eigenmodes are synchronized and eventually excited depends on the frequency spacing and relative coupling strength among different eigenmodes. Because vocal fold vibration depends on the eigenmodes that are eventually excited, changes in the eigenmode synchronization pattern often lead to changes in the F0, vocal fold vibration pattern, and the resulting voice quality. Previous studies have shown that a slight change in vocal fold properties such as stiffness or medial surface shape may cause phonation to occur at a different eigenmode, leading to a qualitatively different vocal fold vibration pattern and abrupt changes in F0 ( Tokuda et al. , 2007 ; Zhang, 2009 ). Eigenmode synchronization is not limited to two vocal fold eigenmodes, either. It may also occur between a vocal fold eigenmode and an eigenmode of the subglottal or supraglottal system. In this sense, the negative damping due to subglottal or supraglottal acoustic loading can be viewed as the result of synchronization between one of the vocal fold modes and one of the acoustic resonances.

Eigenmode synchronization discussed above corresponds to a 1:1 temporal synchronization of two eigenmodes. For a certain range of vocal fold conditions, e.g., when asymmetry (left-right or anterior-posterior) exists in the vocal system or when the vocal folds are strongly coupled with the sub- or supra-glottal acoustics, synchronization may occur so that the two eigenmodes are synchronized not toward the same frequency, but at a frequency ratio of 1:2, 1:3, etc., leading to subharmonics or biphonation ( Ishizaka and Isshiki, 1976 ; Herzel, 1993 ; Herzel et al. , 1994 ; Neubauer et al. , 2001 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Titze, 2008 ; Lucero et al. , 2015 ). Temporal desynchronization of eigenmodes often leads to irregular or chaotic vocal fold vibration ( Herzel et al. , 1991 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Steinecke and Herzel, 1995 ). Transition between different synchronization patterns, or bifurcation, often leads to a sudden change in the vocal fold vibration pattern and voice quality.

These studies show that the nonlinear interaction between vocal fold eigenmodes is a central feature of the phonation process, with different synchronization or desynchronization patterns producing a large variety of voice types. Thus, by changing the geometrical and biomechanical properties of the vocal folds, either through laryngeal muscle activation or mechanical modification as in phonosurgery, we can select eigenmodes and eigenmode synchronization pattern to control or modify our voice, in the same way as we control speech formants by moving articulators in the vocal tract to modify vocal tract acoustic resonances.

The concept of eigenmode and eigenmode synchronization is also useful for phonation modeling, because eigenmodes can be used as building blocks to construct more complex motion of the system. Often, only the first few eigenmodes are required for adequate reconstruction of complex vocal fold vibrations (both regular and irregular; Herzel et al. , 1994 ; Berry et al. , 1994 ; Berry et al. , 2006 ), which would significantly reduce the degrees of freedom required in computational models of phonation.

An important feature of normal phonation is the complete closure of the membranous glottis during vibration, which is essential to the production of high-frequency harmonics. Incomplete closure of the membranous glottis, as often observed in pathological conditions, often leads to voice production of a weak and/or breathy quality.

It is generally assumed that approximation of the vocal folds through arytenoid adduction is sufficient to achieve glottal closure during phonation, with the duration of glottal closure or the closed quotient increasing with increasing degree of vocal fold approximation. While a certain degree of vocal fold approximation is obviously required for glottal closure, there is evidence suggesting that other factors also are in play. For example, excised larynx experiments have shown that some larynges would vibrate with incomplete glottal closure despite that the arytenoids are tightly sutured together ( Isshiki, 1989 ; Zhang, 2011 ). Similar incomplete glottal closure is also observed in experiments using physical vocal fold models with isotropic material properties ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In these experiments, increasing the subglottal pressure increased the vocal fold vibration amplitude but often did not lead to improvement in the glottal closure pattern ( Xuan and Zhang, 2014 ). These studies show that addition stiffness or geometry conditions are required to achieve complete membranous glottal closure.

Recent studies have started to provide some insight toward these additional biomechanical conditions. Xuan and Zhang (2014) showed that embedding fibers along the anterior-posterior direction in otherwise isotropic models is able to improve glottal closure ( Xuan and Zhang, 2014 ). With an additional thin stiffer outmost layer simulating the epithelium, these physical models are able to vibrate with a considerably long closed period. It is interesting that this improvement in the glottal closure pattern occurred only when the fibers were embedded to a location close to the vocal fold surface in the cover layer. Embedding fibers in the body layer did not improve the closure pattern at all. This suggests a possible functional role of collagen and elastin fibers in the intermediate and deep layers of the lamina propria in facilitating glottal closure during vibration.

The difference in the glottal closure pattern between isotropic and anisotropic vocal folds could be due to many reasons. Compared to isotropic vocal folds, anisotropic vocal folds (or fiber-embedded models) are better able to maintain their adductory position against the subglottal pressure and are less likely to be pushed apart by air pressure ( Zhang, 2011 ). In addition, embedding fibers along the AP direction may also enhance the medial-lateral motion, further facilitating glottal closure. Zhang (2014) showed that the first few in vacuo eigenmodes of isotropic vocal folds exhibit similar in-phase, up-and-down swing-like motion, with the medial-lateral and superior-inferior motions locked in a similar phase relationship. Synchronization of modes of similar vibration patterns necessarily leads to qualitatively the same vibration patterns, in this case an up-and-down swing-like motion, with vocal fold vibration dominantly along the superior-inferior direction, as observed in recent physical model experiments ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In contrast, for vocal folds with the AP stiffness much higher than the transverse stiffness, the first few in vacuo modes exhibit qualitatively distinct vibration patterns, and the medial-lateral motion and the superior-inferior motion are no longer locked in a similar phase in the first few in vacuo eigenmodes. This makes it possible to strongly excite large medial-lateral motion without proportional excitation of the superior-inferior motion. As a result, anisotropic models exhibit large medial-lateral motion with a vertical phase difference along the medial surface. The improved capability to maintain adductory position against the subglottal pressure and to vibrate with large medial-lateral motion may contribute to the improved glottal closure pattern observed in the experiment of Xuan and Zhang (2014) .

Geometrically, a thin vocal fold has been shown to be easily pushed apart by the subglottal pressure ( Zhang, 2016a ). Although a thin anisotropic vocal fold vibrates with a dominantly medial-lateral motion, this is insufficient to overcome its inability to maintain position against the subglottal pressure. As a result, the glottis never completely closes during vibration, which leads to a relatively smooth glottal flow waveform and weak excitation of higher-order harmonics in the radiated output voice spectrum ( van den Berg, 1968 ; Zhang, 2016a ). Increasing vertical thickness of the medial surface allows the vocal fold to better resist the glottis-opening effect of the subglottal pressure, thus maintaining the adductory position and achieving complete glottal closure.

Once these additional stiffness and geometric conditions (i.e., certain degree of stiffness anisotropy and not-too-small vertical vocal fold thickness) are met, the duration of glottal closure can be regulated by varying the vertical phase difference in vocal fold motion along the medial surface. A non-zero vertical phase difference means that, when the lower margins of the medial surfaces start to open, the glottis would continue to remain closed until the upper margins start to open. One important parameter affecting the vertical phase difference is the vertical thickness of the medial surface or the degree of medial bulging in the inferior portion of the medial surface. Given the same condition of vocal fold stiffness and vocal fold approximation, the vertical phase difference during vocal fold vibration increases with increasing vertical medial surface thickness (Fig. 8 ). Thus, the thicker the medial surface, the larger the vertical phase difference, and the longer the closed phase (Fig. 8 ; van den Berg, 1968 ; Alipour and Scherer, 2000 ; Zhang, 2016a ). Similarly, the vertical phase difference and thus the duration of glottal closure can be also increased by reducing the elastic surface wave speed in the superior-inferior direction ( Ishizaka and Flanagan, 1972 ; Story and Titze, 1995 ), which depends primarily on the stiffness in the transverse plane and to a lesser degree on the AP stiffness, or increasing the body-cover stiffness ratio ( Story and Titze, 1995 ; Zhang, 2009 ).

FIG. 8. (Color online) The closed quotient CQ and vertical phase difference VPD as a function of the medial surface thickness, the AP stiffness (Gap), and the resting glottal angle (α). Reprinted with permission of ASA from Zhang (2016a).

(Color online) The closed quotient CQ and vertical phase difference VPD as a function of the medial surface thickness, the AP stiffness (G ap ), and the resting glottal angle ( α ). Reprinted with permission of ASA from Zhang (2016a) .

Theoretically, the duration of glottal closure can be controlled by changing the ratio between the vocal fold equilibrium position (or the mean glottal opening) and the vocal fold vibration amplitude. Both stiffening the vocal folds and tightening vocal fold approximation are able to move the vocal fold equilibrium position toward glottal midline. However, such manipulations often simultaneously reduce the vibration amplitude. As a result, the overall effect on the duration of glottal closure is unclear. Zhang (2016a) showed that stiffening the vocal folds or increasing vocal fold approximation did not have much effect on the duration of glottal closure except around onset when these manipulations led to significant improvement in vocal fold contact.

Although a Bernoulli-based flow description is often used for phonation models, the realistic glottal flow is highly three-dimensional and much more complex. The intraglottal pressure distribution is shown to be affected by the three-dimensionality of the glottal channel geometry ( Scherer et al. , 2001 ; Scherer et al. , 2010 ; Mihaescu et al. , 2010 ; Li et al. , 2012 ). As the airflow separates from the glottal wall as it exits the glottis, a jet forms downstream of the flow separation point, which leads to the development of shear layer instabilities, vortex roll-up, and eventually vortex shedding from the jet and transition into turbulence. The vortical structures would in turn induce disturbances upstream, which may lead to oscillating flow separation point, jet attachment to one side of the glottal wall instead of going straight, and possibly alternating jet flapping ( Pelorson et al. , 1994 ; Shinwari et al. , 2003 ; Triep et al. , 2005 ; Kucinschi et al. , 2006 ; Erath and Plesniak, 2006 ; Neubauer et al. , 2007 ; Zheng et al. , 2009 ). Recent experiments and simulations also showed that for a highly divergent glottis, airflow may separate inside the glottis, which leads to the formation and convection of intraglottal vortices ( Mihaescu et al. , 2010 ; Khosla et al. , 2014 ; Oren et al. , 2014 ).

Some of these flow features have been incorporated in phonation models (e.g., Liljencrants, 1991 ; Pelorson et al. , 1994 ; Kaburagi and Tanabe, 2009 ; Erath et al. , 2011 ; Howe and McGowan, 2013 ). Resolving other features, particularly the jet instability, vortices, and turbulence downstream of the glottis, demands significantly increased computational costs so that simulation of a few cycles of vocal fold vibration often takes days or months. On the other hand, the acoustic and perceptual relevance of these intraglottal and supraglottal flow structures has not been established. From the sound production point of view, these complex flow structures in the downstream glottal flow field are sound sources of quadrupole type (dipole type when obstacles are present in the pathway of airflow, e.g., tightly adducted false vocal folds). Due to the small length scales associated with the flow structures, these sound sources are broadband in nature and mostly at high frequencies (generally above 2 kHz), with an amplitude much smaller than the harmonic component of the voice source. Therefore, if the high-frequency component of voice is of interest, these flow features have to be accurately modeled, although the degree of accuracy required to achieve perceptual sufficiency has yet to be determined.

It has been postulated that the vortical structures may directly affect the near-field glottal fluid-structure interaction and thus vocal fold vibration and the harmonic component of the voice source. Once separated from the vocal fold walls, the glottal jet starts to develop jet instabilities and is therefore susceptible to downstream disturbances, especially when the glottis takes on a divergent shape. In this way, the unsteady supraglottal flow structures may interact with the boundary layer at the glottal exit and affect the flow separation point within the glottal channel ( Hirschberg et al. , 1996 ). Similarly, it has been hypothesized that intraglottal vortices can induce a local negative pressure on the medial surface of the vocal folds as the intraglottal vortices are convected downstream and thus may facilitate rapid glottal closure during voice production ( Khosla et al. , 2014 ; Oren et al. , 2014 ).

While there is no doubt that these complex flow features affect vocal fold vibration, the question remains concerning how large an influence these vortical structures have on vocal fold vibration and the produced acoustics. For the flow conditions typical of voice production, many of the flow features or instabilities have time scales much different from that of vocal fold vibration. For example, vortex shedding at typical voice conditions occurs generally at frequencies above 1000 Hz ( Zhang et al. , 2004 ; Kucinschi et al. , 2006 ). Considering that phonation is essentially a resonance phenomenon of the vocal folds (Sec. III B ) and the mismatch between vocal fold resonance and typical frequency scales of the vortical structures, it is questionable that compared to vocal fold inertia and elastic recoil, the pressure perturbations on vocal fold surface due to intraglottal or supraglottal vortical structures are strong enough or last for a long enough period to have a significant effect on voice production. Given a longitudinal shear modulus of the vocal fold of about 10 kPa and a shear strain of 0.2, the elastic recoil stress of the vocal fold is approximately 2000 Pa. The pressure perturbations induced by intraglottal or supraglottal vortices are expected to be much smaller than the subglottal pressure. Assuming an upper limit of about 20% of the subglottal pressure for the pressure perturbations (as induced by intraglottal vortices, Oren et al. , 2014 ; in reality this number is expected to be much smaller at normal loudness conditions and even smaller for supraglottal vortices) and a subglottal pressure of 800 Pa (typical of normal speech production), the pressure perturbation on vocal fold surface is about 160 Pa, which is much smaller than the elastic recoil stress. Specifically to the intraglottal vortices, while a highly divergent glottal geometry is required to create intraglottal vortices, the presence of intraglottal vortices induces a negative suction force applied mainly on the superior portion of the medial surface and, if the vortices are strong enough, would reduce the divergence of the glottal channel. In other words, while intraglottal vortices are unable to create the necessary divergence conditions required for their creation, their existence tends to eliminate such conditions.

There have been some recent studies toward quantifying the degree of the influence of the vortical structures on phonation. In an excised larynx experiment without a vocal tract, it has been observed that the produced sound does not change much when sticking a finger very close to the glottal exit, which presumably would have significantly disturbed the supraglottal flow field. A more rigorous experiment was designed in Zhang and Neubauer (2010) in which they placed an anterior-posteriorly aligned cylinder in the supraglottal flow field and traversed it in the flow direction at different left-right locations and observed the acoustics consequences. The hypothesis was that, if these supraglottal flow structures had a significant effect on vocal fold vibration and acoustics, disturbing these flow structures would lead to noticeable changes in the produced sound. However, their experiment found no significant changes in the sound except when the cylinder was positioned within the glottal channel.

The potential impact of intraglottal vortices on phonation has also been numerically investigated ( Farahani and Zhang, 2014 ; Kettlewell, 2015 ). Because of the difficulty in removing intraglottal vortices without affecting other aspects of the glottal flow, the effect of the intraglottal vortices was modeled as a negative pressure superimposed on the flow pressure predicted by a base glottal flow model. In this way, the effect of the intraglottal vortices can be selectively activated or deactivated independently of the base flow so that its contribution to phonation can be investigated. These studies showed that intraglottal vortices only have small effects on vocal fold vibration and the glottal flow. Kettlewell (2015) further showed that the vortices are either not strong enough to induce significant pressure perturbation on vocal fold surfaces or, if they are strong enough, the vortices advect rapidly into the supraglottal region and the induced pressure perturbations would be too brief to have any impact to overcome the inertia of the vocal fold tissue.

Although phonation models using simplified flow models neglecting flow vortical structures are widely used and appear to qualitatively compare well with experiments ( Pelorson et al. , 1994 ; Zhang et al. , 2002a ; Ruty et al. , 2007 ; Kaburagi and Tanabe, 2009 ), more systematic investigations are required to reach a definite conclusion regarding the relative importance of these flow structures to phonation and voice perception. This may be achieved by conducting parametric studies in a large range of conditions over which the relative strength of these vortical structures are known to vary significantly and observing their consequences on voice production. Such an improved understanding would facilitate the development of computationally efficient reduced-order models of phonation.

In the discussion of F0 control, an analogy is often made between phonation and vibration in strings in the voice literature (e.g., Colton et al. , 2011 ). The vibration frequency of a string is determined by its length, tension, and mass. By analogy, the F0 of voice production is also determined by its length, tension, and mass, with the mass interpreted as the mass of the vocal folds that is set into vibration. Specifically, F0 increases with increasing tension, decreasing mass, and decreasing vocal fold length. While the string analogy is conceptually simple and heuristically useful, some important features of the vocal folds are missing. Other than the vague definition of an effective mass, the string model, which implicitly assumes cross-section dimension much smaller than length, completely neglects the contribution of vocal fold stiffness in F0 control. Although stiffness and tension are often not differentiated in the voice literature, they have different physical meanings and represent two different mechanisms that resist deformation (Fig. 2 ). Stiffness is a property of the vocal fold and represents the elastic restoring force in response to deformation, whereas tension or stress describes the mechanical state of the vocal folds. The string analogy also neglects the effect of vocal fold contact, which introduces additional stiffening effect.

Because phonation is essentially a resonance phenomenon of the vocal folds, the F0 is primarily determined by the frequency of the vocal fold eigenmodes that are excited. In general, vocal fold eigenfrequencies depend on both vocal fold geometry, including length, depth, and thickness, and the stiffness and stress conditions of the vocal folds. Shorter vocal folds tend to have high eigenfrequencies. Thus, because of the small vocal fold size, children tend to have the highest F0, followed by female and then male. Vocal fold eigenfrequencies also increase with increasing stiffness or stress (tension), both of which provide a restoring force to resist vocal fold deformation. Thus, stiffening or tensioning the vocal folds would increase the F0 of the voice. In general, the effect of stiffness on vocal fold eigenfrequencies is more dominant than tension when the vocal fold is slightly elongated or shortened, at which the tension is small or even negative and the string model would underestimate F0 or fail to provide a prediction. As the vocal fold gets further elongated and tension increases, the stiffness and tension become equally important in affecting vocal fold eigenfrequencies ( Titze and Hunter, 2004 ; Yin and Zhang, 2013 ).

When vocal fold contact occurs during vibration, the vocal fold collision force appears as an additional restoring force ( Ishizaka and Flanagan, 1972 ). Depending on the extent, depth of influence, and duration of vocal fold collision, this additional force can significantly increase the effective stiffness of the vocal folds and thus F0. Because the vocal fold contact pattern depends on the degree of vocal fold approximation, subglottal pressure, and vocal fold stiffness and geometry, changes in any of these parameters may have an effect on F0 by affecting vocal fold contact ( van den Berg and Tran, 1959 ; Zhang, 2016a ).

In humans, F0 can be increased by increasing either vocal fold eigenfrequencies or the extent and duration of vocal fold contact. Control of vocal fold eigenfrequencies is largely achieved by varying the stiffness and tension along the AP direction. Due to the nonlinear material properties of the vocal folds, both the AP stiffness and tension can be controlled by elongating or shortening the vocal folds, through activation of the CT muscle. Although elongation also increases vocal fold length which lowers F0, the effect of the increase in stiffness and tension on F0 appears to dominate that of increasing length.

The effect of TA muscle activation on F0 control is a little more complex. In addition to shortening vocal fold length, TA activation tensions and stiffens the body layer, decreases tension in the cover layer, but may decrease or increase the cover stiffness ( Yin and Zhang, 2013 ). Titze et al. (1988) showed that depending on the depth of the body layer involved in vibration, increasing TA activation can either increase or decrease vocal fold eigenfrequencies. On the other hand, Yin and Zhang (2013) showed that for an elongated vocal fold, as is often the case in phonation, the overall effect of TA activation is to reduce vocal fold eigenfrequencies. Only for conditions of a slightly elongated or shortened vocal folds, TA activation may increase vocal fold eigenfrequencies. In addition to the effect on vocal fold eigenfrequencies, TA activation increases vertical thickness of the vocal folds and produces medial compression between the two folds, both of which increase the extent and duration of vocal tract contact and would lead to an increased F0 ( Hirano et al. , 1969 ). Because of these opposite effects on vocal fold eigenfrequencies and vocal fold contact, the overall effect of TA activation on F0 would vary depending on the specific vocal fold conditions.

Increasing subglottal pressure or activation of the LCA/IA muscles by themselves do not have much effect on vocal fold eigenfrequencies ( Hirano and Kakita, 1985 ; Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, they often increase the extent and duration of vocal fold contact during vibration, particularly with increasing subglottal pressure, and thus lead to increased F0 ( Hirano et al. , 1969 ; Ishizaka and Flanagan, 1972 ; Zhang, 2016a ). Due to nonlinearity in vocal fold material properties, increased vibration amplitude at high subglottal pressures may lead to increased effective stiffness and tension, which may also increase F0 ( van den Berg and Tan, 1959 ; Ishizaka and Flanagan, 1972 ; Titze, 1989 ). Ishizaka and Flanagan (1972) showed in their two-mass model that vocal fold contact and material nonlinearity combined can lead to an increase of about 40 Hz in F0 when the subglottal pressure is increased from about 200 to 800 Pa. In the continuum model of Zhang (2016a) , which includes the effect of vocal fold contact but not vocal fold material nonlinearity, increasing subglottal pressure alone can increase the F0 by as large as 20 Hz/kPa.

Because voice is produced at the glottis, filtered by the vocal tract, and radiated from the mouth, an increase in vocal intensity can be achieved by either increasing the source intensity or enhancing the radiation efficiency. The source intensity is controlled primarily by the subglottal pressure, which increases the vibration amplitude and the negative peak or MFDR of the time derivative of the glottal flow. The subglottal pressure depends primarily on the alveolar pressure in the lungs, which is controlled by the respiratory muscles and the lung volume. In general, conditions of the laryngeal system have little effect on the establishment of the alveolar pressure and subglottal pressure ( Hixon, 1987 ; Finnegan et al. , 2000 ). However, an open glottis often results in a small glottal resistance and thus a considerable pressure drop in the lower airway and a reduced subglottal pressure. An open glottis also leads to a large glottal flow rate and a rapid decline in the lung volume, thus reducing the duration of speech between breaths and increasing the respiratory effort required in order to maintain a target subglottal pressure ( Zhang, 2016b ).

In the absence of a vocal tract, laryngeal adjustments, which control vocal fold stiffness, geometry, and position, do not have much effect on the source intensity, as shown in many studies using laryngeal, physical, or computational models of phonation ( Tanaka and Tanabe, 1986 ; Titze, 1988b ; Zhang, 2016a ). In the experiment by Tanaka and Tanabe (1986) , for a constant subglottal pressure, stimulation of the CT and LCA muscles had almost no effects on vocal intensity whereas stimulation of the TA muscle slightly decreased vocal intensity. In an excised larynx experiment, Titze (1988b) found no dependence of vocal intensity on the glottal width. Similar secondary effects of laryngeal adjustments have also been observed in a recent computational study ( Zhang, 2016a ). Zhang (2016a) also showed that the effect of laryngeal adjustments may be important at subglottal pressures slightly above onset, in which case an increase in either AP stiffness or vocal fold approximation may lead to improved vocal fold contact and glottal closure, which significantly increased the MFDR and thus vocal intensity. However, these effects became less efficient with increasing vocal intensity.

The effect of laryngeal adjustments on vocal intensity becomes a little more complicated in the presence of the vocal tract. Changing vocal tract shape by itself does not amplify the produced sound intensity because sound propagation in the vocal tract is a passive process. However, changes in vocal tract shape may provide a better impedance match between the glottis and the free space outside the mouth and thus improve efficiency of sound radiation from the mouth ( Titze and Sundberg, 1992 ). This is particularly the case for harmonics close to a formant, which are often amplified more than the first harmonic and may become the most energetic harmonic in the spectrum of the output voice. Thus, vocal intensity can be increased through laryngeal adjustments that increase excitation of harmonics close to the first formant of the vocal tract ( Fant, 1982 ; Sundberg, 1987 ) or by adjusting vocal tract shape to match one of the formants with one of the dominant harmonics in the source spectrum.

In humans, all three strategies (respiratory, laryngeal, and articulatory) are used to increase vocal intensity. When asked to produce an intensity sweep from soft to loud voice, one generally starts with a slightly breathy voice with a relatively open glottis, which requires the least laryngeal effort but is inefficient in voice production. From this starting position, vocal intensity can be increased by increasing either the subglottal pressure, which increases vibration amplitude, or vocal fold adduction (approximation and/or thickening). For a soft voice with minimal vocal fold contact and minimal higher-order harmonic excitation, increasing vocal fold adduction is particularly efficient because it may significantly improve vocal fold contact, in both spatial extent and duration, thus significantly boosting the excitation of harmonics close to the first formant. In humans, for low to medium vocal intensity conditions, vocal intensity increase is often accompanied by simultaneous increases in the subglottal pressure and the glottal resistance ( Isshiki, 1964 ; Holmberg et al. , 1988 ; Stathopoulos and Sapienza, 1993 ). Because the pitch level did not change much in these experiments, the increase in glottal resistance was most likely due to tighter vocal fold approximation through LCA/IA activation. The duration of the closed phase is often observed to increase with increasing vocal intensity ( Henrich et al. , 2005 ), indicating increased vocal fold thickening or medial compression, which are primarily controlled by the TA muscle. Thus, it seems that both the LCA/IA/TA muscles and subglottal pressure increase play a role in vocal intensity increase at low to medium intensity conditions. For high vocal intensity conditions, when further increase in vocal fold adduction becomes less effective ( Hirano et al. , 1969 ), vocal intensity increase appears to rely dominantly on the subglottal pressure increase.

On the vocal tract side, Titze (2002) showed that the vocal intensity can be increased by matching a wide epilarynx with lower glottal resistance or a narrow epilarynx with higher glottal resistance. Tuning the first formant (e.g., by opening mouth wider) to match the F0 is often used in soprano singing to maximize vocal output ( Joliveau et al. , 2004 ). Because radiation efficiency can be improved through adjustments in either the vocal folds or the vocal tract, this makes it possible to improve radiation efficiency yet still maintain desired pitch or articulation, whichever one wishes to achieve.

Voice quality generally refers to aspects of the voice other than pitch and loudness. Due to the subjective nature of voice quality perception, many different descriptions are used and authors often disagree with the meanings of these descriptions ( Gerratt and Kreiman, 2001 ; Kreiman and Sidtis, 2011 ). This lack of a clear and consistent definition of voice quality makes it difficult for studies of voice quality and identifying its physiological correlates and controls. Acoustically, voice quality is associated with the spectral amplitude and shape of the harmonic and noise components of the voice source, and their temporal variations. In the following we focus on physiological factors that are known to have an impact on the voice spectra and thus are potentially perceptually important.

One of the first systematic investigations of the physiological controls of voice quality was conducted by Isshiki (1989 , 1998) using excised larynges, in which regions of normal, breathy, and rough voice qualities were mapped out in the three-dimensional parameter space of the subglottal pressure, vocal fold stiffness, and prephonatory glottal opening area (Fig. 9 ). He showed that for a given vocal fold stiffness and prephonatory glottal opening area, increasing subglottal pressure led to voice production of a rough quality. This effect of the subglottal pressure can be counterbalanced by increasing vocal fold stiffness, which increased the region of normal voice in the parameter space of Fig. 9 . Unfortunately, the details of this study, including the definition and manipulation of vocal fold stiffness and perceptual evaluation of different voice qualities, are not fully available. The importance of the coordination between the subglottal pressure and laryngeal conditions was also demonstrated in van den Berg and Tan (1959) , which showed that although different vocal registers were observed, each register occurred in a certain range of laryngeal conditions and subglottal pressures. For example, for conditions of low longitudinal tension, a chest-like phonation was possible only for small airflow rates. At large values of the subglottal pressure, “it was impossible to obtain good sound production. The vocal folds were blown too wide apart…. The shape of the glottis became irregularly curved and this curving was propagated along the glottis.” Good voice production at large flow rates was possible only with thyroid cartilage compression which imitates the effect of TA muscle activation. Irregular vocal fold vibration at high subglottal pressures has also been observed in physical model experiments (e.g., Xuan and Zhang, 2014 ). Irregular or chaotic vocal fold vibration at conditions of pressure-stiffness mismatch has also been reported in the numerical simulation of Berry et al. (1994) , which showed that while regular vocal fold vibration was observed for typical vocal fold stiffness conditions, irregular vocal fold vibration (e.g., subharmonic or chaotic vibration) was observed when the cover layer stiffness was significantly reduced while maintaining the same subglottal pressure.

FIG. 9. A three-dimensional map of normal (N), breathy (B), and rough (R) phonation in the parameter space of the prephonatory glottal area (Ag0), subglottal pressure (Ps), vocal fold stiffness (k). Reprinted with permission of Springer from Isshiki (1989).

A three-dimensional map of normal (N), breathy (B), and rough (R) phonation in the parameter space of the prephonatory glottal area (Ag0), subglottal pressure (Ps), vocal fold stiffness (k). Reprinted with permission of Springer from Isshiki (1989) .

The experiments of van den Berg and Tan (1959) and Isshiki (1989) also showed that weakly adducted vocal folds (weak LCA/IA/TA activation) often lead to vocal fold vibration with incomplete glottal closure during phonation. When the airflow is sufficiently high, the persistent glottal gap would lead to increased turbulent noise production and thus phonation of a breathy quality (Fig. 9 ). The incomplete glottal closure may occur in the membranous or the cartilaginous portion of the glottis. When the incomplete glottal closure is limited to the cartilaginous glottis, the resulting voice is breathy but may still have strong harmonics at high frequencies. When the incomplete glottal closure occurs in the membranous glottis, the reduced or slowed vocal fold contact would also reduce excitation of higher-order harmonics, resulting in a breathy and weak quality of the produced voice. When the vocal folds are sufficiently separated, the coupling between the two vocal folds may be weakened enough so that each vocal fold can vibrate at a different F0. This would lead to biphonation or voice containing two distinct fundamental frequencies, resulting in a perception similar to that of the beat frequency phenomenon.

Compared to a breathy voice, a pressed voice is presumably produced with tight vocal fold approximation or even some degree of medial compression in the membranous portion between the two folds. A pressed voice is often characterized by a second harmonic that is stronger than the first harmonic, or a negative H1-H2, with a long period of glottal closure during vibration. Although a certain degree of vocal fold approximation and stiffness anisotropy is required to achieve vocal fold contact during phonation, the duration of glottal closure has been shown to be primarily determined by the vertical thickness of the vocal fold medial surface ( van den Berg, 1968 ; Zhang, 2016a ). Thus, although it is generally assumed that a pressed voice can be produced with tight arytenoid adduction through LCA/IA muscle activation, activation of the LCA/IA muscles alone is unable to achieve prephonatory medial compression in the membranous glottis or change the vertical thickness of the medial surface. Activation of the TA muscle appears to be essential in producing a voice change from a breathy to a pressed voice quality. A weakened TA muscle, as in aging or muscle atrophy, would lead to difficulties in producing a pressed voice or even sufficient glottal closure during phonation. On the other hand, strong TA muscle activation, as in for example, spasmodic dysphonia, may lead to too tight a closure of the glottis and a rough voice quality ( Isshiki, 1989 ).

In humans, vocal fold stiffness, vocal fold approximation, and geometry are regulated by the same set of laryngeal muscles and thus often co-vary, which has long been considered as one possible origin of vocal registers and their transitions ( van den Berg, 1968 ). Specifically, it has been hypothesized that changes in F0 are often accompanied by changes in the vertical thickness of the vocal fold medial surface, which lead to changes in the spectral characteristics of the produced voice. The medial surface thickness is primarily controlled by the CT and TA muscles, which also regulate vocal fold stiffness and vocal fold approximation. Activation of the CT muscle reduces the medial surface thickness, but also increases vocal fold stiffness and tension, and in some conditions increases the resting glottal opening ( van den Berg and Tan, 1959 ; van den Berg, 1968 ; Hirano and Kakita, 1985 ). Because the LCA/IA/TA muscles are innervated by the same nerve and often activated together, an increase in the medial surface thickness through TA muscle activation is often accompanied by increased vocal fold approximation ( Hirano and Kakita, 1985 ) and contact. Thus, if one attempts to increase F0 primarily by activation of the LCA/IA/TA muscles, the vocal folds are likely to have a large medial surface thickness and probably low AP stiffness, which will lead to a chest-like voice production, with large vertical phase difference along the medial surface, long closure of the glottis, small flow rate, and strong harmonic excitation. In the extreme case of strong TA activation and minimum CT activation and very low subglottal pressure, the glottis can remain closed for most of the cycle, leading to a vocal fry-like voice production. In contrast, if one attempts to increase F0 by increasing CT activation alone, the vocal folds, with a small medial surface thickness, are likely to produce a falsetto-like voice production, with incomplete glottal closure and a nearly sinusoidal flow waveform, very high F0, and a limited number of harmonics.

Voice applications generally fall into two major categories. In the clinic, simulation of voice production has the potential to predict outcomes of clinical management of voice disorders, including surgery and voice therapy. For such applications, accurate representation of vocal fold geometry and material properties to the degree that matches actual clinical treatment is desired, and for this reason continuum models of the vocal folds are preferred over lumped-element models. Computational cost is not necessarily a concern in such applications but still has to be practical. In contrast, for some other applications, particularly in speech technology applications, the primary goal is to reproduce speech acoustics or at least perceptually relevant features of speech acoustics. Real-time capability is desired in these applications, whereas realistic representation of the underlying physics involved is often not necessary. In fact, most of the current speech synthesis systems consider speech purely as an acoustic signal and do not model the physics of speech production at all. However, models that take into consideration the underlying physics, at least to some degree, may hold the most promise in speech synthesis of natural-sounding, speaker-specific quality.

Early efforts on artificial speech production, dating back to as early as the 18th century, focused on mechanically reproducing the speech production system. A detailed review can be found in Flanagan (1972) . The focus of these early efforts was generally on articulation in the vocal tract rather than the voice source, which is understandable considering that meaning is primarily conveyed through changes in articulation and the lack of understanding of the voice production process. The vibrating element in these mechanical models, either a vibrating reed or a slotted rubber sheet stretched over an opening, is only a rough approximation of the human vocal folds.

More sophisticated mechanical models have been developed more recently to better reproduce the three-dimensional layered structure of the vocal folds. A membrane (cover)-cushion (body) two-layer rubber vocal fold model was first developed by Smith (1956) . Similar mechanical models were later developed and used in voice production research (e.g., Isogai et al. , 1988 ; Kakita, 1988 ; Titze et al. , 1995 ; Thomson et al. , 2005 ; Ruty et al. , 2007 ; Drechsel and Thomson, 2008 ), using silicone or rubber materials or liquid-filled membranes. Recent studies ( Murray and Thomson, 2012 ; Xuan and Zhang, 2014 ) have also started to embed fibers into these models to simulate the anisotropic material properties due to the presence of collagen and elastin fibers in the vocal folds. A similar layered vocal fold model has been incorporated into a mechanical talking robot system ( Fukui et al. , 2005 ; Fukui et al. , 2007 ; Fukui et al. , 2008 ). The most recent version of the talking robot, Waseda Talker, includes mechanisms for the control of pitch and resting glottal opening, and is able to produce voice of modal, creaky, or breathy quality. Nevertheless, although a mechanical voice production system may find application in voice prosthesis or humanoid robotic systems in the future, current mechanical models are still a long way from reproducing or even approaching humans' capability and flexibility in producing and controlling voice.

Compared to mechanically reproducing the physical process involved in speech production, it is easier to reproduce speech as an acoustic signal. This is particularly the case for speech synthesis. One approach adopted in most of the current speech synthesis systems is to concatenate segments of pre-recorded natural voice into new speech phrases or sentences. While relatively easy to implement, in order to achieve natural-sounding speech, this approach requires a large database of words spoken in different contexts, which makes it difficult to apply to personalized speech synthesis of varying emotional percepts.

Another approach is to reproduce only perceptually relevant acoustic features of speech, as in formant synthesis. The target acoustic features to be reproduced generally include the F0, sound amplitude, and formant frequencies and bandwidths. This approach gained popularity with the development of electrical synthesizers and later computer simulations which allow flexible and accurate control of these acoustic features. Early formant-based synthesizers used simple sound sources, often a filtered impulse train as the sound source for voiced sounds and white noise for unvoiced sounds. Research on the voice sources (e.g., Fant, 1979 ; Fant et al. , 1985 ; Rothenberg et al. , 1971 ; Titze and Talkin, 1979 ) has led to the development of parametric voice source models in the time domain, which are capable of producing voice source waveforms of varying F0, amplitude, open quotient, and degree of abruptness of the glottal flow shutoff, and thus synthesis of different voice qualities.

While parametric voice source models provide flexibility in source variations, synthetic speech generated by the formant synthesis still suffers limited naturalness. This limited naturalness may result from the primitive rules used in specifying dynamic controls of the voice source models ( Klatt, 1987 ). Also, the source model control parameters are not independent from each other and often co-vary during phonation. A challenge in formant synthesis is thus to specify voice source parameter combinations and their time variation patterns that may occur in realistic voice production of different voice qualities by different speakers. It is also possible that some perceptually important features are missing from time-domain voice source models ( Klatt, 1987 ). Human perception of voice characteristics is better described in the frequency domain as the auditory system performs an approximation to Fourier analysis of the voice and sound in general. While time-domain models have better correspondence to the physical events occurring during phonation (e.g., glottal opening and closing, and the closed phase), it is possible some spectral details of perceptual importance are not captured in the simple time-domain voice source models. For example, spectral details in the low and middle frequencies have been shown to be of considerable importance to naturalness judgment, but are difficult to be represented in a time-domain source model ( Klatt, 1987 ). A recent study ( Kreiman et al. , 2015 ) showed that spectral-domain voice source models are able to create significantly better matches to natural voices than time-domain voice source models. Furthermore, because of the independence between the voice source and the sub- and supra-glottal systems in formant synthesis, interactions and co-variations between vocal folds and the sub- and supra-glottal systems are by design not accounted for. All these factors may contribute to the limited naturalness of the formant synthesized speech.

An alternative approach to natural speech synthesis is to computationally model the voice production process based on physical principles. The control parameters would be geometry and material properties of the vocal system or, in a more realistic way, respiratory and laryngeal muscle activation. This approach avoids the need to specify consistent characteristics of either the voice source or the formants, thus allowing synthesis and modification of natural voice in a way intuitively similar to human voice production and control.

The first such computer model of voice production is the one-mass model by Flanagan and Landgraf (1968) , in which the vocal fold is modeled as a horizontally moving single-degree of freedom mass-spring-damper system. This model is able to vibrate in a restricted range of conditions when the natural frequency of the mass-spring system is close to one of the acoustic resonances of the subglottal or supraglottal tracts. Ishizaka and Flanagan (1972) extended this model to a two-mass model in which the upper and lower parts of the vocal fold are modeled as two separate masses connected by an additional spring along the vertical direction. The two-mass model is able to vibrate with a vertical phase difference between the two masses, and thus able to vibrate independently of the acoustics of the sub- and supra-glottal tracts. Many variants of the two-mass model have since been developed. Titze (1973) developed a 16-mass model to better represent vocal fold motion along the anterior-posterior direction. To better represent the body-cover layered structure of the vocal folds, Story and Titze (1995) extended the two-mass model to a three-mass model, adding an additional lateral mass representing the inner muscular layer. Empirical rules have also been developed to relate control parameters of the three-mass model to laryngeal muscle activation levels ( Titze and Story, 2002 ) so that voice production can be simulated with laryngeal muscle activity as input. Designed originally for speech synthesis purpose, these lumped-element models of voice production are generally fast in computational time and ideal for real-time speech synthesis.

A drawback of the lumped-element models of phonation is that the model control parameters cannot be directly measured or easily related to the anatomical structure or material properties of the vocal folds. Thus, these models are not as useful in applications in which a realistic representation of voice physiology is required, as, for example, in the clinical management of voice disorders. To better understand the voice source and its control under different voicing conditions, more sophisticated computational models of the vocal folds based on continuum mechanics have been developed to understand laryngeal muscle control of vocal fold geometry, stiffness, and tension, and how changes in these vocal fold properties affect the glottal fluid-structure interaction and the produced voice. One of the first such models is the finite-difference model by Titze and Talkin (1979) , which coupled a three-dimensional vocal fold model of linear elasticity with the one-dimensional glottal flow model of Ishizaka and Flanagan (1972) . In the past two decades more refined phonation models using a two-dimensional or three-dimensional Navier-Stokes description of the glottal flow have been developed (e.g., Alipour et al. , 2000 ; Zhao et al. , 2002 ; Tao et al. , 2007 ; Luo et al. , 2009 ; Zheng et al. , 2009 ; Bhattacharya and Siegmund, 2013 ; Xue et al. , 2012 , 2014 ). Continuum models of laryngeal muscle activation have also been developed to model vocal fold posturing ( Hunter et al. , 2004 ; Gommel et al. , 2007 ; Yin and Zhang, 2013 , 2014 ). By directly modeling the voice production process, continuum models with realistic geometry and material properties ideally hold the most promise in reproducing natural human voice production. However, because the phonation process is highly nonlinear and involves large displacement and deformation of the vocal folds and complex glottal flow patterns, modeling this process in three dimensions is computationally very challenging and time-consuming. As a result, these computational studies are often limited to one or two specific aspects instead of the entire voice production process, and the acoustics of the produced voice, other than F0 and vocal intensity, are often not investigated. For practical applications, real-time or not, reduced-order models with significantly improved computational efficiency are required. Some reduced-order continuum models, with simplifications in both the glottal flow and vocal fold dynamics, have been developed and used in large-scale parametric studies of voice production (e.g., Titze and Talkin, 1979 ; Zhang, 2016a ), which appear to produce qualitatively reasonable predictions. However, these simplifications have yet to be rigorously validated by experiment.

We currently have a general understanding of the physical principles of voice production. Toward establishing a cause-effect theory of voice production, much is to be learned about voice physiology and biomechanics. This includes the geometry and mechanical properties of the vocal folds and their variability across subject, sex, and age, and how they vary across different voicing conditions under laryngeal muscle activation. Even less is known about changes in vocal fold geometry and material properties in pathologic conditions. The surface conditions of the vocal folds and their mechanical properties have been shown to affect vocal fold vibration ( Dollinger et al. , 2014 ; Bhattacharya and Siegmund, 2015 ; Tse et al. , 2015 ), and thus need to be better quantified. While in vivo animal or human larynx models ( Moore and Berke, 1988 ; Chhetri et al. , 2012 ; Berke et al. , 2013 ) could provide such information, more reliable measurement methods are required to better quantify the viscoelastic properties of the vocal fold, vocal fold tension, and the geometry and movement of the inner vocal fold layers. While macro-mechanical properties are of interest, development of vocal fold constitutive laws based on ECM distribution and interstitial fluids within the vocal folds would allow us to better understand how vocal fold mechanical properties change with prolonged vocal use, vocal fold injury, and wound healing, which otherwise is difficult to quantify.

While oversimplification of the vocal folds to mass and tension is of limited practical use, the other extreme is not appealing, either. With improved characterization and understanding of vocal fold properties, establishing a cause-effect relationship between voice physiology and production thus requires identifying which of these physiologic features are actually perceptually relevant and under what conditions, through systematic parametric investigations. Such investigations will also facilitate the development of reduced-order computational models of phonation in which perceptually relevant physiologic features are sufficiently represented and features of minimum perceptual relevance are simplified. We discussed earlier that many of the complex supraglottal flow phenomena have questionable perceptual relevance. Similar relevance questions can be asked with regard to the geometry and mechanical properties of the vocal folds. For example, while the vocal folds exhibit complex viscoelastic properties, what are the main material properties that are definitely required in order to reasonably predict vocal fold vibration and voice quality? Does each of the vocal fold layers, in particular, the different layers of the lamina propria, have a functional role in determining the voice output or preventing vocal injury? Current vocal fold models often use a simplified vocal fold geometry. Could some geometric features of a realistic vocal fold that are not included in current models have an important role in affecting voice efficiency and voice quality? Because voice communication spans a large range of voice conditions (e.g., pitch, loudness, and voice quality), the perceptual relevance and adequacy of specific features (i.e., do changes in specific features lead to perceivable changes in voice?) should be investigated across a large number of voice conditions rather than a few selected conditions. While physiologic models of phonation allow better reproduction of realistic vocal fold conditions, computational models are more suitable for such systematic parametric investigations. Unfortunately, due to the high computational cost, current studies using continuum models are often limited to a few conditions. Thus, the establishment of cause-effect relationship and the development of reduced-order models are likely to be iterative processes, in which the models are gradually refined to include more physiologic details to be considered in the cause-effect relationship.

A causal theory of voice production would allow us to map out regions in the physiological parameter space that produce distinct vocal fold vibration patterns and voice qualities of interest (e.g., normal, breathy, rough voices for clinical applications; different vocal registers for singing training), similar to that described by Isshiki (1989 ; also Fig. 9 ). Although the voice production system is quite complex, control of voice should be both stable and simple, which is required for voice to be a robust and easily controlled means of communication. Understanding voice production in the framework of nonlinear dynamics and eigenmode interactions and relating it to voice quality may facilitate toward this goal. Toward practical clinical applications, such a voice map would help us understand what physiologic alteration caused a given voice change (the inverse problem), and what can be done to restore the voice to normal. Development of efficient and reliable tools addressing the inverse problem has important applications in the clinical diagnosis of voice disorders. Some methods already exist that solve the inverse problem in lumped-element models (e.g., Dollinger et al. , 2002 ; Hadwin et al. , 2016 ), and these can be extended to physiologically more realistic continuum models.

Solving the inverse problem would also provide an indirect approach toward understanding the physiologic states that lead to percepts of different emotional states or communication of other personal traits, which are otherwise difficult to measure directly in live human beings. When extended to continuous speech production, this approach may also provide insights into the dynamic physiologic control of voice in running speech (e.g., time contours of the respiratory and laryngeal adjustments). Such information would facilitate the development of computer programs capable of natural-sounding, conversational speech synthesis, in which the time contours of control parameters may change with context, speaking style, or emotional state of the speaker.

This study was supported by research Grant Nos. R01 DC011299 and R01 DC009229 from the National Institute on Deafness and Other Communication Disorders, the National Institutes of Health. The author would like to thank Dr. Liang Wu for assistance in preparing the MRI images in Fig. 1 , Dr. Jennifer Long for providing the image in Fig. 1(b) , Dr. Gerald Berke for providing the stroboscopic recording from which Fig. 3 was generated, and Dr. Jody Kreiman, Dr. Bruce Gerratt, Dr. Ronald Scherer, and an anonymous reviewer for the helpful comments on an earlier version of this paper.

Citing articles via

  • Online ISSN 1520-8524
  • Print ISSN 0001-4966
  • For Researchers
  • For Librarians
  • For Advertisers
  • Our Publishing Partners  
  • Physics Today
  • Conference Proceedings
  • Special Topics

pubs.aip.org

  • Privacy Policy
  • Terms of Use

Connect with AIP Publishing

This feature is available to subscribers only.

Sign In or Create an Account

The Relationship Between Acoustics and Human Voice Essay

  • To find inspiration for your paper and overcome writer’s block
  • As a source of information (ensure proper referencing)
  • As a template for you assignment

Introduction

The method used to determine sound absorption, type of sound absorbers, porous absorbers, membrane absorbers, resonator absorber, practical use of sound absorbers in room acoustics, works cited.

The term ‘acoustic’ is synonymous with the study of sound waves and their effects. In a synopsis, this study basically centers on the consequences of wave motions across the three states of matter including solids, liquids and gasses. As such, the scope of acoustics cut across array of disciplines, and to this effect, the terms, for instance, psychoacoustics and bioacoustics are popular among acousticians. Moreover, acoustics find application in technical fields including noise control, transducer technology, design of theatre halls, and “sound recording and production” (Finn et al. 103). With regards to the scope of this paper, our main interest is centered on sound absorbers and how they are applied in room designs.

The reverberation time (T 60 = 0.16V/A) as derived by Sabine is the most vital formula when it comes to room acoustics. As such, on predicting T 60 of materials one is in a better position to determine the surface, acoustic characteristics of a room and hence; clad a room appropriately. Principally, in order for one to achieve an ideal room design meant for a specific application, an acoustician need to have knowledge on “absorption coefficient per octave band” (Finn et al. 103) of a diversity of materials. In a nutshell, in a room, materials including wooden doors and windows are known to absorbing sound of low frequency.

In the contrary, fabrics and clothes are known to absorbing middle and high frequency sound waves. Consequently, one can strike a balance between these materials so as to achieve appropriate T 60 versus frequency combinations in a room. In a synopsis, the specific objective of this paper is to introduce and hence appreciate the physical mechanisms vital in sound absorption cum reverberation control. To this effect, the knowledge of sound absorption coefficient would come in handy when describing the properties of these materials. In some places, for example, in churches, there is a general demand for control of reverberations. As such, this paper will try to demonstrate how an effective room design dampens these effects.

Generally, the inner surfaces of a room receive sound from a wide diffuse field typified by the figure below:

Random incidence

Consequently, this chapter describes a relevant sound absorption method that would be used to determine the sound absorption coefficient emanating from the above incidence. This method is also reffered to as the reverbaration room method. Ideally, the experiment is performed in a “reverbaration room with highly irregular or non parallel surfaces and/or suspended, sound defusing elements” (Finn et al. 103). The assumption in this experiment is that the varied sound field would concur with the requirements stipulated in Subine’s reverberation formula. To derive Subines formula we assume that a vacant reverberation room has the following parameters; an absorption coefficient, α empty , inner surface area, S, and has a volume, V, then the below equation is derived:

Equation

In this room an introduction of a foreign material, S sample , changes the entire equation to:

Equation

On combining the two equations and eliminating S yields the equation below:

Equation

The above equation is fundamental in determining the coefficient, α sample , of a foreign material. Importantly, the measurement is “normally carried out in 1/1 or 1/3 octave bands from 100 to 5000 MHz” (Finn et al. 104). The results obtained in this method can only be credible if the size of the material sample relative to the area of the room is logical. Basically, if a very small material is introduced to an abnormally large room then the results obtained would be faulty (α>1) (see the graph below). This phenomenon is accounted for by the effects of diffraction of sound waves which dwarfs the absorption properties of the sample.

Graphic

This chapter describes the most common types of materials used for sound absorption. These materials include the membrane, resonate and porous absorbers. This section tries to relate the absorption coefficient of these materials to the frequency of the same as typified by the graph below:

Type of sound absorbers

Porous absorbers are common in our houses and they are presented in numerous forms including garments like curtains, furniture, carpets, and the ceiling material. Porous materials are basically characterized by the presence of air pockets which can be pressed more or less thanks to the resistivity of the material. Chiefly, the absorption property of a material is a function of the friction drag force present in air molecules under motion, and the nature of the material in contact with the air where heat transfer happens (from kinetic in sound to heat energy on the material).

Taking into account a scenario where a sound wave is incident but normally on a porous material mounted on a rigid surface, the trend exhibited by the graph below (left) of a standing wave together with its pressure amplitude on the left would result. The sound pressure coincides with the particle velocity.

Porous Absorbers

This is in contrast with when the same wave bombards a rigid termination (see figure below). Basically, the pressure and the particle velocity cancel out. The rational behind this analysis is that it aids in determining sound absorption efficacy of a material. Ideally, to achieve the best sound absorption efficacy then the thickness of a material should be at least greater than a quarter the wavelenght of the sound wave.

Porous Absorbers

In a nutshell, sound absorbance efficacy of a given thickness of a material will be deemed to have failed if a certain threshold frequency of the sound wave fails to attain. The graph below portrays how absorption coefficient versus frequency differs with various thicknesses of a wool mat.

Porous Absorbers

A membrane absorber is a “kind of double walled sound absorber with an air-filled cavity sandwiched between the walls” (Finn et al. 106). The frequency of resonance (f o ) of the entire setup is a function of the mass, m, of a unit area of a plate, and the spring ability of the enclosed air which depends on the deepness, L, of the enclosed air. This can be represented by the equation below:

Equation

Nevertheless, this is limited to when the plate is absolutely limp. Other plate parameters e.g. stiffness as well as the mode of pulsation are vital in determining the resonance frequency. To this end, this value can be obtained using the expression below:

Equation

As such, the variables “a and b represent the dimensions of the material, h is the thickness while E and v are the Young’s Modulus and poisons ratio respectively” (Finn et al. 106).

The graph below shows a plot of absorbance coefficient against sound frequency for two sample materials (plywood) having different thickness. Furthermore, one of the plates had been mounted with glass wool while the other one was without. The trends as attested by the graphs below confirm the above two equations. They reveal “that the thickness is inversely proportional to resonance frequency” (Finn et al. 106). Also, it has been established that the presence of a glass wool enhances the absorption efficacy. This also decimates the resonance frequency.

Graphic

These absorbers are found in our houses typically as wooden floor surfaces, and they result in a controlled low, resonance frequency value contrary to concrete built rooms. The later, is synonymous with blurred sounds at low frequencies.

A more advanced system of a membrane absorber is a resonator which utilizes an oscillating air sandwiched in a double walled cavity. This has an opening on the surface to the outer atmosphere. To this effect the enclosed air act as a spring function (see the figure below).

Resonator absorber

For type (a) the resonance frequency is obtained through the expression below:

Equation

The variables S, V, l and δ represent the crossection area of the openning, volume of enclosed air, neck length and a correction factor respectively. However, this type of absober is not common since it has a very short frequency range. As such, a resonating panel overcomes the shortcomings of the former model by providing for a relatively wider frequency range. Thus, its resonance frequency is given by:

Equation

The variables P and L represent the extent of perforation and depth of the air pockets respectively.

As compared to a membrane absorber, a resonator is an efficient sound absorber. The damping efficacy and hence the sound absorbance can be optimized by mounting a monolayer of mineral wool in the perforations. Also, this can significantly be enhanced by the reduction of the pore diameters. Most common form of this type of an absorber is a perforated gypsum board.

The essence of understanding the physics behind sound absorbers is to equip oneself with the knowledge of how reverberation can be contained in room designs. This will in turn reduce noise and enhance intelligibility. Some states’ building laws, for example Denmark Building Law, clearly stipulates for an optimum T 60 value for diverse working environs including schools and day-care institutions among others. Furthermore, this law recommends for specific designs for buildings including theatre and concert halls among others.

Room acoustic designers usually target the ceiling because it provides for manipulation since most of its surface is always unoccupied. Consequently, acousticians have come up with designs including suspended ceilings and mineral wool baffles that aid in damping reverberations. These designs are portrayed by the pictures below:

Practical use of sound absorbers in room acoustics

Principally, in room acoustics where a majority of sound absorbers are biased towards the ceiling, the T 60 becomes a function of the room height. Therefore;

Formula

Many a public places demand for a noise free environ thanks to acousticians’ efforts in trying to dampen reverberation. However, an ideal situation requires for an intelligible speech too. To this effect, the figures below show how a delayed sound decay in a room can facade weak phonemes typified schematically as perpendicular bars. Basically, in speeches, a consonant element that is usually overlooked as a weaker sound can serve to jeopardize the intelligibility of a speech if the decay time is not checked. As illustrated by the schematic diagrams below, a long reverberation has a potential to extremely deteriorate the entire speech.

Long reverberation has a potential to extremely deteriorate the entire speech

Finn, Jacobsen, Torben Poulsen, Jens Rindel, Anders Gade and Mogens Ohlrich. FUNDERMENTALS OF ACOUSTICS AND NOISE CONTROL . Odense: Department of Electric Enginearing, 2011. Print.

  • Conservation Of Energy
  • Electrolysis of Metals to Refine
  • "The Acoustics of Loneliness" by Mary Shine Literary Analysis
  • Glassware Packing Company Analysis
  • Importance of Sound
  • The Use of Physics in my Daily Activities
  • GE Taps into Coolest Energy Storage Technology around
  • Physics: Sliding Bubble Dynamics
  • The Sun’s Light and Heat: Solar Energy Issue
  • Rotational Response and Slip Prediction of Serpentine Belt Drive Systems
  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2022, April 8). The Relationship Between Acoustics and Human Voice. https://ivypanda.com/essays/the-relationship-between-acoustics-and-human-voice/

"The Relationship Between Acoustics and Human Voice." IvyPanda , 8 Apr. 2022, ivypanda.com/essays/the-relationship-between-acoustics-and-human-voice/.

IvyPanda . (2022) 'The Relationship Between Acoustics and Human Voice'. 8 April.

IvyPanda . 2022. "The Relationship Between Acoustics and Human Voice." April 8, 2022. https://ivypanda.com/essays/the-relationship-between-acoustics-and-human-voice/.

1. IvyPanda . "The Relationship Between Acoustics and Human Voice." April 8, 2022. https://ivypanda.com/essays/the-relationship-between-acoustics-and-human-voice/.

Bibliography

IvyPanda . "The Relationship Between Acoustics and Human Voice." April 8, 2022. https://ivypanda.com/essays/the-relationship-between-acoustics-and-human-voice/.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 30 October 2023

A large-scale comparison of human-written versus ChatGPT-generated essays

  • Steffen Herbold 1 ,
  • Annette Hautli-Janisz 1 ,
  • Ute Heuer 1 ,
  • Zlata Kikteva 1 &
  • Alexander Trautsch 1  

Scientific Reports volume  13 , Article number:  18617 ( 2023 ) Cite this article

25k Accesses

43 Citations

98 Altmetric

Metrics details

  • Computer science
  • Information technology

ChatGPT and similar generative AI models have attracted hundreds of millions of users and have become part of the public discourse. Many believe that such models will disrupt society and lead to significant changes in the education system and information generation. So far, this belief is based on either colloquial evidence or benchmarks from the owners of the models—both lack scientific rigor. We systematically assess the quality of AI-generated content through a large-scale study comparing human-written versus ChatGPT-generated argumentative student essays. We use essays that were rated by a large number of human experts (teachers). We augment the analysis by considering a set of linguistic characteristics of the generated essays. Our results demonstrate that ChatGPT generates essays that are rated higher regarding quality than human-written essays. The writing style of the AI models exhibits linguistic characteristics that are different from those of the human-written essays. Since the technology is readily available, we believe that educators must act immediately. We must re-invent homework and develop teaching concepts that utilize these AI models in the same way as math utilizes the calculator: teach the general concepts first and then use AI tools to free up time for other learning objectives.

Similar content being viewed by others

essay on human voice

ChatGPT-3.5 as writing assistance in students’ essays

essay on human voice

Perception, performance, and detectability of conversational artificial intelligence across 32 university courses

essay on human voice

L2 writer engagement with automated written corrective feedback provided by ChatGPT: A mixed-method multiple case study

Introduction.

The massive uptake in the development and deployment of large-scale Natural Language Generation (NLG) systems in recent months has yielded an almost unprecedented worldwide discussion of the future of society. The ChatGPT service which serves as Web front-end to GPT-3.5 1 and GPT-4 was the fastest-growing service in history to break the 100 million user milestone in January and had 1 billion visits by February 2023 2 .

Driven by the upheaval that is particularly anticipated for education 3 and knowledge transfer for future generations, we conduct the first independent, systematic study of AI-generated language content that is typically dealt with in high-school education: argumentative essays, i.e. essays in which students discuss a position on a controversial topic by collecting and reflecting on evidence (e.g. ‘Should students be taught to cooperate or compete?’). Learning to write such essays is a crucial aspect of education, as students learn to systematically assess and reflect on a problem from different perspectives. Understanding the capability of generative AI to perform this task increases our understanding of the skills of the models, as well as of the challenges educators face when it comes to teaching this crucial skill. While there is a multitude of individual examples and anecdotal evidence for the quality of AI-generated content in this genre (e.g. 4 ) this paper is the first to systematically assess the quality of human-written and AI-generated argumentative texts across different versions of ChatGPT 5 . We use a fine-grained essay quality scoring rubric based on content and language mastery and employ a significant pool of domain experts, i.e. high school teachers across disciplines, to perform the evaluation. Using computational linguistic methods and rigorous statistical analysis, we arrive at several key findings:

AI models generate significantly higher-quality argumentative essays than the users of an essay-writing online forum frequented by German high-school students across all criteria in our scoring rubric.

ChatGPT-4 (ChatGPT web interface with the GPT-4 model) significantly outperforms ChatGPT-3 (ChatGPT web interface with the GPT-3.5 default model) with respect to logical structure, language complexity, vocabulary richness and text linking.

Writing styles between humans and generative AI models differ significantly: for instance, the GPT models use more nominalizations and have higher sentence complexity (signaling more complex, ‘scientific’, language), whereas the students make more use of modal and epistemic constructions (which tend to convey speaker attitude).

The linguistic diversity of the NLG models seems to be improving over time: while ChatGPT-3 still has a significantly lower linguistic diversity than humans, ChatGPT-4 has a significantly higher diversity than the students.

Our work goes significantly beyond existing benchmarks. While OpenAI’s technical report on GPT-4 6 presents some benchmarks, their evaluation lacks scientific rigor: it fails to provide vital information like the agreement between raters, does not report on details regarding the criteria for assessment or to what extent and how a statistical analysis was conducted for a larger sample of essays. In contrast, our benchmark provides the first (statistically) rigorous and systematic study of essay quality, paired with a computational linguistic analysis of the language employed by humans and two different versions of ChatGPT, offering a glance at how these NLG models develop over time. While our work is focused on argumentative essays in education, the genre is also relevant beyond education. In general, studying argumentative essays is one important aspect to understand how good generative AI models are at conveying arguments and, consequently, persuasive writing in general.

Related work

Natural language generation.

The recent interest in generative AI models can be largely attributed to the public release of ChatGPT, a public interface in the form of an interactive chat based on the InstructGPT 1 model, more commonly referred to as GPT-3.5. In comparison to the original GPT-3 7 and other similar generative large language models based on the transformer architecture like GPT-J 8 , this model was not trained in a purely self-supervised manner (e.g. through masked language modeling). Instead, a pipeline that involved human-written content was used to fine-tune the model and improve the quality of the outputs to both mitigate biases and safety issues, as well as make the generated text more similar to text written by humans. Such models are referred to as Fine-tuned LAnguage Nets (FLANs). For details on their training, we refer to the literature 9 . Notably, this process was recently reproduced with publicly available models such as Alpaca 10 and Dolly (i.e. the complete models can be downloaded and not just accessed through an API). However, we can only assume that a similar process was used for the training of GPT-4 since the paper by OpenAI does not include any details on model training.

Testing of the language competency of large-scale NLG systems has only recently started. Cai et al. 11 show that ChatGPT reuses sentence structure, accesses the intended meaning of an ambiguous word, and identifies the thematic structure of a verb and its arguments, replicating human language use. Mahowald 12 compares ChatGPT’s acceptability judgments to human judgments on the Article + Adjective + Numeral + Noun construction in English. Dentella et al. 13 show that ChatGPT-3 fails to understand low-frequent grammatical constructions like complex nested hierarchies and self-embeddings. In another recent line of research, the structure of automatically generated language is evaluated. Guo et al. 14 show that in question-answer scenarios, ChatGPT-3 uses different linguistic devices than humans. Zhao et al. 15 show that ChatGPT generates longer and more diverse responses when the user is in an apparently negative emotional state.

Given that we aim to identify certain linguistic characteristics of human-written versus AI-generated content, we also draw on related work in the field of linguistic fingerprinting, which assumes that each human has a unique way of using language to express themselves, i.e. the linguistic means that are employed to communicate thoughts, opinions and ideas differ between humans. That these properties can be identified with computational linguistic means has been showcased across different tasks: the computation of a linguistic fingerprint allows to distinguish authors of literary works 16 , the identification of speaker profiles in large public debates 17 , 18 , 19 , 20 and the provision of data for forensic voice comparison in broadcast debates 21 , 22 . For educational purposes, linguistic features are used to measure essay readability 23 , essay cohesion 24 and language performance scores for essay grading 25 . Integrating linguistic fingerprints also yields performance advantages for classification tasks, for instance in predicting user opinion 26 , 27 and identifying individual users 28 .

Limitations of OpenAIs ChatGPT evaluations

OpenAI published a discussion of the model’s performance of several tasks, including Advanced Placement (AP) classes within the US educational system 6 . The subjects used in performance evaluation are diverse and include arts, history, English literature, calculus, statistics, physics, chemistry, economics, and US politics. While the models achieved good or very good marks in most subjects, they did not perform well in English literature. GPT-3.5 also experienced problems with chemistry, macroeconomics, physics, and statistics. While the overall results are impressive, there are several significant issues: firstly, the conflict of interest of the model’s owners poses a problem for the performance interpretation. Secondly, there are issues with the soundness of the assessment beyond the conflict of interest, which make the generalizability of the results hard to assess with respect to the models’ capability to write essays. Notably, the AP exams combine multiple-choice questions with free-text answers. Only the aggregated scores are publicly available. To the best of our knowledge, neither the generated free-text answers, their overall assessment, nor their assessment given specific criteria from the used judgment rubric are published. Thirdly, while the paper states that 1–2 qualified third-party contractors participated in the rating of the free-text answers, it is unclear how often multiple ratings were generated for the same answer and what was the agreement between them. This lack of information hinders a scientifically sound judgement regarding the capabilities of these models in general, but also specifically for essays. Lastly, the owners of the model conducted their study in a few-shot prompt setting, where they gave the models a very structured template as well as an example of a human-written high-quality essay to guide the generation of the answers. This further fine-tuning of what the models generate could have also influenced the output. The results published by the owners go beyond the AP courses which are directly comparable to our work and also consider other student assessments like Graduate Record Examinations (GREs). However, these evaluations suffer from the same problems with the scientific rigor as the AP classes.

Scientific assessment of ChatGPT

Researchers across the globe are currently assessing the individual capabilities of these models with greater scientific rigor. We note that due to the recency and speed of these developments, the hereafter discussed literature has mostly only been published as pre-prints and has not yet been peer-reviewed. In addition to the above issues concretely related to the assessment of the capabilities to generate student essays, it is also worth noting that there are likely large problems with the trustworthiness of evaluations, because of data contamination, i.e. because the benchmark tasks are part of the training of the model, which enables memorization. For example, Aiyappa et al. 29 find evidence that this is likely the case for benchmark results regarding NLP tasks. This complicates the effort by researchers to assess the capabilities of the models beyond memorization.

Nevertheless, the first assessment results are already available – though mostly focused on ChatGPT-3 and not yet ChatGPT-4. Closest to our work is a study by Yeadon et al. 30 , who also investigate ChatGPT-3 performance when writing essays. They grade essays generated by ChatGPT-3 for five physics questions based on criteria that cover academic content, appreciation of the underlying physics, grasp of subject material, addressing the topic, and writing style. For each question, ten essays were generated and rated independently by five researchers. While the sample size precludes a statistical assessment, the results demonstrate that the AI model is capable of writing high-quality physics essays, but that the quality varies in a manner similar to human-written essays.

Guo et al. 14 create a set of free-text question answering tasks based on data they collected from the internet, e.g. question answering from Reddit. The authors then sample thirty triplets of a question, a human answer, and a ChatGPT-3 generated answer and ask human raters to assess if they can detect which was written by a human, and which was written by an AI. While this approach does not directly assess the quality of the output, it serves as a Turing test 31 designed to evaluate whether humans can distinguish between human- and AI-produced output. The results indicate that humans are in fact able to distinguish between the outputs when presented with a pair of answers. Humans familiar with ChatGPT are also able to identify over 80% of AI-generated answers without seeing a human answer in comparison. However, humans who are not yet familiar with ChatGPT-3 are not capable of identifying AI-written answers about 50% of the time. Moreover, the authors also find that the AI-generated outputs are deemed to be more helpful than the human answers in slightly more than half of the cases. This suggests that the strong results from OpenAI’s own benchmarks regarding the capabilities to generate free-text answers generalize beyond the benchmarks.

There are, however, some indicators that the benchmarks may be overly optimistic in their assessment of the model’s capabilities. For example, Kortemeyer 32 conducts a case study to assess how well ChatGPT-3 would perform in a physics class, simulating the tasks that students need to complete as part of the course: answer multiple-choice questions, do homework assignments, ask questions during a lesson, complete programming exercises, and write exams with free-text questions. Notably, ChatGPT-3 was allowed to interact with the instructor for many of the tasks, allowing for multiple attempts as well as feedback on preliminary solutions. The experiment shows that ChatGPT-3’s performance is in many aspects similar to that of the beginning learners and that the model makes similar mistakes, such as omitting units or simply plugging in results from equations. Overall, the AI would have passed the course with a low score of 1.5 out of 4.0. Similarly, Kung et al. 33 study the performance of ChatGPT-3 in the United States Medical Licensing Exam (USMLE) and find that the model performs at or near the passing threshold. Their assessment is a bit more optimistic than Kortemeyer’s as they state that this level of performance, comprehensible reasoning and valid clinical insights suggest that models such as ChatGPT may potentially assist human learning in clinical decision making.

Frieder et al. 34 evaluate the capabilities of ChatGPT-3 in solving graduate-level mathematical tasks. They find that while ChatGPT-3 seems to have some mathematical understanding, its level is well below that of an average student and in most cases is not sufficient to pass exams. Yuan et al. 35 consider the arithmetic abilities of language models, including ChatGPT-3 and ChatGPT-4. They find that they exhibit the best performance among other currently available language models (incl. Llama 36 , FLAN-T5 37 , and Bloom 38 ). However, the accuracy of basic arithmetic tasks is still only at 83% when considering correctness to the degree of \(10^{-3}\) , i.e. such models are still not capable of functioning reliably as calculators. In a slightly satiric, yet insightful take, Spencer et al. 39 assess how a scientific paper on gamma-ray astrophysics would look like, if it were written largely with the assistance of ChatGPT-3. They find that while the language capabilities are good and the model is capable of generating equations, the arguments are often flawed and the references to scientific literature are full of hallucinations.

The general reasoning skills of the models may also not be at the level expected from the benchmarks. For example, Cherian et al. 40 evaluate how well ChatGPT-3 performs on eleven puzzles that second graders should be able to solve and find that ChatGPT is only able to solve them on average in 36.4% of attempts, whereas the second graders achieve a mean of 60.4%. However, their sample size is very small and the problem was posed as a multiple-choice question answering problem, which cannot be directly compared to the NLG we consider.

Research gap

Within this article, we address an important part of the current research gap regarding the capabilities of ChatGPT (and similar technologies), guided by the following research questions:

RQ1: How good is ChatGPT based on GPT-3 and GPT-4 at writing argumentative student essays?

RQ2: How do AI-generated essays compare to essays written by students?

RQ3: What are linguistic devices that are characteristic of student versus AI-generated content?

We study these aspects with the help of a large group of teaching professionals who systematically assess a large corpus of student essays. To the best of our knowledge, this is the first large-scale, independent scientific assessment of ChatGPT (or similar models) of this kind. Answering these questions is crucial to understanding the impact of ChatGPT on the future of education.

Materials and methods

The essay topics originate from a corpus of argumentative essays in the field of argument mining 41 . Argumentative essays require students to think critically about a topic and use evidence to establish a position on the topic in a concise manner. The corpus features essays for 90 topics from Essay Forum 42 , an active community for providing writing feedback on different kinds of text and is frequented by high-school students to get feedback from native speakers on their essay-writing capabilities. Information about the age of the writers is not available, but the topics indicate that the essays were written in grades 11–13, indicating that the authors were likely at least 16. Topics range from ‘Should students be taught to cooperate or to compete?’ to ‘Will newspapers become a thing of the past?’. In the corpus, each topic features one human-written essay uploaded and discussed in the forum. The students who wrote the essays are not native speakers. The average length of these essays is 19 sentences with 388 tokens (an average of 2.089 characters) and will be termed ‘student essays’ in the remainder of the paper.

For the present study, we use the topics from Stab and Gurevych 41 and prompt ChatGPT with ‘Write an essay with about 200 words on “[ topic ]”’ to receive automatically-generated essays from the ChatGPT-3 and ChatGPT-4 versions from 22 March 2023 (‘ChatGPT-3 essays’, ‘ChatGPT-4 essays’). No additional prompts for getting the responses were used, i.e. the data was created with a basic prompt in a zero-shot scenario. This is in contrast to the benchmarks by OpenAI, who used an engineered prompt in a few-shot scenario to guide the generation of essays. We note that we decided to ask for 200 words because we noticed a tendency to generate essays that are longer than the desired length by ChatGPT. A prompt asking for 300 words typically yielded essays with more than 400 words. Thus, using the shorter length of 200, we prevent a potential advantage for ChatGPT through longer essays, and instead err on the side of brevity. Similar to the evaluations of free-text answers by OpenAI, we did not consider multiple configurations of the model due to the effort required to obtain human judgments. For the same reason, our data is restricted to ChatGPT and does not include other models available at that time, e.g. Alpaca. We use the browser versions of the tools because we consider this to be a more realistic scenario than using the API. Table 1 below shows the core statistics of the resulting dataset. Supplemental material S1 shows examples for essays from the data set.

Annotation study

Study participants.

The participants had registered for a two-hour online training entitled ‘ChatGPT – Challenges and Opportunities’ conducted by the authors of this paper as a means to provide teachers with some of the technological background of NLG systems in general and ChatGPT in particular. Only teachers permanently employed at secondary schools were allowed to register for this training. Focusing on these experts alone allows us to receive meaningful results as those participants have a wide range of experience in assessing students’ writing. A total of 139 teachers registered for the training, 129 of them teach at grammar schools, and only 10 teachers hold a position at other secondary schools. About half of the registered teachers (68 teachers) have been in service for many years and have successfully applied for promotion. For data protection reasons, we do not know the subject combinations of the registered teachers. We only know that a variety of subjects are represented, including languages (English, French and German), religion/ethics, and science. Supplemental material S5 provides some general information regarding German teacher qualifications.

The training began with an online lecture followed by a discussion phase. Teachers were given an overview of language models and basic information on how ChatGPT was developed. After about 45 minutes, the teachers received a both written and oral explanation of the questionnaire at the core of our study (see Supplementary material S3 ) and were informed that they had 30 minutes to finish the study tasks. The explanation included information on how the data was obtained, why we collect the self-assessment, and how we chose the criteria for the rating of the essays, the overall goal of our research, and a walk-through of the questionnaire. Participation in the questionnaire was voluntary and did not affect the awarding of a training certificate. We further informed participants that all data was collected anonymously and that we would have no way of identifying who participated in the questionnaire. We orally informed participants that they consent to the use of the provided ratings for our research by participating in the survey.

Once these instructions were provided orally and in writing, the link to the online form was given to the participants. The online form was running on a local server that did not log any information that could identify the participants (e.g. IP address) to ensure anonymity. As per instructions, consent for participation was given by using the online form. Due to the full anonymity, we could by definition not document who exactly provided the consent. This was implemented as further insurance that non-participation could not possibly affect being awarded the training certificate.

About 20% of the training participants did not take part in the questionnaire study, the remaining participants consented based on the information provided and participated in the rating of essays. After the questionnaire, we continued with an online lecture on the opportunities of using ChatGPT for teaching as well as AI beyond chatbots. The study protocol was reviewed and approved by the Research Ethics Committee of the University of Passau. We further confirm that our study protocol is in accordance with all relevant guidelines.

Questionnaire

The questionnaire consists of three parts: first, a brief self-assessment regarding the English skills of the participants which is based on the Common European Framework of Reference for Languages (CEFR) 43 . We have six levels ranging from ‘comparable to a native speaker’ to ‘some basic skills’ (see supplementary material S3 ). Then each participant was shown six essays. The participants were only shown the generated text and were not provided with information on whether the text was human-written or AI-generated.

The questionnaire covers the seven categories relevant for essay assessment shown below (for details see supplementary material S3 ):

Topic and completeness

Logic and composition

Expressiveness and comprehensiveness

Language mastery

Vocabulary and text linking

Language constructs

These categories are used as guidelines for essay assessment 44 established by the Ministry for Education of Lower Saxony, Germany. For each criterion, a seven-point Likert scale with scores from zero to six is defined, where zero is the worst score (e.g. no relation to the topic) and six is the best score (e.g. addressed the topic to a special degree). The questionnaire included a written description as guidance for the scoring.

After rating each essay, the participants were also asked to self-assess their confidence in the ratings. We used a five-point Likert scale based on the criteria for the self-assessment of peer-review scores from the Association for Computational Linguistics (ACL). Once a participant finished rating the six essays, they were shown a summary of their ratings, as well as the individual ratings for each of their essays and the information on how the essay was generated.

Computational linguistic analysis

In order to further explore and compare the quality of the essays written by students and ChatGPT, we consider the six following linguistic characteristics: lexical diversity, sentence complexity, nominalization, presence of modals, epistemic and discourse markers. Those are motivated by previous work: Weiss et al. 25 observe the correlation between measures of lexical, syntactic and discourse complexities to the essay gradings of German high-school examinations while McNamara et al. 45 explore cohesion (indicated, among other things, by connectives), syntactic complexity and lexical diversity in relation to the essay scoring.

Lexical diversity

We identify vocabulary richness by using a well-established measure of textual, lexical diversity (MTLD) 46 which is often used in the field of automated essay grading 25 , 45 , 47 . It takes into account the number of unique words but unlike the best-known measure of lexical diversity, the type-token ratio (TTR), it is not as sensitive to the difference in the length of the texts. In fact, Koizumi and In’nami 48 find it to be least affected by the differences in the length of the texts compared to some other measures of lexical diversity. This is relevant to us due to the difference in average length between the human-written and ChatGPT-generated essays.

Syntactic complexity

We use two measures in order to evaluate the syntactic complexity of the essays. One is based on the maximum depth of the sentence dependency tree which is produced using the spaCy 3.4.2 dependency parser 49 (‘Syntactic complexity (depth)’). For the second measure, we adopt an approach similar in nature to the one by Weiss et al. 25 who use clause structure to evaluate syntactic complexity. In our case, we count the number of conjuncts, clausal modifiers of nouns, adverbial clause modifiers, clausal complements, clausal subjects, and parataxes (‘Syntactic complexity (clauses)’). The supplementary material in S2 shows the difference between sentence complexity based on two examples from the data.

Nominalization is a common feature of a more scientific style of writing 50 and is used as an additional measure for syntactic complexity. In order to explore this feature, we count occurrences of nouns with suffixes such as ‘-ion’, ‘-ment’, ‘-ance’ and a few others which are known to transform verbs into nouns.

Semantic properties

Both modals and epistemic markers signal the commitment of the writer to their statement. We identify modals using the POS-tagging module provided by spaCy as well as a list of epistemic expressions of modality, such as ‘definitely’ and ‘potentially’, also used in other approaches to identifying semantic properties 51 . For epistemic markers we adopt an empirically-driven approach and utilize the epistemic markers identified in a corpus of dialogical argumentation by Hautli-Janisz et al. 52 . We consider expressions such as ‘I think’, ‘it is believed’ and ‘in my opinion’ to be epistemic.

Discourse properties

Discourse markers can be used to measure the coherence quality of a text. This has been explored by Somasundaran et al. 53 who use discourse markers to evaluate the story-telling aspect of student writing while Nadeem et al. 54 incorporated them in their deep learning-based approach to automated essay scoring. In the present paper, we employ the PDTB list of discourse markers 55 which we adjust to exclude words that are often used for purposes other than indicating discourse relations, such as ‘like’, ‘for’, ‘in’ etc.

Statistical methods

We use a within-subjects design for our study. Each participant was shown six randomly selected essays. Results were submitted to the survey system after each essay was completed, in case participants ran out of time and did not finish scoring all six essays. Cronbach’s \(\alpha\) 56 allows us to determine the inter-rater reliability for the rating criterion and data source (human, ChatGPT-3, ChatGPT-4) in order to understand the reliability of our data not only overall, but also for each data source and rating criterion. We use two-sided Wilcoxon-rank-sum tests 57 to confirm the significance of the differences between the data sources for each criterion. We use the same tests to determine the significance of the linguistic characteristics. This results in three comparisons (human vs. ChatGPT-3, human vs. ChatGPT-4, ChatGPT-3 vs. ChatGPT-4) for each of the seven rating criteria and each of the seven linguistic characteristics, i.e. 42 tests. We use the Holm-Bonferroni method 58 for the correction for multiple tests to achieve a family-wise error rate of 0.05. We report the effect size using Cohen’s d 59 . While our data is not perfectly normal, it also does not have severe outliers, so we prefer the clear interpretation of Cohen’s d over the slightly more appropriate, but less accessible non-parametric effect size measures. We report point plots with estimates of the mean scores for each data source and criterion, incl. the 95% confidence interval of these mean values. The confidence intervals are estimated in a non-parametric manner based on bootstrap sampling. We further visualize the distribution for each criterion using violin plots to provide a visual indicator of the spread of the data (see Supplementary material S4 ).

Further, we use the self-assessment of the English skills and confidence in the essay ratings as confounding variables. Through this, we determine if ratings are affected by the language skills or confidence, instead of the actual quality of the essays. We control for the impact of these by measuring Pearson’s correlation coefficient r 60 between the self-assessments and the ratings. We also determine whether the linguistic features are correlated with the ratings as expected. The sentence complexity (both tree depth and dependency clauses), as well as the nominalization, are indicators of the complexity of the language. Similarly, the use of discourse markers should signal a proper logical structure. Finally, a large lexical diversity should be correlated with the ratings for the vocabulary. Same as above, we measure Pearson’s r . We use a two-sided test for the significance based on a \(\beta\) -distribution that models the expected correlations as implemented by scipy 61 . Same as above, we use the Holm-Bonferroni method to account for multiple tests. However, we note that it is likely that all—even tiny—correlations are significant given our amount of data. Consequently, our interpretation of these results focuses on the strength of the correlations.

Our statistical analysis of the data is implemented in Python. We use pandas 1.5.3 and numpy 1.24.2 for the processing of data, pingouin 0.5.3 for the calculation of Cronbach’s \(\alpha\) , scipy 1.10.1 for the Wilcoxon-rank-sum tests Pearson’s r , and seaborn 0.12.2 for the generation of plots, incl. the calculation of error bars that visualize the confidence intervals.

Out of the 111 teachers who completed the questionnaire, 108 rated all six essays, one rated five essays, one rated two essays, and one rated only one essay. This results in 658 ratings for 270 essays (90 topics for each essay type: human-, ChatGPT-3-, ChatGPT-4-generated), with three ratings for 121 essays, two ratings for 144 essays, and one rating for five essays. The inter-rater agreement is consistently excellent ( \(\alpha >0.9\) ), with the exception of language mastery where we have good agreement ( \(\alpha =0.89\) , see Table  2 ). Further, the correlation analysis depicted in supplementary material S4 shows weak positive correlations ( \(r \in 0.11, 0.28]\) ) between the self-assessment for the English skills, respectively the self-assessment for the confidence in ratings and the actual ratings. Overall, this indicates that our ratings are reliable estimates of the actual quality of the essays with a potential small tendency that confidence in ratings and language skills yields better ratings, independent of the data source.

Table  2 and supplementary material S4 characterize the distribution of the ratings for the essays, grouped by the data source. We observe that for all criteria, we have a clear order of the mean values, with students having the worst ratings, ChatGPT-3 in the middle rank, and ChatGPT-4 with the best performance. We further observe that the standard deviations are fairly consistent and slightly larger than one, i.e. the spread is similar for all ratings and essays. This is further supported by the visual analysis of the violin plots.

The statistical analysis of the ratings reported in Table  4 shows that differences between the human-written essays and the ones generated by both ChatGPT models are significant. The effect sizes for human versus ChatGPT-3 essays are between 0.52 and 1.15, i.e. a medium ( \(d \in [0.5,0.8)\) ) to large ( \(d \in [0.8, 1.2)\) ) effect. On the one hand, the smallest effects are observed for the expressiveness and complexity, i.e. when it comes to the overall comprehensiveness and complexity of the sentence structures, the differences between the humans and the ChatGPT-3 model are smallest. On the other hand, the difference in language mastery is larger than all other differences, which indicates that humans are more prone to making mistakes when writing than the NLG models. The magnitude of differences between humans and ChatGPT-4 is larger with effect sizes between 0.88 and 1.43, i.e., a large to very large ( \(d \in [1.2, 2)\) ) effect. Same as for ChatGPT-3, the differences are smallest for expressiveness and complexity and largest for language mastery. Please note that the difference in language mastery between humans and both GPT models does not mean that the humans have low scores for language mastery (M=3.90), but rather that the NLG models have exceptionally high scores (M=5.03 for ChatGPT-3, M=5.25 for ChatGPT-4).

When we consider the differences between the two GPT models, we observe that while ChatGPT-4 has consistently higher mean values for all criteria, only the differences for logic and composition, vocabulary and text linking, and complexity are significant. The effect sizes are between 0.45 and 0.5, i.e. small ( \(d \in [0.2, 0.5)\) ) and medium. Thus, while GPT-4 seems to be an improvement over GPT-3.5 in general, the only clear indicator of this is a better and clearer logical composition and more complex writing with a more diverse vocabulary.

We also observe significant differences in the distribution of linguistic characteristics between all three groups (see Table  3 ). Sentence complexity (depth) is the only category without a significant difference between humans and ChatGPT-3, as well as ChatGPT-3 and ChatGPT-4. There is also no significant difference in the category of discourse markers between humans and ChatGPT-3. The magnitude of the effects varies a lot and is between 0.39 and 1.93, i.e., between small ( \(d \in [0.2, 0.5)\) ) and very large. However, in comparison to the ratings, there is no clear tendency regarding the direction of the differences. For instance, while the ChatGPT models write more complex sentences and use more nominalizations, humans tend to use more modals and epistemic markers instead. The lexical diversity of humans is higher than that of ChatGPT-3 but lower than that of ChatGPT-4. While there is no difference in the use of discourse markers between humans and ChatGPT-3, ChatGPT-4 uses significantly fewer discourse markers.

We detect the expected positive correlations between the complexity ratings and the linguistic markers for sentence complexity ( \(r=0.16\) for depth, \(r=0.19\) for clauses) and nominalizations ( \(r=0.22\) ). However, we observe a negative correlation between the logic ratings and the discourse markers ( \(r=-0.14\) ), which counters our intuition that more frequent use of discourse indicators makes a text more logically coherent. However, this is in line with previous work: McNamara et al. 45 also find no indication that the use of cohesion indices such as discourse connectives correlates with high- and low-proficiency essays. Finally, we observe the expected positive correlation between the ratings for the vocabulary and the lexical diversity ( \(r=0.12\) ). All observed correlations are significant. However, we note that the strength of all these correlations is weak and that the significance itself should not be over-interpreted due to the large sample size.

Our results provide clear answers to the first two research questions that consider the quality of the generated essays: ChatGPT performs well at writing argumentative student essays and outperforms the quality of the human-written essays significantly. The ChatGPT-4 model has (at least) a large effect and is on average about one point better than humans on a seven-point Likert scale.

Regarding the third research question, we find that there are significant linguistic differences between humans and AI-generated content. The AI-generated essays are highly structured, which for instance is reflected by the identical beginnings of the concluding sections of all ChatGPT essays (‘In conclusion, [...]’). The initial sentences of each essay are also very similar starting with a general statement using the main concepts of the essay topics. Although this corresponds to the general structure that is sought after for argumentative essays, it is striking to see that the ChatGPT models are so rigid in realizing this, whereas the human-written essays are looser in representing the guideline on the linguistic surface. Moreover, the linguistic fingerprint has the counter-intuitive property that the use of discourse markers is negatively correlated with logical coherence. We believe that this might be due to the rigid structure of the generated essays: instead of using discourse markers, the AI models provide a clear logical structure by separating the different arguments into paragraphs, thereby reducing the need for discourse markers.

Our data also shows that hallucinations are not a problem in the setting of argumentative essay writing: the essay topics are not really about factual correctness, but rather about argumentation and critical reflection on general concepts which seem to be contained within the knowledge of the AI model. The stochastic nature of the language generation is well-suited for this kind of task, as different plausible arguments can be seen as a sampling from all available arguments for a topic. Nevertheless, we need to perform a more systematic study of the argumentative structures in order to better understand the difference in argumentation between human-written and ChatGPT-generated essay content. Moreover, we also cannot rule out that subtle hallucinations may have been overlooked during the ratings. There are also essays with a low rating for the criteria related to factual correctness, indicating that there might be cases where the AI models still have problems, even if they are, on average, better than the students.

One of the issues with evaluations of the recent large-language models is not accounting for the impact of tainted data when benchmarking such models. While it is certainly possible that the essays that were sourced by Stab and Gurevych 41 from the internet were part of the training data of the GPT models, the proprietary nature of the model training means that we cannot confirm this. However, we note that the generated essays did not resemble the corpus of human essays at all. Moreover, the topics of the essays are general in the sense that any human should be able to reason and write about these topics, just by understanding concepts like ‘cooperation’. Consequently, a taint on these general topics, i.e. the fact that they might be present in the data, is not only possible but is actually expected and unproblematic, as it relates to the capability of the models to learn about concepts, rather than the memorization of specific task solutions.

While we did everything to ensure a sound construct and a high validity of our study, there are still certain issues that may affect our conclusions. Most importantly, neither the writers of the essays, nor their raters, were English native speakers. However, the students purposefully used a forum for English writing frequented by native speakers to ensure the language and content quality of their essays. This indicates that the resulting essays are likely above average for non-native speakers, as they went through at least one round of revisions with the help of native speakers. The teachers were informed that part of the training would be in English to prevent registrations from people without English language skills. Moreover, the self-assessment of the language skills was only weakly correlated with the ratings, indicating that the threat to the soundness of our results is low. While we cannot definitively rule out that our results would not be reproducible with other human raters, the high inter-rater agreement indicates that this is unlikely.

However, our reliance on essays written by non-native speakers affects the external validity and the generalizability of our results. It is certainly possible that native speaking students would perform better in the criteria related to language skills, though it is unclear by how much. However, the language skills were particular strengths of the AI models, meaning that while the difference might be smaller, it is still reasonable to conclude that the AI models would have at least comparable performance to humans, but possibly still better performance, just with a smaller gap. While we cannot rule out a difference for the content-related criteria, we also see no strong argument why native speakers should have better arguments than non-native speakers. Thus, while our results might not fully translate to native speakers, we see no reason why aspects regarding the content should not be similar. Further, our results were obtained based on high-school-level essays. Native and non-native speakers with higher education degrees or experts in fields would likely also achieve a better performance, such that the difference in performance between the AI models and humans would likely also be smaller in such a setting.

We further note that the essay topics may not be an unbiased sample. While Stab and Gurevych 41 randomly sampled the essays from the writing feedback section of an essay forum, it is unclear whether the essays posted there are representative of the general population of essay topics. Nevertheless, we believe that the threat is fairly low because our results are consistent and do not seem to be influenced by certain topics. Further, we cannot with certainty conclude how our results generalize beyond ChatGPT-3 and ChatGPT-4 to similar models like Bard ( https://bard.google.com/?hl=en ) Alpaca, and Dolly. Especially the results for linguistic characteristics are hard to predict. However, since—to the best of our knowledge and given the proprietary nature of some of these models—the general approach to how these models work is similar and the trends for essay quality should hold for models with comparable size and training procedures.

Finally, we want to note that the current speed of progress with generative AI is extremely fast and we are studying moving targets: ChatGPT 3.5 and 4 today are already not the same as the models we studied. Due to a lack of transparency regarding the specific incremental changes, we cannot know or predict how this might affect our results.

Our results provide a strong indication that the fear many teaching professionals have is warranted: the way students do homework and teachers assess it needs to change in a world of generative AI models. For non-native speakers, our results show that when students want to maximize their essay grades, they could easily do so by relying on results from AI models like ChatGPT. The very strong performance of the AI models indicates that this might also be the case for native speakers, though the difference in language skills is probably smaller. However, this is not and cannot be the goal of education. Consequently, educators need to change how they approach homework. Instead of just assigning and grading essays, we need to reflect more on the output of AI tools regarding their reasoning and correctness. AI models need to be seen as an integral part of education, but one which requires careful reflection and training of critical thinking skills.

Furthermore, teachers need to adapt strategies for teaching writing skills: as with the use of calculators, it is necessary to critically reflect with the students on when and how to use those tools. For instance, constructivists 62 argue that learning is enhanced by the active design and creation of unique artifacts by students themselves. In the present case this means that, in the long term, educational objectives may need to be adjusted. This is analogous to teaching good arithmetic skills to younger students and then allowing and encouraging students to use calculators freely in later stages of education. Similarly, once a sound level of literacy has been achieved, strongly integrating AI models in lesson plans may no longer run counter to reasonable learning goals.

In terms of shedding light on the quality and structure of AI-generated essays, this paper makes an important contribution by offering an independent, large-scale and statistically sound account of essay quality, comparing human-written and AI-generated texts. By comparing different versions of ChatGPT, we also offer a glance into the development of these models over time in terms of their linguistic properties and the quality they exhibit. Our results show that while the language generated by ChatGPT is considered very good by humans, there are also notable structural differences, e.g. in the use of discourse markers. This demonstrates that an in-depth consideration not only of the capabilities of generative AI models is required (i.e. which tasks can they be used for), but also of the language they generate. For example, if we read many AI-generated texts that use fewer discourse markers, it raises the question if and how this would affect our human use of discourse markers. Understanding how AI-generated texts differ from human-written enables us to look for these differences, to reason about their potential impact, and to study and possibly mitigate this impact.

Data availability

The datasets generated during and/or analysed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.8343644

Code availability

All materials are available online in form of a replication package that contains the data and the analysis code, https://doi.org/10.5281/zenodo.8343644 .

Ouyang, L. et al. Training language models to follow instructions with human feedback (2022). arXiv:2203.02155 .

Ruby, D. 30+ detailed chatgpt statistics–users & facts (sep 2023). https://www.demandsage.com/chatgpt-statistics/ (2023). Accessed 09 June 2023.

Leahy, S. & Mishra, P. TPACK and the Cambrian explosion of AI. In Society for Information Technology & Teacher Education International Conference , (ed. Langran, E.) 2465–2469 (Association for the Advancement of Computing in Education (AACE), 2023).

Ortiz, S. Need an ai essay writer? here’s how chatgpt (and other chatbots) can help. https://www.zdnet.com/article/how-to-use-chatgpt-to-write-an-essay/ (2023). Accessed 09 June 2023.

Openai chat interface. https://chat.openai.com/ . Accessed 09 June 2023.

OpenAI. Gpt-4 technical report (2023). arXiv:2303.08774 .

Brown, T. B. et al. Language models are few-shot learners (2020). arXiv:2005.14165 .

Wang, B. Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX. https://github.com/kingoflolz/mesh-transformer-jax (2021).

Wei, J. et al. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (2022).

Taori, R. et al. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023).

Cai, Z. G., Haslett, D. A., Duan, X., Wang, S. & Pickering, M. J. Does chatgpt resemble humans in language use? (2023). arXiv:2303.08014 .

Mahowald, K. A discerning several thousand judgments: Gpt-3 rates the article + adjective + numeral + noun construction (2023). arXiv:2301.12564 .

Dentella, V., Murphy, E., Marcus, G. & Leivada, E. Testing ai performance on less frequent aspects of language reveals insensitivity to underlying meaning (2023). arXiv:2302.12313 .

Guo, B. et al. How close is chatgpt to human experts? comparison corpus, evaluation, and detection (2023). arXiv:2301.07597 .

Zhao, W. et al. Is chatgpt equipped with emotional dialogue capabilities? (2023). arXiv:2304.09582 .

Keim, D. A. & Oelke, D. Literature fingerprinting : A new method for visual literary analysis. In 2007 IEEE Symposium on Visual Analytics Science and Technology , 115–122, https://doi.org/10.1109/VAST.2007.4389004 (IEEE, 2007).

El-Assady, M. et al. Interactive visual analysis of transcribed multi-party discourse. In Proceedings of ACL 2017, System Demonstrations , 49–54 (Association for Computational Linguistics, Vancouver, Canada, 2017).

Mennatallah El-Assady, A. H.-J. & Butt, M. Discourse maps - feature encoding for the analysis of verbatim conversation transcripts. In Visual Analytics for Linguistics , vol. CSLI Lecture Notes, Number 220, 115–147 (Stanford: CSLI Publications, 2020).

Matt Foulis, J. V. & Reed, C. Dialogical fingerprinting of debaters. In Proceedings of COMMA 2020 , 465–466, https://doi.org/10.3233/FAIA200536 (Amsterdam: IOS Press, 2020).

Matt Foulis, J. V. & Reed, C. Interactive visualisation of debater identification and characteristics. In Proceedings of the COMMA workshop on Argument Visualisation, COMMA , 1–7 (2020).

Chatzipanagiotidis, S., Giagkou, M. & Meurers, D. Broad linguistic complexity analysis for Greek readability classification. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications , 48–58 (Association for Computational Linguistics, Online, 2021).

Ajili, M., Bonastre, J.-F., Kahn, J., Rossato, S. & Bernard, G. FABIOLE, a speech database for forensic speaker comparison. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) , 726–733 (European Language Resources Association (ELRA), Portorož, Slovenia, 2016).

Deutsch, T., Jasbi, M. & Shieber, S. Linguistic features for readability assessment. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications , 1–17, https://doi.org/10.18653/v1/2020.bea-1.1 (Association for Computational Linguistics, Seattle, WA, USA \(\rightarrow\) Online, 2020).

Fiacco, J., Jiang, S., Adamson, D. & Rosé, C. Toward automatic discourse parsing of student writing motivated by neural interpretation. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022) , 204–215, https://doi.org/10.18653/v1/2022.bea-1.25 (Association for Computational Linguistics, Seattle, Washington, 2022).

Weiss, Z., Riemenschneider, A., Schröter, P. & Meurers, D. Computationally modeling the impact of task-appropriate language complexity and accuracy on human grading of German essays. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , 30–45, https://doi.org/10.18653/v1/W19-4404 (Association for Computational Linguistics, Florence, Italy, 2019).

Yang, F., Dragut, E. & Mukherjee, A. Predicting personal opinion on future events with fingerprints. In Proceedings of the 28th International Conference on Computational Linguistics , 1802–1807, https://doi.org/10.18653/v1/2020.coling-main.162 (International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020).

Tumarada, K. et al. Opinion prediction with user fingerprinting. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021) , 1423–1431 (INCOMA Ltd., Held Online, 2021).

Rocca, R. & Yarkoni, T. Language as a fingerprint: Self-supervised learning of user encodings using transformers. In Findings of the Association for Computational Linguistics: EMNLP . 1701–1714 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).

Aiyappa, R., An, J., Kwak, H. & Ahn, Y.-Y. Can we trust the evaluation on chatgpt? (2023). arXiv:2303.12767 .

Yeadon, W., Inyang, O.-O., Mizouri, A., Peach, A. & Testrow, C. The death of the short-form physics essay in the coming ai revolution (2022). arXiv:2212.11661 .

TURING, A. M. I.-COMPUTING MACHINERY AND INTELLIGENCE. Mind LIX , 433–460, https://doi.org/10.1093/mind/LIX.236.433 (1950). https://academic.oup.com/mind/article-pdf/LIX/236/433/30123314/lix-236-433.pdf .

Kortemeyer, G. Could an artificial-intelligence agent pass an introductory physics course? (2023). arXiv:2301.12127 .

Kung, T. H. et al. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLOS Digital Health 2 , 1–12. https://doi.org/10.1371/journal.pdig.0000198 (2023).

Article   Google Scholar  

Frieder, S. et al. Mathematical capabilities of chatgpt (2023). arXiv:2301.13867 .

Yuan, Z., Yuan, H., Tan, C., Wang, W. & Huang, S. How well do large language models perform in arithmetic tasks? (2023). arXiv:2304.02015 .

Touvron, H. et al. Llama: Open and efficient foundation language models (2023). arXiv:2302.13971 .

Chung, H. W. et al. Scaling instruction-finetuned language models (2022). arXiv:2210.11416 .

Workshop, B. et al. Bloom: A 176b-parameter open-access multilingual language model (2023). arXiv:2211.05100 .

Spencer, S. T., Joshi, V. & Mitchell, A. M. W. Can ai put gamma-ray astrophysicists out of a job? (2023). arXiv:2303.17853 .

Cherian, A., Peng, K.-C., Lohit, S., Smith, K. & Tenenbaum, J. B. Are deep neural networks smarter than second graders? (2023). arXiv:2212.09993 .

Stab, C. & Gurevych, I. Annotating argument components and relations in persuasive essays. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers , 1501–1510 (Dublin City University and Association for Computational Linguistics, Dublin, Ireland, 2014).

Essay forum. https://essayforum.com/ . Last-accessed: 2023-09-07.

Common european framework of reference for languages (cefr). https://www.coe.int/en/web/common-european-framework-reference-languages . Accessed 09 July 2023.

Kmk guidelines for essay assessment. http://www.kmk-format.de/material/Fremdsprachen/5-3-2_Bewertungsskalen_Schreiben.pdf . Accessed 09 July 2023.

McNamara, D. S., Crossley, S. A. & McCarthy, P. M. Linguistic features of writing quality. Writ. Commun. 27 , 57–86 (2010).

McCarthy, P. M. & Jarvis, S. Mtld, vocd-d, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 42 , 381–392 (2010).

Article   PubMed   Google Scholar  

Dasgupta, T., Naskar, A., Dey, L. & Saha, R. Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications , 93–102 (2018).

Koizumi, R. & In’nami, Y. Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System 40 , 554–564 (2012).

spacy industrial-strength natural language processing in python. https://spacy.io/ .

Siskou, W., Friedrich, L., Eckhard, S., Espinoza, I. & Hautli-Janisz, A. Measuring plain language in public service encounters. In Proceedings of the 2nd Workshop on Computational Linguistics for Political Text Analysis (CPSS-2022) (Potsdam, Germany, 2022).

El-Assady, M. & Hautli-Janisz, A. Discourse Maps - Feature Encoding for the Analysis of Verbatim Conversation Transcripts (CSLI lecture notes (CSLI Publications, Center for the Study of Language and Information, 2019).

Hautli-Janisz, A. et al. QT30: A corpus of argument and conflict in broadcast debate. In Proceedings of the Thirteenth Language Resources and Evaluation Conference , 3291–3300 (European Language Resources Association, Marseille, France, 2022).

Somasundaran, S. et al. Towards evaluating narrative quality in student writing. Trans. Assoc. Comput. Linguist. 6 , 91–106 (2018).

Nadeem, F., Nguyen, H., Liu, Y. & Ostendorf, M. Automated essay scoring with discourse-aware neural models. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , 484–493, https://doi.org/10.18653/v1/W19-4450 (Association for Computational Linguistics, Florence, Italy, 2019).

Prasad, R. et al. The Penn Discourse TreeBank 2.0. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08) (European Language Resources Association (ELRA), Marrakech, Morocco, 2008).

Cronbach, L. J. Coefficient alpha and the internal structure of tests. Psychometrika 16 , 297–334. https://doi.org/10.1007/bf02310555 (1951).

Article   MATH   Google Scholar  

Wilcoxon, F. Individual comparisons by ranking methods. Biom. Bull. 1 , 80–83 (1945).

Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6 , 65–70 (1979).

MathSciNet   MATH   Google Scholar  

Cohen, J. Statistical power analysis for the behavioral sciences (Academic press, 2013).

Freedman, D., Pisani, R. & Purves, R. Statistics (international student edition). Pisani, R. Purves, 4th edn. WW Norton & Company, New York (2007).

Scipy documentation. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html . Accessed 09 June 2023.

Windschitl, M. Framing constructivism in practice as the negotiation of dilemmas: An analysis of the conceptual, pedagogical, cultural, and political challenges facing teachers. Rev. Educ. Res. 72 , 131–175 (2002).

Download references

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

Faculty of Computer Science and Mathematics, University of Passau, Passau, Germany

Steffen Herbold, Annette Hautli-Janisz, Ute Heuer, Zlata Kikteva & Alexander Trautsch

You can also search for this author in PubMed   Google Scholar

Contributions

S.H., A.HJ., and U.H. conceived the experiment; S.H., A.HJ, and Z.K. collected the essays from ChatGPT; U.H. recruited the study participants; S.H., A.HJ., U.H. and A.T. conducted the training session and questionnaire; all authors contributed to the analysis of the results, the writing of the manuscript, and review of the manuscript.

Corresponding author

Correspondence to Steffen Herbold .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information 1., supplementary information 2., supplementary information 3., supplementary tables., supplementary figures., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Herbold, S., Hautli-Janisz, A., Heuer, U. et al. A large-scale comparison of human-written versus ChatGPT-generated essays. Sci Rep 13 , 18617 (2023). https://doi.org/10.1038/s41598-023-45644-9

Download citation

Received : 01 June 2023

Accepted : 22 October 2023

Published : 30 October 2023

DOI : https://doi.org/10.1038/s41598-023-45644-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Defense against adversarial attacks: robust and efficient compressed optimized neural networks.

  • Insaf Kraidia
  • Afifa Ghenai
  • Samir Brahim Belhaouari

Scientific Reports (2024)

AI-driven translations for kidney transplant equity in Hispanic populations

  • Oscar A. Garcia Valencia
  • Charat Thongprayoon
  • Wisit Cheungpasitporn

Solving Not Answering. Validation of Guidance for Writing Higher-Order Multiple-Choice Questions in Medical Science Education

  • Maria Xiromeriti
  • Philip M. Newton

Medical Science Educator (2024)

How will the state think with ChatGPT? The challenges of generative artificial intelligence for public administrations

  • Thomas Cantens

AI & SOCIETY (2024)

User satisfaction with the service quality of ChatGPT

  • Kim Shin Young
  • Sang-Gun Lee
  • Ga Youn Hong

Service Business (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

essay on human voice

#1 Text To Speech (TTS) Reader Online

Proudly serving millions of users since 2015

Type or upload any text, file, website & book for listening online, proofreading, reading-along or generating professional mp3 voice-overs.

I need to >

Play Text Out Loud

Reads out loud plain text, files, e-books and websites. Remembers text & caret position, so you can come back to listening later, unlimited length, recording and more.

Create Humanlike Voiceovers

The simplest most robust & affordable AI voice-over generating tool online. Mix voices, languages & speeds. Listen before recording. Unlimited!

Additional Text-To-Speech Solutions

Turns your articles, PDFs, emails, etc. into podcasts, so you can listen to it on your own podcast player when convenient, with all the advantages that come with your podcast app.

SpeechNinja says what you type in real time. It enables people with speech difficulties to speak out loud using synthesized voice (AAC) and more.

Battle tested for years, serving millions of users, especially good for very long texts.

Need to read a webpage? Simply paste its URL here & click play. Leave empty to read about the Beatles 🎸

Books & Stories

Listen to some of the best stories ever written. We have them right here. Want to upload your own? Use the main player to upload epub files.

Simply paste any URL (link to a page) and it will import & read it out loud.

Chrome Extension

Reads out loud webpages, directly from within the page.

TTSReader for mobile - iOS or Android. Includes exporting audio to mp3 files.

NEW 🚀 - TTS Plugin

Make your own website speak your content - with a single line of code. Hassle free.

TTSReader Premium

Support our development team & enjoy ad-free better experience. Commercial users, publishers are required a premium license.

TTSReader reads out loud texts, webpages, pdfs & ebooks with natural sounding voices. Works out of the box. No need to download or install. No sign in required. Simply click 'play' and enjoy listening right in your browser. TTSReader remembers your text and position between sessions, so you can continue listening right where you left. Recording the generated speech is supported as well. Works offline, so you can use it at home, in the office, on the go, driving or taking a walk. Listening to textual content using TTSReader enables multitasking, reading on the go, improved comprehension and more. With support for multiple languages, it can be used for unlimited use cases .

Get Started for Free

Main Use Cases

Listen to great content.

Most of the world's content is in textual form. Being able to listen to it - is huge! In that sense, TTSReader has a huge advantage over podcasts. You choose your content - out of an infinite variety - that includes humanity's entire knowledge and art richness. Listen to lectures, to PDF files. Paste or upload any text from anywhere, edit it if needed, and listen to it anywhere and anytime.

Proofreading

One of the best ways to catch errors in your writing is to listen to it being read aloud. By using TTSReader for proofreading, you can catch errors that you might have missed while reading silently, allowing you to improve the quality and accuracy of your written content. Errors can be in sentence structure, punctuation, and grammar, but also in your essay's structure, order and content.

Listen to web pages

TTSReader can be used to read out loud webpages in two different ways. 1. Using the regular player - paste the URL and click play. The website's content will be imported into the player. (2) Using our Chrome extension to listen to pages without leaving the page . Listening to web pages with TTSReader can provide a more accessible, convenient, and efficient way of consuming online content.

Turn ebooks into audiobooks

Upload any ebook file of epub format - and TTSReader will read it out loud for you, effectively turning it into an audiobook alternative. You can find thousands of epub books for free, available for download on Project Gutenberg's site, which is an open library for free ebooks.

Read along for speed & comprehension

TTSReader enables read along by highlighting the sentence being read and automatically scrolling to keep it in view. This way you can follow with your own eyes - in parallel to listening to it. This can boost reading speed and improve comprehension.

Generate audio files from text

TTSReader enables exporting the synthesized speech with a single click. This is available currently only on Windows and requires TTSReader’s premium . Adhering to the commercial terms some of the voices may be used commercially for publishing, such as narrating videos.

Accessibility, dyslexia, etc.

For individuals with visual impairments or reading difficulties, listening to textual content, lectures, articles & web pages can be an essential tool for accessing & comprehending information.

Language learning

TTSReader can read out text in multiple languages, providing learners with listening as well as speaking practice. By listening to the text being read aloud, learners can improve their comprehension skills and pronunciation.

Kids - stories & learning

Kids love stories! And if you can read them stories - it's definitely the best! But, if you can't, let TTSReader read them stories for you. Set the right voice and speed, that is appropriate for their comprehension level. For kids who are at the age of learning to read - this can also be an effective tool to strengthen that skill, as it highlights every sentence being read.

Main Features

Ttsreader is a free text to speech reader that supports all modern browsers, including chrome, firefox and safari..

Includes multiple languages and accents. If on Chrome - you will get access to Google's voices as well. Super easy to use - no download, no login required. Here are some more features

Fun, Online, Free. Listen to great content

Drag, drop & play (or directly copy text & play). That’s it. No downloads. No logins. No passwords. No fuss. Simply fun to use and listen to great content. Great for listening in the background. Great for proof-reading. Great for kids and more. Learn more, including a YouTube we made, here .

Multilingual, Natural Voices

We facilitate high-quality natural-sounding voices from different sources. There are male & female voices, in different accents and different languages. Choose the voice you like, insert text, click play to generate the synthesized speech and enjoy listening.

Exit, Come Back & Play from Where You Stopped

TTSReader remembers the article and last position when paused, even if you close the browser. This way, you can come back to listening right where you previously left. Works on Chrome & Safari on mobile too. Ideal for listening to articles.

Vs. Recorded Podcasts

In many aspects, synthesized speech has advantages over recorded podcasts. Here are some: First of all - you have unlimited - free - content. That includes high-quality articles and books, that are not available on podcasts. Second - it’s free. Third - it uses almost no data - so it’s available offline too, and you save money. If you like listening on the go, as while driving or walking - get our free Android Text Reader App .

Read PDF Files, Texts & Websites

TTSReader extracts the text from pdf files, and reads it out loud. Also useful for simply copying text from pdf to anywhere. In addition, it highlights the text currently being read - so you can follow with your eyes. If you specifically want to listen to websites - such as blogs, news, wiki - you should get our free extension for Chrome

Export Speech to Audio Files

TTSReader enables exporting the synthesized speech to mp3 audio files. This is available currently only on Windows, and requires ttsreader’s premium .

Pricing & Plans

  • Online text to speech player
  • Chrome extension for reading webpages

$10.99 /mo OR $39 /yr

  • Premium TTSReader.com
  • Premium Chrome extension
  • Better support from the development team

Compare plans

FreePremium
Unlimited text reading
Online text to speech
Upload files, PDFs, ebooks
Web player
Webpage reading Chrome extension
Editing
Ads free
Unlock features
Recording audio - for generating audio files from text
Commercial license
Publishing license (under the following )
Better support from the development team

Sister Apps Developed by Our Team

Speechnotes

Dictation & Transcription

Type with your voice for free, or automatically transcribe audio & video recordings

Buttons - Kids Dictionary

Turns your device into multiple push-buttons interactive games

Animals, numbers, colors, counting, letters, objects and more. Different levels. Multilingual. No ads. Made by parents, for our own kids.

Ways to Get In Touch, Feedback & Community

Visit our contact page , for various ways to get in touch with us, send us feedback and interact with our community of users & developers.

Text to Speech

Generate speech from text. choose a voice to read your text aloud. you can use it to narrate your videos, create voice-overs, convert your documents into audio, and more..

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

Realistic Text-to-Speech AI converter

essay on human voice

Create realistic Voiceovers online! Insert any text to generate speech and download audio mp3 or wav for any purpose. Speak a text with AI-powered voices.You can convert text to voice for free for reference only. For all features, purchase the paid plans

How to convert text into speech?

  • Just type some text or import your written content
  • Press "generate" button
  • Download MP3 / WAV

Full list of benefits of neural voices

Multi-voice editor.

Dialogue with AI Voices . You can use several voices at once in one text.

Over 1000 Natural Sounding Voices

Crystal-clear voice over like a Human. Males, females, children's, elderly voices.

You spend little on re-dubbing the text. Limits are spent only for changed sentences in the text. Read more about our cost-effective Limit System . Enjoy full control over your spending with one-time payments for only what you use. Pay as you go : get flexible, cost-effective access to our neural network voiceover services without subscriptions.

If your Limit balance is sufficient, you can use a single query to convert a text of up to 2,000,000 characters into speech.

Commercial Use

You can use the generated audio for commercial purposes. Examples: YouTube, Tik Tok, Instagram, Facebook, Twitch, Twitter, Podcasts, Video Ads, Advertising, E-book, Presentation and other.

Custom voice settings

Change Speed, Pitch, Stress, Pronunciation, Intonation , Emphasis , Pauses and more. SSML support .

SRT to audio

Subtitles to Audio : Convert your subtitle file into perfectly timed multilingual voiceovers with our advanced neural networks.

Downloadable TTS

You can download converted audio files in MP3, WAV, OGG for free.

Powerful support

We will help you with any questions about text-to-speech. Ask any questions, even the simplest ones. We are happy to help.

Compatible with editing programs

Works with any video creation software: Adobe Premier, After effects, Audition, DaVinci Resolve, Apple Motion, Camtasia, iMovie, Audacity, etc.

Cloud save your history

All your files and texts are automatically saved in your profile on our cloud server. Add tracks to your favorites in one click.

Use our text to voice converter to make videos with natural sounding speech!

Say goodbye to expensive traditional audio creation

Cheap price. Create a professional voiceover in real time for pennies. it is 100 times cheaper than a live speaker.

Traditional audio creation

sound studio

  • Expensive live speakers, high prices
  • A long search for freelancers and studios
  • Editing requires complex tools and knowledge
  • The announcer in the studio voices a long time. It takes time to give him a task and accept it.

speechgen on different devices

  • Affordable tts generation starting at $0.08 per 1000 characters
  • Website accessible in your browser right now
  • Intuitive interface, suitable for beginners
  • SpeechGen generates text from speech very quickly. A few clicks and the audio is ready.

Create AI-generated realistic voice-overs.

Ways to use. Cases.

See how other people are already using our realistic speech synthesis. There are hundreds of variations in applications. Here are some of them.

  • Voice over for videos. Commercial, YouTube, Tik Tok, Instagram, Facebook, and other social media. Add voice to any videos!
  • E-learning material. Ex: learning foreign languages, listening to lectures, instructional videos.
  • Advertising. Increase installations and sales! Create AI-generated realistic voice-overs for video ads, promo, and creatives.
  • Public places. Synthesizing speech from text is needed for airports, bus stations, parks, supermarkets, stadiums, and other public areas.
  • Podcasts. Turn text into podcasts to increase content reach. Publish your audio files on iTunes, Spotify, and other podcast services.
  • Mobile apps and desktop software. The synthesized ai voices make the app friendly.
  • Essay reader. Read your essay out loud to write a better paper.
  • Presentations. Use text-to-speech for impressive PowerPoint presentations and slideshow.
  • Reading documents. Save your time reading documents aloud with a speech synthesizer.
  • Book reader. Use our text-to-speech web app for ebook reading aloud with natural voices.
  • Welcome audio messages for websites. It is a perfect way to re-engage with your audience. 
  • Online article reader. Internet users translate texts of interesting articles into audio and listen to them to save time.
  • Voicemail greeting generator. Record voice-over for telephone systems phone greetings.
  • Online narrator to read fairy tales aloud to children.
  • For fun. Use the robot voiceover to create memes, creativity, and gags.

Maximize your content’s potential with an audio-version. Increase audience engagement and drive business growth.

Who uses Text to Speech?

SpeechGen.io is a service with artificial intelligence used by about 1,000 people daily for different purposes. Here are examples.

Video makers create voiceovers for videos. They generate audio content without expensive studio production.

Newsmakers convert text to speech with computerized voices for news reporting and sports announcing.

Students and busy professionals to quickly explore content

Foreigners. Second-language students who want to improve their pronunciation or listen to the text comprehension

Software developers add synthesized speech to programs to improve the user experience.

Marketers. Easy-to-produce audio content for any startups

IVR voice recordings. Generate prompts for interactive voice response systems.

Educators. Foreign language teachers generate voice from the text for audio examples.

Booklovers use Speechgen as an out loud book reader. The TTS voiceover is downloadable. Listen on any device.

HR departments and e-learning professionals can make learning modules and employee training with ai text to speech online software.

Webmasters convert articles to audio with lifelike robotic voices. TTS audio increases the time on the webpage and the depth of views.

Animators use ai voices for dialogue and character speech.

Text to Speech enables brands, companies, and organizations to deliver enhanced end-user experience, while minimizing costs.

Frequently Asked Questions

Convert any text to super realistic human voices. See all tariff plans .

Enhance Your Content Accessibility

Boost your experience with our additional features. Easily convert PDFs, DOCx files, and video subtitles into natural-sounding audio.

📄🔊 PDF to Audio

Transform your PDF documents into audible content for easier consumption and enhanced accessibility.

📝🎧 DOCx to mp3

Easily convert Word documents into speech for listening on the go or for those who prefer audio format

🔊📰 WordPress plugin

Enhance your WordPress site with our plugin for article voiceovers, embedding an audio player directly on your site to boost user engagement and diversify your content.

Supported languages

  • Amharic (Ethiopia)
  • Arabic (Algeria)
  • Arabic (Egypt)
  • Arabic (Saudi Arabia)
  • Bengali (India)
  • Catalan (Spain)
  • English (Australia)
  • English (Canada)
  • English (GB)
  • English (Hong Kong)
  • English (India)
  • English (Philippines)
  • German (Austria)
  • Hindi India
  • Spanish (Argentina)
  • Spanish (Mexico)
  • Spanish (United States)
  • Tamil (India)
  • All languages: +76

We use cookies to ensure you get the best experience on our website. Learn more: Privacy Policy

Choose Rewritify to humanize AI text and get access to 200 FREE words!

Humanize AI Content to Make it More Human, Undetectable & Engaging

Rewrite AI

AI Humanizer

Tired of robotic, AI-generated content? Let Rewritify help by reworking it to sound more authentically human and relatable.

Our AI humanizer doesn't just replace a few words in the text. That just won't cut it. Instead, we do a comprehensive overhaul of the text by imitating real human writing patterns.

AI Humanizer

Undetectable AI

With our AI humanizer, you can consistently bypass AI detectors like GPTZero, Originality.ai, Turnitin, ZeroGPT, Winson AI, and many others. Don't let them stop your work from being seen.

check for ai

No Plagiarism Guaranteed

Our undetectable AI outputs will always be entirely free from plagiarism and 100% authentic, so you can rest assured that unintentional plagiarism will never be a concern.

Pass Plagiarism Checkers Without Fail

Optimized for Search Engines

AI-generated content can face Google penalties and sharp traffic declines! Our AI humanizer makes content undetectable to Google's algorithms, more human-like, and still keyword-rich, avoiding SEO penalties and attracting convertible traffic easily.

Pass Plagiarism Checkers Without Fail

World-class Humanzation Model

By analyzing vast amounts of human-written text, our algorithms use more than 1.6 trillion parameters to learn to replicate human authors' writing style and tone, creating text that imitates human-written content.

Make AI Content Pass As Human Written

Built-in AI detector

Our built-in AI checker is integrated with all popular AI detectors in the market, including GPTZero and ZeroGPT. This lets you verify the detectability of the rewritten output across all these AI detectors at the same time.

Analyze AI Detection Results in Real Time

Multilingual Capabilities

Rewritify can humanize AI text written in whichever language. It accurately captures the rhythm and nuance of how people authentically communicate in any tongue.

How to Use Rewritify?

How to Use Rewritify?

Using our AI humanizer is an absolute breeze. Just follow the steps below.

Provide Your Content

Simply paste your AI-generated text into our easy-to-use platform.

Let Us Do The Heavy Lifting

Rewritify seamlessly converts your content into natural, human-like text with one click.

Get Authentic, Undetectable Output

Receive AI text that has been humanized to avoid AI detection.

What Our Customers Say About Rewritify

- David Lee

- David Lee

University Student

"My professors can easily spot AI-written papers and assignments. With Rewritify, I can make the AI output sound like it was written by a human. I feel more confident turning in my work after running them through Rewritify."

- John Smith

- John Smith

"I use a lot of AI tools to help with content creation. But the output doesn't always sound natural. Rewiritfy solves that problem! It takes my AI-generated drafts and turns them into human-sounding blog posts, social media captions, and more in just seconds."

- Emily Davis

- Emily Davis

Marketing Agency

"Rewiritfy transforms our AI-generated content into natural, engaging copy that sounds like it was hand-written. This allows us to deliver excellent results to clients on time and budget. "

- Charles Lee

- Charles Lee

"Using AI to draft news stories makes it easier to keep up with publishing deadlines. Rewritify helps me rework the AI content to sound more human and engaging, which ensures that I pass all my editor's checks with flying colors."

- Ethan Davis

- Ethan Davis

Entrepreneur

"I often found AI content to be too generic and bland to help my business generate real sales conversions. This was until I discovered Rewritify, which did wonders in making the content sound more natural and relatable to customers. It's been an amazing turnaround!"

- Cecilia Brown

- Cecilia Brown

SEO Professional

"Rewritify has been such a major breakthrough for our marketing campaigns! We can now produce bulk AI content that sounds so human-like, it meets Google’s SEO guidelines. This has drastically helped us avoid penalties that could affect our ranking on SERPs."

- Arthur Miller

- Arthur Miller

Advertising Copywriter

"Working in a fast-paced advertising agency, I need to produce engaging content quickly. Rewritify transforms my AI-generated text into compelling, human-like copy. Now, I can meet tight deadlines without compromising quality."

- Daisy Johnson

- Daisy Johnson

E-commerce store owner

"Rewritify has dramatically improved the quality of our product descriptions. It seamlessly breathes life into our AI-generated content, making it sound engagingly human. Plus, it's played a huge role in improving our conversion rate."

Frequently Asked Questions

1. what is an ai humanizer.

An AI humanizer is a revolutionary tool that transforms AI-generated content into a form that mirrors human writing By skillfully integrating natural language nuances, contextual understanding, and idiomatic expressions, an AI humanizer adds a human-like touch to AI-generated text, making it nearly indiscernible from content crafted by a human writer.

2. What is Rewritify and how does it work?

Rewritify is an undetectable AI writer designed to humanize your AI-generated text into human-sounding, authentic content. It is the best-in-class AI humanizer powered by a state-of-the-art language model trained on trillions of user-sourced content to understand and replicate how humans write as opposed to Al.

3. Is this undetectable AI writer free to use?

Yes. Our undetectable AI paraphraser allows you to try out our tool with 200 free words. See what it can do for yourself. If you love Rewritify and want more, upgrades unlock extra perks and power.

4. Will using a humanizer compromise the quality of the original text?

No. Our AI humanizer aims to improve and make AI-generated text sound more human. It keeps the key information from the original text but changes the style and tone to be more like how humans write. The goal is to enhance AI text while maintaining its quality and core meaning.

5. Can I really bypass AI detectors with this AI humanizer?

Absolutely! Rewritify is a powerful undetectable AI paraphraser that ensures that the rewritten text is super human-like, empowering you to beat all leading AI detection platforms including GPTZero, ZeroGPT, Originality.ai, Turnitin, Copyleaks, Winston AI, writer.com, and so much more.

6. Will the rewritten text by this AI humanizer lose its SEO value?

No. In fact, using an undetectable AI rephraser like Rewritify can actually enhance the SEO value of the text.

Rewritify ensures that the core meaning, keywords, and relevant information from the original text are retained in the rewritten version. It improves the style, tone, and overall readability of the content to make it more appealing to both search engines and users.

7. How many languages does our undetectable AI tool support?

Rewritify is able to humanize AI text in over 50 languages. Whether it be English, Spanish, French, German, Arabic, Chinese, Italian, or any other supported language, Rewritify can rewrite the text and make it sound as if it were written by a human in that particular language.

8. Why should I use Rewritify to recreate AI text?

Relying solely on ChatGPT and similar tools can result in a significant drawback: easily detectable and unreadable content. This can hinder your ability to connect with your audience on a deeper level and ultimately impact the success of your brand.

That's why you should choose Rewritify over ChatGPT. We help you produce quality AI texts with a humanized feel with just one click.

9. Which is the best AI humanizer?

Rewritify stands out as the best AI humanizer among its competitors. With its advanced capabilities and user-friendly interface, Rewritify is designed to efficiently transform AI-generated content into human-like text.

It performs cross-checks using multiple AI detectors, such as GPTZero, ZeroGPT, Sapling, Winston AI, and Copyleaks. This ensures that the rewritten content maintains a high level of human-like quality and can effectively bypass detection from various AI detection tools.

10. How can I get undetectable AI content online for free?

To generate undetectable AI content for free, consider using Rewritify. This powerful, undetectable AI writer offers a cost-free solution to transform AI-generated content into human-like text.

By leveraging its advanced capabilities and built-in cross-checking with multiple AI detectors, Rewritify enables users to create high-quality, human-like content that can effectively bypass detection from various AI tools.

Simply sign up to get 200 free words and start using Rewritify to generate undetectable AI content at no cost.

Humanize AI Text Like a Pro

Simply input your AI-generated content and we will instantly humanize it into engaging, undetectable, and human-sounding text. Get started for free!

IMAGES

  1. (PDF) 2007 Human Voice in speech and singing

    essay on human voice

  2. The Importance of Voice and Style in Essay Writing

    essay on human voice

  3. The Power of Voice Essay

    essay on human voice

  4. Human Voice

    essay on human voice

  5. Essay About Pronunciation

    essay on human voice

  6. Essay On Why does your voice matter? » ️

    essay on human voice

VIDEO

  1. A brief introduction to speaking human

  2. Practicing his human voice! #cuteanimals #talkingparrot #funnyanimals #talkingbird

  3. 10 Lines Essay On Human Rights In English

  4. IELTS Essay- Human cloning

  5. HUMAN VOICE (2023 Remaster)

  6. TM Abinaya

COMMENTS

  1. The Power of the Human Voice

    Verbal communication is unique to humans. Human beings are emotional creatures. The human voice is thought to convey emotional valence, arousal and intensity. Music is a powerful medium capable of eliciting a broad range of emotions. The ability to detect emotion in speech and music is an important task in our daily lives.

  2. The power of 'voice,' and empowering the voiceless

    The power of 'voice,' and empowering the voiceless

  3. Don't Underestimate the Power of Your Voice

    Our voices matter as much as our words matter. They have the power to awaken the senses and lead others to act, close deals, or land us successful job interviews. Through our voices, we create ...

  4. The Human Voice Essay

    The Human Voice Essay. Our voice is our primary mean of communication, and most of us can't go for more than a couple of minutes without using it. We don't use your voice for just talking though, our voice can be used to do a variation of things. The most obvious example would be singing. So it is obvious the human voice is a means of ...

  5. The Power of Using your voice

    The Power of Using your voice. A voice is a tool that transports us into the future. A future that has more possibilities and more solutions. A voice is a tool that can be used for standing up for what is right, rather than what is easy. A voice gives your opinions a platform, and gifts you with the opportunity to have perspective and knowledge ...

  6. Human Voices Are Unique but We're Not That Good at Recognizing Them

    The following essay is reprinted with permission from The Conversation, ... Luckily computer algorithms are still able to make the most of the individuality of the human voice. They have probably ...

  7. Mechanics of human voice production and control

    Mechanics of human voice production and control - PMC

  8. Human voice

    The human voice has the power to create beautiful, unique music that has the ability move one to tears. These tears are often due to the emotional impact a song has on a person, but in some cases, these tears are the body's response to a terrible, uncomfortable sound- the sound of somebody singing terribly.

  9. Human Voice

    The acoustic communication mode specific to human beings is speech. This chapter focuses on speech production from both physical and signal processing points of view. Spoken languages exhibit an enormous variation in speech units and their combination. Phonetics is the science that has developed ways to analyse and describe speech units and ...

  10. Artificial intelligence is being used to digitally replicate human

    Using artificial intelligence to digitally replicate human voices

  11. Beyond speech: Exploring diversity in the human voice

    Speech, singing, and nonverbal vocalizations are three distinct vocal domains. •. Voice pitch is low and stable in speech compared to nonverbal vocalizations. •. Nonverbal vocalizations are poorly articulated and mostly contain a-like vowels. •. Source modulation is critical for conveying affect, filter modulation for semantics.

  12. AI vs. Human Voices: How Delivery Source and Narrative ...

    Request PDF | AI vs. Human Voices: How Delivery Source and Narrative Format Influence the Effectiveness of Persuasion Messages | AI communicators (e.g., AI voice assistants) play an increasingly ...

  13. Free Text to Speech Online with Realistic AI Voices

    Free Text to Speech Online with Realistic AI Voices

  14. AI vs. Human Voices: How Delivery Source and Narrative Format Influence

    Through a web-based experiment (N = 228), we tested how the persuasive effects of messages are influenced by their format (narrative vs. non-narrative) and the communicator (human voice vs. AI voice) in the scenario of debunking myths about COVID-19 vaccination. The findings revealed that the human communicator was perceived to be more credible ...

  15. How to Bring Your Voice to Life in Personal Essays

    How to Bring Your Voice to Life in Personal Essays. It's not what happens to us in our lives that makes us into writers; it's what we make out of what happens to us. It's our distinctive point of view. Dinty W. Moore. Oct 4, 2011. I remember well the self-doubts of my early writing career, when I felt completely unsure that I could ever ...

  16. Mechanics of human voice production and control

    Understanding such cause-effect relationship between voice physiology and production necessarily requires a multi-disciplinary effort. While voice production results from a complex fluid-structure-acoustic interaction process, which again depends on the geometry and material properties of the lungs, larynx, and the vocal tract, the end interest of voice is its acoustics and perception.

  17. Human voice perception

    Acoustical features of voice. Voice perception is grounded in voice production. To help make clear how vocal information is analysed, we shall briefly consider the particular acoustical characteristics of this sound category. As summarized by Ghazanfar and Rendall (2008), vocal sounds are generated by the interplay of a source (the vocal folds ...

  18. The Relationship Between Acoustics and Human Voice Essay

    Get a custom essay on The Relationship Between Acoustics and Human Voice. 185 writers online. Learn More. The reverberation time (T 60 = 0.16V/A) as derived by Sabine is the most vital formula when it comes to room acoustics. As such, on predicting T 60 of materials one is in a better position to determine the surface, acoustic characteristics ...

  19. A large-scale comparison of human-written versus ChatGPT-generated essays

    A large-scale comparison of human-written versus ...

  20. #1 Text To Speech (TTS) Reader Online. Free & Unlimited

    #1 Text To Speech (TTS) Reader Online. Free & Unlimited

  21. Text to Speech

    Text to Speech

  22. Realistic Text to Speech converter & AI Voice generator

    Realistic Text to Speech converter & AI Voice generator

  23. Undetectable AI

    Simply input your AI-generated content and we will instantly humanize it into engaging, undetectable, and human-sounding text. Get started for free! Try for Free. Rewritify is an undetectable AI tool and AI humanizer that helps you bypass AI detection. Sign up to get 3,000,000 words for free.