Have you ever clicked play on an audiobook and then tried to match the voice you hear to a character in the book?
If the quality of the voice is very high, how would you know it is not a pre-recording of the book’s real author reading back their content to you – its listener?
In fact, maybe it is not a human voice at all. Instead, it is AI-generated. Now what?
Is the answer to any of these enough to get you to close this audiobook, head back to your digital library, and try your luck with another?
What is TTS audio?
TTS is a common abbreviation for text-to-speech technology that converts written text into spoken words. The voice of these spoken words could be human or artificially generated. When artificially generated voices are chosen for use in TTS systems, the speech output often comes across as robotic or unnatural, at least compared to other audio formats such as live audio (human speech in real-time) or voice acting.
In more recent years, TTS audio has become hard to distinguish from other audio formats since information can now be converted into seemingly natural speech by using advanced “linguistic and acoustic processing.”
Yet TTS is not to be confused with STT. STT, or speech-to-text, employs automatic speech recognition to recognize words spoken and then transcribes these words into text in a language that a computer can understand.
Uses for TTS today
The cost-effectiveness and efficiency of TTS are two major factors influencing how it is used today.
Broad applications of TTS include call centre platforms, mobile terminals (such as PCs and phones) and broadcasting information service systems. In each case, TTS offers a convenient and alternative way to digest information and improve user experience on digital platforms. Just think: you prepare breakfast while a TTS device reads you your emails and memos every morning.
For content creators, TTS is an enticing option for producing music, podcast recordings, Instagram reels, YouTube vlogs and audiobooks. A content creator will likely spend less time choosing a TTS app, feeding it a script and waiting for the audio to generate than if they tried to (traditionally) record someone reading the script aloud.
For people with disabilities or foreign language speakers, TTS offers an accessible means of reducing communication barriers and learning new material. Assistive speech devices can take TTS voice banks, allowing individuals to build a synthetic voice.
Two pilot studies (using Kurzweil 3000 TTS software) were conducted in 2012 on around 104 high school students in grades 9–12.
These students had varying levels of disabilities, which made reading difficult. All students who participated in the study were flagged as at-risk for referral to special education services. Electronic documents such as books, articles and magazines were run through a TTS system that each of the students had been taught how to use before the study. By the end of the study, conclusions were drawn that the reading rate, vocabulary and comprehension of students who participated throughout the entire study had improved.
A case of Mistaken Identity?
“TTS audio listeners were able to detect personality cues based on voice parameters such as volume, speaking rate and pitch.”
One study, using TTS audio with varied personality cues, recorded that when TTS is used for long-form content (e.g., paragraphs in an essay, as opposed to brief words and/or sentences), the TTS audio listeners were able to detect personality cues based on voice parameters such as volume, speaking rate and pitch. Manipulating these parameters was enough to have a listener conclude it is the voice of an introvert and not an extrovert, or vice versa.
Other studies find that what a TTS audio listener can infer from the TTS voice varies depending on the gender and accent of said voice. Although some of these voices are AI-generated, the more similar to a male voice the audio is, the higher the possibility of its listener believing the content to be credible.
In summary, the impact on user experience when listening to TTS audio can be far-reaching. The root cause of these impacts is that a TTS audio listener chooses to interpret what is heard either positively or negatively.
Exploring the Impacts
TTS audio listeners may mistake the identity of the narrator of their favourite audiobook, the bot on their most frequently visited website or the tour guide in the latest vlog of a YouTube channel they have been subscribed to for years.
Scarier prospects than this include the use of TTS to create deepfakes as well as phishing and fraud scams, which employ emotionally manipulative TTS audio to dupe their listeners. TTS, when technologically advanced enough, is hard to distinguish from natural human speech and can be used to create fake quotes by influential people.
Ranging from harmless to harmful, the broad applications of TTS today means that there is an increasing chance of mistaking an AI-generated voice for a human’s. The simplest remedy to this risk starts with acknowledging that it is, in fact, a probable risk.
As such, one should slowly train the mind to approach TTS technology with a healthy dose of skepticism.