Voice imitation algorithms

From SI410
Jump to: navigation, search

Voice imitation algorithms (also known as Speech synthesis[1]) are a form of Synthetic Media, used to imitate human speech. They achieve this by using machine learning and artificial intelligence techniques[2]. The most common method of voice imitation algorithms relies on many voice samples to produce synthesized speech.[3] The most well-known and advanced voice imitation algorithms are Lyrebird AI, Deepvoice, and WaveNet.


Mechanical voice imitation

The earliest recorded instance of an artificial system used to imitate human speech was created by Austro-Hungarian author and inventor Wolfgang von Kempelen. Development of his speaking machine started in 1769 and eventually implemented tongue and lip models which enabled the machine to pronounce vowels and consonants.[4]

Electronic voice imitation

The first complete electronic English speech synthesis system was created in Japan in 1968 by Noriko Umeda and Ryunen Teranishi working for the Japanese government.[5] This system was able to analyze English text and approximate the pronunciation of sentences.[6] The system used a dictionary of the 1500 most common English words. The techniques and rules used in this software were later used in the Bell Labs' own speech synthesis system.[7]


Commercial implementation

The Speak and Spell was originally introduced in 1978 by Texas Instruments. Featuring a keyboard and a speech synthesizer, it allowed one to convert words that were typed onto the keyboard into synthesized audio that it played from its speakers. The purpose of the Speak and Spell was to assist children in proper pronunciation and spelling.[8]
Lyrebird AI

Lyrebird (also known as Lyrebird AI) was a Montreal based company founded in 2017 focused on speech synthesis and voice imitation.[9] In 2019 it was acquired by Descript, an American company focused on audio editing software, specifically tailored towards podcast creators.[10] Lyrebird AI uses artificial intelligence and voice samples to accurately replicate realistic human speech.

China-based technology company Baidu has used neural networks and deep learning to create accurate voice imitations from thousands of collected voice samples with in-house software Deepvoice.[11][12] Baidu claims that Deepvoice is capable of replicating thousands of unique voices, with less than 30 minutes of voice samples from each voice.[13]


University of Delaware and Nemours Alfred I. duPont Hospital for Children's jointly operated Applied Science and Engineering Laboratories (also know as ASEL), has researched and developed the Model Talker.[14][15] A software which is used with AAC devices to replicate human speech to assist those with hearing or speech impairments. The ModelTalker TTS is able to convert English language text to English language synthesized speech.

The vocoder was invented in 1938 by Bell Labs.[16] It is a type of voice codec that analyzes and synthesizes the human voice waveforms. It is mainly used in audio data compression so that voice data can be saved and utilized while using fewer bits than the original data. This allows synthesized speech algorithms to save, analyze, and output higher fidelity data to better replicate and more accurately imitate human speech.[17]

Speech imitation in culture

A virtual assistant is an artificial intelligence system that performs tasks for an individual based on commands or questions. The most prominent of these being Siri, a virtual assistant developed by Apple Inc. and utilized on Apple's various operating systems. As of 2019 Siri supports 21 different languages,[18] can interpret and respond to a wide range of voice commands,[19] and as of 2020, can speak in 5 different English language accents.[20]

HAL 9000 (also simply referred to as HAL) is an Artificial Intelligence and the main antagonist of Arthur C. Clarke's 1968 Space Odyssey series. HAL interacts with other characters in the series with a speech synthesizer and is a popular example of artificial human speech synthesis in popular culture.

Ethical implications

Voice imitation algorithms have been used in Grandparent scams. A type of telemarketing fraud where the scammer will call an elderly person while claiming to be a relative who has gotten themselves into some kind of trouble and needs money. This type of scam is made easier by the realistic sounding synthesized voice, which makes it harder for the person being scammed to identify the person they are speaking with as a synthesized voice.[21]

There have been concerns raised over the authenticity of voice recordings when one has access to realistic voice imitation software. Concerns such as if recordings of politicians in closed-door meetings can be trusted as authentic when any voice could be replicated with enough voice samples.[22] Responses to the rise of synthetic media include the Deep Fake Detection Challenge (also known as DFDC), which is sponsored by several big tech companies. The DFDC incentivizes participants to help develop technology for detecting deep fakes and related synthetic media, often referred to as tampered media, with prizes for software developed by participants which helps to identify synthetic media. The goals of the DFDC are to develop detection software at a quicker pace than the AI used in the creation of tampered media.[23]

See also


  1. https://thehill.com/opinion/cybersecurity/470826-perception-wont-be-reality-once-ai-can-manipulate-what-we-see
  2. https://www.sciencedirect.com/science/article/pii/S0007681319301600?via%3Dihub
  3. https://towardsdatascience.com/you-can-now-speak-using-someone-elses-voice-with-deep-learning-8be24368fa2b
  4. http://www.coli.uni-saarland.de/~trouvain/Kempelen-Web_2017_07_31.pdf
  5. http://amhistory.si.edu/archives/speechsynthesis/ss_etl.htm
  6. https://www.jbistudios.com/blog/text-to-speech-future-now
  7. http://amhistory.si.edu/archives/speechsynthesis/dk_757.htm#Umeda
  8. https://toytales.ca/speak-spell-from-texas-instruments-1978/
  9. https://www.wired.com/brandlab/2018/10/lyrebird-uses-ai-find-artificial-voice/
  10. https://www.businessinsider.com/groupon-founder-andrew-mason-new-startup-descript-detour-2017-12
  11. https://www.technologyreview.com/f/610386/a-new-algorithm-can-mimic-your-voice-with-just-snippets-of-audio/
  12. http://research.baidu.com/Blog/index-view?id=91
  13. http://research.baidu.com/Blog/index-view?id=81
  14. https://www.asel.udel.edu/
  15. https://www.asel.udel.edu/speech/ModelTalker.html
  16. https://patents.google.com/patent/US2121142A/en
  17. https://arxiv.org/abs/1711.10433
  18. https://www.globalme.net/blog/language-support-voice-assistants-compared
  19. https://www.cnet.com/how-to/the-complete-list-of-siri-commands/
  20. https://www.lifewire.com/change-siri-to-mans-voice-4103822
  21. https://www.nextgov.com/emerging-tech/2019/11/ftc-explore-promises-and-potential-abuses-voice-cloning-technology/161083/
  22. https://www.scientificamerican.com/article/new-ai-tech-can-mimic-any-voice/
  23. https://deepfakedetectionchallenge.ai/