Difference between revisions of "Voice imitation algorithms"

From SI410
Jump to: navigation, search
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''Voice imitation algorithms''' (also known as '''[https://en.wikipedia.org/wiki/Speech_synthesis Speech synthesis]'''<ref>https://thehill.com/opinion/cybersecurity/470826-perception-wont-be-reality-once-ai-can-manipulate-what-we-see</ref>) are a form of [https://en.wikipedia.org/wiki/Synthetic_media Synthetic Media], used to imitate human speech. They achieve this by using [https://en.wikipedia.org/wiki/Machine_learning machine learning] and [https://en.wikipedia.org/wiki/Artificial_intelligence artificial intelligence] techniques<ref>https://www.sciencedirect.com/science/article/pii/S0007681319301600?via%3Dihub</ref>. The most common method of voice imitation algorithms relies on many voice samples to produce synthesized speech.<ref>https://towardsdatascience.com/you-can-now-speak-using-someone-elses-voice-with-deep-learning-8be24368fa2b</ref>  
+
'''Voice imitation algorithms''' (also known as '''[https://en.wikipedia.org/wiki/Speech_synthesis Speech synthesis]'''<ref>https://thehill.com/opinion/cybersecurity/470826-perception-wont-be-reality-once-ai-can-manipulate-what-we-see</ref>) are a form of [https://en.wikipedia.org/wiki/Synthetic_media Synthetic Media], used to imitate human speech. They achieve this by using [https://en.wikipedia.org/wiki/Machine_learning machine learning] and [https://en.wikipedia.org/wiki/Artificial_intelligence artificial intelligence] techniques<ref>https://www.sciencedirect.com/science/article/pii/S0007681319301600?via%3Dihub</ref>. The most common method of voice imitation algorithms relies on many voice samples to produce synthesized speech.<ref>https://towardsdatascience.com/you-can-now-speak-using-someone-elses-voice-with-deep-learning-8be24368fa2b</ref> The most well-known and advanced voice imitation algorithms are Lyrebird AI, Deepvoice, and [https://en.wikipedia.org/wiki/WaveNet WaveNet].
  
 
== History ==
 
== History ==
 +
===Mechanical voice imitation===
 +
The earliest recorded instance of an artificial system used to imitate human speech was created by [https://en.wikipedia.org/wiki/Austria-Hungary Austro-Hungarian] author and inventor [https://en.wikipedia.org/wiki/Wolfgang_von_Kempelen Wolfgang von Kempelen]. Development of his  [https://en.wikipedia.org/wiki/Wolfgang_von_Kempelen%27s_speaking_machine speaking machine] started in 1769 and eventually implemented tongue and lip models which enabled the machine to pronounce vowels and consonants.<ref>http://www.coli.uni-saarland.de/~trouvain/Kempelen-Web_2017_07_31.pdf</ref>
 +
 +
===Electronic voice imitation===
 +
The first complete electronic English speech synthesis system was created in [https://en.wikipedia.org/wiki/Japan Japan] in 1968 by Noriko Umeda and Ryunen Teranishi working for the Japanese government.<ref>http://amhistory.si.edu/archives/speechsynthesis/ss_etl.htm</ref> This system was able to analyze English text and approximate the pronunciation of sentences.<ref>https://www.jbistudios.com/blog/text-to-speech-future-now</ref> The system used a dictionary of the 1500 [https://en.wikipedia.org/wiki/Most_common_words_in_English most common English words]. The techniques and rules used in this software were later used in the [https://en.wikipedia.org/wiki/Bell_Labs Bell Labs'] own speech synthesis system.<ref>http://amhistory.si.edu/archives/speechsynthesis/dk_757.htm#Umeda</ref>
 +
 +
==Implementation==
 
===Commercial implementation===
 
===Commercial implementation===
The [https://en.wikipedia.org/wiki/Speak_%26_Spell_(toy) Speak and Spell] was originally introduced in 1978 by [https://en.wikipedia.org/wiki/Texas_Instruments Texas Instruments]. It featured a keyboard and a speech synthesizer, which was used to convert words that were typed onto the keyboard into synthesized audio that it played from speakers.  [[File:Screen Shot 2020-03-13 at 3.47.02 PM.png|thumbnail|Lyrebird AI]]
+
The [https://en.wikipedia.org/wiki/Speak_%26_Spell_(toy) Speak and Spell] was originally introduced in 1978 by [https://en.wikipedia.org/wiki/Texas_Instruments Texas Instruments]. Featuring a keyboard and a speech synthesizer, it allowed one to convert words that were typed onto the keyboard into synthesized audio that it played from its speakers. The purpose of the Speak and Spell was to assist children in proper pronunciation and spelling.<ref>https://toytales.ca/speak-spell-from-texas-instruments-1978/</ref> [[File:Screen Shot 2020-03-13 at 3.47.02 PM.png|thumbnail|Lyrebird AI]]
  
[https://www.descript.com/lyrebird-ai?source=lyrebird Lyrebird] (also known as '''Lyrebird AI''') was a Montreal based company founded in 2017 focused on speech synthesis and voice imitation.<ref>https://www.wired.com/brandlab/2018/10/lyrebird-uses-ai-find-artificial-voice/</ref> In 2019 it was acquired by Descript, an American company focused on [https://en.wikipedia.org/wiki/Audio_editing_software audio editing software], specifically tailored towards [https://en.wikipedia.org/wiki/Podcast podcast creators].<ref>https://www.businessinsider.com/groupon-founder-andrew-mason-new-startup-descript-detour-2017-12</ref> Lyrebird AI uses artificial intelligence and voice samples to accurately replicate human speech.
+
[https://www.descript.com/lyrebird-ai?source=lyrebird Lyrebird] (also known as '''Lyrebird AI''') was a Montreal based company founded in 2017 focused on speech synthesis and voice imitation.<ref>https://www.wired.com/brandlab/2018/10/lyrebird-uses-ai-find-artificial-voice/</ref> In 2019 it was acquired by Descript, an American company focused on [https://en.wikipedia.org/wiki/Audio_editing_software audio editing software], specifically tailored towards [https://en.wikipedia.org/wiki/Podcast podcast creators].<ref>https://www.businessinsider.com/groupon-founder-andrew-mason-new-startup-descript-detour-2017-12</ref> Lyrebird AI uses artificial intelligence and voice samples to accurately replicate realistic human speech.
  
 
China-based [https://en.wikipedia.org/wiki/Technology_company technology company] [https://en.wikipedia.org/wiki/Baidu Baidu] has used [https://en.wikipedia.org/wiki/Artificial_neural_network neural networks] and [https://en.wikipedia.org/wiki/Deep_learning deep learning] to create accurate voice imitations from thousands of collected voice samples with [https://en.wikipedia.org/wiki/In-house_software in-house software] Deepvoice.<ref>https://www.technologyreview.com/f/610386/a-new-algorithm-can-mimic-your-voice-with-just-snippets-of-audio/</ref><ref>http://research.baidu.com/Blog/index-view?id=91</ref> Baidu claims that Deepvoice is capable of replicating thousands of unique voices, with less than 30 minutes of voice samples from each voice.<ref>http://research.baidu.com/Blog/index-view?id=81</ref>
 
China-based [https://en.wikipedia.org/wiki/Technology_company technology company] [https://en.wikipedia.org/wiki/Baidu Baidu] has used [https://en.wikipedia.org/wiki/Artificial_neural_network neural networks] and [https://en.wikipedia.org/wiki/Deep_learning deep learning] to create accurate voice imitations from thousands of collected voice samples with [https://en.wikipedia.org/wiki/In-house_software in-house software] Deepvoice.<ref>https://www.technologyreview.com/f/610386/a-new-algorithm-can-mimic-your-voice-with-just-snippets-of-audio/</ref><ref>http://research.baidu.com/Blog/index-view?id=91</ref> Baidu claims that Deepvoice is capable of replicating thousands of unique voices, with less than 30 minutes of voice samples from each voice.<ref>http://research.baidu.com/Blog/index-view?id=81</ref>
Line 14: Line 21:
  
 
==Speech imitation in culture==
 
==Speech imitation in culture==
===Virtual assistants===
+
A [[Virtual Assistants|virtual assistant]] is an artificial intelligence system that performs tasks for an individual based on commands or questions. The most prominent of these being [https://en.wikipedia.org/wiki/Siri Siri], a virtual assistant developed by [https://en.wikipedia.org/wiki/Apple_Inc. Apple Inc.] and utilized on Apple's various operating systems. As of 2019 Siri supports 21 different languages,<ref>https://www.globalme.net/blog/language-support-voice-assistants-compared</ref> can interpret and respond to a wide range of voice commands,<ref>https://www.cnet.com/how-to/the-complete-list-of-siri-commands/</ref> and as of 2020, can speak in 5 different English language accents.<ref>https://www.lifewire.com/change-siri-to-mans-voice-4103822</ref>
A [https://en.wikipedia.org/wiki/Virtual_assistant virtual assistant] that performs tasks for an individual based on commands or questions. The most prominent of these being [https://en.wikipedia.org/wiki/Siri Siri], a virtual assistant developed by [https://en.wikipedia.org/wiki/Apple_Inc. Apple Inc.] and utilized on Apple's various operating systems. As of 2019 Siri supports 21 different languages,<ref>https://www.globalme.net/blog/language-support-voice-assistants-compared</ref> can interpret and respond to a wide range of voice commands,<ref>https://www.cnet.com/how-to/the-complete-list-of-siri-commands/</ref> and as of 2020, can speak in 5 different english accents.<ref>https://www.lifewire.com/change-siri-to-mans-voice-4103822</ref>
+
 
 +
[https://en.wikipedia.org/wiki/HAL_9000 HAL 9000] (also simply referred to as '''HAL''') is an Artificial Intelligence and the main antagonist of [https://en.wikipedia.org/wiki/Arthur_C._Clarke Arthur C. Clarke's] 1968 [https://en.wikipedia.org/wiki/Space_Odyssey Space Odyssey] series. HAL interacts with other characters in the series with a speech synthesizer and is a popular example of artificial human speech synthesis in [https://en.wikipedia.org/wiki/Popular_culture popular culture].
  
 
==Ethical implications==
 
==Ethical implications==
Line 21: Line 29:
  
 
There have been concerns raised over the authenticity of voice recordings when one has access to realistic voice imitation software. Concerns such as if recordings of politicians in closed-door meetings can be trusted as authentic when any voice could be replicated with enough voice samples.<ref>https://www.scientificamerican.com/article/new-ai-tech-can-mimic-any-voice/</ref> Responses to the rise of synthetic media include the Deep Fake Detection Challenge (also known as DFDC), which is sponsored by several [https://en.wikipedia.org/wiki/Big_Tech big tech] companies. The DFDC incentivizes participants to help develop technology for detecting deep fakes and related synthetic media, often referred to as tampered media, with prizes for software developed by participants which helps to identify synthetic media. The goals of the DFDC are to develop detection software at a quicker pace than the AI used in the creation of tampered media.<ref>https://deepfakedetectionchallenge.ai/</ref>
 
There have been concerns raised over the authenticity of voice recordings when one has access to realistic voice imitation software. Concerns such as if recordings of politicians in closed-door meetings can be trusted as authentic when any voice could be replicated with enough voice samples.<ref>https://www.scientificamerican.com/article/new-ai-tech-can-mimic-any-voice/</ref> Responses to the rise of synthetic media include the Deep Fake Detection Challenge (also known as DFDC), which is sponsored by several [https://en.wikipedia.org/wiki/Big_Tech big tech] companies. The DFDC incentivizes participants to help develop technology for detecting deep fakes and related synthetic media, often referred to as tampered media, with prizes for software developed by participants which helps to identify synthetic media. The goals of the DFDC are to develop detection software at a quicker pace than the AI used in the creation of tampered media.<ref>https://deepfakedetectionchallenge.ai/</ref>
 +
 +
==See also==
 +
*[[Artificial Intelligence and Technology|Artificial intelligence]]
 +
*[[Deepfake]]
 +
*[[Online Identity Theft|Online identity theft]]
 +
*[https://en.wikipedia.org/wiki/Speech-generating_device Speech-generating device]
 +
*[[Virtual Assistants|Virtual assistants]]
  
 
==References==
 
==References==

Latest revision as of 19:59, 27 March 2020

Voice imitation algorithms (also known as Speech synthesis[1]) are a form of Synthetic Media, used to imitate human speech. They achieve this by using machine learning and artificial intelligence techniques[2]. The most common method of voice imitation algorithms relies on many voice samples to produce synthesized speech.[3] The most well-known and advanced voice imitation algorithms are Lyrebird AI, Deepvoice, and WaveNet.

History

Mechanical voice imitation

The earliest recorded instance of an artificial system used to imitate human speech was created by Austro-Hungarian author and inventor Wolfgang von Kempelen. Development of his speaking machine started in 1769 and eventually implemented tongue and lip models which enabled the machine to pronounce vowels and consonants.[4]

Electronic voice imitation

The first complete electronic English speech synthesis system was created in Japan in 1968 by Noriko Umeda and Ryunen Teranishi working for the Japanese government.[5] This system was able to analyze English text and approximate the pronunciation of sentences.[6] The system used a dictionary of the 1500 most common English words. The techniques and rules used in this software were later used in the Bell Labs' own speech synthesis system.[7]

Implementation

Commercial implementation

The Speak and Spell was originally introduced in 1978 by Texas Instruments. Featuring a keyboard and a speech synthesizer, it allowed one to convert words that were typed onto the keyboard into synthesized audio that it played from its speakers. The purpose of the Speak and Spell was to assist children in proper pronunciation and spelling.[8]
Lyrebird AI

Lyrebird (also known as Lyrebird AI) was a Montreal based company founded in 2017 focused on speech synthesis and voice imitation.[9] In 2019 it was acquired by Descript, an American company focused on audio editing software, specifically tailored towards podcast creators.[10] Lyrebird AI uses artificial intelligence and voice samples to accurately replicate realistic human speech.

China-based technology company Baidu has used neural networks and deep learning to create accurate voice imitations from thousands of collected voice samples with in-house software Deepvoice.[11][12] Baidu claims that Deepvoice is capable of replicating thousands of unique voices, with less than 30 minutes of voice samples from each voice.[13]

Research

University of Delaware and Nemours Alfred I. duPont Hospital for Children's jointly operated Applied Science and Engineering Laboratories (also know as ASEL), has researched and developed the Model Talker.[14][15] A software which is used with AAC devices to replicate human speech to assist those with hearing or speech impairments. The ModelTalker TTS is able to convert English language text to English language synthesized speech.

The vocoder was invented in 1938 by Bell Labs.[16] It is a type of voice codec that analyzes and synthesizes the human voice waveforms. It is mainly used in audio data compression so that voice data can be saved and utilized while using fewer bits than the original data. This allows synthesized speech algorithms to save, analyze, and output higher fidelity data to better replicate and more accurately imitate human speech.[17]

Speech imitation in culture

A virtual assistant is an artificial intelligence system that performs tasks for an individual based on commands or questions. The most prominent of these being Siri, a virtual assistant developed by Apple Inc. and utilized on Apple's various operating systems. As of 2019 Siri supports 21 different languages,[18] can interpret and respond to a wide range of voice commands,[19] and as of 2020, can speak in 5 different English language accents.[20]

HAL 9000 (also simply referred to as HAL) is an Artificial Intelligence and the main antagonist of Arthur C. Clarke's 1968 Space Odyssey series. HAL interacts with other characters in the series with a speech synthesizer and is a popular example of artificial human speech synthesis in popular culture.

Ethical implications

Voice imitation algorithms have been used in Grandparent scams. A type of telemarketing fraud where the scammer will call an elderly person while claiming to be a relative who has gotten themselves into some kind of trouble and needs money. This type of scam is made easier by the realistic sounding synthesized voice, which makes it harder for the person being scammed to identify the person they are speaking with as a synthesized voice.[21]

There have been concerns raised over the authenticity of voice recordings when one has access to realistic voice imitation software. Concerns such as if recordings of politicians in closed-door meetings can be trusted as authentic when any voice could be replicated with enough voice samples.[22] Responses to the rise of synthetic media include the Deep Fake Detection Challenge (also known as DFDC), which is sponsored by several big tech companies. The DFDC incentivizes participants to help develop technology for detecting deep fakes and related synthetic media, often referred to as tampered media, with prizes for software developed by participants which helps to identify synthetic media. The goals of the DFDC are to develop detection software at a quicker pace than the AI used in the creation of tampered media.[23]

See also

References

  1. https://thehill.com/opinion/cybersecurity/470826-perception-wont-be-reality-once-ai-can-manipulate-what-we-see
  2. https://www.sciencedirect.com/science/article/pii/S0007681319301600?via%3Dihub
  3. https://towardsdatascience.com/you-can-now-speak-using-someone-elses-voice-with-deep-learning-8be24368fa2b
  4. http://www.coli.uni-saarland.de/~trouvain/Kempelen-Web_2017_07_31.pdf
  5. http://amhistory.si.edu/archives/speechsynthesis/ss_etl.htm
  6. https://www.jbistudios.com/blog/text-to-speech-future-now
  7. http://amhistory.si.edu/archives/speechsynthesis/dk_757.htm#Umeda
  8. https://toytales.ca/speak-spell-from-texas-instruments-1978/
  9. https://www.wired.com/brandlab/2018/10/lyrebird-uses-ai-find-artificial-voice/
  10. https://www.businessinsider.com/groupon-founder-andrew-mason-new-startup-descript-detour-2017-12
  11. https://www.technologyreview.com/f/610386/a-new-algorithm-can-mimic-your-voice-with-just-snippets-of-audio/
  12. http://research.baidu.com/Blog/index-view?id=91
  13. http://research.baidu.com/Blog/index-view?id=81
  14. https://www.asel.udel.edu/
  15. https://www.asel.udel.edu/speech/ModelTalker.html
  16. https://patents.google.com/patent/US2121142A/en
  17. https://arxiv.org/abs/1711.10433
  18. https://www.globalme.net/blog/language-support-voice-assistants-compared
  19. https://www.cnet.com/how-to/the-complete-list-of-siri-commands/
  20. https://www.lifewire.com/change-siri-to-mans-voice-4103822
  21. https://www.nextgov.com/emerging-tech/2019/11/ftc-explore-promises-and-potential-abuses-voice-cloning-technology/161083/
  22. https://www.scientificamerican.com/article/new-ai-tech-can-mimic-any-voice/
  23. https://deepfakedetectionchallenge.ai/