Voice imitation algorithms

From SI410
Revision as of 19:59, 13 March 2020 by Igesinsk (Talk | contribs)

Jump to: navigation, search

Voice imitation algorithms (also known as Speech synthesis[1]) are a form of Synthetic Media, used to imitate human speech. They achieve this by using machine learning and artificial intelligence techniques[2]. The most common method of voice imitation algorithms relies on many voice samples to produce synthesized speech.[3]

History

Commercial implementation

The Speak and Spell was originally introduced in 1978 by Texas Instruments. It featured a keyboard and a speech synthesizer, which was used to convert words that were typed onto the keyboard into synthesized audio that it played from speakers.
Lyrebird AI

Lyrebird (also known as Lyrebird AI) was a Montreal based company founded in 2017 focused on speech synthesis and voice imitation.[4] In 2019 it was acquired by Descript, an American company focused on audio editing software, specifically tailored towards podcast creators.[5] Lyrebird AI uses artificial intelligence and voice samples to accurately replicate human speech.

China-based technology company Baidu has used neural networks and deep learning to create accurate voice imitations from thousands of collected voice samples with in-house software Deepvoice.[6][7] Baidu claims that Deepvoice is capable of replicating thousands of unique voices, with less than 30 minutes of voice samples from each voice.[8]

Research

University of Delaware and Nemours Alfred I. duPont Hospital for Children's jointly operated Applied Science and Engineering Laboratories (also know as ASEL), has researched and developed the Model Talker.[9][10] A software which is used with AAC devices to replicate human speech to assist those with hearing or speech impairments. The ModelTalker TTS is able to convert English language text to English language synthesized speech.

The vocoder was invented in 1938 by Bell Labs.[11] It is a type of voice codec that analyzes and synthesizes the human voice waveforms. It is mainly used in audio data compression so that voice data can be saved and utilized while using fewer bits than the original data. This allows synthesized speech algorithms to save, analyze, and output higher fidelity data to better replicate and more accurately imitate human speech.[12]

Speech imitation in culture

Virtual assistants

A virtual assistant that performs tasks for an individual based on commands or questions. The most prominent of these being Siri, a virtual assistant developed by Apple Inc. and utilized on Apple's various operating systems. As of 2019 Siri supports 21 different languages,[13] can interpret and respond to a wide range of voice commands,[14] and as of 2020, can speak in 5 different english accents.[15]

Ethical implications

Voice imitation algorithms have been used in Grandparent scams. A type of telemarketing fraud where the scammer will call an elderly person while claiming to be a relative who has gotten themselves into some kind of trouble and needs money. This type of scam is made easier by the realistic sounding synthesized voice, which makes it harder for the person being scammed to identify the person they are speaking with as a synthesized voice.[16]

There have been concerns raised over the authenticity of voice recordings when one has access to realistic voice imitation software. Concerns such as if recordings of politicians in closed-door meetings can be trusted as authentic when any voice could be replicated with enough voice samples.[17] Responses to the rise of synthetic media include the Deep Fake Detection Challenge (also known as DFDC), which is sponsored by several big tech companies. The DFDC incentivizes participants to help develop technology for detecting deep fakes and related synthetic media, often referred to as tampered media, with prizes for software developed by participants which helps to identify synthetic media. The goals of the DFDC are to develop detection software at a quicker pace than the AI used in the creation of tampered media.[18]

References

  1. https://thehill.com/opinion/cybersecurity/470826-perception-wont-be-reality-once-ai-can-manipulate-what-we-see
  2. https://www.sciencedirect.com/science/article/pii/S0007681319301600?via%3Dihub
  3. https://towardsdatascience.com/you-can-now-speak-using-someone-elses-voice-with-deep-learning-8be24368fa2b
  4. https://www.wired.com/brandlab/2018/10/lyrebird-uses-ai-find-artificial-voice/
  5. https://www.businessinsider.com/groupon-founder-andrew-mason-new-startup-descript-detour-2017-12
  6. https://www.technologyreview.com/f/610386/a-new-algorithm-can-mimic-your-voice-with-just-snippets-of-audio/
  7. http://research.baidu.com/Blog/index-view?id=91
  8. http://research.baidu.com/Blog/index-view?id=81
  9. https://www.asel.udel.edu/
  10. https://www.asel.udel.edu/speech/ModelTalker.html
  11. https://patents.google.com/patent/US2121142A/en
  12. https://arxiv.org/abs/1711.10433
  13. https://www.globalme.net/blog/language-support-voice-assistants-compared
  14. https://www.cnet.com/how-to/the-complete-list-of-siri-commands/
  15. https://www.lifewire.com/change-siri-to-mans-voice-4103822
  16. https://www.nextgov.com/emerging-tech/2019/11/ftc-explore-promises-and-potential-abuses-voice-cloning-technology/161083/
  17. https://www.scientificamerican.com/article/new-ai-tech-can-mimic-any-voice/
  18. https://deepfakedetectionchallenge.ai/