Artificial intelligence company OpenAI recently introduced Voice Engine, a natural-sounding speech generator that uses text and a 15-second audio sample to create an “emotive and realistic” imitation of the original speaker.
OpenAI has not yet released Voice Engine to the public, citing concerns over the potential abuse of its generative artificial intelligence (AI) – specifically to produce audio deepfakes – which could contribute to misinformation, especially during elections.
Audio deepfakes and their uses
Audio deepfakes are generated using deep learning techniques in which large datasets of audio samples are used for AI models to learn the characteristics of human speech to produce realistic audio. Audio deepfakes can be generated in two ways: text-to-speech (text is converted to audio) and speech-to-speech (an uploaded voice recording is synthesised as the targeted voice).
Audio deepfakes have been used in cyber-enabled financial scams where fraudsters impersonate bank customers to authorise transactions. The same technology is increasingly being used to propagate disinformation. Several audio deepfakes attempting to mimic the voices of politicians have circulated on social media. In 2023, artificially generated audio clips of UK Labour leader Keir Starmer allegedly featured him berating party staffers. While fact-checkers determined the audio was fake, it surpassed 1.5 million hits on X (formerly Twitter).
In India, voice cloning of children has been used to deceive parents into transferring money. In Singapore, deepfake videos containing voice clones of politicians such as the prime minister and deputy prime minister have been used in cyber-scams.
Commercialisation boom
Anyone can generate an audio deepfake. They are easier and cheaper to make than video deepfakes and simpler to disseminate on social media and messaging platforms.
With advancements in technology, only one or two minutes of audio are needed to generate a convincing deepfake recording. More professional voice clones require payment, but the sums are not prohibitive. OpenAI’s Voice Engine has reduced the number of audio seconds needed to generate a realistic recording.
The commercialisation of audio deepfake technology has boomed in recent years. Companies such as ElevenLabs offer services to create synthetic copies of voices, generate speech in 29 languages, and match accents of one’s choice.
There has been an uptick in political deepfakes targeting electoral processes in recent years, with the aim of sowing discord and confusion. Audio deepfakes are being deployed in the lead-up to India’s elections. An AI-enabled audio of US President Joe Biden was used in a robocall targeted at registered Democrat residents in New Hampshire ahead of the Democratic Primary in January 2024. While the robocall was deemed to be a fake, it served to increase public awareness of the potential risks of AI-enabled voice cloning.
Days before the Slovakian parliamentary elections, audio deepfake recordings allegedly featuring conversations between a leading politician and a journalist discussing topics such as vote-rigging went viral, dividing public opinion despite fact-checkers verifying that the recordings were fake and manipulated – the conversations never happened. Experts believe the recordings may have influenced the election outcome.
In the United States, state election officials are concerned about being targeted for voice cloning, which could be used to maliciously announce false election results.
Audio deepfakes can also be weaponised to sow discord and incite violence. An audio deepfake of the Mayor of London, Sadiq Khan, disparaging Remembrance weekend and calling for pro-Palestinian marches in London was manipulated to resemble a secret audio recording. It went viral on social media and led to hateful comments directed at Khan.
These uses of generative AI have real potential to influence public opinion and turbo-charge disinformation, allowing it to spread rapidly on social media and messaging platforms.
Solutions in the works
Audio deepfakes contain fewer overt signs of manipulation than deepfake videos or images, and are not easily detected without technical expertise.
Computer security software company McAfee recently announced Project Mockingbird to detect and expose altered audio in videos. Companies that provide services for AI voice generation have also taken measures to ensure their systems can identify altered audio. The watermarking of audio generated by such companies could go some way to promoting the proactive monitoring of audio deepfakes and how they are being used.
On the legislative front, there has been an increasing call for action globally. The United States has committed to attempt to regulate audio deepfakes used in elections, for instance, by banning robocalls during campaigns.
Emphasis should be placed on responding quickly to refute misinformation and disinformation propagated by audio deepfakes. Giving more resources to journalists and fact-checkers to tap their collective subject expertise would go a long way towards demystifying deepfakes and exposing the use of AI by malicious actors to generate misleading content.