deepfake audio, voice synthesis
© Upimages

Dr Matthew Aylett delves into deepfake audio; touching on weaponisation and lack of public awareness, while reframing the tech as a communication tool for speech-altering diseases

Video editing technology has advanced rapidly in recent years, to the point where the topic of weaponising it to create fake videos – otherwise known as deepfakes – has gone mainstream.

There is major concern among governments and businesses of the risk of deepfakes being used to spread fake news and misinformation for nefarious purposes, and rightly so.

However, the footage and imagery alone is not enough. The vast majority of video footage and clips feature audio and, yet, audio is very rarely discussed where deepfake is concerned. It’s critical though, as audio validates the authenticity of footage. Regardless of what a video shows, if the audio clip doesn’t match – for example, lips move in a manner inconsistent with what the person on-camera is saying – the authenticity of the video will be questioned and it will likely be dismissed.

Today, video and audio technology is developing at speed. And for good reason.

As with most new technology, it’s driven by a need or desire to improve something or overcome a problem; it’s a small minority of technology that is initially developed for sinister reasons. Make no mistake though, many types of new technology can be weaponised in one way or another – as video editing technology proves. We’re now seeing that happen with synthesised voices (voices generated and spoken by a computer) as wrongdoers seek to misinform or trick targets for their own gain or cause.

In July of this year, security firm Symantec revealed three cases of deepfake audio being used to trick senior financial controllers into transferring money. This shows that audio innocently made available by CEOs is being used by cyber-criminals to create deepfake audio that can con finance departments. And this was done without video footage.

How advanced is speech synthesis of today?

In the 1990s, many synthetic voice systems used a technique called unit selection, whereby voice actors would spend a long time – around thirty hours – in a recording studio, and an algorithm would break down their speech into its component units. This would then be reassembled into artificial speech and ‘smoothed’ so that the algorithm could pronounce almost any word. This process produced high-quality sound, but it was a slow process – it took approximately three months from end to end.

Later speech synthesis technology attempted to accelerate and improve this process without requiring so much time upfront. The system is known as parametric modelling, for example, first generates phonemes (the units of speech) and their durations from a text, then applies ‘features’ to them – for example, frequency, rhythm and intonation.

Deep Neural Network (DNN) systems use neural networks, where certain elements are ‘weighted’ and trained repeatedly to improve synthesised speech. These systems use ‘correct’ examples of speech patterns to fine-tune the individual elements of the synthesised speech based on parts that are right or wrong.

Technology has now started to move beyond this: what we are seeing today is an evolution of this system in WaveNet-style synthesis, whereby all previous iterations are used in the neural network training, rather than the one ‘trial’ immediately prior to the current test.

This results in a much smoother, cleaner experience. However, it can be quite in-depth and time-consuming to generate because of the large number of trials that are analysed – although that said, accelerating compute time is a much easier process than improving algorithms!

It’s here to stay

By this point, you may wonder, if deepfakes pose such a risk, why isn’t there a blanket ban on developing video editing and speech synthesis technology that can be used in harmful ways by bad actors? It’s simple: for the most part, this technology is improving people’s lives.

Synthesised speech, for example, can be used by people living with communication difficulties caused by a range of disabilities, illnesses and diseases (such as motor neurone disease, cerebral palsy or multiple sclerosis, and more) in their augmentative and alternative (AAC) tools to communicate more effectively and experience an improved quality of life. So, to ban the development of such technology is to strip the voices of those who need them most.

Businesses and authorities must accept that the issue of deepfakes is one that they’ll face increasingly and for some time. No doubt many will already be seeking to understand them, what is and isn’t possible, and how they can be exploited. But there’s not much information in the public, or certainly not in one centralised, easy to access location.

There’s a number of important facts to understand about audio and speech synthesis.

Firstly, it’s a misconception that huge amounts of audio, usually voice recordings, are needed to replicate someone’s voice and create a fake statement in their voice.

Professionals actually only need as little as half an hour of recording. With so much content on YouTube and social media, it can easily be sourced, especially those who are public-facing and regularly give interviews, like politicians or celebrities.

It is, however, a challenge to source ‘clean audio’. In a recording studio, professionals can capture all the voice data they need to replicate a voice to the highest quality. But using existing materials from the likes of YouTube means the voice is likely to be of a lower quality, with background noises and substandard devices (such as iPhones being used to record the audio) posing challenges.

Whoever is seeking to misuse the data will be required to find ways to mask problems like background noises, which is possible but in-depth analysis could expose it as fake.

Secondly, there’s a belief that producing quality digital voices requires high compute power. This false information has even appeared in mainstream media outlets like the BBC. It’s not the case. Faster, more powerful computers are more accessible than ever, especially to cybercriminals who invest in high spec versions.

Finally, producing organic-sounding conversation using speech synthesis is currently very difficult. Current tools have a long way to go before they reach a level that could deceive those viewing the deepfake or listening to the audio. It will happen, but for the time being conversational speech synthesis is less of a concern.

What comes next?

Video editing and speech synthesis technology has evolved significantly in recent years and that’s set to continue. We can expect to see further development, but with that comes a responsibility for parties with power.

Governments and regulators may look at introducing dedicated legislation designed to protect against deepfake videos and audio, with suitable punishments for those who weaponise the technology and break the law. At present, anyone misusing the technology is only prosecutable under laws specific to the fraud, scams and theft they commit using deepfakes. We may see dedicated, anti-misinformation legislation introduced to combat the growing spread of fake news that has recently dogged electrical campaigns.

The speech synthesis industry may also consider some form of watermark that can be applied to digital audio files. While this wouldn’t prevent a deepfake being created, it would allow for it to be revealed as false or malicious faster and hopefully be contained before mass misinformation occurs.

At the heart of the issue, however, is education.

Governments, businesses and the average person must be informed on what a deepfake is and the threat it poses. Compare this situation to email scams; we’re not going to abandon email simply because a handful of cyber-criminals abuse them – but if you know a dodgy email could be a scam or hack attempt, you’re considerably less likely to fall for it.



Dr Matthew Aylett





Please enter your comment!
Please enter your name here