Tech & Science

New software edits voices like text

Published

July 9, 2017

The software is called VoCo and it allows the use to edit a transcript of an audio recording of a human voice by adding or replacing words. These replacement words are processed through the software and become automatically synthesized in the speaker’s voice. Such technology can allow anew interview, for example, where certain words or phrases on the part of an interviewee which are not clear to be altered. In the era of fake news however, such software could also provide the means to change the meaning or context of what someone is saying.

The new software has come from Princeton University, Engineering School in the U.S. and it is based on a sophisticated algorithm that utilizes machine learning to recreate the sound of a particular voice and also ‘learn’ from any mistakes made through human correction. The idea is to make the editing of podcasts and narration on videos, such as those placed on YouTube channels, far easier. It also negates the need for the narrator to return to the recording studio should some of the earlier recorded utterances be unclear.

In addition, the software can provide the starting point for creating personalized robotic voices that sound more natural and ‘human-like’. Despite technological advances, many computer voices continue to sound like Cybermen from Doctor Who.

Discussing the software on his university website, the lead developer Professor Adam Finkelstein said: “VoCo provides a peek at a very practical technology for editing audio tracks, but it is also a harbinger for future technologies that will allow the human voice to be synthesized and automated in remarkable ways.”

VoCo works by augmenting the waveform with a text transcript of the track and this allows the user to replace or insert new words that do not already exist in the track simply by typing in the transcript. As the user types the new word, VoCo updates the audio track and automatically synthesizes the new word by linking together snippets of audio from elsewhere in the narration. This happens due to an optimization algorithm that searches the voice recording and chooses the best possible combinations of partial word sounds, called “phonemes,” to build new words in the user’s voice.

A video from the university explains more about the genesis of the software:

The development of the software has been described in the journal Transactions on Graphics, with the paper titled “VoCo: Text-based Insertion and Replacement in Audio Narration.”

In this article:Audio, Communications, Software, Text

Written By Dr. Tim Sandle

Dr. Tim Sandle is Digital Journal's Editor-at-Large for science news. Tim specializes in science, technology, environmental, business, and health journalism. He is additionally a practising microbiologist; and an author. He is also interested in history, politics and current affairs.