A new AI system can create natural-sounding speech and music after a few seconds of audio.
Developed by Google researchers, AudioLM generates audio that matches the prompt’s style, including complex sounds such as piano music or people speaking, in a way that is nearly indistinguishable from the original recording. The technique shows promise for speeding up the process of training AI to generate audio, and it could eventually be used to automatically generate music to accompany videos.
AI-generated audio is commonplace: Voices for home assistants like Alexa use natural language processing. AI music systems like OpenAI’s Jukebox have already produced impressive results, but most existing techniques require humans to prepare transcripts and label text-based training data, which takes a lot of time and human labor. For example, Jukebox uses text-based data to generate lyrics.
AudioLM, described in a non-peer-reviewed paper last month, is different: it requires no transcription or labeling. Instead, sound databases are fed into the program and machine learning is used to compress the audio files into sound clips, called ‘tokens’, without losing too much information. This tokenized training data is then fed into a machine learning model that uses natural language processing to learn the patterns of the sound.
To generate the audio, a few seconds of sound are input into AudioLM, which then predicts what comes next. The process is similar to the way language models such as GPT-3 predict which phrases and words usually follow one another.
the audio clips released by the team sound pretty natural. Especially piano music generated with AudioLM sounds smoother than piano music generated using existing AI techniques, which often sounds chaotic.
Roger Dannenberg, who studies computer-generated music at Carnegie Mellon University, says AudioLM already has much better sound quality than previous music generation programs. In particular, he says, AudioLM is surprisingly good at recreating some of the repeating patterns inherent in man-made music. To generate realistic piano music, AudioLM needs to capture many of the subtle vibrations in each note when piano keys are struck. The music must also maintain its rhythms and harmonies over a period of time.
“That’s really impressive, partly because it indicates that they are learning a certain structure at multiple levels,” Dannenberg says.
AudioLM is not just limited to music. Because it is trained on a library of recordings of people uttering sentences, the system can also generate speech that continues in the accent and cadence of the original speaker, although at this point those sentences may still appear as non-sequiturs that do not have some meaning. feeling. AudioLM is trained to learn what types of sound clips often occur together and uses the reverse process to produce sentences. It also has the advantage of being able to learn the pauses and exclamations inherent in spoken languages, but not easily translated into text.
Rupal Patel, who studies information and speech science at Northeastern University, says previous work with AI to generate audio could only capture nuances if they were explicitly annotated in training data. In contrast, AudioLM automatically learns those characteristics from the input data, adding to the realistic effect.
“There’s a lot of what we might call linguistic information that isn’t in the words you say, but it’s a different way of communicating based on the way you say things to express a specific intention or emotion. print,” said Neil Zeghidour, a co-creator of AudioLM. For example, someone may laugh after saying something to indicate that it was a joke. “All that makes speech, of course,” he says.
Ultimately, AI-generated music could be used to provide more natural-sounding background soundtracks for videos and slideshows. Speech-generation technology that sounds more natural could help improve Internet accessibility tools and bots that work in healthcare facilities, Patel says. The team also hopes to create more sophisticated sounds, such as a band with different instruments or sounds that mimic a recording of a tropical rainforest.
However, the ethical implications of the technology must be considered, Patel says. It’s especially important to determine whether the musicians producing the clips used as training data receive attribution or royalties from the final product — a problem that has surfaced with text-to-image AIs. AI-generated speech that is indistinguishable from the real thing can also become so persuasive that it facilitates the spread of misinformation.
In the paper, the researchers write that they are already considering and working on these issues, for example by developing techniques to distinguish natural sounds from sounds produced with AudioLM. Patel also suggested incorporating audio watermarks into AI-generated products to make them easier to distinguish from natural audio.