While the effect is rather crude, the system offers an early glimpse of what’s to come for generative artificial intelligence, and it’s the next obvious step from the text-to-image AI systems that will be huge this year. caused excitement.
Meta’s announcement of Make-A-Video, which is not yet available to the public, will likely prompt other AI labs to release their own versions. It also raises some big ethical questions.
In the past month alone, AI lab OpenAI made its latest text-to-image AI system DALL-E available to everyone, and AI startup Stability.AI launched Stable Diffusion, an open-source text-to-image system.
But text-to-video AI poses even greater challenges. First, these models require an enormous amount of computing power. They are even greater computing power than large text-to-image AI models, which use millions of images to train, because putting together just one short video requires hundreds of images. That means it’s really only big technology companies that can afford to build these systems in the near future. They are also more difficult to train because there are no large-scale datasets of high-quality videos with text.
To get around this, Meta combined data from three open-source image and video data sets to train his model. Standard text-image datasets of labeled still images helped the AI learn what objects are called and what they look like. And a database of videos helped it learn how those objects should move in the world. The combination of the two approaches helped Make-A-Video, which is described in a non-peer-reviewed paper published todaygenerate videos from text at scale.
Tanmay Gupta, a computer vision researcher at the Allen Institute for Artificial Intelligence, says Meta’s results are promising. The videos being shared show that the model can capture 3D shapes as the camera rotates. The model also has some sense of depth and understanding of lighting. Gupta says some of the details and movements were done neatly and convincingly.
“However, there is plenty of room for the research community to improve, especially if these systems are used for video editing and professional content creation,” he adds. In particular, it is still difficult to model complex interactions between objects.
In the video generated by the “An artist’s brush on a canvas” prompt, the brush moves across the canvas, but the strokes on the canvas are not realistic. “I would like to see these models succeed in generating a sequence of interactions such as ‘The man takes a book from the shelf, puts on his glasses and sits down to read it while having a cup of coffee,’” Gupta says .