Meta unveils an AI that generates video based on text prompts

While the effect is rather rudimentary, the system provides a glimpse of what’s to come for synthetic artificial intelligence, and it’s the obvious next step from the text-to-image AI system that has caused great excitement this year.

Meta’s announcement of Make-A-Video, which has yet to be made public, will likely prompt other AI labs to release their own versions. It also raises some big ethical questions.

Just last month, AI lab OpenAI created its latest text-to-image AI system DALL-E available to everyone, and AI startup Stability.AI has launched Diffuse Stabilization, an open source text-to-image system.

But text-to-video AI comes with some bigger challenges. First, these models require a large amount of computing power. They’re an even bigger computational tool than text-to-big-image AI models that use millions of images for training, because just stitching together a short video requires hundreds of images. . That means only the big tech companies can afford to build these systems for the foreseeable future. They are also harder to train because there is no large-scale dataset of high-quality video paired with text.

To solve this problem, Meta combined data from three open source image and video datasets to train its model. The standard set of text-image data of labeled static images helped the AI ​​learn what objects are called and what they look like. And a database of videos helped it learn how those objects are supposed to move around in the world. The combination of two approaches helped Make-A-Video, described in a peer-to-peer review Article published todaycreate videos from text on a large scale.

Tanmay Gupta, a computer vision scientist at the Allen Institute for Artificial Intelligence, says Meta’s results are promising. Shared videos show the model being able to capture 3D shapes as the camera rotates. The model also has some concept of depth and understanding of light. Gupta says some details and movements are done in a sophisticated and convincing way.

However, “there is a lot of room for the research community to improve, especially if these systems are used for professional video editing and content creation,” he added. In particular, it remains difficult to model complex interactions between objects.

In the video generated by the prompt “An artist’s brush painting on canvas”, the brush moves across the canvas, but the strokes on the canvas are not realistic. “I would love to see these models succeed in creating an interactive sequence, such as ‘The man picks up a book from the shelf, puts on his glasses, and sits down to read it while drinking a cup of coffee, ‘” Gupta said.

Source link


News5h: Update the world's latest breaking news online of the day, breaking news, politics, society today, international mainstream news .Updated news 24/7: Entertainment, the World everyday world. Hot news, images, video clips that are updated quickly and reliably

Related Articles

Back to top button