Combining vision and language could be the key to more capable AI – TechCrunch

Depending on the theory of intelligence you subscribe to, achieving “human-grade” AI will require a system that can leverage multiple modalities – e.g. audio, visual, and textual – to think about the world. For example, when displaying an overturned truck and a police cruiser on a snowy highway, the human-level AI can infer that dangerous road conditions caused the accident. Or, powered by the robot, when asked to retrieve a soft drink from the fridge, they navigate around people, furniture, and pets to retrieve the can and place it within reach of the requestor.

Today’s AI is no more. But new research shows encouraging signs of progress, from robots that can figure out the steps in response to basic commands (e.g., “get a bottle of water”) to learning text-generating systems. from explanations. In this revived issue of Deep Science, our weekly series on the latest developments in AI and the broader scientific field, we’ll cover the work of DeepMind, Google, and OpenAI to makes strides towards systems that can – if not fully understand the world – tackle narrow tasks like generating images with impressive power.

OpenAI’s innovative DALL-E AI Lab, DALL-E 2, is easily the most impressive project to emerge from the depths of an AI lab. As my colleague Devin Coldewey writes, while the original DALL-E has shown a remarkable skill in crafting images to fit almost any prompt (e.g., “a team dog,” beret”), then DALL-E 2 goes even further. The image it produces is much more detailed, and the DALL-E 2 can intelligently replace a certain area in an image – inserting a table into a photo of a marble floor, for example. full of appropriate reflexes.


An example of the types of images that DALL-E 2 can produce.

DALL-E 2 gets the most attention this week. But on Thursday, researchers at Google detailed an equally impressive visual understanding system called Visual-Driven Prosody for Text-to-Speech – VDTTS – in a post on Google’s AI blog. VDTTS can produce realistic sound, lip synced voice, nothing more than text and video frames of the person talking.

The speech produced by EMs, while not a perfect medium for recorded dialogue, is still quite good, with convincingly human-like expressions and timing. Google sees it one day being used in the studio to replace original audio that might have been recorded in noisy conditions.

Of course, intuitive insights are only one step on the path to more capable AI. Another component is understanding of the language, which lags in many respects – even AI aside. fully documented toxicity and Prejudice problem. In one obvious example, an advanced Google system, Language Modeling Pathways (PaLM), memorized 40% of the data used to “train” it, according to one paper, resulting in PaLM plagiarize text according to the copyright notice in the code.

Fortunately, DeepMind, the AI ​​lab backed by Alphabet, is among the exploratory techniques to tackle this problem. In a new learnDeepMind researchers investigate whether AI linguistic systems – which learn to generate text from a variety of existing text examples (thought books and social media) – can benefit from being provided or not explain of those texts. After annotating dozens of linguistic tasks (e.g., “Answer these questions by determining whether the second sentence is an appropriate expression of the first sentence”) accompanied by explanations ( e.g. “David’s eyes are not literal daggers, it is a metaphor used to imply that David is glaring furiously at Paul.”) and evaluate the performance of the systems. different on them, the DeepMind team found that the examples actually improved the system’s performance.

DeepMind’s approach, if it gets past the gathering in the academic community, could one day be applied in the field of robotics, forming the building blocks of a robot that can understand dream requirements. pool (e.g. “take out the trash”) without step-by-step instructions. New by Google”Do as I can, not as I say“The project offers a glimpse into this future – albeit with significant limitations.

A collaboration between Robotics at Google and the Everyday Robotics team at Alphabet’s X lab, Do As I Can, Not As I Say seeks to condition an AI language system to suggest “possible” and “appropriate” actions contextual” for a robot, depending on the task. The robot acts as the “hands and eyes” of the language system while the system provides high-level semantic knowledge of the task – theoretically the language system encodes a wealth of knowledge useful to the user. robot.

Google's robot

Image credits: Robots at Google

A system called SayCan chooses which skill the robot will perform in response to a command, including (1) the probability that a certain skill is useful and (2) the likelihood of successfully performing the skill. For example, in response to someone saying “I spilled my coke, can you bring me something to clean it up?” SayCan can direct the robot to find a sponge, pick up the sponge and bring it to the person who requested it. it.

SayCan is limited by the robot’s hardware – on more than one occasion, the team observed the robot they chose to conduct the experiment accidentally drop objects. However, it, along with the work of DALL-E 2 and DeepMind in contextual understanding, is an illustration of how AI systems when combined can bring us much closer. Jet type Future.

Source link


News5h: Update the world's latest breaking news online of the day, breaking news, politics, society today, international mainstream news .Updated news 24/7: Entertainment, the World everyday world. Hot news, images, video clips that are updated quickly and reliably

Related Articles

Back to top button