Last week, at Google’s annual conference dedicated to new products and technologies, the company announced a change to its premier AI product: The Bard chatbot, like OpenAI’s GPT-4, will soon be able to describe images. Although it may seem like a minor update, the enhancement is part of a quiet revolution in how companies, researchers, and consumers develop and use AI—pushing the technology not only beyond remixing written language and into different media, but toward the loftier goal of a rich and thorough comprehension of the world. ChatGPT is six months old, and it’s already starting to look outdated.
That program and its cousins, known as large language models, mime intelligence by predicting what words are statistically likely to follow one another in a sentence. Researchers have trained these models on ever more text—at this point, every book ever and then some—with the premise that force-feeding machines more words in different configurations will yield better predictions and smarter programs. This text-maximalist approach to AI development has been dominant, especially among the most public-facing corporate products, for years.
But language-only models such as the original ChatGPT are now giving way to machines that can also process images, audio, and even sensory data from robots. The new approach might reflect a more human understanding of intelligence, an early attempt to approximate how a child learns by existing in and observing the world. It might also help companies build AI that can do more stuff and therefore be packaged into more products.
GPT-4 and Bard are not the only programs with these expanded capabilities. Also last week, Meta released a program called ImageBind that processes text, images, audio, information about depth, infrared radiation, and information about motion and position. Google’s recent PaLM-E was trained on both language and robot sensory data, and the company has teased a new, more powerful model that moves beyond text. Microsoft has its own model, which was trained on words and images. Text-to-image generators such as DALL-E 2, which captivated the internet last summer, are trained on captioned pictures.
These are known as multimodal models—text is one modality, images another—and many researchers hope they will bring AI to new heights. The grandest future is one in which AI isn’t limited to writing formulaic essays and assisting people in Slack; it would be able to search the internet without making things up, animate a video, guide a robot, or create a website on its own (as GPT-4 did in a demonstration, based on a loose concept sketched by a human).
A multimodal approach could theoretically solve a central problem with language-only models: Even if they can fluently string words together, they struggle to connect those words to concepts, ideas, objects, or events. “When they talk about a traffic jam, they don’t have any experience of traffic jams beyond what they’ve associated with it from other pieces of language,” Melanie Mitchell, an AI researcher and a cognitive scientist at the Santa Fe Institute, told me—but if an AI’s training data could include videos of traffic jams, “there’s a lot more information that they can glean.” Learning from more types of data could help AI models envision and interact with physical environments, develop something approaching common sense, and even address problems with fabrication. If a model understands the world, it might be less likely to invent things about it.
The push for multimodal models is not entirely new; Google, Facebook, and others introduced automated image-captioning systems nearly a decade ago. But a few key changes in AI research have made cross-domain approaches more possible and promising in the past few years, Jing Yu Koh, who studies multimodal AI at Carnegie Mellon, told me. Whereas for decades, computer-science fields such as natural-language processing, computer vision, and robotics used extremely different methods, now they all use a programming method called “deep learning.” As a result, their code and approaches have become more similar, and their models are easier to integrate into one another. And internet giants such as Google and Facebook have curated ever-larger data sets of images and videos, and computers are becoming powerful enough to handle them.
There’s a practical reason for the change too. The internet, no matter how incomprehensibly large it may seem, contains a finite amount of text for AI to be trained on. And there’s a realistic limit to how big and unwieldy these programs can get, as well as how much computing power they can use, Daniel Fried, a computer scientist at Carnegie Mellon, told me. Researchers are “starting to move beyond text to hopefully make models more capable with the data that they can collect.” Indeed, Sam Altman, OpenAI’s CEO and, thanks in part to this week’s Senate testimony, a kind of poster boy for the industry, has said that the era of scaling text-based models is likely over—only months after ChatGPT reportedly became the fastest-growing consumer app in history.
How much better multimodal AI will understand the world than ChatGPT, and how much more fluent its language will be, if at all, is up for debate. Although many exhibit better performance over language-only programs—especially in tasks involving images and 3-D scenarios, such as describing photos and envisioning the outcome of a sentence—in other domains, they have not been as stellar. In the technical report accompanying GPT-4, researchers at OpenAI reported almost no improvement on standardized-test performance when they added vision. The model also continues to hallucinate—confidently making false statements that are absurd, subtly wrong, or just plain despicable. Google’s PaLM-E actually did worse on language tasks than the language-only PaLM model, perhaps because adding the robot sensory information traded off with losing some language in its training data and abilities. Still, such research is in its early phases, Fried said, and could improve in years to come.
We remain far from anything that would truly emulate how people think. “Whether these models are going to reach human-level intelligence—I think that’s not likely, given the kinds of architectures that they use right now,” Mitchell told me. Even if a program such as Meta’s ImageBind can process images and sound, humans also learn by interacting with other people, have long-term memory and grow from experience, and are the products of millions of years of evolution—to name only a few ways artificial and organic intelligence don’t align.
And just as throwing more textual data at AI models didn’t solve long-standing problems with bias and fabrication, throwing more types of data at the machines won’t necessarily do so either. A program that ingests not only biased text but also biased images will still produce harmful outputs, just across more media. Text-to-image models like Stable Diffusion, for instance, have been shown to perpetuate racist and sexist biases, such as associating Black faces with the word thug. Opaque infrastructures and training data sets make it hard to regulate and audit the software; the possibility of labor and copyright violations might only grow as AI has to vacuum up even more types of data.
Multimodal AI might even be more susceptible to certain kinds of manipulations, such as altering key pixels in an image, than models proficient only in language, Mitchell said. Some form of fabrication will likely continue, and perhaps be even more convincing and dangerous because the hallucinations will be visual—imagine AI conjuring a scandal on the scale of fake images of Donald Trump’s arrest. “I don’t think multimodality is a silver bullet or anything for many of these issues,” Koh said.
Intelligence aside, multimodal AI might just be a better business proposition. Language models are already a gold rush for Silicon Valley: Before the corporate boom in multimodality, OpenAI reportedly expected $1 billion in revenue by 2024; multiple recent analyses predicted that ChatGPT will add tens of billions of dollars to Microsoft’s annual revenue in a few years.
Going multimodal could be like searching for El Dorado. Such programs will simply offer more to customers than the plain, text-only ChatGPT, such as describing images and videos, interpreting or even producing diagrams, being more useful personal assistants, and so on. Multimodal AI could help consultants and venture capitalists make better slide decks, improve existing but spotty software that describes images and the environment to visually impaired people, speed the processing of onerous electronic health records, and guide us along streets not as a map, but by observing the buildings around us.
Applications to robotics, self-driving cars, medicine, and more are easy to conjure, even if they never materialize—like a golden city that, even if it proves mythical, still justifies conquest. Multimodality will not need to produce clearly more intelligent machines to take hold. It just needs to make more apparently profitable ones.