Jump to Content
Transform with Google Cloud

The Prompt: A historic year for AI ends with a momentous multimodal moment

December 20, 2023
Philip Moyer

Global VP, AI & Business Solutions at Google Cloud

Try Gemini 1.5 Pro

Google's most advanced multimodal model is now widely available.

Got AI?

Business leaders are buzzing about generative AI. To help you keep up with this fast-moving, transformative topic, “The Prompt” brings you our ongoing observations from our work with customers and partners, as well as the newest AI happenings at Google. In this edition, Phil Moyer, global vice-president for AI & Business Solutions at Google Cloud reflects on one of the biggest years for enterprise tech in memory and considers what to expect in the new year (hint: it could be just as big).

As I look back at the start of the year, it’s hard to believe how much has happened since generative AI first exploded into the mainstream. This month alone, Google launched Gemini, the AI Hypercomputer, and Duet AI for Developers (now generally available), the latest among dozens of gen AI products and hundreds of gen AI updates we released in 2023. The pace is frankly astonishing.

This accelerated innovation is everywhere. At Google Cloud, the number of active gen AI projects on Vertex AI has grown more than 7X. Gemini is already supercharging the Vertex AI platform, giving developers the power to build sophisticated AI agents, and it’s coming soon to our Duet AI portfolio, so customers have AI assistance whenever and wherever they need it. There’s also been an explosion of activity in the open-source gen AI world along with many outstanding models from organizations across the industry — it’s truly an exciting time.

This year, much of the attention has been on novel consumer applications and clever experiments at larger enterprises. With multimodal models like Gemini, we expect to see more serious and significant advances across industries.

What’s more, we started 2023 with most models confined to their training data — but now, we have robust solutions to fine-tune models and connect them to external and proprietary sources, letting organizations apply the intelligence of AI models across their data. From enabling question-answer chatbots that span a company’s enterprise data to synthesizing and analyzing diverse information, these capabilities are enabling remarkable use cases.

Not to sound hyperbolic, but the first few times I used Gemini felt like that magical “Eureka” moment. And I can’t wait for everyone else to have theirs. This is the moment where more and more leaders will not only see new uses for gen AI — they’ll start using it themselves for nearly everything.

Multimodality unlocks advanced reasoning

Gemini was built from the ground up to be multimodal, which means it can generalize, understand, operate across, and combine many different types of information simultaneously — whether it’s text, code, audio, image, or video.

For instance, you can ask Gemini: “Over the past five years, what was this bank or this online retailer’s cash dividend payout ratio?”

Payout ratios show the proportion of a company’s earnings are given to shareholders as dividends in relation to its total earnings. In order to provide an answer, a model will need to understand all the different definitions of cash, cash equivalents, and dividends and be able to apply them within the mathematical concept of ratios. It will also need to accurately retrieve financial information from outside systems for the last five years and access other AI models to calculate the ratio.


Highlights from our Applied AI Summit keynote.

Multimodality is the difference between models that can predict the next word (or words) in a sentence and more sophisticated models that both understand and act on information across different data types. To answer the question above, a model not only has to understand a question but also distinguish mathematical concepts like equations and retrieve the specific elements needed — two things that weren’t possible less than a year ago.

Models like Gemini indicate that we’re about to enter an entirely new era of gen AI that will take us closer to true language understanding, where systems can synthesize across many different types of data and create even more business value across industries.

It also means the applications across domains and real-world environments are that much stronger, since models like Gemini can tackle so many more situations. Gemini Nano, our mobile-sized model that can operate on-device, creates powerful opportunities to run AI at the edge, meaning data can be securely analyzed and responded to faster and with limited connection. These mobile-first models can enhance tasks as diverse as emergency services, mobile banking, or augmented gaming.

Blending information to solve real-world problems

Multimodal capabilities also offer organizations new ways to merge different types of data to tackle challenges in the physical world. Many industries face unstructured, unexpected problems that may not be possible to solve through a single mode of analysis or limited data sources.

For instance, improving safety on construction sites requires analyzing and combining many different types of information. A company might have visual data like video feeds or images, incident reports from construction sites, or other types of data like financial costs or timeline delays. Multimodal gen AI models can help blend together all of this information and understand where, when, and how accidents are most likely to occur and create safer, more efficient approaches.

Or consider an airline mechanic trying to evaluate an engine that’s making an unusual sound when accelerating. The mechanic could take a video with sound and then describe by voice a few other details. A gen AI app could help consider all these modalities of information and retrieve relevant information from that specific craft’s technical handbook, helping the mechanic to quickly identify the problem and come up with solutions for how to fix the issue.

A single model designed to work with modalities in tandem, rather than multiple models for each type of information bolted together, will make gen AI far more capable of achieving the kind of multi-sensory reasoning that humans take for granted. These advancements create more potential for building gen AI agents that can take actions toward a goal, such as looking up information, completing transactions, and even making decisions.

Already, we’ve seen tremendous creativity when it comes to building AI agents that deliver new business value.

Priceline is building a digital concierge that can help travelers research locations, plan complex itineraries, and even book trips directly from a conversational chat experience. Six Flags is creating a new virtual assistant that will guide park visitors in planning their day, answer questions, and even offer personalized recommendations about rides, entertainment, dining, and more. Formula E launched a conversational agent that can analyze live telemetry data from sensors on race cars along with historical records from past championships and answer questions from drivers and fans during live events.

Imagine the new use cases available to organizations with agents that have the ability to retrieve, understand, and act on different forms of real-world and enterprise data.

The next level of gen AI use cases is here

As gen AI models continue to add more modalities, the potential AI use cases are also expanding and evolving. Gen AI apps are likely to transform from basic response generation into complex gen AI systems and virtual assistants that can do everything from analyzing contracts and financial data, to aggregating huge bodies of research, to delivering completed drafts of content.

In the coming year, I expect to see organizations implementing regular development cycles, delivering frequent releases of new gen AI capabilities and improvements to increase performance. In addition, the ability to deploy gen AI on the edge is coming to mobile developers, through models like Gemini Nano, with the potential to completely transform the devices where we can access gen AI and bring an entirely new set of business use cases to explore.

Some of the most interesting multimodal use cases on the horizon include:

  • Insurance claims and assessments: Using gen AI models to analyze mobile phone video footage or images to assess damage in a collision.
  • Asset surveillance: Auditing building records and conducting risk-based inspections to ensure that the quality of physical assets are maintained.
  • Disease diagnosis: Analyzing lab tests, medical images, and patient data to assist clinicians to spot health conditions and predict potential disease progression.
  • Financial fraud detection: Incorporating real-time knowledge about fraudulent methods and activity to improve fraud detection, such as recognizing counterfeit bills, fraudulent transactions, and other under-the-radar anomalies.
  • Contract compliance: Assessing retail shelf spaces and physical merchandising to ensure that shelf placements and promotions reflect retail contracts with suppliers.
  • Analyzing purchase or transaction history: Allowing customers or organizations to use natural language queries to answer banking questions and spending habits.
  • Document summarization and search: Analyzing earning calls and press releases to identify key takeaways and predict analyst questions.

Gemini has shown us the way forward, and it’s up to business leaders and boards to seize these opportunities to reinvent their organizations with AI.

This year was the inflection point that made the future possible, but the real excitement and transformation are yet to come. Those are the breakthroughs I’m really looking forward to — and I’m looking forward to Google Cloud being an important partner in getting them built. If you want to learn more about our latest gen AI advancements and how we’re helping developers meet the next frontier, check out these highlights from the Applied AI Summit.

Opening image created with Midjourney, running on Google Cloud, using the prompt "a happier conceptual illustration in a magazine style that conveys a blending of image, text, video, code, and audio."

Posted in