Gemini was built to be multimodal so that it could interpret information in multiple formats, spanning text, code, audio, image, and video
Google has announced its latest AI model, Gemini, which was built to be multimodal so that it could interpret information in multiple formats, spanning text, code, audio, image, and video.
According to Google, the typical approach for creating a multimodal model involves training components for different information formats separately and then combining them together. What sets Gemini apart is that it was trained from the start on different formats and then fine-tuned with additional multi-modal data.
“This helps Gemini seamlessly understand and reason about all kinds of inputs from the ground up, far better than existing multimodal models — and its capabilities are state of the art in nearly every domain,” Sundar Pichai, CEO of Google and Alphabet, and Demis Hassabis, CEO and co-founder of Google DeepMind, wrote in a blog post.
Google also explained that the new model has pretty sophisticated reasoning capabilities, which allow it to understand complex written and visual information, making it “uniquely skilled at uncovering knowledge that can be difficult to discern amid vast amounts of data.”
For example, it can read through hundreds of thousands of documents and extract insights that lead to new breakthroughs in certain fields.
Its multimodal nature also makes it particularly suited to understanding and answering questions in complex fields like math and physics.