Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI. learn more
Kohia added multimodal embedding to its search model, allowing users to deploy images in RAG-style enterprise searches.
Embed 3, introduced last year, uses an embedded model that converts data into a numerical representation. Embedding has become important in search augmentation generation (RAG). This is because companies can create document embeddings that the model can compare to retrieve the information requested by the prompt.
You can see it now by searching.
We’re excited to release a fully multimodal embed for people to start building. pic.twitter.com/Zdj70B07zJ
— Aidan Gomez (@aidangomez) October 22, 2024
The new multimodal version allows you to generate embeddings for both images and text. Cohere claims Embed 3 is “the most commonly functioning multimodal embedding model on the market today.” Aidan Gonzales, co-founder and CEO of Cohere, posted a graph on X showing the performance improvements for image search with Embed 3.
The model’s image retrieval performance across different categories is very impressive. Significant increases in nearly every category considered. pic.twitter.com/6oZ3M6u0V0
— Aidan Gomez (@aidangomez) October 22, 2024
“This advancement will enable businesses to derive real value from the vast amounts of data stored in images,” said Cohere. blog post. “Companies can now build systems that accurately and quickly search critical multimodal assets such as complex reports, product catalogs, and design files to increase employee productivity.”
Cohere said a more multimodal focus will expand the amount of data that businesses can access through RAG searches. Many organizations often limit RAG searches to structured and unstructured text, even though their data libraries have multiple file formats. Customers can now bring in more charts, graphs, product images, and design templates.
Improved performance
Cohere said Embed 3’s encoders “share a unified latent space,” allowing users to include both images and text in the database. Some methods of image embedding often require maintaining separate databases for images and text. The company said this method allows for better mixed modality searches.
According to the company, “Other models tend to cluster text and image data into separate regions, leading to weak search results that are biased toward text-only data. Embed 3, on the other hand, Prioritize the meaning behind the data.”
Embed 3 is available in over 100 languages.
Cohere said multimodal Embed 3 is now available on its platform and Amazon SageMaker.
catch up
Thanks to the introduction of image-based search on platforms like Google and chat interfaces like ChatGPT, many consumers are rapidly becoming accustomed to multimodal search. As individuals become accustomed to finding information in photos, it’s natural that they would want to have the same experience at work.
Businesses are also beginning to realize this benefit, as other embedded model companies offer several multimodal options. Some model developers do this: google and OpenAIprovides a kind of multimodal embedding. Other open source models can also facilitate the embedding of images and other modalities. The battle is now over multimodal embedding models that can run with the speed, accuracy, and security that enterprises demand.
Cohere was founded by some of the researchers responsible for the Transformer model (Gomez was one of the authors of the famous “Caution is All You Need” paper), but many in the enterprise space I’ve had a hard time being the center of attention. The company updated its API in September to make it easier for customers to switch from competing models to Cohere models. At the time, Cohere said the move was to align with industry standards where customers frequently switch models.