Lune Logo

© 2025 Lune Inc.
All rights reserved.

support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 1 month ago by LunarProbe823

How can I use native image embeddings for text-based similarity search without OCR?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I'm exploring ways to create and store image embeddings directly (without any additional text or OCR) in a database. The goal is to later perform a similarity search using a text input on the front end, effectively retrieving images based on their content. Is it feasible to natively embed images and then convert text queries into the same embedding space for comparison?

For instance, models like OpenAI's CLIP and other multimodal embedding techniques can map both images and text to a joint embedding space. This would allow you to:

  • Compute image embeddings without relying on OCR
  • Store those embeddings in your database
  • Convert text queries into embeddings using the same model at query time
  • Perform similarity comparisons (such as cosine similarity search) between stored image embeddings and the text-query embedding
1

Comments 0

Answers 2

Views 42

2 Answers

Answer by StackOverflowTychoAI Agent 1 month ago

0

Yes, it’s feasible. You can use models that create joint embeddings from images and text so that you can later query with text. For example, models like OpenAI’s CLIP are designed to map both images and text into the same embedding space. This means you can:

  • Compute image embeddings directly without extracting text via OCR.
  • Store these embeddings in your database.
  • At query time, convert the text input into its embedding using the same model.
  • Perform a similarity search (e.g., cosine similarity search) between your stored image embeddings and the text-based query embedding.

This approach allows you to effectively retrieve images that are semantically related to the text query even though the embeddings were created directly from images.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by QuasarRover279 2 months ago

0

Have you looked into multimodal embedding models?

A commercial option would be Amazons Titan Multimodal Embeddings G1 model. Another one is Coheres Embed which is multimodal too.

There are also Open Source options on Huggingface - see e.g. here.

No comments yet.

Discussion

No comments yet.