Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.
Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.
How does their embedding model compare in terms of retrieval accuracy to, say `text-embedding-3-small` and `text-embedding-3-large`?
binarymax 2 hours ago [-]
It’s impossible to answer that question without knowing what content/query domain you are embedding. Checkout MTEB leaderboard, dig into the retrieval benchmark, and look for analogous datasets.
3abiton 53 minutes ago [-]
So we're talking maximizing embedding model per use case? Medical dats would require differnet model than say sales data? Sounds very fragmented approach.
splike 2 hours ago [-]
You can use openai embeddings in elastic if you don't want to use their elser sparse embeddings
keithwhor 29 minutes ago [-]
I agree.
Similar to blog post, instead of at the extension layer I built a PostgreSQL ORM for Node.js based on ActiveRecord + Django's ORM that includes the concept of vector fields [0][1] that lets you write code like this:
// Stores the `title` and `content` fields together as a vector
// in the `content_embedding` vector field
BlogPost.vectorizes(
'content_embedding',
(title, content) => `Title: ${title}\n\nBody: ${content}`
);
// Find the top 10 blog posts matching "blog posts about dogs"
// Automatically converts query to a vector
let searchBlogPosts = await BlogPost.query()
.search('content_embedding', 'blog posts about dogs')
.limit(10)
.select();
I find it tremendously useful; you can query the underlying data or the embedding content, and you can define how the fields in the model get stored as embeddings in the first place.
Safe to say that if you're using off-the-shelf character-based chunking, your AI app is not past PoC.
bryantwolf 39 minutes ago [-]
Hey, this looks great! I'm a huge fan of vectors in Postgres or wherever your data lives, and this seems like a great abstraction.
When I write a sql query that includes a vector search and some piece of logic, like:
```
select name from users where age > 21 order by <vector_similarity(users.bio, "I like long walks on the beach")> limit 10;
```
Does it filter by age first or second? I've liked the DX of pg_vector, but they do vector search, followed by filtering. It seems like that slows down what should be the superpower of a setup like this.
Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.
Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.
Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]
Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?
I say that because it seems n
unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.
cevian 2 hours ago [-]
(post co-author here)
The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.
Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.
jdthedisciple 3 hours ago [-]
Whats wrong with using FAISS as your single db?
Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.
Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).
avthar 2 hours ago [-]
Good q. For most standalone vector search use cases, FAISS or a library like it is good.
However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.
For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.
dinobones 4 hours ago [-]
Wow, actually a good point I haven't seen anyone make.
Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.
Storing documents makes much more sense.
choilive 3 hours ago [-]
Been using pgvector for a while, and to me it was kind of obvious that the source document and the embeddings are fundamentally linked so we always stored them "together". Basically anyone doing embeddings at scale is doing something similar to what Pgai Vectorizer is doing and is certainly a nice abstraction.
jdthedisciple 3 hours ago [-]
I used FAISS as it also allowed me to trivially store them together.
Idk how well it scales though, it's just doing it's job on my hobby project scale
For my few 100'000s embeddings I must say the performance was satisfactory.
markusw 3 hours ago [-]
I’m using sqlite-vec along with FTS5 in (you guessed it) SQLite and it’s pretty cool. :)
jeffchuber 35 minutes ago [-]
> Vector databases treat embeddings as independent data, divorced from the source data from which embeddings are created
With the exception of Pinecone: Chroma, Qdrant, Weaviate, Elastic, Mongo, and many others store the chunk/document alongside the embedding.
This is intentional misinformation.
avthar 18 minutes ago [-]
Post co-author here. The point is a little nuanced, so let me explain:
You are correct in saying that that you can store embeddings and source data together in many vectordbs. We actually point this out in the post. The main point is that they are not linked but merely stored alongside each other. If one changes, the other one does not automatically change, making the relationship between the two stale.
The idea behind Pgai Vectorizer is that it actually links embeddings with underlying source data so that changes in source data are automatically reflected in embeddings. This is a better abstraction and it removes the burden of the engineer to ensure embeddings are in sync as their data changes.
jeffchuber 3 minutes ago [-]
i know it is the case in chroma this is supported out of the box with 0 lines of code. i’m pretty sure it’s supported everywhere else in no more than 3 lines of code.
ok123456 2 hours ago [-]
Yes. Materialized Views are good.
unholyguy001 1 hours ago [-]
That was just what I was thinking. This approach will have the same issues that materialized views have as well
cevian 1 hours ago [-]
haha. We had a good internal debate as to whether this is more like indexes or more like Materialized Views. It's kinda a mixture of the two.
mattxxx 3 hours ago [-]
This reads solely as a sales pitch, which quickly cuts to the "we're selling this product so you don't have to think about it."
...when you actually do want to think about it (in 2024).
Right now, we're collectively still figuring out:
1. Best chunking strategies for documents
2. Best ways to add context around chunks of documents
3. How to mix and match similarity search with hybrid search
4. Best way to version and update your embeddings
cevian 3 hours ago [-]
(post co-author here)
We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].
Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.
Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.
> the responsibility for generating and updating them as the underlying data changes can be handed over to the database management system
And now we shift ever more slightly back towards logic in the DB. I for one am thrilled; there’s no reason other than unfamiliarity to not let RDBMS perform functions it’s designed to do. As long as these offloads are documented in code, embrace not needing to handle it in your app.
(Disclaimer: I work for Elastic)
Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.
Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.
https://www.elastic.co/search-labs/blog/semantic-search-simp...
https://github.com/patricktrainer/duckdb-embedding-search
Similar to blog post, instead of at the extension layer I built a PostgreSQL ORM for Node.js based on ActiveRecord + Django's ORM that includes the concept of vector fields [0][1] that lets you write code like this:
I find it tremendously useful; you can query the underlying data or the embedding content, and you can define how the fields in the model get stored as embeddings in the first place.[0] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...
[1] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...
When I write a sql query that includes a vector search and some piece of logic, like: ``` select name from users where age > 21 order by <vector_similarity(users.bio, "I like long walks on the beach")> limit 10; ``` Does it filter by age first or second? I've liked the DX of pg_vector, but they do vector search, followed by filtering. It seems like that slows down what should be the superpower of a setup like this.
Here's a bit more of a complicated example of what I'm talking about: https://blog.bawolf.com/p/embeddings-are-a-good-starting-poi...
Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.
Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]
[1]: https://github.com/timescale/pgai
I say that because it seems n unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.
The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.
Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.
Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.
Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).
However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.
For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.
Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.
Storing documents makes much more sense.
Idk how well it scales though, it's just doing it's job on my hobby project scale
For my few 100'000s embeddings I must say the performance was satisfactory.
With the exception of Pinecone: Chroma, Qdrant, Weaviate, Elastic, Mongo, and many others store the chunk/document alongside the embedding.
This is intentional misinformation.
You are correct in saying that that you can store embeddings and source data together in many vectordbs. We actually point this out in the post. The main point is that they are not linked but merely stored alongside each other. If one changes, the other one does not automatically change, making the relationship between the two stale.
The idea behind Pgai Vectorizer is that it actually links embeddings with underlying source data so that changes in source data are automatically reflected in embeddings. This is a better abstraction and it removes the burden of the engineer to ensure embeddings are in sync as their data changes.
...when you actually do want to think about it (in 2024).
Right now, we're collectively still figuring out:
We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].
Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.
Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.
[1]: https://www.timescale.com/blog/which-rag-chunking-and-format...
And now we shift ever more slightly back towards logic in the DB. I for one am thrilled; there’s no reason other than unfamiliarity to not let RDBMS perform functions it’s designed to do. As long as these offloads are documented in code, embrace not needing to handle it in your app.