Hi
I understand from the following documentation that our ybdb does not support distributed vector storage. May I ask what is the plan for this feature?
committed 04:45PM - 21 May 24 UTC
Summary:
This change implements the YSQL side of vector index creation. This dif… f adds support for index creation statements with a dummy ANN method called `ybdummyann` for now in the form
```
create extension vector;
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
CREATE INDEX ON items USING ybdummyann (embedding vector_l2_ops);
```
This creates an inverted index in DocDB with a schema that looks like
`BaseYBCTID | embedding |`
With only `BaseYBCTID` as the key.
We can do an index ANN scan based on certain query vector such as
` SELECT * FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; `
or an index only scan such as
` SELECT embedding FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; `
Note that the results from a `ybdummyann` index won't actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM's such as `hnsw` and `ivfflat` meant for external usage.
```
WARNING: ybdummyann is meant for internal-testing only. It does not yield ordered results.
```
When a vector index is created, a message of type `PgVectorIdxOptionsPB` found in `common.proto` is populated into `IndexInfo`. A log message has been inserted into `tablet.cc` to show how this can be accessed. Vector index scans populate a field of type `PgVectorReadOptionsPB` in the `PgsqlReadRequestPB`.
The relcache preloader is adjusted to not load index relations whose user-defined AM handler procs might not be loaded yet.
A new access method handler called `ybdummyannhandler` is created by this diff.
Any future vector index AM/AM handler will share functionality very similar to `ybdummyann`. For this reason, this common functionality is all placed in `src/ybvector/ybvector*`.
The main remaining TODOs after this change are:
- Build out DocDB side.
- Add capabilities to mergesort rows from tablets based on their distance from the query vector.
- Add an extra key column to denote future sharding information of each row.
- Allow included values.
- Allow a mix of vector and non-vector key attributes.
**Upgrade/Rollback safety:**
This adds vector index protobuf fields that should not be used by anybody production customer right now.
Jira: DB-11118
Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressThirdPartyExtensionsPgvector'
Reviewers: timur, jason, mbautin, sergei
Reviewed By: timur, jason
Subscribers: yql, ybase
Differential Revision: https://phorge.dev.yugabyte.com/D34200
One of them is mentioned:
“Note that the results from a ybdummyann
index won’t actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM’s such as hnsw
and ivfflat
meant for external usage.”
Hi @ZhenNan2016
Why make another issue when you can ask in the github issue?
Issues being worked on for the near term are mentioned in the roadmap GitHub - yugabyte/yugabyte-db: YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
Hi, @dorian_yugabyte
Excuse me.
I have an urgent need to use hnsw index. I asked a question on the relevant github, and also @ mbautin, but he didn’t respond.
Can you help me to confirm when this hnsw index can be supported?
opened 03:22PM - 02 Jul 24 UTC
kind/enhancement
area/docdb
priority/medium
Jira Link: [DB-12030](https://yugabyte.atlassian.net/browse/DB-12030)
### Descr… iption
Hi
I understand from the following documentation that our ybdb does not support distributed vector storage. May I ask what is the plan for this feature?
https://github.com/yugabyte/yugabyte-db/commit/5ca67e496d8c40cdef56c71513ef6d3b6d630596
One of them is mentioned:
“Note that the results from a ybdummyann index won’t actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM’s such as hnsw and ivfflat meant for external usage.”
Thanks a lot.
### Warning: Please confirm that this issue does not contain any sensitive information
- [X] I confirm this issue does not contain any sensitive information.
[DB-12030]: https://yugabyte.atlassian.net/browse/DB-12030?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
opened 12:18AM - 03 Aug 24 UTC
kind/enhancement
area/docdb
priority/medium
status/awaiting-triage
Jira Link: [DB-12298](https://yugabyte.atlassian.net/browse/DB-12298)
### Descr… iption
Create our own experimental HNSW implementation.
- Initial implementation will index vectors in memory.
- DocDB integration can be added gradually.
- Parts of the experimental implementation can be moved to the production implementation.
- Command-line tools and benchmarks, with the ability to tune parameters, can be added on top of the experimental implementation.
### Issue Type
kind/enhancement
### Warning: Please confirm that this issue does not contain any sensitive information
- [X] I confirm this issue does not contain any sensitive information.
[DB-12298]: https://yugabyte.atlassian.net/browse/DB-12298?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Thanks a lot.
Hi @ZhenNan2016 , It is being worked on, but I have no dates. Can you explain your use case (here or fpachot@yugabyte.com )? Knowing about user cases may help increase the priority or get more precise roadmap
Hi @FranckPachot
I have some business scenarios on my side, including but not limited to the following:
Finance, e-commerce and other fields related to text, audio, video, images and other media resources, have ANN search or KNN search requirements, and the storage and retrieval of these resources are expected to reach the level of billions.
Thanks a lot.
If some of those customers may need support, we can involve pre-sales, and they may help increase the priority. If it is open source users, follow the git issue for updates on the roadmap.
Are they already on PostgreSQL or others? Or is is new application?
Most of these users are open source users, including me.
Some of these users are already on PostgreSQL, some are new applications.
Thanks a lot.