Example:
As an example, we’ll create a collection with an adapter that chunks text into paragraphs and converts each chunk into an embedding vector using theall-MiniLM-L6-v2
model.
First, install vecs
with optional dependencies for text embeddings:
all-MiniLM-L6-v2
384 dimensional text embedding model.
Adapter
s allow you to work with a collection as though they store your prefered data type natively.
Built-in Adapters
vecs provides several built-in Adapters. Have an idea for a useful adapter? Open an issue requesting it.ParagraphChunker
TheParagraphChunker
AdapterStep
splits text media into paragraphs and yields each paragraph as a separate record. That can be a useful preprocessing step when upserting large documents that contain multiple paragraphs. The ParagraphChunker
delimits paragraphs by two consecutive line breaks \n\n
.
ParagrphChunker
is a pre-preocessing step and must be used in combination with another adapter step like TextEmbedding
to transform the chunked text into a vector.
skip_during_query
argument to True
. Setting skip_during_query
to False
will raise an exception if the input text contains more than one paragraph.
TextEmbedding
TheTextEmbedding
AdapterStep
accepts text and converts it into a vector that can be consumed by the Collection
. TextEmbedding
supports all models available in the sentence_transformers
package. A complete list of supported models is available in vecs.adapter.TextEmbeddingModel
.
Interface
Adapters are objects that take in data in the form ofIterable[Tuple[str, Any, Optional[Dict]]]
where Tuple[str, Any, Optional[Dict]]]
represents records of (id, media, metadata)
.
The main use of Adapters is to transform the media part of the records into a form that is ready to be ingested into the collection (like converting text into embeddings). However, Adapters can also modify the id
or metadata
if required.
Due to the common interface, adapters may be comprised of multiple adapter steps to create multi-stage preprocessing pipelines. For example, a multi-step adapter might first convert text into chunks and then convert each text chunk into an embedding vector.