Example:
As an example, we’ll create a collection with an adapter that chunks text into paragraphs and converts each chunk into an embedding vector using theall-MiniLM-L6-v2 model.
First, install vecs with optional dependencies for text embeddings:
all-MiniLM-L6-v2 384 dimensional text embedding model.
Adapters allow you to work with a collection as though they store your prefered data type natively.
Built-in Adapters
vecs provides several built-in Adapters. Have an idea for a useful adapter? Open an issue requesting it.ParagraphChunker
TheParagraphChunker AdapterStep splits text media into paragraphs and yields each paragraph as a separate record. That can be a useful preprocessing step when upserting large documents that contain multiple paragraphs. The ParagraphChunker delimits paragraphs by two consecutive line breaks \n\n.
ParagrphChunker is a pre-preocessing step and must be used in combination with another adapter step like TextEmbedding to transform the chunked text into a vector.
skip_during_query argument to True. Setting skip_during_query to False will raise an exception if the input text contains more than one paragraph.
TextEmbedding
TheTextEmbedding AdapterStep accepts text and converts it into a vector that can be consumed by the Collection. TextEmbedding supports all models available in the sentence_transformers package. A complete list of supported models is available in vecs.adapter.TextEmbeddingModel.
Interface
Adapters are objects that take in data in the form ofIterable[Tuple[str, Any, Optional[Dict]]] where Tuple[str, Any, Optional[Dict]]] represents records of (id, media, metadata).
The main use of Adapters is to transform the media part of the records into a form that is ready to be ingested into the collection (like converting text into embeddings). However, Adapters can also modify the id or metadata if required.
Due to the common interface, adapters may be comprised of multiple adapter steps to create multi-stage preprocessing pipelines. For example, a multi-step adapter might first convert text into chunks and then convert each text chunk into an embedding vector.