Keyword search matches strings. Semantic search matches meaning — "I forgot my login" finds the article titled "Resetting your password" even though they share no words. The mechanism is embeddings: a model turns each piece of text into a vector of a few thousand numbers, and texts that mean similar things land close together in that space. Generate those vectors once, store them, and you can rank anything by relevance — the retrieval half of every RAG system.

Brievio exposes /v1/embeddings as a drop-in OpenAI-compatible endpoint, so the SDK call is identical to what you already write — only the base_url changes. Embeddings are cheap, you pay honest token counts, and failed 4xx/5xx calls cost nothing. This post is the practical path: generate vectors, score them with cosine similarity, store and query them with pgvector, and wire the result into a retrieval pipeline — with the caveats that actually bite.

Generating embeddings

One call. Pass a list of strings, get back a list of vectors in the same order. Always batch — a few hundred inputs per request costs the same per token as one-at-a-time but saves you hundreds of round trips:

embed.py

# Generate embeddings with the OpenAI SDK — same call, different base_url.
from openai import OpenAI

client = OpenAI(
    api_key="sk-brievio-...",
    base_url="https://api.brievio.com/v1",
)

EMBED_MODEL = "text-embedding-3-large"   # pick one model and stick with it

def embed(texts: list[str]) -> list[list[float]]:
    # Batch up to a few hundred inputs per call — far fewer round trips,
    # same per-token price. The response preserves input order.
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return [row.embedding for row in resp.data]

vecs = embed(["How do I reset my password?", "Where is my invoice?"])
print(len(vecs), "x", len(vecs[0]))     # 2 x 3072  (model-dependent dimension)

# usage is reported honestly — embeddings are token-metered like any other call
print(resp.usage.prompt_tokens, "tokens billed")

The vector length (its dimension) is fixed by the model — commonly 768, 1536, or 3072. Higher dimensions capture slightly more nuance but cost more to store and compare. The single most important rule: pick one embedding model and never mix. A vector from one model and a vector from another live in different, incompatible spaces — comparing them produces noise. See the live models catalog for the embedding models available and their dimensions; pricing per model is on the pricing page.

Scoring with cosine similarity

To find the closest matches, you compare the query vector against your stored vectors. The standard metric is cosine similarity: it measures the angle between two vectors and ignores their length, so it cares about direction (meaning) rather than magnitude. 1.0 means the same direction, 0 means unrelated:

search.py

# Semantic search = embed the query, score it against every stored vector,
# return the closest. The scoring function is cosine similarity.
import numpy as np

def cosine(a: np.ndarray, b: np.ndarray) -> float:
    # 1.0 = identical direction, 0 = unrelated, -1 = opposite.
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

# Many embedding models return unit-length vectors already. When they do,
# cosine similarity reduces to a plain dot product — cheaper at scale:
def cosine_unit(a: np.ndarray, b: np.ndarray) -> float:
    return float(a @ b)

query = np.array(embed(["I forgot my login"])[0])
corpus = np.array(embed(documents))                 # (N, dim)

scores = corpus @ query / (
    np.linalg.norm(corpus, axis=1) * np.linalg.norm(query)
)
top = scores.argsort()[::-1][:5]                    # indices of the 5 best matches
for i in top:
    print(round(float(scores[i]), 3), documents[i][:60])

Two practical notes. First, many embedding models already return unit-length vectors — when they do, cosine similarity is just a dot product, which is what makes brute-force search over tens of thousands of vectors fast enough to skip a database entirely. Second, the brute-force loop is fine up to roughly the low tens of thousands of vectors. Past that, scanning every vector on every query gets slow, and you want a real vector store with an index.

Storing and querying with pgvector

If your data already lives in Postgres, you don't need a separate vector database — the pgvector extension adds a vector column type and nearest-neighbour operators. You embed each chunk once, store the vector next to its text, and let Postgres do the search:

store.sql

-- pgvector turns Postgres into a vector store. Enable it once, then store
-- each chunk's embedding alongside its text and metadata.
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
    id        bigserial PRIMARY KEY,
    doc_id    text NOT NULL,
    content   text NOT NULL,
    embedding vector(3072)            -- must match your model's dimension
);

-- Approximate-nearest-neighbour index. <=> is cosine distance in pgvector
-- (0 = closest). Build the index AFTER bulk-loading, not before.
CREATE INDEX ON chunks
    USING hnsw (embedding vector_cosine_ops);

-- Retrieval: pass the query embedding as a parameter ($1) and take the
-- nearest rows. Order by distance ascending = most similar first.
SELECT id, doc_id, content, 1 - (embedding <=> $1) AS similarity
FROM   chunks
WHERE  doc_id = ANY($2)               -- optional metadata pre-filter
ORDER  BY embedding <=> $1
LIMIT  5;

The <=> operator is cosine distance (0 = identical), so 1 - (embedding <=> $1) gives you back a similarity score. The hnsw index makes search approximate but fast — for most retrieval workloads the tiny recall tradeoff is invisible, and you build it after bulk-loading, never before. The optional WHERE clause is the quiet superpower: filter by tenant, document, language, or date before the vector search so you never retrieve a neighbour the user isn't allowed to see.

Wiring it into RAG

Retrieval-augmented generation is two steps stitched together: semantic search to find relevant context, then a generation model to answer using that context. The retrieval side is everything above. The generation side is a normal chat completion — and this is where you pair a cheap embedding model with a strong reasoning model:

Chunk, then embed. Split documents into passages of roughly a few hundred tokens with a little overlap, embed each, and store them. Chunking is the lever most teams under-invest in: chunks too large bury the answer in noise; too small and they lose the context that makes them meaningful. Tune it on your own data.
Retrieve at query time. Embed the user's question with the same model, pull the top 3–8 chunks, and paste their text into the prompt as context.
Generate the answer. Send that context plus the question to a capable model — Claude or Gemini through the same Brievio endpoint — and instruct it to answer only from the provided context and cite which chunk it used.

The economics work because the two halves have wildly different price points. Embedding your whole corpus is a one-time, low-cost batch job; re-embedding only happens when content changes. The per-query cost is dominated by the generation step, not retrieval — which is exactly why the cost-control techniques in our AI API cost-optimization playbook (prompt caching the static instructions, capping output, picking the right model per task) move the needle far more than anything on the embedding side.

The caveats that actually bite

Vectors are not interchangeable across models. Switch embedding models and you must re-embed the entire corpus — query and stored vectors have to come from the same model, or scores are meaningless. Treat the model choice as a schema decision.
Dimension is a storage and speed tradeoff. A 3072-dim vector is four times the bytes of a 768-dim one and slower to compare. Bigger isn't automatically better for your task — measure recall on your own queries before paying for the largest model.
Chunking quality beats model quality. Most bad RAG answers trace back to bad chunks, not a weak embedding model. Respect document structure — split on headings and paragraphs, not a blind character count.
Semantic search can miss exact terms. Product SKUs, error codes, and names are sometimes better matched by keyword search. In practice the strongest systems are hybrid: combine vector similarity with a classic full-text or BM25 score.
Embeddings have a context limit too. Text past the model's input limit is silently truncated — your "embedding" then represents only the first part of an over-long chunk. Keep chunks comfortably under the limit.

The concrete takeaway

Semantic search is four moving parts: an embedding model, a similarity metric, a store, and a chunking strategy. Pick one embedding model and commit to it; use cosine similarity; start with brute-force NumPy and graduate to pgvector's HNSW index when your corpus outgrows a single scan; and spend your tuning budget on chunking, because that's where retrieval quality is won or lost. Through Brievio, the embedding call is the OpenAI SDK you already know with one line changed, metered on honest token counts, and it sits next to the Claude and Gemini models you'll use for the generation step — one key, one base_url, the whole RAG pipeline. The endpoint reference and a runnable quickstart live in the docs.

Embeddings and semantic search with the OpenAI SDK (RAG guide)

Generating embeddings

Scoring with cosine similarity

Storing and querying with pgvector

Wiring it into RAG

The caveats that actually bite

The concrete takeaway

$ ls ./related

Vision and document understanding with Claude and Gemini via one API

Structured output and JSON mode across Claude and Gemini

Rate limits, retries and backoff: production error handling for AI APIs

Building an AI agent loop: tools, memory and safe iteration