Step 3: Embeddings & Recall — Give Your Agent Long-Term Memory

Embeddings, locally

nomic-embed-text turns a string into a 768-dimension vector. We store it as a packed float32 blob in the embedding column — no extension, no vector DB, just bytes in SQLite.

Add to memory.py:

import struct, math
import ollama
 
EMBED_MODEL = "nomic-embed-text"
 
def embed(text: str) -> list[float]:
    return ollama.embeddings(model=EMBED_MODEL, prompt=text)["embedding"]
 
def to_blob(vec: list[float]) -> bytes:
    return struct.pack(f"{len(vec)}f", *vec)
 
def from_blob(blob: bytes) -> list[float]:
    return list(struct.unpack(f"{len(blob) // 4}f", blob))
 
def cosine(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb + 1e-9)

At a few hundred facts, a linear scan in Python is fast enough — microseconds. This is the honest scale of personal-agent memory. (When you outgrow it — tens of thousands of facts — swap the scan for sqlite-vec or pgvector; the interface below doesn't change.)

Recall — but only what's true now

This is the function that makes our memory different from transcript search. It filters to valid_to IS NULL before ranking, so a retired fact can never be returned:

def recall(con, query: str, k: int = 5) -> list[tuple[str, str, str]]:
    q = embed(query)
    rows = con.execute(
        "SELECT subject, predicate, value, embedding"
        " FROM facts WHERE valid_to IS NULL"
    ).fetchall()
    scored = [
        (cosine(q, from_blob(emb)), s, p, v)
        for (s, p, v, emb) in rows
        if emb is not None
    ]
    scored.sort(reverse=True)
    return [(s, p, v) for (_, s, p, v) in scored[:k]]

That WHERE valid_to IS NULL is the entire fix for the update failure. Naive retrieval over raw turns ranks every past mention by similarity; both "I live in Toronto" and "I moved to Berlin" match "where do I live?", and the stale one can win. Here, the Toronto row is already closed by the time we recall, so it isn't even a candidate.

Try it

if __name__ == "__main__":
    con = connect()
    # (pretend these were written by the pipeline in Step 4)
    for pred, val in [("role", "backend engineer"), ("city", "Berlin")]:
        emb = to_blob(embed(f"user {pred} {val}"))
        con.execute(
            "INSERT INTO facts(subject,predicate,value,embedding,valid_from,valid_to,session)"
            " VALUES('user',?,?,?,'2026-01-01T00:00:00',NULL,1)",
            (pred, val, emb),
        )
    con.commit()
    print(recall(con, "where does the user live?", k=2))

$ python memory.py
[('user', 'city', 'Berlin'), ('user', 'role', 'backend engineer')]

The city fact ranks first for a location query, exactly as you'd hope. Next we make the writes real — and temporal.

Reference: Ollama embeddings API · nomic-embed-text · sqlite-vec (when you outgrow the scan)