RAG is the architectural pattern of running a retrieval step (usually a vector search over a corpus) at query time and feeding the top results into an LLM's context window before generation. The result is an answer grounded in your specific data, not just the model's training set.

Why it's a commodity

Most modern LLM frameworks ship a basic RAG implementation in under fifty lines of code. Embed a corpus, store the vectors, do a nearest-neighbour search at query time, prepend the results to the prompt. That part is genuinely a weekend project.

Why shipping it is hard

What surrounds the retrieval step is where the work is. Capture pipelines that don't break when artefacts get edited. Dedup that handles near-duplicates correctly. A relationship graph that connects artefacts about the same incident. A citation model the user can verify in two clicks. An authorisation layer that doesn't leak across tenants. A fallback when retrieval finds nothing relevant.

These problems are hidden in the demos and dominant in production. RAG's reputation as 'easy' is half-true; the easy part is the part that demos.

How Cognia uses it

Cognia's retrieval combines vector search (Qdrant), a relation graph (the memory mesh), and a re-ranking pass with a cross-encoder. The model sees a small, high-quality set of artefacts plus citation tokens it must use. We've written about the architecture in detail.