'RAG is dead' has been the most reliable way to get attention in AI engineering since late 2023. Every time a model ships a bigger context window, someone declares retrieval obsolete. And every time, the people running real systems in production quietly carry on retrieving.
The truth underneath the hype is more useful than the headline. Naive RAG is breaking down, and the discipline replacing it is called context engineering. If your retrieval layer still looks like the 2023 tutorial you copied it from, it is already your bottleneck.
This is what actually changed, and what to build instead.
Is RAG Actually Dead?
No. Naive RAG is. Retrieval, the act of giving a model information it was never trained on, matters more than ever, because models are frozen at training time and your business data is not. What is dying is the lazy version: a single vector search dumping disconnected snippets into a prompt and hoping for the best.
Douwe Kiela, co-author of the original 2020 RAG paper, puts it bluntly. People 'have rebranded it now as context engineering, which includes MCP and RAG'. The retrieval did not go anywhere. The 'R' still stands for retrieval. What changed is everything around it.
So the headline is wrong, but it is pointing at something real. The useful question is not 'is retrieval dead?'. It is 'why is the thing I built in 2023 falling over?'
What Is Naive RAG, and Why Is It Breaking?
Naive RAG was designed for a human-scale, single-turn problem: one question, one retrieval, one answer. Agents break every one of those assumptions. They retrieve in loops, sometimes dozens of times per task, so every weakness in single-shot retrieval compounds with each call.
Four failure modes show up again and again:
- Lost in the middle. Stuffing more chunks into the prompt does not help. Models read the start and end of a long context far better than the middle, and accuracy on the same task can fall by double digits when the key passage is buried in the centre. This is the canonical Lost in the Middle finding, replicated across GPT, Claude and others.
- Context rot. Chroma tested 18 frontier models and found every one degrades as the input grows, even on trivial tasks. A million-token window is not a million tokens of usable attention. Chroma's research is worth reading in full.
- Brittle chunking. Fixed-size windows cut straight through tables, headers and arguments, handing the model half a thought to work with.
- Volume. Re-embedded and duplicated documents leave near-identical chunks competing for the same slots, and agentic loops multiply every retrieval defect by the number of calls.
The market has noticed. VentureBeat's early-2026 tracker showed enterprise intent to adopt hybrid retrieval roughly tripling in a single quarter as RAG programmes hit what it called a 'scale wall'. It is a small-sample signal, but a telling one. Teams are not abandoning retrieval. They are rebuilding it.
What Is Context Engineering?
Context engineering is the discipline of curating the right set of tokens to put in front of a model at each step of its work. Prompt engineering optimises the wording of a mostly static instruction. Context engineering manages the whole, changing context state: retrieval, memory, tool results, history, and what to discard.
The term went mainstream in June 2025, when Shopify's Tobi Lütke and Andrej Karpathy started using it in earnest. Karpathy's definition stuck: 'the delicate art and science of filling the context window with just the right information for the next step.' Anthropic formalised the discipline in September 2025, framing it as the natural progression of prompt engineering.
RAG sits inside context engineering, not against it. RAG answers one question: 'what should I fetch?'. Context engineering answers the harder one: 'what should be in the window right now, and what should I drop to make room?'. That second question is where production systems live or die.
Does a Million-Token Window Just Kill RAG?
No, and this is the argument that refuses to die. Bigger context windows change the trade-off. They do not remove the need to retrieve. Pasting your whole corpus into the prompt loses on cost, latency, scale and accuracy at the same time.
Run the numbers and it falls apart quickly:
- Cost. Attention is quadratic. By one vendor's estimate, a full multi-million-token call can cost on the order of 1,000 times more than retrieving only what you need.
- Scale. Real knowledge bases dwarf any window. A context window is a fixed ceiling. Retrieval scales to corpora that cannot fit by definition.
- Latency. Long-context calls are slow, tens of seconds at high token counts versus sub-second retrieval. That is fatal inside an agent loop that retrieves many times per task.
- Context rot, again. Even when everything fits, quality drops as the window fills.
Long context is a complement, not a replacement: a bigger working set per step, used deliberately. Treat the context window as a dumping ground and you have recreated the problem retrieval was meant to solve.
What Actually Replaces Naive RAG?
Not one technique, a portfolio. The 2026 baseline is hybrid search with reranking, retrieval exposed as a tool an agent calls, and specialised methods for the queries that plain vector search handles badly.
| Technique | What it does | When to reach for it |
|---|---|---|
| Hybrid search + reranking | Dense vectors plus keyword (BM25), fused, then a cross-encoder reorders the top results | The default for anything with names, codes or exact terms, which is most business data |
| Agentic retrieval | The model plans, retrieves, grades relevance, rewrites the query and retrieves again | Multi-hop and high-stakes questions where one shot is not enough |
| Retrieval as a tool (MCP) | The agent pulls what it needs at runtime instead of you pre-stuffing the prompt | Agents reaching many live, permissioned systems |
| GraphRAG | Traverses an entity-relationship graph rather than ranking by similarity | Relationship and multi-hop queries, where vector search collapses past a handful of entities |
| Semantic layer | Queries a governed model of your metrics and dimensions, not a vector index | Structured data and KPIs, where vector search scores close to zero |
| Memory | Persists long-term context outside the window, recalled on demand | Multi-session agents and personalisation |
Retrieval-as-a-tool is the quiet revolution. Instead of guessing what an agent will need and stuffing it into the prompt up front, you let the agent decide, increasingly over the Model Context Protocol (MCP), now a de facto standard adopted across OpenAI, Google and Microsoft. The data gets pulled when it is needed, not before.
What Should You Build in 2026?
Stop treating retrieval as the system and start treating it as one tool the system calls. The pragmatic architecture is an agent that decides when and how to retrieve, sitting on a hybrid-search baseline, with specialised routes for the queries vector search handles badly.
In practice:
- Default to hybrid search plus a reranker. Pure vector search is no longer the baseline. It is the fallback.
- Make retrieval a tool the agent invokes, not a fixed step bolted to the front of the prompt.
- Route by query type. Lookups go to hybrid RAG, relationship questions to a graph, metrics to a semantic layer. One retrieval strategy for everything is how you end up with an agent that is confidently wrong.
- Engineer the window deliberately. Load just-in-time, compress when it fills, drop aggressively. Fight context rot on purpose rather than hoping a bigger window saves you.
- Measure retrieval quality like you measure everything else, the same discipline that separates a production system from a demo. If you cannot score your retrieval, you cannot improve it.
So Is It Worth Rebuilding?
If your retrieval layer was designed in 2024, yes. The cost of leaving it is an agent that fetches the wrong thing, buries the right thing in a bloated prompt, and fails in ways no model upgrade will fix.
The 'RAG is dead' headline was always lazy. Retrieval did not die. It grew up, got absorbed into a broader discipline, and stopped pretending to be the whole architecture. The teams still bolting a vector database onto their documents and calling it finished are the ones who will spend the rest of the year wondering why their agent cannot be trusted.
Context engineering is the real job: deciding, at every step, what a model should see. Get that right and the model has a fighting chance. Get it wrong and no context window is large enough to save you.
If you are building agents and your retrieval layer still looks like that 2023 tutorial, we should talk.


