April 22, 2026

RAG Fundamentals: From IR Perspective

What junior engineers miss about RAG, and what scientists struggle to articulate, observed from interviews, client conversations, and the gap between IR research and development.

RAG Information Retrieval

Where This Comes From

This is not a textbook post. It is a collection of patterns I kept seeing in technical interviews with junior engineers, and in conversations with researchers who came to us as clients trying to make sense of why their RAG systems weren’t working.

There are many RAG tutorials out there. Many engineers try to follow them blindly, miss the nuances, and further into the development process cannot troubleshoot or pivot when something is not working. I am writing this post to draw attention to the parts that matter.

What Junior Engineers Consistently Miss

They Treat Retrieval as a Solved Step

The most common pattern in interviews: the candidate had a strong grasp of the generation side: prompt engineering, context formatting, output parsing, but when I asked about the retrieval step, the answer was always some version of “you embed the query, embed the documents, take the top-k by cosine similarity.”

That’s not wrong. But it stops exactly where the interesting questions start.

How do you evaluate whether your retrieved chunks are actually relevant?
How do you choose your embedding model, indexing approach, or retrieval approach overall?
What happens when the query is ambiguous or underspecified?
How does retrieval quality degrade as your corpus grows or shifts?

One very important step is evaluation, you need a measure of usefulness. In retrieval, the standard metrics are Recall@k, Precision@k, and NDCG@k. Know these before you design your RAG system, and understand when each matters. Optimize for recall when you need all possible relevant context from your corpus. Optimize for precision when your context window is limited, but also when you’re worried about the model getting confused by noisy, off-topic chunks. Some engineers skip this and go straight to the more exciting generation step. Their version of evaluation is: “I looked at retrieved documents for a few samples, they looked fine.” That is a red flag. Evaluations need to be backed by numbers, not eyeballing.

Retrieval Has Many Moving Parts

Document quality, document length, index type, keyword search, vector search, re-ranking pipeline, similarity metric: these are all parameters to tune, separate from how many documents you retrieve.

Chunking Is an Afterthought

Many engineers treat chunking as a preprocessing detail, pick a chunk size, maybe add some overlap, move on. In practice, chunking strategy is one of the highest-leverage decisions in a RAG pipeline. It is very important to know the structure of your data and chunk them based on that.

Things I rarely heard candidates reason about:

Chunk boundaries and how they break semantic units
Hierarchical chunking (summary + detail)
Whether chunks should respect document structure (sections, paragraphs, tables)
The relationship between chunk size and the retrieval model’s sweet spot

What Scientists (Our Clients) Struggled to Articulate

This is a different kind of gap. In some ways it is the progression of what we talked about before: engineering teams spend considerable time building RAG pipelines, yet miss many fundamentals as they try to overcomplicate things chasing the best possible outcome.

”It Finds Something But Not the Right Thing”

This was the most common complaint. The system retrieved something — often a plausible-sounding chunk — but not the specific piece of knowledge the scientist needed. The LLM then confidently generated an answer from the wrong evidence.

What they were describing, without the vocabulary, is actually two separate problems that often look identical from the outside.

The first is a retrieval problem: high recall, low precision. The system casts a wide net and returns chunks that are topically adjacent but not specifically relevant. The second is a faithfulness problem: the right chunks are retrieved, but the LLM generates an answer that goes beyond or contradicts what the evidence actually says. These fail differently and need to be diagnosed separately. You cannot fix a faithfulness problem by improving your index.

Another common complaint is that queries are just too hard for the system to handle. Before assuming that, do systematic troubleshooting: build test cases that isolate which part of the pipeline is the bottleneck. You may find the retrieval step is fine and the generation step is the problem, or vice versa.

What is also worth knowing is that tuning indices and choosing better embedding models only goes so far. Semantic search is often only marginally better than keyword search for specific, narrow queries, despite what many people assume about getting the best OpenAI embeddings. The BEIR benchmark (Thakur et al., 2021) showed this directly: BM25 is competitive with or outperforms dense retrieval across many domain-specific test sets. Research on query rewriting for RAG (Ma et al., 2023) and hypothetical document embeddings (Gao et al., 2023) has shown that small changes to how queries are phrased can dramatically improve retrieval performance. Making questions more precise, breaking them down, and aligning them to the vocabulary and style of the corpus can be the highest-leverage thing you do.

”It Was Confident About Something Wrong”

What we should understand is that instruction-tuned LLMs are optimized to produce fluent, helpful-sounding responses. They generate what is plausible given the context, and when the context contains a partial or adjacent match, the model will often extrapolate confidently rather than flag that the evidence is insufficient. This is not a flaw in the model’s reasoning per se; it is a consequence of how these models are trained to behave. They can express uncertainty, but they need to be explicitly told when to do so.

The fix is straightforward: instruct the model explicitly in the prompt. If the retrieved context does not contain enough information to answer the question, say so. Do not fill in the gap. This alone can significantly reduce confident wrong answers.

What IR Research Actually Gives You Here

If you’ve worked in IR, you have a vocabulary and a set of tools that apply directly to these problems. I am going to explain a few here so you get the idea of what to look for.

Query Performance Prediction

This is one of the IR concepts that does not get talked about enough in RAG contexts. Query Performance Prediction (QPP) is about estimating how difficult a query will be for your retrieval system before you see the actual results. I spent part of my master’s research on QPP. It can look like an academic detail, but it is a strong practical signal for knowing when your system is about to fail.

The intuition is simple: not all queries are equally answerable. Some are clear, specific, and well-aligned with your corpus vocabulary. Others are vague, ambiguous, or use terminology your documents do not use. QPP gives you a way to identify the second category without waiting for users to complain about bad answers.

In practice, simple signals work surprisingly well. If retrieved documents have very similar scores, the system is uncertain and cannot clearly distinguish relevant from irrelevant. If scores are spread out, with one or two documents ranking significantly higher, the retrieval is more confident. You can use score variance, the gap between the top document and the rest, or pre-retrieval signals like the IDF of query terms to flag hard queries.

Why does this matter for RAG? Because it gives you a routing mechanism. Hard queries can be flagged, rephrased, or handled differently before the LLM ever sees them. You do not need to wait for a bad answer to know retrieval was likely to fail.

Re-ranking and the Two-Stage Retrieval Pattern

Re-ranking is one of the more interesting approaches in retrieval. It is slower, but it improves precision significantly. The pattern is two stages:

First stage: retrieve a broad candidate set, top 100 or top 1000, using a fast method like BM25. Second stage: re-rank that candidate set using a more expensive, more accurate model. The LLM then only sees a small, high-precision shortlist.

The second stage is typically a cross-encoder, a model that takes a query and document together as input and scores their relevance jointly. This is more accurate than bi-encoder retrieval because the query and document interact directly. But it is too slow to run over an entire corpus. Running it over 100 candidates instead of 100,000 is what makes it practical.

This pattern directly addresses the precision problem I described earlier. Your first stage casts a wide net, your re-ranker filters it down. If you are still seeing precision problems after adding re-ranking, the next places to look are: candidate set size, re-ranker model quality, and whether your chunks are well-structured enough to re-rank in the first place.

Evaluation: You Can’t Improve What You Don’t Measure

The hardest part of RAG evaluation is that you usually do not have ground truth relevance labels for your specific corpus. Unlike general IR benchmarks, your domain is specific, your documents are internal, and nobody has annotated which chunks are relevant to which questions.

The practical approach: generate a synthetic evaluation set from your documents. Use an LLM to produce questions that each document can answer, then use those question-document pairs as your ground truth. It is not perfect, but it is vastly better than nothing, and it gives you something concrete to measure against as you change the pipeline.

What to measure at each stage:

Retrieval: Recall@k, Precision@k, NDCG@k against your annotated or synthetic set
Faithfulness: does the generated answer contradict or go beyond what the retrieved context says
Answer relevance: does the generated answer actually address the question

Measuring end-to-end quality alone tells you something is broken, not where. Measuring each component separately is what lets you diagnose and improve.

The second thing people skip is measuring over time. Corpora change, queries shift, and models get updated. Retrieval quality that was acceptable six months ago may have degraded without anyone noticing. Build evaluation into your pipeline as a recurring check, not a one-time gate before launch.

Closing Thoughts

The problems people run into with RAG are not new. They are retrieval problems — retrieval has been a hard, well-studied, still-unsolved area for decades. What’s new is the fluency of the generation layer, which makes failures harder to spot and easier to overlook.

The engineers who will build the most reliable RAG systems are the ones who treat retrieval with the same seriousness they give to the model.