RAG sounds complicated. Chunking strategies, vector databases, embedding models, rerankers - the ecosystem makes it look like a major infrastructure project. But the core idea is simple: find the relevant text, then ask the model to answer using it.
This post builds a complete RAG pipeline using two packages: NumPy and the OpenAI API. No LangChain. No vector database. No framework abstraction. Just the mechanics, exposed.
Objective
We’ll build an assistant for a grain bin construction manual using gpt-4o-mini with a straightforward retrieval strategy: embed the manual, find the most relevant passages for each question, and pass them as context.
Three questions drive our evaluation:
-
“What is the min and max torque for the bolts?”
- The answer is in a table on page 25.
- Can we extract and format the exact values correctly?
-
“What voids the warranty?”
- Warranty information is scattered across several parts of the document.
- Can we retrieve the disconnected pieces coherently?
-
“How do you install a 2-ring door?”
- A specific section with detailed instructions and images.
- Can we retrieve a complete, ordered list of steps?
The torque question is our litmus test: one specific answer, verifiable on sight.
Setup
You’ll need an OPENAI_API_KEY from your OpenAI account.
Python Setup
from dotenv import load_dotenv
load_dotenv().env File
OPENAI_API_KEY=sk-123...Next, set up the OpenAI client:
from openai import OpenAI
client = OpenAI() # Reads OPENAI_API_KEY from environment
def get_chat_response(question):
response = client.chat.completions.create( # Setup chat model
messages=[
{
"role": "user", # Question asked by user
"content": question, # The actual question
}],
model="gpt-4o-mini", # Using GPT-4o-mini model
)
return response.choices[0].message.content # Extract response stringThe get_chat_response function is everything we need to talk to gpt-4o-mini. That’s it.
The OpenAI package conveniently looks for the OPENAI_API_KEY environment variable and loads it automatically.
To simplify asking our base questions we put them in an enum called EvalQ which is short for evaluation questions.
from enum import StrEnum
class EvalQ(StrEnum):
TORQUE = "What is the min and max torque is used for the bolts?"
WARRANTY_VOID = "What voids the warranty?"
TWO_RING_DOOR = "How do you install a 2 ring door?"The data source is a PDF installation manual for the CB34 Grain Bin Sidewall - detailed construction instructions covering everything except the roof. Load it and split it into chunks for retrieval:
from pypdf import PdfReader
manual = 'grainbin_manual.pdf'
reader = PdfReader(manual)
manual_pages = [page.extract_text() for page in reader.pages] # Split into pages
manual_string = "\n\n".join(manual_pages) # One string
manual_words = manual_string.split(" ") # Split into words
manual_pages_words = [page.split(" ") for page in manual_pages] # Words per pageKey Terminology
| Name | What | Example |
|---|---|---|
| encoding | Text converted to tokens | ”Hi there” → [123, 2, 0, 2] |
| tokens | Numeric representation of text chunks | Individual pieces of text |
| embedding | Text made into numerical vectors | High-dimensional vector space |
| chat model | Generative AI model | gpt-4o-mini |
| embedding model | Converts text into numerical values | text-embedding-3-small |
Baseline Answer
Before measuring how well the RAG pipeline performs, we need a baseline. Ask the model the same questions with no additional context and see what it says on its own.
Prepare Question
Each question gets the context for a CB34 Chief Industries Grain Bin Sidewall so the model knows what it’s being asked about.
def add_withwhat(question):
return f"{question} This is for a CB34 Grain Bin Sidewall from Chief Industries."
torque_question_withwhat = add_withwhat(EvalQ.TORQUE)
torque_answer_withwhat = get_chat_response(torque_question_withwhat)Here is what the question looks like now and the answer we get from the chat model.
Question: What is the min and max torque for the bolts? This is for a CB34 Grain Bin Sidewall from Chief Industries.
Answer: “I don’t have specific information about the torque specifications for the bolts used in the CB34 Grain Bin Sidewall from Chief Industries. For accurate details, I recommend checking the installation manual…” (approximately 80 words)
The model is telling you to look it up somewhere else, which isn’t helpful. And it took 80 words to say nothing.
Token count is a proxy for both cost and response time. When a model doesn’t know something, it should say so briefly.
def question_template_idontknow(question):
return f"""
{question} If you don't know the answer just say I don't know.
"""
torque_question_idontknow = question_template_idontknow(torque_question_withwhat)
torque_answer_idontknow = get_chat_response(torque_question_idontknow)Answer: “I don’t know.”
That is a drastic improvement. In 3 words we now know that the chat model cannot provide the answer we are looking for.
A simple prompt modification produced a concise, honest answer, and it held consistent across all 3 questions.
Prompt Template
A final point on enhancing our string is the need for a more generic way to modify our questions. Up until now, we’ve relied on individual functions, but this approach is becoming unwieldy.
To streamline this process, let’s create a prompt template that ensures our questions follow best practices when querying the model.
def generate_prompt(question, context=""):
return f"""
You are a Chief Industries bin dealer. Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
If appropriate write out as a table and only answer the exact question given. Do not offer safety advice. No need to restate the question in any way.
Context:
{context}
Question:
{question}
Answer:
"""Key elements of the template:
- Tell the chat model who to act as and what to look for in the question
- “I don’t know” addition
- Only respond with actionable language instead of fine print and repetition
- Context section for the grain bin manual
- The question being asked
- Answer prompt for expected formatting
Having this template allows us to not have to dial in boilerplate every time we want to get a question answered. Let’s see if this makes a difference in responses.
torque_prompt_question = generate_prompt(EvalQ.TORQUE)
torque_prompt_answer = get_chat_response(torque_prompt_question)All three questions now return “I don’t know” in a consistent, concise manner.
This is our baseline answer we are going to work on improving.
Encoding
Tokens represent the way text is broken down for the chat model’s consumption, and context length refers to the number of tokens batched together. These two concepts are crucial in determining how long a query will take and its associated cost.
We are using the gpt-4o-mini chat model which is OpenAI’s current budget option. While it is on the cheaper side of models it can be easy to have an explosion of tokens being sent and received. Here are the pricing details for the gpt-4o-mini model:
- 15¢ per million input tokens
- 60¢ per million output tokens
The number of tokens used when asking a question and receiving an answer impacts both cost and computation time. The maximum number of tokens a model can handle at once is known as the context window. A larger context window allows for more data to be processed but also increases computation time and cost.
See What You Got
Encoding is the process of converting our text into tokens. This step is normally obfuscated by an embedding model but doing it yourself shows what is being consumed by our model and is a good step in exploratory analysis.
The tiktoken package allows us to count the number of tokens in a piece of text locally. It uses Byte Pair Encoding (BPE) for encoding which has the property that it can go from text to tokens and back to text which allows for examination of the encoding process.
import tiktoken
model_encoding = tiktoken.encoding_for_model("gpt-4o-mini")
manual_string_encoded = model_encoding.encode(manual_string)The entire manual contains approximately 18,000 tokens - approaching some embedding model limits and a meaningful chunk of the 128k context window. Answering any individual question doesn’t require the full document. Constraining context to only the relevant passages reduces noise and cuts cost.
How Does It Break Down
Understanding what is being encoded is crucial for effectively managing and optimizing our use of chat models.
By gaining insight into this process we’ll be better equipped to make informed decisions about constraining our context to what is relevant. Below are the first 5 tokens that are encoded from our data:
- Encoding 1:
(space) - Encoding 2:
CB - Encoding 3:
34 - Encoding 4:
- - Encoding 5:
Installation
The first encoding represents a space. Subsequent encodings involve letters and numbers with a full word not appearing until the 5th encoding.
Most encodings are relatively short but do not have to be. The distribution typically shows that most tokens are 1-4 characters, with the majority being 2-3 characters per token.
Now we know how text is broken down into tokens. Let’s break up the manual so we can keep our token count more constrained when interacting with our chat model.
Split Up the Manual
Splitting the text will allow us to not have to use the entire dataset every time we make a call to our chat model. Here are 3 simple but effective strategies for splitting up the manual:
- Split by page: This is done when the file was read
- Split by word count: This gives a more granular approach and is not restricted by page start and stop
- A hard cutoff could cut something off in the middle of explanation
- Recursive text splitting: Create a window of overlapping text where we control the length and the overlap
- Overlapping the windows of text reduces the hard cutoff problem from splitting by word
import more_itertools as mi
manual_page_split = manual_pages
manual_word_split = [" ".join(chunk) for chunk in mi.chunked(manual_words, 101) if None not in chunk]
manual_recursive_split = [" ".join(window) for window in mi.windowed(manual_words, 98, step=60) if None not in window]The next step is embedding our text chunks using text-embedding-3-small. First, inspect the tokenizer for this model to understand why constraining input length matters.
embedding_model_encoding = tiktoken.encoding_for_model("text-embedding-3-small")
manual_page_split_encoded = embedding_model_encoding.encode_batch(manual_page_split)
manual_word_split_encoded = embedding_model_encoding.encode_batch(manual_word_split)
manual_recursive_split_encoded = embedding_model_encoding.encode_batch(manual_recursive_split)Embeddings
An embedding model is a model that converts text into numbers. In our case it will take care of the encoding of the data for us but it is something we need be aware of for context window limits and billing purposes.
With a numerical representation of our text we can treat them as context aware vectors whose context is given by the data the embedding model utilized and how it was trained.
This is where we start getting into the retrieval part of our RAG setup. How do we add the relevant parts of our data as context to the original question?
Let’s connect to OpenAI’s text-embedding-3-small embedding model and see what that looks like.
Measuring Context
What is context? Context is anything relevant to the question that allows the answerer to have the information required to construct an answer.
To measure context then we need to have a way of scoring similarity between two pieces of text embedded into a context aware vector space.
A common measure of similarity is the Cosine Similarity metric. Cosine Similarity is a measure of how close 2 vectors are and gives a value of 0 for far apart and 1 for being the same.
import numpy as np
def cosine_similarity(x, y):
return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))Let’s do a quick hello world with the Cosine Similarity metric and the text-embedding-3-small model.
Hello World
def get_embeddings(text, model_name="text-embedding-3-small"):
return client.embeddings.create(input=text, model=model_name).data[0].embedding
hello_embedding = get_embeddings("Hello world")
hello_again_embedding = get_embeddings("Hello world again")
dinos_embedding = get_embeddings("Dinosaurs are real")The text-embedding-3-small model returns a vector of float values with size 1536. These values are centered and standardized and don’t offer much information on their own.
What Makes the Cut
Using the similarity score let’s see if we can extract the correct information from our splits based on using a similarity score.
manual_page_split_embeddings = [get_embeddings(split) for split in manual_page_split]
manual_word_split_embeddings = [get_embeddings(split) for split in manual_word_split]
manual_recursive_split_embeddings = [get_embeddings(split) for split in manual_recursive_split]Now we get the embedding spaces for our questions and compare them to the split text embedding spaces from above.
torque_embeddings = get_embeddings(EvalQ.TORQUE)
warrantyvoid_embeddings = get_embeddings(EvalQ.WARRANTY_VOID)
tworingdoor_embeddings = get_embeddings(EvalQ.TWO_RING_DOOR)def compare_embeddings(compare_with, compare_to):
results = []
for embedding in compare_with:
results.append(cosine_similarity(compare_to, embedding))
return results
torque_pages_similarity_scores = compare_embeddings(manual_page_split_embeddings, torque_embeddings)
torque_word_similarity_scores = compare_embeddings(manual_word_split_embeddings, torque_embeddings)
torque_recursive_similarity_scores = compare_embeddings(manual_recursive_split_embeddings, torque_embeddings)Each splitting strategy has different characteristics:
- Page split: Shows a clear separation with similarity scores, with one page (page 25) having significantly higher score (~0.65) containing the torque information
- Word split: Highest similarity score (~0.68) but information is cut off in the middle
- Recursive split: More consistent scores with the relevant information spread across overlapping windows
The manual page split has a score that is separated and it is the page that the torque information is on (page 25).
The text-embedding-3-small embedding model has a maximum allowed number of tokens of 8,191. The entire manual would exceed this, requiring splitting.
The recursive text split gives a good mix of lowering our token count while also lowering our chances of missing out on context.
RAG Time
Now we have all the elements to create our RAG pipeline:
- Load PDF file as a string
- Split string text recursively
- Embed each split of text using OpenAI’s text-embedding-3-small model
- Embed the question being asked with the same model
- Retrieve 2 most similar text splits to our question
- Input question and the similar splits into our prompt template
- Query OpenAI chat model with our augmented question
Our approach has been simplified to the point that each step could benefit from optimization. That will be for another article.
I Know Too Much
The chat model lacks the information needed to provide a useful response. The manual does have less tokens than the context window limit for the gpt-4o-mini model so let’s look at what happens when we add the entire manual to our prompt template. This is essentially augmenting our question without retrieval.
torque_question_manual_context = generate_prompt(EvalQ.TORQUE, context=manual_string)
warrantyvoid_question_manual_context = generate_prompt(EvalQ.WARRANTY_VOID, context=manual_string)
tworingdoor_question_manual_context = generate_prompt(EvalQ.TWO_RING_DOOR, context=manual_string)
warrantyvoid_answer_manual_context = get_chat_response(warrantyvoid_question_manual_context)
tworingdoor_answer_manual_context = get_chat_response(tworingdoor_question_manual_context)
torque_answer_manual_context = get_chat_response(torque_question_manual_context)Whoa! Another vast improvement to the response we receive without much effort. All 3 questions now get detailed, accurate answers from the manual.
Augment with Context
Can we get the same or better results if we reduce the extraneous information given as context?
No doubt our computation times improve and cost goes down but quality of the response should not suffer. In fact it can be improved since this is essentially removing noise from the context.
Retrieve the relevant splits for each question using the same approach from the “What Makes the Cut” section:
def create_final_answer(question, question_embedding,
text_split_embeddings=manual_recursive_split_embeddings,
text_splits=manual_recursive_split, k=2):
similarity_scores = compare_embeddings(text_split_embeddings, question_embedding)
top_k_idxs = np.argsort(similarity_scores)[-k:]
context_string = "\n".join([text_splits[i] for i in top_k_idxs])
final_question = generate_prompt(question, context_string)
final_answer = get_chat_response(final_question)
return final_answer
torque_final_answer = create_final_answer(EvalQ.TORQUE, torque_embeddings, k=2)Torque Question Results (k=2)
Answer: The manual specifies min torque of 30 ft-lbs and max torque of 35 ft-lbs for the bolts.
That is a perfect answer! Looks like that is all we needed and gained an amazing performance boost from trimming down our input tokens.
Warranty Question Results
This information is strewn throughout the manual, so we need more context:
- k=2: Missing context
- k=12: Better but still incomplete
- k=22: Good coverage of warranty void conditions
- k=150: Most comprehensive answer
Since the information is found throughout the document it is not surprising that we need to have a lot more context than we did when we asked about the torque values.
Two Ring Door Question Results
This information is on a few pages but lives together:
- k=2: Incomplete instructions
- k=12: Better step-by-step instructions
- k=50: Comprehensive detailed instructions
We get pretty good results but with less detail than when we used the entire manual for context. Overall we can get similar results by retrieving chunks of the manual by measuring how similar it is to our question.
Next Steps
Now that we’ve built the RAG pipeline by hand there is plenty of room for optimization. Here are just a few areas where improvements can be made:
- Implement a semantic text splitter for more meaningful text chunking
- Experiment with different embedding spaces to find the best fit for your data
- Explore advanced methods for selecting the most similar documents
- Design a more refined and effective prompt template
- Investigate vector databases and relevant frameworks for enhanced performance
These suggestions are just the beginning. Each of these topics deserves deeper exploration in future articles.
Conclusion
Building a RAG pipeline is straightforward and effective. Even the simple walkthrough we’ve covered here shows how quickly you can get useful results from your own data. The key insight is that you don’t need a vector database or a heavyweight framework to get started. A few hundred lines of Python, a PDF, and two OpenAI API calls is all it takes to build something that genuinely works.
Glossary
| Name | What | Aliases |
|---|---|---|
| Retrieval Augmented Generation | Update questions asked of a chat model by adding context | RAG |
| Encoding | Convert bits of string into numbers | tokenization |
| Embedding | Convert bits of string into numerical vectors | vectorization |
| Chat model | Models that you can have a conversation with | gpt-4*, claude, llama |
| Grain bin | Large metal storage container for grain | Bin |