Let’s build a grain bin with a naive RAG using Numpy and OpenAI

Employ the power of a RAG to build a grain bin

rag

numpy

grainbin

Author

Brandon Rundquist

Introduction

In this article we will explore how to build a Retrieval Augmented Generation (RAG) pipeline to create an assistant to help build a grain bin. We will use Numpy for calculations and OpenAI for their embedding and chat models. Using just these two packages lets us focus on creating the pipeline and it’s details.

Objective

The goal is to create an assistant for building a grain bin using a RAG that outperforms non augmented generative AI while being cost effective. We will adopt a naive strategy of how we could improve our interaction with the chat model gpt-4o-mini.

To evaluate our progress in this goal we have 3 questions that we will using as a test of our system.

“What is the min and max torque is used for the bolts?”
- The answer to this question is provided in a table on page 25.
- Can we extract and format this exact answer correctly?
“What voids the warranty?”
- The warranty information is scattered across several parts of the document.
- How effectively can we retrieve this disconnected information?
“How do you install a 2-ring door?”
- This question pertains to a specific section in the manual which includes detailed instructions and images.
- Can we retrieve a comprehensive list of these complex instructions in a coherent manner?

Using these 3 questions will allow us to make evaluations of how well our strategy for interacting with the chat model is performing.

The question about torque values will be used as our litmus test since it has the most straightforward answer and is the simplest to extract from the manual.

Setup

OpenAI started the generative AI revolution and remains a leader in this field. They offer a cost effective way to achieve great results. To connect to OpenAI you’ll need an API token which you can easily find instructions for obtaining online.

The first step is to make your OPENAI_API_TOKEN available to your Python program. You can do this by using the dotenv Python package along with a .env file that contains your token..

Python
.env

from dotenv import load_dotenv
load_dotenv()

OPENAI_API_KEY=sk-123...

Then we will connect to the OpenAI client with the model gpt-4o-mini model.

from openai import OpenAI
1client = OpenAI()
def get_chat_response(question):
2    response = client.chat.completions.create(
        messages=[
            {
3                "role": "user",
                "content": question,
            }],
4        model="gpt-4o-mini",
    )
5    return response.choices[0].message.content

1: Connect to OpenAI. It will read in the environment variable OPENAI_API_TOKEN
2: Setup a chat model to ask our questions to
3: This is where we set up the question that is asked by a “user”. There are other role types but we will not go into these
4: We are using the GPT-4o-mini model
5: Extracting the string of the response

We will use the get_chat_response function from this point forward to interact with the gpt-4o-mini model. That’s all it takes to get started with OpenAI.

The OpenAI package conveniently looks for the OPENAI_API_TOKEN environment variable and loads it automatically.

To simplify asking our base questions we put them in an enum called EvalQ which is short for evaluation questions.

from enum import StrEnum
class EvalQ(StrEnum):
    TORQUE = "What is the min and max torque is used for the bolts?"
    WARRANTY_VOID = "What voids the warranty?"
    TWO_RING_DOOR = "How do you install a 2 ring door?"

And the final thing we will setup is the data we will be using, the grain bin manual.

The manual (will refer to it as this form here on) is a PDF file that can be found here. This manual provides detailed instructions for constructing a CB34 Grain Bin Sidewall covering everything except the roof. An image of the bin can be seen below.

Here we will load the PDF into a Python object and then split the document in various ways for later use.

from pypdf import PdfReader
manual = 'grainbin_manual.pdf'
reader = PdfReader(manual)

1manual_pages = [page.extract_text() for page in reader.pages]
2manual_string = "\n\n".join(manual_pages)
3manual_words = manual_string.split(" ")
4manual_pages_words = [page.split(" ") for page in manual_pages]

1: Split into a list of pages
2: One string
3: Split into a list of words
4: Split into a list of words per page

Name	What	Example
encoding	Text converted to tokens	“Hi there” -> [123, 2, 0, 2]
tokens	…	…
embedding	Text made into numerical vectors
chat model	Generative AI model, gpt-4o-mini
embedding model	A model that converts text into numerical values

Baseline answer

Before we measure how well the RAG pipeline is performing we need to something to measure it against. To do that we will ask the chat model without any context to see what a non augmented answer looks like.

Prepare question

To make it fair to the chat model we will give each question the context of for a CB34 Chief Industries Grain Bin Sidewall so it knows what the questions are pertaining to.

In 3s

Since we are going to use this modification of all 3 questions later we run them all through the string template.

Here is what the question looks like now and the answer we get from the chat model.

def question_template_idontknow(question):
    return f"""
{question} If you don't know the answer just say I don't know.
"""
torque_question_idontknow = question_template_idontknow(torque_question_withwhat)
torque_answer_idontknow = get_chat_response(torque_question_idontknow)

That is a drastic improvement. In 3 words we now know that the chat model cannot provide the answer we are looking for.

A very simple modification of the question resulted in an answer that is concise and understandable and was consistent across for all 3 questions.

Prompt template

A final point on enhancing our string is the need for a more generic way to modify our questions. Up until now, we’ve relied on individual functions, but this approach is becoming unwieldy.

To streamline this process, let’s create a prompt template that ensures our questions follow best practices when querying the model.

def generate_prompt(question, context=""):
    return f"""
1You are an Chief Industries bin dealer. Use the following pieces of retrieved context to answer the question.

2If you don't know the answer, just say that you don't know.

3If appropriate write out as a table and only answer the exact question given. Do not offer safety advice. No need to restate the question in any way.

4Context:
{context}

5Question:
{question}

6Answer:

"""

1: Tell the chat model who to act as and what to look for in the question.
2: I don’t know addition.
3: Only respond with actionable language instead of fine print and repitition.
4: Here is where our context will go such as the grain bin manual.
5: This is where the question being asked goes.
6: This gives extra assurance that the chat model will respond in an expected manner.

Having this template allows us to not have to dial in boilerplate everytime we want to get a question answered. Let’s see if this makes a different in responses.

torque_prompt_question = generate_prompt(EvalQ.TORQUE)
torque_prompt_answer = get_chat_response(torque_prompt_question)
warrantyvoid_prompt_question = generate_prompt(EvalQ.WARRANTY_VOID)
warrantyvoid_prompt_answer = get_chat_response(warrantyvoid_prompt_question)
tworingdoor_prompt_question = generate_prompt(EvalQ.TWO_RING_DOOR)
tworingdoor_prompt_answer = get_chat_response(tworingdoor_prompt_question)

Encoding

Tokens represent the way text is broken down for the chat model’s consumption, and context length refers to the number of tokens batched together. These two concepts are crucial in determining how long a query will take and its associated cost.

We are using the gpt-4o-mini chat model which is OpenAI’s current budget option. While it is on the cheaper side of models it can be easy to have an explosion of tokens being sent and received. Here are the pricing details for the gpt-4o-mini model:

15¢ per million input tokens
60¢ per million output tokens

The number of tokens used when asking a question and receiving an answer impacts both cost and computation time. The maximum number of tokens a model can handle at once is known as the context window. A larger context window allows for more data to be processed but also increases computation time and cost.

See what you got

Encoding is the process of converting our text into tokens. This step is normally obfuscated by an embedding model but doing it yourself shows what is being consumed by our model and is a good step in exploratory analysis.

The tiktoken package allows us to count the number of tokens in a piece of text locally. It uses Byte Pair Encoding (BPE) for encoding which has the property that it can go from text to tokens and back to text which allows for examination of the encoding process.

import tiktoken
model_encoding = tiktoken.encoding_for_model("gpt-4o-mini")
manual_string_encoded = model_encoding.encode(manual_string)

How does it break down

Most encodings are relatively short but do not have to be. Below we can see the distribution of how many characters are in each encoding.

Plot code

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")

import numpy as np
_, manual_string_encoded_offsets = model_encoding.decode_with_offsets(manual_string_encoded)
manual_string_encoded_offsets_arr = np.array(manual_string_encoded_offsets)
manual_string_encoded_offsets_deltas = manual_string_encoded_offsets_arr[1:] - manual_string_encoded_offsets_arr[:-1]

sns.histplot(manual_string_encoded_offsets_deltas)
plt.xlabel('Number of characters')
plt.ylabel('Token count')
plt.title('Distribution of the number of characters in each token')
plt.grid(True, linestyle='--', alpha=0.5)
plt.gca().set_facecolor('#f0f0f0')
plt.gca().get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
plt.show()

Now we know how text is broken down into tokens. Let’s break up the manual so we can keep our token count more constrained when interacting with our chat model.

Split up the manual

Splitting the text will allow us to not have to use the entire dataset everytime we make a call to our chat model. Here are 3 simple but effective strategies splitting up the manual.

Split by page. This is done when the file was read.
Split by word count. This gives a more granular approach and is not restricted by page start and stop.
- A hard cutoff could cut something off in the middle of explaination.
Recursive text splitting. Create a window of overlapping text where we control the length and the overlap.
- Overlapping the windows of text reduces the hard cutoff problem from splitting by word.

import more_itertools as mi
manual_page_split = manual_pages
manual_word_split = [" ".join(chunk) for chunk in mi.chunked(manual_words, 101) if None not in chunk]
manual_recursive_split = [" ".join(window) for window in mi.windowed(manual_words, 98, step=60) if None not in window]

more_itertools

The more_itertools Python package has alot of useful patterns for splitting up data.

embedding_model_encoding = tiktoken.encoding_for_model("text-embedding-3-small")
manual_page_split_encoded = embedding_model_encoding.encode_batch(manual_page_split)
manual_word_split_encoded = embedding_model_encoding.encode_batch(manual_word_split)
manual_recurive_split_encoded = embedding_model_encoding.encode_batch(manual_recursive_split)

Plot code

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

sns.histplot([len(x) for x in manual_page_split_encoded], ax=axes[0])
axes[0].set_xlabel('Token count')
axes[0].set_ylabel('Number of splits')
axes[0].set_title('Manual Page Split')
axes[0].yaxis.set_major_locator(plt.MaxNLocator(integer=True))

sns.histplot([len(x) for x in manual_word_split_encoded], ax=axes[1])
axes[1].set_xlabel('Token count')
axes[1].set_ylabel('')
axes[1].set_title('Manual Word Split')
axes[1].yaxis.set_major_locator(plt.MaxNLocator(integer=True))

sns.histplot([len(x) for x in manual_recurive_split_encoded], ax=axes[2])
axes[2].set_xlabel('Token count')
axes[2].set_ylabel('')
axes[2].set_title('Manual Recursive Split')
axes[2].yaxis.set_major_locator(plt.MaxNLocator(integer=True))

plt.tight_layout()
plt.show()

max_token_count = max([len(x) for x in manual_page_split_encoded])
max_word_split_token_count = max([len(x) for x in manual_word_split_encoded])
max_recursive_split_token_count = max([len(x) for x in manual_recurive_split_encoded])
markdown_text = f"""
Now our max token count for any split is {max_token_count}. That is a drastic improvement and it is 
at the very high end. The max token counts for the word split and the recursive split are {max_word_split_token_count} and
{max_recursive_split_token_count} respectively.

These are dramatic token count reductions but how do we get the right split for the question being asked?
"""

Embeddings

An embedding model is a model that converts text into numbers. In our case it will take care of the encoding of the data for us but it is something we need be aware of for context window limits and billing purposes.

With a numerical representation of our text we can treat them as context aware vectors whose context is given by the data the embedding model utilized and how it was trained.

This is where we start getting into the retrieval part of our RAG setup. How do we add the relevant parts of our data as context to the original question?

Let’s connect to OpenAI’s text-embedding-3-small embedding model and see what that looks like.

Measuring context

What is context? Context is anything relevant to the question that allows the answerer to have the information required to construct an answer.

To measure context then we need to have a way of scoring similarity between to pieces of text embedded into a context aware vector space.

A common measure of similarity is the Cosine Similiarity metric. Cosine Similiarty is a measure of how close 2 vectors are and gives a value of 0 for far apart and 1 for being the same.

def cosine_similarity(x, y): 
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

Let’s do a quick hello world with the Cosine Similarity metric and the text-embedding-3-small model.

Hello world

def get_embeddings(text, model_name="text-embedding-3-small"):
    return client.embeddings.create(input=text, model=model_name).data[0].embedding

hello_embedding = get_embeddings("Hello world")
hello_again_embedding = get_embeddings("Hello world again")
dinos_embedding = get_embeddings("Dinosaurs are real")

Plot code

fig, axes = plt.subplots(1, 1, figsize=(15, 5))

sns.histplot(hello_embedding, ax=axes)
sns.histplot(hello_again_embedding, ax=axes)
sns.histplot(dinos_embedding, ax=axes)
axes.set_xticklabels([])
axes.set_yticklabels([])
axes.set_xlabel('')
axes.set_ylabel('Frequency')
axes.set_title('Histogram of Embedding Variables')
axes.legend(['Hello Embedding', 'Hello Again Embedding', 'Dinos Embedding'])

plt.tight_layout()
plt.show()

What makes the cut

Using the similarity score let’s see if we can extract the correct information from our splits based on using a similarity score.

manual_page_split_embeddings = [get_embeddings(split) for split in manual_page_split]
manual_word_split_embeddings = [get_embeddings(split) for split in manual_word_split]
manual_recursive_split_embeddings = [get_embeddings(split) for split in manual_recursive_split]

Now we get the embedding spaces for our questions and the compare them to the split text embedding spaces from above.

torque_embeddings = get_embeddings(EvalQ.TORQUE)
warrantyvoid_embeddings = get_embeddings(EvalQ.WARRANTY_VOID)
tworingdoor_embeddings = get_embeddings(EvalQ.TWO_RING_DOOR)

def compare_embeddings(compare_with, compare_to):
    results = []
    for embedding in compare_with:
        results.append(cosine_similarity(compare_to, embedding))
    return results
torque_pages_similarity_scores = compare_embeddings(manual_page_split_embeddings, torque_embeddings)
torque_word_similarity_scores = compare_embeddings(manual_word_split_embeddings, torque_embeddings)
torque_recursive_similarity_scores = compare_embeddings(manual_recursive_split_embeddings, torque_embeddings)

Plot code

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Similarity scores for torque question embeddings")
sns.histplot(torque_pages_similarity_scores, ax=axes[0])
axes[0].set_xlabel('Similarity score')
axes[0].set_ylabel('Number of splits')
axes[0].set_title('Manual Page Split')
axes[0].yaxis.set_major_locator(plt.MaxNLocator(integer=True))

sns.histplot(torque_word_similarity_scores, ax=axes[1])
axes[1].set_xlabel('Similarity score')
axes[1].set_ylabel('')
axes[1].set_title('Manual Word Split')
axes[1].yaxis.set_major_locator(plt.MaxNLocator(integer=True))

sns.histplot(torque_recursive_similarity_scores, ax=axes[2])
axes[2].set_xlabel('Similarity score')
axes[2].set_ylabel('')
axes[2].set_title('Manual Recursive Split')
axes[2].yaxis.set_major_locator(plt.MaxNLocator(integer=True))

plt.tight_layout()
plt.show()

Each plot has some values in the 0.6 range but what is interesting is their is a clear split for the page split. That make sense since we know the torque information is on 1 page.

To keep it simple we are going to just go with the 2 high scores. How you choose the splits that become part of your context is a very involved process that is beyond this article.

The manual page split has a score that is separated and it is the page that the torque information is on (page 25).

The text-embedding-3-small embedding model has a maximum allowed number of tokens to be 8191. Using the tiktoken package we can examine how many tokens we were trying to push.

embed_model_embeddings = tiktoken.encoding_for_model("text-embedding-3-small")

The recursive text split gives a good mix of lowering our token count while also lowering our chances of missing out on context.

RAG time

Now we have all the elements to create our RAG pipeline.

Load PDF file as a string.
Split string text recursively.
Embed each split of text using OpenAI’s text-3-embedding-small model.
Embed the question being asked with the same model.
Retrieve 2 most similar text splits to our question.
Input question and the similar splits into our prompt template.
Query OpenAI chat model with our augmented question.

Our approach has been simplified to the point that each step could benefit from optimization. That will be for another article.

I know too much

The chat model lacks the information needed to provide a useful response. The manual does have less tokens than the context window limit for the gpt-4o-mini model so lets look at what happens when we add the entire manual to our prompt template. This is essentially augmenting our question without retrieval.


#| code-annotations: hover

1torque_question_manual_context = generate_prompt(EvalQ.TORQUE, context=manual_string)
torque_question_manual_context

1: We are templating the “I don’t know” version of the torque question

Let’s see what this gives us when we send it to our chat model.

warrantyvoid_question_manual_context = generate_prompt(EvalQ.WARRANTY_VOID, context=manual_string)
tworingdoor_question_manual_context = generate_prompt(EvalQ.TWO_RING_DOOR, context=manual_string)

warrantyvoid_answer_manual_context = get_chat_response(warrantyvoid_question_manual_context)
tworingdoor_answer_manual_context = get_chat_response(tworingdoor_question_manual_context)
torque_answer_manual_context = get_chat_response(torque_question_manual_context)

Whoa! Another vast improvement to the response we recieve without much effort. Here are all 3 original questions and their respective answers can be seen together.

Augment with context

Can we get the same or better results is we reduce the extranneous information given as context?

No doubt our computation times improve and cost goes down but quality of the response should not suffer. In fact it can be improved since this is essentially removing noise from the context.

We will now get the correct splits for all of our questions like we did in the What makes the cut chapter.

The first final answer we are going to look at is for the torque question. Here is the answer for when we add just 2 pieces of the most similar context.

def create_final_answer(question, question_embedding, text_split_embeddings=manual_recursive_split_embeddings, text_splits=manual_recursive_split, k=2):
    similarity_scores = compare_embeddings(text_split_embeddings, question_embedding)
    top_k_idxs = np.argsort(similarity_scores)[-k:]
    context_string = "\n".join([text_splits[i] for i in top_k_idxs])
    final_question = generate_prompt(question, context_string)
    final_answer = get_chat_response(final_question)
    return final_answer

torque_final_answer = create_final_answer(EvalQ.TORQUE, torque_embeddings, k=2)

That is a perfect answer! Looks like that is all we needed and gained an amazing performance boost from trimming down our input tokens.

Now let’s look at question that wants to know what voids the warranty.

We can recall that this information is strewn all throughout the manual.

It is missing context. Only 2 splits is not very much text so this is not surprising. Let’s add more context and see what the results are.

Now let’s turn up the number of similar splits that we are adding as context from 2 to 12 and see what that does to the chat model response.

Let’s try one more and ramp up the number of splits to 50.

Since the information is found throughout the document it is not surprising that we need to have a lot more context than we did when we asked about the torque values.

And finally we get our final answer for the two ring door question. This information is on a few pages but it does live by each other.

The first output is our normal starting point of 2 similar text splits.

This is not as bad as the warranty question was but still missing information.

Here is what the two ring door answer looks like with 12 text splits as context.

tworingdoor_final_answer = create_final_answer(EvalQ.TWO_RING_DOOR, tworingdoor_embeddings, k=2)
print_markdown(tworingdoor_final_answer)

tworingdoor_final_answer = create_final_answer(EvalQ.TWO_RING_DOOR, tworingdoor_embeddings, k=12)
print_markdown(tworingdoor_final_answer)

So we get pretty good results but less details than when we used the entire manual for context.

Overall we can get similar by retrieving chunks of the manual by measuring how similar it is to our question. There are some glaring shortcoming that we touch on next.

Next steps

Now that we’ve built the RAG pipeline by hand there is plenty of room for optimization. Here are just a few areas where improvements can be made:

Implement a semantic text splitter for more meaningful text chunking.
Experiment with different embedding spaces to find the best fit for your data.
Explore advanced methods for selecting the most similar documents.
Design a more refined and effective prompt template.
Investigate vector databases and relevant frameworks for enhanced performance.

These suggestions are just the beginning. I’ll be diving deeper into each of these topics in the following sections.

Conclusion

Building a RAG pipeline is both straightforward and incredibly effective for enhancing the analysis of personal data. Even with the simple walkthrough we’ve covered here, it’s evident how quickly you can achieve impressive results. This approach is sure to catch the attention of anyone looking to integrate AI into their workflows offering a powerful way to elevate data-driven insights.

Glossary

Name	What	Aliases
Retrieval Augmented Generation	Update questions asked of a chat model by adding context	RAG
Encoding	Convert bits of string into numbers	tokenization
Embedding	Convert bits of string into numerical vectors	vectorization
Chat model	Models that you can have a conversation with	gpt4*, claude, llama
Grain bin	Large metal storage container for grain	Bin