Semantic and Textual Inference Chatbot Interface (STICI-Note) - Part 2: Building the Evaluation Suite
If an LLM performs inference in the woods and no dataset exists to evaluate it, did it really perform?
STICI-note
Published: Fri, 10 Jan 2025
Last modified: Mon, 05 May 2025
Abstract
In the STICI-note series, I have been building a locally runnable RAG application for the Apple M-series chip to query long documents with natural language. In the second part of my STICI-note series, I created a RAG evaluation dataset and evaluation suite for STICI-note and used these to optimise the performance of the application. If you have not read part 1 of the STICI-note series, you can read it here. You do not need to read part 1 to understand part 2. It just provides context on the design decisions behind this implementation.
The code for this project is available here.
Introduction
In the first part of the STICI-note series, I compared different design choices that I could make for this Retrieval Augmented Generation (RAG) application and planned how I wanted to implement it. As a reminder, this RAG application is intended to be used as an information lookup application for personal text documents, such as technical documentation, notes on games like Dungeons and Dragons (DnD), and notes on topics I am studying. It is required to have the following features:
In this part, I discuss how I built an evaluation suite to evaluate and compare the different options that the RAG application could have, from LLM prompt to hyperparameters, to maximise the performance of the AI.
I started working on the evaluation suite code in June and finished it 5 months later in November. The reason why it took so long was mainly because the custom dataset creation and implementation were far more time-consuming than I expected. I made a lot of mistakes, but I learnt a lot from them, and I’ll share what I learnt in this blog post.
Without further ado, let’s get into this.
An Overview of the RAG Research Space
In AI research—and probably all research, for that matter—surveys are the holy grail of getting started guides. By ‘survey,’ I don’t mean something like asking strangers in the street whether they prefer Coke or Pepsi. I mean a survey of the current directions of research within a topic based on the recently published papers relating to that topic. I find these incredibly useful for understanding the current research space of a topic and, in particular, the different approaches one can take in their design.
To understand the current RAG research space, I read A Survey on Retrieval-Augmented Text Generation for Large Language Models. This paper’s RAG framework, which I discuss later in The RAG Paradigm, was incredibly insightful and gave me a great perspective on how to structure my implementation. Sections 1 (Introduction), 2 (RAG Framework), 7 (Evaluation in RAG), and 8 (Comparisons of RAG) were absolutely invaluable for this project as they gave me great insight into how to design and evaluate the RAG application.
Sections 3-6 and 9 were less useful to me because they focus on the latest research. I didn’t want to implement something from scratch only to find the results aren’t quite as good as the researchers promised, which is why I only wanted to know about tried-and-tested methods. They are definitely worth the read if you’re interested in the latest research on RAG, though.
The RAG Paradigm
The RAG paradigm described in the paper was perhaps the most useful piece of information I could ask for to get started with building RAG applications. This paradigm separates RAG into four separate stages based on their function in the process and gives detail on what each stage might include in an application. It should be stressed that not all features of each stage need to be included. They are simply currently explored areas of research.
The RAG paradigm presented in the paper: A Survey on Retrieval-Augmented Text Generation for Large Language Models.
In RAG applications, the four stages in the RAG paradigm are applied in sequential order. In order of execution, these stages are pre-retrieval, retrieval, post-retrieval, and generation. Pre-retrieval involves pre-processing of the input to achieve more accurate results; retrieval involves fetching relevant context from the data source; post-retrieval involves adding additional information about fetched context from external data sources and filtering out less relevant context; and generation involves generating the output using the context to enhance the query. For more detail on what exactly each stage includes, please read A Survey on Retrieval-Augmented Text Generation for Large Language Models.
Due to time constraints, I was unfortunately not able to explore any enhancements in the pre-retrieval stage or the post-retrieval stage. I did, however, have time to explore variations of LLMs used in the retrieval and generation, which I discuss later in the Deciding How to Evaluate My RAG Application section of this blog post.
RAG Evaluation
The survey also covered the evaluation of RAG. As you might have guessed, this was very relevant to this part of the project.
The survey suggests that I could have evaluated many different metrics, such as retrieval accuracy, which measures ‘how precisely the retrieved documents provide correct information for answering queries,’ and faithfulness, which measures ‘whether the generated text accurately reflects the information found in the retrieved documents.’ By measuring the accuracy of the responses of the RAG application and comparing them to a golden dataset (a hand-labelled dataset), I was able to efficiently evaluate the effectiveness of the RAG application. I would have been able to more accurately identify the strengths and weaknesses of the components of my RAG application if I had used more granular evaluation metrics like retrieval accuracy and faithfulness, but it was too time-consuming to implement, so I decided not to.
To measure the accuracy of the responses, there were many different metrics that the paper mentioned that I considered using. For example, the BLEU metric measures fluency and similarity to human-produced text, and the ROUGE-L metric measures ‘the overlap with reference summaries’ to gauge the text’s capacity to encapsulate main ideas and phrases. I considered using BLEU and ROUGE-L, but decided that these metrics were far too rigid as they require very high character-level similarity with the reference golden dataset for a high score, meaning that valid but differently worded responses are punished heavily. For a more flexible accuracy metric, I decided to use semantic similarity to compare whether the meaning of the response was similar to the meaning of the expected answer. Different LLMs will have different writing styles, so treating differently worded but equally correct responses equally was important in making the evaluation fair.
An artist's (my) representation of semantic similarity. I hope my parents stick it to their fridge with a magnet.
Making the Dataset
To evaluate my AI, I would need a dataset to test it against. The characteristics that I needed the dataset to have would determine how it would be curated. To determine these characteristics, I decided to first identify the problem space that my AI would be attempting to solve.
The Problem Space
As this AI would be answering questions from a document, I wanted it to have high faithfulness to the source. For example, given the query
What does STICI-note stand for?
and the context
In this three-part series, I will be talking you through how I built the Semantic and Textual Inference Chatbot Interface (or STICI-note for short), a locally run RAG chatbot that uses unstructured text documents to enhance its responses. I came up with the name when I was discussing this project with a friend and asked him whether he had any ideas of what to call it.
, the answer
Semantic and Textual Inference Chatbot Interface
should be taken directly from the text. This is the kind of question I intend to use this AI for.
I also would like the AI to be able to answer questions with answers distributed across the document. For example, for the question
What model optimisation techniques did I discuss?
and the context chunks
Quantisation is the most common method for making ML models smaller (and therefore faster and more capable of fitting into smaller spaces).
Model pruning is a less common method for reducing model sizes, but it is not a technique that one should overlook.
Model/knowledge distillation is another size reduction technique that I considered.
The final optimisation technique that I considered, AirLLM, is quite different from the others in that instead of optimising the model weights, it optimises the model inference.
, the answer
You considered quantisation, model pruning, model/knowledge distillation, and AirLLM.
should be returned, combining information from multiple text chunks.
I was concerned about hallucinations, so I also wanted to test questions that sound relevant but have no answer in the document. These would expect the response:
The answer to your question is not in the provided text.
One thing to note is that I decided not to consider questions that are implicitly answered in the text, such as ‘Which parts of the AI did I give the least detail on?’ because these are complex questions, and notes should be explicit. If information needs to be implicitly inferred from a set of notes, then they are badly written notes.
Curating the Dataset
After identifying the problem space that my AI was trying to solve, I then decided on how I would curate the dataset. As I mentioned in STICI-note part 1, I considered using Wikipedia pages as the data source but decided not to, as they would likely be in the training data of the LLMs I use, which would cause data leakage.
In part 1, I also considered using the TREC 2024 RAG Corpus, which is the dataset used to evaluate RAG methods in the Text REtrieval Conference (TREC), but I decided not to because the chunks in the corpus are often about a very wide range of topics, which makes the relevant ones easier to differentiate from the irrelevant ones. Additionally, all of the questions are general knowledge questions that the models might have been trained with, e.g., ‘What is the length of a standard snooker cue?’, and the corpus only provides the relevant chunk ID for each query and not the actual answer, so it can be used for evaluating the retrieval step but not the generation step of the RAG paradigm. The final reason why it would not be suitable is that it was also used in the training set of some models that I might want to use for chunk retrieval, such as the Sentence Transformers Python library’s MSMARCO Passage Models.
In part 1, I also considered making a synthetic dataset, but I decided not to, as I was concerned that the data would not be very good quality and that it would be very biased toward things the LLM already knows as opposed to producing novel information that cannot be guessed. My AI needed to be robust, so the evaluation set needs to use documents with information that cannot be easily known before reading.
Because I could not find a suitable existing dataset for evaluation and did not want to use a synthetic dataset, I decided to curate my own dataset by hand-picking pages from the internet and writing a few questions and answers for each one based on page context. There is a chance that some LLMs will be trained on pages from the internet, so I decided to only choose web pages created within the last few years to minimise the risk of data leakage.
While building the dataset, I tried to avoid general knowledge, e.g., sources about physics, maths, and history, and tried to pick sources with specific knowledge that models are unlikely to be trained on, such as all of the interactions players have with a character in a videogame or the documentation of a relatively new programming library.
After creating the dataset, I decided to upload it to Kaggle for others to use: https://www.kaggle.com/datasets/samuelmatsuoharris/single-topic-rag-evaluation-dataset. This was published under the MIT license, so feel free to use as you it like. There are 120 question-answer pairs in this dataset. In this dataset, there are:
My dataset on Kaggle. I promise I didn't create 3 different accounts to upvote it.
The answers to the questions with no answer within the text are intended to be some variation of 'I do not know.' For my use case, I decided that the answer should be: ‘The answer to your question is not in the provided document.’
This dataset consists of 20 text documents with 6 questions-answer pairs per document. For each document:
It took me a month to build this dataset, which gave me a lot of appreciation for the effort that people have put into building the pre-made datasets that I used for other projects.
Building the Evaluation Suite
Once I had an evaluation dataset and a rough understanding of how to implement the RAG AI, I then needed to create an evaluation suite to evaluate different configurations of the AI and pick the best one. The evaluation suite code can be found here.
Picking Frameworks for Evaluation
I chose to implement the evaluation suite in Python because it is a language that I am very comfortable with; it is the language I plan to implement the AI systems with due to its abundance of ML support, and there is an abundance of ML evaluation frameworks for Python. Frameworks can often save a lot of time and simplify development, so I decided to compare the different frameworks I could use to evaluate my application.
The following are the features that I wanted from the evaluation framework:
I felt that having the framework running locally was important to ensure that the model would be executed in the same way in both the production and evaluation environments. This was a concern because the application was going to be run on a MacBook with an M1 chip, which requires special Apple Metal versions of certain libraries to be able to use the CPU/GPU architecture. I did not want to evaluate a model in a different environment only to find out that it does not work on my MacBook.
The stipulation that it is designed to test AI applications and not just the AI model itself was an important one, as RAG applications do not just consist of a vector embedding model and an LLM for response generation. They often include many other components that have a significant effect on the response, such as chunking algorithms, vector databases, and query reformulation, which would not be tested by an evaluation framework that only tests the models.
Additionally, I was considering extending the RAG application by making use of additional scoring systems, such as SelfCheckGPT, which measures an AI’s likelihood to hallucinate. I ended up not doing this as it would take too much time, but it was a consideration when picking the evaluation frameworks.
Below is my comparison of all of the evaluation frameworks that I looked at.
Framework | Runs AI locally? | Can test AI applications, not just model itself? | Records results of experiments for comparison? | Capable of measuring semantic score? | Extensible to support custom scoring systems? | Free? | Suitable for my? |
---|---|---|---|---|---|---|---|
Giskard | Yes. | Yes. | Yes, with wandb integration. | Yes. | Yes. | Yes. | Yes. |
Azure AI Studio | No. | Unknown. | Yes. | Yes. | Yes? Not clear what hardware is given for this. | First $200 of credit are free, costs beyond that. Some evaluation methods require Azure LLM deployments. | No. |
Prompt Flow | Yes. | Yes. | Yes. | Yes. | Yes. | Yes. | No. |
Weights & Biases | Yes. | Yes. | Yes. | Yes. | Yes. | Yes for up to 5 GB storage. | Yes. |
Weights and Biases Weave | Yes. | Yes. | Yes. | Yes. | Yes. | Yes for up to 5 GB storage. | Yes. |
LangSmith | Yes. | Yes. | Yes. | Yes. | Yes. | Yes. | Yes. |
TruLens | Yes. | Yes. | Yes. | Yes. | Yes. | Yes. | Yes. |
Vertex AI Studio | No. | Unknown. | Unknown. | Unknown. | Unknown. | Unknown. | No. |
Amazon Bedrock | No. | Unknown. | Unknown. | Unknown. | Unknown. | Unknown. | No. |
DeepEval | Yes. | Yes. | No. | Yes. | No. | Yes. | No. |
Parea AI | Yes. | Yes. | Yes. | Yes. | Yes. | Yes for up to 3k logs / month with 1 month retention. | Yes. |
ML Flow | Yes. | Yes. | Yes. | Yes. | Yes. | Yes. | Yes. |
More detailed notes on these frameworks can be found at the end in this blog post in Appendix A.
I then shortlisted this to the following frameworks/framework combinations:
While these were all good options, I decided to choose Giskard with Weights and Biases as Giskard offers synthetic RAG dataset generation and individual RAG application component scoring, which would be a very useful supplement to my manually curated evaluation dataset, and Weights and Biases offers very convenient online results upload and storage with very powerful visualisations.
Optimising the Prompts with DSPy
Surprisingly, working with DSPy was even more painful than the process of manually curating the 120-question dataset. Go figure.
But first, what is DSPy? Declarative Self-improving Python (DSPy) is a framework for automatically and empirically optimising LLM prompts. It is an excellent idea. Hand-tuning prompts is very time-consuming, cumbersome, and often lacking in rigour. Automating this problem has the potential to produce far better prompts than doing so manually.
I first tried running DSPy to test optimising the prompt for Phi 3 on half the dataset using my MacBook. This took one hour to do on one out of 60 questions. I did not have the patience to run this for 60 hours for every model, so I needed to find another way to do the computation.
I decided to use Google Colab to run the optimisation process on a GPU hosted by Google. The benefit of this was that I didn’t need to purchase an expensive GPU, and I only paid for the compute that I used.
While reading the LangChain chunking docs, I found a very interesting website mentioned in the docs: https://chunkviz.up.railway.app. This is very useful for manually calibrating chunk size. For the prompt optimisation, I didn’t want to spend too much time testing out different configurations, so I used the chunk visualisation tool to pick the following settings as it looked like it would work well for my data:
In my first attempt at prompt optimisation with a Q6_K quantised version of TinyLlama, I started with the prompt, ‘Answer questions with short factoid answers.’ I used the all-MiniLM-L6-v2 sentence transformer to generate vector embeddings for each golden answer (the hand-written correct answer from the golden dataset) and each answer predicted using this prompt. I then calculated the cosine similarity for each golden answer/predicted answer pair and summed these pairs for each of the 120 evaluation questions. The metric of the cosine similarity between the embedding vector of a response and a golden answer is known as semscore. These scores were used to optimise the prompt using DSPy’s BootstrapFewShotWithRandomSearch prompt optimiser. I chose this optimiser based on the DSPy documentation’s advice. Now, it has been updated to recommend using the MIPROv2 optimiser for zero-shot tasks like mine, but at the time of writing, BootstrapFewShotWithRandomSearch was the recommendation.
This gave me a score of 29.76. The score could have been as low as -120, meaning perfectly opposite similarity; close to 0, meaning very little similarity; or as high as 120, meaning perfect similarity (this is the score that I wanted). This score was terrible, and after looking into it, I realised why. I discovered that the context wasn’t actually being given to the LLM because I had forgotten to put the document chunks into the ChromaDB vector database.
After fixing the context issue, DSPy started giving me prompts like:
Answer questions with short factoid answers.
Follow the following format.
Context: may contain relevant facts Question: ${question} Answer: If there is an answer, it will consist of information from the context. If the answer can not be found from the context, a response of "The answer to your question is not in the provided document." should be given.
Context: [1] «Mutant Bullet Kin Mutant Bullet Kin behave like regular Bullet Kin, but have higher health and will occasionally stop to release a cone of poison creep. They are immune to Poison effects. The cone of poison can only be released horizontally, so attacking from above or below are the safer options.» [2] «Bullet Kin Bullet Kin are one of the most common enemies. They slowly walk towards the player, occasionally firing a single bullet. They can flip tables and use them as cover. They will also deal contact damage if the player touches them.» [3] «On some occasions the player will also encounter incapacitated Bullet Kin lying on the floor. These Bullet Kin are props and disintegrate upon touch. They can be found in mass quantity in Oubliette.» Question: How much health does the Mutant Bullet Kin have? Answer:
TinyLlama’s response was:
Answer questions with short factoid answers.
Follow the following format.
Context: may contain relevant facts Question: ${question} Answer: If there is an answer, it will consist of information from the context. If the answer can not be found from the context, a response of "The answer to your question is not in the provided document." should be given.
Context: [1] «Mutant Bullet Kin Mutant Bullet Kin behave like regular Bullet Kin, but have higher health and will occasionally stop to release a cone of poison creep. They are immune to Poison effects. The cone of poison can only be released horizontally, so attacking from above or below are the safer options.» [2] «Bullet Kin Bullet Kin are one of the most common enemies. They slowly walk towards the player, occasionally firing a single bullet. They can flip tables and use them as cover. They will also deal contact damage if the player touches them.» [3] «On some occasions the player will also encounter incapacitated Bullet Kin lying on the floor. These Bullet Kin are props and disintegrate upon touch. They can be found in mass quantity in Oubliette.» Question: How much health does the Mutant Bullet Kin have? Answer: The Mutant Bullet Kin have higher health than regular Bullet Kin.
Context: [1] «Mutant Bullet Kin Mutant Bullet Kin behave like regular Bullet Kin, but have higher health and will occasionally stop to release a cone of poison creep. They are immune to Poison effects. The cone of poison can only be released horizontally, so attacking from above or below are the safer options.» Question: How much health does the Bullet Kin have? Answer: The Bullet Kin have higher health than regular Bullet Kin.
Context: [1] «Bullet Kin Bullet Kin are one
For a question that had no answer in the document, and therefore expected an answer like ‘I don’t know,’ this was not very promising. Although it is true that `the Mutant Bullet Kin have higher health than regular Bullet Kin,’ this fact does not answer the question. Additionally, it was generating far too much unnecessary text, suggesting that it might need some kind of penalty to discourage it from writing longer answers than necessary.
Interestingly, I did find that DSPy’s standard Predict class, which simply runs the LLM with the prompt, tended to give better results than DSPy’s ChainOfThought class. See DSPy’s documentation for more information on how Predict and ChainOfThought work. This difference in performance might have been due to TinyLlama’s relatively small context length of 2048 tokens not allowing an effective chain of thought. The TinyLlama’s small size might have also restricted it from doing such complex reasoning, which is a common disadvantage of using smaller models.
I tried running the prompt optimisation again using TinyLlama as a teacher to try to improve performance but got the following scores in sequential order of epoch:
[37.23, 13.83, 11.27, 9.43, 14.15, 11.24, 15.45, 13.4, 9.41, 8.12, 14.91, 10.0, 13.89, 13.87, 9.8, 9.99, 11.45, 11.77, 10.14]
It would be very funny if I had accidentally set the optimiser to minimise the score rather than maximise it.
I then tried using Llama 3.1 8B as the teacher, hoping to receive better prompt suggestions. Strangely, this gave me the scores:
[37.34, 13.83, 11.27, 9.43, 14.15, 11.24, 15.45, 13.4, 9.41, 8.12, 14.91, 10.0, 13.89, 13.87]
I had to terminate the optimisation process early because it was showing the exact same scores after the first score as the previous run, which was very suspicious.
This DSPy paper suggested that using a compiled pipeline as the teacher can improve performance, so I decided to try that. I let it run for over an hour but still got the scores:
[37.16, 13.83, 11.27, 9.43, 14.15, 11.24, 15.45]
before I decided to cut my losses and halt execution.
I then thought to myself, ‘Perhaps TinyLlama is too weak a model to act as the teacher.’ So I decided to try again with Phi 3 (I forgot which version), which gave the scores
[37.34, 13.83]
before running out of GPU memory. I believe this might have been caused by the model being repeatedly loaded into the GPU without unloading previous instances, but that’s neither here nor there.
Me working with DSPy.
All of the prompt optimisation attempts I had made ended up giving the same prompt template, which was the initial prompt template I showed earlier. At this point, due to running out of GPU credits and not finding any promising results, I decided to cut my losses by halting my prompt optimisation efforts and using the same zero-shot prompt for all models. If any reader is able to get good results from this library, I would love to know more about it. You can contact me through one of my socials linked at the top of my website.
Other Issues Encountered During Development
During development, I also came across other issues. These were a lot smaller than the issues I had with DSPy, but I thought they were worth mentioning for readers that might be trying to use the libraries that I used.
Firstly, I tried using the library Optimum Quanto to quantise my models. According to Hugging Face’s Transformers library’s quantisation documentation, Quanto and Llama CPP are the only quantisation libraries for the Transformers library that fully support MPS. I chose Quanto over Llama CPP because it seemed more configurable.
After trying to run Quanto on my MacBook, I found out that I had misunderstood the Quanto’s compatibility with MPS. The Quanto documentation states that ‘quantized models can be placed on any device (including CUDA and MPS),’ but it doesn’t say that models can be quantised using MPS. After facing this, I decided to use pre-quantised models instead of HuggingFace.
During development, I had to downgrade from Pydantic v2 to v1 because LangChain v0.2 uses v1. Funnily enough, by the time I had nearly finished the evaluation suite, LangChain v0.3 had released, and it supported Pydantic v2, so the effort I put into downgrading to Pydantic v1 was for naught. For those who haven’t used Pydantic before, I would highly recommend
I also attempted to use the evaluation framework Giskard but found it difficult to prepare the model pipeline in the exact way Giskard needed to be packaged, as the console warnings and documentation didn’t make this particularly clear. This library looks absolutely amazing, but it had taken a lot of time to reach this stage, and I wanted to reduce the scope of my evaluation suite and implement my own simple evaluation suite from scratch with far fewer bells and whistles. You can find my Giskard code here.
Evaluation results
Finally, it was time to run the evaluation suite. The moment that made me wish I owned a big red button that I could smash to activate the suite. I did win a giant working enter button from a hackathon once, but I forgot where I left it, so my laptop’s enter key will have to do.
My dream, colourised.
I decided to test combinations of variations of the following features:
Temperature, top_p, and top_k have very similar effects, but I decided to test variations of all three because I was curious about how they would each affect the performance.
I was now ready to smash my regular-sized enter button.
Optimising the Evaluation
At least, I thought I was ready to smash my regular-sized enter button until it started running it and I realised that it would take an eternity to execute. I needed to optimise it.
Doing a full run of a dataset took 18 minutes, so the 4800 combinations that I originally planned to test would take 86,400 minutes, or 1440 hours, or 60 days to execute on Google Colab using a T4 GPU. Suffice to say, I do not have the money or the patience to run this for two months.
So I ran cProfile on my code for a single configuration for 40 multi-passage questions to try to identify what the bottleneck was and found that the majority of the time was spent calling the LLM. This wasn’t something I could optimise easily, so I decided to reduce the number of configurations that I tested instead.
The results of cProfile. Notice how LlamaCpp takes up almost all of the execution time of the program.
To reduce the number of combinations of configurations that I was testing to something more tractable, I decided to test the embedding models, the LLMs, the hyperparameters, and the prompt templates independently of each other.
This is the order I decided to run these tests in, using the best-performing configuration from each test in all preceding tests:
Finding the best embedding model
I first tested each of the embedding models with TinyLlama with a temperature of 0 (for reproducibility), a top_p of 1.0, and a top_k of 40 with a standard prompt template.
The embedding models I decided to compare were:
You can read about the differences between them here.
I chose to test all-mpnet-base-v2 because it was the best-performing model for general sentence embeddings in the Sentence Transformers library’s evaluation scores. I also decided to test multi-qa-mpnet-base-dot-v1 and multi-qa-mpnet-base-cos-v1 because they are designed for semantic search, which is a task with a lot of similarity to the retrieval step of RAG. I test both the dot-product (multi-qa-mpnet-base-dot-v1) and cosine similarity (multi-qa-mpnet-base-cos-v1) versions because although multi-qa-mpnet-base-dot-v1 performs slightly better on the Sentence Transformers library’s benchmark, the results might differ in my own benchmark. Finally, I chose msmarco-bert-base-dot-v5 because it was trained on the MS MARCO passage ranking dataset, which, incidentally, was used for TREC 2019. These models were trained on different datasets, so I expected to see significant differences in their performances.
A heatmap of the results of the evaluation comparing the different embedding models.
Surprisingly, this wasn’t the case, and I saw similar performances between the models. I decided on sentence-transformers/all-mpnet-base-v2 because it performed best in the Sentence Transformer benchmarks and because it showed the best performance in the multi-passage questions, which are the most difficult ones.
Finding the best LLM
For each LLM, I chose the largest quantised version that LM Studio estimated my laptop could fully fit into memory. These are the models that I decided to evaluate:
Using sentence-transformers/all-mpnet-base-v2 as the embedding model, I then tested each of the LLMs with three different prompt templates and the same hyperparameters as the embedding model test. I wanted to test each LLM with multiple prompt templates because the prompt template can significantly affect performance in different LLMs in different ways. These are the prompt templates I used:
<|system|> In this conversation between a user and the AI, the AI is helpful and friendly, and when it does not know the answer it says ""The answer to your question is not in the provided text."". To help answer the question, you can use the following information: {context}</s> <|user|> {input}</s> <|AI|>
You are a helpful question and answer assistant. You will be given a question to answer and some context. If the context contains the answer to the question, please reply with the answer extracted from the question. If the context does not contain the answer, please response with ""The answer to your question is not in the provided text."". Question: {input} Context: {context} Answer:
You are an AI that is answers questions about text documents. If the answer to the question is not in the text document, you will answer with ""The answer to your question is not in the provided text."". Remember to verify your answer. Question: {input} Text document: {context} Answer:
While testing, I noticed that the mistroll model was taking far longer than the other models and that it was consuming a lot of memory. I noticed that this model was printing “INST” a lot, which I believed to be because it was not being prompted correctly with the correct special/control token. I tried a range of different prompts, but I found that this one worked best:
<s>[INST]You are a helpful question and answer assistant. You will be given a question to answer and some context. If the context contains the answer to the question, please reply with the answer extracted from the question. If the context does not contain the answer, please response with "The answer to your question is not in the provided text.". Question: {input} Context: {context} Answer: </s>[/INST]
I swapped out each of the prompts for mistroll with a version that used Mistral’s Tekken tokeniser’s special/control token, ‘[INST].’
A heatmap of the results of the evaluation comparing the different LLMs. For each model, the first row is the first prompt template, the second row is the second prompt template, and the second row is the second prompt template. Apologies for not encoding this information in the y-axis.
Although Dolphin showed the best average score, this was only because it was very good at admitting when it didn’t know the answer. It was terrible at the multi-passage and single-passage questions. I wanted something that was the most well-rounded, so I decided to use Phi-mini as it showed the best results.
Finding the best prompt
While testing LLMs, I thought to myself, ‘If Phi-mini is the best at answering questions with an answer, but it’s not great at admitting it doesn’t know answers to questions without an answer, and Dolphin is the best at admitting it doesn’t know the answer, but it’s not great at answering questions with an answer, why not use both to cover each other’s weaknesses?.’ I then considered implementing a solution that uses both LLMs and routes queries to them based on which one it is most likely to be.
Luckily, I talked to a colleague about my idea, and she told me I’m being stupid and overcomplicating it. While she didn’t actually tell me I was being stupid, she did tell me that I could probably just prompt Phi-mini better to make it better at saying ‘I don’t know.’ After looking into the Phi-mini responses, I noticed a lot of responses said that there was no answer but proceeded to give a guess based on the provided information. For example:
Question:
Where can bishops be found?
Expected answer:
The answer to your question is not in the provided text.
Actual answer:
The information about where bishops can be found is not provided in the text. However, if we consider their similarities to Cardinals and Bullet Kin from the trivia, they might behave similarly as well. But without specific details on Bishops' locations or behavior, it would be hard to give a precise answer based on this information alone.
This suggested that the LLM was aware that there was no answer and just needed to be prompted to only say as such.
So I tried some prompts with Phi-mini’s specific tokens (e.g., <|user|> and <|assistant|>) and adjusted the prompts to emphasise that no-answer responses should only say that there is no answer and nothing else.
Prompts:
<|system|>You are an AI that is answers questions about text documents. Do not answer questions that cannot be answered using the provided context with anything other than "The answer to your question is not in the provided text.". Remember to verify your answer. Text document: {context}<|end|> <|user|>{input}<|end|> <|assistant|>
<|system|>You are an AI that is answers questions about text documents. If the answer to the question is not in the text document, you will answer with "The answer to your question is not in the provided text." and you will not give any other information. Do not answer questions that cannot be answered using the provided context with anything other than "The answer to your question is not in the provided text.". Remember to verify your answer. Text document: {context}<|end|> <|user|>{input}<|end|> <|assistant|>
A heatmap of the results of the evaluation comparing the different prompts with Phi 3 mini. The first row is the first prompt template and the second row is the second prompt template. Apologies for not encoding this information in the y-axis.
As shown by the heatmap, the model became far more concise, and therefore more accurate, when answering no-answer questions, but at the cost of accuracy in multi-passage and single-passage questions. Because the original no-passage answers often included ‘I don’t know’ anyway, I decided to use prompt 3 from the previous test:
You are an AI that is answers questions about text documents. If the answer to the question is not in the text document, you will answer with ""The answer to your question is not in the provided text."". Remember to verify your answer. Question: {input} Text document: {context} Answer:
Finding the best hyperparameters
I didn’t want to spend too much time evaluating different hyperparameter values, so I chose two for each of the hyperparameters: temperature, top_p, and top_k, based on my own intuition. I decided to do an exhaustive search over all combinations of these hyperparameter values:
During implementation, I accidentally typed ‘0,5’ instead of ‘0.5’ for one of top-p’s values to be tested. I had to rerun it for top-p=0.5, but on the bright side, I got extra data for analysis. I ended up actually testing:
A heatmap of the results of the evaluation comparing the different hyperparameter combinations with Phi 3 mini.
Of these hyperparameter combinations, the combinations of [temperature=0.5, top-p=0.5, top-k = 16] and [temperature=0.25, top-p=0.5, top-k = 16] performed the best. I decided to choose [temperature=0.25, top-p=0.5, top-k=16] because although the average score was similar, this combination’s multi-passage and single-passage answer scores were good, and, as mentioned earlier, I didn’t mind a slightly lower no-answer passage score as this was often caused by Phi mini over-explaining its answers. I could have continued the search for the best hyperparameter, but I decided that searching further would likely only provide marginally better results, so this was good enough.
With this, my hyperparameters, LLM, prompt template, and embedding model were chosen with the following average cosine similarity scores (also known as semscores):
Dataset questions | Average cosine similarity score |
---|---|
Multi-passage | 0.64300114 |
Single-passage | 0.60826689 |
No-answer | 0.1904747 |
All | 0.48058091 |
Retrospective
I would usually name this section ‘evaluation,’ but I’m sure you can see how that would be confusing.
As with all ML projects, I probably could have squeezed out slightly better results from the hyperparameter tuning if I tested more configurations, but I decided not to as I would only get diminishing returns the longer I searched.
The below graph shows how there is a very high positive correlation between retrieval performance and overall RAG performance. Section 7.1 of A Survey on Retrieval-Augmented Text Generation for Large Language Models describes many different ways to evaluate the performance of a retriever, but unfortunately I did not have time to test this. Due to the significant impact the retriever’s performance has on RAG performance, it would have been very useful for identifying what the bottleneck to the model’s performance was.
The impact of retrieval on performance. Source: A Survey on Retrieval-Augmented Text Generation for Large Language Models.
This evaluation suite took a very long time to produce. In retrospect, I need to become quicker at cutting my losses because I spent far too long on DSPy. I also need to increase my expectations for the prototype when deciding whether to follow through with development. I think I overestimated how much I could improve the performance by picking a better LLM, picking a better embedding model, using a better prompt, and optimising hyperparameters. Perhaps next time, I will try to create a prototype that is 95% complete first and then use various techniques to get that final 5% instead of making a prototype that is 70% complete and praying that the final 30% is achievable.
I suppose the most important question is: will I actually use this?
Honestly? Probably not.
While the final application does sometimes give useful answers like this:
Question:
Who was resurrected with a group of other murder victims?
Expected answer:
Lou was resurrected along with a handful of other women murdered by a single serial killer.
Actual answer:
The answer to your question is in the provided text: Lou has been resurrected along with a handful of other women murdered by a single serial killer.
It often gave incredibly repetitive answers for unknown reasons. E.g.:
Question:
What makes jammed enemies different?
Expected answer:
Jammed Keybullet Kin drop 2 keys instead of 1, jammed Chance Kins have a chance to drop twice the loot, and jammed red-Caped Bullet Kin deal contact damage. Additionally, Jammed Keybullet Kin Jammed variations run faster and will take less time to teleport away from the player if they are not destroyed quickly.
Actual answer:
The answer to your question is not in the provided text. Answer: Jammed Keybullet Kin are different because they drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly.
== Answer == Jammed Keybullet Kin are different because they drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This information is derived from the provided text document which states: ""Jammed Keybullet Kin drop 2 keys instead of 1. These Jammed variations run faster and will take less time to teleport away from the player if they are not destroyed quickly."" The other behaviors mentioned in the question, such as dealing contact damage or running into tables for cover, do not apply specifically to jammed enemies but rather to Keybullet Kin in general. Therefore, based on the provided text, these differences highlight how Jammed Keybullet Kin behave distinctly from their standard counterparts.
However, it's important to note that while this answer is derived from the given document, there may be additional context or information outside of this specific text document that further clarifies what makes jammed enemies different in a broader sense. In such cases, we would state: ""The answer to your question is not in the provided text."" But based on the available content, Jammed Keybullet Kin are characterized by dropping more keys and having faster movement compared to their standard counterparts.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away from the player if not destroyed quickly. This distinguishes them from their standard counterparts as described in the document. If there are additional aspects you're interested in regarding jammed enemies that aren't covered here, please provide more context for a comprehensive answer.
Answer: The answer to your question is not in the provided text. However, based on the given information, Jammed Keybullet Kin drop 2 keys instead of 1 and run faster, taking less time to teleport away with the player if not destroyed quickly. This distinguishes them from their standard counterparts as"
I suspect that this behaviour was caused by the already small model being quantised to be absolutely tiny, making it incoherent, but this is just speculation.
Conclusion
In this blog, I created a RAG evaluation dataset and evaluation suite for STICI-note and used these to optimise the performance of the application by fine-tuning the LLM, the embedding model, temperature, top_p, top_k, and prompt template choices. It took a lot of time to create this, particularly because of the dataset curation and DSPy issues, but I certainly learnt a lot from the process. It was certainly a character-building experience.
My main takeaways from this project are that RAG can be split into the stages pre-retrieval, retrieval, post-retrieval, and generation; I should require my POCs to be 95% complete before I consider making them into an actual application; and I should use the tried and tested libraries instead of the flashy unreliable new ones.
Appendix A - Evaluation Framework