Improving the Accuracy of Technical LLM QA by 37%
An evaluation of Lune AI against standalone LLMs on real world technical/coding questions.
The tendency of LLMs to hallucinate information not included in training data often leads to inaccurate answers to technical questions regarding up-to-date libraries or documentation.
gpt-4o hallucinating implementing structured output with langchain.
Overview
At Lune AI, we are attempting to solve this problem by training expert LLMs on up-to-date knowledge sources.
To evaluate the accuracy of Lunes compared to the LLMs Claude 3.5 Sonnet and GPT-4o, we ran a real world technical QA test on a dataset of verified Stack Overflow posts.
For this evaluation, we chose Kubernetes as the technical topic. We trained a Lune on all of Kubernetes documentation, and compiled a list of all Stack Overflow posts with the Kubernetes tag with verified answers that were posted after the knowledge cutoff date of the comparison LLMs, April 2024.
Methodology
The evaluation process was conducted in two stages: inferencing and evaluation.
During the inferencing stage, three models—Claude-3.5-Sonnet-20241022, GPT-4o, and the Kubernetes Lune—were presented with the question titles and bodies from each post. The models were tasked with generating answers without any additional contextual information.
In the evaluation stage, the generated responses, alongside the original verified answers, were analyzed using an independent large language model (LLM) to perform a direct comparison. If the LLM determined that a generated response was equivalent in content to the verified human answer, it was classified as correct. Otherwise, it was marked as incorrect. We understand that there are lots of limitations and potential for error with using an LLM to evaluate, but our goal with this first test was to capture higher level trends, with plans for more in depth and accurate evals soon.
Our analysis revealed a model bias in which the evaluation process favored the generated responses from the same model family as the evaluating LLM. For instance, when GPT-4o served as the evaluator, its own generated responses were more frequently marked correct compared to when Claude-3.5-Sonnet acted as the evaluator.
To mitigate this bias, we ran this evaluation with four distinct LLMs, averaging the results to determine the final outcome.
The models used for the evaluation were:
- gpt-4o
- claude-3-5-sonnet-20241022
- o1-mini
- gemini-1.5-pro
Results
On average, the Kubernetes Lune saw a 37% improvement in accuracy from **gpt-4o and a 14% improvement in accuracy from claude-3-5-sonnet-20241022.
Regardless of the model used for evaluation, gpt-4o consistently performed the worst in terms of accuracy, while the Kubernetes Lune outperformed both stock LLMs in every evaluation.
Future Directions
Automatic context switching - We are actively refining and evaluating our Tycho model to enhance its ability to automatically determine the most relevant knowledge sources based on user queries. SWE-bench Tycho evaluations are our next priority.
Evaluating diverse topics - As our repository of user-created Lunes continues to grow, we aim to conduct evaluations across a broader range of technical topics. This will help us assess performance improvements when utilizing various knowledge sources.
Better accuracy - We are committed to continuous improvement by incorporating developer-driven feedback loops directly into our platform. As user-driven insights enhance the performance of individual Lunes, we anticipate a corresponding increase in overall accuracy for both Tycho and the Lunes ecosystem.
Explore the possibilities—visit our Explore page to discover and interact with Lunes created on our platform or train a custom Lune for free to suit your unique needs.