Langchain will try to break your codebase: Or, how Clean Architecture can save you from headaches

LLMs Langchain will try to break your codebase: Or, how Clean Architecture can save you from headaches Nina Holm-JensenSenior Data Scientist Prerequisites to reading: A basic understanding of how a RAG system works (if not, you can read any article on the topic, like this one, and come back) Ability to read python code Fundamental understanding of how to unit test Repo: https://github.com/TodAI-A-S/Todai-LLM-project/branches LLMs are complicated. To write a simple RAG system, you have to connect to multiple databases and tables (your vectorstore is obvious, and then you have the user table, the conversation history, maybe conversation metadata or user permissions, etc.). You then have to juggle the data retrieved by your LLM. It needs to be coerced into sensible objects which you stream to the frontend for visual rendering. And what happens when the LLM feels feisty and sends back a different data format? Your error handling needs to be solid. It’s all terribly complex. What is the must-do for all complex code? For this article, I assume that we all agree that automatic testing is good and necessary. If you need a little convincing, go here. First, let me start by scoping this article. I will focus pretty hardcore on local, on-my-machine integration testing, which is fundamentally deterministic. LLMs are notoriously non-deterministic, so for a proper, business-centric test suite, you will also need a way to test the LLM flow. See my colleague’s articles on that particular topic. Here, I will discuss how to best abstract away all of that so you can test the deterministic code around the LLM. It’s harder than it sounds.  I will base this article on snippets of actual production code I wrote this year. I have cut 90% of the complexity (and all of the identifying business logic) so I can focus on a few interesting parts. What is the problem? Langchain promises to bundle all your LLM specific needs into a friendly library that handles everything for you. It does that, mostly. But, to be completely frank, Langchain is a bit of a mess. It is made for data scientists, not software engineers. Is this part a function or a class? Who knows? Is this part configuring or executing code? Does it even matter? If you are here, you probably care a little bit about writing testable code. You care about separation of concerns. If you are here, you have probably also seen the Langchain documentation – or the myriad of ten-line RAG examples littering the internet. And you’ve thought “how the hell do I modularize this?” If you go to Langchain’s own documentation and string their examples together, it will look something like this: embeddings = OpenAIEmbeddings() docs = [    Document(        page_content=”A bunch of scientists bring back dinosaurs and mayhem breaks loose”,        metadata={“year”: 1993, “rating”: 7.7, “genre”: “science fiction”},    ),    …, ] vectorstore = OpenSearchVectorSearch.from_documents(    docs,    embeddings,    index_name=”opensearch-self-query-demo”,    opensearch_url=”http://localhost:9200″, ) Nice and quick. You’ve got your entire RAG system in a 16-line script where everything – connecting to the database, connecting to the LLM, and a ton of transformation – happens in a single line. Nice. And utterly untestable. Some of the questions that crop up: How do I mock anything here?  The OpenAIEmbeddings class reads my API key from environment variables. Can I control that without nasty side effects? The database connection happens somewhere deep in the from_documents() function. If I were to write a test, I would have to do a lot of hacky interceptions, and it would undoubtedly break the next time a hapless developer came by. The solution I quickly hit the wall of trying to make Langchain’s own examples testable. How do you separate responsibilities? How do you inject dependencies? I realised that Langchain wasn’t going to help me willingly. So I made it my personal mission to beat and batter Langchain into a shape reminiscent of proper architecture. My anvil of choice was Clean Architecture as described by Uncle Bob. Fundamentally, I worked to separate the code into three layers: Interfaces, which handle any and all external connections and calls. Services, which handle all business logic. Services DO things. Services carry and transform data. Entities, which are the data objects being carried and transformed. Along with, of course, configuration code, which we all know to separate out. Strict adherence to this architecture became my guiding principle. Working with entities Remember, the thing I wrote is a full-fledged application. It receives (messy) json data through an endpoint, then has to call a database to enrich the data, then pass this data elsewhere for audit logging. Then it does some data transformations (including a pdf parser reading the page content). And then finally the relevant data reaches the embedding flow. At all these points, I need to clearly understand what my data looks like. The responsibility of the entity: Collecting data in well-defined and sensibly-named chunks. Matching data to domain concepts I used Pydantic to parse the incoming json from the upstream data pipeline, and to represent the data at all times in the transformation steps, until it arrived at my embedding flow looking something like this. class PdfPage(BaseModel):    page_content: str    page_number: int class PdfDocument(BaseModel):    doc_id: str    pages: List[PdfPage] Nothing magical here. Entities are just dataclasses. We love them because they give us a shared language around the data, and because they are so much more self-documenting than dictionaries or lists or all the other structures that data scientists love to use while tinkering. Interfaces Dependencies are finicky things. Whenever you depend on anything not in your own codebase, you risk that an unexpected change messes with your test. Maybe someone had to delete the data row you depended on? Maybe someone else is updating the service you need to call? Then your test fails for reasons completely unrelated to actual errors. And, from a code maintenance perspective, you want to be able to change providers reasonably easily. You want any external dependencies to be plug-and-play with clearly defined entry points. Infrastructure doesn’t change often

Why we haven’t gotten rid of Langchain yet

LLMs Why we haven’t gotten rid of Langchain yet Nina Holm-JensenSenior Data Scientist Based on a conversation between Todai’s data scientists. Prerequisites: A basic understanding of how a RAG system works (if not, you can read any article on the topic, like this one, and come back) Why would you want to get rid of Langchain? Langchain is a long-standing framework for working with LLMs. It is conceptually based on the idea of “chains”, which is a way of chaining together different flows, agents, and other functionality. It is designed to be very plug-and-play with standardised interfaces between the different modules. In principle, you should be able to call the same functions for OpenAI as you do for AWS’s Bedrock suite. Which is nice and simple. But the major problem with Langchain is exactly this plug-and-play mentality. Langchain invites you not to think too deeply about the implementation details of your work. Everything is abstracted away into these chain “links”, and you just have to configure a few things. But under the hood, the implementation details ARE very different and very complicated. LLMs change at a breakneck pace, and every provider has their own quirks. Langchain is not exactly stringent about following software engineering principles. Configuration code, executing code, database connections, data transformations, it all often happens in one big bowl of spaghetti code. To fit the chain structure, the under-the-hood code often looks something like this: This laissez-faire approach leads to fun surprises. One time, a colleague of mine just couldn’t understand why she kept getting a weird bug in her chain, until she went on a code deep dive and realised that, upstream in her LLM chain, Langchain had implemented a function which just returned None. No errors. No NotImplementedException. Just an empty return value, which was propagated further down the chain’s modules until it, eventually, caused an error. The concept of “chains” is also another level of abstraction, which can be hard to explain to non-data scientists. It is worth noting that Thoughtworks’s Tech Radar even moved Langchain to “Hold” last year, meaning they recommend against using it. And, like a senior team member provocatively asked us all, if Langchain really abstracts away all details, what is the point of us? Couldn’t any software engineer do exactly what we do? So why do we still stick by it? Because speed. Langchain is still exceptional in the prototyping stages, specifically because it allows you to ignore the details for now. The time from idea to execution is, frankly, exceptional. Say you have built a RAG which is running happily in production. It is configured to pull the 25 most relevant document chunks in the retrieval stage. These are used when it writes an answer.  But now, my project manager asks me to improve the quality of the answers. I know, through reading and research, that one way (among many) to improve quality is to improve the context sent to the chatbot. Maybe I could query 100 chunks and then rerank them to 25, turning my context retrieval into a two-step process with limited loss of performance. Wanting to pursue this route, I need to decide on a reranking algorithm. As a good, up-to-date data scientist, I know that such algorithms are legion. A few of them are: max-marginal-relevance cross-encoder modeller coheres reranker API LLM-as-rereanker An entirely different embedding model and vector score Each of these options would be very time consuming to implement from scratch. The worst part is, I cannot know in advance which method will outperform the others (nor if any of them will outperform my baseline). None of them are objectively best – it all depends on the quirks of my specific data and implementation. I also cannot know how the change will influence my solution. For example, MMR is known to be fast, while LLM-as-reranker is slow. Maybe it is prohibitively slow on my specific dataset? Langchain has an implementation ready-to-go for each of these options. It also has a Reranker module which is designed to be easy to plug into my chain. With Langchain, I can implement and test each of these options quickly, and I can go back to my project manager within the week with a concrete plan for a better quality answer. Why is it so fast? Simple. Langchain is still one of the biggest frameworks around. This means that the internet is bursting with guides, implementations, code examples and things of that nature. Whatever I want to test, Langchain and its huge community can provide. It has plug-and-play modules for almost every provider you need, including database providers. It is still one of the first to implement new algorithms, methods and other state-of-the-art designs. In this case, big really does mean fast. So couldn’t anyone just do the data scientist’s job? Eventually, maybe. But the entire field of LLMs are moving so fast these days that it is still a full-time job to keep abreast of the news cycle, not to mention tinkering and trying the new tools, and getting the real-life experiences necessary to build up an intuition around real-life challenges.  While any engineer can pick up Langchain and implement anything, it still takes an expert to know exactly WHAT to implement to accommodate someone’s specific needs and challenges. What are the Langchain alternatives? I know a colleague of mine is working on an article comparing all the bigger LLM frameworks, so stay tuned for that. Long-term, I have no doubt that we will see better, more mature frameworks take over the marketplace. Once the initial goldrush is over, we will have a diverse toolbox capable of covering most use cases. In that sense, LLMs are no different from other new, shiny technologies. Today, however, we will venture the claim that the real alternative to Langchain is to not use a big framework at all. All the major providers have APIs for their services, and if you’re self-hosting your LLM, you probably have very specific needs anyway.

Evaluating and Testing Large Language Models

LLMs Evaluating and Testing Large Language Models Hampus GustavssonSenior Data Scientist at Todai As with all software, before releasing it to production, it must pass a thorough test suite. The same applies to large language models (LLMs). However, testing LLMs is not as straightforward as testing traditional software or even classical machine learning models. This article highlights several aspects of testing LLMs, followed by a demonstration using a Python implementation from Confident AI—Deepeval. Our test case will feature a type of LLM called Retrieval-Augmented Generation (RAG), which integrates access to domain-specific data. You can find all the referenced code in our repository. Background At Todai, we often help companies develop RAG systems tailored to specific industries and use cases. Testing these applications must be tailored to each project’s unique requirements. In this article, we share key lessons learned and identify the variables we consider critical during the testing phase. Our focus will primarily be on selecting appropriate metrics and designing test suites to optimize business objectives and key performance indicators (KPIs). Modes of testing Let us start by dividing the metrics into two categories: statistical and model based. A statistical approach, based on traditional n-grams techniques, counts the number of correctly predicted words. This is a traditional approach. While robust, it is less flexible and struggles with scenarios where correct answers can be expressed in multiple valid ways. Consequently, its use has declined as LLMs have advanced.As publicly available large language models have become better, the statistical based approach has become less common. While some use cases still benefit from statistical methods, the model-based approach generally provides greater flexibility and accuracy. Model Based When using a model based approach, the overarching scheme is to prompt an LLM such that it outputs quantitative results. It is not uncommon that this comes from a subjective perspective. And one does have to be mindful of this. As with asking a coworker whether they would label an answer correct or incorrect, ambiguity exists in this approach. We will later in the article go into depth on how an assessment of a question answer can look. Metrics Selecting the right metrics is a complex process, often involving translating qualitative business goals into quantitative measures. Metrics generally fall into two categories: Testing against predefined answers: This method compares the model’s responses to a set of prewritten answers. System health evaluations: This involves testing the application’s guardrails to prevent undesirable outcomes, such as biased or toxic responses. In customer-facing applications, system health evaluations are essential to mitigate the risk of poor user experiences. This can involve stress-testing the system by intentionally provoking problematic outputs and measuring against bias, toxicity, or predefined ethical standards. In our demo, we will focus on the first category: testing performance against prewritten answers. This involves generating questions and expected answers, scoring the model’s responses, and aggregating these scores to evaluate overall performance. Key considerations include identifying relevant metrics, defining thresholds, and understanding the consequences of deployment or non-deployment. So, what kind of metrics should we consider for our RAG system? We will be using Answer Relevancy and Answer Correctness. The answer relevancy metric is the LLM equivalent to precision, this measures the degree of irrelevant or redundant information provided by the model. Question-Answer Generation When it comes to generating the test suite, there are a few things to consider. We want to generate tests in a scalable, representative and robust way. Starting out with the representative and robustness aspect, we should make the questions as relevant for the task as possible and focus on making the questions more global rather than local, challenging the RAG to combine various sources of information to end up with the correct answer.  The straightforward way of generating questions and answers is by letting subject matter experts and / or stakeholders manually create these. This is for sure a valid way, but let us also explore test suite building a little further beyond this. We will be looking into three approaches – two offline and one online based. These are manually made, automatically made and sourced from a production setting. We will try to cover what sets these techniques apart and compare  their pros and cons against each other. Let us look at them one by one. Manually made: The straightforward way to get the questions and correct answers. Preferably, this is done via subject matter experts and / or intended users when applicable. The cons with this approach is that it might be too biased to the ones creating the questions and answers, the dataset it creates is stale (i.e. not taking into account data drift) and might of course be resource intensive. Production questions: This is the preferred approach, with one big caveat. If you can get real world questions and answers, hopefully these represent the real world use case as much as possible. It is also quick to react to changes in the environment it is being used in. The caveat, however, is to retrieve this feedback without disturbing the user experience. The obvious way is to explicitly ask for feedback. But this is rarely something you want to do. There are a few different ways you can get user feedback implicitly, and it is one of these you should opt for going for. Automated: The basis for these is to use another model to generate questions and answers. Either via randomly sampling documents to base the questions from or in a methodical manner getting documents. There is a risk of generating questions that are too easily answered by the model. Ie, the model is having a hard time creating questions hard enough for itself to be able to answer. Also, without further prompting or providing examples, there is a risk that the questions being tested have low or no significance in practice. This can be tamed by prompting the QA generator to by combining one of the two previous steps. You then can use these questions as a test fleet, and prompt