Why ChatGPT Won’t Replace Me as a (Human) Developer

LLMs Why ChatGPT Won’t Replace Me as a (Human) Developer Nina Holm-JensenSenior Data Scientist The year was 2024. The future had arrived wearing the name ChatGPT, and everyone was raving. I was a senior software developer who had been called into a sales meeting to answer the customer’s more technical questions. We made smalltalk over our afternoon coffee. And then they started asking questions. A few questions about data integrations. Some about our estimates. And then I was asked: “How can you estimate so much time, when I can ask ChatGPT to make it in five minutes?” At the time, I was astonished – and, to be entirely honest, a little bit angry. I spent an embarrassingly long time ranting to my colleagues at the coffee machine afterwards. But in the year since, I still hear the question asked. Not always as blatantly. But it is permeating the entire field of coding at this point. So let’s explore it. Everyone can code now Everyone and their grandma has tried vibe coding their own app at this point. A friend of mine texted me excitedly the other day to tell me how he has vibe coded his own app to send love letters to his girlfriend. Last week, my project manager told me how he used Cursor to fix an issue in an old application, happy that he didn’t have to bother the developer who wrote it. Romantic use cases aside, this development is exciting. On my end, as an “old” programmer, it is fun to get new tools to make my work faster. On a societal level, it is amazing how code is getting democratized. A lot of grunt work is being lifted off my shoulders. But it is incredibly important that we understand what kinds of work ChatGPT can lift. And what kinds of work it cannot. Consider my project manager who solved a coding issue with Cursor. If we envision an IT system as an elaborate birthday cake, then Cursor basically told him to get the blue sprinkles from the cupboard and to put them on the cake. That is impressive, for sure. But who made the cake itself? Who layered on the frosting? Who went and bought the blue sprinkles and put them in the cupboard? Not Cursor. The “invisible” work I do Let’s say I am asked to develop a new feature. What do I need to master in order to turn this feature request into actual value? The first and foremost thing is to know the broader IT landscape. I need to know which tools are out there, what kind of problems they can solve, what is just new and shiny, and what is feasible. It is knowing and understanding the IT landscape of my particular organization or project. It is knowing all the legacy code, which code repos are actually deployed where, the infrastructure, and which external systems we depend upon. It is remembering the decisions we have made and why. This is especially important for the counterintuitive decisions we made ten sprints ago. Sometimes a database configuration is insane because it has to accommodate the legacy code once written by the founder’s teenage nephew. It is knowing the requirements, both technical and non-technical, and keeping them in mind when analysing solutions. It is knowing which requirements are half-baked or inaccurately described (because life happens, and nothing is ever perfectly documented), either because I have been told, or because I have seen similar things before. It is knowing all the things that are not written down but often discussed in meetings, like expectations for scalability, security demands, or cost limits. Often, I will be expected to prepare the system for a grand new feature before the feature is described – and often, working towards this fuzzily-described idea will save us much time in the long run. It is constantly questioning whether we are solving the right problem in the right way. It is building an actual, architecturally solid codebase which Cursor can read and act upon. Code must always be clean and extendable, meaning we can add more features with minimal pain. Clean code is an art and a discipline, and it is crucial for preserving our lead time. It is reading and understanding the code before accepting Cursor’s proposed changes to my codebase. Metaphorically, Cursor might ask me to sprinkle chili flakes on my birthday cake, and I have to know whether it is the right thing to do. Why can only a human do it? Because humans are imperfect, social creatures. We are creatures of emotion, of connection, of politics. All these things are incredibly important when gathering the knowledge I need to turn code into value. This is why the coffee machine is the most important place in any organization. I go there, and I overhear my colleagues talking, and I realize that someone on another team has just solved the problem I’m agonizing over, saving me days of further work. As a human, I can hear the hesitation in the product manager’s voice when she describes a new feature request, clocking that she isn’t entirely confident in its value. I understand that other work is done by other humans. I send the email asking our devops guy to set up data backups in our system. I send that email again. I then bring him a coffee and a smile, shamelessly bribing him to solve my problem during a clearly overworked day/week/month/worklife. Because empathy and connection is something only another human can offer. ChatGPT does not drink coffee And now we come full circle. ChatGPT is amazing and will revolutionize the field of IT (as it will revolutionise so many other fields). It is already changing the way I work, the way I talk to my project manager, the way I interact with new knowledge. ChatGPT is an amazing assistant, which I can direct according to the needs of myself and my customers. It is a great tool which I
Langchain will try to break your codebase: Or, how Clean Architecture can save you from headaches

LLMs Langchain will try to break your codebase: Or, how Clean Architecture can save you from headaches Nina Holm-JensenSenior Data Scientist Prerequisites to reading: A basic understanding of how a RAG system works (if not, you can read any article on the topic, like this one, and come back) Ability to read python code Fundamental understanding of how to unit test Repo: https://github.com/TodAI-A-S/Todai-LLM-project/branches LLMs are complicated. To write a simple RAG system, you have to connect to multiple databases and tables (your vectorstore is obvious, and then you have the user table, the conversation history, maybe conversation metadata or user permissions, etc.). You then have to juggle the data retrieved by your LLM. It needs to be coerced into sensible objects which you stream to the frontend for visual rendering. And what happens when the LLM feels feisty and sends back a different data format? Your error handling needs to be solid. It’s all terribly complex. What is the must-do for all complex code? For this article, I assume that we all agree that automatic testing is good and necessary. If you need a little convincing, go here. First, let me start by scoping this article. I will focus pretty hardcore on local, on-my-machine integration testing, which is fundamentally deterministic. LLMs are notoriously non-deterministic, so for a proper, business-centric test suite, you will also need a way to test the LLM flow. See my colleague’s articles on that particular topic. Here, I will discuss how to best abstract away all of that so you can test the deterministic code around the LLM. It’s harder than it sounds. I will base this article on snippets of actual production code I wrote this year. I have cut 90% of the complexity (and all of the identifying business logic) so I can focus on a few interesting parts. What is the problem? Langchain promises to bundle all your LLM specific needs into a friendly library that handles everything for you. It does that, mostly. But, to be completely frank, Langchain is a bit of a mess. It is made for data scientists, not software engineers. Is this part a function or a class? Who knows? Is this part configuring or executing code? Does it even matter? If you are here, you probably care a little bit about writing testable code. You care about separation of concerns. If you are here, you have probably also seen the Langchain documentation – or the myriad of ten-line RAG examples littering the internet. And you’ve thought “how the hell do I modularize this?” If you go to Langchain’s own documentation and string their examples together, it will look something like this: embeddings = OpenAIEmbeddings() docs = [ Document( page_content=”A bunch of scientists bring back dinosaurs and mayhem breaks loose”, metadata={“year”: 1993, “rating”: 7.7, “genre”: “science fiction”}, ), …, ] vectorstore = OpenSearchVectorSearch.from_documents( docs, embeddings, index_name=”opensearch-self-query-demo”, opensearch_url=”http://localhost:9200″, ) Nice and quick. You’ve got your entire RAG system in a 16-line script where everything – connecting to the database, connecting to the LLM, and a ton of transformation – happens in a single line. Nice. And utterly untestable. Some of the questions that crop up: How do I mock anything here? The OpenAIEmbeddings class reads my API key from environment variables. Can I control that without nasty side effects? The database connection happens somewhere deep in the from_documents() function. If I were to write a test, I would have to do a lot of hacky interceptions, and it would undoubtedly break the next time a hapless developer came by. The solution I quickly hit the wall of trying to make Langchain’s own examples testable. How do you separate responsibilities? How do you inject dependencies? I realised that Langchain wasn’t going to help me willingly. So I made it my personal mission to beat and batter Langchain into a shape reminiscent of proper architecture. My anvil of choice was Clean Architecture as described by Uncle Bob. Fundamentally, I worked to separate the code into three layers: Interfaces, which handle any and all external connections and calls. Services, which handle all business logic. Services DO things. Services carry and transform data. Entities, which are the data objects being carried and transformed. Along with, of course, configuration code, which we all know to separate out. Strict adherence to this architecture became my guiding principle. Working with entities Remember, the thing I wrote is a full-fledged application. It receives (messy) json data through an endpoint, then has to call a database to enrich the data, then pass this data elsewhere for audit logging. Then it does some data transformations (including a pdf parser reading the page content). And then finally the relevant data reaches the embedding flow. At all these points, I need to clearly understand what my data looks like. The responsibility of the entity: Collecting data in well-defined and sensibly-named chunks. Matching data to domain concepts I used Pydantic to parse the incoming json from the upstream data pipeline, and to represent the data at all times in the transformation steps, until it arrived at my embedding flow looking something like this. class PdfPage(BaseModel): page_content: str page_number: int class PdfDocument(BaseModel): doc_id: str pages: List[PdfPage] Nothing magical here. Entities are just dataclasses. We love them because they give us a shared language around the data, and because they are so much more self-documenting than dictionaries or lists or all the other structures that data scientists love to use while tinkering. Interfaces Dependencies are finicky things. Whenever you depend on anything not in your own codebase, you risk that an unexpected change messes with your test. Maybe someone had to delete the data row you depended on? Maybe someone else is updating the service you need to call? Then your test fails for reasons completely unrelated to actual errors. And, from a code maintenance perspective, you want to be able to change providers reasonably easily. You want any external dependencies to be plug-and-play with clearly defined entry points. Infrastructure doesn’t change often
Why we haven’t gotten rid of Langchain yet

LLMs Why we haven’t gotten rid of Langchain yet Nina Holm-JensenSenior Data Scientist Based on a conversation between Todai’s data scientists. Prerequisites: A basic understanding of how a RAG system works (if not, you can read any article on the topic, like this one, and come back) Why would you want to get rid of Langchain? Langchain is a long-standing framework for working with LLMs. It is conceptually based on the idea of “chains”, which is a way of chaining together different flows, agents, and other functionality. It is designed to be very plug-and-play with standardised interfaces between the different modules. In principle, you should be able to call the same functions for OpenAI as you do for AWS’s Bedrock suite. Which is nice and simple. But the major problem with Langchain is exactly this plug-and-play mentality. Langchain invites you not to think too deeply about the implementation details of your work. Everything is abstracted away into these chain “links”, and you just have to configure a few things. But under the hood, the implementation details ARE very different and very complicated. LLMs change at a breakneck pace, and every provider has their own quirks. Langchain is not exactly stringent about following software engineering principles. Configuration code, executing code, database connections, data transformations, it all often happens in one big bowl of spaghetti code. To fit the chain structure, the under-the-hood code often looks something like this: This laissez-faire approach leads to fun surprises. One time, a colleague of mine just couldn’t understand why she kept getting a weird bug in her chain, until she went on a code deep dive and realised that, upstream in her LLM chain, Langchain had implemented a function which just returned None. No errors. No NotImplementedException. Just an empty return value, which was propagated further down the chain’s modules until it, eventually, caused an error. The concept of “chains” is also another level of abstraction, which can be hard to explain to non-data scientists. It is worth noting that Thoughtworks’s Tech Radar even moved Langchain to “Hold” last year, meaning they recommend against using it. And, like a senior team member provocatively asked us all, if Langchain really abstracts away all details, what is the point of us? Couldn’t any software engineer do exactly what we do? So why do we still stick by it? Because speed. Langchain is still exceptional in the prototyping stages, specifically because it allows you to ignore the details for now. The time from idea to execution is, frankly, exceptional. Say you have built a RAG which is running happily in production. It is configured to pull the 25 most relevant document chunks in the retrieval stage. These are used when it writes an answer. But now, my project manager asks me to improve the quality of the answers. I know, through reading and research, that one way (among many) to improve quality is to improve the context sent to the chatbot. Maybe I could query 100 chunks and then rerank them to 25, turning my context retrieval into a two-step process with limited loss of performance. Wanting to pursue this route, I need to decide on a reranking algorithm. As a good, up-to-date data scientist, I know that such algorithms are legion. A few of them are: max-marginal-relevance cross-encoder modeller coheres reranker API LLM-as-rereanker An entirely different embedding model and vector score Each of these options would be very time consuming to implement from scratch. The worst part is, I cannot know in advance which method will outperform the others (nor if any of them will outperform my baseline). None of them are objectively best – it all depends on the quirks of my specific data and implementation. I also cannot know how the change will influence my solution. For example, MMR is known to be fast, while LLM-as-reranker is slow. Maybe it is prohibitively slow on my specific dataset? Langchain has an implementation ready-to-go for each of these options. It also has a Reranker module which is designed to be easy to plug into my chain. With Langchain, I can implement and test each of these options quickly, and I can go back to my project manager within the week with a concrete plan for a better quality answer. Why is it so fast? Simple. Langchain is still one of the biggest frameworks around. This means that the internet is bursting with guides, implementations, code examples and things of that nature. Whatever I want to test, Langchain and its huge community can provide. It has plug-and-play modules for almost every provider you need, including database providers. It is still one of the first to implement new algorithms, methods and other state-of-the-art designs. In this case, big really does mean fast. So couldn’t anyone just do the data scientist’s job? Eventually, maybe. But the entire field of LLMs are moving so fast these days that it is still a full-time job to keep abreast of the news cycle, not to mention tinkering and trying the new tools, and getting the real-life experiences necessary to build up an intuition around real-life challenges. While any engineer can pick up Langchain and implement anything, it still takes an expert to know exactly WHAT to implement to accommodate someone’s specific needs and challenges. What are the Langchain alternatives? I know a colleague of mine is working on an article comparing all the bigger LLM frameworks, so stay tuned for that. Long-term, I have no doubt that we will see better, more mature frameworks take over the marketplace. Once the initial goldrush is over, we will have a diverse toolbox capable of covering most use cases. In that sense, LLMs are no different from other new, shiny technologies. Today, however, we will venture the claim that the real alternative to Langchain is to not use a big framework at all. All the major providers have APIs for their services, and if you’re self-hosting your LLM, you probably have very specific needs anyway.
Evaluating and Testing Large Language Models

LLMs Evaluating and Testing Large Language Models Hampus GustavssonSenior Data Scientist at Todai As with all software, before releasing it to production, it must pass a thorough test suite. The same applies to large language models (LLMs). However, testing LLMs is not as straightforward as testing traditional software or even classical machine learning models. This article highlights several aspects of testing LLMs, followed by a demonstration using a Python implementation from Confident AI—Deepeval. Our test case will feature a type of LLM called Retrieval-Augmented Generation (RAG), which integrates access to domain-specific data. Background At Todai, we often help companies develop RAG systems tailored to specific industries and use cases. Testing these applications must be tailored to each project’s unique requirements. In this article, we share key lessons learned and identify the variables we consider critical during the testing phase. Our focus will primarily be on selecting appropriate metrics and designing test suites to optimize business objectives and key performance indicators (KPIs). Modes of testing Let us start by dividing the metrics into two categories: statistical and model based. A statistical approach, based on traditional n-grams techniques, counts the number of correctly predicted words. This is a traditional approach. While robust, it is less flexible and struggles with scenarios where correct answers can be expressed in multiple valid ways. Consequently, its use has declined as LLMs have advanced.As publicly available large language models have become better, the statistical based approach has become less common. While some use cases still benefit from statistical methods, the model-based approach generally provides greater flexibility and accuracy. Model Based When using a model based approach, the overarching scheme is to prompt an LLM such that it outputs quantitative results. It is not uncommon that this comes from a subjective perspective. And one does have to be mindful of this. As with asking a coworker whether they would label an answer correct or incorrect, ambiguity exists in this approach. We will later in the article go into depth on how an assessment of a question answer can look. Metrics Selecting the right metrics is a complex process, often involving translating qualitative business goals into quantitative measures. Metrics generally fall into two categories: Testing against predefined answers: This method compares the model’s responses to a set of prewritten answers. System health evaluations: This involves testing the application’s guardrails to prevent undesirable outcomes, such as biased or toxic responses. In customer-facing applications, system health evaluations are essential to mitigate the risk of poor user experiences. This can involve stress-testing the system by intentionally provoking problematic outputs and measuring against bias, toxicity, or predefined ethical standards. In our demo, we will focus on the first category: testing performance against prewritten answers. This involves generating questions and expected answers, scoring the model’s responses, and aggregating these scores to evaluate overall performance. Key considerations include identifying relevant metrics, defining thresholds, and understanding the consequences of deployment or non-deployment. So, what kind of metrics should we consider for our RAG system? We will be using Answer Relevancy and Answer Correctness. The answer relevancy metric is the LLM equivalent to precision, this measures the degree of irrelevant or redundant information provided by the model. Question-Answer Generation When it comes to generating the test suite, there are a few things to consider. We want to generate tests in a scalable, representative and robust way. Starting out with the representative and robustness aspect, we should make the questions as relevant for the task as possible and focus on making the questions more global rather than local, challenging the RAG to combine various sources of information to end up with the correct answer. The straightforward way of generating questions and answers is by letting subject matter experts and / or stakeholders manually create these. This is for sure a valid way, but let us also explore test suite building a little further beyond this. We will be looking into three approaches – two offline and one online based. These are manually made, automatically made and sourced from a production setting. We will try to cover what sets these techniques apart and compare their pros and cons against each other. Let us look at them one by one. Manually made: The straightforward way to get the questions and correct answers. Preferably, this is done via subject matter experts and / or intended users when applicable. The cons with this approach is that it might be too biased to the ones creating the questions and answers, the dataset it creates is stale (i.e. not taking into account data drift) and might of course be resource intensive. Production questions: This is the preferred approach, with one big caveat. If you can get real world questions and answers, hopefully these represent the real world use case as much as possible. It is also quick to react to changes in the environment it is being used in. The caveat, however, is to retrieve this feedback without disturbing the user experience. The obvious way is to explicitly ask for feedback. But this is rarely something you want to do. There are a few different ways you can get user feedback implicitly, and it is one of these you should opt for going for. Automated: The basis for these is to use another model to generate questions and answers. Either via randomly sampling documents to base the questions from or in a methodical manner getting documents. There is a risk of generating questions that are too easily answered by the model. Ie, the model is having a hard time creating questions hard enough for itself to be able to answer. Also, without further prompting or providing examples, there is a risk that the questions being tested have low or no significance in practice. This can be tamed by prompting the QA generator to by combining one of the two previous steps. You then can use these questions as a test fleet, and prompt the model to try to make questions in the same