Hvem får mest ud af AI-brug? Eksperter eller novicer?

Hvem får mest ud af AI-brug? Eksperter eller novicer? En nuanceret gennemgang af studier og praktiske erfaringer om spørgsmåletTil denne artikel findes også en podcast. Dan RoseFounder & CEO, Todai Formål Debatten om, hvem der reelt får mest nytte af kunstig intelligens, har stået på længe og især siden studierne kendt som Jagged Frontier(September 2023) og Generative AI at Work (April 2023) udkom. De studier, mange medier og mange fagfolk har siden fremhævet at AI er en udlignende kraft, der især løfter novicer. Men praktiske erfaringer og nyere forskning giver et mere nuanceret billede og komplekst billede. Denne artikel forsøger at give indsigt i de nuancer og hvornår AI-værktøjer er gode for novicer og hvornår de er gode for eksperter. At afklare dette spørgsmål er essentielt, da svaret har konsekvenser for hvordan organisationer bør adoptere AI. Det vil feks. have betydning for: Hvem skal trænes mest? Hvordan skal medarbejdere trænes? Hvordan skal AI-løsninger passer til hvem, og Hvad er den reelle værdi af AI og hvor opstår den? Hvem skal vi ansatte i fremtiden? Spørgsmålet har også betydning for politiske beslutninger vedrørende uddannelse og fremtidens arbejdsmarked. F.eks. er effekten af AI på folkeskolen og uddannelsesinstitutioner meget anderledes, alt efter om de dygtigste eller mindst dygtige elever får den største effekt af AI. Med andre ord; Løser AI problemer med kompetencegab eller gør det problemet værre? Og hvordan skal undervisning med AI designes? Dertil er det nødvendigt at nuancere, hvad vi mener med “bedre” og “hurtigere” i AI-sammenhæng; handler det kun om outputkvalitet og tidsbesparelse, eller også om opgaveprioritering og arbejdsprocesser? Det er også relevant at se på effekten af AI over tid, selvom den mulighed for nu er meget begrænset. Bemærk at artiklen bruger “eksperter” og “novicer” uden en nærmere definition. Det er et bevidst valg, da det defineres forskelligt i forskningsstudier og en præcis definition derfor vil ramme skævt i de fleste sammenhænge. Nogle studier kigger f.eks. på erfaring i tid og andre tester deltagernes evner før de introducerer AI. Du må som læser tolke fra konteksten eller læse de specifikke studier. Resultaterne kort Helt kort er resultaterne: Eksperter synes at opnå størst gevinst ved længere, åbne opgaver uden et entydigt facit, hvor AI kan indgå i en feedback-proces eller bruges i både forberedelse og produktion. Dette gælder spil som skak, reframing af problemstillinger, større programmeringsopgaver og længere skriftligt arbejde. Novicer har mest gavn af AI ved små, afgrænsede opgaver med et lukket udfaldsrum, som har et relativt klart “rigtigt” svar eller en optimal vej til det rigtige svar. Dette gælder f.eks. mindre tekstopgaver og support-besparelser. Brede observationsstudier af hele arbejdsmarkedet viser ofte ingen signifikant gennemsnitlig effekt af AI på nuværende tidspunkt. Effekten på kreative opgaver varierer; både begyndere og øvede kan have gavn, afhængigt af opgavens art . Evnen til at bruge AI effektivt ser ud til at være korreleret med erfaring inden for domænet. Den adfærd eksperter naturligt har i deres arbejde ser ud til at være en fordel når AI bruges i arbejdet. Udbredt brug af AI kan føre til kreativ og strategisk og sproglig ensretning. Om det er en fordel eller ulempe kommer an på den specifikke opgave. Forbehold og begrænsninger Denne gennemgang er baseret på et udvalg af studier og er underlagt visse begrænsninger: Der kan være (er) selektionsbias i udvælgelsen af studier. De fleste inkluderede studier anvender ældre AI-modeller (F.eks. GPT 3.5 og GPT-4o). AI udvikler sig hastigt og Studierne mangler ofte longitudinelle data, der kan vise effekter over tid . Et enkelt studie peger på muligheden, men data er endnu ikke tilgængelige. Flere af studierne er ikke peer-reviewed. Antallet af inkluderede studier er begrænset. Gennemgangen er primært baseret på generative AI-værktøjer, der bruges i eksisterende processer. Processer og løsninger der helt gentænkes med AI er ikke undersøgt her. Metode Denne artikel bygger på en gennemgang af 10 udvalgte forskningsartikler og working papers (se studierne listet nedenfor), der primært undersøger effekten af generativ AI på produktivitet, kvalitet og læring hos brugere med forskelligt erfaringsniveau. Studierne er analyseret ud fra følgende dimensioner: Studietype: Er det et bredt observationelt studie af faktiske arbejdsmarkedsdata eller et mere kontrolleret eksperiment fokuseret på specifikke opgaver? Ekspertniveau: Hvornår gavner AI eksperter versus novicer? Opgavetype: Hvilken effekt har AI på forskellige typer opgaver (fx eksplorative, producerende, planlæggende, lærende)? Outcome-type: Gør det en forskel, om opgavens resultat er åbent (flere mulige gode løsninger) eller lukket (et mere defineret korrekt svar)? Studierne Effects of AI Feedback on Learning, the Skill Gap, and Intellectual Diversity Riedl & Bogert (September, 2024) AI er bedst for: De erfarne Studietype: Kvalitativt studie Peer-reviewed: Nej Link: https://arxiv.org/pdf/2409.18660 Et studie af 52.000 skakspillere viser, at forkert brug af AI-feedback (søgt efter succes frem for fejl) hæmmer læring. Dygtige spillere bruger AI mere effektivt (efter nederlag), hvilket øger færdighedsgabet mellem spillere. Adgang til AI reducerer også strategisk diversitet i spillet. Friend or foe? Artificial intelligence (AI) and negotiation Cummins & Jensen (June, 2024) AI er bedst for: Studiet peger på at erfaring med AI-værktøjer er mere afgørende en domænefaglig kompetence. Studietype: Ekspirimentstuide Peer-reviewed: Ja Link: https://journals.sagepub.com/doi/10.1177/20555636241256852 Studiet undersøger, hvordan Generative AI påvirker forhandlingsresultater. Forskerne testede hold i tre scenarier: Menneske vs. Menneske, Menneske+GPT vs. Menneske, og Menneske+GPT vs. Menneske+GPT. Resultaterne viste, at de hold, hvor begge parter brugte ChatGPT, opnåede de “bedste og hurtigste resultater” og den højeste værdiskabelse. Hold, der udelukkende forhandlede ansigt til ansigt uden AI, klarede sig dårligst og opnåede ofte suboptimale resultater eller slet ingen aftale. Large Language Models, Small Labor Market Effects Anders Humlum and Emilie Vestergaard (May 2025) AI er bedst for: Der findes ingen direkte effekt, men der findes en klar effekt i erhvervsmobilitet, hvilket som udgangspunkt er en fordel for nyuddannede. Studietype: Observationsstudie Peer-reviewed: Nej Link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5219933 Studiet analyserer GenAI’s brug og arbejdsmarkedseffekter blandt danske lønmodtagere ved hjælp af store spørgeskemaer (2023/24) koblet med registerdata. Selvom initiativer fra arbejdsgiver øger AI-adoption og ændrer arbejdsopgaver, ses endnu ingen påviselig effekt på samlet beskæftigelse eller løn. Generative AI at work Erik
What are AI Agents?

AI Agents What are AI Agents? Because not everything that uses an API deserves a job title Dan TrueHead of Solutions Following up from our recent blog about What AI is, let’s take a further look at one other recently hyped and misunderstood feature: AI Agents. They’ll book a table for you at your favourite restaurant, make you a budget, and even write code – all while matching your Friday-night energy, whether that means getting everything done with calm precision or charging ahead with bold, unstoppable momentum. An Agent is an entity with Agency, conversely an AI Agent is an AI with Agency – or at least the ability to Act in some sort of an environment. AI Agents have been studied and implemented for decades, and classic examples include many robots and game AI. I was building AI agents that could plan & act on their own in unpredictable game-like digital environments back in my university days a few decades ago. It’s nothing new. In recent years LLM-based AI Agents have brought the hype to the masses. While we have decades of available literature on what AI Agents are, the casual observer of the current hypetrain could be excused for believing any LLM/GPT-based system nowadays is an AI Agent in disguise. If you get this reference, I hope your back pain isn’t too bad The underlying discussion about what constitutes an AI Agent relates to philosophy and etymology: to be an Agent do you need your own Agency or is it sufficient to Act On Behalf of someone else? To avoid injecting any existential crisis into our reader, we’ll keep this limited to AI systems 🙂 Personally, I prefer to distinguish between two different systems explained below. I use these terms to establish some clarity in this age of hype and vagueness. Do note that these are my definitions based on what I’ve read and learned, so you’ll find plenty of people who will tell you I’m wrong. Some of them may even be right. AI Agents: When an AI has true Agency in some environment, continously pursuing its defined goals with the tools it has available. Agentic Behaviour: When some AI system can Act, but only on behalf of a user and only in accordance with the goals set forward by that user. Let’s take a deeper look at each. AI Agents For a system to be a true AI Agent I believe it must: Be at least partially continuously active, e.g. run all the time or at least as a scheduled job. I.e. to be an agent it needs to have agency – it can’t just be reacting to user-prompts and then be inactive until the next prompt. Go through a Sense-Think-Act cyclus. Meaning it first observes an environment, then considers how to achieve its goal(s) in that environment and potentially takes actions. Most people can figure how this would work in a robot, but let’s take a more useful example from industry: an AI agent to generate leads for your business. Imagine you have a business where most of your sales leads come through sites such as LinkedIn, Job Postings, Announcement platforms and the like. An AI Agent cyclus could look like this: Sense: Continuously (e.g. every minute, hour or night) Scan LinkedIn, Job Posting etc. sources for leads to generate a list of potential leads not yet evaluated. Think: Evaluate each lead according to various criteria such as the Ideal Customer Profile. This may mean more look-ups to websites or databases to get more information about the potential lead and their business. Act: Take action towards those leads, using tools or APIs made available to the Agent. This could include sending an introductory email or creating a new lead entry in a Customer Relationship Management system. As you can see, such AI Agents could have their uses – but is a much narrower definition than what you see used on LinkedIn at the moment. Reasonable people can disagree, but to me all the above is necessary before I call it a true AI Agent. Agentic Behaviour Now, where it gets messier is with Agentic Behaviour. To me, this is when an AI (usually a GPT, but not necessarily) has parts of the full flow of an AI Agent, but not the entirety or doesn’t go through a cycle but only ever acts in response to user inputs. For instance: A GPT that has access to one or more tools such as APIs, which it can call on the users behalf – Acting. This usually also entails some Think steps, as it needs to evaluate whether it has enough information to call a specific API or needs to ask clarifying questions from the user. But since it doesn’t exist in an environment which it Senses and then reacts to – it only ever reacts to user-input instead of existing in its own cycle – so to me, this it’s not a true AI Agent. To be fair, many GPT systems with access to tools and based on so-called LLM Reasoning Models (terrible name, since they don’t do Automated Reasoning – but that’s for another day), do have an internal cycle: They consider the user prompt Considers how to answer the query Decides whether to call potentially-useful tools such as APIs either to get more information or to update some record somewhere Decide whether they have a satisfying response to the user and either return it or loop through the cycle again This internal loop is often what is used as justification for calling it an agent. Again, to me this isn’t enough to be called an AI Agent as they only ever act on behalf of a user’s input, but it’s definitely Agentic Behaviour. Rounding off At the risk of annoying half of LinkedIn: not everything with a loop and an API call is an AI Agent. There’s a difference between a system that occasionally wakes up when poked, and one that
What is AI?

AI in General What is AI? Dan TrueHead of Solutions Is AI and Machine Learning the same thing? What about DeepLearning? AI is all the hype these days: Everyone and their dog is talking about it on LinkedIn, and it seems like AI experts are crawling out the woodwork. As someone who began studying AI back in 2008, at the thawing of the last AI Winter – I’ve seen the field of practitioners grow exponentially in the last few years, with many of them not seeming to realise that this field of Computer Science arose in the 1950s – and its formal foundations stretch back much further. Back in the 1970 people had the same discussions as now about what was just a program and what was AI, while businesses tried to sell a vision where AI ran on specialized hardware that they could sell us (lookup LISP machines). There’s a famous quote (from 1971!): “AI is a collective name for problems which we do not yet know how to solve properly by computer” – by Bertram Raphael (co-creator of the A* search algorithm) Which logically entails that as soon as we know how to solve a problem, it stops being AI. While I don’t agree that’s how we should define AI, it accurately reflects our problems in adequately defining what it is we’re working with. So the next time you have a discussion on LinkedIn about what AI is or isn’t, know that you’re partaking in a century-old discussion and take some pride in following in the footsteps of your ancestors. Since the field’s inception, it has become clear that to adequately define what Artificial Intelligence is, we first need to define what Intelligence itself is, which we still really can’t adequately do. I hate dependency problems. So what is it? Let’s take a look at what the broader field of AI is, as per mid 2025. This isn’t an exhaustive or scientific list, but it’s what I use when I try to bring my point across with my customers. I will focus on ease of understanding and explainability here, and not on making a taxonomy that will stand the test of time. I know how futile that is and I have a kettle on. I usually divide the field into the following sub-fields: Predictive AI, Generative AI, Symbolic AI and AI Understanding. Predictive AI In short, Predictive AI are methods used a set of data to try to predict new data-points from the trends in the data set. This is often used interchangeably with Machine Learning, which isn’t entirely accurate (I’ll get to why), but works for most casual conversations. Predictive AI as a discipline split from the wider AI field (which was based in Computer Science) during the 1980s and grew out of a numerical/statistical approach. Predictive AI covers a lot of different types: Statistical Methods: such as Linear Regression use pure statistics to predict the next value, e.g. predicting a future price of a good or service based on it’s historical value function. Supervised Learning: These models are given labelled data-sets and are trained to predict the label of future new examples, e.g. predicting whether a credit card transaction will be fraudulent, based on historical labels of fraudulent transactions. Unsupervised Learning: Unsupervised Learning models try to learn from raw data which is not labelled, i.e. predicting which customers belong to the same customer segment without having labelled previous customers beforehand. Reinforcement Learning: These models try to optimise a reward-function in an environment and need to constantly balance short-term and long-term gain in that reward function. Classic examples would be an AI playing chess or Go, where the reward function is a score of the current state of the board and the play tries to optimise their own score. Note that Reinforcement Learning has strong ties into Symbolic AI, which we will introduce below. Note that Neural Networks, though hyped as ‘a model of the human brain’, is just one of many methods within Predictive AI. Most of the effort of a Data Scientist is directed at understanding the available data, testing various models to find a good predictor and fine-tuning the input to get the most accurate predictions possible. GenAI, which I now consider its own field within AI, grew out of Unsupervised Learning prediction models trained on text, images and sound. One key issue with Predictive AI is that it’s not explainable in general. You can do some parameter analysis to map which parameters affect the outcome the most, you generally can’t adequately explain why a certain prediction is the way it is – especially in Predictive models with many input parameters: imagine asking for an explanation of why some request was denied and receiving a list of many hundred weighted parameters back. Not exactly useful to a human. For some use cases this is acceptable, like when predicting ice-cream sales. But when a predictive model makes a decision directly affecting a human, like predicting whether they’re likely to pay back a loan, be a good tenant in an apartment or become a criminal, it quickly becomes immoral if not straight-up illegal in some jurisdictions. Predictive models are dependent on data – not necessarily much, but some – to do anything. That also means there is a large case of problems of which no data exists, which predictive AI is hard to apply to. Generative AI Generative AI (GenAI) is a discipline about systems which generate output, usually of text, images, or sound, but more edge cases also exist – such as generating 3D models from prompts or animations from motion capture. Note that under the hood many GenAI systems are built upon Predictive AI models hailing from Unsupervised Learning, where the Prediction is the next word, soundbite etc. which results in generating a stream of output. Some genAI models are more unique to this field such as Diffusion Models, which generate an entire output and then refine it towards the desired target output
Why ChatGPT Won’t Replace Me as a (Human) Developer

LLMs Why ChatGPT Won’t Replace Me as a (Human) Developer Nina Holm-JensenSenior Data Scientist The year was 2024. The future had arrived wearing the name ChatGPT, and everyone was raving. I was a senior software developer who had been called into a sales meeting to answer the customer’s more technical questions. We made smalltalk over our afternoon coffee. And then they started asking questions. A few questions about data integrations. Some about our estimates. And then I was asked: “How can you estimate so much time, when I can ask ChatGPT to make it in five minutes?” At the time, I was astonished – and, to be entirely honest, a little bit angry. I spent an embarrassingly long time ranting to my colleagues at the coffee machine afterwards. But in the year since, I still hear the question asked. Not always as blatantly. But it is permeating the entire field of coding at this point. So let’s explore it. Everyone can code now Everyone and their grandma has tried vibe coding their own app at this point. A friend of mine texted me excitedly the other day to tell me how he has vibe coded his own app to send love letters to his girlfriend. Last week, my project manager told me how he used Cursor to fix an issue in an old application, happy that he didn’t have to bother the developer who wrote it. Romantic use cases aside, this development is exciting. On my end, as an “old” programmer, it is fun to get new tools to make my work faster. On a societal level, it is amazing how code is getting democratized. A lot of grunt work is being lifted off my shoulders. But it is incredibly important that we understand what kinds of work ChatGPT can lift. And what kinds of work it cannot. Consider my project manager who solved a coding issue with Cursor. If we envision an IT system as an elaborate birthday cake, then Cursor basically told him to get the blue sprinkles from the cupboard and to put them on the cake. That is impressive, for sure. But who made the cake itself? Who layered on the frosting? Who went and bought the blue sprinkles and put them in the cupboard? Not Cursor. The “invisible” work I do Let’s say I am asked to develop a new feature. What do I need to master in order to turn this feature request into actual value? The first and foremost thing is to know the broader IT landscape. I need to know which tools are out there, what kind of problems they can solve, what is just new and shiny, and what is feasible. It is knowing and understanding the IT landscape of my particular organization or project. It is knowing all the legacy code, which code repos are actually deployed where, the infrastructure, and which external systems we depend upon. It is remembering the decisions we have made and why. This is especially important for the counterintuitive decisions we made ten sprints ago. Sometimes a database configuration is insane because it has to accommodate the legacy code once written by the founder’s teenage nephew. It is knowing the requirements, both technical and non-technical, and keeping them in mind when analysing solutions. It is knowing which requirements are half-baked or inaccurately described (because life happens, and nothing is ever perfectly documented), either because I have been told, or because I have seen similar things before. It is knowing all the things that are not written down but often discussed in meetings, like expectations for scalability, security demands, or cost limits. Often, I will be expected to prepare the system for a grand new feature before the feature is described – and often, working towards this fuzzily-described idea will save us much time in the long run. It is constantly questioning whether we are solving the right problem in the right way. It is building an actual, architecturally solid codebase which Cursor can read and act upon. Code must always be clean and extendable, meaning we can add more features with minimal pain. Clean code is an art and a discipline, and it is crucial for preserving our lead time. It is reading and understanding the code before accepting Cursor’s proposed changes to my codebase. Metaphorically, Cursor might ask me to sprinkle chili flakes on my birthday cake, and I have to know whether it is the right thing to do. Why can only a human do it? Because humans are imperfect, social creatures. We are creatures of emotion, of connection, of politics. All these things are incredibly important when gathering the knowledge I need to turn code into value. This is why the coffee machine is the most important place in any organization. I go there, and I overhear my colleagues talking, and I realize that someone on another team has just solved the problem I’m agonizing over, saving me days of further work. As a human, I can hear the hesitation in the product manager’s voice when she describes a new feature request, clocking that she isn’t entirely confident in its value. I understand that other work is done by other humans. I send the email asking our devops guy to set up data backups in our system. I send that email again. I then bring him a coffee and a smile, shamelessly bribing him to solve my problem during a clearly overworked day/week/month/worklife. Because empathy and connection is something only another human can offer. ChatGPT does not drink coffee And now we come full circle. ChatGPT is amazing and will revolutionize the field of IT (as it will revolutionise so many other fields). It is already changing the way I work, the way I talk to my project manager, the way I interact with new knowledge. ChatGPT is an amazing assistant, which I can direct according to the needs of myself and my customers. It is a great tool which I
Langchain will try to break your codebase: Or, how Clean Architecture can save you from headaches

LLMs Langchain will try to break your codebase: Or, how Clean Architecture can save you from headaches Nina Holm-JensenSenior Data Scientist Prerequisites to reading: A basic understanding of how a RAG system works (if not, you can read any article on the topic, like this one, and come back) Ability to read python code Fundamental understanding of how to unit test Repo: https://github.com/TodAI-A-S/Todai-LLM-project/branches LLMs are complicated. To write a simple RAG system, you have to connect to multiple databases and tables (your vectorstore is obvious, and then you have the user table, the conversation history, maybe conversation metadata or user permissions, etc.). You then have to juggle the data retrieved by your LLM. It needs to be coerced into sensible objects which you stream to the frontend for visual rendering. And what happens when the LLM feels feisty and sends back a different data format? Your error handling needs to be solid. It’s all terribly complex. What is the must-do for all complex code? For this article, I assume that we all agree that automatic testing is good and necessary. If you need a little convincing, go here. First, let me start by scoping this article. I will focus pretty hardcore on local, on-my-machine integration testing, which is fundamentally deterministic. LLMs are notoriously non-deterministic, so for a proper, business-centric test suite, you will also need a way to test the LLM flow. See my colleague’s articles on that particular topic. Here, I will discuss how to best abstract away all of that so you can test the deterministic code around the LLM. It’s harder than it sounds. I will base this article on snippets of actual production code I wrote this year. I have cut 90% of the complexity (and all of the identifying business logic) so I can focus on a few interesting parts. What is the problem? Langchain promises to bundle all your LLM specific needs into a friendly library that handles everything for you. It does that, mostly. But, to be completely frank, Langchain is a bit of a mess. It is made for data scientists, not software engineers. Is this part a function or a class? Who knows? Is this part configuring or executing code? Does it even matter? If you are here, you probably care a little bit about writing testable code. You care about separation of concerns. If you are here, you have probably also seen the Langchain documentation – or the myriad of ten-line RAG examples littering the internet. And you’ve thought “how the hell do I modularize this?” If you go to Langchain’s own documentation and string their examples together, it will look something like this: embeddings = OpenAIEmbeddings() docs = [ Document( page_content=”A bunch of scientists bring back dinosaurs and mayhem breaks loose”, metadata={“year”: 1993, “rating”: 7.7, “genre”: “science fiction”}, ), …, ] vectorstore = OpenSearchVectorSearch.from_documents( docs, embeddings, index_name=”opensearch-self-query-demo”, opensearch_url=”http://localhost:9200″, ) Nice and quick. You’ve got your entire RAG system in a 16-line script where everything – connecting to the database, connecting to the LLM, and a ton of transformation – happens in a single line. Nice. And utterly untestable. Some of the questions that crop up: How do I mock anything here? The OpenAIEmbeddings class reads my API key from environment variables. Can I control that without nasty side effects? The database connection happens somewhere deep in the from_documents() function. If I were to write a test, I would have to do a lot of hacky interceptions, and it would undoubtedly break the next time a hapless developer came by. The solution I quickly hit the wall of trying to make Langchain’s own examples testable. How do you separate responsibilities? How do you inject dependencies? I realised that Langchain wasn’t going to help me willingly. So I made it my personal mission to beat and batter Langchain into a shape reminiscent of proper architecture. My anvil of choice was Clean Architecture as described by Uncle Bob. Fundamentally, I worked to separate the code into three layers: Interfaces, which handle any and all external connections and calls. Services, which handle all business logic. Services DO things. Services carry and transform data. Entities, which are the data objects being carried and transformed. Along with, of course, configuration code, which we all know to separate out. Strict adherence to this architecture became my guiding principle. Working with entities Remember, the thing I wrote is a full-fledged application. It receives (messy) json data through an endpoint, then has to call a database to enrich the data, then pass this data elsewhere for audit logging. Then it does some data transformations (including a pdf parser reading the page content). And then finally the relevant data reaches the embedding flow. At all these points, I need to clearly understand what my data looks like. The responsibility of the entity: Collecting data in well-defined and sensibly-named chunks. Matching data to domain concepts I used Pydantic to parse the incoming json from the upstream data pipeline, and to represent the data at all times in the transformation steps, until it arrived at my embedding flow looking something like this. class PdfPage(BaseModel): page_content: str page_number: int class PdfDocument(BaseModel): doc_id: str pages: List[PdfPage] Nothing magical here. Entities are just dataclasses. We love them because they give us a shared language around the data, and because they are so much more self-documenting than dictionaries or lists or all the other structures that data scientists love to use while tinkering. Interfaces Dependencies are finicky things. Whenever you depend on anything not in your own codebase, you risk that an unexpected change messes with your test. Maybe someone had to delete the data row you depended on? Maybe someone else is updating the service you need to call? Then your test fails for reasons completely unrelated to actual errors. And, from a code maintenance perspective, you want to be able to change providers reasonably easily. You want any external dependencies to be plug-and-play with clearly defined entry points. Infrastructure doesn’t change often
Why we haven’t gotten rid of Langchain yet

LLMs Why we haven’t gotten rid of Langchain yet Nina Holm-JensenSenior Data Scientist Based on a conversation between Todai’s data scientists. Prerequisites: A basic understanding of how a RAG system works (if not, you can read any article on the topic, like this one, and come back) Why would you want to get rid of Langchain? Langchain is a long-standing framework for working with LLMs. It is conceptually based on the idea of “chains”, which is a way of chaining together different flows, agents, and other functionality. It is designed to be very plug-and-play with standardised interfaces between the different modules. In principle, you should be able to call the same functions for OpenAI as you do for AWS’s Bedrock suite. Which is nice and simple. But the major problem with Langchain is exactly this plug-and-play mentality. Langchain invites you not to think too deeply about the implementation details of your work. Everything is abstracted away into these chain “links”, and you just have to configure a few things. But under the hood, the implementation details ARE very different and very complicated. LLMs change at a breakneck pace, and every provider has their own quirks. Langchain is not exactly stringent about following software engineering principles. Configuration code, executing code, database connections, data transformations, it all often happens in one big bowl of spaghetti code. To fit the chain structure, the under-the-hood code often looks something like this: This laissez-faire approach leads to fun surprises. One time, a colleague of mine just couldn’t understand why she kept getting a weird bug in her chain, until she went on a code deep dive and realised that, upstream in her LLM chain, Langchain had implemented a function which just returned None. No errors. No NotImplementedException. Just an empty return value, which was propagated further down the chain’s modules until it, eventually, caused an error. The concept of “chains” is also another level of abstraction, which can be hard to explain to non-data scientists. It is worth noting that Thoughtworks’s Tech Radar even moved Langchain to “Hold” last year, meaning they recommend against using it. And, like a senior team member provocatively asked us all, if Langchain really abstracts away all details, what is the point of us? Couldn’t any software engineer do exactly what we do? So why do we still stick by it? Because speed. Langchain is still exceptional in the prototyping stages, specifically because it allows you to ignore the details for now. The time from idea to execution is, frankly, exceptional. Say you have built a RAG which is running happily in production. It is configured to pull the 25 most relevant document chunks in the retrieval stage. These are used when it writes an answer. But now, my project manager asks me to improve the quality of the answers. I know, through reading and research, that one way (among many) to improve quality is to improve the context sent to the chatbot. Maybe I could query 100 chunks and then rerank them to 25, turning my context retrieval into a two-step process with limited loss of performance. Wanting to pursue this route, I need to decide on a reranking algorithm. As a good, up-to-date data scientist, I know that such algorithms are legion. A few of them are: max-marginal-relevance cross-encoder modeller coheres reranker API LLM-as-rereanker An entirely different embedding model and vector score Each of these options would be very time consuming to implement from scratch. The worst part is, I cannot know in advance which method will outperform the others (nor if any of them will outperform my baseline). None of them are objectively best – it all depends on the quirks of my specific data and implementation. I also cannot know how the change will influence my solution. For example, MMR is known to be fast, while LLM-as-reranker is slow. Maybe it is prohibitively slow on my specific dataset? Langchain has an implementation ready-to-go for each of these options. It also has a Reranker module which is designed to be easy to plug into my chain. With Langchain, I can implement and test each of these options quickly, and I can go back to my project manager within the week with a concrete plan for a better quality answer. Why is it so fast? Simple. Langchain is still one of the biggest frameworks around. This means that the internet is bursting with guides, implementations, code examples and things of that nature. Whatever I want to test, Langchain and its huge community can provide. It has plug-and-play modules for almost every provider you need, including database providers. It is still one of the first to implement new algorithms, methods and other state-of-the-art designs. In this case, big really does mean fast. So couldn’t anyone just do the data scientist’s job? Eventually, maybe. But the entire field of LLMs are moving so fast these days that it is still a full-time job to keep abreast of the news cycle, not to mention tinkering and trying the new tools, and getting the real-life experiences necessary to build up an intuition around real-life challenges. While any engineer can pick up Langchain and implement anything, it still takes an expert to know exactly WHAT to implement to accommodate someone’s specific needs and challenges. What are the Langchain alternatives? I know a colleague of mine is working on an article comparing all the bigger LLM frameworks, so stay tuned for that. Long-term, I have no doubt that we will see better, more mature frameworks take over the marketplace. Once the initial goldrush is over, we will have a diverse toolbox capable of covering most use cases. In that sense, LLMs are no different from other new, shiny technologies. Today, however, we will venture the claim that the real alternative to Langchain is to not use a big framework at all. All the major providers have APIs for their services, and if you’re self-hosting your LLM, you probably have very specific needs anyway.
Evaluating and Testing Large Language Models

LLMs Evaluating and Testing Large Language Models Hampus GustavssonSenior Data Scientist at Todai As with all software, before releasing it to production, it must pass a thorough test suite. The same applies to large language models (LLMs). However, testing LLMs is not as straightforward as testing traditional software or even classical machine learning models. This article highlights several aspects of testing LLMs, followed by a demonstration using a Python implementation from Confident AI—Deepeval. Our test case will feature a type of LLM called Retrieval-Augmented Generation (RAG), which integrates access to domain-specific data. Background At Todai, we often help companies develop RAG systems tailored to specific industries and use cases. Testing these applications must be tailored to each project’s unique requirements. In this article, we share key lessons learned and identify the variables we consider critical during the testing phase. Our focus will primarily be on selecting appropriate metrics and designing test suites to optimize business objectives and key performance indicators (KPIs). Modes of testing Let us start by dividing the metrics into two categories: statistical and model based. A statistical approach, based on traditional n-grams techniques, counts the number of correctly predicted words. This is a traditional approach. While robust, it is less flexible and struggles with scenarios where correct answers can be expressed in multiple valid ways. Consequently, its use has declined as LLMs have advanced.As publicly available large language models have become better, the statistical based approach has become less common. While some use cases still benefit from statistical methods, the model-based approach generally provides greater flexibility and accuracy. Model Based When using a model based approach, the overarching scheme is to prompt an LLM such that it outputs quantitative results. It is not uncommon that this comes from a subjective perspective. And one does have to be mindful of this. As with asking a coworker whether they would label an answer correct or incorrect, ambiguity exists in this approach. We will later in the article go into depth on how an assessment of a question answer can look. Metrics Selecting the right metrics is a complex process, often involving translating qualitative business goals into quantitative measures. Metrics generally fall into two categories: Testing against predefined answers: This method compares the model’s responses to a set of prewritten answers. System health evaluations: This involves testing the application’s guardrails to prevent undesirable outcomes, such as biased or toxic responses. In customer-facing applications, system health evaluations are essential to mitigate the risk of poor user experiences. This can involve stress-testing the system by intentionally provoking problematic outputs and measuring against bias, toxicity, or predefined ethical standards. In our demo, we will focus on the first category: testing performance against prewritten answers. This involves generating questions and expected answers, scoring the model’s responses, and aggregating these scores to evaluate overall performance. Key considerations include identifying relevant metrics, defining thresholds, and understanding the consequences of deployment or non-deployment. So, what kind of metrics should we consider for our RAG system? We will be using Answer Relevancy and Answer Correctness. The answer relevancy metric is the LLM equivalent to precision, this measures the degree of irrelevant or redundant information provided by the model. Question-Answer Generation When it comes to generating the test suite, there are a few things to consider. We want to generate tests in a scalable, representative and robust way. Starting out with the representative and robustness aspect, we should make the questions as relevant for the task as possible and focus on making the questions more global rather than local, challenging the RAG to combine various sources of information to end up with the correct answer. The straightforward way of generating questions and answers is by letting subject matter experts and / or stakeholders manually create these. This is for sure a valid way, but let us also explore test suite building a little further beyond this. We will be looking into three approaches – two offline and one online based. These are manually made, automatically made and sourced from a production setting. We will try to cover what sets these techniques apart and compare their pros and cons against each other. Let us look at them one by one. Manually made: The straightforward way to get the questions and correct answers. Preferably, this is done via subject matter experts and / or intended users when applicable. The cons with this approach is that it might be too biased to the ones creating the questions and answers, the dataset it creates is stale (i.e. not taking into account data drift) and might of course be resource intensive. Production questions: This is the preferred approach, with one big caveat. If you can get real world questions and answers, hopefully these represent the real world use case as much as possible. It is also quick to react to changes in the environment it is being used in. The caveat, however, is to retrieve this feedback without disturbing the user experience. The obvious way is to explicitly ask for feedback. But this is rarely something you want to do. There are a few different ways you can get user feedback implicitly, and it is one of these you should opt for going for. Automated: The basis for these is to use another model to generate questions and answers. Either via randomly sampling documents to base the questions from or in a methodical manner getting documents. There is a risk of generating questions that are too easily answered by the model. Ie, the model is having a hard time creating questions hard enough for itself to be able to answer. Also, without further prompting or providing examples, there is a risk that the questions being tested have low or no significance in practice. This can be tamed by prompting the QA generator to by combining one of the two previous steps. You then can use these questions as a test fleet, and prompt the model to try to make questions in the same