LLMs

Langchain will try to break your codebase: Or, how Clean Architecture can save you from headaches

Nina Holm-Jensen
Senior Data Scientist

Prerequisites to reading:

  • A basic understanding of how a RAG system works (if not, you can read any article on the topic, like this one, and come back)
  • Ability to read python code
  • Fundamental understanding of how to unit test

Repo: https://github.com/TodAI-A-S/Todai-LLM-project/branches

LLMs are complicated. To write a simple RAG system, you have to connect to multiple databases and tables (your vectorstore is obvious, and then you have the user table, the conversation history, maybe conversation metadata or user permissions, etc.). You then have to juggle the data retrieved by your LLM. It needs to be coerced into sensible objects which you stream to the frontend for visual rendering. And what happens when the LLM feels feisty and sends back a different data format? Your error handling needs to be solid.

It’s all terribly complex.

What is the must-do for all complex code?

For this article, I assume that we all agree that automatic testing is good and necessary. If you need a little convincing, go here.

First, let me start by scoping this article. I will focus pretty hardcore on local, on-my-machine integration testing, which is fundamentally deterministic. LLMs are notoriously non-deterministic, so for a proper, business-centric test suite, you will also need a way to test the LLM flow. See my colleague’s articles on that particular topic.

Here, I will discuss how to best abstract away all of that so you can test the deterministic code around the LLM. It’s harder than it sounds. 

I will base this article on snippets of actual production code I wrote this year. I have cut 90% of the complexity (and all of the identifying business logic) so I can focus on a few interesting parts.

What is the problem?

Langchain promises to bundle all your LLM specific needs into a friendly library that handles everything for you. It does that, mostly. But, to be completely frank, Langchain is a bit of a mess. It is made for data scientists, not software engineers. Is this part a function or a class? Who knows? Is this part configuring or executing code? Does it even matter?

If you are here, you probably care a little bit about writing testable code. You care about separation of concerns. If you are here, you have probably also seen the Langchain documentation – or the myriad of ten-line RAG examples littering the internet. And you’ve thought “how the hell do I modularize this?”

If you go to Langchain’s own documentation and string their examples together, it will look something like this:

				
					embeddings = OpenAIEmbeddings()


docs = [
   Document(
       page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
       metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
   ),
   ...,
]


vectorstore = OpenSearchVectorSearch.from_documents(
   docs,
   embeddings,
   index_name="opensearch-self-query-demo",
   opensearch_url="http://localhost:9200",
)

				
			

Nice and quick. You’ve got your entire RAG system in a 16-line script where everything – connecting to the database, connecting to the LLM, and a ton of transformation – happens in a single line. Nice. And utterly untestable.

Some of the questions that crop up:

  • How do I mock anything here? 
  • The OpenAIEmbeddings class reads my API key from environment variables. Can I control that without nasty side effects?
  • The database connection happens somewhere deep in the from_documents() function. If I were to write a test, I would have to do a lot of hacky interceptions, and it would undoubtedly break the next time a hapless developer came by.

The solution

I quickly hit the wall of trying to make Langchain’s own examples testable. How do you separate responsibilities? How do you inject dependencies? I realised that Langchain wasn’t going to help me willingly. So I made it my personal mission to beat and batter Langchain into a shape reminiscent of proper architecture. My anvil of choice was Clean Architecture as described by Uncle Bob.

Fundamentally, I worked to separate the code into three layers:

  • Interfaces, which handle any and all external connections and calls.
  • Services, which handle all business logic. Services DO things. Services carry and transform data.
  • Entities, which are the data objects being carried and transformed.


Along with, of course, configuration code, which we all know to separate out.

Strict adherence to this architecture became my guiding principle.

Working with entities

Remember, the thing I wrote is a full-fledged application. It receives (messy) json data through an endpoint, then has to call a database to enrich the data, then pass this data elsewhere for audit logging. Then it does some data transformations (including a pdf parser reading the page content). And then finally the relevant data reaches the embedding flow.

At all these points, I need to clearly understand what my data looks like.

The responsibility of the entity:

  • Collecting data in well-defined and sensibly-named chunks.
  • Matching data to domain concepts

I used Pydantic to parse the incoming json from the upstream data pipeline, and to represent the data at all times in the transformation steps, until it arrived at my embedding flow looking something like this.

				
					class PdfPage(BaseModel):
   page_content: str
   page_number: int




class PdfDocument(BaseModel):
   doc_id: str
   pages: List[PdfPage]

				
			

Nothing magical here. Entities are just dataclasses. We love them because they give us a shared language around the data, and because they are so much more self-documenting than dictionaries or lists or all the other structures that data scientists love to use while tinkering.

Interfaces

Dependencies are finicky things. Whenever you depend on anything not in your own codebase, you risk that an unexpected change messes with your test. Maybe someone had to delete the data row you depended on? Maybe someone else is updating the service you need to call? Then your test fails for reasons completely unrelated to actual errors.

And, from a code maintenance perspective, you want to be able to change providers reasonably easily. You want any external dependencies to be plug-and-play with clearly defined entry points. Infrastructure doesn’t change often in production, but it sure as hell changes a lot between dev, production and your local machine.

The responsibility of interfaces:

  • Handling any and all actual network calls to external services
  • Handling any dependency-specific configuration or stupid workarounds
  • Making sure that, were we to switch e.g. database client, only the interface code would need updating.

In my little toy example, I have two different dependencies: OpenSearch and Bedrock.

First step was to ensure that I could easily abstract the underlying database away. This is a pretty simple interface.

				
					class OpenSearchInterface:


   def __init__(self, config: AppConfig):
       self.config = config
       self.HTTP_AUTH = None  # get these from secrets


   def delete_chunks_if_exist(self, doc_id: str) -> None:
       # Insert code to delete all embedded chunks for a given doc_id
       pass


   def count_index(self, index_name: str) -> int:
       return self._get_opensearch_client().count(index=index_name)["count"]


   def _get_opensearch_client(self) -> OpenSearch:
       client = OpenSearch(
           hosts=[self.config.opensearch_url],
           http_auth=self.HTTP_AUTH,
           use_ssl=False,
           verify_certs=True,
           connection_class=RequestsHttpConnection,
           pool_maxsize=20,
       )


       return client

				
			

The database interface is responsible for retrieving its own auth config, and it is responsible for its own gazillion little utility functions to delete existing document chunks, count the index, etc. The interface is reused all over the codebase and is incredibly useful.

The Bedrock interface was a bit less… smooth.

				
					class EmbedderInterface:


   def __init__(self, config: AppConfig):
       self.config = config
       self.model = self._setup_embedding_model()


   # Set up embedding LLM
   def _setup_embedding_model(self) -> BedrockEmbeddings:
       try:
           # Dumb hacky workaround due to a bug in langchain/boto3 where
           # we cannot supply the region directly in the function
           os.environ["AWS_DEFAULT_REGION"] = self.config.bedrock_region
           model = BedrockEmbeddings(
               client=None,
               model_id=self.config.embedding_model_id,
               region_name=self.config.bedrock_region,  # This gets ignored under the hood, but fails if it is not set.......
           )
           return model
       finally:
           os.environ["AWS_DEFAULT_REGION"] = self.config.context_region

				
			

These code comments are from the actual solution. As you can see, I felt a lot of frustration towards AWS/boto3 and Bedrock at this point. As a framework, it wore all the scars of moving fast and breaking easily.

My clean architecture had the benefit that I could encapsulate all the stupidity in a single class. For a while, I also had a timer wrapper around the setup class, for debugging purposes. The surrounding code didn’t have to know anything about boto3. And if we ever need to use a different embedding provider (very likely, in a space that moves as fast as this), it is fast and easy to change.

Tying it all together with the service

The piece de la resistance is the Service. This is where business value happens. 

The responsibility of services:

  • Tying the interfaces together
  • Transforming data (which is the goal of 90% of all business logic)
  • Executing the flows of the code


The responsibility of this specific service:

  • Owning the flow from data input to final embeddings.
  • Merging the different interfaces into Langchain’s Frankenstein objects.
  • Turning the PdfDocument object into Langchain’s Document objects, so they can be embedded.
  • Giving the document chunks ids based on their doc_id metadata (which came from my input data). This little functionality will be the subject of my test code!

In my case, my service looked like this.

				
					class EmbeddingService:


   def __init__(
       self,
       config: AppConfig,
       embedder: EmbedderInterface,
       opensearch: OpenSearchInterface,
   ):
       self.config = config
       self.opensearch = opensearch
       self.embedder = embedder


   # Notice this logic which could easily be messed up by a human
   def _doc2chunks(self, pdf_doc: PdfDocument) -> List[Document]:
       text_splitter = RecursiveCharacterTextSplitter(add_start_index=True)


       page_contents = []
       metadatas = []


       for page in pdf_doc.pages:
           page_contents.append(page.page_content)
           metadatas.append(
               {
                   "doc_id": pdf_doc.doc_id,
                   "page_number": page.page_number,
               }
           )


       chunks = text_splitter.create_documents(page_contents, metadatas)
       return chunks


   def handle_document(self, pdf_doc: PdfDocument) -> None:
       self.opensearch.delete_chunks_if_exist(pdf_doc.doc_id)
       chunks = self._doc2chunks(pdf_doc)


       self.get_vectorstore().from_documents(
           documents=chunks,
           # These ids cause overwriting - they will be the subjects of my test!
           ids=[f"{c.metadata['doc_id']}-{c.metadata['start_index']}" for c in chunks],
           embedding=self.embedder.model,
           index_name=self.config.opensearch_index,
           opensearch_url=self.config.opensearch_url,
           http_auth=self.opensearch.HTTP_AUTH,
           timeout=300,
       )


   def get_vectorstore(self) -> OpenSearchVectorSearch:
       return OpenSearchVectorSearch(
           embedding_function=self.embedder.model,
           opensearch_url=self.config.opensearch_url,
           index_name=self.config.opensearch_index,
           http_auth=self.opensearch.HTTP_AUTH,
       )

				
			

(yes, all opensearch configuration has to be set twice. I nearly went bald from hair-pulling when I realised)

Now, after all this code, my architecture looks something like this:

What does a test look like then?

Pytest allows mocking of specific functions. In the very beginning of the project, I spent a lot of time trying to find the network calls in Langchain, so I could mock those with fake return data. The problem is the Langchain spaghetti code. Every time we changed the smallest thing, I had to do extensive detective work all over again. The time-spent-to-quality-achieved ratio was not worth it.

No matter what I did, Langchain refused to unfurl. The whole point of Langchain is to hide the complexity in single-line commands. The problem is that issues and bugs arise specifically from Langchain’s idiosyncrasies.

My mission, then, became to test Langchain’s classes and functions as input-output blackboxes. I had to treat Langchain itself as the thing that can break.

Like any good test-driven developer, I wrote a test every time Langchain threw something weird at me. A few examples:

  • Overwriting all existing documents of the same id, based on a seemingly-irrelevant config.
  • NOT overwriting existing documents of the same id, based on a seemingly-irrelevant config.
  • Sharing a single metadata object between all my embedding chunks, thus making it hard to set unique chunk ids.

My test suite tells the story of a Langchain which rarely acts as expected. All the more reason to continuously test it!

One such test was based on a business requirement. Langchain overwrites documents based on their id, and since I have semi-complicated code constructing said ids, I want to make sure it overwrites as intended.

So, the flow of my desired test was:

  1. Index two separate documents
  2. Index one of those documents again
  3. Check that my database still only contains two documents

First, some table-setting:

				
					@pytest.fixture()
def testcontainers():
   subprocess.call("docker compose up -d", shell=True)
   time.sleep(5)
   yield
   subprocess.call("docker compose down -v", shell=True)




@pytest.fixture()
def testconfig():
   testconfig = AppConfig()
   testconfig.opensearch_url = "http://localhost:9200"
   testconfig.opensearch_index = "random-index"

				
			

There’s a million better ways to handle your test configuration, and a handful better ways to handle your test containers. Please accept these hacky versions in the spirit of getting to the interesting stuff.

Namely, what does my test look like:

				
					def util_make_doc(doc_id: str) -> PdfDocument:
   return PdfDocument(
       doc_id=doc_id,
       pages=[PdfPage(page_content="hgsdhsdkjgdskjs", page_number=1)],
   )




def test_only_deletes_expected_document(testconfig, testcontainers):
   # Arrange
   fake_embedder = EmbedderInterface(testconfig)
   fake_embedder.model = FakeEmbeddings(size=5)


   os_interface = OpenSearchInterface(testconfig)


   embedder_service = EmbeddingService(
				
			

What happens here:

  1. I make an Embedder interface and easily substitute a fake Embedding model for the real one (so I won’t have any errant network calls). In this specific case, Langchain has been kind enough to supply a mock embedding model. Much appreciated, Langchain developers!
  2. I make an OpenSearchInterface and make sure to base it on my Docker container configuration.
  3. I make my service, and I inject the two interfaces whose behavior I now control.
  4. I embed two documents, both dataclass entities, meaning I literally cannot construct them differently from production data objects.
  5. I embed one of those documents again.
  6. I use a utility function from my OpenSearch interface to make sure my test condition is met.


Personally, I love the elegance of clean architecture. It makes dependency injection so simple and easy. It means I can base my test code so much more readily on actual production code.

Pros and cons of this approach

I won’t lie. I had a thudding headache when I was done. But I was pretty sure I knew what my code was doing, and I was pretty sure I wouldn’t accidentally break it.

Which is great. Because data scientists love tinkering, and they love showing up with completely new ways of constructing the chain, of tweaking the data pipeline, of parsing the output. That’s their job. My job is making sure that all their new stuff doesn’t break the old stuff.

It’s somewhat expensive to maintain this test suite. Langchain is full of surprises, and even more full of bugs. But frequently, in trying to understand the newest LLM flow, the data scientist also gained a deeper understanding from trying to explain it to my dumb ass. We made lots of beautiful little diagrams like this one (from the very beginning):

Most importantly, doing this work created a safety zone around a highly unstable library, and minimized the amount of bugs we pushed to production. 

Bugs are costlier than tests.

Some parting words

I have fired a lot of shots at Langchain in this article. As a developer, it is a frustrating tool. We will probably end up rewriting the entire codebase at some point, and I will finally have my nicely segmented code that works WITH me, not against me.

But in my opinion, we aren’t quite there yet. Langchain still has its place in the world. For a full discussion, go to this article. While Langchain is still around, my job is to stand vigilant and to protect my codebase from the worst of its idiosyncrasies.