Why QA is the hard part of building LLM chatbots: lessons from testing a forestry assistant

We spent two days testing the Reconecta bot. The bot itself — a WhatsApp assistant that helps forest landowners in rural Spain navigate property rights, subsidies, and sustainable management — was straightforward to build. The quality assurance process was anything but.

The bot works. It answers questions, captures leads, routes users to the right resources. But getting it to work reliably — across different inputs, edge cases, and the unpredictability of large language models — took more effort than writing the entire agent.

What Reconecta does

Reconecta is a publicly funded project that helps forest landowners in Cuenca, Soria, and Teruel — three provinces in Spain with a lot of abandoned woodland. The portal (reconecta.es) offers information on property clarification, forest management, and green entrepreneurship. Our job was to turn that portal into a conversational agent: something a farmer could message on WhatsApp and get real answers from.

The technical side was simple. We used a whitelabel platform (uChat/Falitech), built a knowledge base from the site content, wrote a system prompt, and wired up function calling for lead capture. The bot had seven atomic functions — one per data point — so it could save information progressively instead of interrogating users all at once. On paper, it was clean.

Then we tested it.

The unpredictability problem

LLM-powered agents are nondeterministic. Run the same prompt twice, you get two different conversation flows.

Some examples from our testing:

The interrogation trap. The original bot had a single function with six parameters, five of them required. The LLM could not call it until it had all five — so it entered interrogation mode, firing questions one after another without giving the user anything of value first. We fixed this by splitting into atomic functions, but the model still sometimes asked two questions in one message, or repeated a question it already had the answer for.

The helpfulness paradox. A user says "I inherited some land" and the bot launches into a full explanation of inheritance law, property registration, and cadastral references — three paragraphs before pausing for breath. Other times, the same input gets a one-liner that feels dismissive. Same prompt. Same knowledge base. Different behavior.

The data collection pendulum. We rewrote the prompt three times. The first version collected data too aggressively. The second was so laid-back it forgot to ask. The third version said "give information first, ask for data only when it adds value" — which the model followed about 70% of the time. The other 30%, it either jumped to data collection too early or never asked at all.

Hallucinated empathy. The bot would occasionally invent details about the user's situation — "I understand managing a forest alone can be overwhelming" when the user never said they were alone. The model fills in gaps from context. That is what language models do. In a service agent, that is a bug.

Why this is harder than traditional QA

In traditional software testing, you write test cases: given input X, the system produces output Y. You can automate this, run it in CI, and trust that passing tests mean the feature works.

With LLM agents, this breaks down.

The same input can produce different outputs. You cannot assert "given this message, the bot says exactly this." You can assert ranges — "the response mentions subsidies" — but ranges are soft, and soft assertions erode confidence in your test suite.

Context matters. The bot's behavior depends on the entire conversation history. Testing individual messages in isolation tells you little. Testing full conversations is slow. Testing all combinations of conversation paths? Forget it.

The model changes under you. Not just model upgrades — the same model version can behave differently over time as providers adjust weights, routing, and system prompts. A prompt that worked on Monday might produce subtly different behavior on Wednesday.

So your QA process becomes less "run the test suite" and more "sit down, have a conversation with the bot, and see what happens." Over and over. With different personas. Different starting points. Different edge cases.

That is what we did. For two days.

What actually worked

After cycling through prompt versions and testing patterns, a few things made the QA process tractable:

Atomic function design. Splitting the monolithic "save everything" function into one function per data point was the single biggest improvement. The model could save data incrementally instead of holding everything in memory until it had enough to make the call. This eliminated the interrogation pattern almost entirely.

Explicit negative instructions. Instead of saying "be helpful," we said: "Do NOT ask for data. Do NOT repeat information you already have. Do NOT block the conversation to collect data." Negative constraints are sharper than positive ones for LLMs. "Don't do X" is easier to follow than "be Y."

Priority rules in the prompt. We made the hierarchy explicit: "Give information first. Ask for data only when the user needs personalized help. If the user does not provide cadastral data, continue with general information." This reduced the pendulum swings between aggressive data collection and total passivity.

Short, focused prompts. Our v5 prompt is roughly 3,900 tokens — down from 6,700 in v2. We cut a province directory from 50+ entries to 3 plus a fallback. We compressed skills from verbose descriptions to one-line definitions. Less prompt text means less surface area for the model to misinterpret.

Test personas, not test cases. Instead of writing "assert response contains X," we defined user personas (an inheritor who does not know what they own, a farmer looking for subsidies, a curious student) and tested whether the bot gave each persona a useful experience. This is subjective, but it maps to what actually matters: does the user get value?

The honest takeaway

The Reconecta bot is live and functional. But if I am being honest, we cannot guarantee it will handle every conversation perfectly. No one can guarantee that with current LLM technology. The best we can do is:

Constrain the problem space — limited domain, clear rules, explicit fallbacks.
Test the common paths thoroughly.
Accept that edge cases will happen and monitor for them.
Iterate on the prompt based on real conversations.

The building part — the knowledge base, the functions, the integration — took maybe a week. The testing and tuning took just as long. And we are not done. Every batch of real conversations reveals new edge cases that no amount of pre-launch testing could have caught.

That ratio — 50% building, 50% QA — is not bad engineering. It is the reality of working with nondeterministic systems. You do not call a function. You coach a collaborator. Coaching takes time.

If you are building an LLM agent and your QA plan is "we will test it a few times before launch," budget for more. Development is the easy part. Making sure it actually works — consistently, across real users, with real edge cases — is where the effort lives.