CI/CD as a harness for models

A pipeline doesn't make a model smart. It keeps the model tied to reality.

Today I finished a CI/CD flow for a WordPress installation with Tutor LMS. The task itself was normal: validate the environment, activate plugins, deploy to production, and check that the site was still alive.

The interesting part wasn't WordPress. The interesting part was what broke.

First the deploy stayed manual when it was supposed to run automatically. Then the temporary WP-CLI container couldn't find the database because it didn't inherit the real WordPress container's environment variables. Then a wp eval call failed because a PHP namespace was escaped wrong. Later, the bootstrap script assumed Tutor Pro returned a JSON object, but CI returned a list.

None of this was brilliant. It was boring detail work. That's why it mattered.

A model can write a script that looks correct. It can explain why the script should work. It can even fix the first error in the logs. But without a harness around it, the model stays too close to its own story. It thinks it is done because the plan makes sense.

CI/CD cuts that fantasy short.

The problem isn't that models make mistakes

Models make mistakes. Humans do too. The difference is that a tired human usually leaves traces: doubts, comments, a TODO, an "I'll test this later." A model can produce a complete answer with beautiful, false confidence.

That makes it dangerous in software work. Not because it is useless, but because it is useful very quickly.

The model generates the .gitlab-ci.yml. It generates the deploy script. It generates the healthcheck. It generates the summary for the chat. Everything flows. If a team measures progress by the amount of text produced or files changed, it looks like the system is moving.

But software is not text. Software is behavior under specific conditions.

A deploy is not finished because the model says "deploy configured." It is finished when a pipeline runs from a clean commit, validates the environment, runs the deploy, verifies production, and leaves an ID someone can audit later.

That "later" matters. Without receipts, the work is not done. It is just a well-told story.

What a CI/CD harness means

A harness doesn't replace the climber. It doesn't climb for the climber either. It only stops a normal fall from becoming a disaster.

CI/CD does something similar for models and agents.

The model can propose changes, write code, and adjust scripts. The pipeline decides whether that work survives contact with the environment. Not with an imagined environment. With Docker, real variables, real permissions, images that take time to download, commands that don't exist inside a container, and plugins that return different shapes depending on version.

In the LMS work, the pipeline had two simple parts:

validate_lmstutor: start WordPress and MariaDB in CI, install Tutor LMS, activate Tutor Pro, import the Workademy template, and check the home page.
deploy_production: copy Tutor Pro to the Coolify WordPress instance, activate dependencies with WP-CLI, and verify the public site.

The first version passed validation and left the deploy manual. That was already a signal: the pipeline existed, but it was not CD yet. It had a button. A button means a human is still in the middle.

The next change removed the button. Then the real error appeared: the temporary WP-CLI container read wp-config.php, but wp-config.php depended on WORDPRESS_DB_HOST. The temporary container didn't receive that variable, so WordPress fell back to mysql. In production, the database host was mariadb.

You don't catch that bug by skimming the script. You catch it by running it.

Then came the namespace bug. The script sent \TutorPro\... to PHP when WP-CLI needed \TutorPro\.... Another dumb detail. Another detail that breaks production.

Then Tutor Pro returned dependencies as a list in CI and as an object in production. The code assumed one shape. The pipeline said: not so fast.

That is the value of the harness. It doesn't make the model infallible. It forces the model to answer to facts.

Models need good friction

There is a phrase I like for this: good friction.

Bad friction is bureaucracy. Duplicate tickets. Approvals with no criteria. Meetings where nobody decides anything.

Good friction is a test that fails with a clear reason. A linter that catches a stupid mistake. A healthcheck that checks real content, not only HTTP 200. A pipeline that won't let a deploy pass until the public site responds with concrete signals from the right app.

Models need that friction more than we do.

An agent can make ten changes in ten minutes. That sounds great until three of those changes depend on assumptions. The agent doesn't know that the Coolify WordPress container doesn't ship with wp-cli. It doesn't know that rg may not exist on the remote server. It doesn't know that a plugin changes JSON shape between environments. It can infer those things, but inference is not verification.

The pipeline turns inference into evidence.

And when it fails, the failure becomes work material. You are no longer arguing about whether the model sounds right. You are looking at a job with an ID, logs, duration, and output.

Receipts matter

One of the worst habits in AI-assisted teams is accepting "done" as a state.

Done according to whom?

Done according to the model means little. Done according to a commit is not enough. Done according to a manual deploy can be a lie if nobody knows what ended up on the server. Done starts to mean something when there are receipts:

Commit: edc6a07
Pipeline: 2563528319
Validation job: success
Deploy job: success
Public website: responds with Workademy | Tutor LMS
Admin state: verified through WP-CLI

The receipts don't need to be pretty. They need to exist.

A receipt also helps another human, or another agent, pick up the work tomorrow. If something breaks, they don't have to reconstruct the story from loose chat messages. They can open the pipeline, read the commit, inspect the log, and reproduce the state.

That changes how we collaborate with models. The model stops being a machine that answers and becomes a machine that works inside a testable system.

The harness is not only CI

CI/CD is one part of the harness. It is not the whole thing.

For models, the complete harness usually has several layers:

Small tests for pure logic.
Integration tests for contracts between pieces.
Configuration validation before touching production.
Automated deploy from a controlled branch.
Healthchecks that inspect content, not only status codes.
Logs with IDs you can cite later.
Safety rules for external actions: emails, payments, posts, destructive changes.

The last layer matters a lot. A model can send an email with credentials. Technically, that is easy. Operationally, it is delicate. The harness there is not pytest; it is a rule: draft the email, show a preview, wait for explicit confirmation, send from the right account, and verify that Gmail put it in Sent.

That kind of guardrail is less glamorous than a test suite. It also prevents real messes.

Designing repos for agents

If we are going to use AI agents in development, we need to design repos as if an agent will touch them.

That means obvious commands. Idempotent scripts. Documented variables. Healthchecks with clear patterns. Errors that explain what is missing. Pipelines that don't depend on a person staring at a terminal.

A repo prepared for agents is also better for humans. The agent just forces the issue. Where a human can ask in Slack, "how do we deploy this?", the agent needs a command. Where a human can remember that "on that server the container has a weird name," the agent needs to discover it or read it from a variable.

Pretty documentation helps. A script that fails well helps more.

The point

The conversation about AI in development usually runs to the extremes. Either the model replaces programmers, or it is autocomplete with marketing.

The useful part is in the middle: models working inside systems that verify.

A model with serious CI/CD around it can move quickly without turning every change into a bet. It can write, test, fix, deploy, and report with evidence. When it gets something wrong, the harness brings it back to the ground.

That is the pattern I want to repeat: don't ask the model to be perfect. Build an environment where being wrong is cheap, visible, and fixable.

AI doesn't need more trust. It needs better harnesses.