Why Odoo HA Breaks Before It Helps on Small Swarm Setups

I wanted high availability.

What I got was a login loop, broken sessions, missing files, and a database slowed down by storage that had no business hosting Postgres.

This happened on a small Docker Swarm setup running Odoo. On paper it looked fine: two Odoo replicas on different nodes, Traefik in front, shared storage where needed, and the usual promise of resilience. In practice it was fragile from top to bottom.

The problem wasn’t Odoo by itself. The problem was pretending a small cluster was ready for HA when the parts underneath still weren’t.

What went wrong

Users got stuck in `303` redirect loops because session handling was not truly shared across nodes. One request landed on one replica, the next request landed on another, and Odoo behaved like the login had never completed.

The filestore had the same problem. One node had the files Odoo needed. Another node didn’t. That meant missing assets and failures that only made sense once you tracked which node served the request.

Then the database layer made everything worse. Postgres was sitting on NFS-backed storage, checkpoints became painfully slow, and pages that should have loaded in a fraction of a second started hanging for tens of seconds.

The trap

This is the trap with small self-hosted stacks. You read “high availability” and think redundancy. What you actually build is a system with more moving parts than your storage, session model, and routing can support.

For Odoo that matters a lot. It has sessions, a filestore, background jobs, websocket behavior in newer versions, and a database that does not forgive bad storage choices. If those pieces are only half-solved, the HA story is fake.

Two replicas do not make a deployment safer if both replicas depend on weak assumptions.

What actually helped

The fix was not clever. We stopped pretending the environment was ready for HA.

The better setup was one Odoo replica, Postgres on local disk, deliberate placement for the app and the database, and routing that matched how Odoo actually works.

Once Postgres moved off NFS and the stateful parts stopped fighting the cluster, response times went back to normal.

The rule I trust now

If your HA design depends on hope, it isn’t HA.

If your sessions aren’t shared correctly, it isn’t HA.

If your filestore is inconsistent, it isn’t HA.

If Postgres is sitting on storage that makes checkpoints crawl, it definitely isn’t HA.

It is just a wider blast radius.

Conclusion

Small teams love the idea of enterprise-grade infrastructure. I get it. I love it too.

But the professional move is not the architecture that sounds better in a diagram. It is the one that fails less in production.

For this Odoo stack, that meant giving up fake HA and choosing the boring setup that matched reality.