The Odoo incident: when a Docker Swarm worker took the database with it

What happened

A few days ago we had to restore an old Odoo database backup.

The immediate symptom was simple: Odoo needed to come back online, and the database we had in production was no longer available in a usable way. The deeper problem was worse. The active PostgreSQL data lived on a Docker Swarm worker, in a local Docker volume attached to that node.

When that worker disappeared, the database volume disappeared with it.

That is the part that matters.

The failure was not an Odoo bug. It was an infrastructure mistake: critical persistent data was tied to a Swarm worker as if that worker were permanent.

Omar described it well in a voice note while we were reconstructing the incident:

We had the database on a Docker Swarm worker. The volume with the Odoo database was on that node. When we lost that node, we no longer had the elements needed to recover it.

That is the whole incident in three sentences.

Why the restore was old

After the worker was gone, the next question was obvious: restore the latest backup.

That sounds easy until the backup is not actually restorable.

We found recent backup files, but they were too small to represent the real production database. They looked like backup artifacts, not usable recovery points. A backup file that exists but cannot restore production data is just a comforting lie on disk.

The usable backup was older. It brought Odoo back, but it also moved the system back in time.

The restored database was Odoo 18.0. Some of our references still pointed to 19.0, which created a second layer of confusion when XML-RPC authentication and old assumptions failed. The site was alive, but recent business records were missing.

The visible damage

The first visible damage was in accounting.

Some invoices and customers created after the backup date were gone. Not archived, not hidden, not filtered out. Gone from the restored database because they had never existed at the time of that backup.

Two invoices were recoverable because they had already been sent by email:

INV/2026/00012 for Global Journey Consulting SL
INV/2026/00016 for TRABITAT GLOBAL SL

Both were 2,000 EUR invoices. Both had PDF copies in sent mail. Both customers were missing or incomplete in the restored Odoo database.

That detail saved us. Email became the recovery source.

The recovery

The recovery process was deliberately boring.

First, we searched sent mail. The relevant account was omard@trabitat.com, and the useful query was simply recent sent messages containing factura.

That produced two invoice PDFs:

INV-2026-00012.pdf
INV_2026_00016.pdf

Before touching Odoo, we made a fresh database backup:

/root/website-db-backup/website_18_before_invoice_recovery_20260603T173921Z.sql.gz

Then we recreated the two invoices in Odoo through the ORM, not by writing SQL directly. That matters because Odoo accounting records are not just one table. Posting an invoice touches moves, move lines, sequences, partners, taxes, attachments, chatter, and accounting constraints.

The recreated invoices ended up as:

INV/2026/00012, Odoo account.move ID 215
INV/2026/00016, Odoo account.move ID 216

Both were posted. Both remained unpaid. Both had their original PDFs attached to the invoice record.

Then we verified the result:

each invoice number existed exactly once
each invoice was in posted state
each invoice total was 2000.0
each invoice had one PDF attachment
the accounting line used account 8125
the tax configuration matched the restored database

That is the kind of receipt I want after a recovery. Not "it should be fine." A list of checks that actually ran.

The real mistake

The real mistake was mixing Swarm scheduling with local persistence.

Docker Swarm is good at moving containers around. A worker can disappear and the scheduler will try to place the service somewhere else. That is fine for stateless services. It is dangerous for PostgreSQL unless the storage model is designed for it.

A local Docker volume belongs to a node. If PostgreSQL writes to a local volume on worker A, worker B does not magically have that data. Moving the container does not move the database.

For web containers, this is usually fine. For PostgreSQL, it is the whole game.

The setup treated a local worker volume as if it were durable shared infrastructure. It was not.

The backup problem

The second mistake was trust without restore tests.

A backup job can run every night and still be useless. It can produce tiny dumps. It can back up the wrong database. It can fail halfway and leave a file behind. It can succeed technically while missing the data that matters.

The only backup that counts is one you have restored and checked.

For Odoo, a restore test should answer concrete questions:

Can PostgreSQL restore the dump into a clean database?
Does Odoo start against that restored database?
Do core models load without module errors?
Are expected counts reasonable: users, partners, invoices, tasks?
Can we open the login page and at least one business record?

If those checks do not run, we do not have a backup system. We have a backup ritual.

The fix I want

For this kind of Odoo setup, I would keep it simple.

PostgreSQL should run on a known node with known storage until there is a real HA design. Pin it. Document it. Monitor it. Stop pretending Swarm gives database HA by itself.

Then add a proper backup flow:

daily logical dump with size checks
retention across at least two machines
restore test into a temporary database
Odoo smoke test against the restored database
alert if record counts drop below sane thresholds
monthly manual recovery drill

That is not fancy. It is the minimum.

If we later want real HA, then we design real HA: replicated PostgreSQL, tested failover, shared or replicated filestore, session handling, and a clear operator runbook. Until then, a boring single-node database with verified backups is better than pretend HA with local volumes on random workers.

What I took from it

This incident did not teach me that Odoo is fragile. Odoo was just the application sitting on top of the mistake.

It taught me that persistence has to be boring on purpose.

Databases do not care that a scheduler can restart containers. Accounting records do not care that a service has one replica running somewhere. Invoices live or die by the storage layer and the restore plan.

We recovered two invoices from sent mail this time. That was luck plus good operational hygiene: the invoices had been sent, the PDFs still existed, and the reconstruction was small enough to verify manually.

Next time the recovery source should not be Gmail.

It should be a tested backup.