Monday, June 05, 2017

Postgres lockup on ALTER TABLE

You can learn a lot when you break things.

I had a bit of fun and panic with Postgres the other day, while trying to do the equivalent of this innocuous seeming drop column query:

ALTER TABLE aschema.sometable DROP COLUMN IF EXISTS unwanted_column;

So innocuous was this query that the column didn't even exist in production. It had already been dropped. I was checking in an Alembic migration to keep staging and dev DBs in line with prod. Harmless, right?

The deploy went live and things started going from green to red. Uh oh.

Well, it turns out that even a non-consequential ALTER TABLE operation takes a table lock. ...which should be totally fine, unless some long-running process has taken out an AccessShareLock and is sitting on it. There's a nice write-up on this exact situation here:

The [ALTER TABLE operation] tries to take the ACCESS EXCLUSIVE it needs, and it queues up behind the first lock. Now all future lock requests queue up behind the waiting ACCESS EXCLUSIVE request. Conceptually, incoming lock requests which are compatible with the already-granted lock could jump over the waiting ACCESS EXCLUSIVE and be granted out of turn, but that is not how PostgreSQL does it. other words, screeching halt - database frozen. Crap!

Postgres committer Josh Berkus wrote a pair of articles on the topic (ALTER TABLE and downtime, part 1 and ALTER TABLE and downtime, part 2). Great, now what?

Luckily, Postgres provides means to find and kill errant tasks. So, let's find and cancel the blocked ALTER TABLE query and/or the guilty part holding the shared lock.

In case you're ever in the same situation, here are a few key PostgreSQL incantations:

What locks are being held?

select d.datname, t.relname, l.relation, l.locktype, page, virtualtransaction, pid, mode, granted
from pg_locks l
left join pg_stat_all_tables t on l.relation=t.relid
left join pg_database d on l.database=d.oid
order by d.datname, relation asc;

What locks are being held on a particular table?

select * from pg_locks where granted and relation = 'my_table'::regclass \x\g\x

Who's breaking my database and what do they think they're doing?

select * from pg_stat_activity where pid=1234567 \x\g\x
select pg_cancel_backend(pid);

See what you get when you break my database?

select pg_cancel_backend(1234567);

...and finally, thanks to our friend pg_cancel_backend, our DB is unfrozen and we can go track down who had that lock held for so long.