|
Home > Archive > Slony1 PostgreSQL Replication > April 2006 > Graceful switchover gone wrong.
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Graceful switchover gone wrong.
|
|
| Gavin Hamill 2006-04-05, 8:28 pm |
| Oh dear - why is nothing ever easy? :(
I've just tried to gracefully swap master/slave roles with little success...
cayenne:~# /usr/lib/postgresql/8.1/bin/slonik <slonswap.txt
.... which provided no output for 5 minutes at which point I CTRL-C'd...
This script is:
cluster name = replication;
node 1 admin conninfo='host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX';
node 2 admin conninfo='host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX';
lock set (id = 1, origin = 1);
wait for event (origin = 1, confirmed = 2);
move set (id = 1, old origin = 1, new origin = 2);
wait for event (origin = 1, confirmed = 2);
i.e. copy+paste directly from slony_115/failover.html
The log for node 1 shows:
2006-04-05 23:49:32 BST CONFIG moveSet: set_id=1 old_origin=1 new_origin=2
2006-04-05 23:49:32 BST DEBUG1 remoteWorkerThread_2
: helper thread for provider 2 created
2006-04-05 23:49:32 BST CONFIG storeListen: li_origin=2 li_receiver=1 li_provider=2
2006-04-05 23:49:35 BST DEBUG1 remoteWorkerThread_2
: connected to data provider 2 on 'host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX'
During this 'dead time' I tried to execute a simple UPDATE on node 1, and was told
"ERROR: Slony-I: Table pbx_ext_state is replicated and cannot be modified on a subscriber node" - great - that's exactly what I'd expect.
Unfortunately, I was told the same thing when I executed the same query on node 2! At this point I paniced and executed 'uninstall node (id=1)' to clean out the current machine's slony config so I could at least bring our website back online again.
The next line in node 1's log after the above is:
2006-04-05 23:56:55 BST FATAL syncThread: "start transaction;set transaction isolation level serializable;select last_value from "_replication".sl_action_seq;" - ERROR: schema "_replication" does not exist
which unsurprisingly is when I uninstalled node 1 :) I also saw a process on node 1 during the 'dead time' marked as 'idle in transaction'.
I've not touched node 2 - what can the various sl_ tables tell me about why this process froze, and what should I be looking for if/when it happens when I try it again tomorrow? :(
<sigh> Maybe I should just run away and join the circus...
Cheers,
Gavin
| |
| Jan Wieck 2006-04-06, 7:38 am |
| On 4/5/2006 7:15 PM, Gavin Hamill wrote:
> Oh dear - why is nothing ever easy? :(
>
> I've just tried to gracefully swap master/slave roles with little success...
Briefly stopping and forcibly disconnecting all clients (pgpool is very
handy for that) from both, the old and the new origin, usually makes
switchover a lot smoother and faster.
Since obviously the MOVE SET happened on node 1 but wasn't able to
complete on node 2, you must have had some backend on node 2 that kept a
lock on one of the replicated tables, preventing the event to replicate.
Slon needs to put back all the user constraints, remove the deny_access
trigger and put the log trigger into place on the new origin, so it
needs an access_exclusive table lock on the entire set. What you've done
at this point, while slon on node 2 was waiting to get those locks, is
to destroy that event in the UNINSTALL NODE panic move (and I assume you
also stopped or restarted slon on node 2) before it did replicate. With
that event info gone, node 2 is still a perfectly valid subscriber of a
non-existent node.
Restarting the postmaster of node 2 or any client application accessing
node 2 should have been enough.
Jan
--
#===================
====================
====================
===========#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#===================
====================
=========== JanWieck- bwPqjjyvM7QAvxtiuMwx
3w@public.gmane.org #
| |
| Gavin Hamill 2006-04-06, 8:28 pm |
| On Thu, 06 Apr 2006 08:18:34 -0400
Jan Wieck <JanWieck- bwPqjjyvM7QAvxtiuMwx
3w@public.gmane.org> wrote:
> Briefly stopping and forcibly disconnecting all clients (pgpool is very
> handy for that) from both, the old and the new origin, usually makes
> switchover a lot smoother and faster.
OK I'll bear that in mind for tonight's attempt :)
> Since obviously the MOVE SET happened on node 1 but wasn't able to
> complete on node 2, you must have had some backend on node 2 that kept a
> lock on one of the replicated tables, preventing the event to replicate.
Looking at scrollbacks in the cold light of day revealed a telltale
sign - there was still a vacuum process running that I'd started hours
earlier and forgot about - only a vacuum analyse, but I'm sure it won't
have helped matters!
> Restarting the postmaster of node 2 or any client application accessing
> node 2 should have been enough.
Heh, at least I'm now the wiser for this - many thanks :)
Am hoping tonight's run will be much smoother =)
Cheers,
Gavin.
|
|
|
|
|