Home > Archive > Slony1 PostgreSQL Replication > April 2006 > Graceful switchover gone wrong.









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Graceful switchover gone wrong.
Gavin Hamill

2006-04-05, 8:28 pm

Oh dear - why is nothing ever easy? :(

I've just tried to gracefully swap master/slave roles with little success...

cayenne:~# /usr/lib/postgresql/8.1/bin/slonik <slonswap.txt
.... which provided no output for 5 minutes at which point I CTRL-C'd...

This script is:

cluster name = replication;
node 1 admin conninfo='host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX';
node 2 admin conninfo='host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX';

lock set (id = 1, origin = 1);
wait for event (origin = 1, confirmed = 2);
move set (id = 1, old origin = 1, new origin = 2);
wait for event (origin = 1, confirmed = 2);

i.e. copy+paste directly from slony_115/failover.html

The log for node 1 shows:
2006-04-05 23:49:32 BST CONFIG moveSet: set_id=1 old_origin=1 new_origin=2
2006-04-05 23:49:32 BST DEBUG1 remoteWorkerThread_2
: helper thread for provider 2 created
2006-04-05 23:49:32 BST CONFIG storeListen: li_origin=2 li_receiver=1 li_provider=2
2006-04-05 23:49:35 BST DEBUG1 remoteWorkerThread_2
: connected to data provider 2 on 'host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX'

During this 'dead time' I tried to execute a simple UPDATE on node 1, and was told
"ERROR: Slony-I: Table pbx_ext_state is replicated and cannot be modified on a subscriber node" - great - that's exactly what I'd expect.

Unfortunately, I was told the same thing when I executed the same query on node 2! At this point I paniced and executed 'uninstall node (id=1)' to clean out the current machine's slony config so I could at least bring our website back online again.

The next line in node 1's log after the above is:

2006-04-05 23:56:55 BST FATAL syncThread: "start transaction;set transaction isolation level serializable;select last_value from "_replication".sl_action_seq;" - ERROR: schema "_replication" does not exist

which unsurprisingly is when I uninstalled node 1 :) I also saw a process on node 1 during the 'dead time' marked as 'idle in transaction'.

I've not touched node 2 - what can the various sl_ tables tell me about why this process froze, and what should I be looking for if/when it happens when I try it again tomorrow? :(

<sigh> Maybe I should just run away and join the circus...

Cheers,
Gavin
Jan Wieck

2006-04-06, 7:38 am

On 4/5/2006 7:15 PM, Gavin Hamill wrote:

> Oh dear - why is nothing ever easy? :(
>
> I've just tried to gracefully swap master/slave roles with little success...


Briefly stopping and forcibly disconnecting all clients (pgpool is very
handy for that) from both, the old and the new origin, usually makes
switchover a lot smoother and faster.

Since obviously the MOVE SET happened on node 1 but wasn't able to
complete on node 2, you must have had some backend on node 2 that kept a
lock on one of the replicated tables, preventing the event to replicate.
Slon needs to put back all the user constraints, remove the deny_access
trigger and put the log trigger into place on the new origin, so it
needs an access_exclusive table lock on the entire set. What you've done
at this point, while slon on node 2 was waiting to get those locks, is
to destroy that event in the UNINSTALL NODE panic move (and I assume you
also stopped or restarted slon on node 2) before it did replicate. With
that event info gone, node 2 is still a perfectly valid subscriber of a
non-existent node.

Restarting the postmaster of node 2 or any client application accessing
node 2 should have been enough.


Jan

--
#===================
====================
====================
===========#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#===================
====================
=========== JanWieck- bwPqjjyvM7QAvxtiuMwx
3w@public.gmane.org #
Gavin Hamill

2006-04-06, 8:28 pm

On Thu, 06 Apr 2006 08:18:34 -0400
Jan Wieck <JanWieck- bwPqjjyvM7QAvxtiuMwx
3w@public.gmane.org> wrote:

> Briefly stopping and forcibly disconnecting all clients (pgpool is very
> handy for that) from both, the old and the new origin, usually makes
> switchover a lot smoother and faster.


OK I'll bear that in mind for tonight's attempt :)

> Since obviously the MOVE SET happened on node 1 but wasn't able to
> complete on node 2, you must have had some backend on node 2 that kept a
> lock on one of the replicated tables, preventing the event to replicate.


Looking at scrollbacks in the cold light of day revealed a telltale
sign - there was still a vacuum process running that I'd started hours
earlier and forgot about - only a vacuum analyse, but I'm sure it won't
have helped matters!

> Restarting the postmaster of node 2 or any client application accessing
> node 2 should have been enough.


Heh, at least I'm now the wiser for this - many thanks :)

Am hoping tonight's run will be much smoother =)

Cheers,
Gavin.
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com