Home > Archive > Slony1 PostgreSQL Replication > September 2005 > Failover failures









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Failover failures
elein

2005-08-22, 8:25 pm

Slony 1.1. Three nodes. 10 set(1) => 20 => 30.

I ran failover from node10 to node20.

On node30, the origin of the set was changed
from 10 to 20, however, drop node10 failed
because of the row in sl_setsync.

This causes slon on node30 to quit and the cluster to
become unstable. Which in turn prevents putting
node10 back into the mix.

Please tell me I'm not the first one to run into
this...

The only clean work around I can see is to drop
node 30. Re-add it. And then re-add node10. This
leaves us w/o a back up for the downtime.


This is what is in some of the tables for node20:

gb2=# select * from sl_node;
no_id | no_active | no_comment | no_spool
-------+-----------+-------------------------+----------
20 | t | Node 20 - gb2@localhost | f
30 | t | Node 30 - gb3@localhost | f
(2 rows)

gb2=# select * from sl_set;
set_id | set_origin | set_locked | set_comment
--------+------------+------------+----------------------
1 | 20 | | Set 1 for gb_cluster
gb2=# select * from sl_setsync;
ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list
-----------+------------+-----------+------------+------------+---------+-----------------
(0 rows)

This is what I have for node30:

gb3=# select * from sl_node;
no_id | no_active | no_comment | no_spool
-------+-----------+-------------------------+----------
10 | t | Node 10 - gb@localhost | f
20 | t | Node 20 - gb2@localhost | f
30 | t | Node 30 - gb3@localhost | f
(3 rows)

gb3=# select * from sl_set;
set_id | set_origin | set_locked | set_comment
--------+------------+------------+----------------------
1 | 20 | | Set 1 for gb_cluster
(1 row)

gb3=# select * from sl_setsync;
ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list
-----------+------------+-----------+------------+------------+---------+-----------------
1 | 10 | 235 | 1290260 | 1290261 | |
(1 row)

frustrated,
--elein
elein

2005-08-23, 3:24 am

replying to my own message, added the error message
from dropping the dead node from node30...

On Mon, Aug 22, 2005 at 05:12:19PM -0700, elein wrote:
> Slony 1.1. Three nodes. 10 set(1) => 20 => 30.
>
> I ran failover from node10 to node20.
>
> On node30, the origin of the set was changed
> from 10 to 20, however, drop node10 failed
> because of the row in sl_setsync.
>
> This causes slon on node30 to quit and the cluster to
> become unstable. Which in turn prevents putting
> node10 back into the mix.
>
> Please tell me I'm not the first one to run into
> this...
>
> The only clean work around I can see is to drop
> node 30. Re-add it. And then re-add node10. This
> leaves us w/o a back up for the downtime.
>
>
> This is what is in some of the tables for node20:
>
> gb2=# select * from sl_node;
> no_id | no_active | no_comment | no_spool
> -------+-----------+-------------------------+----------
> 20 | t | Node 20 - gb2@localhost | f
> 30 | t | Node 30 - gb3@localhost | f
> (2 rows)
>
> gb2=# select * from sl_set;
> set_id | set_origin | set_locked | set_comment
> --------+------------+------------+----------------------
> 1 | 20 | | Set 1 for gb_cluster
> gb2=# select * from sl_setsync;
> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list
> -----------+------------+-----------+------------+------------+---------+-----------------
> (0 rows)
>
> This is what I have for node30:
>
> gb3=# select * from sl_node;
> no_id | no_active | no_comment | no_spool
> -------+-----------+-------------------------+----------
> 10 | t | Node 10 - gb@localhost | f
> 20 | t | Node 20 - gb2@localhost | f
> 30 | t | Node 30 - gb3@localhost | f
> (3 rows)
>
> gb3=# select * from sl_set;
> set_id | set_origin | set_locked | set_comment
> --------+------------+------------+----------------------
> 1 | 20 | | Set 1 for gb_cluster
> (1 row)
>
> gb3=# select * from sl_setsync;
> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list
> -----------+------------+-----------+------------+------------+---------+-----------------
> 1 | 10 | 235 | 1290260 | 1290261 | |
> (1 row)
>
> frustrated,
> --elein


drop_node10.slnk:5: PGRES_FATAL_ERROR select "_gb_cluster".dropNode(10); - ERROR: update or delete on "sl_node" violates foreign key constraint "ssy_origin-no_id-ref" on "sl_setsync"
DETAIL: Key (no_id)=(10) is still referenced from table "sl_setsync".
CONTEXT: SQL statement "delete from "_gb_cluster".sl_node where no_id = $1 "
PL/pgSQL function "dropnode_int" line 29 at SQL statement
SQL statement "SELECT "_gb_cluster".dropNode_int( $1 )"
PL/pgSQL function "dropnode" line 47 at perform
drop_node10.slnk:7: Failed to drop node 10 from cluster
Christopher Browne

2005-08-23, 11:24 am

elein wrote:

>Slony 1.1. Three nodes. 10 set(1) => 20 => 30.
>
>I ran failover from node10 to node20.
>
>On node30, the origin of the set was changed
>from 10 to 20, however, drop node10 failed
>because of the row in sl_setsync.
>
>This causes slon on node30 to quit and the cluster to
>become unstable. Which in turn prevents putting
>node10 back into the mix.
>
>Please tell me I'm not the first one to run into
>this...
>
>The only clean work around I can see is to drop
>node 30. Re-add it. And then re-add node10. This
>leaves us w/o a back up for the downtime.
>
>
>This is what is in some of the tables for node20:
>
>gb2=# select * from sl_node;
> no_id | no_active | no_comment | no_spool
>-------+-----------+-------------------------+----------
> 20 | t | Node 20 - gb2@localhost | f
> 30 | t | Node 30 - gb3@localhost | f
>(2 rows)
>
>gb2=# select * from sl_set;
> set_id | set_origin | set_locked | set_comment
>--------+------------+------------+----------------------
> 1 | 20 | | Set 1 for gb_cluster
>gb2=# select * from sl_setsync;
> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list
>-----------+------------+-----------+------------+------------+---------+-----------------
>(0 rows)
>
>This is what I have for node30:
>
>gb3=# select * from sl_node;
> no_id | no_active | no_comment | no_spool
>-------+-----------+-------------------------+----------
> 10 | t | Node 10 - gb@localhost | f
> 20 | t | Node 20 - gb2@localhost | f
> 30 | t | Node 30 - gb3@localhost | f
>(3 rows)
>
>gb3=# select * from sl_set;
> set_id | set_origin | set_locked | set_comment
>--------+------------+------------+----------------------
> 1 | 20 | | Set 1 for gb_cluster
>(1 row)
>
>gb3=# select * from sl_setsync;
> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list
>-----------+------------+-----------+------------+------------+---------+-----------------
> 1 | 10 | 235 | 1290260 | 1290261 | |
>(1 row)
>
>frustrated,
>--elein
> ____________________
____________________
_______
>Slony1-general mailing list
>Slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
>http://gborg.postgresql.org/mailman.../slony1-general
>
>

That error message in your other email was VERY helpful in pointing at
least at what clues to look for...

I /think/ that the FAILOVER_SET event hasn't yet been processed on node
30, which would be consistent with everything we see.

Can you check logs on node 30 or sl_event on node 30 to see if
FAILOVER_SET has made it there?

What seems not to have happened is for the FAILOVER_SET event to process
on node 30; when that *does* happen, it would delete out the sl_setsync
entry pointing to node 10 and create a new one pointing to node 20.
(This is in the last 1/2 of function failoverSet_int().)

I'll bet that sl_subscribe on node 30 is still pointing to node 10; that
would be further confirmation that FAILOVER_SET hasn't processed on node 30.

If that event hasn't processed, then we can at least move the confusion
from being:
"Help! I don't know why the configuration is so odd on node 30!"
to
"Hmm. The config is consistent with FAILOVER not being done yet. What
prevented the FAILOVER_SET event from processing on node 30?"

We're not at a full answer, but the latter question points to a more
purposeful search :-).
elein

2005-08-23, 8:25 pm


On Tue, Aug 23, 2005 at 11:40:45AM -0400, Christopher Browne wrote:
> elein wrote:
>
> That error message in your other email was VERY helpful in pointing at
> least at what clues to look for...
>
> I /think/ that the FAILOVER_SET event hasn't yet been processed on node
> 30, which would be consistent with everything we see.
>
> Can you check logs on node 30 or sl_event on node 30 to see if
> FAILOVER_SET has made it there?
>
> What seems not to have happened is for the FAILOVER_SET event to process
> on node 30; when that *does* happen, it would delete out the sl_setsync
> entry pointing to node 10 and create a new one pointing to node 20.
> (This is in the last 1/2 of function failoverSet_int().)
>
> I'll bet that sl_subscribe on node 30 is still pointing to node 10; that
> would be further confirmation that FAILOVER_SET hasn't processed on node 30.
>
> If that event hasn't processed, then we can at least move the confusion
> from being:
> "Help! I don't know why the configuration is so odd on node 30!"
> to
> "Hmm. The config is consistent with FAILOVER not being done yet. What
> prevented the FAILOVER_SET event from processing on node 30?"
>
> We're not at a full answer, but the latter question points to a more
> purposeful search :-).
>


I'm fairly certain the failover did not make it to node30. I have
the set up here and will rerun it clean and send you logs and whatever
tables you would like. But not until this evening PST.

--elein
elein

2005-08-23, 8:25 pm

On Tue, Aug 23, 2005 at 11:29:34AM -0700, elein wrote:
>
> On Tue, Aug 23, 2005 at 11:40:45AM -0400, Christopher Browne wrote:
>
> I'm fairly certain the failover did not make it to node30. I have
> the set up here and will rerun it clean and send you logs and whatever
> tables you would like. But not until this evening PST.
>
> --elein


One more hint. node30 was originally getting set 1 (originating on node10)
from node20. Perhaps the failover logic thought it did not have to
do so much in that case?

if only I could think of everything in one messsage :)

elein
elein

2005-08-27, 3:24 am

Chris Browne has been helpful but busy.
Can someone else try to look into this problem.

It seems to be a basic failure on failover when
there are cascaded replicas. You are left unable
to add back in the failed node.
[color=darkred]

* failover 10 to 20
* bounce slons
* drop node 10 <---chokes on FK Constraint on Node30
"begin transaction; set transaction isolation level serializable; lock table "_gb_cluster".sl_config_lock; select "_gb_cluster".dropNode_int(10); notify "_gb_cluster_Restart"; notify "_gb_cluster_Event"; notify "_gb_cluster_Confirm"; insert into "_
gb_cluster".sl_event (ev_origin, ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1 ) values ('20', '84', '2005-08-26 19:03:34.92452', '3577884', '3577885', '', 'DROP_NODE', '10'); insert into "_gb_cluster".sl_confirm (c
on_origin, con_received, con_seqno, con_timestamp) values (20, 30, '84', CURRENT_TIMESTAMP); commit transaction;" PGRES_FATAL_ERROR ERROR: update or delete on "sl_node" violates foreign key constraint "ssy_origin-no_id-ref" on "sl_setsync"DETAIL: Key
(no_id)=(10) is still referenced from table "sl_setsync".
CONTEXT: SQL statement "delete from "_gb_cluster".sl_node where no_id = $1 "
PL/pgSQL function "dropnode_int" line 29 at SQL statement


I have the repro. I have the dumps of the dbs. I have the log files.

I've looked at the code and in failoverSet_int it seems
to do the right thing for sl_setsync. But I think that
this was never called or sent to Node30 because
it was not a direct subscriber. I'm not familiar enough with the
higher level calling mechanism to know where this should be called
and isn't.

``elein

On Tue, Aug 23, 2005 at 11:40:45AM -0400, Christopher Browne wrote:[color=darkred
]
> elein wrote:
>
Darcy Buskermolen

2005-09-06, 8:24 pm

On Monday 22 August 2005 17:12, elein wrote:
> Slony 1.1. Three nodes. 10 set(1) => 20 => 30.
>
> I ran failover from node10 to node20.
>
> On node30, the origin of the set was changed
> from 10 to 20, however, drop node10 failed
> because of the row in sl_setsync.
>
> This causes slon on node30 to quit and the cluster to
> become unstable. Which in turn prevents putting
> node10 back into the mix.
>
> Please tell me I'm not the first one to run into
> this...
>
> The only clean work around I can see is to drop
> node 30. Re-add it. And then re-add node10. This
> leaves us w/o a back up for the downtime.
>
>
> This is what is in some of the tables for node20:
>
> gb2=# select * from sl_node;
> no_id | no_active | no_comment | no_spool
> -------+-----------+-------------------------+----------
> 20 | t | Node 20 - gb2@localhost | f
> 30 | t | Node 30 - gb3@localhost | f
> (2 rows)
>
> gb2=# select * from sl_set;
> set_id | set_origin | set_locked | set_comment
> --------+------------+------------+----------------------
> 1 | 20 | | Set 1 for gb_cluster
> gb2=# select * from sl_setsync;
> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip |
> ssy_action_list
> -----------+------------+-----------+------------+------------+---------+--
>--------------- (0 rows)
>
> This is what I have for node30:
>
> gb3=# select * from sl_node;
> no_id | no_active | no_comment | no_spool
> -------+-----------+-------------------------+----------
> 10 | t | Node 10 - gb@localhost | f
> 20 | t | Node 20 - gb2@localhost | f
> 30 | t | Node 30 - gb3@localhost | f
> (3 rows)
>
> gb3=# select * from sl_set;
> set_id | set_origin | set_locked | set_comment
> --------+------------+------------+----------------------
> 1 | 20 | | Set 1 for gb_cluster
> (1 row)
>
> gb3=# select * from sl_setsync;
> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip |
> ssy_action_list
> -----------+------------+-----------+------------+------------+---------+--
>--------------- 1 | 10 | 235 | 1290260 | 1290261 |
> | (1 row)
>
> frustrated,
> --elein

Elein,
I can share your frustration, I have just for the first time started to
investigate failover and I have yet to be able to have a clean failover
happen, no matter how I do a failover I end up with nodes that are no longer
in sync with other the nodes. My time is fairly short this week, but I hope
to be able to spend some time on it. I've pushed all my other slony work to
the back burner to come to a solid resolution to this.

Jan/Chris are either of you able to reproduce stable failovers in a multi node
(more than a single origin/subscriber pair) ?

> ____________________
____________________
_______
> Slony1-general mailing list
> Slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
> http://gborg.postgresql.org/mailman.../slony1-general


--
Darcy Buskermolen
Wavefire Technologies Corp.

http://www.wavefire.com
ph: 250.717.0200
fx: 250.763.1759
Andrew Sullivan

2005-09-07, 7:24 am

I'm going to forward this to our internal QA folks, who are currently
doing extended tests on the 1.1 release.

A

On Tue, Sep 06, 2005 at 02:29:17PM -0700, Darcy Buskermolen wrote:
> On Monday 22 August 2005 17:12, elein wrote:
> Elein,
> I can share your frustration, I have just for the first time started to
> investigate failover and I have yet to be able to have a clean failover
> happen, no matter how I do a failover I end up with nodes that are no longer
> in sync with other the nodes. My time is fairly short this week, but I hope
> to be able to spend some time on it. I've pushed all my other slony work to
> the back burner to come to a solid resolution to this.
>
> Jan/Chris are either of you able to reproduce stable failovers in a multi node
> (more than a single origin/subscriber pair) ?
>
>
> --
> Darcy Buskermolen
> Wavefire Technologies Corp.
>
> http://www.wavefire.com
> ph: 250.717.0200
> fx: 250.763.1759
> ____________________
____________________
_______
> Slony1-general mailing list
> Slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
> http://gborg.postgresql.org/mailman.../slony1-general


--
Andrew Sullivan | ajs-oaT0K0jot5/q2IAV+ODieA@public.gmane.org
A certain description of men are for getting out of debt, yet are
against all taxes for raising money to pay it off.
--Alexander Hamilton
Christopher Browne

2005-09-28, 8:25 pm

Darcy Buskermolen wrote:

>On Monday 22 August 2005 17:12, elein wrote:
>
>
>Elein,
>I can share your frustration, I have just for the first time started to
>investigate failover and I have yet to be able to have a clean failover
>happen, no matter how I do a failover I end up with nodes that are no longer
>in sync with other the nodes. My time is fairly short this week, but I hope
>to be able to spend some time on it. I've pushed all my other slony work to
>the back burner to come to a solid resolution to this.
>
>Jan/Chris are either of you able to reproduce stable failovers in a multi node
>(more than a single origin/subscriber pair) ?
>
>

I finally put together a suitable environment to do some testing of this...

I'm running into a case where, upon failover from node 1 to node 2, I
get the following error message from slonik:

-sh-2.05b$ slonik failover.slonik
failover.slonik:6: NOTICE: Slony-I: terminating DB connection of faile
node with pid 13903
CONTEXT: PL/pgSQL function "failednode" line 75 at perform
FATAL: terminating connection due to administrator command
failover.slonik:6: NOTICE: failedNode: set 1 has no other direct
receivers - move now
failover.slonik:6: NOTICE: Slony-I: terminating DB connection of faile
node with pid 13905
CONTEXT: PL/pgSQL function "failednode" line 75 at perform
FATAL: terminating connection due to administrator command
failover.slonik:6: NOTICE: failedNode: set 1 has no other direct
receivers - move now
-sh-2.05b$

After which point things are not QUITE ok.

I find that node #2 still has the "denyaccess" triggers in place.

Interestingly, if I do a LOCK SET/MOVE SET to shift origin to node 3,
then shift it back to node 2, all seems to be well again.

That was with 1.0.5, not 1.1 (because I had the compile handy :-)). I'd
not expect material differences in 1.1, as there hasn't been substantial
change to this in 1.1.

The case of single origin/subscriber isn't worth researching much as
that's a case where there isn't really much point to FAILOVER, as losing
the origin means you no longer have a replication cluster anymore.
Supposing there were anomalies there, I'd find that somewhat
uninteresting, as it makes just as much sense to do an UNINSTALL NODE
and drop replication from the surviving node altogether.
Christopher Browne

2005-09-28, 8:25 pm

elein wrote:

>Chris Browne has been helpful but busy.
>Can someone else try to look into this problem.
>
>It seems to be a basic failure on failover when
>there are cascaded replicas. You are left unable
>to add back in the failed node.
>
>

I am getting convinced that the problem (a problem?) lies in trying to
delete the dead node too soon.

If I work slowly enough, it all "works out;" I can drop the failed node
and add it back in.

If I wait (oh, say 10 minutes) between FAILOVER and dropping node 1,
then that's enough time for references to node 1 to get purged out of
event logs and sl_setsync and such.

If I don't wait, the "unrelated" nodes wind up falling over, unhappy
about the attempt to delete a node that they still are referencing.

This seems a nicely evil bit of log...

ERROR remoteWorkerThread_2
: "begin transaction; set transaction
isolation level serializable; lock table "_failtest".sl_config_lock;
select "_failtest".dropNode_int(1); notify "_failtest_Restart"; notify
"_failtest_Event"; notify "_failtest_Confirm"; insert into
"_failtest".sl_event (ev_origin, ev_seqno, ev_timestamp,
ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1 ) values ('2',
'102', '2005-09-28 21:55:53.176607', '462075', '462076', '',
'DROP_NODE', '1'); insert into "_failtest".sl_confirm (con_origin,
con_received, con_seqno, con_timestamp) values (2, 3, '102',
CURRENT_TIMESTAMP); commit transaction;" PGRES_FATAL_ERROR ERROR:
update or delete on "sl_node" violates foreign key constraint
"ssy_origin-no_id-ref" on "sl_setsync"
DETAIL: Key (no_id)=(1) is still referenced from table "sl_setsync".

This bit of log file gets generated on node #3 (the extra subscriber)
when I submit the DROP NODE request.

It's not suggesting an analytical fix to me just yet... I'm not sure I
want to automagically delete the sl_setsync entry just yet at the point
that this happens...
elein

2005-09-29, 8:24 pm

I believe I waited a while for the failover to get
over to the third node. Should I have killed and
restarted the slon processes?

If I get a chance today, I will try to reconstruct
my test cases.

--elein

On Wed, Sep 28, 2005 at 06:08:34PM -0400, Christopher Browne wrote:
> elein wrote:
>
> I am getting convinced that the problem (a problem?) lies in trying to
> delete the dead node too soon.
>
> If I work slowly enough, it all "works out;" I can drop the failed node
> and add it back in.
>
> If I wait (oh, say 10 minutes) between FAILOVER and dropping node 1,
> then that's enough time for references to node 1 to get purged out of
> event logs and sl_setsync and such.
>
> If I don't wait, the "unrelated" nodes wind up falling over, unhappy
> about the attempt to delete a node that they still are referencing.
>
> This seems a nicely evil bit of log...
>
> ERROR remoteWorkerThread_2
: "begin transaction; set transaction
> isolation level serializable; lock table "_failtest".sl_config_lock;
> select "_failtest".dropNode_int(1); notify "_failtest_Restart"; notify
> "_failtest_Event"; notify "_failtest_Confirm"; insert into
> "_failtest".sl_event (ev_origin, ev_seqno, ev_timestamp,
> ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1 ) values ('2',
> '102', '2005-09-28 21:55:53.176607', '462075', '462076', '',
> 'DROP_NODE', '1'); insert into "_failtest".sl_confirm (con_origin,
> con_received, con_seqno, con_timestamp) values (2, 3, '102',
> CURRENT_TIMESTAMP); commit transaction;" PGRES_FATAL_ERROR ERROR:
> update or delete on "sl_node" violates foreign key constraint
> "ssy_origin-no_id-ref" on "sl_setsync"
> DETAIL: Key (no_id)=(1) is still referenced from table "sl_setsync".
>
> This bit of log file gets generated on node #3 (the extra subscriber)
> when I submit the DROP NODE request.
>
> It's not suggesting an analytical fix to me just yet... I'm not sure I
> want to automagically delete the sl_setsync entry just yet at the point
> that this happens...
>

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com