Home > Archive > Slony1 PostgreSQL Replication > September 2005 > Failover Stalls









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Failover Stalls

2005-09-27, 9:24 am

I have set up Slony-1 v 1.1.0 on two servers, each with identical
databases, and have organised replication between the two of them.

I can get a switchover to work, but when I do a failover the failover
script stops in the middle of the failover command.

Here is an extra from the script. The variables are set by the rest of
the script, nodeId is the name of the subscriber, remoteId is the name
of the origin, and the idea of the script is to run it on the surviving
server after something nasty has happened to the other server.

log "Attempting failover to local node ($nodeId)"
log `date`
slonik <<EOF
cluster name =3D $CLUSTER_NAME;
node 1 admin conninfo =3D '$one_conninfo';
node 2 admin conninfo =3D '$two_conninfo';
echo 'Failing over to node $nodeId';
failover ( id=3D$remoteId, backup node =3D $nodeId);
echo 'Failover complete';
EOF

In my test scenario, node 2 is the origin. I kill the postmaster on node
2 to simulate the server dying a horrible death. The slon daemon on node
2 dies and the slon daemon on node 1 starts to complain of being unable
to access the node 2 database (I've x'd out the true IP address) :-

2005-09-27 14:33:39 BST DEBUG2 remoteWorkerThread_2
: forward confirm
1,9841 received by 2
2005-09-27 14:33:40 BST ERROR remoteListenThread_2
: "select ev_origin,
ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip,
ev_type, ev_data1, ev_data2, ev_data3, ev_data4,
ev_data5, ev_data6, ev_data7, ev_data8 from "_dot_ha".sl_event e
where (e.ev_origin =3D '2' and e.ev_seqno > '26') order by e.ev_origin,
e.ev_seqno" - server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2005-09-27 14:33:49 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC
9842
2005-09-27 14:33:49 BST DEBUG2 localListenThread: Received event 1,9842
SYNC
2005-09-27 14:33:50 BST ERROR slon_connectdb: PQconnectdb("dbname=3DDOT
host=3Dxxx.xxx.xxx.xxx user=3Dpostgres") failed - could not connect to
server: Connection refused
Is the server running on host "xxx.xxx.xxx.xxx" and accepting
TCP/IP connections on port 5432?
2005-09-27 14:33:50 BST WARN remoteListenThread_2
: DB connection
failed - sleep 10 seconds

At this point I now have no origin, but a working subscriber, up to
date, at least up to the time of the last synch.

I want to make this subscriber the new origin. A switchover won't work
so I execute the above script on node 1 and I get the following output:-

ha-failover.sh: Attempting failover to local node (1)
ha-failover.sh: Tue Sep 27 14:34:17 BST 2005
<stdin>:4: Failing over to node 1
<stdin>:5: NOTICE: failedNode: set 1 has no other direct receivers -
move now

And then the script just hangs there (I've left it running for over an
hour). It seems to be stuck on the failover line as it never reaches the
second echo statement.

When I look at the node 1 database I see that I can now update the
replicated tables, so node 1 now thinks it is the master. I can check
this by inspecting sl_set and see the origin for my replication set is
now node 1. The sl_subscribe table is empty. The sl_node table shows
both nodes and both are active, which strikes me as suspicious.

DOT=3D# select * from _dot_ha.sl_node;
no_id | no_active | no_comment | no_spool
-------+-----------+-----------------+----------
1 | t | Node One | f
2 | t | Node Two | f
(2 rows)

There is only one line in the slon daemon log that is of significance at
the moment of the failover:-

2005-09-27 14:34:10 BST WARN remoteListenThread_2
: DB connection
failed - sleep 10 seconds
2005-09-27 14:34:17 BST INFO localListenThread: got restart
notification - signal scheduler
2005-09-27 14:34:20 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC
9845

Can anybody gives me any clues as to what is going on?

Thanks

Steve Hindmarch
BT Exact
Christopher Browne

2005-09-28, 8:25 pm

stephen.hindmarch-5Ybtn9MHkAA@public.gmane.org wrote:

>I have set up Slony-1 v 1.1.0 on two servers, each with identical
>databases, and have organised replication between the two of them.
>
>I can get a switchover to work, but when I do a failover the failover
>script stops in the middle of the failover command.
>
>Here is an extra from the script. The variables are set by the rest of
>the script, nodeId is the name of the subscriber, remoteId is the name
>of the origin, and the idea of the script is to run it on the surviving
>server after something nasty has happened to the other server.
>
>log "Attempting failover to local node ($nodeId)"
>log `date`
>slonik <<EOF
> cluster name = $CLUSTER_NAME;
> node 1 admin conninfo = '$one_conninfo';
> node 2 admin conninfo = '$two_conninfo';
> echo 'Failing over to node $nodeId';
> failover ( id=$remoteId, backup node = $nodeId);
> echo 'Failover complete';
>EOF
>
>In my test scenario, node 2 is the origin. I kill the postmaster on node
>2 to simulate the server dying a horrible death. The slon daemon on node
>2 dies and the slon daemon on node 1 starts to complain of being unable
>to access the node 2 database (I've x'd out the true IP address) :-
>
>2005-09-27 14:33:39 BST DEBUG2 remoteWorkerThread_2
: forward confirm
>1,9841 received by 2
>2005-09-27 14:33:40 BST ERROR remoteListenThread_2
: "select ev_origin,
>ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip,
>ev_type, ev_data1, ev_data2, ev_data3, ev_data4,
>ev_data5, ev_data6, ev_data7, ev_data8 from "_dot_ha".sl_event e
>where (e.ev_origin = '2' and e.ev_seqno > '26') order by e.ev_origin,
>e.ev_seqno" - server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>2005-09-27 14:33:49 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC
>9842
>2005-09-27 14:33:49 BST DEBUG2 localListenThread: Received event 1,9842
>SYNC
>2005-09-27 14:33:50 BST ERROR slon_connectdb: PQconnectdb("dbname=DOT
>host=xxx.xxx.xxx.xxx user=postgres") failed - could not connect to
>server: Connection refused
> Is the server running on host "xxx.xxx.xxx.xxx" and accepting
> TCP/IP connections on port 5432?
>2005-09-27 14:33:50 BST WARN remoteListenThread_2
: DB connection
>failed - sleep 10 seconds
>
>At this point I now have no origin, but a working subscriber, up to
>date, at least up to the time of the last synch.
>
>I want to make this subscriber the new origin. A switchover won't work
>so I execute the above script on node 1 and I get the following output:-
>
>ha-failover.sh: Attempting failover to local node (1)
>ha-failover.sh: Tue Sep 27 14:34:17 BST 2005
><stdin>:4: Failing over to node 1
><stdin>:5: NOTICE: failedNode: set 1 has no other direct receivers -
>move now
>
>And then the script just hangs there (I've left it running for over an
>hour). It seems to be stuck on the failover line as it never reaches the
>second echo statement.
>
>

That is somewhat curious. I'll see if I can see why that would be.

"Wild speculation" (which is no more valuable than "speculative gossip")
would be that perhaps it's waiting to tell all the remaining subscribers
something, and since there aren't any, there's something confused about
that.

Your scenario here is one where it would be about as useful to simply do
an UNINSTALL NODE on node 1, because once the FAILOVER is done, there
will be nothing other than node 1 in the cluster. With no subscribers,
the presence of replication is pretty well a "historical curiosity."

Under such a circumstance, with two nodes, and the master dead, I'd be
inclined to simply drop replication, as, with only one node, you don't
honestly have replication going on anymore...

>When I look at the node 1 database I see that I can now update the
>replicated tables, so node 1 now thinks it is the master. I can check
>this by inspecting sl_set and see the origin for my replication set is
>now node 1. The sl_subscribe table is empty. The sl_node table shows
>both nodes and both are active, which strikes me as suspicious.
>
>DOT=# select * from _dot_ha.sl_node;
> no_id | no_active | no_comment | no_spool
>-------+-----------+-----------------+----------
> 1 | t | Node One | f
> 2 | t | Node Two | f
>(2 rows)
>
>
>

This actually *isn't* suspicious. This is normal.

FAILOVER doesn't actually drop out the failed node.

Dropping a node has, alas, side-effects, notably purging out information
about the events coming from that node. That would add insult to injury
supposing we had a node 3 that was more up to date than node 1.

We would then find ourselves in the regrettable position where we knew
node 3 had some better data, but have no way to properly apply it to
node 1 to get it up to speed. That would essentially add insult to
injury; node 3 was in better shape, but we would have to drop it, too,
because there's no way to get at its data :-(.

Anyhoo, node 2 won't go away until you explicitly drop it. Which should
wait until the reformed cluster is working OK...

And as for sl_subscribe, well, there is no longer any subscriber to set
1. Node #1 is the only node still working; nothing is subscribing to
it. The emptiness of sl_subscribe is just fine.

>There is only one line in the slon daemon log that is of significance at
>the moment of the failover:-
>
>2005-09-27 14:34:10 BST WARN remoteListenThread_2
: DB connection
>failed - sleep 10 seconds
>2005-09-27 14:34:17 BST INFO localListenThread: got restart
>notification - signal scheduler
>2005-09-27 14:34:20 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC
>9845
>
>Can anybody gives me any clues as to what is going on?
>
>

It seems to me as though everything is actually OK.

You'll want to drop node 2...

2005-09-29, 7:25 am

Thanks for your response Chris.

I did get further by stepping through the slonik code to see where the
problem was.

It appears that slonik is waiting forever for the slon daemon to
restart. You'll see in the log that slon gets the restart signal but
never seems to do anything about it.

The work around was to stop slon before running the script. The script
progresses through properly and then I can restart slon and drop node 2
with no problems.

Thanks for the advice on how to do the failover. I think I'll stick with
keeping replication in place because
a) It should make recovery of the failed node simpler as I only have to
worry about bringing one node back in, rather than rebuild the whole
cluster, and
b) the way project is going it won't be long before somebody asks me to
do a 3 node solution.

If I have any time I'll have a look at what is happening in slon. If
there are any tests you think I could run to give more clues I'd be glad
to try them out.

Steve Hindmarch
BT Exact

-----Original Message-----
From: Christopher Browne [mailto:cbbrowne-swQf4SbcV9C7WVzo/KQ3Mw@public.gmane.org]=20
Sent: 28 September 2005 20:20
To: Hindmarch,SJ,Stephen
,XBD R
Cc: slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
Subject: Re: [Slony1-general] Failover Stalls


stephen.hindmarch-5Ybtn9MHkAA@public.gmane.org wrote:

>I have set up Slony-1 v 1.1.0 on two servers, each with identical=20
>databases, and have organised replication between the two of them.
>
>I can get a switchover to work, but when I do a failover the failover=20
>script stops in the middle of the failover command.
>
>Here is an extra from the script. The variables are set by the rest of=20
>the script, nodeId is the name of the subscriber, remoteId is the name=20
>of the origin, and the idea of the script is to run it on the surviving


>server after something nasty has happened to the other server.
>
>log "Attempting failover to local node ($nodeId)"
>log `date`
>slonik <<EOF
> cluster name =3D $CLUSTER_NAME;
> node 1 admin conninfo =3D '$one_conninfo';
> node 2 admin conninfo =3D '$two_conninfo';
> echo 'Failing over to node $nodeId';
> failover ( id=3D$remoteId, backup node =3D $nodeId);
> echo 'Failover complete';
>EOF
>
>In my test scenario, node 2 is the origin. I kill the postmaster on=20
>node 2 to simulate the server dying a horrible death. The slon daemon=20
>on node 2 dies and the slon daemon on node 1 starts to complain of=20
>being unable to access the node 2 database (I've x'd out the true IP=20
>address) :-
>
>2005-09-27 14:33:39 BST DEBUG2 remoteWorkerThread_2
: forward confirm=20
>1,9841 received by 2 2005-09-27 14:33:40 BST ERROR =20
> remoteListenThread_2
: "select ev_origin,
>ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip,
>ev_type, ev_data1, ev_data2, ev_data3, ev_data4,
>ev_data5, ev_data6, ev_data7, ev_data8 from "_dot_ha".sl_event e
>where (e.ev_origin =3D '2' and e.ev_seqno > '26') order by e.ev_origin, =


>e.ev_seqno" - server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>2005-09-27 14:33:49 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC=20
>9842 2005-09-27 14:33:49 BST DEBUG2 localListenThread: Received event=20
>1,9842 SYNC
>2005-09-27 14:33:50 BST ERROR slon_connectdb: =

PQconnectdb(" dbname=3DDOT
>host=3Dxxx.xxx.xxx.xxx user=3Dpostgres") failed - could not connect to
>server: Connection refused
> Is the server running on host "xxx.xxx.xxx.xxx" and accepting
> TCP/IP connections on port 5432?
>2005-09-27 14:33:50 BST WARN remoteListenThread_2
: DB connection
>failed - sleep 10 seconds
>
>At this point I now have no origin, but a working subscriber, up to=20
>date, at least up to the time of the last synch.
>
>I want to make this subscriber the new origin. A switchover won't work=20
>so I execute the above script on node 1 and I get the following=20
>output:-
>
>ha-failover.sh: Attempting failover to local node (1)
>ha-failover.sh: Tue Sep 27 14:34:17 BST 2005
><stdin>:4: Failing over to node 1
><stdin>:5: NOTICE: failedNode: set 1 has no other direct receivers -=20
>move now
>
>And then the script just hangs there (I've left it running for over an=20
>hour). It seems to be stuck on the failover line as it never reaches=20
>the second echo statement.
> =20
>

That is somewhat curious. I'll see if I can see why that would be.

"Wild speculation" (which is no more valuable than "speculative gossip")
would be that perhaps it's waiting to tell all the remaining subscribers
something, and since there aren't any, there's something confused about
that.

Your scenario here is one where it would be about as useful to simply do
an UNINSTALL NODE on node 1, because once the FAILOVER is done, there
will be nothing other than node 1 in the cluster. With no subscribers,
the presence of replication is pretty well a "historical curiosity."

Under such a circumstance, with two nodes, and the master dead, I'd be
inclined to simply drop replication, as, with only one node, you don't
honestly have replication going on anymore...

>When I look at the node 1 database I see that I can now update the=20
>replicated tables, so node 1 now thinks it is the master. I can check=20
>this by inspecting sl_set and see the origin for my replication set is=20
>now node 1. The sl_subscribe table is empty. The sl_node table shows=20
>both nodes and both are active, which strikes me as suspicious.
>
>DOT=3D# select * from _dot_ha.sl_node;
> no_id | no_active | no_comment | no_spool
>-------+-----------+-----------------+----------
> 1 | t | Node One | f
> 2 | t | Node Two | f
>(2 rows)
>
> =20
>

This actually *isn't* suspicious. This is normal.

FAILOVER doesn't actually drop out the failed node.

Dropping a node has, alas, side-effects, notably purging out information
about the events coming from that node. That would add insult to injury
supposing we had a node 3 that was more up to date than node 1.

We would then find ourselves in the regrettable position where we knew
node 3 had some better data, but have no way to properly apply it to
node 1 to get it up to speed. That would essentially add insult to
injury; node 3 was in better shape, but we would have to drop it, too,
because there's no way to get at its data :-(.

Anyhoo, node 2 won't go away until you explicitly drop it. Which should
wait until the reformed cluster is working OK...

And as for sl_subscribe, well, there is no longer any subscriber to set
1. Node #1 is the only node still working; nothing is subscribing to
it. The emptiness of sl_subscribe is just fine.

>There is only one line in the slon daemon log that is of significance=20
>at the moment of the failover:-
>
>2005-09-27 14:34:10 BST WARN remoteListenThread_2
: DB connection
>failed - sleep 10 seconds
>2005-09-27 14:34:17 BST INFO localListenThread: got restart
>notification - signal scheduler
>2005-09-27 14:34:20 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC=20
>9845
>
>Can anybody gives me any clues as to what is going on?
> =20
>

It seems to me as though everything is actually OK.

You'll want to drop node 2...
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com