Home > Archive > Slony1 PostgreSQL Replication > January 2006 > Replication fails after network outage









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Replication fails after network outage
Glen Eustace

2006-01-26, 5:00 pm

I am using slony-1.1.0 with postgresql-8.0.6 and have a situation that I
hope has a better resolution than I am currently using.

One of my 2 slaves is some distance away and over the last 6 months or
so we have had quite a few network brown or black outs between it and
the master. After such an event, replication fails and the only way I am
managing to get it to go again is to drop the node and database and
start again. I have done this now so many times I have scripted it so
that I can get the slave back online relatively quickly.

I get errors, like the following, in the slony log

2006-01-26 08:11:13 NZDT ERROR remoteWorkerThread_1
: "start
transaction; set enable_seqscan = off; set enable_indexscan = on; "
PGRES_FATAL_ERROR 2006-01-26 08:11:13 NZDT ERROR remoteWorkerThread_1
:
"close LOG; " PGRES_FATAL_ERROR 2006-01-26 08:11:13 NZDT ERROR remot
eWorkerThread_1: "rollback transaction; set enable_seqscan = default;
set enable_indexscan = default; " PGRES_FATAL_ERROR 2006-01-26 08
:11:13 NZDT ERROR remoteWorkerThread_1
: helper 1 finished with error
2006-01-26 08:11:13 NZDT ERROR remoteWorkerThread_1
: SYNC aborted

Stopping and restarting all the various slony processes doesn't seem to
clear things.

NB: It only ever seems to happen after a network event. Any advice on
how to get replication started again without rebuilding would be
appreciated.


--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Glen and Rosanne Eustace,
GodZone Internet Services, a division of AGRE Enterprises Ltd.,
P.O. Box 8020, Palmerston North, New Zealand 5301
Ph/Fax: +64 6 357 8168, Mob: +64 27 5 424 015, Web: www.godzone.net.nz

"A Ministry specialising in providing low-cost professional Internet
Services to NZ Christian Churches, Ministries and Organisations"
Christopher Browne

2006-01-26, 5:00 pm

Glen Eustace wrote:

>I am using slony-1.1.0 with postgresql-8.0.6 and have a situation that I
>hope has a better resolution than I am currently using.
>
>One of my 2 slaves is some distance away and over the last 6 months or
>so we have had quite a few network brown or black outs between it and
>the master. After such an event, replication fails and the only way I am
>managing to get it to go again is to drop the node and database and
>start again. I have done this now so many times I have scripted it so
>that I can get the slave back online relatively quickly.
>
>I get errors, like the following, in the slony log
>
>2006-01-26 08:11:13 NZDT ERROR remoteWorkerThread_1
: "start
>transaction; set enable_seqscan = off; set enable_indexscan = on; "
>PGRES_FATAL_ERROR 2006-01-26 08:11:13 NZDT ERROR remoteWorkerThread_1
:
>"close LOG; " PGRES_FATAL_ERROR 2006-01-26 08:11:13 NZDT ERROR remot
>eWorkerThread_1: "rollback transaction; set enable_seqscan = default;
>set enable_indexscan = default; " PGRES_FATAL_ERROR 2006-01-26 08
>:11:13 NZDT ERROR remoteWorkerThread_1
: helper 1 finished with error
>2006-01-26 08:11:13 NZDT ERROR remoteWorkerThread_1
: SYNC aborted
>
>Stopping and restarting all the various slony processes doesn't seem to
>clear things.
>
>NB: It only ever seems to happen after a network event. Any advice on
>how to get replication started again without rebuilding would be
>appreciated.
>
>
>
>

One thought...

You might want to turn the logging up to a higher level; it looks as
though it's at level 1, and I'd expect "-d 2" to give more useful
information.

Another notion...

My suspicion is that what is happening is that the connection between
the slon and the database it is managing was broken by the network
event. Higher debug levels might display a message like "a slon is
already servicing node #2;" that would be a good tell-tale sign...

The next time this happens, connect in to the database and look at
pg_stat_activity to see what slony-related backends are in use. My
suspicion is that you'll see several of them, possibly (if statement
logging is on) indicating "<IDLE> in transaction".

Solution #1... Those idle-in-transaction backends are, in effect,
'zombies' of sorts. They haven't yet figured out that the network
connection has died and won't be coming back. They could persist
(depending on TCP/IP configuration) for up to a couple hours. Kill them
off, and see if starting new slon processes works out better.

Solution #2... It is preferable if each slon lives on the same network
as the database it is managing. That would prevent some of the above
from happening, notably in that restarting slons would do some good.
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com