Home > Archive > Slony1 PostgreSQL Replication > June 2005 > Slony stops when using log shipping









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Slony stops when using log shipping
Chris Newland

2005-06-03, 7:24 am

Hi All,

I have Slony 1.1.0 beta 5 working fine in a 3 node configuration with =
nodes
2 and 3 subscribing directly to node 1.

When I try to enable log shipping on node 3 using the following slon =
command
the slon process quits unexpectedly.

/usr/local/pgsql/bin/slon -a/usr/local/pgsql/data my_cluster
"dbname=3Dmydbslave user=3Dpostgres host=3D<node 3 ip>"

The slon command is run as the postgres user who has write permission on
/usr/local/pgsql/data.

Some log files are created but they all seem to be empty of any SQL
statements:

-- Slony-I log shipping archive
-- Node 2, Event 416
-- at... Fri Jun 3 11:51:12 2005

start transaction;

------------------------------------------------------------------
-- End Of Archive Log
------------------------------------------------------------------
commit;


I'm using PostgreSQL 8.0.3 built from source.

Here is the output from slon:

2005-06-03 11:51:12 BST CONFIG main: slon version 1.1.0 starting up
2005-06-03 11:51:12 BST CONFIG main: local node id =3D 3
2005-06-03 11:51:12 BST DEBUG2 slon: watchdog process started
2005-06-03 11:51:12 BST DEBUG2 slon: begin signal handler setup
2005-06-03 11:51:12 BST DEBUG2 slon: end signal handler setup
2005-06-03 11:51:12 BST DEBUG2 slon: wait for main child process
2005-06-03 11:51:12 BST DEBUG2 main: main process started
2005-06-03 11:51:12 BST DEBUG2 main: begin signal handler setup
2005-06-03 11:51:12 BST DEBUG2 main: end signal handler setup
2005-06-03 11:51:12 BST CONFIG main: launching sched_start_mainloop

2005-06-03 11:51:12 BST CONFIG main: loading current cluster =
configuration
2005-06-03 11:51:12 BST CONFIG storeNode: no_id=3D1 no_comment=3D'Master
=
Node'
2005-06-03 11:51:12 BST DEBUG2 setNodeLastEvent: no_id=3D1 =
event_seq=3D571
2005-06-03 11:51:12 BST CONFIG storeNode: no_id=3D2 no_comment=3D'Slave =
node
db2'
2005-06-03 11:51:12 BST DEBUG2 setNodeLastEvent: no_id=3D2 =
event_seq=3D413
2005-06-03 11:51:12 BST CONFIG storePath: pa_server=3D1 pa_client=3D3
pa_conninfo=3D"dbname=3Dmydb host=3D<node 1 ip> user=3Dpostgres" =
pa_connretry=3D10
2005-06-03 11:51:12 BST CONFIG storePath: pa_server=3D2 pa_client=3D3
pa_conninfo=3D"dbname=3Dmydbslave ho
st=3D<node 2 ip> user=3Dpostgres" pa_connretry=3D10
2005-06-03 11:51:12 BST CONFIG storeListen: li_origin=3D1 =
li_receiver=3D3
li_provider=3D1
2005-06-03 11:51:12 BST CONFIG storeListen: li_origin=3D2 =
li_receiver=3D3
li_provider=3D2
2005-06-03 11:51:12 BST CONFIG storeSet: set_id=3D1 set_origin=3D1
set_comment=3D'MYDB Replication Set'
2005-06-03 11:51:12 BST WARN remoteWorker_wakeup:
node 1 - no worker
thread
2005-06-03 11:51:12 BST DEBUG2 sched_wakeup_node():
no_id=3D1 (0 threads =
+
worker signaled)
2005-06-03 11:51:12 BST CONFIG storeSubscribe: sub_set=3D1 =
sub_provider=3D1
sub_forward=3D'f'
2005-06-03 11:51:12 BST WARN remoteWorker_wakeup:
node 1 - no worker
thread
2005-06-03 11:51:12 BST DEBUG2 sched_wakeup_node():
no_id=3D1 (0 threads =
+
worker signaled)
2005-06-03 11:51:12 BST CONFIG enableSubscription: sub_set=3D1
2005-06-03 11:51:12 BST WARN remoteWorker_wakeup:
node 1 - no worker
thread
2005-06-03 11:51:12 BST DEBUG2 sched_wakeup_node():
no_id=3D1 (0 threads =
+
worker signaled)
2005-06-03 11:51:12 BST DEBUG2 main: last local event sequence =3D 395
2005-06-03 11:51:12 BST CONFIG main: configuration complete - starting
threads
2005-06-03 11:51:12 BST DEBUG1 localListenThread: thread starts
2005-06-03 11:51:12 BST CONFIG enableNode: no_id=3D1
2005-06-03 11:51:12 BST DEBUG1 remoteWorkerThread_1
: thread starts
2005-06-03 11:51:12 BST DEBUG1 remoteListenThread_1
: thread starts
2005-06-03 11:51:12 BST CONFIG enableNode: no_id=3D2
2005-06-03 11:51:12 BST DEBUG1 remoteWorkerThread_2
: thread starts
2005-06-03 11:51:12 BST DEBUG1 remoteListenThread_2
: thread starts
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: start listening for
event origin 1
2005-06-03 11:51:12 BST DEBUG1 cleanupThread: thread starts
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: start listening for
event origin 2
2005-06-03 11:51:12 BST DEBUG1 syncThread: thread starts
2005-06-03 11:51:12 BST DEBUG4 cleanupThread: bias =3D 35383
2005-06-03 11:51:12 BST DEBUG1 main: running scheduler mainloop
2005-06-03 11:51:12 BST DEBUG4 remoteWorkerThread_1
: update provider
configuration
2005-06-03 11:51:12 BST DEBUG1 remoteWorkerThread_1
: helper thread for
provider 1 created
2005-06-03 11:51:12 BST DEBUG4 remoteWorkerThread_1
: added active set 1 =
to
provider 1
2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: waiting for work
2005-06-03 11:51:12 BST DEBUG4 remoteWorkerThread_2
: update provider
configuration
2005-06-03 11:51:12 BST DEBUG1 remoteListenThread_1
: connected to
'dbname=3Dmydb host=3D<node 1 ip> user
=3Dpostgres'
2005-06-03 11:51:12 BST DEBUG1 remoteListenThread_2
: connected to
'dbname=3Dmydbslave host=3D<node 2 ip>
user=3Dpostgres'
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,572 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,573 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,574 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_1
: Received event =
1,572
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,575 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,576 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,577 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,578 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,579 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,580 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,581 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,582 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,583 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,584 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_1
: SYNC 572 processing
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,585 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,586 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,587 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,588 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,589 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,590 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,591 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,592 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,593 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,594 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,595 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,414 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,415 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event =
2,414
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,416 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,417 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,418 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,419 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,420 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,421 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,422 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,423 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,424 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,425 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,426 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 414 processing
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,427 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,428 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,429 =
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need =
syncing
for this event
2005-06-03 11:51:12 BST DEBUG1 remoteWorkerThread_1
: connected to data
provider 1 on 'dbname=3Dmydb
host=3D<node 1 ip> user=3Dpostgres'
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event =
2,415
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 415 processing
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need =
syncing
for this event
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_1
: syncing set 1 with =
15
table(s) from provider 1
2005-06-03 11:51:12 BST DEBUG3 remoteWorkerThread_1
: activate helper 1
2005-06-03 11:51:12 BST DEBUG4 remoteWorkerThread_1
: waiting for log =
data
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event =
2,416
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 416 processing
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need =
syncing
for this event
2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: allocate lines
2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: have 10 line =
buffers
2005-06-03 11:51:12 BST DEBUG2 remoteHelperThread_1
_1: 0.005 seconds =
delay
for first row
2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: got 1 log rows
2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: 1 log buffers
delivered
2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: no more log rows
2005-06-03 11:51:12 BST DEBUG2 remoteHelperThread_1
_1: 0.005 seconds =
until
close cursor
2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: change helper =
thread
status
2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: send DONE/ERROR =
line
to worker
2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: waiting for =
workgroup
to finish
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event =
2,417
SYNC
2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 417 processing
2005-06-03 11:51:12 BST DEBUG1 slon: done
2005-06-03 11:51:12 BST DEBUG2 slon: exit(0)

Nothing unusual in data/serverlog

Does anybody have any ideas?

Thanks,

Chris
Christopher Browne

2005-06-06, 11:24 am

Chris Newland wrote:

>Hi All,
>
>I have Slony 1.1.0 beta 5 working fine in a 3 node configuration with nodes
>2 and 3 subscribing directly to node 1.
>
>

What platform? I'm kind of thinking NOT Linux; see my commentary way
further down on that...

>When I try to enable log shipping on node 3 using the following slon command
>the slon process quits unexpectedly.
>
>/usr/local/pgsql/bin/slon -a/usr/local/pgsql/data my_cluster
>"dbname=mydbslave user=postgres host=<node 3 ip>"
>
>The slon command is run as the postgres user who has write permission on
>/usr/local/pgsql/data.
>
>Some log files are created but they all seem to be empty of any SQL
>statements:
>
>-- Slony-I log shipping archive
>-- Node 2, Event 416
>-- at... Fri Jun 3 11:51:12 2005
>
>start transaction;
>
>------------------------------------------------------------------
>-- End Of Archive Log
>------------------------------------------------------------------
>commit;
>
>
>

It is not odd for there to be SOME empty log files. Indeed, supposing
there are no updates going into the system, there will still
periodically be some SYNC events. The default for sync_interval_timeou
t
is 60000ms, which means that there will be a timeout, and hence a SYNC,
once per minute even if there is no activity going on at all.
<http://linuxdatabases.info/info/slon.html>

>I'm using PostgreSQL 8.0.3 built from source.
>
>Here is the output from slon:
>
>

Summarizing a bit...

The log indicates that the node collects up a whole ream of SYNCs from
nodes 1 and 2, and starts work on node #2's events...

I take it that node #3 is subscribing to node #2? The fact that the
work is on remoteWorkerThread_2
points in that direction...

>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,594 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,595 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,414 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,415 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event 2,414
>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,416 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,417 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,418 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,419 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,420 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,421 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,422 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,423 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,424 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,425 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,426 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 414 processing
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,427 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,428 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,429 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG1 remoteWorkerThread_1
: connected to data
>provider 1 on 'dbname=mydb
>host=<node 1 ip> user=postgres'
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event 2,415
>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 415 processing
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_1
: syncing set 1 with 15
>table(s) from provider 1
>2005-06-03 11:51:12 BST DEBUG3 remoteWorkerThread_1
: activate helper 1
>2005-06-03 11:51:12 BST DEBUG4 remoteWorkerThread_1
: waiting for log data
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event 2,416
>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 416 processing
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: allocate lines
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: have 10 line buffers
>2005-06-03 11:51:12 BST DEBUG2 remoteHelperThread_1
_1: 0.005 seconds delay
>for first row
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: got 1 log rows
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: 1 log buffers
>delivered
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: no more log rows
>2005-06-03 11:51:12 BST DEBUG2 remoteHelperThread_1
_1: 0.005 seconds until
>close cursor
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: change helper thread
>status
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: send DONE/ERROR line
>to worker
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: waiting for workgroup
>to finish
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event 2,417
>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 417 processing
>2005-06-03 11:51:12 BST DEBUG1 slon: done
>2005-06-03 11:51:12 BST DEBUG2 slon: exit(0)
>
>Nothing unusual in data/serverlog
>
>Does anybody have any ideas?
>
>

You should have files in the spool directory for syncs 415 and 416;
evidently something "died" when working on #417.

Are 415/416 there? Do they have any data in them?

If they're empty, this begs the question: Is that right or wrong? I'd
expect slon to get that right...

You should be able to look at the table sl_event on node 2...

select * from _my_cluster.sl_event where ev_origin = 2 and ev_seqno
between 415 and 417;

Supposing 415 and 416 are 'empty' SYNCs, I'd expect ev_xip to be blank,
which indicates that there were no XIDs associated with those SYNCs.

I expect that there's *something* associated with SYNC #417.

The above sequence of logging is consistent with the notion that during
SYNC #417, one of the queries being constructed (generally using
slon_mkquery() and/or slon_appendquery) blew up due to some mismatch of
values being passed in versus what was expected. What happens, at that
point, is that the worker thread #2 "blows up," likely with something
akin to a segmentation fault, which is then hidden because the main
thread catches this and then terminates without giving terribly much notice.

What I'd expect to see, at this point, is that if you restart that slon,
you'll find that it similarly logs a whole bunch of SYNC events from
nodes 1 and 2, starts processing SYNC #417 again, and then falls down
much the same way.

Finding which call to slon_mkquery()/slon_appendquery() may be a bit of
a trial and error thing; I'd go thru function sync_event() and add some
logging for each of the calls to those functions in order to find which
one is blowing up. Essentially, you'd see which was the LAST one that
succeeded, allowing identification of which one failed.

You may want to pull the latest CVS edition; it is worth noting that
remote_worker.c version is at version 1.85, whereas Beta #5 uses 1.84.

There *ARE* changes significant to log shipping in v1.85; Neil Conway
did some cleanup of initialization of some of the log shipping
functions. I'd particularly expect that to be relevant if you're on a
platform other than Linux, such as FreeBSD, as Linux has often been a
bit easier-going about variable initialization than other platforms.
Chris Newland

2005-06-06, 11:24 am

Hi Christopher,

Answers inline.

-----Original Message-----
From: Christopher Browne [mailto:cbbrowne-swQf4SbcV9C7WVzo/KQ3Mw@public.gmane.org]=20
Sent: 06 June 2005 17:19
To: Chris Newland
Cc: slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
Subject: Re: [Slony1-general] Slony stops when using log shipping

Chris Newland wrote:

>Hi All,
>
>I have Slony 1.1.0 beta 5 working fine in a 3 node configuration with =

nodes

>2 and 3 subscribing directly to node 1.
> =20
>

What platform? I'm kind of thinking NOT Linux; see my commentary way
further down on that...

*** Debian Linux 3.0 (woody) fully patched, kernel 2.4.31

>When I try to enable log shipping on node 3 using the following slon

command
>the slon process quits unexpectedly.
>
>/usr/local/pgsql/bin/slon -a/usr/local/pgsql/data my_cluster
>"dbname=3Dmydbslave user=3Dpostgres host=3D<node 3 ip>"
>
>The slon command is run as the postgres user who has write permission =

on
>/usr/local/pgsql/data.
>
>Some log files are created but they all seem to be empty of any SQL
>statements:
>
>-- Slony-I log shipping archive
>-- Node 2, Event 416
>-- at... Fri Jun 3 11:51:12 2005
>
>start transaction;
>
>------------------------------------------------------------------
>-- End Of Archive Log
>------------------------------------------------------------------
>commit;
>
> =20
>

It is not odd for there to be SOME empty log files. Indeed, supposing
there are no updates going into the system, there will still
periodically be some SYNC events. The default for sync_interval_timeou
t
is 60000ms, which means that there will be a timeout, and hence a SYNC,
once per minute even if there is no activity going on at all.=20
http://linuxdatabases.info/info/slon.html

*** Sorry, they were all empty and identical to the one posted above =
apart
from the serial number and timestamps.=20

>I'm using PostgreSQL 8.0.3 built from source.
>
>Here is the output from slon:
> =20
>

Summarizing a bit...

The log indicates that the node collects up a whole ream of SYNCs from
nodes 1 and 2, and starts work on node #2's events...

I take it that node #3 is subscribing to node #2? The fact that the
work is on remoteWorkerThread_2
points in that direction...

*** I wanted node 1 (main db) to sync to nodes 2 and 3 (failover nodes) =
and
my understanding was that I needed a completely connected triangle so I
think that's what I've configured.

>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,594 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1
: queue event 1,595 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,414 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,415 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event =

2,414

>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,416 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,417 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,418 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,419 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,420 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,421 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,422 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,423 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,424 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,425 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,426 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 414 =

processing
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,427 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,428 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2
: queue event 2,429 =

SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need =

syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG1 remoteWorkerThread_1
: connected to data
>provider 1 on 'dbname=3Dmydb
>host=3D<node 1 ip> user=3Dpostgres'
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event =

2,415

>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 415 =

processing
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need =

syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_1
: syncing set 1 with =

15
>table(s) from provider 1
>2005-06-03 11:51:12 BST DEBUG3 remoteWorkerThread_1
: activate helper 1
>2005-06-03 11:51:12 BST DEBUG4 remoteWorkerThread_1
: waiting for log =

data
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event =

2,416

>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 416 =

processing
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: no sets need =

syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: allocate lines
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: have 10 line =

buffers
>2005-06-03 11:51:12 BST DEBUG2 remoteHelperThread_1
_1: 0.005 seconds =

delay

>for first row
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: got 1 log rows
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: 1 log buffers
>delivered
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: no more log rows
>2005-06-03 11:51:12 BST DEBUG2 remoteHelperThread_1
_1: 0.005 seconds =

until

>close cursor
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: change helper =

thread
>status
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1
_1: send DONE/ERROR =

line
>to worker
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1
_1: waiting for

workgroup
>to finish
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: Received event =

2,417

>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2
: SYNC 417 =

processing
>2005-06-03 11:51:12 BST DEBUG1 slon: done
>2005-06-03 11:51:12 BST DEBUG2 slon: exit(0)
>
>Nothing unusual in data/serverlog
>
>Does anybody have any ideas?
> =20
>

You should have files in the spool directory for syncs 415 and 416;
evidently something "died" when working on #417.

*** The only spool dir on my system is /var/spool which doesn't seem to =
have
any db/slony related files. Do you mean the directory specified in the =
-a
switch to slon?

I've been trying to track this down and have occasionally noticed file
writing errors related to the log shipping files:

2005-06-06 17:34:02 BST DEBUG2 remoteWorkerThread_2
: Received event =
2,28410
SYNC
2005-06-06 17:34:02 BST DEBUG3 calc sync size - last time: 1 last =
length: 11
ideal: 5454 proposed si
ze: 2
2005-06-06 17:34:02 BST DEBUG2 remoteWorkerThread_2
: SYNC 28410 =
processing
2005-06-06 17:34:02 BST DEBUG2 remoteWorkerThread_2
: no sets need =
syncing
for this event
2005-06-06 17:34:02 BST DEBUG2 remoteWorkerThread_1
: syncing set 1 with =
15
table(s) from provider 1
2005-06-06 17:34:02 BST ERROR remoteWorkerThread_1
: Cannot write to =
archive
file /usr/local/pgsql/d
ata/ slony1_log_2_0000000
0000000028410.sql.tmp - Bad file
descriptor2005-06-06 17:34:02 BST DEBUG1 sl
on: done
2005-06-06 17:34:02 BST DEBUG2 slon: exit(0)


Slon died just after the bad file descriptor error. It managed to write
other slony sql files to /usr/local/pgsql/data/ before that one failed.
Maybe a race condition? It's an SMP box.

I'll try the very latest from CVS and then carry on with your =
suggestions.

Many thanks,

Chris




Are 415/416 there? Do they have any data in them?

If they're empty, this begs the question: Is that right or wrong? I'd
expect slon to get that right...

You should be able to look at the table sl_event on node 2...

select * from _my_cluster.sl_event where ev_origin =3D 2 and ev_seqno
between 415 and 417;

Supposing 415 and 416 are 'empty' SYNCs, I'd expect ev_xip to be blank,
which indicates that there were no XIDs associated with those SYNCs.

I expect that there's *something* associated with SYNC #417.

The above sequence of logging is consistent with the notion that during
SYNC #417, one of the queries being constructed (generally using
slon_mkquery() and/or slon_appendquery) blew up due to some mismatch of
values being passed in versus what was expected. What happens, at that
point, is that the worker thread #2 "blows up," likely with something
akin to a segmentation fault, which is then hidden because the main
thread catches this and then terminates without giving terribly much =
notice.

What I'd expect to see, at this point, is that if you restart that slon,
you'll find that it similarly logs a whole bunch of SYNC events from
nodes 1 and 2, starts processing SYNC #417 again, and then falls down
much the same way.

Finding which call to slon_mkquery()/slon_appendquery() may be a bit of
a trial and error thing; I'd go thru function sync_event() and add some
logging for each of the calls to those functions in order to find which
one is blowing up. Essentially, you'd see which was the LAST one that
succeeded, allowing identification of which one failed.

You may want to pull the latest CVS edition; it is worth noting that
remote_worker.c version is at version 1.85, whereas Beta #5 uses 1.84.

There *ARE* changes significant to log shipping in v1.85; Neil Conway
did some cleanup of initialization of some of the log shipping
functions. I'd particularly expect that to be relevant if you're on a
platform other than Linux, such as FreeBSD, as Linux has often been a
bit easier-going about variable initialization than other platforms.
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com