Home > Archive > Slony1 PostgreSQL Replication > April 2006 > Initial COPY works, but no new data since.









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Initial COPY works, but no new data since.
Gavin Hamill

2006-03-31, 7:38 am

Oh dear, I appear to be in a bit over my head, and I hope someone can
help :)

The situation is a simple 2-node setup, replicating from 194.24.250.137
-> 194.24.250.143 - configuration of 'set1' and submission via slonik
went smoothly, and the initial set COPY took 2 hours, which is about
right for 3G of data..

My problem is that I have not seen any new data on the slave since the
initial copy. The master (node 1) is has text logs in /var/log/slony as
specified with

$ /usr/lib/postgresql/8.1/bin/slon -a /var/log/slony replication
"host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXXXXX"

the most recent slon console output on the master is:

2006-03-31 11:06:47 BST CONFIG main: slon version 1.1.5 starting up
2006-03-31 11:06:47 BST CONFIG main: local node id = 1
2006-03-31 11:06:47 BST CONFIG main: launching sched_start_mainloop

2006-03-31 11:06:47 BST CONFIG main: loading current cluster configuration
2006-03-31 11:06:47 BST CONFIG storeNode: no_id=2 no_comment='Node 2 -
Jalapeno'
2006-03-31 11:06:47 BST CONFIG storePath: pa_server=2 pa_client=1
pa_conninfo="host=194.24.250.143 dbname=laterooms user=XXX port=5432
password=XXXXX" pa_connretry=10
2006-03-31 11:06:47 BST CONFIG storeListen: li_origin=2 li_receiver=1
li_provider=2
2006-03-31 11:06:47 BST CONFIG storeSet: set_id=1 set_origin=1
set_comment='All laterooms tables and sequences'
2006-03-31 11:06:47 BST CONFIG main: configuration complete - starting
threads
2006-03-31 11:06:47 BST DEBUG1 localListenThread: thread starts
NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=17670
NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=20424
NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=20425
NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=21028
NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=21030
2006-03-31 11:06:47 BST CONFIG enableNode: no_id=2
2006-03-31 11:06:47 BST DEBUG1 main: running scheduler mainloop
2006-03-31 11:06:47 BST DEBUG1 remoteWorkerThread_2
: thread starts
2006-03-31 11:06:47 BST DEBUG1 cleanupThread: thread starts
2006-03-31 11:06:47 BST DEBUG1 remoteListenThread_2
: thread starts
2006-03-31 11:06:47 BST DEBUG1 syncThread: thread starts
2006-03-31 11:06:47 BST DEBUG1 remoteListenThread_2
: connected to
'host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXXXXX'

There is a slony1_log_2_0000000
0000000XXXX.sql in /var/log/slony on the
master every 10 seconds, as I'd expect, but each and every one of these
files is 288 bytes long with a boilerplate:

====================
=============
-- Slony-I log shipping archive
-- Node 2, Event 2435
start transaction;

------------------------------------------------------------------
-- End Of Archive Log
------------------------------------------------------------------
commit;
vacuum analyze "_replication".sl_setsync_offline;
====================
=============

Finally, there's been an 'idle in transaction' postgres process on it
for hours.

There is IP traffic between the two - PG packets containing stuff like

select ev_origin, ev_seqno, ev_timestamp, ev_minxid, ev_maxxid,
ev_xip, ev_type, ev_data1, ev_data2, ev_data3,
ev_data4, ev_data5, ev_data6, ev_data7, ev_data8 from
"_replication".sl_event e where (e.ev_origin = '1' and e.ev_seqno > '23359')


The slave is more interesting, however. First, the 'slon' process is
consuming 100% CPU time (on a single CPU, of course)... second there has
been an 'idle in transaction' postgres process on it for hours...

in the slave's 'slon log' dir, there's 2GB worth of logfiles...

many " slony1_log_1_0000000
00000000XXXXX.sql" containing incremental
updates to the DB, I can see INSERT/UPDATE/DELETEs in these files and
the queries are definately coming from the master db.

Worrying is the 1GB slony1_log_2_0000000
0000000000261.sql which is the
entirety of the initial dump - should this still be here?

As you can no doubt tell, I'm very green with this, and would just
appreciate some reassurance and pointing in the right direction! :(

Cheers,
Gavin.
Christopher Browne

2006-03-31, 11:31 am

Gavin Hamill wrote:
> Oh dear, I appear to be in a bit over my head, and I hope someone can
> help :)
>
> The situation is a simple 2-node setup, replicating from 194.24.250.137
> -> 194.24.250.143 - configuration of 'set1' and submission via slonik
> went smoothly, and the initial set COPY took 2 hours, which is about
> right for 3G of data..
>
> My problem is that I have not seen any new data on the slave since the
> initial copy. The master (node 1) is has text logs in /var/log/slony as
> specified with
>
> $ /usr/lib/postgresql/8.1/bin/slon -a /var/log/slony replication
> "host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXXXXX"
>
>

I see a first problem here...

It's useless to use the "-a" option against the "master" node. The
master node is not subscribing to any sets, so nothing will be recorded
in the "log shipping" logs there.
> the most recent slon console output on the master is:
>
> 2006-03-31 11:06:47 BST CONFIG main: slon version 1.1.5 starting up
> 2006-03-31 11:06:47 BST CONFIG main: local node id = 1
> 2006-03-31 11:06:47 BST CONFIG main: launching sched_start_mainloop

> 2006-03-31 11:06:47 BST CONFIG main: loading current cluster configuration
> 2006-03-31 11:06:47 BST CONFIG storeNode: no_id=2 no_comment='Node 2 -
> Jalapeno'
> 2006-03-31 11:06:47 BST CONFIG storePath: pa_server=2 pa_client=1
> pa_conninfo="host=194.24.250.143 dbname=laterooms user=XXX port=5432
> password=XXXXX" pa_connretry=10
> 2006-03-31 11:06:47 BST CONFIG storeListen: li_origin=2 li_receiver=1
> li_provider=2
> 2006-03-31 11:06:47 BST CONFIG storeSet: set_id=1 set_origin=1
> set_comment='All laterooms tables and sequences'
> 2006-03-31 11:06:47 BST CONFIG main: configuration complete - starting
> threads
> 2006-03-31 11:06:47 BST DEBUG1 localListenThread: thread starts
> NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=17670
> NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=20424
> NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=20425
> NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=21028
> NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=21030
> 2006-03-31 11:06:47 BST CONFIG enableNode: no_id=2
> 2006-03-31 11:06:47 BST DEBUG1 main: running scheduler mainloop
> 2006-03-31 11:06:47 BST DEBUG1 remoteWorkerThread_2
: thread starts
> 2006-03-31 11:06:47 BST DEBUG1 cleanupThread: thread starts
> 2006-03-31 11:06:47 BST DEBUG1 remoteListenThread_2
: thread starts
> 2006-03-31 11:06:47 BST DEBUG1 syncThread: thread starts
> 2006-03-31 11:06:47 BST DEBUG1 remoteListenThread_2
: connected to
> 'host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXXXXX'
>
> There is a slony1_log_2_0000000
0000000XXXX.sql in /var/log/slony on the
> master every 10 seconds, as I'd expect, but each and every one of these
> files is 288 bytes long with a boilerplate:
>
> ====================
=============
> -- Slony-I log shipping archive
> -- Node 2, Event 2435
> start transaction;
>
> ------------------------------------------------------------------
> -- End Of Archive Log
> ------------------------------------------------------------------
> commit;
> vacuum analyze "_replication".sl_setsync_offline;
> ====================
=============
>

That seems pretty normal. Node 1 is the master; it gets empty SYNCs
every so often from node 2.
> Finally, there's been an 'idle in transaction' postgres process on it
> for hours.
>
> There is IP traffic between the two - PG packets containing stuff like
>
> select ev_origin, ev_seqno, ev_timestamp, ev_minxid, ev_maxxid,
> ev_xip, ev_type, ev_data1, ev_data2, ev_data3,
> ev_data4, ev_data5, ev_data6, ev_data7, ev_data8 from
> "_replication".sl_event e where (e.ev_origin = '1' and e.ev_seqno > '23359')
>
>
> The slave is more interesting, however. First, the 'slon' process is
> consuming 100% CPU time (on a single CPU, of course)... second there has
> been an 'idle in transaction' postgres process on it for hours...
>
> in the slave's 'slon log' dir, there's 2GB worth of logfiles...
>
> many " slony1_log_1_0000000
00000000XXXXX.sql" containing incremental
> updates to the DB, I can see INSERT/UPDATE/DELETEs in these files and
> the queries are definately coming from the master db.
>
> Worrying is the 1GB slony1_log_2_0000000
0000000000261.sql which is the
> entirety of the initial dump - should this still be here?
>

Yeah, that all seems pretty normal.
> As you can no doubt tell, I'm very green with this, and would just
> appreciate some reassurance and pointing in the right direction! :(
>

The part I'm wondering about is thus:

I have not seen any new data on the slave since the initial copy.

It sounds as though you are seeing log shipping logs that contain new
data; are you not seeing the corresponding data loaded onto the
subscriber? That would be quite surprising...
Gavin Hamill

2006-04-01, 3:26 am

On Fri, 31 Mar 2006 17:26:07 +0000
Christopher Browne <cbbrowne-swQf4SbcV9C7WVzo/KQ3Mw@public.gmane.org> wrote:

> I see a first problem here...
>
> It's useless to use the "-a" option against the "master" node. The
> master node is not subscribing to any sets, so nothing will be recorded
> in the "log shipping" logs there.


OK fair enough. This evening I decided to uninstall both nodes, drop the _replication schema on both databases, and start again, this time much closer to the HOWTO...

dropdb + createdb on slave, then import the schema only. The following was imported into slonik on Node1 without failure:

cluster name = replication;
node 1 admin conninfo='host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX';
node 2 admin conninfo='host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX';

init cluster (id=1, comment = 'Cayenne');

# CREATE SET
try {
create set (id = 1, origin = 1, comment = 'Set 1 for replication');
} on error {
echo 'Could not create subscription set 1 for replication!';
exit -1;
}

# SET ADD TABLE
echo 'Subscription set 1 created';
echo 'Adding tables to the subscription set';
set add table (set id = 1, origin = 1, id = 1,
full qualified name = 'public.AccessRoles',
comment = 'Table public.AccessRoles with primary key');

.... and so on for 85 tables and 89 sequences...

# STORE NODE
store node (id = 2, event node = 1, comment = 'Node 2 - Jalapeno');
echo 'Set up replication nodes';

# STORE PATH
echo 'Next: configure paths for each node/origin';
store path (server = 2, client = 1, conninfo = 'host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX');
store path (server = 1, client = 2, conninfo = 'host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX');

# STORE LISTEN
store listen (origin = 1, receiver = 2, provider = 1);
store listen (origin = 2, receiver = 1, provider = 2);
echo 'Replication nodes prepared';
echo 'Please start a slon replication daemon for each node';

So, then ..
Node1$ slon replication "host=194.24.250.137 dbname=laterooms user=XXX password=XXX"
Node2$ slon replication "host=194.24.250.143 dbname=laterooms user=XXX password=XXX"

then watching the slon log on Node2....

2006-04-01 01:24:09 BST CONFIG main: slon version 1.1.5 starting up
2006-04-01 01:24:09 BST CONFIG main: local node id = 2
2006-04-01 01:24:09 BST CONFIG main: launching sched_start_mainloop

2006-04-01 01:24:09 BST CONFIG main: loading current cluster configuration
2006-04-01 01:24:09 BST CONFIG storeNode: no_id=1 no_comment='Cayenne'

2006-04-01 01:24:09 BST CONFIG storePath: pa_server=1 pa_client=2 pa_conninfo="host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX" pa_connretry=10
2006-04-01 01:24:09 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 01:24:09 BST CONFIG storeSet: set_id=1 set_origin=1 set_comment='Set 1 for replication'
2006-04-01 01:24:09 BST WARN remoteWorker_wakeup:
node 1 - no worker thread
2006-04-01 01:24:09 BST CONFIG main: configuration complete - starting threads
2006-04-01 01:24:09 BST DEBUG1 localListenThread: thread starts
2006-04-01 01:24:09 BST CONFIG enableNode: no_id=1
2006-04-01 01:24:09 BST DEBUG1 remoteWorkerThread_1
: thread starts
2006-04-01 01:24:09 BST DEBUG1 remoteListenThread_1
: thread starts
2006-04-01 01:24:09 BST DEBUG1 cleanupThread: thread starts
2006-04-01 01:24:09 BST DEBUG1 main: running scheduler mainloop
2006-04-01 01:24:09 BST DEBUG1 syncThread: thread starts
2006-04-01 01:24:09 BST DEBUG1 remoteListenThread_1
: connected to 'host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX'
2006-04-01 01:24:09 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 01:24:09 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 01:24:09 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1

Now I submit the magic 'subscribe set ( id = 1, provider = 1, receiver = 2, forward = no);'

2006-04-01 01:25:45 BST CONFIG storeSubscribe: sub_set=1 sub_provider=1 sub_forward='f'
2006-04-01 01:25:45 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 01:25:45 BST DEBUG1 copy_set 1
2006-04-01 01:25:45 BST DEBUG1 remoteWorkerThread_1
: connected to provider DB
2006-04-01 01:34:31 BST DEBUG1 cleanupThread: 0.018 seconds for cleanupEvent()
2006-04-01 01:34:31 BST DEBUG1 cleanupThread: 0.006 seconds for delete logs
2006-04-01 01:44:58 BST DEBUG1 cleanupThread: 0.006 seconds for cleanupEvent()
2006-04-01 01:44:58 BST DEBUG1 cleanupThread: 0.004 seconds for delete logs
2006-04-01 01:55:33 BST DEBUG1 cleanupThread: 0.003 seconds for cleanupEvent()
2006-04-01 01:55:33 BST DEBUG1 cleanupThread: 0.003 seconds for delete logs
2006-04-01 02:06:26 BST DEBUG1 cleanupThread: 0.003 seconds for cleanupEvent()
2006-04-01 02:06:26 BST DEBUG1 cleanupThread: 0.005 seconds for delete logs
2006-04-01 02:16:49 BST DEBUG1 cleanupThread: 0.005 seconds for cleanupEvent()
2006-04-01 02:16:49 BST DEBUG1 cleanupThread: 0.004 seconds for delete logs
2006-04-01 02:27:28 BST DEBUG1 cleanupThread: 0.003 seconds for cleanupEvent()
2006-04-01 02:27:28 BST DEBUG1 cleanupThread: 0.004 seconds for delete logs
2006-04-01 02:39:00 BST DEBUG1 cleanupThread: 0.006 seconds for cleanupEvent()
2006-04-01 02:39:00 BST DEBUG1 cleanupThread: 0.007 seconds for delete logs
2006-04-01 02:49:33 BST DEBUG1 cleanupThread: 0.003 seconds for cleanupEvent()
2006-04-01 02:49:33 BST DEBUG1 cleanupThread: 0.005 seconds for delete logs
2006-04-01 02:59:54 BST DEBUG1 cleanupThread: 0.002 seconds for cleanupEvent()
2006-04-01 02:59:54 BST DEBUG1 cleanupThread: 0.005 seconds for delete logs
2006-04-01 03:10:14 BST DEBUG1 cleanupThread: 0.003 seconds for cleanupEvent()
2006-04-01 03:10:14 BST DEBUG1 cleanupThread: 0.007 seconds for delete logs
2006-04-01 03:13:46 BST DEBUG1 remoteWorkerThread_1
: disconnected from provider DB
2006-04-01 03:13:46 BST DEBUG1 copy_set 1 done in 6480.502 seconds
2006-04-01 03:13:46 BST CONFIG enableSubscription: sub_set=1
2006-04-01 03:13:58 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 03:13:58 BST DEBUG1 remoteWorkerThread_1
: helper thread for provider 1 created
2006-04-01 03:13:58 BST DEBUG1 remoteWorkerThread_1
: connected to data provider 1 on 'host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXXXXXXXX'

At this point I would be expecting Node2 to start catching up with whatever Node1's been up to during the 2 hours of the copy_set.. but it does not. If I run this:

$ tcpdump -i bond0 -n host 194.24.250.143 and port 5432 -s 1500 -w out

Then do a trivial UPDATE of one row on a table on Node1, and let tcpdump run for 30 seconds, the only queries I see sent across are "select ev_origin, ev_seqno, ev_timestamp..." - I'm not seeing any real data.

The thing that worries me most is that 'slon' on Node2 is using 100% CPU time and apparently doing nothing at all. There is no CPU use for any postgres process, nor any disk activity (confirmed with iostat).

Related to this /must/ be the fact that there's a postgres process on /both/ master+slave marked as 'idle in transaction'. I've asked our devels if they're using any transactions in the codebase and they've told me "We don't, but NHibernate likely does"
if that makes any difference.

How can I find out a bit more about what transaction the system is stuck in, since it may be causing some kind of busy-wait loop in slon hence the 100% CPU use?

Node1 is an Quad Xeon machine, and Node2 is an AIX 5.3 box (yep, PG compiled --with-thread-safety - slony complained loudly when I didn't have this flag set, so I recompiled PG.) and PG 8.1.3 + Slony 1.1.5 are in use on both machines.

I've followed all the instructions and I understand what each step does - I just can't work out what the problem is :(

Finally, just after copy_set finished, this was noticed in the postgres log:

2006-04-01 03:13:57 BST LOG: duration: 11414.154 ms statement: select "_replication". enableSubscription(1
, 1, 2); notify "_replication_Event"; notify " _replication_Confirm
"; insert into "_replication".sl_event (ev_origin, ev_seqno, ev_timestamp,
ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1, ev_data2, ev_data3, ev_data4 ) values ('1', '249', '2006-04-01 01:25:45.586148', '196845367', '196845368', '', 'ENABLE_SUBSCRIPTION
', '1', '1', '2', 'f'); insert into "_replication".sl_confirm
(con_origin, con_received, con_seqno, con_timestamp) values (1, 2, '249', now()); commit transaction;

Yours in a mix of hope + desperation,
Gavin.
Gavin Hamill

2006-04-01, 3:26 am

On Sat, 1 Apr 2006 03:50:29 +0100
Gavin Hamill <gdh- jB9c8NvlSj2akBO8gow8
eQ@public.gmane.org> wrote:

Just a little addendum..
[color=darkred]
> 2006-04-01 03:13:58 BST DEBUG1 remoteWorkerThread_1
: helper thread for provider 1 created
> 2006-04-01 03:13:58 BST DEBUG1 remoteWorkerThread_1
: connected to data provider 1 on 'host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXXXXXXXX'[
/color]

Doing 'select * from pg_stat_activity' shows the time at which the 'idle in transaction' process began...

<IDLE> in transaction | 2006-04-01 03:14:02.538096+01

i.e. just moments after the copy_set is done and the first real sync will have been brought down from Node1, no doubt..

The plot thickens, but I'm still at a loss as to why this is happening :(

Cheers,
Gavin.
Gavin Hamill

2006-04-03, 3:31 am

Gavin Hamill wrote:

This has now been worked around - I'm running both slon daemons on the
master rather than one on master + one on slave. Perhaps not the most
elegant solution, but better than no solution at all.

Must be some unfortunate AIX 5.3 interaction that I simply don't have
the resources to troubleshoot :/

Cheers,
Gavin.
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com