|
Home > Archive > Slony1 PostgreSQL Replication > January 2006 > sl_log_1 filling
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Robert Littlejohn 2006-01-19, 8:24 pm |
| Hello,
I have a simple master to one slave cluster setup. My setup is slony1-1.1.0
with postgresql-7.3.9-2 on a Red Hat ES v3 server (both master and slave and
the same).
This worked great for several months even with frequent network outages.
Then in Nov the router at the slave site was changed and the replication
stopped but nobody noticed. Now several months later I have over 5 million
entries in sl_log_1 and over 400, 000 in sl_event. I see no errors in the
logs, I've vacuumed all tables including the sl_* and pg_* tables - no
change. I've restarted both servers - I've viewed the network connections
and all are good. The slon processes run fine and the postres's are talking
to each other but still no replication.
small example from slave log:
2006-01-19 16:50:02 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 953791
2006-01-19 16:50:02 AST DEBUG2 localListenThread: Received event 2,953791
SYNC
2006-01-19 16:50:12 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 953792
2006-01-19 16:50:12 AST DEBUG2 localListenThread: Received event 2,953792
SYNC
2006-01-19 16:50:22 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 953793
and from the master:
2006-01-19 16:50:02 AST DEBUG2 remoteListenThread_2
: queue event 2,953790
SYNC
2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: Received event 2,953790
SYNC
2006-01-19 16:50:02 AST DEBUG3 calc sync size - last time: 1 last length:
9881 ideal: 6 proposed size: 2
2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: SYNC 953790 processing
2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
for this event
2006-01-19 16:50:02 AST DEBUG2 syncThread: new sl_action_seq 7342475 - SYNC
1037985
2006-01-19 16:50:02 AST DEBUG2 localListenThread: Received event 1,1037985
SYNC
2006-01-19 16:50:06 AST DEBUG2 syncThread: new sl_action_seq 7342477 - SYNC
1037986
2006-01-19 16:50:07 AST DEBUG2 localListenThread: Received event 1,1037986
SYNC
2006-01-19 16:50:10 AST DEBUG2 syncThread: new sl_action_seq 7342479 - SYNC
1037987
2006-01-19 16:50:11 AST DEBUG2 localListenThread: Received event 1,1037987
SYNC
2006-01-19 16:50:12 AST DEBUG2 remoteListenThread_2
: queue event 2,953791
SYNC
2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: Received event 2,953791
SYNC
2006-01-19 16:50:12 AST DEBUG3 calc sync size - last time: 1 last length:
10001 ideal: 5 proposed size: 2
2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: SYNC 953791 processing
2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
for this event
2006-01-19 16:50:12 AST DEBUG2 syncThread: new sl_action_seq 7342482 - SYNC
1037988
2006-01-19 16:50:13 AST DEBUG2 localListenThread: Received event 1,1037988
SYNC
2006-01-19 16:50:22 AST DEBUG2 remoteListenThread_2
: queue event 2,953792
SYNC
2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: Received event 2,953792
SYNC
2006-01-19 16:50:22 AST DEBUG3 calc sync size - last time: 1 last length:
10071 ideal: 5 proposed size: 2
2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: SYNC 953792 processing
2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
for this event
Any help anybody can give me will be appreciated. BTW if this is the wrong
mailling list could someone point me at the correct location for slony1
problems.
Thanks
| |
| Christopher Browne 2006-01-19, 8:24 pm |
| Robert Littlejohn wrote:
>Hello,
>I have a simple master to one slave cluster setup. My setup is slony1-1.1.0
>with postgresql-7.3.9-2 on a Red Hat ES v3 server (both master and slave and
>the same).
>
>This worked great for several months even with frequent network outages.
>Then in Nov the router at the slave site was changed and the replication
>stopped but nobody noticed. Now several months later I have over 5 million
>entries in sl_log_1 and over 400, 000 in sl_event. I see no errors in the
>logs, I've vacuumed all tables including the sl_* and pg_* tables - no
>change. I've restarted both servers - I've viewed the network connections
>and all are good. The slon processes run fine and the postres's are talking
>to each other but still no replication.
>
>small example from slave log:
>2006-01-19 16:50:02 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 953791
>2006-01-19 16:50:02 AST DEBUG2 localListenThread: Received event 2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 953792
>2006-01-19 16:50:12 AST DEBUG2 localListenThread: Received event 2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC 953793
>
>and from the master:
>2006-01-19 16:50:02 AST DEBUG2 remoteListenThread_2
: queue event 2,953790
>SYNC
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: Received event 2,953790
>SYNC
>2006-01-19 16:50:02 AST DEBUG3 calc sync size - last time: 1 last length:
>9881 ideal: 6 proposed size: 2
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: SYNC 953790 processing
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>2006-01-19 16:50:02 AST DEBUG2 syncThread: new sl_action_seq 7342475 - SYNC
>1037985
>2006-01-19 16:50:02 AST DEBUG2 localListenThread: Received event 1,1037985
>SYNC
>2006-01-19 16:50:06 AST DEBUG2 syncThread: new sl_action_seq 7342477 - SYNC
>1037986
>2006-01-19 16:50:07 AST DEBUG2 localListenThread: Received event 1,1037986
>SYNC
>2006-01-19 16:50:10 AST DEBUG2 syncThread: new sl_action_seq 7342479 - SYNC
>1037987
>2006-01-19 16:50:11 AST DEBUG2 localListenThread: Received event 1,1037987
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 remoteListenThread_2
: queue event 2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: Received event 2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG3 calc sync size - last time: 1 last length:
>10001 ideal: 5 proposed size: 2
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: SYNC 953791 processing
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>2006-01-19 16:50:12 AST DEBUG2 syncThread: new sl_action_seq 7342482 - SYNC
>1037988
>2006-01-19 16:50:13 AST DEBUG2 localListenThread: Received event 1,1037988
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 remoteListenThread_2
: queue event 2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: Received event 2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG3 calc sync size - last time: 1 last length:
>10071 ideal: 5 proposed size: 2
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: SYNC 953792 processing
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>
>Any help anybody can give me will be appreciated. BTW if this is the wrong
>mailling list could someone point me at the correct location for slony1
>problems.
>
>
This is by all means the right list.
Take a look at the FAQ... http://linuxfinances.info/info/faq.html
This is consistent with the cleanup thread not cleaning things out.
It's evident that you're running 1.1 (from the ideal/proposed SYNC group
sizes in the logs)...
There used to be common problems where dropping nodes would lead to
confirmations not getting, well, confirmed, and that would prevent
things from clearing out. That has been fixed for a long time;
nonetheless, problems with event propagation can lead to the cleanup
thread deciding that it can't yet clean out data.
Try the following query (change the namespace appropriately!):
select * from _oxrsbar.sl_confirm where con_origin not in (select no_id
from _oxrsbar.sl_node) or con_received not in (select no_id from
_oxrsbar.sl_node);
You might also try running (find this in the "tools" directory) either
test_slony_state-dbi.pl or test_slony_state.pl. These point to one of
the nodes, and then rummage through the configuration looking for a
number of possible problem conditions.
If you're not keen on Perl, you can look in test_slony_state*.pl to see
what queries they are issuing to look at the state of things. That may
help you to tell where the problem lies.
You might also take a peek at those log files and look for the cleanup
thread to see what kind of work it's doing. Chances are that there's
some long outstanding missing event confirmation or such; as soon as it
gets dealt with, sl_log_1 will get cleaned out (over the next 10
minutes; that thread wakes up only periodically...)
| |
| Robert Littlejohn 2006-01-24, 8:25 pm |
| Great, thanks for the info. I've been meaning to get back to this but I've
been out of the office a bit.
I had already been through the faq (that's why I did the vacuums) but so far
I still can't find anything. The query you posted returned no results. I'm
looking at the test_slony_state-dbi.pl now but so far it only tells me
sl_seqlog, sl_log_1, sl_seqlog all exceed 200000. Some of the tests fail
with perl errors so I'll try to get those test run.
I've also looked quite a bit at the logs and have found nothing. The
cleanupThread is starting every 5 - 15 minutes and reports things like:
2006-01-06 12:25:47 AST DEBUG1 cleanupThread: 5.849 seconds for
cleanupEvent()
2006-01-06 12:26:20 AST DEBUG3 cleanupThread: minxid: 199383042
2006-01-06 12:26:20 AST DEBUG4 cleanupThread: xid 199383042 still active -
analyze instead
2006-01-06 12:39:49 AST DEBUG1 cleanupThread: 3.261 seconds for
cleanupEvent()
2006-01-06 12:40:08 AST DEBUG1 cleanupThread: 18.638 seconds for delete
logs
except for the xid 199383042 still active - analyze instead nothing really
jumps out at me.
I'll keep looking as time permits.
Thanks again,
Robert
-----Original Message-----
From: Christopher Browne [mailto:cbbrowne-swQf4SbcV9C7WVzo/KQ3Mw@public.gmane.org]
Sent: Thursday, January 19, 2006 5:31 PM
To: Robert Littlejohn
Cc: 'slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org'
Subject: Re: [Slony1-general] sl_log_1 filling
Robert Littlejohn wrote:
>Hello,
>I have a simple master to one slave cluster setup. My setup is
slony1-1.1.0
>with postgresql-7.3.9-2 on a Red Hat ES v3 server (both master and slave
and
>the same).
>
>This worked great for several months even with frequent network outages.
>Then in Nov the router at the slave site was changed and the replication
>stopped but nobody noticed. Now several months later I have over 5 million
>entries in sl_log_1 and over 400, 000 in sl_event. I see no errors in the
>logs, I've vacuumed all tables including the sl_* and pg_* tables - no
>change. I've restarted both servers - I've viewed the network connections
>and all are good. The slon processes run fine and the postres's are
talking
>to each other but still no replication.
>
>small example from slave log:
>2006-01-19 16:50:02 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC
953791
>2006-01-19 16:50:02 AST DEBUG2 localListenThread: Received event 2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC
953792
>2006-01-19 16:50:12 AST DEBUG2 localListenThread: Received event 2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 syncThread: new sl_action_seq 1 - SYNC
953793
>
>and from the master:
>2006-01-19 16:50:02 AST DEBUG2 remoteListenThread_2
: queue event 2,953790
>SYNC
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: Received event
2,953790
>SYNC
>2006-01-19 16:50:02 AST DEBUG3 calc sync size - last time: 1 last length:
>9881 ideal: 6 proposed size: 2
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: SYNC 953790 processing
>2006-01-19 16:50:02 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>2006-01-19 16:50:02 AST DEBUG2 syncThread: new sl_action_seq 7342475 - SYNC
>1037985
>2006-01-19 16:50:02 AST DEBUG2 localListenThread: Received event 1,1037985
>SYNC
>2006-01-19 16:50:06 AST DEBUG2 syncThread: new sl_action_seq 7342477 - SYNC
>1037986
>2006-01-19 16:50:07 AST DEBUG2 localListenThread: Received event 1,1037986
>SYNC
>2006-01-19 16:50:10 AST DEBUG2 syncThread: new sl_action_seq 7342479 - SYNC
>1037987
>2006-01-19 16:50:11 AST DEBUG2 localListenThread: Received event 1,1037987
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 remoteListenThread_2
: queue event 2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: Received event
2,953791
>SYNC
>2006-01-19 16:50:12 AST DEBUG3 calc sync size - last time: 1 last length:
>10001 ideal: 5 proposed size: 2
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: SYNC 953791 processing
>2006-01-19 16:50:12 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>2006-01-19 16:50:12 AST DEBUG2 syncThread: new sl_action_seq 7342482 - SYNC
>1037988
>2006-01-19 16:50:13 AST DEBUG2 localListenThread: Received event 1,1037988
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 remoteListenThread_2
: queue event 2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: Received event
2,953792
>SYNC
>2006-01-19 16:50:22 AST DEBUG3 calc sync size - last time: 1 last length:
>10071 ideal: 5 proposed size: 2
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: SYNC 953792 processing
>2006-01-19 16:50:22 AST DEBUG2 remoteWorkerThread_2
: no sets need syncing
>for this event
>
>Any help anybody can give me will be appreciated. BTW if this is the wrong
>mailling list could someone point me at the correct location for slony1
>problems.
>
>
This is by all means the right list.
Take a look at the FAQ... http://linuxfinances.info/info/faq.html
This is consistent with the cleanup thread not cleaning things out.
It's evident that you're running 1.1 (from the ideal/proposed SYNC group
sizes in the logs)...
There used to be common problems where dropping nodes would lead to
confirmations not getting, well, confirmed, and that would prevent
things from clearing out. That has been fixed for a long time;
nonetheless, problems with event propagation can lead to the cleanup
thread deciding that it can't yet clean out data.
Try the following query (change the namespace appropriately!):
select * from _oxrsbar.sl_confirm where con_origin not in (select no_id
from _oxrsbar.sl_node) or con_received not in (select no_id from
_oxrsbar.sl_node);
You might also try running (find this in the "tools" directory) either
test_slony_state-dbi.pl or test_slony_state.pl. These point to one of
the nodes, and then rummage through the configuration looking for a
number of possible problem conditions.
If you're not keen on Perl, you can look in test_slony_state*.pl to see
what queries they are issuing to look at the state of things. That may
help you to tell where the problem lies.
You might also take a peek at those log files and look for the cleanup
thread to see what kind of work it's doing. Chances are that there's
some long outstanding missing event confirmation or such; as soon as it
gets dealt with, sl_log_1 will get cleaned out (over the next 10
minutes; that thread wakes up only periodically...)
| |
| Christopher Browne 2006-01-24, 8:25 pm |
| Robert Littlejohn <Robert.Littlejohn- I6otxJDipUGw7v2zf2it
dNJR4SXAr38/@public.gmane.org> writes:
> Great, thanks for the info. I've been meaning to get back to this but I've
> been out of the office a bit.
>
> I had already been through the faq (that's why I did the vacuums) but so far
> I still can't find anything. The query you posted returned no results. I'm
> looking at the test_slony_state-dbi.pl now but so far it only tells me
> sl_seqlog, sl_log_1, sl_seqlog all exceed 200000. Some of the tests fail
> with perl errors so I'll try to get those test run.
>
> I've also looked quite a bit at the logs and have found nothing. The
> cleanupThread is starting every 5 - 15 minutes and reports things like:
> 2006-01-06 12:25:47 AST DEBUG1 cleanupThread: 5.849 seconds for
> cleanupEvent()
> 2006-01-06 12:26:20 AST DEBUG3 cleanupThread: minxid: 199383042
> 2006-01-06 12:26:20 AST DEBUG4 cleanupThread: xid 199383042 still active -
> analyze instead
> 2006-01-06 12:39:49 AST DEBUG1 cleanupThread: 3.261 seconds for
> cleanupEvent()
> 2006-01-06 12:40:08 AST DEBUG1 cleanupThread: 18.638 seconds for delete
> logs
>
> except for the xid 199383042 still active - analyze instead nothing really
> jumps out at me.
Well, the "analyze instead" part ought to be a reasonably useful
optimization.
Essentially, if a transaction is running now that was running the last
time the cleanup thread was running, then it's futile to try to VACUUM
the tables, as no data will get cleaned out; that
fairly-old-transaction will hold onto the data.
I'm not sure that you necessarily have any problem going on right now.
If you have some long-running transactions (and you certainly do), and
see hundreds/thousands of database updates per minute, it would be
pretty easy for the size of sl_log_1 and sl_seqlog to grow to ~200K.
Consider: sl_log_1 contains a row for each tuple that is updated.
If you do a transaction per second, each of which involves 10 table
updates, that would add, to this table...
60 x 10 = 600 rows per minute
If you had *no* long running transactions, then you'd expect the
cleanup thread, after 10 minutes, to find, and leave alone, 600 x 10 =
6000 rows that are relevant to the last 10 minutes of activity.
If you have some transaction that's running for an hour, then that
leads to growth to 36000 rows.
If you're doing about 5 transactions per second, rather than 1, that
easily gets you to 200K rows in sl_log_1.
sl_seqlog gets, for each sync, a row for each sequence that you
replicate. If you have 50 sequences, that can grow pretty big pretty
easily...
I'm sort of doing some "back of the napkin" estimates here just to
suggest how you might check to see if the numbers are reasonable or
not...
If you ever see replication fall behind, if a WAN connection slows up,
it would be fully natural to see sl_log_1 grow to:
period of time in seconds, plus 10 minutes
times
expected transactions per second
times
expected tuples updated per transaction
Does that help?
--
"cbbrowne","@","ca.afilias.info"
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
| |
| Robert Littlejohn 2006-01-26, 5:00 pm |
| I'm pretty sure I do have a problem here, even though I don't see it in the
logs or slony1 tables.
It looks like the DB is getting around 20 transactions per minute (approx
30,000 inserts per day) and we have a cleanup routine that runs on weekends
only. Both master and slave servers have been rebooted at least once but
the tables on the slave only have data up to Nov 28, 2005. It looks like NO
replication is occurring at all but I still do not see any errors and the
two servers are "talking" to each other. Also last week my sl_log_1 was at
around 5.5 million rows, this week it is 6 million and climbing.
I've created another cluster on a different DB with the same two servers.
These are live tables being replicated but with very low volume. So far
this new cluster is working great with transactions replicated over within
seconds of all updates/inserts. The original cluster is still stuck on Nov
28.
I will try to put aside some time this week to look into the tools dir and
see what tests are not being run properly. If I don't find an answer I
think rebuilding the slave will have to be done.
Thanks,
Robert
-----Original Message-----
From: Christopher Browne [mailto:cbbrowne-swQf4SbcV9C7WVzo/KQ3Mw@public.gmane.org]
Sent: Tuesday, January 24, 2006 4:44 PM
To: Robert Littlejohn
Cc: 'slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org'
Subject: Re: [Slony1-general] sl_log_1 filling
Robert Littlejohn <Robert.Littlejohn- I6otxJDipUGw7v2zf2it
dNJR4SXAr38/@public.gmane.org> writes:
> Great, thanks for the info. I've been meaning to get back to this but I've
> been out of the office a bit.
>
> I had already been through the faq (that's why I did the vacuums) but so
far
> I still can't find anything. The query you posted returned no results.
I'm
> looking at the test_slony_state-dbi.pl now but so far it only tells me
> sl_seqlog, sl_log_1, sl_seqlog all exceed 200000. Some of the tests fail
> with perl errors so I'll try to get those test run.
>
> I've also looked quite a bit at the logs and have found nothing. The
> cleanupThread is starting every 5 - 15 minutes and reports things like:
> 2006-01-06 12:25:47 AST DEBUG1 cleanupThread: 5.849 seconds for
> cleanupEvent()
> 2006-01-06 12:26:20 AST DEBUG3 cleanupThread: minxid: 199383042
> 2006-01-06 12:26:20 AST DEBUG4 cleanupThread: xid 199383042 still active -
> analyze instead
> 2006-01-06 12:39:49 AST DEBUG1 cleanupThread: 3.261 seconds for
> cleanupEvent()
> 2006-01-06 12:40:08 AST DEBUG1 cleanupThread: 18.638 seconds for delete
> logs
>
> except for the xid 199383042 still active - analyze instead nothing really
> jumps out at me.
Well, the "analyze instead" part ought to be a reasonably useful
optimization.
Essentially, if a transaction is running now that was running the last
time the cleanup thread was running, then it's futile to try to VACUUM
the tables, as no data will get cleaned out; that
fairly-old-transaction will hold onto the data.
I'm not sure that you necessarily have any problem going on right now.
If you have some long-running transactions (and you certainly do), and
see hundreds/thousands of database updates per minute, it would be
pretty easy for the size of sl_log_1 and sl_seqlog to grow to ~200K.
Consider: sl_log_1 contains a row for each tuple that is updated.
If you do a transaction per second, each of which involves 10 table
updates, that would add, to this table...
60 x 10 = 600 rows per minute
If you had *no* long running transactions, then you'd expect the
cleanup thread, after 10 minutes, to find, and leave alone, 600 x 10 =
6000 rows that are relevant to the last 10 minutes of activity.
If you have some transaction that's running for an hour, then that
leads to growth to 36000 rows.
If you're doing about 5 transactions per second, rather than 1, that
easily gets you to 200K rows in sl_log_1.
sl_seqlog gets, for each sync, a row for each sequence that you
replicate. If you have 50 sequences, that can grow pretty big pretty
easily...
I'm sort of doing some "back of the napkin" estimates here just to
suggest how you might check to see if the numbers are reasonable or
not...
If you ever see replication fall behind, if a WAN connection slows up,
it would be fully natural to see sl_log_1 grow to:
period of time in seconds, plus 10 minutes
times
expected transactions per second
times
expected tuples updated per transaction
Does that help?
--
"cbbrowne","@","ca.afilias.info"
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
| |
| Robert Littlejohn 2006-01-26, 5:00 pm |
| Forgot to mention I did find an error in the logs. This morning I found the
slon for the original cluster was down and this error in the logs:
2006-01-24 21:39:59 AST DEBUG2 localListenThread: Received event 1,1042636
SYNC
2006-01-24 21:40:00 AST DEBUG2 syncThread: new sl_action_seq 7947330 - SYNC
1042637
2006-01-24 21:40:02 AST DEBUG2 syncThread: new sl_action_seq 7947334 - SYNC
1042638
2006-01-24 21:40:03 AST DEBUG2 localListenThread: Received event 1,1042637
SYNC
2006-01-24 21:40:03 AST FATAL syncThread: "commit transaction;" - ERROR:
simple_heap_update: tuple concurrently updated
2006-01-24 21:40:03 AST DEBUG1 slon: shutdown requested
2006-01-24 21:40:04 AST DEBUG2 slon: notify worker process to shutdown
2006-01-24 21:40:04 AST DEBUG2 slon: wait for worker process to shutdown
.....
I restarted just fine but my data is still on Nov 28
-----Original Message-----
From: Christopher Browne [mailto:cbbrowne-swQf4SbcV9C7WVzo/KQ3Mw@public.gmane.org]
Sent: Tuesday, January 24, 2006 4:44 PM
To: Robert Littlejohn
Cc: 'slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org'
Subject: Re: [Slony1-general] sl_log_1 filling
Robert Littlejohn <Robert.Littlejohn- I6otxJDipUGw7v2zf2it
dNJR4SXAr38/@public.gmane.org> writes:
> Great, thanks for the info. I've been meaning to get back to this but I've
> been out of the office a bit.
>
> I had already been through the faq (that's why I did the vacuums) but so
far
> I still can't find anything. The query you posted returned no results.
I'm
> looking at the test_slony_state-dbi.pl now but so far it only tells me
> sl_seqlog, sl_log_1, sl_seqlog all exceed 200000. Some of the tests fail
> with perl errors so I'll try to get those test run.
>
> I've also looked quite a bit at the logs and have found nothing. The
> cleanupThread is starting every 5 - 15 minutes and reports things like:
> 2006-01-06 12:25:47 AST DEBUG1 cleanupThread: 5.849 seconds for
> cleanupEvent()
> 2006-01-06 12:26:20 AST DEBUG3 cleanupThread: minxid: 199383042
> 2006-01-06 12:26:20 AST DEBUG4 cleanupThread: xid 199383042 still active -
> analyze instead
> 2006-01-06 12:39:49 AST DEBUG1 cleanupThread: 3.261 seconds for
> cleanupEvent()
> 2006-01-06 12:40:08 AST DEBUG1 cleanupThread: 18.638 seconds for delete
> logs
>
> except for the xid 199383042 still active - analyze instead nothing really
> jumps out at me.
Well, the "analyze instead" part ought to be a reasonably useful
optimization.
Essentially, if a transaction is running now that was running the last
time the cleanup thread was running, then it's futile to try to VACUUM
the tables, as no data will get cleaned out; that
fairly-old-transaction will hold onto the data.
I'm not sure that you necessarily have any problem going on right now.
If you have some long-running transactions (and you certainly do), and
see hundreds/thousands of database updates per minute, it would be
pretty easy for the size of sl_log_1 and sl_seqlog to grow to ~200K.
Consider: sl_log_1 contains a row for each tuple that is updated.
If you do a transaction per second, each of which involves 10 table
updates, that would add, to this table...
60 x 10 = 600 rows per minute
If you had *no* long running transactions, then you'd expect the
cleanup thread, after 10 minutes, to find, and leave alone, 600 x 10 =
6000 rows that are relevant to the last 10 minutes of activity.
If you have some transaction that's running for an hour, then that
leads to growth to 36000 rows.
If you're doing about 5 transactions per second, rather than 1, that
easily gets you to 200K rows in sl_log_1.
sl_seqlog gets, for each sync, a row for each sequence that you
replicate. If you have 50 sequences, that can grow pretty big pretty
easily...
I'm sort of doing some "back of the napkin" estimates here just to
suggest how you might check to see if the numbers are reasonable or
not...
If you ever see replication fall behind, if a WAN connection slows up,
it would be fully natural to see sl_log_1 grow to:
period of time in seconds, plus 10 minutes
times
expected transactions per second
times
expected tuples updated per transaction
Does that help?
--
"cbbrowne","@","ca.afilias.info"
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
| |
| Andrew Sullivan 2006-01-26, 5:00 pm |
| On Wed, Jan 25, 2006 at 10:13:28AM -0400, Robert Littlejohn wrote:
> 2006-01-24 21:40:04 AST DEBUG2 slon: notify worker process to shutdown
> 2006-01-24 21:40:04 AST DEBUG2 slon: wait for worker process to shutdown
> ....
>
> I restarted just fine but my data is still on Nov 28
By "restarted" do you mean you restarted just when you found this?
Because I don't think you're ever going to catch up from 28 Nov. I'd
drop the subscriber and rebuild it.
A
--
Andrew Sullivan | ajs-oaT0K0jot5/q2IAV+ODieA@public.gmane.org
You may save him from referring to variables in another package, but
you can't save him from writing a badly designed program to solve the
wrong problem, and taking forever to do it. --Paul Graham
|
|
|
|
|