| Christopher Browne 2005-08-17, 8:25 pm |
| Marc Munro wrote:
>Brad, thank you so much for your response. I have more information but
>still no solution.
>
>In response to Brad's suggestions:
>
>1) pg_stat_activity
>Interesting. There is an idle in transaction process from the provider.
>Shutting down and restarting the database resulted in a new session with
>the same status. This appears to be something to do with slony.
>
>
That may be fine, as long as the idle-in-transaction process isn't old.
>2) vacuum subscriber's pg_listener
>No bloat to report. Only a handful of rows existed.
>
>
>
The question is how many rows it *couldn't* purge; how many dead tuples
were kicking around. e.g. - the output of VACUUM ANALYZE on the table...
>3) vacuum subscribers sl_log_1
>Nothing - there are no rows in the subscriber.
>
>4) Vacuuming subscriber
>Already done.
>
>Results of further investigation:
>I think the problem may simply be that the load on the provider during
>my initial sync was very high. This has resulted in sl_log_1 becoming
>huge. In my slony logs on the subscriber I often see lines like:
>
>2005-08-17 09:08:47 PDT DEBUG2 remoteHelperThread_1
_1: 83.952 seconds
>delay for first row
>
>Vacuuming the sl_log_1 table on the provider shows it to be pretty big:
>
>INFO: --Relation _cage2.sl_log_1--
>INFO: Pages 714558: Changed 1084, Empty 0; Tup 35986240: Vac 0, Keep 0,
>UnUsed 7172.
> Total CPU 52.50s/9.31u sec elapsed 177.23 sec.
>INFO: --Relation pg_toast.pg_toast_2177383887--
>INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
> Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
>INFO: Analyzing _cage2.sl_log_1
>
>I have tracked down a specific slony query that takes a long time to run
>on the provider (100+ secs):
>
>select log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype,
>log_cmddata from "_cage2".sl_log_1 where log_origin = 1 and ( (
> log_tableid in
> (100,200,300,400,500
,600,700,800,900,100
0,1100,1200,1300,140
0,1500,1600,1700,180
0,1900,2000,2100,220
0,2300,2400,2500,260
0,2700,2800,2900,300
0,3100,3200,3300,340
0,3500,3600,3700,380
0,3900,4000,4100,420
0,4300,4400,4500,460
0,4700,4800,4900,500
0,5100,5200,53
00,5400,5500,5600,57
00,5800,5900,6000,61
00,6200,6300,6400,65
00,6600,6700,6800,69
00,7000,7100,7200,73
00,7400,7500,99999)[
color=darkred]
> and (log_xid < '2106689363' and "_cage2". xxid_lt_snapshot(log
_xid,
> '2106433975:21066893
63:''2106433975'',''
2106688480'''))
> and (log_xid >= '2106433975' and "_cage2". xxid_ge_snapshot(log
_xid,
> '2106433975:21066892
66:''2106433975'',''
2106688480'''))
> ) ) order by log_actionseq;
>
>This returns 662 rows. If I remove the xxid_ge_snapshot and
>xxid_lt_snapshot function calls it returns 2135674. This would seem to
>be the root of my performance problem.
>
>I don't understand what the functions do, but allowing them to so
>drastically prune the result set means that the query does a lot of work
>for relatively few rows.
>
>Can anyone explain what is going on here?
>
>[/color]
You're hitting a "Rod Taylor" style situation, where you have some
pretty enormous amount of activity going on.
Each SYNC is running for a really long time, and there's no much way of
avoiding that :-(.
What Rod did under analagous conditions was to group a lot of SYNCs
together. Since it'll probably be doing a Seq Scan, you might as well
try to process most of sl_log_1.
With the default "-g 6", you'll probably see replication falling behind :-(
Suggestion: set the -g option as high as you can get away with. The
SYNCs will run for quite a long time, but at least they'll find lots of
data to replicate. In v 1.0.x, the maximum setting out of the box for
-g is "-g 100", which tries to process 100 SYNCs at a time, if you're
running behind.
If you're running 1.1.x, the maximum value is 10000, and you'd probably
find it useful to set it pretty high until things catch up. And set the
"desired_sync_time" value (the -o option to slon) to 0, so that it
doesn't do the "adaptive" thing.
|