Home > Archive > Slony1 PostgreSQL Replication > September 2005 > Buffering problem - a patch?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Buffering problem - a patch?
Philip Warner

2005-09-18, 3:24 am

Following recent problems with buffer sizes (again), I would be interested to know if:

(a) the slony developers think my latest suggestion (setting bounds on buffer space used when reading logs) is a good idea,
(b) if you plan to implement this (or something similar), or
(c) if you would be interested in patches that implement this (or something similar).

Or if you think my case if just too unusual to bother with...
Christopher Browne

2005-09-19, 3:25 am

Philip Warner <pjw- Ig6Zz+cC40D0CCvOHzKK
cA@public.gmane.org> writes:
> Following recent problems with buffer sizes (again), I would be interested to know if:
>
> (a) the slony developers think my latest suggestion (setting bounds on buffer space used when reading logs) is a good idea,
> (b) if you plan to implement this (or something similar), or
> (c) if you would be interested in patches that implement this (or something similar).
>
> Or if you think my case if just too unusual to bother with...


It seems a good idea, at least in principle.

Hey, if we can change the behaviour of slon so that it consumes Very
Little Memory, or, rather, to prevent it from Behaving Badly, that
hardly seems a bad thing.

I haven't looked deeply, but the only reason I can think of for there
to need to be a lot of memory occupied at any given time is if the
groupings of 10 queries together (e.g. - when sl_log_1 entries are
being applied, they are done in sets of 10) involves some records that
are Very Large.

Typical here would be with some application like RT, where one of the
tables, "attachements", can contain arbitrarily large records.

It would seem perfectly reasonable to short circuit this:

--> If size of current data being processed > 10MB then don't
bother moving on from record 7 to record 8; push the
data out NOW and clear the data structure...

The only downside that I see is the cost of searching through the
strings to see how big they are.

I'm not certain that this is The Issue, though. If what's consuming
memory is doing so from inside libpq, then "setting bounds" may be
rather more troublesome...

Actually, I just thought of the other place where there's an issue,
and it is indeed thornier.

That's with the "FETCH 100 FROM LOG" queries. That's going to draw in
100 rows from sl_log_1 each time, and if some are Pathological Big
Records, _there_ lies the problem.

There isn't a "FETCH 100 FROM LOG STOP AFTER 50MB" command :-(.

I suppose what is needful is to have a way to control both the 10 and
100 figures.

Both the 10 and 100 figures are #defined in <src/slon.h>, thus:

#ifdef SLON_CHECK_CM
DTUPLES
#define SLON_COMMANDS_PER_LI
NE 1
#define SLON_DATA_FETCH_SIZE
100
#define SLON_WORKLINES_PER_H
ELPER (SLON_DATA_FET
CH_SIZE * 4)
#else
#define SLON_COMMANDS_PER_LI
NE 10
#define SLON_DATA_FETCH_SIZE
10
#define SLON_WORKLINES_PER_H
ELPER (SLON_DATA_FET
CH_SIZE * 50)
#endif

This is NOT easily changeable at runtime, but if you're running into
these problems, then I'd suggest decreasing these values in slon.h.

Ideally, it would be nice for slon to figure out suitable values by
itself. How to do that when a bad choice would lead to slon running
out of memory and falling over seems, erm, troublesome :-(.

A way to cope with that would be if we stuffed the values into a table
on the node that the slon is servicing, and had a policy of always
starting out by taking the old values from the node and scaling them
back a bit. If all goes well, the slon gets a bit more aggressive; if
it crashes, that naturally leads to stepping back.

Definitely worthy of more thought before implementation...
--
output = ("cbbrowne" "@" "ca.afilias.info")
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
Philip Warner

2005-09-19, 3:25 am


Thanks for the reply.

Christopher Browne wrote:

> Hey, if we can change the behaviour of slon so that it consumes Very
> Little Memory, or, rather, to prevent it from Behaving Badly, that
> hardly seems a bad thing.


That's what I'd like to do. If a slon process is killed, or a link goes
down too long, I end up with a DB that can not be made 'current'.
Sometimes, I just get a big update, and slon dies (and will not restart).

> (e.g. - when sl_log_1 entries are
>being applied, they are done in sets of 10) involves some records that
>are Very Large.
>
>

Indeed; we can have a single record of 37MB (max).

>--> If size of current data being processed > 10MB then don't
> bother moving on from record 7 to record 8; push the
> data out NOW and clear the data structure...
>
>

This is the kind of approach I was thinking of; just stop fetching, and
wake up the code that does the replication.

>The only downside that I see is the cost of searching through the
>strings to see how big they are.
>
>

Shouldn't be a problem; AFAICT the rows are fetched and stored in a
queue which does a realloc to store the data. Must know the size.

>That's with the "FETCH 100 FROM LOG" queries. That's going to draw in
>100 rows from sl_log_1 each time, and if some are Pathological Big
>Records, _there_ lies the problem.
>
>

I think that's part of the problem; we could fetch 10 at a time, but the
way read the code, it will just loop and fetch the next 10. So either
way we'll consume lots of memory.

My proposed solution is to:

(a) use the 'max buffer space' as a guideline only
(b) use fetch 1 or fetch 10 (they both seem fast)
(c) when the current *used* buffer space (ie buffers in the
'to-be-processed' queue) exceeds the buffer limit (after the fetch-10 is
complete), then don't fetch the next 10. Wait for the used buffer space
to drop below an 'empty/restart-limit'.
(d) dealloc/free memory from the queues once processed (if not already done)

This means that the buffer limit would always be exceed, but by at most
10 fetches.If every one of those 10 log rows is 37MB, then I'll blow my
process address size again, but that is *very* unlikely.

Running a query to get the strig sizes (as you may have suggested above,
now I think about it) could/would prevent problems: "select
sum(length(log_cmdda
ta)) from...". We could then adjust the fetch size
to 'fetch 1' if the total size will blow the current buffer size.


>This is NOT easily changeable at runtime, but if you're running into
>these problems, then I'd suggest decreasing these values in slon.h.
>
>

Why is that? There are a lot of places that reference the values, but is
there a real problem with making them variable?

>Ideally, it would be nice for slon to figure out suitable values by
>itself. How to do that when a bad choice would lead to slon running
>out of memory and falling over seems, erm, troublesome :-(.
>
>

Using the "select sum(length(log_cmdda
ta)) from..." approach, then using
either fetch-100 or fetch-1 based on the result seems like a good first
pass. Also has the advantage of no change from the users perspective
unless they have huge queries -- and then the change improves reliability.


To summarize, the basic approach would be:

- prior to doing an open cursor, run "select sum(length(log_cmdda
ta))
from...limit 100".
- use this to determine fetch size (1 or 100).
- open cursor/fetch etc. At the end of the fetch loop, see how much
buffered data we have in the queues. If too much, pause and store in
replicated DB.
- repeat.

Does this sound broadly OK? Not sure how best to do the pause/restart.
Just sleep/wake? Use thread events? etc etc

Later, this could be refined by selecting lengths of next 100 and tuning
how many we fetch based on this knowledge. But as a first pass it seems
to satisfy the requirements.
Christopher Browne

2005-09-19, 9:24 am

Philip Warner <pjw- Ig6Zz+cC40D0CCvOHzKK
cA@public.gmane.org> writes:
> Thanks for the reply.
>
> Christopher Browne wrote:
>
>
> That's what I'd like to do. If a slon process is killed, or a link goes
> down too long, I end up with a DB that can not be made 'current'.
> Sometimes, I just get a big update, and slon dies (and will not restart).
>
> Indeed; we can have a single record of 37MB (max).


Ah, good, so I'm not heading down a wrong road on this.

[color=darkred]
> This is the kind of approach I was thinking of; just stop fetching,
> and wake up the code that does the replication.


Alas, that begs a "how do we do that?"

In this area, Slony-I isn't doing any magic. It just does a plain old
"FETCH [X] FROM LOG" statement, which does _not_ admit any opportunity
to stop fetching early.

[color=darkred]
> Shouldn't be a problem; AFAICT the rows are fetched and stored in a
> queue which does a realloc to store the data. Must know the size.


That works for the "submission" side, where slon is generating the
queries that are to go into the destination system.

[color=darkred]
> I think that's part of the problem; we could fetch 10 at a time, but
> the way read the code, it will just loop and fetch the next 10. So
> either way we'll consume lots of memory.
>
> My proposed solution is to:
>
> (a) use the 'max buffer space' as a guideline only
> (b) use fetch 1 or fetch 10 (they both seem fast)
> (c) when the current *used* buffer space (ie buffers in the
> 'to-be-processed' queue) exceeds the buffer limit (after the fetch-10 is
> complete), then don't fetch the next 10. Wait for the used buffer space
> to drop below an 'empty/restart-limit'.
> (d) dealloc/free memory from the queues once processed (if not already done)
>
> This means that the buffer limit would always be exceed, but by at most
> 10 fetches.If every one of those 10 log rows is 37MB, then I'll blow my
> process address size again, but that is *very* unlikely.
>
> Running a query to get the strig sizes (as you may have suggested
> above, now I think about it) could/would prevent problems: "select
> sum(length(log_cmdda
ta)) from...". We could then adjust the fetch
> size to 'fetch 1' if the total size will blow the current buffer
> size.


That query would be exceedingly expensive for the scenario where there
are a million rows outstanding in sl_log_1.

Indeed, the logic that you're proposing is a logic which would worsen
the behaviour of Slony-I for use cases where records are generally
pretty small.

Consider the scenario where none of the sl_log_1 records are larger
than 1K in size, but where we have *enormous* numbers of them.

Using FETCH 1 rather than FETCH 100 means submitting 100 times as many
queries, and having 100 times as many query round trips between the
slon and the provider.

[color=darkred]
> Why is that? There are a lot of places that reference the values,
> but is there a real problem with making them variable?


They can't be variables because their values are used at compile time
to indicate array sizes. It is certainly possible to change the array
definitions and use malloc() at runtime, but that involves quite a bit
of change.

[color=darkred]
> Using the "select sum(length(log_cmdda
ta)) from..." approach, then
> using either fetch-100 or fetch-1 based on the result seems like a
> good first pass. Also has the advantage of no change from the users
> perspective unless they have huge queries -- and then the change
> improves reliability.


.... Unless there *aren't* any Fat Rows, in which reliability goes down
because we're submitting flurries of tiny queries that take longer to
run making it more likely that replication falls behind.

> To summarize, the basic approach would be:
>
> - prior to doing an open cursor, run "select sum(length(log_cmdda
ta))
> from...limit 100".
> - use this to determine fetch size (1 or 100).
> - open cursor/fetch etc. At the end of the fetch loop, see how much
> buffered data we have in the queues. If too much, pause and store in
> replicated DB.
> - repeat.
>
> Does this sound broadly OK? Not sure how best to do the pause/restart.
> Just sleep/wake? Use thread events? etc etc


No, there is no pause/restart. The fetch size determines everything.

> Later, this could be refined by selecting lengths of next 100 and
> tuning how many we fetch based on this knowledge. But as a first
> pass it seems to satisfy the requirements.


I really dislike the way that this injures behaviour for
non-pathological cases (e.g. - where there aren't Fat Rows 37MB in
size).

Not to be too pointed about it, but our systems don't have Fat Rows,
and I'm quite disinclined to make changes that specifically hurt us...

I have a different thought, which might conceivably improve things for
everyone...

The thought is to store record size in a new field on sl_log_1, call
it "log_cmdsize", and populate it at insert time.

We declare LOG cursor as being ...

declare LOG cursor for select log_origin, log_xid, log_tableid,
log_actionseq, log_cmdtype, log_cmdsize
from sl_log_1 [with various other criteria] order by log_actionseq;

On this cursor, we can probably quite well afford to blindly do FETCH
1000.

We then walk through the 1000 rows, decide how many we'll be applying,
based on a logic like...

for each row do
add the row's key information to a query to pull *ALL* data from sl_log_1
ram_consumption += row.log_cmdsize
if ram_consumption > threshold then
perform the second query and load results into this node
end if
done

In cases where log_cmddata is small, this would mean that we might do
way *more* than 10 at a time, which should mitigate the performance
loss resulting from having to read thru sl_log_1 twice.

The mechanism for efficiently pulling the detail data from sl_log_1
based on a set of keys probably requires generating a 2D array of
index values to pass in to a set-returning stored procedure.

I really see two answers to this, at this point...

1. Set the FETCH value(s) to 1 rather than 100 at compile time if you
know you have problems with Fat Rows.

This is, of course, a reasonable immediate answer that would be
available to you in 1.0.5.

2. We have to do something quite a bit cleverer, probably similar to
what I outlined, if we don't want to injure users that don't use Fat
Rows.

Cleverer can't be in place until 1.2, at the earliest...

I need to talk with a couple people about this...
--
(format nil "~S@~S" "cbbrowne" "ca.afilias.info")
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
Philip Warner

2005-09-19, 11:24 am

Christopher Browne wrote:

>The thought is to store record size in a new field on sl_log_1, call
>it "log_cmdsize", and populate it at insert time.
>
>

May not need to store it; I would expect that the size of a text field
could be determined cheaply (Jan would know - he wrote toast).

>We declare LOG cursor as being ...
>
>declare LOG cursor for select log_origin, log_xid, log_tableid,
> log_actionseq, log_cmdtype, log_cmdsize
>from sl_log_1 [with various other criteria] order by log_actionseq;
>
>

....etc. Looks good to me, *except*, my reading of remote_worker.c made
me believe it would loop retrieving 100 rows repeatedly, while another
thread sends to the replicated db. If I am right, we would still need
some way of pausing the 'pull' part of the pull->push mechanism. Or I
misread the code -- having just looked at it again, it may only read 100
at a time.

You may even be able to refine this (depending on cost of such things)
by selecting substring(log_cmddat
a from 1 for 1024) so that small
commands require no extra IO.

>The mechanism for efficiently pulling the detail data from sl_log_1
>based on a set of keys probably requires generating a 2D array of
>index values to pass in to a set-returning stored procedure.
>
>

You may find that reiterating the cursor is simplest, especially if we
have length stored: you can select where log_cmdlength>1024.

>1. Set the FETCH value(s) to 1 rather than 100 at compile time if you
>know you have problems with Fat Rows.
>
>

See above. I have not tried recompiling with a value of 1, but I thought
it would just loop. Maybe I misread the code.

>2. We have to do something quite a bit cleverer, probably similar to
>what I outlined, if we don't want to injure users that don't use Fat
>Rows.
>
>

If we don't need to change the schema (eg. if length(log_cmddata) is
cheap) then would 1.1 be possible?


Thanks for the continuing help.
Jan Wieck

2005-09-19, 1:26 pm

On 9/19/2005 10:31 AM, Philip Warner wrote:
> Christopher Browne wrote:
>
> May not need to store it; I would expect that the size of a text field
> could be determined cheaply (Jan would know - he wrote toast).
>
> ...etc. Looks good to me, *except*, my reading of remote_worker.c made
> me believe it would loop retrieving 100 rows repeatedly, while another
> thread sends to the replicated db. If I am right, we would still need
> some way of pausing the 'pull' part of the pull->push mechanism. Or I
> misread the code -- having just looked at it again, it may only read 100
> at a time.


You didn't misread the code. It indeed buffers based on a compiled in
number of rows only and doesn't take the size into account at all. So
yes, the fetching thread needs to stop if the buffer grows too large.
Since it does block if all buffers are filled, that part wouldn't be too
complicated.

What gets complicated is the fact that the buffer never shrinks! All the
buffer lines stay allocated and eventually get enlarged until slon
exits. So even if you stop fetching after you hit large rows, slowly
over time all buffer lines will get adjusted to that huge size. On some
operating systems (libc implementations to be precise) free() isn't a
solution here as it never returns memory to the OS, but keeps the pages
for future alloc()s. The best way to tackle that would IMHO be to allow
only certain buffer lines to be used for huge rows and block if none of
them is available.


Jan

>
> You may even be able to refine this (depending on cost of such things)
> by selecting substring(log_cmddat
a from 1 for 1024) so that small
> commands require no extra IO.
>
> You may find that reiterating the cursor is simplest, especially if we
> have length stored: you can select where log_cmdlength>1024.
>
> See above. I have not tried recompiling with a value of 1, but I thought
> it would just loop. Maybe I misread the code.
>
> If we don't need to change the schema (eg. if length(log_cmddata) is
> cheap) then would 1.1 be possible?
>
>
> Thanks for the continuing help.
>
>
> ____________________
____________________
_______
> Slony1-general mailing list
> Slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
> http://gborg.postgresql.org/mailman.../slony1-general



--
#===================
====================
====================
===========#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#===================
====================
=========== JanWieck- bwPqjjyvM7QAvxtiuMwx
3w@public.gmane.org #
Philip Warner

2005-09-19, 1:26 pm

Jan Wieck wrote:

>
> You didn't misread the code. It indeed buffers based on a compiled in
> number of rows only and doesn't take the size into account at all. So
> yes, the fetching thread needs to stop if the buffer grows too large.
> Since it does block if all buffers are filled, that part wouldn't be
> too complicated.
>
> What gets complicated is the fact that the buffer never shrinks! All
> the buffer lines stay allocated and eventually get enlarged until slon
> exits. So even if you stop fetching after you hit large rows, slowly
> over time all buffer lines will get adjusted to that huge size. On
> some operating systems (libc implementations to be precise) free()
> isn't a solution here as it never returns memory to the OS, but keeps
> the pages for future alloc()s.


Well, it would help, wou;dn't it? If in one pass, row(1) had 37MB
allocated, and in another pass row(2) wanted 37MB, at least another 37MB
would not be grabbed from the OS -- the freed block would be available.

> The best way to tackle that would IMHO be to allow only certain buffer
> lines to be used for huge rows and block if none of them is available.


Wouldn't this lead to ordering problems?

What about definining a MAX_ROW_BUFFER which represents the maximum
allowed to be permanently allocated to command data fetched from the
log. Then, only fetch cmddata for log rows up to this size. For rows
larger than this, retrieve the PK and store in the list. When the item
is to be processed, retrieve the cmddata directly using the PK.
Jan Wieck

2005-09-19, 8:25 pm

On 9/19/2005 2:11 PM, Philip Warner wrote:

> Jan Wieck wrote:
>
>
> Well, it would help, wou;dn't it? If in one pass, row(1) had 37MB
> allocated, and in another pass row(2) wanted 37MB, at least another 37MB
> would not be grabbed from the OS -- the freed block would be available.
>
>
> Wouldn't this lead to ordering problems?
>
> What about definining a MAX_ROW_BUFFER which represents the maximum
> allowed to be permanently allocated to command data fetched from the
> log. Then, only fetch cmddata for log rows up to this size. For rows
> larger than this, retrieve the PK and store in the list. When the item
> is to be processed, retrieve the cmddata directly using the PK.


That would create quite a nightmare in the thread coordination. The one
that does the fetch then needs to be told by the one that does the apply
to get a specific row instead now.

What you could to to keep it simple is to go with a free() approach.
free() buffers that are over a certain size after they are applied. And
have the fetch thread wait if the buffered amount exceeds your limit. In
addition, you probably want to make the initial fetch size a config
parameter and also make the actual number of fetched rows depending on
the buffers fill level, so to speak. The larger the buffer is, the fewer
rows to fetch in order to avoid "fetching 100 50M rows at once" by surprise.


Jan

--
#===================
====================
====================
===========#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#===================
====================
=========== JanWieck- bwPqjjyvM7QAvxtiuMwx
3w@public.gmane.org #
Philip Warner

2005-09-20, 11:24 am

Jan Wieck wrote:

> What you could to to keep it simple is to go with a free() approach.
> free() buffers that are over a certain size after they are applied.
> And have the fetch thread wait if the buffered amount exceeds your
> limit. In addition, you probably want to make the initial fetch size a
> config parameter and also make the actual number of fetched rows
> depending on the buffers fill level, so to speak. The larger the
> buffer is, the fewer rows to fetch in order to avoid "fetching 100 50M
> rows at once" by surprise.


Thanks for the replies; to summarize the plan (please say yay or nay!):

- modify remote_worker.c to make initial fetch size a config parameter
- modify remote_worker.c (and related) to alloc/free large blocks (say >
1MB, or a user-settable value)
- add a config parameter 'fetch buffer limit'
- modify remote_worker.c to do fetches <= 'initial fetch size' based on
currently used memory and 'fetch buffer limit'. Minimum 1.
- modify remote_worker.c to pause after completing one complete fetch
cycle to pause if exceeding 'fetch buffer limit', and automagically wake
up again....hmmm.

(should we just skip the last one?)
Jan Wieck

2005-09-20, 11:24 am

On 9/20/2005 11:14 AM, Philip Warner wrote:

> Jan Wieck wrote:
>
>
> Thanks for the replies; to summarize the plan (please say yay or nay!):
>
> - modify remote_worker.c to make initial fetch size a config parameter
> - modify remote_worker.c (and related) to alloc/free large blocks (say >
> 1MB, or a user-settable value)
> - add a config parameter 'fetch buffer limit'


yes, yes, yes

> - modify remote_worker.c to do fetches <= 'initial fetch size' based on
> currently used memory and 'fetch buffer limit'. Minimum 1.
> - modify remote_worker.c to pause after completing one complete fetch
> cycle to pause if exceeding 'fetch buffer limit', and automagically wake
> up again....hmmm.
>
> (should we just skip the last one?)


I would say to modify remote_worker.c so that it treats "too much memory
in use" as if there would be no more buffer lines available. The latter
can happen already, and remote_worker will then wait on the condition
variable for the local apply thread to return empty buffer lines to the
pool. That event (returning lines to pool) is what triggers the
condition variable and thereby wakes up remote_worker. Since this event
is not guaranteed to be in connection with any free(), you'll have to
recheck and eventually wait again.


Jan

--
#===================
====================
====================
===========#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#===================
====================
=========== JanWieck- bwPqjjyvM7QAvxtiuMwx
3w@public.gmane.org #
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com