|
Home > Archive > PostgreSQL Performance > March 2006 > WAL logging of SELECT ... INTO command
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
WAL logging of SELECT ... INTO command
|
|
| Jim C. Nasby 2006-03-21, 7:28 am |
| Currently, it appears that SELECT * INTO new_table FROM old_table logs
each page as it's written to WAL. Is this actually needed? Couldn't the
database simply log that the SELECT ... INTO statement was executed
instead? Doing so would likely result in a large performance improvement
in most installs. Is there no provision for writing anything but data
page changes (or whole pages) to WAL?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly
| |
| Simon Riggs 2006-03-21, 8:33 pm |
| On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote:
> Currently, it appears that SELECT * INTO new_table FROM old_table logs
> each page as it's written to WAL. Is this actually needed? Couldn't the
> database simply log that the SELECT ... INTO statement was executed
> instead? Doing so would likely result in a large performance improvement
> in most installs. Is there no provision for writing anything but data
> page changes (or whole pages) to WAL?
AFAIK it takes the same code path as CREATE TABLE AS SELECT, which
already does exactly what you suggest (except when using PITR).
Best Regards, Simon Riggs
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
| |
| Jim C. Nasby 2006-03-22, 7:33 am |
| On Tue, Mar 21, 2006 at 08:33:50PM +0000, Simon Riggs wrote:
> On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote:
>
> AFAIK it takes the same code path as CREATE TABLE AS SELECT, which
> already does exactly what you suggest (except when using PITR).
Ok, I saw disk activity on the base directory and assumed it was pg_xlog
stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore
default_tablepsace and create the new tables in the base directory. I'm
guessing that's a bug... (this is on 8.1.2, btw).
Also, why do we log rows for CTAS/SELECT INTO when PITR is in use for
simple SELECTs (ones that don't call non-deterministic functions)? The
data should alread be available AFAICS...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
| |
| Simon Riggs 2006-03-22, 7:33 am |
| On Wed, 2006-03-22 at 06:47 -0600, Jim C. Nasby wrote:
> Also, why do we log rows for CTAS/SELECT INTO when PITR is in use for
> simple SELECTs (ones that don't call non-deterministic functions)? The
> data should alread be available AFAICS...
Not sure what you're asking... SELECTs don't produce WAL.
PITR wants all changes. Without PITR we can optimise certain logging
actions.
Best Regards, Simon Riggs
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
| |
| Jim C. Nasby 2006-03-22, 7:33 am |
| On Wed, Mar 22, 2006 at 01:08:34PM +0000, Simon Riggs wrote:
> On Wed, 2006-03-22 at 06:47 -0600, Jim C. Nasby wrote:
>
>
> Not sure what you're asking... SELECTs don't produce WAL.
Yes, there'd have to be some special kind of WAL entry that specifies
what select statement was used in CTAS.
> PITR wants all changes. Without PITR we can optimise certain logging
> actions.
The only change here is that we're creating a new table based on the
results of a SELECT. If that SELECT doesn't use anything that's
non-deterministic, then the machine doing the recovery should already
have all the data it needs, provided that we log the SELECT that was
used in the CTAS.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
| |
| Tom Lane 2006-03-22, 9:31 am |
| "Jim C. Nasby" <jnasby@pervasive.com> writes:
[color=darkred]
> The only change here is that we're creating a new table based on the
> results of a SELECT. If that SELECT doesn't use anything that's
> non-deterministic, then the machine doing the recovery should already
> have all the data it needs, provided that we log the SELECT that was
> used in the CTAS.
This is based on a fundamental misconception about the way PITR
log-shipping works. We log actions at the physical level (put this
tuple here), not the logical here's-the-statement-we-executed level.
The two approaches cannot mix, because as soon as there's any physical
discrepancy at all, physical-level actions would be incorrectly applied
to the slave database.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq
| |
| Jim C. Nasby 2006-03-22, 9:31 am |
| On Wed, Mar 22, 2006 at 10:06:05AM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
>
>
> This is based on a fundamental misconception about the way PITR
> log-shipping works. We log actions at the physical level (put this
> tuple here), not the logical here's-the-statement-we-executed level.
> The two approaches cannot mix, because as soon as there's any physical
> discrepancy at all, physical-level actions would be incorrectly applied
> to the slave database.
Oh, so in other words, SELECT * INTO temp FROM table is inherently
non-deterministic at the physical level, so the only way to be able to
allow PITR to work is to duplicate all the physical changes. Darn.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
| |
| Tom Lane 2006-03-22, 11:32 am |
| "Jim C. Nasby" <jnasby@pervasive.com> writes:
> Oh, so in other words, SELECT * INTO temp FROM table is inherently
> non-deterministic at the physical level, so the only way to be able to
> allow PITR to work is to duplicate all the physical changes. Darn.
Well, lemme put it this way: I'm not prepared to require that PG be
deterministic at the physical level. One obvious source of
non-determinancy is the FSM, which is likely to hand out different free
space to different transactions depending on what else is going on at
the same time. There are others, such as deliberately random
tie-breaking during btree index insertion.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq
| |
| Csaba Nagy 2006-03-22, 11:32 am |
| On Wed, 2006-03-22 at 16:35, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
>
> Well, lemme put it this way: I'm not prepared to require that PG be
> deterministic at the physical level. One obvious source of
> non-determinancy is the FSM, which is likely to hand out different free
> space to different transactions depending on what else is going on at
> the same time. There are others, such as deliberately random
> tie-breaking during btree index insertion.
While you're at talking about WAL and PITR... I see from the aboce
discussion that PITR is already demanding special handling in the code
(I hope I got this one right, as the following are based on this).
What if the PITR logging would be disconnected from the WAL logging
completely ?
What I mean is to introduce a WAL subscription mechanism, which
basically means some incoming connections where we stream the log
records. We don't need to write them to disk at all in the normal case,
I guess usually PITR will store the records on some other machine so it
means network, not disk. And it doesn't need to be done synchronously,
it can lag behind the running transactions, and we can do it in batches
of WAL records.
It also would mean that the local WAL does not need to log the things
which are only needed for the PITR... that would likely mean some spared
WAL disk activity. Of course it also would mean that the local WAL and
PITR WAL are not the same, but that is not an issue I guess.
It would also permit immediate recycling of the WAL files if the current
archiving style is not used.
The drawbacks I can see (please add yours):
1) the need for the subscription management code with the added
complexity it implies;
2) problems if the WAL stream lags too much behind;
3) problems if the subscribed client's connection is interrupted;
Nr. 2 could be solved by saving the PITR WAL separately if the lag grows
over a threshold, and issue a warning. This could still be acceptable,
as the writing doesn't have to be synchronous and can be made in
relatively large blocks.
There could be a second bigger lag threshold which completely cancels
the subscription. All these thresholds should be configurable, as it
depends on the application what's more important, to have the standby
available all the time or have the primary faster if loaded...
Nr. 3. can be solved by either canceling the subscription on connection
drop, or by allowing a certain amount of time after which the
subscription is canceled. The client can reconnect before this timeout
expires. In the meantime the primary can store the PITR WAL on disk as
mentioned above...
Cheers,
Csaba.
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq
| |
| Kris Jurka 2006-03-22, 8:35 pm |
|
On Wed, 22 Mar 2006, Jim C. Nasby wrote:
> Ok, I saw disk activity on the base directory and assumed it was pg_xlog
> stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore
> default_tablepsace and create the new tables in the base directory. I'm
> guessing that's a bug... (this is on 8.1.2, btw).
This has been fixed in CVS HEAD as part of a patch to allow additional
options to CREATE TABLE AS.
http://archives.postgresql.org/pgsq...02/msg00211.php
Kris Jurka
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
| |
| Qingqing Zhou 2006-03-23, 3:32 am |
|
"Simon Riggs" <simon@2ndquadrant.com> wrote
> On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote:
>
> AFAIK it takes the same code path as CREATE TABLE AS SELECT, which
> already does exactly what you suggest (except when using PITR).
>
As I read, they did take the same code path, but did they "simply log that
the SELECT ... INTO statement was executed"? If so, how can we rely on the
unreliable content of the old_table to do recovery?
Regards,
Qingqing
| |
| Jim C. Nasby 2006-03-24, 7:43 am |
| On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote:
>
>
> On Wed, 22 Mar 2006, Jim C. Nasby wrote:
>
>
> This has been fixed in CVS HEAD as part of a patch to allow additional
> options to CREATE TABLE AS.
>
> http://archives.postgresql.org/pgsq...02/msg00211.php
I'll argue that the current behavior is still a bug and should be fixed.
Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces
then...) to honor default_tablespace?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
| |
| Jim C. Nasby 2006-03-24, 7:43 am |
| On Wed, Mar 22, 2006 at 02:20:39PM +0800, Qingqing Zhou wrote:
>
> "Simon Riggs" <simon@2ndquadrant.com> wrote
>
> As I read, they did take the same code path, but did they "simply log that
> the SELECT ... INTO statement was executed"? If so, how can we rely on the
> unreliable content of the old_table to do recovery?
Why would the content of the old_table be unreliable? If we've replayed
logs up to the point of the CTAS then any data that would be visible to
the CTAS should be fine, no?
Though, the way Tom put it in one of his replies it sounds like WAL
doesn't do any kind of statement logging, only data logging. If that's
the case I'm not sure that the CTAS would actually get replayed. But I
suspect I'm just misunderstanding...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
| |
| Alvaro Herrera 2006-03-24, 7:43 am |
| Jim C. Nasby wrote:
> Why would the content of the old_table be unreliable? If we've replayed
> logs up to the point of the CTAS then any data that would be visible to
> the CTAS should be fine, no?
>
> Though, the way Tom put it in one of his replies it sounds like WAL
> doesn't do any kind of statement logging, only data logging. If that's
> the case I'm not sure that the CTAS would actually get replayed. But I
> suspect I'm just misunderstanding...
The CTAS doesn't get logged (nor replayed obviously). What happens is
that the involved files are fsync'ed before transaction commit, AFAIR.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
| |
| Jim C. Nasby 2006-03-24, 7:43 am |
| On Fri, Mar 24, 2006 at 08:39:02AM -0400, Alvaro Herrera wrote:
> Jim C. Nasby wrote:
>
>
> The CTAS doesn't get logged (nor replayed obviously). What happens is
> that the involved files are fsync'ed before transaction commit, AFAIR.
Ahh, yes, that sounds right. Might be a nice gain to be had if there was
some way to log the statement, but I suspect getting WAL to support that
would be extremely non-trivial.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
| |
| Alvaro Herrera 2006-03-24, 9:32 am |
| Jim C. Nasby wrote:
> On Fri, Mar 24, 2006 at 08:39:02AM -0400, Alvaro Herrera wrote:
>
> Ahh, yes, that sounds right. Might be a nice gain to be had if there was
> some way to log the statement, but I suspect getting WAL to support that
> would be extremely non-trivial.
None at all, at least in the current incarnation, I think, because said
query execution is dependent on the contents of the FSM, which is itself
dependent on the timing of VACUUM and other stuff. Such an action,
running with a different FSM content, can very trivially cause data
corruption.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
|
|
|
|
|