|
Home > Archive > PostgreSQL Performance > March 2006 > Database possible corruption , unsolvable mystery
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Database possible corruption , unsolvable mystery
|
|
| Eric Lauzon 2006-03-29, 1:31 pm |
| Greetings,
We have an issue where we have a database with many tables.
The layout of the database is 3 set of look alike tables with different names.
Each set of tables has some referential integrety that point back to the main
control table.
On two set of tables we are able to efficiently delete referential and main record
without a problems, but on the 3rd set we have an issue where the control table is clugged
and delete seem to take forever , as example on the first two set a delete of 60K record take about
4 second to 10 second but on the 3rd set it can take as long as 3hours.
This seem to be only affecting one database , the schema and way of doing is replicated elsewhere
and if efficient.
The question is, even after droping 3rd set integrity , dumping the table data , deleting the table,
doing a manual checkpoint and recreating the table with the dump data , with or without referential
integrity , the problems still araise.
If we copy the data from the live table and do a create table aaa as select * from problematic_table;
the table aaa operations are normal and efficient.
This is why our investigation brought us to the folowing questions:
1. Are postgresql data file name are hashed references to table name(as oracle)? [~path to data EX:/var/log/pgsql/data/[arbitraty numbers]/[datafile]]?
2. If the data files are corrupted and we re-create is it possible it uses the same files thus creating the same issue?
3. Since we know that all the tables has that problems is there an internal table with undisclosed references to tables data files?
I hope the questions were clear.
Have a good day, and thank you in advance.
Eric Lauzon
[Recherche & Développement]
Above Sécurité / Above Security
Tél : (450) 430-8166
Fax : (450) 430-1858
AVERTISSEMENT CONCERNANT LA CONFIDENTIALITÉ
Le présent message est à l'usage exclusif du ou des destinataires mentionnés ci-dessus. Son contenu est confidentiel et peut être assujetti au secret professionnel. Si vous avez reçu le présent message par erreur, veuillez nous en aviser immédiatement et
le détruire en vous abstenant d'en faire une copie, d'en divulguer le contenu ou d'y donner suite.
CONFIDENTIALITY NOTICE
This communication is intended for the exclusive use of the addressee identified above. Its content is confidential and may contain privileged information. If you have received this communication by error, please notify the sender and delete the message w
ithout copying or disclosing it.
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
| |
| Richard Huxton 2006-03-29, 8:28 pm |
| Eric Lauzon wrote:
> This is why our investigation brought us to the folowing questions:
>
> 1. Are postgresql data file name are hashed references to table
> name(as oracle)? [~path to data EX:/var/log/pgsql/data/[arbitraty
> numbers]/[datafile]]?
OID numbers - look in the contrib directory/package for the oid2name
utility.
> 2. If the data files are corrupted and we re-create is it possible it
> uses the same files thus creating the same issue?
No
> 3. Since we know that all the tables has that problems is there an
> internal table with undisclosed references to tables data files? I
> hope the questions were clear.
You mean a system table that could account for your problems since it
refers to some of your tables but not others? No.
The obvious places to start are:
1. vacuum analyse verbose on the tables in question
This should show whether there are a lot of "dead" rows
2. explain analyse on problem queries
To see if the query plans are correct
3. SELECT * FROM pg_stat_???
Assuming you have statistics gathering turned on, this might show
unusual table accesses.
HTH
--
Richard Huxton
Archonet Ltd
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
| |
| Mark Lewis 2006-03-29, 8:28 pm |
| Can you post an explain analyze for the delete query? That will at
least tell you if it is the delete itself which is slow, or a trigger /
referential integrity constraint check. Which version of PG is this?
-- Mark Lewis
On Wed, 2006-03-29 at 12:58 -0500, Eric Lauzon wrote:
> Greetings,
>
> We have an issue where we have a database with many tables.
> The layout of the database is 3 set of look alike tables with different names.
> Each set of tables has some referential integrety that point back to the main
> control table.
>
> On two set of tables we are able to efficiently delete referential and main record
> without a problems, but on the 3rd set we have an issue where the control table is clugged
> and delete seem to take forever , as example on the first two set a delete of 60K record take about
> 4 second to 10 second but on the 3rd set it can take as long as 3hours.
>
> This seem to be only affecting one database , the schema and way of doing is replicated elsewhere
> and if efficient.
>
> The question is, even after droping 3rd set integrity , dumping the table data , deleting the table,
> doing a manual checkpoint and recreating the table with the dump data , with or without referential
> integrity , the problems still araise.
>
> If we copy the data from the live table and do a create table aaa as select * from problematic_table;
> the table aaa operations are normal and efficient.
>
> This is why our investigation brought us to the folowing questions:
>
> 1. Are postgresql data file name are hashed references to table name(as oracle)? [~path to data EX:/var/log/pgsql/data/[arbitraty numbers]/[datafile]]?
>
> 2. If the data files are corrupted and we re-create is it possible it uses the same files thus creating the same issue?
>
> 3. Since we know that all the tables has that problems is there an internal table with undisclosed references to tables data files?
>
> I hope the questions were clear.
>
> Have a good day, and thank you in advance.
>
>
> Eric Lauzon
> [Recherche & Développement]
> Above Sécurité / Above Security
> Tél : (450) 430-8166
> Fax : (450) 430-1858
>
> AVERTISSEMENT CONCERNANT LA CONFIDENTIALITÉ
>
> Le présent message est Ã_ l'usage exclusif du ou des destinataires mentionnés ci-dessus. Son contenu est confidentiel et peut être assujetti au secret professionnel. Si vous avez reçu le présent message par erreur, veuillez nous en aviser immédiat
ement et le détruire en vous abstenant d'en faire une copie, d'en divulguer le contenu ou d'y donner suite.
>
> CONFIDENTIALITY NOTICE
>
> This communication is intended for the exclusive use of the addressee identified above. Its content is confidential and may contain privileged information. If you have received this communication by error, please notify the sender and delete the message
without copying or disclosing it.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
| |
| Eric Lauzon 2006-03-29, 8:28 pm |
| > -----Original Message-----
> From: Richard Huxton & #91;mailto:dev@archo
net.com]
> Sent: 29 mars 2006 17:10
> To: Eric Lauzon
> Cc: pgsql- performance@postgres
ql.org
> Subject: Re: [PERFORM] Database possible corruption ,
> unsolvable mystery
>
> Eric Lauzon wrote:
>
> OID numbers - look in the contrib directory/package for the
> oid2name utility.
This will give me the location of the databases file for a specific
table or index?
>
> possible it
>
> No
>
humm why would it affect only original table , and copy of that table
renamed back to the original table name
but not the copy.
example:
original table name : table_problem <issue>
copy name : table_problem_copy <no issue>
renamed copyed table: table_problem <issue>
>
> You mean a system table that could account for your problems
> since it refers to some of your tables but not others? No.
Well actualy its affecting only one table in a set of 5 table
(referential integrity)
and the table affected if the [referenced table] so it might be system
related, but
as stated if all the data is copied to a create table
copy_of_problematic_
table as select * from problematic_table
there is 0 issue but as soon as copy_of_problematic_
table is renamed to
problematic_table the problems is back.
But we have 2 orther set of 5 table in the same database built exactly
the same way and it dosen't
seem affected by the same problems, this is why i am wandering why the
problems is recurent if
internal postgresql data file are name bound ...and i am not taking
about the OID.
>
> The obvious places to start are:
> 1. vacuum analyse verbose on the tables in question
> This should show whether there are a lot of "dead" rows
> 2. explain analyse on problem queries
> To see if the query plans are correct 3. SELECT * FROM pg_stat_???
> Assuming you have statistics gathering turned on, this
> might show unusual table accesses.
Btw i can't give vacuum info right now because the source database is
being dumped for complete re-insertion.
Mabey later if this dosen't fix the problem , and as of information its
7.4.6 [i know its not the most rescent]
but it is the way it is right now and we suspect the problem might have
come from a power outage while there was
a full vacuum and the reason why its only one table that has been
affected is probably because it was the table being vacummed,
but this is only an assumption right now and more info will folow if the
problems persis after a full restore.
Thanks you :)
-elz
AVERTISSEMENT CONCERNANT LA CONFIDENTIALITE
Le present message est a l'usage exclusif du ou des destinataires mentionnes ci-dessus. Son contenu est confidentiel et peut etre assujetti au secret professionnel. Si vous avez recu le present message par erreur, veuillez nous en aviser immediatement et
le detruire en vous abstenant d'en faire une copie, d'en divulguer le contenu ou d'y donner suite.
CONFIDENTIALITY NOTICE
This communication is intended for the exclusive use of the addressee identified above. Its content is confidential and may contain privileged information. If you have received this communication by error, please notify the sender and delete the message w
ithout copying or disclosing it.
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
| |
|
| Eric Lauzon wrote:
>Mabey later if this dosen't fix the problem , and as of information its
>7.4.6 [i know its not the most rescent]
>but it is the way it is right now and we suspect the problem might have
>come from a power outage while there was
>a full vacuum and the reason why its only one table that has been
>affected is probably because it was the table being vacummed,
>but this is only an assumption right now and more info will folow if the
>problems persis after a full restore.
>
>
>
Hrm, you know that you -should- upgrade to at least the latest 7.4
(7.4.13 I think is the most recent). looking from the changelogs, there
are a few bugs that you could be hitting;
7.4.10
* Fix race condition in transaction log management There was a
narrow window in which an I/O operation could be initiated for the wrong
page, leading to an Assert failure or data corruption.
7.4.9
* Improve checking for partially-written WAL pages
* Fix error that allowed VACUUM to remove ctid chains too soon, and
add more checking in code that follows ctid links. This fixes a
long-standing problem that could cause crashes in very rare circumstances.
7.4.8
* Repair race condition between relation extension and VACUUMThis
could theoretically have caused loss of a page's worth of
freshly-inserted data, although the scenario seems of very low
probability. There are no known cases of it having caused more than an
Assert failure
and these are only the ones that appear 'notably' in the changelog.
In short, I -really- -would- -strongly- -advise- you upgrading to
7.4.13. Personally, I would have made this my first step, especially if
your data is important.
There is no need for a dump/reload between minor point releases.
Although there is a security fix in 7.4.8.
Since the db is in a state of 'down' or repair, why not do it now ?
two birds, one stone.
Regards
Stef
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
| |
| Eric Lauzon 2006-03-29, 8:28 pm |
| > Hrm, you know that you -should- upgrade to at least the latest 7.4
> (7.4.13 I think is the most recent). looking from the
> changelogs, there are a few bugs that you could be hitting;
>
> 7.4.10
> * Fix race condition in transaction log management There
> was a narrow window in which an I/O operation could be
> initiated for the wrong page, leading to an Assert failure or
> data corruption.
>
> 7.4.9
> * Improve checking for partially-written WAL pages
> * Fix error that allowed VACUUM to remove ctid chains too
> soon, and add more checking in code that follows ctid links.
> This fixes a long-standing problem that could cause crashes
> in very rare circumstances.
>
> 7.4.8
> * Repair race condition between relation extension and
> VACUUMThis could theoretically have caused loss of a page's
> worth of freshly-inserted data, although the scenario seems
> of very low probability. There are no known cases of it
> having caused more than an Assert failure
>
> and these are only the ones that appear 'notably' in the
> changelog.
> In short, I -really- -would- -strongly- -advise- you
> upgrading to 7.4.13. Personally, I would have made this my
> first step, especially if your data is important.
>
> There is no need for a dump/reload between minor point releases.
> Although there is a security fix in 7.4.8.
>
> Since the db is in a state of 'down' or repair, why not
> do it now ?
> two birds, one stone.
Thank you , this might be a good solution , but we have a bigger upgrade
comming for 8.1.x later on,
but considering that other things out of our hands might occur , we
might seriously look into it after fixing
the current problems :) [because we dont think that upgrading right now
will magicly fix the problem we are having.]
And on about 10 database [all 7.4.6] it is the first time this occur ,
and the symtom is really on one table, considering
a few terabytes of data sparsed accros a few db, we might have been
lucky yet but as of now its the first time
we can see performance hit only on "delete".
But thanks alot for the hint. [even tho we never had some unexpected
data failure/crash] beside this out of control
human power failure that might have been the root of this [the database
is still dumping ...few gigs :)]
Thanks alot all for the help,and if we find the root cause we will give
feed back.
-elz
AVERTISSEMENT CONCERNANT LA CONFIDENTIALITE
Le present message est a l'usage exclusif du ou des destinataires mentionnes ci-dessus. Son contenu est confidentiel et peut etre assujetti au secret professionnel. Si vous avez recu le present message par erreur, veuillez nous en aviser immediatement et
le detruire en vous abstenant d'en faire une copie, d'en divulguer le contenu ou d'y donner suite.
CONFIDENTIALITY NOTICE
This communication is intended for the exclusive use of the addressee identified above. Its content is confidential and may contain privileged information. If you have received this communication by error, please notify the sender and delete the message w
ithout copying or disclosing it.
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
| |
| Josh Berkus 2006-03-30, 3:27 am |
| Eric,
> Thank you , this might be a good solution , but we have a bigger upgrade
> comming for 8.1.x later on,
> but considering that other things out of our hands might occur , we
> might seriously look into it after fixing
> the current problems :) [because we dont think that upgrading right now
> will magicly fix the problem we are having.]
It probably won't, but it will prevent a re-occurance before you get around to
the 8.1 upgrade. How much time have you wasted on this issue already, an
issue which might not have occurred if you'd kept up with patch releases? A
patch upgrade is what, 5 minutes of downtime?
> And on about 10 database [all 7.4.6] it is the first time this occur ,
> and the symtom is really on one table, considering
> a few terabytes of data sparsed accros a few db, we might have been
> lucky yet but as of now its the first time
> we can see performance hit only on "delete".
Well, that would be in line with the issues 7.4.7-7.4.12. All of them require
millesecond-timing to hit the bug. You're not likely to see it more than
once.
--
Josh Berkus
Aglio Database Solutions
San Francisco
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
|
|
|
|
|