Home > Archive > PostgreSQL SQL > February 2006 > Non Matching Records in Two Tables









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Non Matching Records in Two Tables
Ken Hill

2006-02-08, 8:26 pm

I need some help with a bit of SQL. I have two tables. I want to find
records in one table that don't match records in another table based on
a common column in the two tables. Both tables have a column named
'key100'. I was trying something like:

SELECT count(*)
FROM table1, table2
WHERE (table1.key100 != table2.key100);

But the query is very slow and I finally just cancel it. Any help is
very much appreciated.

-Ken

Frank Bax

2006-02-08, 8:26 pm

At 04:10 PM 2/8/06, Ken Hill wrote:
>I need some help with a bit of SQL. I have two tables. I want to find
>records in one table that don't match records in another table based on a
>common column in the two tables. Both tables have a column named 'key100'.
>I was trying something like:
>
>SELECT count(*)
>FROM table1, table2
>WHERE (table1.key100 != table2.key100);
>
>But the query is very slow and I finally just cancel it. Any help is very
>much appreciated.



vacuum analyse table1;

vacuum analyse table2;

select count(*) from table1 full outer join table2 on
table1.key100=table2.key100 where table1.key100 is null or table2.key100 is
null;

If this is also slow, post output of "EXPLAIN ANALYSE SELECT ...."


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Markus Schaber

2006-02-08, 8:26 pm

Hi, Ken,

Ken Hill schrieb:
> I need some help with a bit of SQL. I have two tables. I want to find
> records in one table that don't match records in another table based on
> a common column in the two tables. Both tables have a column named
> 'key100'. I was trying something like:
>
> SELECT count(*)
> FROM table1, table2
> WHERE (table1.key100 != table2.key100);
>
> But the query is very slow and I finally just cancel it. Any help is
> very much appreciated.


Do you have indices on the key100 columns? Is autovacuum running, or do
you do analyze manually?

Can you send us the output from "EXPLAIN ANALYZE [your query]"?

Btw, I don't think this query will do what you wanted, it basically
creates a cross product, that means if your tables look like:

schabitest=# select * from table1;
key100 | valuea | valueb
--------+--------+--------
1 | foo | bar
2 | blah | blubb
3 | manga | mungo

schabitest=# select * from table2;
key100 | valuec | valued
--------+--------+--------
1 | monday | euro
2 | sunday | dollar
4 | friday | pounds

Then your query will produce something like:
schabitest=# select * from table1, table2 WHERE (table1.key100 !=
table2.key100);
key100 | valuea | valueb | key100 | valuec | valued
--------+--------+--------+--------+--------+--------
1 | foo | bar | 2 | sunday | dollar
1 | foo | bar | 4 | friday | pounds
2 | blah | blubb | 1 | monday | euro
2 | blah | blubb | 4 | friday | pounds
3 | manga | mungo | 1 | monday | euro
3 | manga | mungo | 2 | sunday | dollar
3 | manga | mungo | 4 | friday | pounds

I suggest you would like to have all records from table1 that don't have
a corresponding record in table2:

schabitest=# select * from table1 where table1.key100 not in (select
key100 from table2);
key100 | valuea | valueb
--------+--------+--------
3 | manga | mungo

HTH,
Markus


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Owen Jacobson

2006-02-08, 8:26 pm

Ken Hill wrote:

> I need some help with a bit of SQL. I have two tables. I want
> to find records in one table that don't match records in another
> table based on a common column in the two tables. Both tables
> have a column named 'key100'. I was trying something like:
>
> SELECT count(*)
> FROM table1, table2
> WHERE (table1.key100 != table2.key100);
>
> But the query is very slow and I finally just cancel it. Any help
> is very much appreciated.


That's a cartesian join, there, and it'll be huge (on the order of N*M rows, where N and M are the number of rows in the first and second tables respectively).

It sounds like, from your description, you want to find rows in table1 that don't have a corresponding row in table2. This should work:

SELECT count(*)
FROM table1 LEFT JOIN table2 ON table1.key100 = table2.key100
WHERE table2.key100 IS NULL;

This will still be fairly slow unless there are indexes on table1.key100 and table2.key100, but nowhere near as slow as the original query.

Frank Bax's solution will work if what you want is a count of rows in table1 that don't have a corresponding row in table2 or in table2 that don't have a corresponding row in table1; for that specific requirement you may actually be better off doing two q
ueries (one for each table) and adding the results together.

-Owen

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Ken Hill

2006-02-08, 8:26 pm

On Wed, 2006-02-08 at 16:27 -0500, Frank Bax wrote:

> At 04:10 PM 2/8/06, Ken Hill wrote:
>
>
> vacuum analyse table1;
>
> vacuum analyse table2;
>
> select count(*) from table1 full outer join table2 on
> table1.key100=table2.key100 where table1.key100 is null or table2.key100 is
> null;
>
> If this is also slow, post output of "EXPLAIN ANALYSE SELECT ...."
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql
.org so that your
> message can get through to the mailing list cleanly



OK. I added indexes on the two columns in the two tables:

CREATE INDEX key100_idex
ON ncccr9 (key100);

CREATE INDEX key100_ncccr10_idex
ON ncccr10 (key100);

Here is the analysis of the query:

csalgorithm=# EXPLAIN ANALYSE SELECT count(*)
csalgorithm-# FROM ncccr9 LEFT JOIN ncccr10 ON ncccr9.key100 =
ncccr10.key100
csalgorithm-# WHERE ncccr10.key100 IS NULL;
QUERY
PLAN
-------------------------------------------------------------------------------- --------------------------------------------------------
Aggregate (cost=208337.59..208337.59 rows=1 width=0) (actual
time=255723.212.. 255723.214 rows=1 loops=1)
-> Hash Left Join (cost=99523.55..207101.41 rows=494471 width=0)
(actual ti me=92326.635..255538.447 rows=38144 loops=1)
Hash Cond: ("outer".key100 = "inner".key100)
Filter: ("inner".key100 IS NULL)
-> Seq Scan on ncccr9 (cost=0.00..59360.71 rows=494471
width=104) (ac tual time=171.778..75099.734 rows=494471 loops=1)
-> Hash (cost=88438.64..88438.64 rows=611564 width=104)
(actual time= 91962.956..91962.956 rows=0 loops=1)
-> Seq Scan on ncccr10 (cost=0.00..88438.64 rows=611564
width=1 04) (actual time=11.704..76519.323 rows=611564 loops=1)
Total runtime: 255724.219 ms
(8 rows)

The result of 38,144 non-matching records seems too much:

csalgorithm=# SELECT count(*)
csalgorithm-# FROM ncccr9 LEFT JOIN ncccr10 ON ncccr9.key100 =
ncccr10.key100
csalgorithm-# WHERE ncccr10.key100 IS NULL;
count
-------
38144
(1 row)

Maybe I need to do a RIGHT JOIN to return the count of records in table
'ncccr10' that don't match records in 'ncccr9'?

Thanks for your help. JOINS are fairly new to me.


Patrick JACQUOT

2006-02-09, 3:24 am

Ken Hill wrote:

> I need some help with a bit of SQL. I have two tables. I want to find
> records in one table that don't match records in another table based
> on a common column in the two tables. Both tables have a column named
> 'key100'. I was trying something like:
>
> SELECT count(*)
> FROM table1, table2
> WHERE (table1.key100 != table2.key100);
>
> But the query is very slow and I finally just cancel it. Any help is
> very much appreciated.
>
> -Ken


Maybe you could use a NOT EXISTS subquery, as in
SELECT count(*) from table1
WHERE NOT EXISTS(SELECT count(*) from table2 WHERE table1.key100
=table2.key100)
which gives you the number of records in table1 without corresponding
records in table2.
That kind of query is quite fast, if there exists an index on table2.key100
hth
P.Jacquot

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

BigSmoke

2006-02-09, 11:24 am

You can use an EXCEPT clause.

Ken Hill

2006-02-14, 8:25 pm

On Wed, 2006-02-08 at 22:31 +0100, Markus Schaber wrote:

> Hi, Ken,
>
> Ken Hill schrieb:
>
> Do you have indices on the key100 columns? Is autovacuum running, or do
> you do analyze manually?
>
> Can you send us the output from "EXPLAIN ANALYZE [your query]"?
>
> Btw, I don't think this query will do what you wanted, it basically
> creates a cross product, that means if your tables look like:
>
> schabitest=# select * from table1;
> key100 | valuea | valueb
> --------+--------+--------
> 1 | foo | bar
> 2 | blah | blubb
> 3 | manga | mungo
>
> schabitest=# select * from table2;
> key100 | valuec | valued
> --------+--------+--------
> 1 | monday | euro
> 2 | sunday | dollar
> 4 | friday | pounds
>
> Then your query will produce something like:
> schabitest=# select * from table1, table2 WHERE (table1.key100 !=
> table2.key100);
> key100 | valuea | valueb | key100 | valuec | valued
> --------+--------+--------+--------+--------+--------
> 1 | foo | bar | 2 | sunday | dollar
> 1 | foo | bar | 4 | friday | pounds
> 2 | blah | blubb | 1 | monday | euro
> 2 | blah | blubb | 4 | friday | pounds
> 3 | manga | mungo | 1 | monday | euro
> 3 | manga | mungo | 2 | sunday | dollar
> 3 | manga | mungo | 4 | friday | pounds
>
> I suggest you would like to have all records from table1 that don't have
> a corresponding record in table2:
>
> schabitest=# select * from table1 where table1.key100 not in (select
> key100 from table2);
> key100 | valuea | valueb
> --------+--------+--------
> 3 | manga | mungo
>
> HTH,
> Markus
>


Here is my query SQL:

SELECT key100 FROM ncccr10
WHERE ncccr10.key100 NOT IN (SELECT key100 FROM ncccr9);

It is is running after 30 minutes. Here is the query plan:

QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on ncccr10 (cost=0.00..20417160510.08 rows=305782 width=104)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on ncccr9 (cost=0.00..65533.71 rows=494471 width=104)
(4 rows)

Any ideas why it is so slow?

Tom Lane

2006-02-14, 8:25 pm

Ken Hill <ken@scottshill.com> writes:
> Seq Scan on ncccr10 (cost=0.00..20417160510.08 rows=305782 width=104)
> Filter: (NOT (subplan))
> SubPlan
> -> Seq Scan on ncccr9 (cost=0.00..65533.71 rows=494471 width=104)
> (4 rows)


> Any ideas why it is so slow?


"NOT (subplan)" is horrendous (and the system knows it, note the huge
cost estimate). Try increasing work_mem enough so you get a hashed
subplan instead.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

chester c young

2006-02-14, 8:25 pm


> Here is my query SQL:
>
> SELECT key100 FROM ncccr10
> WHERE ncccr10.key100 NOT IN (SELECT key100 FROM ncccr9);
>
> It is is running after 30 minutes. Here is the query plan:
>


I would try an outer join:

select a.key100
from ncccr10 a
left join ncccr9 b on( key100 )
where b.key100 is null;

also (hate to be obvious) have you analyzed lately?


____________________
____________________
__________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Ken Hill

2006-02-14, 8:25 pm

On Tue, 2006-02-14 at 13:08 -0800, chester c young wrote:

>
> I would try an outer join:
>
> select a.key100
> from ncccr10 a
> left join ncccr9 b on( key100 )
> where b.key100 is null;
>
> also (hate to be obvious) have you analyzed lately?
>
>
> ____________________
____________________
__________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql
.org so that your
> message can get through to the mailing list cleanly


Yes, I did a VACUUM ANALYZE on the entire database before running this
query. Also, I get an error with your suggestion:

csalgorithm=# SELECT a.key100 FROM ncccr10 a
csalgorithm-# LEFT JOIN ncccr9 b ON(a.key100)
csalgorithm-# WHERE b.key100 IS Null;
ERROR: argument of JOIN/ON must be type boolean, not type character


Ken Hill

2006-02-14, 8:25 pm

On Tue, 2006-02-14 at 16:07 -0500, Tom Lane wrote:

> Ken Hill <ken@scottshill.com> writes:
>
>
> "NOT (subplan)" is horrendous (and the system knows it, note the huge
> cost estimate). Try increasing work_mem enough so you get a hashed
> subplan instead.
>
> regards, tom lane



How do I ncrease work_mem?

Ken Hill

2006-02-14, 8:25 pm

On Tue, 2006-02-14 at 13:24 -0800, Ken Hill wrote:

> On Tue, 2006-02-14 at 13:08 -0800, chester c young wrote:
>
>
> Yes, I did a VACUUM ANALYZE on the entire database before running this
> query. Also, I get an error with your suggestion:
>
> csalgorithm=# SELECT a.key100 FROM ncccr10 a
> csalgorithm-# LEFT JOIN ncccr9 b ON(a.key100)
> csalgorithm-# WHERE b.key100 IS Null;
> ERROR: argument of JOIN/ON must be type boolean, not type character
>


Well, this works:

SELECT *
FROM ncccr9 a
LEFT JOIN ncccr10 b USING( key100 )
WHERE b.key100 is null;

It still seems slow. It takes about the same time to run as in MS
Access. I thought PostgreSQL would be faster.

Ken Hill

2006-02-15, 11:24 am

On Tue, 2006-02-14 at 15:05 -0800, Bryce Nesbitt wrote:

> Ken Hill wrote:
> I'd say that's fair game, not obvious. Vacuum/Analyze is ar so aparent
> to a person moving
> to Postgres from other DB's.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org


In my O'Reily "Practical PostgreSQL" book it recommends vacuuming a
production database on a daily basis. I liken this to MS Access'
"compact/repair" procedure.

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com