Drop Table

Support Forum for database administrators and web based access to important newsgroups related to databases
Register on Database Support Forum Edit your profileCalendarFind other Database Support forum membersFrequently Asked QuestionsSearch this forum -> 
For Database admins: Free Database-related Magazines Now Free shipping to Texas


Post New Thread










Thread
Author

itzik ben-gan
I am using the following query which I found from a fragmanet of code
by itzik ben-gan to assign a common group id for group of records in my

case which have similar SSN and first Name and Last Name. if the SSN is

the same it should also check the first name and last name of the
record. Becuase records have more than three AKA names, I need to check

all the possibilities of first name last name combination to verify the

records are the same.
This code works fine and can assign group numbers for all the rows.
I am trying this code on a database of 65,000 rows. It's taking around
20 minute to complete. but I'll have to run the same code on
800,000,000 rows.
It will take years to finish.
even if the query is optimized to run in 1 second for 65,000 rows, it
will take more than 4 hours to run on the 800,000,000 row. This where I

realized I am in the "wrong jungle".
1. is there any other feasible and faster way to do this? very
important issue.
2. while assigning group number, it doesn't give sequential numbers. It

skips some of the numbers( group Number 1,2 5,9....)  just curiouse(
not very important)

SELECT  c1.fname, c1.lname, c1.ssn , c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or  substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS grp_num
into tmp_FLS
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)
GO


distinct firstname, lastname and SSN table from the tu_people_data.
I created this table to increase the query performance.


CREATE TABLE [distFLS] (
[fname] [varchar] (32) COLLATE  Latin1_General_CI_AS
 NULL ,
[lname] [varchar] (32) COLLATE  Latin1_General_CI_AS
 NULL ,
[ssn] [int] NULL
) ON [PRIMARY]
GO


CREATE TABLE [TU_People_Data] (
[tu_id] [bigint] NOT NULL ,
[count_id] [int] NOT NULL ,
[fname] [varchar] (32) COLLATE  Latin1_General_CI_AS
 NULL ,
[lname] [varchar] (32) COLLATE  Latin1_General_CI_AS
 NULL ,
[ssn] [int] NULL ,
CONSTRAINT & #91;PK_tu_bulk_peopl
e] PRIMARY KEY  CLUSTERED
(
[tu_id],
[count_id]
)  ON [PRIMARY]
) ON [PRIMARY]
GO


sample data
there is a column count_id after the tu_id and before fname(tu_id and
count_id are primary keys)
tu_id         fname       lname       SSN
156078480       KRISINA WALSH   999999000
156078480       KRISTINA        GIERER  999999000
156078480       KRISTINA        WALSH   999999000
151257883       J       SOTO    999999111
151257883       JOSE    LARIOS  999999111
151257883       JOSE    SOTO    999999111
151257883       L       SOTO    999999111
136312525       ELADIO  GARCIA  999999222
136312525       ELADIO  NAVA    999999222
136312525       ELADIO  NAVAGARCIA      999999222
136312525       GARCIA  NAVA    999999222
149180940       DARREN  SAUERWINE       999999333
149180940       DARREN  SUAERWIN        999999333


Report this thread to moderator Post Follow-up to this message
Old Post
jacob.dba@gmail.com
03-07-06 12:17 AM


Re: itzik ben-gan
Back to basics.  The most commn errors in numeric codes liek SSN are
1) missing digit
2) extra digit
3) one wrong digit
4) Pairwise transpose

For names, use Metaphone and write it in a better 3GL than T-SQL.


Report this thread to moderator Post Follow-up to this message
Old Post
--CELKO--
03-07-06 01:23 AM


Re: itzik ben-gan
i'm not a huge fan of the method of group number assignment. seems
pretty inefficient to do a row count every time.

Try this;
select distinct left(c1.fname,1) as fname, left(c1.lname,1) as lname,
c1.ssn into tmpgroups from tu_people_data

Using SQL, add a new column to tmpgroups which is an identity column
named tu_id

I'm a little puzzled on the business logic. It looks like the records
are identical if the ssn is the same, and the first letter of first
name OR first letter of last name is same. Is this what you want?


Report this thread to moderator Post Follow-up to this message
Old Post
Doug
03-07-06 02:24 PM


Re: itzik ben-gan
you can add the identity in one shot
select distinct identity(int,1,1) as tu_id,left(c1.fname,1) as fname,
left(c1.lname,1) as lname,
c1.ssn into tmpgroups from tu_people_data


http://sqlservercode.blogspot.com/


Report this thread to moderator Post Follow-up to this message
Old Post
SQL
03-07-06 02:24 PM


Re: itzik ben-gan
(jacob.dba@gmail.com) writes:

> I am using the following query which I found from a fragmanet of code
> by itzik ben-gan to assign a common group id for group of records in my
> case which have similar SSN and first Name and Last Name. if the SSN is
> the same it should also check the first name and last name of the
> record. Becuase records have more than three AKA names, I need to check
> all the possibilities of first name last name combination to verify the
>
> records are the same.
> This code works fine and can assign group numbers for all the rows.
> I am trying this code on a database of 65,000 rows. It's taking around
> 20 minute to complete. but I'll have to run the same code on
> 800,000,000 rows.
> It will take years to finish.
> even if the query is optimized to run in 1 second for 65,000 rows, it
> will take more than 4 hours to run on the 800,000,000 row. This where I
> realized I am in the "wrong jungle".

I hope that the query is not going to be run in a regular fashion on
those 800 million rows, but once you are there, it will be a one-off.

For the problem as given it sounds like a nightmare to process 800
million rows.

> SELECT  c1.fname, c1.lname, c1.ssn , c3.tu_id,
>          (SELECT 1 + count(*)
>           FROM distFLS AS c2
>           WHERE c2.ssn < c1.ssn
>           or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
>                                or  substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
>               )) AS grp_num
> into tmp_FLS
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
>        c1.fname = c3.fname and
>        c1.lname= c3.lname)
> GO

Your definition of distFLS does not have an index, at least you did not
post one. A clustered index on ssn would be a good start. In the same
vein, add an index on (ssn, fnmae, lname) on TU_PeopleData.

That may at least speed up your test case on 65000 rows. Although, you
probably need more tweaks to do the 800 million.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx

Report this thread to moderator Post Follow-up to this message
Old Post
Erland Sommarskog
03-08-06 01:24 AM


Re: itzik ben-gan
I have a table with first name, last name, SSN and other columns.
I want to assign group number according to this business logic.
1. Records with equal SSN and (similar first name or last name) belong
to the same group.
John      Smith    1234
Smith    John     1234
S           John    1234
J           Smith  1234
John Smith and Smith John falls in the same group Number as long as
they have similar SSN.
This is because I have a record of equal SSN but the first name and
last name is switched because of people who make error inserting last
name as first name and vice versa. John Smith and Smith John will have
equal group Name if they have equal SSN.
2. There are records with equal SSN but different first name and last
name. These belong to different group numbers.
Equal SSN doesn't guarantee equal group number, at least one of the
first name or last name should be the same. John Smith and Dan Brown
with equal SSN=1234 shouldn't fall in the same group number.

Sample data:
Id Fname    lname    SSN   grpNum
1  John      Smith    1234  1
2  Smith    John     1234   1
3  S           John    1234     1
4  J           Smith  1234      1
5  J            S         1234      1
6  Dan      Brown  1234     2
7  John     Smith  1111      3


I have tried this code for 65,000 rows. It took 20 minute. I have to
run it for 21 million row data. It will take years.


INSERT into temp_FnLnSSN_grp
SELECT  c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or  substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS group_number
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)


dist FLS is distinct First Name, last Name and SSN table from the
people table.


Doug  wrote:
> i'm not a huge fan of the method of group number assignment. seems
> pretty inefficient to do a row count every time.
>
> Try this;
> select distinct left(c1.fname,1) as fname, left(c1.lname,1) as lname,
> c1.ssn into tmpgroups from tu_people_data
>
> Using SQL, add a new column to tmpgroups which is an identity column
> named tu_id
>
> I'm a little puzzled on the business logic. It looks like the records
> are identical if the ssn is the same, and the first letter of first
> name OR first letter of last name is same. Is this what you want?


Report this thread to moderator Post Follow-up to this message
Old Post
jacob.dba@gmail.com
03-27-06 06:32 PM


Re: itzik ben-gan
I have a table with first name, last name, SSN and other columns.
I want to assign group number according to this business logic.
1. Records with equal SSN and (similar first name or last name) belong
to the same group.
John      Smith    1234
Smith    John     1234
S           John    1234
J           Smith  1234
John Smith and Smith John falls in the same group Number as long as
they have similar SSN.
This is because I have a record of equal SSN but the first name and
last name is switched because of people who make error inserting last
name as first name and vice versa. John Smith and Smith John will have
equal group Name if they have equal SSN.
2. There are records with equal SSN but different first name and last
name. These belong to different group numbers.
Equal SSN doesn't guarantee equal group number, at least one of the
first name or last name should be the same. John Smith and Dan Brown
with equal SSN=1234 shouldn't fall in the same group number.

Sample data:
Id Fname    lname    SSN   grpNum
1  John      Smith    1234  1
2  Smith    John     1234   1
3  S           John    1234     1
4  J           Smith  1234      1
5  J            S         1234      1
6  Dan      Brown  1234     2
7  John     Smith  1111      3


I have tried this code for 65,000 rows. It took 20 minute. I have to
run it for 21 million row data. It will take years.


INSERT into temp_FnLnSSN_grp
SELECT  c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or  substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS group_number
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)


dist FLS is distinct First Name, last Name and SSN table from the
people table.


Doug  wrote:
> i'm not a huge fan of the method of group number assignment. seems
> pretty inefficient to do a row count every time.
>
> Try this;
> select distinct left(c1.fname,1) as fname, left(c1.lname,1) as lname,
> c1.ssn into tmpgroups from tu_people_data
>
> Using SQL, add a new column to tmpgroups which is an identity column
> named tu_id
>
> I'm a little puzzled on the business logic. It looks like the records
> are identical if the ssn is the same, and the first letter of first
> name OR first letter of last name is same. Is this what you want?


Report this thread to moderator Post Follow-up to this message
Old Post
jacob.dba@gmail.com
03-27-06 06:32 PM


Sponsored Links





Last Thread Next Thread
Post New Thread

Microsoft SQL Server forum archive

Show a Printable Version Email This Page to Someone! Receive updates to this thread
Microsoft SQL Server
Access database support
PostgreSQL Replication
SQL Server ODBC
FoxPro Support
PostgreSQL pgAdmin
SQL Server Clustering
MySQL ODBC
Web Applications with dBASE
SQL Server CE
MySQL++
Sybase Database Support
MS SQL Full Text Search
PostgreSQL Administration
SQL Anywhere support
DB2 UDB Database
Paradox Database Support
Filemaker Database
Berkley DB
SQL 2000/2000i database
ASE Database
Forum Jump:
All times are GMT. The time now is 08:10 AM.

 
Mobile devices forum | Database support forum archive




Copyrights DropTable.com Database Support Forum 2004 - 2006