Drop Table
Support Forum for database administrators and web based access to important newsgroups related to databasesI am using the following query which I found from a fragmanet of code by itzik ben-gan to assign a common group id for group of records in my case which have similar SSN and first Name and Last Name. if the SSN is the same it should also check the first name and last name of the record. Becuase records have more than three AKA names, I need to check all the possibilities of first name last name combination to verify the records are the same. This code works fine and can assign group numbers for all the rows. I am trying this code on a database of 65,000 rows. It's taking around 20 minute to complete. but I'll have to run the same code on 800,000,000 rows. It will take years to finish. even if the query is optimized to run in 1 second for 65,000 rows, it will take more than 4 hours to run on the 800,000,000 row. This where I realized I am in the "wrong jungle". 1. is there any other feasible and faster way to do this? very important issue. 2. while assigning group number, it doesn't give sequential numbers. It skips some of the numbers( group Number 1,2 5,9....) just curiouse( not very important) SELECT c1.fname, c1.lname, c1.ssn , c3.tu_id, (SELECT 1 + count(*) FROM distFLS AS c2 WHERE c2.ssn < c1.ssn or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) = substring(c1.fname,1,1) or substring(c2.lname,1,1) = substring(c1.lname,1,1) or substring(c2.fname,1,1) = substring(c1.lname,1,1) or substring(c2.lname,1,1) = substring(c1.fname,1,1)) )) AS grp_num into tmp_FLS FROM distFLS AS c1 JOIN tu_people_data AS c3 ON (c1.ssn = c3.ssn and c1.fname = c3.fname and c1.lname= c3.lname) GO distinct firstname, lastname and SSN table from the tu_people_data. I created this table to increase the query performance. CREATE TABLE [distFLS] ( [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL , [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL , [ssn] [int] NULL ) ON [PRIMARY] GO CREATE TABLE [TU_People_Data] ( [tu_id] [bigint] NOT NULL , [count_id] [int] NOT NULL , [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL , [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL , [ssn] [int] NULL , CONSTRAINT & #91;PK_tu_bulk_peopl e] PRIMARY KEY CLUSTERED ( [tu_id], [count_id] ) ON [PRIMARY] ) ON [PRIMARY] GO sample data there is a column count_id after the tu_id and before fname(tu_id and count_id are primary keys) tu_id fname lname SSN 156078480 KRISINA WALSH 999999000 156078480 KRISTINA GIERER 999999000 156078480 KRISTINA WALSH 999999000 151257883 J SOTO 999999111 151257883 JOSE LARIOS 999999111 151257883 JOSE SOTO 999999111 151257883 L SOTO 999999111 136312525 ELADIO GARCIA 999999222 136312525 ELADIO NAVA 999999222 136312525 ELADIO NAVAGARCIA 999999222 136312525 GARCIA NAVA 999999222 149180940 DARREN SAUERWINE 999999333 149180940 DARREN SUAERWIN 999999333
Post Follow-up to this messageBack to basics. The most commn errors in numeric codes liek SSN are 1) missing digit 2) extra digit 3) one wrong digit 4) Pairwise transpose For names, use Metaphone and write it in a better 3GL than T-SQL.
Post Follow-up to this messagei'm not a huge fan of the method of group number assignment. seems pretty inefficient to do a row count every time. Try this; select distinct left(c1.fname,1) as fname, left(c1.lname,1) as lname, c1.ssn into tmpgroups from tu_people_data Using SQL, add a new column to tmpgroups which is an identity column named tu_id I'm a little puzzled on the business logic. It looks like the records are identical if the ssn is the same, and the first letter of first name OR first letter of last name is same. Is this what you want?
Post Follow-up to this messageyou can add the identity in one shot select distinct identity(int,1,1) as tu_id,left(c1.fname,1) as fname, left(c1.lname,1) as lname, c1.ssn into tmpgroups from tu_people_data http://sqlservercode.blogspot.com/
Post Follow-up to this message(jacob.dba@gmail.com) writes: > I am using the following query which I found from a fragmanet of code > by itzik ben-gan to assign a common group id for group of records in my > case which have similar SSN and first Name and Last Name. if the SSN is > the same it should also check the first name and last name of the > record. Becuase records have more than three AKA names, I need to check > all the possibilities of first name last name combination to verify the > > records are the same. > This code works fine and can assign group numbers for all the rows. > I am trying this code on a database of 65,000 rows. It's taking around > 20 minute to complete. but I'll have to run the same code on > 800,000,000 rows. > It will take years to finish. > even if the query is optimized to run in 1 second for 65,000 rows, it > will take more than 4 hours to run on the 800,000,000 row. This where I > realized I am in the "wrong jungle". I hope that the query is not going to be run in a regular fashion on those 800 million rows, but once you are there, it will be a one-off. For the problem as given it sounds like a nightmare to process 800 million rows. > SELECT c1.fname, c1.lname, c1.ssn , c3.tu_id, > (SELECT 1 + count(*) > FROM distFLS AS c2 > WHERE c2.ssn < c1.ssn > or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) = > substring(c1.fname,1,1) or substring(c2.lname,1,1) = > substring(c1.lname,1,1) > or substring(c2.fname,1,1) = > substring(c1.lname,1,1) or substring(c2.lname,1,1) = > substring(c1.fname,1,1)) > )) AS grp_num > into tmp_FLS > FROM distFLS AS c1 > JOIN tu_people_data AS c3 > ON (c1.ssn = c3.ssn and > c1.fname = c3.fname and > c1.lname= c3.lname) > GO Your definition of distFLS does not have an index, at least you did not post one. A clustered index on ssn would be a good start. In the same vein, add an index on (ssn, fnmae, lname) on TU_PeopleData. That may at least speed up your test case on 65000 rows. Although, you probably need more tweaks to do the 800 million. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/pr...oads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodin...ions/books.mspx
Post Follow-up to this messageI have a table with first name, last name, SSN and other columns. I want to assign group number according to this business logic. 1. Records with equal SSN and (similar first name or last name) belong to the same group. John Smith 1234 Smith John 1234 S John 1234 J Smith 1234 John Smith and Smith John falls in the same group Number as long as they have similar SSN. This is because I have a record of equal SSN but the first name and last name is switched because of people who make error inserting last name as first name and vice versa. John Smith and Smith John will have equal group Name if they have equal SSN. 2. There are records with equal SSN but different first name and last name. These belong to different group numbers. Equal SSN doesn't guarantee equal group number, at least one of the first name or last name should be the same. John Smith and Dan Brown with equal SSN=1234 shouldn't fall in the same group number. Sample data: Id Fname lname SSN grpNum 1 John Smith 1234 1 2 Smith John 1234 1 3 S John 1234 1 4 J Smith 1234 1 5 J S 1234 1 6 Dan Brown 1234 2 7 John Smith 1111 3 I have tried this code for 65,000 rows. It took 20 minute. I have to run it for 21 million row data. It will take years. INSERT into temp_FnLnSSN_grp SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id, (SELECT 1 + count(*) FROM distFLS AS c2 WHERE c2.ssn < c1.ssn or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) = substring(c1.fname,1,1) or substring(c2.lname,1,1) = substring(c1.lname,1,1) or substring(c2.fname,1,1) = substring(c1.lname,1,1) or substring(c2.lname,1,1) = substring(c1.fname,1,1)) )) AS group_number FROM distFLS AS c1 JOIN tu_people_data AS c3 ON (c1.ssn = c3.ssn and c1.fname = c3.fname and c1.lname= c3.lname) dist FLS is distinct First Name, last Name and SSN table from the people table. Doug wrote: > i'm not a huge fan of the method of group number assignment. seems > pretty inefficient to do a row count every time. > > Try this; > select distinct left(c1.fname,1) as fname, left(c1.lname,1) as lname, > c1.ssn into tmpgroups from tu_people_data > > Using SQL, add a new column to tmpgroups which is an identity column > named tu_id > > I'm a little puzzled on the business logic. It looks like the records > are identical if the ssn is the same, and the first letter of first > name OR first letter of last name is same. Is this what you want?
Post Follow-up to this messageI have a table with first name, last name, SSN and other columns. I want to assign group number according to this business logic. 1. Records with equal SSN and (similar first name or last name) belong to the same group. John Smith 1234 Smith John 1234 S John 1234 J Smith 1234 John Smith and Smith John falls in the same group Number as long as they have similar SSN. This is because I have a record of equal SSN but the first name and last name is switched because of people who make error inserting last name as first name and vice versa. John Smith and Smith John will have equal group Name if they have equal SSN. 2. There are records with equal SSN but different first name and last name. These belong to different group numbers. Equal SSN doesn't guarantee equal group number, at least one of the first name or last name should be the same. John Smith and Dan Brown with equal SSN=1234 shouldn't fall in the same group number. Sample data: Id Fname lname SSN grpNum 1 John Smith 1234 1 2 Smith John 1234 1 3 S John 1234 1 4 J Smith 1234 1 5 J S 1234 1 6 Dan Brown 1234 2 7 John Smith 1111 3 I have tried this code for 65,000 rows. It took 20 minute. I have to run it for 21 million row data. It will take years. INSERT into temp_FnLnSSN_grp SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id, (SELECT 1 + count(*) FROM distFLS AS c2 WHERE c2.ssn < c1.ssn or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) = substring(c1.fname,1,1) or substring(c2.lname,1,1) = substring(c1.lname,1,1) or substring(c2.fname,1,1) = substring(c1.lname,1,1) or substring(c2.lname,1,1) = substring(c1.fname,1,1)) )) AS group_number FROM distFLS AS c1 JOIN tu_people_data AS c3 ON (c1.ssn = c3.ssn and c1.fname = c3.fname and c1.lname= c3.lname) dist FLS is distinct First Name, last Name and SSN table from the people table. Doug wrote: > i'm not a huge fan of the method of group number assignment. seems > pretty inefficient to do a row count every time. > > Try this; > select distinct left(c1.fname,1) as fname, left(c1.lname,1) as lname, > c1.ssn into tmpgroups from tu_people_data > > Using SQL, add a new column to tmpgroups which is an identity column > named tu_id > > I'm a little puzzled on the business logic. It looks like the records > are identical if the ssn is the same, and the first letter of first > name OR first letter of last name is same. Is this what you want?
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread