Drop Table
Support Forum for database administrators and web based access to important newsgroups related to databasesI have a table with first name, last name, SSN(social security number) and other columns. I want to assign group number according to this business logic. 1. Records with equal SSN and (similar first name or last name) belong to the same group. John Smith 1234 Smith John 1234 S John 1234 J Smith 1234 John Smith and Smith John falls in the same group Number as long as they have similar SSN. This is because I have a record of equal SSN but the first name and last name is switched because of people who make error inserting last name as first name and vice versa. John Smith and Smith John will have equal group Name if they have equal SSN. 2. There are records with equal SSN but different first name and last name. These belong to different group numbers. Equal SSN doesn't guarantee equal group number, at least one of the first name or last name should be the same. John Smith and Dan Brown with equal SSN=1234 shouldn't fall in the same group number. Sample data: Id Fname lname SSN grpNum 1 John Smith 1234 1 2 Smith John 1234 1 3 S John 1234 1 4 J Smith 1234 1 5 J S 1234 1 6 Dan Brown 1234 2 7 John Smith 1111 3 I have tried this code for 65,000 rows. It took 20 minute. I have to run it for 21 million row data. I now that this is not an efficient code. INSERT into temp_FnLnSSN_grp SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id, (SELECT 1 + count(*) FROM distFLS AS c2 WHERE c2.ssn < c1.ssn or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) = substring(c1.fname,1,1) or substring(c2.lname,1,1) = substring(c1.lname,1,1) or substring(c2.fname,1,1) = substring(c1.lname,1,1) or substring(c2.lname,1,1) = substring(c1.fname,1,1)) )) AS group_number FROM distFLS AS c1 JOIN tu_people_data AS c3 ON (c1.ssn = c3.ssn and c1.fname = c3.fname and c1.lname= c3.lname) dist FLS is distinct First Name, last Name and SSN table from the people table. I have posted part of this question, schema one week ago. Please refer this thread. http://groups.google.com/group/comp...6eb380b5f2e6de6
Post Follow-up to this messageI forgot to mention that some of the records have middle name entered in place of first name or last name. fname mname lname ssn John coleman smith 1234 john smith coleman 1234 john S coleman 1234 John C Smith 1234 John Smith 1234 John-coleman Smith 1234 Smith John 1234 During the grouping process I am concerned only about fname,lname, ssn.(no need of middle name). If there is other suggestion to include columns I am happy to accept. I have the idea to assign groups if one of the initial of the names is similar with the others considering that the SSN is the same. that means if SSN is equal and if J or S or C are there as an initial in the names, we can say they are in the same group. Reply jacob.dba@gmail.com wrote: > I have a table with first name, last name, SSN(social security number) > and other columns. > I want to assign group number according to this business logic. > 1. Records with equal SSN and (similar first name or last name) belong > to the same group. > John Smith 1234 > Smith John 1234 > S John 1234 > J Smith 1234 > John Smith and Smith John falls in the same group Number as long as > they have similar SSN. > This is because I have a record of equal SSN but the first name and > last name is switched because of people who make error inserting last > name as first name and vice versa. John Smith and Smith John will have > equal group Name if they have equal SSN. > 2. There are records with equal SSN but different first name and last > name. These belong to different group numbers. > Equal SSN doesn't guarantee equal group number, at least one of the > first name or last name should be the same. John Smith and Dan Brown > with equal SSN=1234 shouldn't fall in the same group number. > > Sample data: > Id Fname lname SSN grpNum > 1 John Smith 1234 1 > 2 Smith John 1234 1 > 3 S John 1234 1 > 4 J Smith 1234 1 > 5 J S 1234 1 > 6 Dan Brown 1234 2 > 7 John Smith 1111 3 > > > I have tried this code for 65,000 rows. It took 20 minute. I have to > run it for 21 million row data. I now that this is not an efficient > code. > > > INSERT into temp_FnLnSSN_grp > SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id, > (SELECT 1 + count(*) > FROM distFLS AS c2 > WHERE c2.ssn < c1.ssn > or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) = > substring(c1.fname,1,1) or substring(c2.lname,1,1) = > substring(c1.lname,1,1) > or substring(c2.fname,1,1) = > substring(c1.lname,1,1) or substring(c2.lname,1,1) = > substring(c1.fname,1,1)) > )) AS group_number > FROM distFLS AS c1 > JOIN tu_people_data AS c3 > ON (c1.ssn = c3.ssn and > c1.fname = c3.fname and > c1.lname= c3.lname) > > > dist FLS is distinct First Name, last Name and SSN table from the > people table. > > I have posted part of this question, schema one week ago. Please refer > this thread. > > http://groups.google.com/group/comp...f2e6 de6
Post Follow-up to this message(jacob.dba@gmail.com) writes: > I want to assign group number according to this business logic. > 1. Records with equal SSN and (similar first name or last name) belong > to the same group. > John Smith 1234 > Smith John 1234 > S John 1234 > J Smith 1234 > John Smith and Smith John falls in the same group Number as long as > they have similar SSN. > This is because I have a record of equal SSN but the first name and > last name is switched because of people who make error inserting last > name as first name and vice versa. John Smith and Smith John will have > equal group Name if they have equal SSN. > 2. There are records with equal SSN but different first name and last > name. These belong to different group numbers. > Equal SSN doesn't guarantee equal group number, at least one of the > first name or last name should be the same. John Smith and Dan Brown > with equal SSN=1234 shouldn't fall in the same group number. What if you have both John Smith and Southerland Jane? Are the same person or not? This looks like a very difficult task, and the fact that you have 800 million rows certainly does not help to make it easier. I think you need to scrap the idea you got from Itzik. My gut feeling say that it will not scale. Here is a very simple-minded solution where I've assumed that as long as any combination of initials match, it's the same group. CREATE TABLE [TU_People_Data] ( [tu_id] [bigint] NOT NULL , [count_id] [int] NOT NULL , [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL , [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL , [ssn] [int] NULL , CONSTRAINT & #91;PK_tu_bulk_peopl e] PRIMARY KEY CLUSTERED ( [tu_id], [count_id] ) ON [PRIMARY] ) ON [PRIMARY] GO CREATE TABLE #initials (ssn int NOT NULL, fname varchar(32) NOT NULL, lname varchar(32) NOT NULL, initials char(2) NOT NULL) go CREATE TABLE #ssnmania (ident int NOT NULL, ssn int NOT NULL, initials char(2) NOT NULL, PRIMARY KEY(ssn, initials)) go INSERT #initals (ssn, fname, lname, initials) SELECT DISTINCT ssn, fname, lname, CASE WHEN fname < lname THEN substring(fname, 1, 1) + substring(lname, 1, 1) ELSE substring(lname, 1, 1) + substring(fname, 1, 1) END FROM TU_People_Data go INSERT #ssnmania (ssn, initials) SELECT DISTINCT ssn, initials FROM #initials go SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident FROM #initials i JOIN #ssnmania s ON i.ssn = s.ssn AND s.initials = i.initials go DROP TABLE #initials, #ssnmania, TU_People_Data -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/pr...oads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodin...ions/books.mspx
Post Follow-up to this messageThanks Erland. I have tried this procedure in the morning and it solves half of my problem. let me start by answering your question. >What if you have both John Smith and Southerland Jane? Are the > same person or not? If these guys' SSN is the same, they are considered to be in the the same group. I am willing to take the chance that John Smith, Southerland Jane and Jack Sam with similar SSN has slim chance to occur. if they exist, they are gouped in one group number. In my table some of the rows for one person are displayed like this. 1.John Coleman Smith 1111 JS 2.John Smith Coleman 1111 CJ 3.Coleman John Smith 1111 CS 4.John-coleman Smith 1111 JS 5. Smith John 1111 JS 6.John Smith 2222 JS 7.J Smith 1111 JS 8 Jack Sam 3333 JS you can see that all this guys can be grouped in the same group name(except the 6th and 8th). I see that SSN is the major factor to identify the groups. So once SSN is the same then the intitals has to be one or two of the three J or S or C. Erland Sommarskog wrote:[color=darkred ] > (jacob.dba@gmail.com) writes: > > What if you have both John Smith and Southerland Jane? Are the > same person or not? > > This looks like a very difficult task, and the fact that you have > 800 million rows certainly does not help to make it easier. > > I think you need to scrap the idea you got from Itzik. My gut feeling > say that it will not scale. > > Here is a very simple-minded solution where I've assumed that as > long as any combination of initials match, it's the same group. > > > CREATE TABLE [TU_People_Data] ( > [tu_id] [bigint] NOT NULL , > [count_id] [int] NOT NULL , > [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NUL L , > [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NUL L , > [ssn] [int] NULL , > CONSTRAINT & #91;PK_tu_bulk_peopl e] PRIMARY KEY CLUSTERED > ( > [tu_id], > [count_id] > ) ON [PRIMARY] > ) ON [PRIMARY] > GO > CREATE TABLE #initials (ssn int NOT NULL, > fname varchar(32) NOT NULL, > lname varchar(32) NOT NULL, > initials char(2) NOT NULL) > go > CREATE TABLE #ssnmania (ident int NOT NULL, > ssn int NOT NULL, > initials char(2) NOT NULL, > PRIMARY KEY(ssn, initials)) > go > INSERT #initals (ssn, fname, lname, initials) > SELECT DISTINCT ssn, fname, lname, > CASE WHEN fname < lname > THEN substring(fname, 1, 1) + substring(lname, 1, 1) > ELSE substring(lname, 1, 1) + substring(fname, 1, 1) > END > FROM TU_People_Data > go > INSERT #ssnmania (ssn, initials) > SELECT DISTINCT ssn, initials > FROM #initials > go > SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident > FROM #initials i > JOIN #ssnmania s ON i.ssn = s.ssn > AND s.initials = i.initials > go > DROP TABLE #initials, #ssnmania, TU_People_Data > > > > > -- > Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se > > Books Online for SQL Server 2005 at > http://www.microsoft.com/technet/pr...oads/books.mspx > Books Online for SQL Server 2000 at > http://www.microsoft.com/sql/prodin...ions/books.mspx
Post Follow-up to this message(jacob.dba@gmail.com) writes: > I have tried this procedure in the morning and it solves half of my > problem. And the other half is? :-) I did not include the middle initial, because I did not see that post until later. But I guess that you could extend the logic that I posted to handle the middle initial as well. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/pr...oads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodin...ions/books.mspx
Post Follow-up to this messageyou need a function that takes the first character from first name, last name, and middle initial, and sorts them. Call it "SortInit" So, pass "Sam Alfred Jones" and it passes back "AJS". Likewise, "Jones Alfred Sam" is returned as "AJS". then, create your temp table and populate it with SSN and Sortinit(). then alter table on your temp table and add an identity column. Then make your "temp table" a permanent one, as your business rules will change, and fundamentally what you are doing is looking for "duplicate rows" and grouping them, and this is almost always a multiple pass project.
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread