Drop Table

Support Forum for database administrators and web based access to important newsgroups related to databases
Register on Database Support Forum Edit your profileCalendarFind other Database Support forum membersFrequently Asked QuestionsSearch this forum -> 
For Database admins: Free Database-related Magazines Now Free shipping to Texas


Post New Thread










Thread
Author

Assigning group numbers for millions of data
I have a table with first name, last name, SSN(social security number)
and other columns.
I want to assign group number according to this business logic.
1. Records with equal SSN and (similar first name or last name) belong
to the same group.
John      Smith    1234
Smith    John     1234
S           John    1234
J           Smith  1234
John Smith and Smith John falls in the same group Number as long as
they have similar SSN.
This is because I have a record of equal SSN but the first name and
last name is switched because of people who make error inserting last
name as first name and vice versa. John Smith and Smith John will have
equal group Name if they have equal SSN.
2. There are records with equal SSN but different first name and last
name. These belong to different group numbers.
Equal SSN doesn't guarantee equal group number, at least one of the
first name or last name should be the same. John Smith and Dan Brown
with equal SSN=1234 shouldn't fall in the same group number.

Sample data:
Id Fname    lname    SSN   grpNum
1  John      Smith    1234  1
2  Smith    John     1234   1
3  S           John    1234     1
4  J           Smith  1234      1
5  J            S         1234      1
6  Dan      Brown  1234     2
7  John     Smith  1111      3


I have tried this code for 65,000 rows. It took 20 minute. I have to
run it for 21 million row data. I now that this is not an efficient
code.


INSERT into temp_FnLnSSN_grp
SELECT  c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or  substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS group_number
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)


dist FLS is distinct First Name, last Name and SSN table from the
people table.

I have posted part of this question, schema one week ago. Please refer
this thread.

http://groups.google.com/group/comp...6eb380b5f2e6de6


Report this thread to moderator Post Follow-up to this message
Old Post
jacob.dba@gmail.com
03-27-06 06:32 PM


Re: Assigning group numbers for millions of data
I forgot to mention that some of the records have middle name entered
in place of first name or last name.
fname  mname  lname       ssn

John  coleman  smith        1234
john   smith      coleman    1234
john   S           coleman     1234
John   C          Smith          1234
John               Smith           1234
John-coleman  Smith           1234
Smith              John            1234


During the grouping process I am concerned only about fname,lname,
ssn.(no need of middle name). If there is other suggestion to include
columns I am happy to accept.
I have the idea to assign groups if one of the initial of the names is

similar with the others considering that the SSN is the same. that
means if SSN is equal and if J or S or C are there as an initial in the

names, we can say they are in the same group.


Reply

jacob.dba@gmail.com  wrote:
> I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John      Smith    1234
> Smith    John     1234
> S           John    1234
> J           Smith  1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
>  This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
>
> Sample data:
> Id Fname    lname    SSN   grpNum
> 1  John      Smith    1234  1
> 2  Smith    John     1234   1
> 3  S           John    1234     1
> 4  J           Smith  1234      1
> 5  J            S         1234      1
> 6  Dan      Brown  1234     2
> 7  John     Smith  1111      3
>
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
>
> INSERT into temp_FnLnSSN_grp
> SELECT  c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
>          (SELECT 1 + count(*)
>           FROM distFLS AS c2
>           WHERE c2.ssn < c1.ssn
>           or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
>                                or  substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
>               )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
>        c1.fname = c3.fname and
>        c1.lname= c3.lname)
>
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
>
> I have posted part of this question, schema one week ago. Please refer
> this thread.
>
> http://groups.google.com/group/comp...f2e6
de6


Report this thread to moderator Post Follow-up to this message
Old Post
jacob.dba@gmail.com
03-28-06 01:27 AM


Re: Assigning group numbers for millions of data
(jacob.dba@gmail.com)  writes:
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John      Smith    1234
> Smith    John     1234
> S           John    1234
> J           Smith  1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
>  This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.

What if you have both John Smith and Southerland Jane? Are the
same person or not?

This looks like a very difficult task, and the fact that you have
800 million rows certainly does not help to make it easier.

I think you need to scrap the idea you got from Itzik. My gut feeling
say that it will not scale.

Here is a very simple-minded solution where I've assumed that as
long as any combination of initials match, it's the same group.


CREATE TABLE [TU_People_Data] (
[tu_id] [bigint] NOT NULL ,
[count_id] [int] NOT NULL ,
[fname] [varchar] (32) COLLATE  Latin1_General_CI_AS
 NULL ,
[lname] [varchar] (32) COLLATE  Latin1_General_CI_AS
 NULL ,
[ssn] [int] NULL ,
CONSTRAINT & #91;PK_tu_bulk_peopl
e] PRIMARY KEY  CLUSTERED
(
[tu_id],
[count_id]
)  ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE #initials  (ssn   int NOT NULL,
fname varchar(32) NOT NULL,
lname varchar(32) NOT NULL,
initials char(2) NOT NULL)
go
CREATE TABLE #ssnmania (ident int NOT NULL,
ssn   int NOT NULL,
initials char(2) NOT NULL,
PRIMARY KEY(ssn, initials))
go
INSERT #initals (ssn, fname, lname, initials)
SELECT DISTINCT ssn, fname, lname,
CASE WHEN fname < lname
THEN substring(fname, 1, 1) + substring(lname, 1, 1)
ELSE substring(lname, 1, 1) + substring(fname, 1, 1)
END
FROM  TU_People_Data
go
INSERT #ssnmania (ssn, initials)
SELECT DISTINCT ssn, initials
FROM   #initials
go
SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident
FROM   #initials i
JOIN   #ssnmania s ON i.ssn = s.ssn
AND s.initials = i.initials
go
DROP TABLE #initials, #ssnmania, TU_People_Data




--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx

Report this thread to moderator Post Follow-up to this message
Old Post
Erland Sommarskog
03-28-06 01:27 AM


Re: Assigning group numbers for millions of data
Thanks Erland.
I have tried this procedure in the morning and it solves half of my
problem.
let me start by answering your question.
>What if you have both John Smith and Southerland Jane? Are the
> same person or not?
If these guys' SSN is the same, they are considered to be in the the
same group.
I am willing to take the chance that John Smith, Southerland Jane and
Jack Sam with similar SSN has  slim chance to occur. if they exist,
they are gouped in one group number. 
In my table some of the rows for one person are displayed like this.
1.John  Coleman  Smith          1111               JS
2.John  Smith      Coleman      1111             CJ
3.Coleman  John  Smith         1111               CS
4.John-coleman    Smith         1111              JS
5. Smith                John           1111               JS
6.John                  Smith          2222              JS
7.J                       Smith          1111               JS
8 Jack                  Sam           3333                JS
you can see that all this guys can be grouped in the same group
name(except  the 6th and 8th). I see that SSN is the major factor to
identify the groups.
So once SSN is the same then the intitals has to be one or two of the
three J or S or C.




Erland Sommarskog  wrote:[color=darkred
]
> (jacob.dba@gmail.com) writes: 
>
> What if you have both John Smith and Southerland Jane? Are the
> same person or not?
>
> This looks like a very difficult task, and the fact that you have
> 800 million rows certainly does not help to make it easier.
>
> I think you need to scrap the idea you got from Itzik. My gut feeling
> say that it will not scale.
>
> Here is a very simple-minded solution where I've assumed that as
> long as any combination of initials match, it's the same group.
>
>
>    CREATE TABLE [TU_People_Data] (
>            [tu_id] [bigint] NOT NULL ,
>            [count_id] [int] NOT NULL ,
>            [fname] [varchar] (32) COLLATE  Latin1_General_CI_AS
 NUL
L ,
>            [lname] [varchar] (32) COLLATE  Latin1_General_CI_AS
 NUL
L ,
>            [ssn] [int] NULL ,
>            CONSTRAINT & #91;PK_tu_bulk_peopl
e] PRIMARY KEY  CLUSTERED
>            (
>                    [tu_id],
>                    [count_id]
>            )  ON [PRIMARY]
>    ) ON [PRIMARY]
>    GO
>    CREATE TABLE #initials  (ssn   int NOT NULL,
>                            fname varchar(32) NOT NULL,
>                            lname varchar(32) NOT NULL,
>                            initials char(2) NOT NULL)
>    go
>    CREATE TABLE #ssnmania (ident int NOT NULL,
>                            ssn   int NOT NULL,
>                            initials char(2) NOT NULL,
>                            PRIMARY KEY(ssn, initials))
>    go
>    INSERT #initals (ssn, fname, lname, initials)
>       SELECT DISTINCT ssn, fname, lname,
>              CASE WHEN fname < lname
>                   THEN substring(fname, 1, 1) + substring(lname, 1, 1)
>                   ELSE substring(lname, 1, 1) + substring(fname, 1, 1)
>              END
>       FROM  TU_People_Data
>    go
>    INSERT #ssnmania (ssn, initials)
>       SELECT DISTINCT ssn, initials
>       FROM   #initials
>    go
>    SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident
>    FROM   #initials i
>    JOIN   #ssnmania s ON i.ssn = s.ssn
>                      AND s.initials = i.initials
>    go
>    DROP TABLE #initials, #ssnmania, TU_People_Data
>
>
>
>
> --
> Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
>
> Books Online for SQL Server 2005 at
> http://www.microsoft.com/technet/pr...oads/books.mspx
> Books Online for SQL Server 2000 at
> http://www.microsoft.com/sql/prodin...ions/books.mspx


Report this thread to moderator Post Follow-up to this message
Old Post
jacob.dba@gmail.com
03-28-06 01:27 AM


Re: Assigning group numbers for millions of data
(jacob.dba@gmail.com)  writes:
> I have tried this procedure in the morning and it solves half of my
> problem.

And the other half is? :-) I did not include the middle initial, because
I did not see that post until later.

But I guess that you could extend the logic that I posted to handle
the middle initial as well.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx

Report this thread to moderator Post Follow-up to this message
Old Post
Erland Sommarskog
03-29-06 01:26 AM


Re: Assigning group numbers for millions of data
you need a function that takes the first character from first name,
last name, and middle initial, and sorts them.   Call it "SortInit"
So, pass "Sam Alfred Jones"   and it passes back "AJS". Likewise,
"Jones Alfred Sam" is returned as "AJS".

then, create your temp table and populate it with  SSN and Sortinit().
then alter table on your temp table and add an identity column.
Then make your "temp table" a permanent one, as your business rules
will change, and fundamentally what you are doing is looking for
"duplicate rows" and grouping them, and this is almost always a
multiple pass project.


Report this thread to moderator Post Follow-up to this message
Old Post
Doug
04-06-06 06:30 PM


Sponsored Links





Last Thread Next Thread
Post New Thread

Microsoft SQL Server forum archive

Show a Printable Version Email This Page to Someone! Receive updates to this thread
Microsoft SQL Server
Access database support
PostgreSQL Replication
SQL Server ODBC
FoxPro Support
PostgreSQL pgAdmin
SQL Server Clustering
MySQL ODBC
Web Applications with dBASE
SQL Server CE
MySQL++
Sybase Database Support
MS SQL Full Text Search
PostgreSQL Administration
SQL Anywhere support
DB2 UDB Database
Paradox Database Support
Filemaker Database
Berkley DB
SQL 2000/2000i database
ASE Database
Forum Jump:
All times are GMT. The time now is 10:46 AM.

 
Mobile devices forum | Database support forum archive




Copyrights DropTable.com Database Support Forum 2004 - 2006