|
Home > Archive > MS Access database support > April 2006 > Quick fuzzy search
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Quick fuzzy search
|
|
| Martin Schneider 2006-04-06, 3:29 am |
| Hi!
I have a database with approx. 2500 records. When entering a new record
I'd like to avoid double entries that may differ slightly in writing.
Currently I am calculating the Levenshtein Distance across all records
and offer the top ten matches. Unfortunately this is very slow on long
strings.
Is there a faster method to do this?
Best regards,
Martin
| |
| kaniest 2006-04-06, 7:41 am |
| Martin Schneider wrote:
> I have a database with approx. 2500 records. When entering a new
> record I'd like to avoid double entries that may differ slightly in
> writing.
> Currently I am calculating the Levenshtein Distance across all records
> and offer the top ten matches. Unfortunately this is very slow on long
> strings.
>
> Is there a faster method to do this?
With such a small dataset it can be feasable to keep the
keys in memory, also calculating in static arrays
could save a cycle or two.
Of course you could use a less demanding metric, tinker
with upper and lower bounds on the way, or filter
candidate matches with a quick and dirty test before
applying the full metric on a selection.
--
Paul
| |
| Anthony England 2006-04-06, 7:41 am |
| "Martin Schneider" <martin.schneider@illusion-factory.de> wrote in message
news:49jtg7Fojtv2U1@
individual.net...
> Hi!
>
> I have a database with approx. 2500 records. When entering a new record
> I'd like to avoid double entries that may differ slightly in writing.
>
> Currently I am calculating the Levenshtein Distance across all records and
> offer the top ten matches. Unfortunately this is very slow on long
> strings.
>
> Is there a faster method to do this?
>
> Best regards,
> Martin
If you want to do such a detailed comparison against each and every record
in the database, then of course it will be slow. What can we say? Get a
faster PC or buy more RAM?
If we knew what was being stored in the database, we might be able to make
more helpful suggestions. So if the database stores contact details and you
wanted to add a new contact:
Jane Smith, 117 Waterloo Road, London, SE1 8UL, United Kingdom
Would I really want to go through and evaluate how closely Jane's name
matches someone called Jan Smitt from Norway before I was sure this was not
a duplicate? However, we know nothing about your application.
| |
| Martin Schneider 2006-04-06, 7:41 am |
| Anthony England schrieb:
> So if the database stores contact details and you
> wanted to add a new contact:
> Jane Smith, 117 Waterloo Road, London, SE1 8UL, United Kingdom
> Would I really want to go through and evaluate how closely Jane's name
> matches someone called Jan Smitt from Norway before I was sure this was not
> a duplicate?
That's exactly the purpose.
Best regards,
Martin
| |
| Anthony England 2006-04-06, 7:41 am |
| "Martin Schneider" <martin.schneider@illusion-factory.de> wrote in message
news:49k6suFot17gU1@
individual.net...
> Anthony England schrieb:
>
> That's exactly the purpose.
>
> Best regards,
> Martin
What I meant was that you could look at only the people in postcode which
starts with SE1 or perhaps only those in London or perhaps only those in the
United Kingdom. I would not compare the new record against every single
existing one.
When I have done this sort of thing before, I give the user some degree of
flexibilty so you can adjust how rigourously you will search for duplicates.
You could keep some soundex representation of the surname in an indexed
field and start the process by entering a surname.
| |
| Martin Schneider 2006-04-06, 7:41 am |
| Anthony England schrieb:
> What I meant was that you could look at only the people in postcode which
> starts with SE1 or perhaps only those in London or perhaps only those in the
> United Kingdom. I would not compare the new record against every single
> existing one.
I would :-) The search is pretty local in a geographical way and similar
names in almost every case identify the same company, even if the ZIP
code is different (the company may have moved it's location).
Best regards,
Martin
| |
| Tom van Stiphout 2006-04-06, 9:35 am |
| On Thu, 06 Apr 2006 09:15:21 +0200, Martin Schneider
<martin.schneider@illusion-factory.de> wrote:
I'm using Ratcliff/Obershelp Pattern Recognition Algorithm on about
25000 records, and it is not too slow. I'm using the version in C and
built a DLL, for optimum speed.
-Tom.
>Hi!
>
>I have a database with approx. 2500 records. When entering a new record
>I'd like to avoid double entries that may differ slightly in writing.
>
>Currently I am calculating the Levenshtein Distance across all records
>and offer the top ten matches. Unfortunately this is very slow on long
>strings.
>
>Is there a faster method to do this?
>
>Best regards,
>Martin
| |
| Larry Linson 2006-04-06, 8:29 pm |
| There's a SOUNDEX algorithm, done in VBA, published in various places. If
you GOOGLE this newsgroup, chances are good that you'll find a link. In
fact, it may be in the Microsoft Knowledge Base at
http://support.microsoft.com. Caveat: AFAIK, it applies to English
pronunciations only.
Larry Linson
Microsoft Access MVP
"Martin Schneider" <martin.schneider@illusion-factory.de> wrote in message
news:49k8vtFp5vh1U1@
individual.net...
> Anthony England schrieb:
>
> I would :-) The search is pretty local in a geographical way and similar
> names in almost every case identify the same company, even if the ZIP code
> is different (the company may have moved it's location).
>
> Best regards,
> Martin
| |
| David W. Fenton 2006-04-06, 8:29 pm |
| "Larry Linson" <bouncer@localhost.not> wrote in
news:ImeZf.25403$fQ6.1287@trnddc03:
> There's a SOUNDEX algorithm, done in VBA, published in various
> places. If you GOOGLE this newsgroup, chances are good that you'll
> find a link. In fact, it may be in the Microsoft Knowledge Base at
> http://support.microsoft.com. Caveat: AFAIK, it applies to English
> pronunciations only.
The Soundex2 algorithm is also freely available on the Web and is a
much better implementation for use in de-duping (far fewer false
positives), though it has the same language limitation.
--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
|
|
|
|
|