Home > Archive > MS Access database support > April 2006 > Re: Can Access use Fuzzy Logic - RegEx









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Can Access use Fuzzy Logic - RegEx
Kevin G. Anderson

2006-04-07, 9:37 am

David W. Fenton wrote:
> "cassetti@gmail.com" <cassetti@gmail.com> wrote in
> news:1144162213.753763.185180@t31g2000cwb.googlegroups.com:
>
>
> I think you've asked the wrong question. You don't want fuzzy logic
> but fuzzy criteria matching.
>
> I've de-duped millions of records, and it's very hard work.
>
> The first thing you need to do is decide what constitutes uniqueness
> (name, name+address, etc.).
>
> Then you need to get all records into the exactly the same format.
> This means that you need to process certain fields to get them all
> formatted the same. For instance, phone numbers might be entered as
> 123-456-7890 and (123) 456-7890 and 123.456.7890 and +1 123
> 456-7890. To de-dupe on phone numbers, you'd need to first process
> out all the variations. What I'd do with phone numbers is first
> strip out the leading +1 or 1, then I'd strip out all characters
> except the numbers. I wouldn't reformat them, but leave them that
> way because the de-duping process is not going to be helped by
> adding more characters. When the final data is cleaned up, you may
> choose to reformat the data (or just display it with the appropriate
> parens and dashes).
>
> Another problem is that if there are multiple phone numbers, you
> might end up with them in the wrong columns. That is, if there are
> just Phone1 and Phone2 instead of Home, Work, Mobile, you might end
> up with two records that are the same person and the Phone1 field
> might match the Phone2 field of the other record. You can
> restructure the data to fix this problem (normalization), by putting
> the phone numbers in a separate table with a link to the source
> record and the phone number. Then you'd only have to search for dups
> on a single field in this phone number table.
>
> Of course, there could be lots of false matches if, say, there are
> many different people sharing the same phone number, so you wouldn't
> want to use phone number by itself to determine duplicates.
>
> To do all this parsing, you'll need to familiarize yourself with the
> VBA string functions, Replace(), Mid(), Left() and Right().
>
> Addresses are much more complicated. I've got lots of code that I've
> written to process addresses into standard formats, and to evalute
> the contents and encode them in a format that is easily comparable.
> But it's very complicated and I'm not prepared to share it here. The
> main point I want to make is that this is a hard task.


Regular expressions are a much more powerful than VBA string functions
for parsing phone numbers, email addresses, etc. out of unstructured
data. And OLE makes it easy to use VBScript's RegEx object within an
Access app. Plus you'll be able to leverage your knowledge of RegEx in
just about every other programming evironment you encounter in the
future.

Search MSDN or the Internet to learn more about what you can do with
RegEx, or check out Jeffrey E. F. Friedl's "Mastering Regular
Expressions" book (published by O'Reilly).

Kevin

David W. Fenton

2006-04-07, 11:32 am

"Kevin G. Anderson" <kgander@pobox.com> wrote in
news:1144418877.837496.82620@g10g2000cwb.googlegroups.com:

> Regular expressions are a much more powerful than VBA string
> functions for parsing phone numbers, email addresses, etc. out of
> unstructured data. And OLE makes it easy to use VBScript's RegEx
> object within an Access app. Plus you'll be able to leverage your
> knowledge of RegEx in just about every other programming
> evironment you encounter in the future.


I, for one, have never found RegExp to be easy to understand at all.
My eyes just glaze over when I try to figure it out, to be honest.

> Search MSDN or the Internet to learn more about what you can do
> with RegEx, or check out Jeffrey E. F. Friedl's "Mastering Regular
> Expressions" book (published by O'Reilly).


I don't see how this helps you. I don't know how RegExp works for
stripping out characters, but given that it's an external library,
while VBA string handling functions are native to Access (and thus,
perhaps, more efficient when accessed through the Access Expression
Service (which is how functions are used in queries)), I would
expect it to be less efficient than the native VBA string handling
functions.

On the other hand, if you're writing custom functions using the
string handling tools, and RegExp provides a function that does the
same thing, the RegExp is perhaps going to be faster.

If you use RegExp already, then it's certainly worth looking at.

But if you don't, I'm not certain there's going to be much utility
to delving too deeply into learning it.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com