Home > Archive > PostgreSQL Discussion > April 2006 > case insensitive match in unicode









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author case insensitive match in unicode
SunWuKung

2006-03-27, 7:36 am

I would need to do case insensitive match against a field that contains=20
text in different languages - Greek, Hungarian, Arabic etc.
The db encoding is UTF8.

So far I found no way to achieve that. I tried converting both strings=20
to the same case and using ~* , but neither worked.

Does anybody no a way to do this?

Thanks for the help.
Bal=E1zs
Martijn van Oosterhout

2006-03-27, 7:36 am

On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote:
> I would need to do case insensitive match against a field that contains
> text in different languages - Greek, Hungarian, Arabic etc.
> The db encoding is UTF8.
>
> So far I found no way to achieve that. I tried converting both strings
> to the same case and using ~* , but neither worked.


Oh, tricky. Firstly, case-insensetive means different things to
different locales. For example, in Turkish 'i' is not the lowecase
version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You
don't specify a platform either. Locale support varies wildly by
platform.

What you probably want it some kind of accent-insensetive match that
mean that é, è, ë, e, É, È, E and Ë are all considered to match
eachother. The way you do that is by converting unicode to a particular
normal form and then comparing. Unfortunatly, I don't think PostgreSQL
supplies such a function right now.

However, some server-side procedural languages can do this. If you can
find one (possibly Perl) that can do the conversion, you can create a
function to do the mapping.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


SunWuKung

2006-03-27, 7:36 am

In article <20060327094829.GA30791@svana.org>, kleptog@svana.org says...
> On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote:
=20[color=darkred]
=20[color=darkred]
>=20
> Oh, tricky. Firstly, case-insensetive means different things to
> different locales. For example, in Turkish 'i' is not the lowecase
> version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You
> don't specify a platform either. Locale support varies wildly by
> platform.
>=20
> What you probably want it some kind of accent-insensetive match that
> mean that =E9, =E8, =EB, e, =C9, =C8, E and =CB are all considered to mat=

ch
> eachother. The way you do that is by converting unicode to a particular
> normal form and then comparing. Unfortunatly, I don't think PostgreSQL
> supplies such a function right now.
>=20
> However, some server-side procedural languages can do this. If you can
> find one (possibly Perl) that can do the conversion, you can create a
> function to do the mapping.
>=20
> Have a nice day,
>=20

This sounds like a very interesting concept.
It wouldn't be 'case insensitive' just insensitive.

The way I imagine it now is a special case of the ~ function.
I create matchgroups in a table and check each character if it is in the=20
group. If it is I will replace the character with the group in [=E9=C9E],=
=20
[o=F3O=D3??] and do a regexp with that.

What do you think?
B.
Martijn van Oosterhout

2006-03-27, 7:37 am

On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
> This sounds like a very interesting concept.
> It wouldn't be 'case insensitive' just insensitive.
>
> The way I imagine it now is a special case of the ~ function.
> I create matchgroups in a table and check each character if it is in the
> group. If it is I will replace the character with the group in [éÉE],
> [oóOÓ??] and do a regexp with that.


No need to reinvent the wheel. ICU provides a range of services to deal
with this. For example the following filter in ICU:

NFD; [:Nonspacing Mark:] Remove; NFC.

Will remove all accents from characters. And it works for all Unicode
characters. With a bit more thinking you can work with case variations
also.

There is also a locale-independant case-mapping module there plus
various locale specific ones also.

http://icu.sourceforge.net/userguide/Transform.html
http://icu.sourceforge.net/userguide/caseMappings.html
http://icu.sourceforge.net/userguide/normalization.html

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


SunWuKung

2006-04-06, 8:25 pm

In article <20060327114037.GD30791@svana.org>, kleptog@svana.org says...
> On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
e=20[color=darkred]
],=20[color=darkred]

>=20
> No need to reinvent the wheel. ICU provides a range of services to deal
> with this. For example the following filter in ICU:
>=20
> NFD; [:Nonspacing Mark:] Remove; NFC.
>=20
> Will remove all accents from characters. And it works for all Unicode
> characters. With a bit more thinking you can work with case variations
> also.
>=20
> There is also a locale-independant case-mapping module there plus
> various locale specific ones also.
>=20
> http://icu.sourceforge.net/userguide/Transform.html
> http://icu.sourceforge.net/userguide/caseMappings.html
> http://icu.sourceforge.net/userguide/normalization.html
>=20
> Have a nice day,
>=20

Thanks, I looked at this and it looks like something that would indeed=20
solve the problem.
However I was so far unable to figure out how could I use this from=20
within Postgres. If you have experience with it could you give me an=20
example?

Thanks
Bal=E1zs
Martijn van Oosterhout

2006-04-07, 9:29 am

On Thu, Apr 06, 2006 at 11:12:26PM +0200, SunWuKung wrote:
> Thanks, I looked at this and it looks like something that would indeed
> solve the problem.
> However I was so far unable to figure out how could I use this from
> within Postgres. If you have experience with it could you give me an
> example?


There are some unofficial ICU patches but I doubt they're still
up-to-date. I don't personally use it though maybe someone else here
does...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


Mike Rylander

2006-04-07, 11:27 am

On 4/6/06, SunWuKung <Balazs.Klein@axelero.hu> wrote:
> In article <20060327114037.GD30791@svana.org>, kleptog@svana.org says...
> Thanks, I looked at this and it looks like something that would indeed
> solve the problem.
> However I was so far unable to figure out how could I use this from
> within Postgres. If you have experience with it could you give me an
> example?


I was looking into creating a Pg function wrapper to some of the ICU
stuff, but, to be perfectly honest, I couldn't find an actual API
reference for ICU.

In any case, you can do this with PL/Perl:

CREATE FUNCTION strip_nonspacing_mar
ks ( text ) RETURNS text AS $func$
use Unicode::Normalize;
use Encode;

my $string = NFD( decode( utf8 => shift() ) );
$string =~ s/\p{Mn}+//ogsm;

return NFC($string);
$func$ LANGUAGE 'plperl' STRICT;

It's untested and won't be as fast as ICU, but it should get the job
done. Hope it helps!

>
> Thanks
> Balázs
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>



--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com