|
Home > Archive > PostgreSQL Discussion > April 2006 > case insensitive match in unicode
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
case insensitive match in unicode
|
|
| SunWuKung 2006-03-27, 7:36 am |
| I would need to do case insensitive match against a field that contains=20
text in different languages - Greek, Hungarian, Arabic etc.
The db encoding is UTF8.
So far I found no way to achieve that. I tried converting both strings=20
to the same case and using ~* , but neither worked.
Does anybody no a way to do this?
Thanks for the help.
Bal=E1zs
| |
| Martijn van Oosterhout 2006-03-27, 7:36 am |
| On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote:
> I would need to do case insensitive match against a field that contains
> text in different languages - Greek, Hungarian, Arabic etc.
> The db encoding is UTF8.
>
> So far I found no way to achieve that. I tried converting both strings
> to the same case and using ~* , but neither worked.
Oh, tricky. Firstly, case-insensetive means different things to
different locales. For example, in Turkish 'i' is not the lowecase
version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You
don't specify a platform either. Locale support varies wildly by
platform.
What you probably want it some kind of accent-insensetive match that
mean that é, è, ë, e, É, È, E and Ë are all considered to match
eachother. The way you do that is by converting unicode to a particular
normal form and then comparing. Unfortunatly, I don't think PostgreSQL
supplies such a function right now.
However, some server-side procedural languages can do this. If you can
find one (possibly Perl) that can do the conversion, you can create a
function to do the mapping.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.
| |
| SunWuKung 2006-03-27, 7:36 am |
| In article <20060327094829.GA30791@svana.org>, kleptog@svana.org says...
> On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote:
=20[color=darkred]
=20[color=darkred]
>=20
> Oh, tricky. Firstly, case-insensetive means different things to
> different locales. For example, in Turkish 'i' is not the lowecase
> version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You
> don't specify a platform either. Locale support varies wildly by
> platform.
>=20
> What you probably want it some kind of accent-insensetive match that
> mean that =E9, =E8, =EB, e, =C9, =C8, E and =CB are all considered to mat=
ch
> eachother. The way you do that is by converting unicode to a particular
> normal form and then comparing. Unfortunatly, I don't think PostgreSQL
> supplies such a function right now.
>=20
> However, some server-side procedural languages can do this. If you can
> find one (possibly Perl) that can do the conversion, you can create a
> function to do the mapping.
>=20
> Have a nice day,
>=20
This sounds like a very interesting concept.
It wouldn't be 'case insensitive' just insensitive.
The way I imagine it now is a special case of the ~ function.
I create matchgroups in a table and check each character if it is in the=20
group. If it is I will replace the character with the group in [=E9=C9E],=
=20
[o=F3O=D3??] and do a regexp with that.
What do you think?
B.
| |
| Martijn van Oosterhout 2006-03-27, 7:37 am |
| On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
> This sounds like a very interesting concept.
> It wouldn't be 'case insensitive' just insensitive.
>
> The way I imagine it now is a special case of the ~ function.
> I create matchgroups in a table and check each character if it is in the
> group. If it is I will replace the character with the group in [éÉE],
> [oóOÓ??] and do a regexp with that.
No need to reinvent the wheel. ICU provides a range of services to deal
with this. For example the following filter in ICU:
NFD; [:Nonspacing Mark:] Remove; NFC.
Will remove all accents from characters. And it works for all Unicode
characters. With a bit more thinking you can work with case variations
also.
There is also a locale-independant case-mapping module there plus
various locale specific ones also.
http://icu.sourceforge.net/userguide/Transform.html
http://icu.sourceforge.net/userguide/caseMappings.html
http://icu.sourceforge.net/userguide/normalization.html
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.
| |
| SunWuKung 2006-04-06, 8:25 pm |
| In article <20060327114037.GD30791@svana.org>, kleptog@svana.org says...
> On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
e=20[color=darkred]
],=20[color=darkred]
>=20
> No need to reinvent the wheel. ICU provides a range of services to deal
> with this. For example the following filter in ICU:
>=20
> NFD; [:Nonspacing Mark:] Remove; NFC.
>=20
> Will remove all accents from characters. And it works for all Unicode
> characters. With a bit more thinking you can work with case variations
> also.
>=20
> There is also a locale-independant case-mapping module there plus
> various locale specific ones also.
>=20
> http://icu.sourceforge.net/userguide/Transform.html
> http://icu.sourceforge.net/userguide/caseMappings.html
> http://icu.sourceforge.net/userguide/normalization.html
>=20
> Have a nice day,
>=20
Thanks, I looked at this and it looks like something that would indeed=20
solve the problem.
However I was so far unable to figure out how could I use this from=20
within Postgres. If you have experience with it could you give me an=20
example?
Thanks
Bal=E1zs
| |
| Martijn van Oosterhout 2006-04-07, 9:29 am |
| On Thu, Apr 06, 2006 at 11:12:26PM +0200, SunWuKung wrote:
> Thanks, I looked at this and it looks like something that would indeed
> solve the problem.
> However I was so far unable to figure out how could I use this from
> within Postgres. If you have experience with it could you give me an
> example?
There are some unofficial ICU patches but I doubt they're still
up-to-date. I don't personally use it though maybe someone else here
does...
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.
| |
| Mike Rylander 2006-04-07, 11:27 am |
| On 4/6/06, SunWuKung <Balazs.Klein@axelero.hu> wrote:
> In article <20060327114037.GD30791@svana.org>, kleptog@svana.org says...
> Thanks, I looked at this and it looks like something that would indeed
> solve the problem.
> However I was so far unable to figure out how could I use this from
> within Postgres. If you have experience with it could you give me an
> example?
I was looking into creating a Pg function wrapper to some of the ICU
stuff, but, to be perfectly honest, I couldn't find an actual API
reference for ICU.
In any case, you can do this with PL/Perl:
CREATE FUNCTION strip_nonspacing_mar
ks ( text ) RETURNS text AS $func$
use Unicode::Normalize;
use Encode;
my $string = NFD( decode( utf8 => shift() ) );
$string =~ s/\p{Mn}+//ogsm;
return NFC($string);
$func$ LANGUAGE 'plperl' STRICT;
It's untested and won't be as fast as ICU, but it should get the job
done. Hope it helps!
>
> Thanks
> Balázs
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
|
|
|
|
|