|
Home > Archive > SQL Anywhere database > April 2005 > Can anybody explain what a UTF-8 collation actually means?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Can anybody explain what a UTF-8 collation actually means?
|
|
| Clive Collie 2005-04-12, 9:23 am |
| ASA 9.0.1
I have unloaded the collation files using dbcollat for UTF-8 and Swedish in
an effort to work this whole collation thing out. The Swedish collation file
has stuff like
: n n N, ñ ñ Ñ
: N n N, Ñ ñ Ñ
: o o O, º, ò ò Ò, ó ó Ó, ô ô Ô, õ õ Õ
: O o O, Ò ò Ò, Ó ó Ó, Ô ô Ô, Õ õ Õ
in it. This is kind of what I expected. From what I understand this means
all characters in each line are regarded as equivalent in string comparison
or searching, and sort order is set by the order in the file, so N-like
things go before O-like things.
The UTF-8 file goes like this
: N n N
: O o O
So the funny Os are not here, though they are elsewhere in the file. The
actual characters listed in the UTF-8 collation seem much the same as
Latin1(?)
So what kind of sorting and searching will a UTF-8 db actually do? What
language is a UTF-8 db suitable for? If UTF-8 gets Swedish all confused what
is the point of it?
Can anybody shed any light on any of this?
TIA
Clive
| |
| John Smirnios 2005-04-12, 11:24 am |
| ASA's UTF8 collation is essentially a binary collation for the UTF-8
encoding. Effectively ASA can assign an arbitrary sort position based on
the first byte of an encoding and then compares the follow bytes as they
appear in the encoding. So, equating 'a' to 'A' is possible since they
are single byte. However, accented versions of a and A have different
follow bytes (in UTF-8) and therefore cannot be considered equal by ASA.
Similarly, characters whose encodings are different lengths cannot be
considered equal ('a' cannot be considered equal to accented 'a' in
UTF-8). I'm pretty sure this used to be described in gory detail in the
documentation but a quick search now didn't yield good results -- I'll
check with the documentation team.
You can get around some of that awkwardness by using the SORTKEY and
COMPARE functions which do linguistically correct comparisons (including
secondary & tertiary sorts, etc).
The usefulness is that it allows you to store unicode in the database &
get some linguistically correct behaviour by using some intrinsic
functions until ASA supports unicode sorting more completely.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Clive Collie wrote:
> ASA 9.0.1
>
> I have unloaded the collation files using dbcollat for UTF-8 and Swedish in
> an effort to work this whole collation thing out. The Swedish collation file
> has stuff like
>
> : n n N, ñ ñ Ñ
> : N n N, Ñ ñ Ñ
> : o o O, º, ò ò Ò, ó ó Ó, ô ô Ô, õ õ Õ
> : O o O, Ò ò Ò, Ó ó Ó, Ô ô Ô, Õ õ Õ
>
> in it. This is kind of what I expected. From what I understand this means
> all characters in each line are regarded as equivalent in string comparison
> or searching, and sort order is set by the order in the file, so N-like
> things go before O-like things.
>
> The UTF-8 file goes like this
>
> : N n N
> : O o O
>
> So the funny Os are not here, though they are elsewhere in the file. The
> actual characters listed in the UTF-8 collation seem much the same as
> Latin1(?)
>
> So what kind of sorting and searching will a UTF-8 db actually do? What
> language is a UTF-8 db suitable for? If UTF-8 gets Swedish all confused what
> is the point of it?
>
> Can anybody shed any light on any of this?
>
> TIA
> Clive
>
>
| |
| Mark Hampton 2005-04-12, 8:23 pm |
| So, when the documentation for SORTKEY says "Default Unicode Multilingual",
would that treat a followed by a combining diacritical the same as the
"composed" character? e.g., a followed by 0x0301 (combining acute accent)
equals 0x00e1 (latin small a with acute accent) equals a?
What about full-width and half-width latin letters and numbers?
"John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
news:425be670$1@foru
ms-1-dub...
> ASA's UTF8 collation is essentially a binary collation for the UTF-8
> encoding. Effectively ASA can assign an arbitrary sort position based on
> the first byte of an encoding and then compares the follow bytes as they
> appear in the encoding. So, equating 'a' to 'A' is possible since they are
> single byte. However, accented versions of a and A have different follow
> bytes (in UTF-8) and therefore cannot be considered equal by ASA.
> Similarly, characters whose encodings are different lengths cannot be
> considered equal ('a' cannot be considered equal to accented 'a' in
> UTF-8). I'm pretty sure this used to be described in gory detail in the
> documentation but a quick search now didn't yield good results -- I'll
> check with the documentation team.
>
> You can get around some of that awkwardness by using the SORTKEY and
> COMPARE functions which do linguistically correct comparisons (including
> secondary & tertiary sorts, etc).
>
> The usefulness is that it allows you to store unicode in the database &
> get some linguistically correct behaviour by using some intrinsic
> functions until ASA supports unicode sorting more completely.
>
> -john.
>
> --
> John Smirnios
> Senior Software Developer
> iAnywhere Solutions Engineering
>
> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
> Developer Community at http://www.ianywhere.com/developer
>
> Clive Collie wrote:
>
| |
| Clive Collie 2005-04-13, 7:23 am |
| I can't make COMPARE do anything that "=" or "like" doesn't do. How is
COMPARE used in practice? For example '??????? ??? ???????' is Bank of
Greece (I hope!). How would you find this with a select statement in case
insensitive fashion
Assuming my database has a UTF8 collation
select * from companies where name = '??????? ??? ???????' will not find it,
nor will
select * from companies where name like '??????? ??? ???????%'. How would
you plug COMPARE in there?
Not that that would help since my application is already written and there
is no way to alter the queries it generates. I am starting to think that a
database containing data from European counties, stored in the native
languages, is not going to be possible. I cannot do case sensitive searches,
and I do an awful lot of "like" queries (or charindex etc). A UTF8 db
combined with a PB10 unicode app seemes to be the only way I can enter
combinations of English, Greek, Swedish, etc, not to mention Japanese data
without the characters getting all messed up (which is what happens if I
uses a non-unicode collation like Latin1). I can get the data in and read it
out with UTF8, it would be nice to search it too!
Is there a way, or is ASA9 just not capable at this kind of thing?
Clive.
"John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
news:425be670$1@foru
ms-1-dub...
> ASA's UTF8 collation is essentially a binary collation for the UTF-8
> encoding. Effectively ASA can assign an arbitrary sort position based on
> the first byte of an encoding and then compares the follow bytes as they
> appear in the encoding. So, equating 'a' to 'A' is possible since they are
> single byte. However, accented versions of a and A have different follow
> bytes (in UTF-8) and therefore cannot be considered equal by ASA.
> Similarly, characters whose encodings are different lengths cannot be
> considered equal ('a' cannot be considered equal to accented 'a' in
> UTF-8). I'm pretty sure this used to be described in gory detail in the
> documentation but a quick search now didn't yield good results -- I'll
> check with the documentation team.
>
> You can get around some of that awkwardness by using the SORTKEY and
> COMPARE functions which do linguistically correct comparisons (including
> secondary & tertiary sorts, etc).
>
> The usefulness is that it allows you to store unicode in the database &
> get some linguistically correct behaviour by using some intrinsic
> functions until ASA supports unicode sorting more completely.
>
> -john.
>
> --
> John Smirnios
> Senior Software Developer
> iAnywhere Solutions Engineering
>
> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
> Developer Community at http://www.ianywhere.com/developer
>
> Clive Collie wrote:
>
| |
| John Smirnios 2005-04-13, 8:23 pm |
| Yes to the comparison of a-acute with a-followed-by-combining-acute. The
following returns 0 (equal) on a utf8 database:
select compare( '\xc3\xa1', 'a\xcc\x81' )
As for the full-width vs. half-width letter 'a', UCA (Unicode Collation
Algorithm) considers the difference between these two letters to be a
difference in "case" (just like upper versus lower). The default unicode
sort order is a tertiary sort and will pick up a case difference. If you
want to ignore case, try the 'nocase' collation. The following returns 0
(equal) on a utf8 database:
select compare( 'a', '\xef\xbd\x81', 'nocase' )
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Mark Hampton wrote:
> So, when the documentation for SORTKEY says "Default Unicode Multilingual",
> would that treat a followed by a combining diacritical the same as the
> "composed" character? e.g., a followed by 0x0301 (combining acute accent)
> equals 0x00e1 (latin small a with acute accent) equals a?
>
> What about full-width and half-width latin letters and numbers?
>
>
>
> "John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
> news:425be670$1@foru
ms-1-dub...
>
>
>
| |
| John Smirnios 2005-04-13, 8:23 pm |
| See my other posting in this thread for examples where COMPARE can
return results that are different from ASA's equality. If you want to
use COMPARE and SORTKEY in a query you could do something like the
following:
select * from t where compare( col, 'bank of greece', 'nocase' ) = 0
or
select * from t where sortkey( col, 'nocase' ) = sortkey( 'bank of
greece', 'nocase' ) = 0
You may also want to take a look at the SORT_COLLATION option as well as
consider creating an index on sortkey( col, 'nocase' ) [which will
essentially just creates a computed column for the value].
As for LIKE, linguistically correct semantics are very complex and not
supported in ASA9.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Clive Collie wrote:
> I can't make COMPARE do anything that "=" or "like" doesn't do. How is
> COMPARE used in practice? For example '??????? ??? ???????' is Bank of
> Greece (I hope!). How would you find this with a select statement in case
> insensitive fashion
>
> Assuming my database has a UTF8 collation
>
> select * from companies where name = '??????? ??? ???????' will not find it,
> nor will
> select * from companies where name like '??????? ??? ???????%'. How would
> you plug COMPARE in there?
>
> Not that that would help since my application is already written and there
> is no way to alter the queries it generates. I am starting to think that a
> database containing data from European counties, stored in the native
> languages, is not going to be possible. I cannot do case sensitive searches,
> and I do an awful lot of "like" queries (or charindex etc). A UTF8 db
> combined with a PB10 unicode app seemes to be the only way I can enter
> combinations of English, Greek, Swedish, etc, not to mention Japanese data
> without the characters getting all messed up (which is what happens if I
> uses a non-unicode collation like Latin1). I can get the data in and read it
> out with UTF8, it would be nice to search it too!
>
> Is there a way, or is ASA9 just not capable at this kind of thing?
>
> Clive.
>
>
>
> "John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
> news:425be670$1@foru
ms-1-dub...
>
>
>
| |
| Mark Hampton 2005-04-14, 1:23 pm |
| That helps... So, \x only supports <bytes>. Sure would be nice to enter
unicode code points in some way like \u+feff.
"John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
news:425d69f6$1@foru
ms-1-dub...
> Yes to the comparison of a-acute with a-followed-by-combining-acute. The
> following returns 0 (equal) on a utf8 database:
>
> select compare( '\xc3\xa1', 'a\xcc\x81' )
>
>
> As for the full-width vs. half-width letter 'a', UCA (Unicode Collation
> Algorithm) considers the difference between these two letters to be a
> difference in "case" (just like upper versus lower). The default unicode
> sort order is a tertiary sort and will pick up a case difference. If you
> want to ignore case, try the 'nocase' collation. The following returns 0
> (equal) on a utf8 database:
>
> select compare( 'a', '\xef\xbd\x81', 'nocase' )
>
> -john.
>
>
>
> --
> John Smirnios
> Senior Software Developer
> iAnywhere Solutions Engineering
>
> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
> Developer Community at http://www.ianywhere.com/developer
>
> Mark Hampton wrote:
>
>
| |
| John Smirnios 2005-04-15, 11:23 am |
| I just used \x notation so that nothing was lost in the posting. Expect
some unicode notation when ASA adds better unicode support.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Mark Hampton wrote:
> That helps... So, \x only supports <bytes>. Sure would be nice to enter
> unicode code points in some way like \u+feff.
>
> "John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
> news:425d69f6$1@foru
ms-1-dub...
>
>
>
| |
| Clive Collie 2005-04-20, 7:23 am |
| You have said it twice now: "when" ASA has better Unicode support. Is this
down as a feature for any particular version you could tell us about. We
would most cetainly have Unicode suppert at the top of our wish list. Not a
very sexy feature I agree but even a small software company like ours deals
with countries all over the world, and all this code page stuff is really
giving us problems. We have one client that want to consolidate databases
from England, France, Sweden, Norway, Denmark, Greece and Lituania! Now tell
me a codepage that allows searching and sorting to work for that lot?!
Clive
"John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
news:425fde65$1@foru
ms-2-dub...
>I just used \x notation so that nothing was lost in the posting. Expect
>some unicode notation when ASA adds better unicode support.
>
> -john.
>
> --
> John Smirnios
> Senior Software Developer
> iAnywhere Solutions Engineering
>
> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
> Developer Community at http://www.ianywhere.com/developer
>
> Mark Hampton wrote:
>
>
>
| |
| John Smirnios 2005-04-20, 9:23 am |
| Well, we're never allowed to actually comment on future plans especially
where it pertains to pinning it down to a "when" or "which version".
Suffice it to say that we are aware that we need better unicode support
and that you will likely see improvements in the future.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Clive Collie wrote:
> You have said it twice now: "when" ASA has better Unicode support. Is this
> down as a feature for any particular version you could tell us about. We
> would most cetainly have Unicode suppert at the top of our wish list. Not a
> very sexy feature I agree but even a small software company like ours deals
> with countries all over the world, and all this code page stuff is really
> giving us problems. We have one client that want to consolidate databases
> from England, France, Sweden, Norway, Denmark, Greece and Lituania! Now tell
> me a codepage that allows searching and sorting to work for that lot?!
>
> Clive
>
>
>
> "John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
> news:425fde65$1@foru
ms-2-dub...
>
>
>
|
|
|
|
|