Home > Archive > PostgreSQL JDBC > April 2005 > Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
Guillaume Cottenceau

2005-04-27, 7:24 am

Mauricio Hernández Durán <mhernandez 'at' ingenian.com> writes:

> Hi all!
>
> We encountered the same problem most people have had using latin1 or
> unicode for spanish characters upon inserting or updates:
>
> ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1


Iconv actually agrees that this UTF-8 character cannot be
converted to ISO8859-1.

I can print UTF-8's 0x00EF which gives "ï".

Then if I manually input "ï", the bytes in UTF-8 to do that are
0xC3AF, and this can be converted to ISO8859-1 (it is 0xEF).

Isn't there a problem with your UTF-8 data containing 0x00EF?

--
Guillaume Cottenceau

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Anders Hermansen

2005-04-27, 7:24 am

* Guillaume Cottenceau (gc@mnc.ch) wrote:
> Isn't there a problem with your UTF-8 data containing 0x00EF?


E0 to EF hex (224 to 239): first byte of a three-byte sequence.


Anders Hermansen

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Anders Hermansen

2005-04-27, 7:24 am

* Guillaume Cottenceau (gc@mnc.ch) wrote:
> Anders Hermansen <anders 'at' yoyo.no> writes:
>
> Well 00 is first byte here, isn't it?


UTF-8 is a byte sequence, so it's not about the first byte in the whole
sequence. But about the first byte in a tree byte sequece.

There should be no nul (0) bytes when encoding UTF-8. I believe this is in the
specification to allow it to be compatible with C nul-terminated strings.

I believe that the byte sequence 0x00EF i illegal UTF-8 because:
1) It contains nul (0x00) byte
2) 0xEF is not followed by two more bytes

On the other hand U+00EF is a valid unicode code point. Which points to:
LATIN SMALL LETTER I WITH DIAERESIS
It is encoded as 0xC3AF in UTF-8
As 0x00EF in UTF-16 (and UCS-2 ?)
As 0xEF in ISO-8859-1


Anders Hermansen

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Guillaume Cottenceau

2005-04-27, 9:23 am

Anders Hermansen <anders 'at' yoyo.no> writes:

> * Guillaume Cottenceau (gc@mnc.ch) wrote:
>
> UTF-8 is a byte sequence, so it's not about the first byte in the whole
> sequence. But about the first byte in a tree byte sequece.


Yes. I forgot that you assumed the machine was big-endian. So the
UTF-8 character is here probably first byte 0xEF, second byte
0x00?

I did my test with first byte 0x00 and second byte 0xEF, hence
confusion with your initial comment.

My reasoning was that if the first byte of this two-byte
sequence is 0x00 then the rule that 0xEF is first byte of a
three-byte sequence doesn't apply, since 0xEF is second byte in
the sequence.

> There should be no nul (0) bytes when encoding UTF-8. I believe
> this is in the specification to allow it to be compatible with
> C nul-terminated strings.
>
> I believe that the byte sequence 0x00EF i illegal UTF-8 because:
> 1) It contains nul (0x00) byte
> 2) 0xEF is not followed by two more bytes
>
> On the other hand U+00EF is a valid unicode code point. Which points to:


I think this is assumed little-endian, e.g. first byte 0x00 and
second byte 0xEF (especially because UTF-8 is just a series of
bytes without any endianness aspects, so it makes good sense to
actually read this left-to-right, e.g. byte 0x00 first).

> LATIN SMALL LETTER I WITH DIAERESIS
> It is encoded as 0xC3AF in UTF-8
> As 0x00EF in UTF-16 (and UCS-2 ?)


Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are
the same[1].

> As 0xEF in ISO-8859-1


Hum I think I may understand what's going on here. It's possible
that in the message:

ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

when they say "0x00ef" they don't talk about UTF-8 per-see but
they use the unicode representation (which is error prone).


Ref:
[1] UCS-2 is a subset of UTF-16 which comprises all the 2-byte
sequence characters but no 3 or 4-byte sequence characters

--
Guillaume Cottenceau

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Vadim Nasardinov

2005-04-27, 9:23 am

On Wednesday 27 April 2005 07:54, Anders Hermansen wrote:
> On the other hand U+00EF is a valid unicode code point. Which points to:
> LATIN SMALL LETTER I WITH DIAERESIS
> It is encoded as 0xC3AF in UTF-8
> As 0x00EF in UTF-16 (and UCS-2 ?)
> As 0xEF in ISO-8859-1


http://www.eki.ee/letter/chardata.cgi?ucode=ef

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Anders Hermansen

2005-04-27, 9:23 am

* Guillaume Cottenceau (gc@mnc.ch) wrote:
> Anders Hermansen <anders 'at' yoyo.no> writes:
>
> Yes. I forgot that you assumed the machine was big-endian. So the
> UTF-8 character is here probably first byte 0xEF, second byte
> 0x00?
>
> I did my test with first byte 0x00 and second byte 0xEF, hence
> confusion with your initial comment.
>
> My reasoning was that if the first byte of this two-byte
> sequence is 0x00 then the rule that 0xEF is first byte of a
> three-byte sequence doesn't apply, since 0xEF is second byte in
> the sequence.


Endianness is not a problem when working with a sequnce of bytes (8-bit)
like in utf-8. It only becomes a problem when you deal with more than 1
byte representing 1 value. So it's an issue in UTF-16 which is big-endian by
default I think.

So I interpreted the message "ERROR: could not convert UTF-8 character 0x00ef
to ISO8859-1" as a byte sequence with 0x00 first, and then 0xef. Maybe that's
a wrong assumption.

>
> I think this is assumed little-endian, e.g. first byte 0x00 and
> second byte 0xEF (especially because UTF-8 is just a series of
> bytes without any endianness aspects, so it makes good sense to
> actually read this left-to-right, e.g. byte 0x00 first).


As I said above. Endiness is not an issue for UTF-8. The byte _sequence_ is
always read from start to end.

>
> Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are
> the same[1].


Yes.

>
> Hum I think I may understand what's going on here. It's possible
> that in the message:
>
> ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1
>
> when they say "0x00ef" they don't talk about UTF-8 per-see but
> they use the unicode representation (which is error prone).


If 0x00ef refers to a unicode codepoint, it should not have been a problem to
convert it to ISO-8859-1 (0xef).

If 0x00ef refers to a byte sequence, then the error message is a bit
misleading because it's not a character but a byte sequence. And the error
is decoding the UTF-8, not encoding the ISO-8859-1.


Anders Hermansen

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com