Home > Archive > SQL Anywhere database > December 2005 > Unload charset problem









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Unload charset problem
Clive Collie

2005-12-16, 11:23 am

I have a database (ASA 9.0.2) that is in the UTF8 collation. However the
data is in 949KOR code page. For example the Korean character "Ieung A
Kiyeok" is 0xBEC8 in cp949 but is 0xEC9588 in UTF8. In my db it is 0xBEC8,
not the proper UTF8 version.

I want to unload the db and reload into a cp949 db. However I can't find a
way to do it. If I unload it without translation (ie as UTF8) and attempt a
reload into cp949 I find "load table" fails: "Cannot convert something to
something else. A bad value was supplied" with the actual column types
varying from table to table. It does this as soon as it tries to import any
record containing Korean text. Changing to "Input into" fails in the same
way.

If I unload using -xi -c "....;charset=cp949" then the unload will mess with
characters and totally mangle them. e.g. 0xBEC8 gets change to 0xE688 (or
something)

So how is it possible to unload/reload this data such that the data itself
remains identical but ends up in a db with a different codepage?

TIA
Clive


John Smirnios

2005-12-20, 9:23 am

UNLOAD TABLE in 9.x always outputs the table data without going through
character set translation. Based on your claim that the data is actually
cp949 inside your utf8 database (even though you would have needed to go
through great pains to do that), the files generated by UNLOAD TABLE
should be in cp949. Verify that that is the case by viewing the table
data files in an editor on your cp949 system.

Since the server cannot properly load the data files which are
supposedly in cp949 into a cp949 database, I'd be willing to bet that
you have some cp949 and some utf8 sitting in your tables. In that case,
you'll have to go through the unloaded table data line by line and
separate the cp949 data from the utf8 data. Use load table to load the
cp949 data then use dbisqlc to connect with "charset=utf8" and use the
INPUT statement. The utf8 data will be converted on its way across the
wire. There is a way to do it with dbisql but I don't recall how to set
the charset for it. If your table name contains Korean characters, you
will need to find out how to use dbisql.

To avoid character set translation between server & client, use
"charset=none" or set the charset to the same as the database (ie
"charset=utf8").

-john.

--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering

Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer

Clive Collie wrote:
> I have a database (ASA 9.0.2) that is in the UTF8 collation. However the
> data is in 949KOR code page. For example the Korean character "Ieung A
> Kiyeok" is 0xBEC8 in cp949 but is 0xEC9588 in UTF8. In my db it is 0xBEC8,
> not the proper UTF8 version.
>
> I want to unload the db and reload into a cp949 db. However I can't find a
> way to do it. If I unload it without translation (ie as UTF8) and attempt a
> reload into cp949 I find "load table" fails: "Cannot convert something to
> something else. A bad value was supplied" with the actual column types
> varying from table to table. It does this as soon as it tries to import any
> record containing Korean text. Changing to "Input into" fails in the same
> way.
>
> If I unload using -xi -c "....;charset=cp949" then the unload will mess with
> characters and totally mangle them. e.g. 0xBEC8 gets change to 0xE688 (or
> something)
>
> So how is it possible to unload/reload this data such that the data itself
> remains identical but ends up in a db with a different codepage?
>
> TIA
> Clive
>
>

Clive Collie

2005-12-21, 7:23 am

Believe me, it is cp949 data and they did go through great pains to get that
data in there. Since I posted this I have had the pleasure of spending
several days pondering the problem and it seems to centre on CR/LF pairs.
For some reason a straight unload of the db messes up the CR/LFs. Consider a
text field containing Hex:

BE C8 2E 0D 0A

This is 949 data and represents a Korean character then a full stop then a
CR/LF.

The file produced by unload dumps this out as (looked at in a sort of
combined hex and text dump)

BE C8 2E 0D \x0A

i.e. Korean character then full stop then binary CR then escaped LF

The OD is skipped and not escaped properly in the unload file. This happens
many times so there is a whole lot of spurious CR characters all over the
unload file.

Using dbunload -xi -c "...charset=UTF8" does this as well. Putting
charset=cp949 translates BEC8 into garbage but fixes the \x0D\x0A problem.

However you have solved the problem. I didn't know there was a charset=none
option in this context. So the command line

dbunload -xi -c "...;charset=none" etc

has left the Korean characters alone and produces proper \x0D\x0A escape
characters so I can now reload into a cp949 database successfully.

So thanks very much for saving the day.

Regards
Clive



"John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
news:43a81955$1@foru
ms-1-dub...[color=darkred]
> UNLOAD TABLE in 9.x always outputs the table data without going through
> character set translation. Based on your claim that the data is actually
> cp949 inside your utf8 database (even though you would have needed to go
> through great pains to do that), the files generated by UNLOAD TABLE
> should be in cp949. Verify that that is the case by viewing the table data
> files in an editor on your cp949 system.
>
> Since the server cannot properly load the data files which are supposedly
> in cp949 into a cp949 database, I'd be willing to bet that you have some
> cp949 and some utf8 sitting in your tables. In that case, you'll have to
> go through the unloaded table data line by line and separate the cp949
> data from the utf8 data. Use load table to load the cp949 data then use
> dbisqlc to connect with "charset=utf8" and use the INPUT statement. The
> utf8 data will be converted on its way across the wire. There is a way to
> do it with dbisql but I don't recall how to set the charset for it. If
> your table name contains Korean characters, you will need to find out how
> to use dbisql.
>
> To avoid character set translation between server & client, use
> "charset=none" or set the charset to the same as the database (ie
> "charset=utf8").
>
> -john.
>
> --
> John Smirnios
> Senior Software Developer
> iAnywhere Solutions Engineering
>
> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
> Developer Community at http://www.ianywhere.com/developer
>
> Clive Collie wrote:


John Smirnios

2005-12-22, 9:23 am

Actually, that makes sense. By default the server (and ISQL) will try to
escape certain values but it will never escape what it thinks are
follow-bytes in a multibyte character. Since the server thinks the data
is UTF-8, 0xBE is a follow byte (with no lead present) so it is treated
as a single byte character. 0xc8 introduces a 3-byte character which
means neither the 2e nor othe 0d can be escaped.

Your only other choice would have been to turn of ESCAPES but since you
have embedded CRLF in your strings, you would have had diffent trouble.
Well, at least you've got it working now. :)

-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering

Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer

Clive Collie wrote:
> Believe me, it is cp949 data and they did go through great pains to get that
> data in there. Since I posted this I have had the pleasure of spending
> several days pondering the problem and it seems to centre on CR/LF pairs.
> For some reason a straight unload of the db messes up the CR/LFs. Consider a
> text field containing Hex:
>
> BE C8 2E 0D 0A
>
> This is 949 data and represents a Korean character then a full stop then a
> CR/LF.
>
> The file produced by unload dumps this out as (looked at in a sort of
> combined hex and text dump)
>
> BE C8 2E 0D \x0A
>
> i.e. Korean character then full stop then binary CR then escaped LF
>
> The OD is skipped and not escaped properly in the unload file. This happens
> many times so there is a whole lot of spurious CR characters all over the
> unload file.
>
> Using dbunload -xi -c "...charset=UTF8" does this as well. Putting
> charset=cp949 translates BEC8 into garbage but fixes the \x0D\x0A problem.
>
> However you have solved the problem. I didn't know there was a charset=none
> option in this context. So the command line
>
> dbunload -xi -c "...;charset=none" etc
>
> has left the Korean characters alone and produces proper \x0D\x0A escape
> characters so I can now reload into a cp949 database successfully.
>
> So thanks very much for saving the day.
>
> Regards
> Clive
>
>
>
> "John Smirnios" < smirnios_at_sybase_d
ot_com> wrote in message
> news:43a81955$1@foru
ms-1-dub...
>
>
>
>

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com