Home > Archive > PostgreSQL JDBC > April 2005 > UNICODE encoding and jdbc related issues









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author UNICODE encoding and jdbc related issues
Chris Kratz

2005-04-06, 8:03 pm

Hello All,

I posted this on the general mailing list several days ago without a response
so am posting here. Does anyone here have any wisdom or experience they don't
mind sharing?

Our production database was created with the default SQL_ASCII encoding. It
appears that some of our users have entered characters into the system with
characters above 127 (accented vowels, etc). None of the tools we use
currently have had a problem with this behavior until recently, everything
just worked.

I was testing some reporting tools this past weekend and have been playing
with Jasper reports[1] . Jasper reports is a Java based reporting tool that
reads data from the database via JDBC. When I initially tried to generate
reports, the jdbc connection would crash with the following message:

org.postgresql.util.PSQLException: Invalid character data was found.

Googling eventually turned up a message on the pgsql-jdbc list detailing the
problem[2]. Basically, java cannot convert these characters above 127 into
unicode which is required by java.

After some more googling, I found that if I took a recent database dump and
then ran it through iconv[3] and then created the database with a unicode
encoding, everything worked.

1. Is there any way to do a iconv type translation inline in a sql statement?
ie select translate(text_field
, unicode) from table.... Btw, set
client_encoding=UNIC
ODE does not work in this situation. In fact the JDBC
driver for postgres seems to do this automatically.

2. I'm really not sure I want to change the encoding of our main database to
Unicode. Is there a performance loss when going to a UNICODE database
encoding? What about sorts, etc. I'm really worried about unintended side
effects of moving from SQL_ASCII to UNICODE.

3. Is there any other way around this issue? Or are we living dangerously by
trying to store non-ascii data in a database created as ascii encoded?

4. Has anyone else gone through a conversion like this? Are there any
gotchas we should look out for?

Thanks for your time,

-Chris

We are using postgres 7.4.5 on Linux.

[1] http://jasperreports.sourceforge.net/
[2] http://archives.postgresql.org/pgsq...10/msg00280.php
[3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf-8.dumpall
--
Chris Kratz

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Kris Jurka

2005-04-06, 8:03 pm



On Wed, 6 Apr 2005, Chris Kratz wrote:

> Our production database was created with the default SQL_ASCII encoding. It
> appears that some of our users have entered characters into the system with
> characters above 127 (accented vowels, etc). None of the tools we use
> currently have had a problem with this behavior until recently, everything
> just worked.
>
> I was testing some reporting tools this past weekend and have been playing
> with Jasper reports[1] . Jasper reports is a Java based reporting tool that
> reads data from the database via JDBC. When I initially tried to generate
> reports, the jdbc connection would crash with the following message:
>
> org.postgresql.util.PSQLException: Invalid character data was found.
>
> Googling eventually turned up a message on the pgsql-jdbc list detailing the
> problem[2]. Basically, java cannot convert these characters above 127 into
> unicode which is required by java.
>
> After some more googling, I found that if I took a recent database dump and
> then ran it through iconv[3] and then created the database with a unicode
> encoding, everything worked.
>
> 1. Is there any way to do a iconv type translation inline in a sql statement?
> ie select translate(text_field
, unicode) from table.... Btw, set
> client_encoding=UNIC
ODE does not work in this situation. In fact the JDBC
> driver for postgres seems to do this automatically.


You can't do translation inline, how would a driver interpret the results
of SELECT translate(field1, unicode), translate(field2, latin1) ?

The driver does SET client_encoding which does work for all real server
encodings. The problem is that SQL_ASCII is not a real encoding. It
accepts any encoding and cannot do conversions to other encodings. Your
db right now could easily have a mix of encodings.

> 2. I'm really not sure I want to change the encoding of our main database to
> Unicode. Is there a performance loss when going to a UNICODE database
> encoding? What about sorts, etc. I'm really worried about unintended side
> effects of moving from SQL_ASCII to UNICODE.


You don't need to use unicode, but you must select another encoding. If
you'd like to stick with a single byte encoding perhaps LATIN1 would be
appropriate for you.


> 3. Is there any other way around this issue? Or are we living dangerously by
> trying to store non-ascii data in a database created as ascii encoded?


You are living dangerously.

> 4. Has anyone else gone through a conversion like this? Are there any
> gotchas we should look out for?


The gotchas here are to make sure your other client tools still work
against the new database.

> [3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf-8.dumpall


I see your data really is LATIN1. Perhaps you should use that as your db
encoding. That should keep your existing client tools happy as well as
the JDBC driver.

Kris Jurka

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql
.org

Igor Postelnik

2005-04-06, 8:03 pm

> > 2. I'm really not sure I want to change the encoding of our main
> database to
database[color=darkr
ed]
unintended[color=dar
kred]
> side
>
> You don't need to use unicode, but you must select another encoding.

If
> you'd like to stick with a single byte encoding perhaps LATIN1 would

be
> appropriate for you.


I've asked this before on the performance list but didn't get any reply.
Is there substantial performance difference between using SQL_ASCII,
LATIN1, or UNICODE?

> The driver does SET client_encoding which does work for all real

server
> encodings. The problem is that SQL_ASCII is not a real encoding. It
> accepts any encoding and cannot do conversions to other encodings.

Your
> db right now could easily have a mix of encodings.


ISTM that when you create a database with SQL_ASCII encoding you decide
to deal with character set issues in the applications. Why is the JDBC
driver dictating how the application handles character set issues?

-Igor


> -----Original Message-----
> From: pgsql-jdbc-owner@postgresql.org [mailto:pgsql-jdbc-
> owner@postgresql.org] On Behalf Of Kris Jurka
> Sent: Wednesday, April 06, 2005 1:23 PM
> To: Chris Kratz
> Cc: pgsql-jdbc@postgresql.org
> Subject: Re: [JDBC] UNICODE encoding and jdbc related issues
>
>
>
> On Wed, 6 Apr 2005, Chris Kratz wrote:
>
encoding.[color=darkred]
> It
system[color=darkred
]
> with
use[color=darkred]
> everything
> playing
tool[color=darkred]
> that
> generate
detailing[color=dark
red]
> the
127[color=darkred]
> into
dump[color=darkred]
> and
> unicode
> statement?
the[color=darkred]
> JDBC
>
> You can't do translation inline, how would a driver interpret the

results
> of SELECT translate(field1, unicode), translate(field2, latin1) ?
>
>
>
>
> dangerously by
encoded?[color=darkred]
>
> You are living dangerously.
>
any[color=darkred]
>
> The gotchas here are to make sure your other client tools still work
> against the new database.
>
> 8.dumpall
>
> I see your data really is LATIN1. Perhaps you should use that as your

db
> encoding. That should keep your existing client tools happy as well

as
> the JDBC driver.
>
> Kris Jurka
>
> ---------------------------(end of

broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to

majordomo@postgresql
.org



---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Kris Jurka

2005-04-06, 8:03 pm



On Wed, 6 Apr 2005, Igor Postelnik wrote:

> I've asked this before on the performance list but didn't get any reply.
> Is there substantial performance difference between using SQL_ASCII,
> LATIN1, or UNICODE?


Performance where? Backend performance in terms of string comparisons and
sorting is driver off of locale, not encoding. You may use the C locale
with UNICODE encoding for example so that should not be an issue. For the
JDBC driver it always wants data coming back to it in unicode. If you've
got a unicode db no conversion is necessary. If you've got a sql_ascii
db no conversion is possible. If you've got a latin1 db conversion will
happen, but I don't know what the cost of that is.


> ISTM that when you create a database with SQL_ASCII encoding you decide
> to deal with character set issues in the applications. Why is the JDBC
> driver dictating how the application handles character set issues?


If the only API the JDBC driver provided was ResultSet.getBytes() then
that would be OK (note this is the only API libpq provides). To provide
getString() the driver must know what encoding the data coming back is
really in. A database encoding of sql_ascii tells us nothing so we can do
nothing about it. It has been suggested in the past to allow the real
database encoding for a sql_ascii database to be specified as a URL
parameter, but I am of the opinion that is just masking the problem, not
solving it. Data should be in a correctly encoded database. If you store
unicode data in a sql_ascii then things like varchar(N) are now the number
of bytes instead of the number of characters as it should. With sql_ascii
there is no restriction on what data can be entered and you can get
yourself in a real mess with different clients entering data in different
encodings. Do yourself a favor and pick a real encoding.

Kris Jurka

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com