Home > Archive > PostgreSQL Hacks > February 2006 > character encoding in StartupMessage









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author character encoding in StartupMessage
John DeSoi

2006-02-28, 8:32 pm

I could not find anything in the Frontend/Backend protocol docs about
character encoding in the StartupMessage. Assuming it is legal for a
database or user name to have unicode characters, how is this handled
when nothing yet has been said about the client encoding?

Thanks,


John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Tom Lane

2006-02-28, 8:32 pm

Christopher Kings-Lynne < chriskl@familyhealth
.com.au> writes:
[color=darkred]
> A similar badness is that if you issue CREATE DATABASE from a UTF8
> database, the dbname will be stored as UTF8. Then, if you go to a
> LATIN1 database and create another it will be stored as LATIN1.


Yeah, this has been discussed before. Database and user names both
have this affliction.

I don't see any very nice solution at the moment. Once we get support
for per-column locales, it might be possible to declare that the shared
catalogs are always in UTF8 encoding and get the necessary
conversions to happen automatically.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Christopher Kings-Lynne

2006-02-28, 8:32 pm

> I don't see any very nice solution at the moment. Once we get support
> for per-column locales, it might be possible to declare that the shared
> catalogs are always in UTF8 encoding and get the necessary
> conversions to happen automatically.



At the very least, could we always convert dbnames and store them as
their own encoding? That way at least in HTML you can probably mark
them out as having particular encodings or something...

Chris


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Martijn van Oosterhout

2006-02-28, 8:32 pm

On Tue, Feb 28, 2006 at 02:45:25PM +0800, Christopher Kings-Lynne wrote:
>
> At the very least, could we always convert dbnames and store them as
> their own encoding? That way at least in HTML you can probably mark
> them out as having particular encodings or something...


This may be the only solution. Converting everything to UTF-8 has
issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc
gives you a different string than you started with). There's probably
no encoding round-trip safe with every other encoding.

You could probably do things like assume that the database name is in
the same encoding as that database and set \l to output:

select convert(datname,pg_e
ncoding_to_char(enco
ding),getdatabaseenc
oding())from pg_database;

However, my personal preference is to treat the name of the database as
a "bunch of bits" ie, don't consider it encoded at all. To login the
user must provide the same "bunch of bits". This doesn't solve the
issue of how to display the database names to users. Maybe define a
cluster encoding for the shared catalogs...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


Alvaro Herrera

2006-02-28, 8:32 pm

Martijn van Oosterhout wrote:

> This may be the only solution. Converting everything to UTF-8 has
> issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc
> gives you a different string than you started with). There's probably
> no encoding round-trip safe with every other encoding.


Is this still true? If I remember clearly, Tatsuo-san had asserted that
this was the case, but later he said there was some bug in our
conversion routines or the conversion tables. So maybe now that those
things are fixed (they are, aren't they?) there _is_ a safe roundtrip
from anything to UTF8 and back.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

John DeSoi

2006-02-28, 8:32 pm


On Feb 28, 2006, at 1:38 AM, Tom Lane wrote:

>
>
> Yeah, this has been discussed before. Database and user names both
> have this affliction.


So are the database/user names in the startup message compared using
the default encoding of the cluster or is just a straight byte
comparison with no consideration of the encoding?

Thanks,



John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Martijn van Oosterhout

2006-02-28, 8:32 pm

On Tue, Feb 28, 2006 at 12:05:17PM -0300, Alvaro Herrera wrote:
> Martijn van Oosterhout wrote:
>
>
> Is this still true? If I remember clearly, Tatsuo-san had asserted that
> this was the case, but later he said there was some bug in our
> conversion routines or the conversion tables. So maybe now that those
> things are fixed (they are, aren't they?) there _is_ a safe roundtrip
> from anything to UTF8 and back.


I beleive so. If use the ICU Converter Explorer [1] to examine some of
the encodings we support, they have "Contains ambiguous aliases? TRUE".
This means that there are multiple converters that claim to support that
encoding, though they produce different results.

The UTF-8 and Unicode FAQ [2] also lists some issues with EUC-JP saying
that the converters had to be modified to make round-trip conversion
work. However, not all converters work the same.

Anyway, maybe it's not a big problem anymore. The ISO-2022 series is
definitly not round-trip compatable [3] but I don't think we support
them anyway. I think the only issue is if the mappings postgres uses
internally don't match what the user expects, but I don't think there's
much we can do about that...

[1] http://www-950.ibm.com/software/glo...demo/converters
[2] http://www.cl.cam.ac.uk/~mgk25/unicode.html
[3] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


Tom Lane

2006-02-28, 8:32 pm

Martijn van Oosterhout <kleptog@svana.org> writes:
[color=darkred]
[color=darkred]
> I beleive so. If use the ICU Converter Explorer [1] to examine some of
> the encodings we support, they have "Contains ambiguous aliases? TRUE".


Which ones, and are they client-only encodings? If all our server-side
encodings are round-trip safe then I think there's no big issue.

In any case I don't think there's a huge problem if we say that database
and user names had better be chosen from the round-trip-safe subset.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Magnus Hagander

2006-02-28, 8:32 pm

> Martijn van Oosterhout <kleptog@svana.org> writes:
>
>
> examine some of
> aliases? TRUE".
>
> Which ones, and are they client-only encodings? If all our
> server-side encodings are round-trip safe then I think
> there's no big issue.
>
> In any case I don't think there's a huge problem if we say
> that database and user names had better be chosen from the
> round-trip-safe subset.


Doesn't this also affect passwords? If so it might be harder to enforce
as the user is often allowed to pick his own password...

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Martijn van Oosterhout

2006-02-28, 8:32 pm

On Tue, Feb 28, 2006 at 11:19:02AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
>
>
>
> Which ones, and are they client-only encodings? If all our server-side
> encodings are round-trip safe then I think there's no big issue.
>
> In any case I don't think there's a huge problem if we say that database
> and user names had better be chosen from the round-trip-safe subset.


This is what it says here [1]:

There are only 19 encodings currently used worldwide as legitimate
POSIX multi-byte locale encodings:

UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6,
ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15,
EUC-JP, EUC-KR, GB2312 (= EUC-CN), KOI8-R, KOI8-U, VISCII,
WINDOWS-1251, WINDOWS-1256

Each of these is fully roundtrip compatible to ISO 10646, therefore
all these locales can be represented nicely in wchar_t as the
equivalent UCS values. The above names and the corresponding defining
documents are listed in the IANA charset registry.

Some of these have multiple definitions according to ICU meaning that
different platforms have implemented them differently in the past
(EUC-JP falls into this catagory), but presumably the IANA charset
registry has proper definitions.

Of the reminaing encodings we support, Big5 is OK, although the term
win-950 which is the windows version has changed over time. GBK has
same problem, win-936 has changed to over time. I don't think we should
concern ourselves with bugs in the windows encodings.

IOW, I think we are mostly safe.

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


John DeSoi

2006-02-28, 8:32 pm


On Feb 28, 2006, at 11:19 AM, Tom Lane wrote:

> In any case I don't think there's a huge problem if we say that
> database
> and user names had better be chosen from the round-trip-safe subset.


What about the pg_hba.conf file? Is there a provision to specify the
encoding or some other way to deal with non-ascii characters?

Thanks,


John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Tom Lane

2006-02-28, 8:32 pm

John DeSoi <desoi@pgedit.com> writes:
> On Feb 28, 2006, at 11:19 AM, Tom Lane wrote:
> What about the pg_hba.conf file? Is there a provision to specify the
> encoding or some other way to deal with non-ascii characters?


pg_hba.conf is also processed without any locale considerations,
ie, effectively the "bunch of bits" approach Martijn mentioned.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com