Home > Archive > PostgeSQL ODBC > November 2005 > Continuing encoding fun....









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Continuing encoding fun....
Dave Page

2005-09-03, 8:23 pm

I've been thinking about this whilst getting dragged round the shops
today, and having read Marko's, Johann's, Hiroshi's and other emails,
not to mention bits of the ODBC spec, here's where I think we stand.

1) The current driver works as expected with Unicode apps.

2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
functions to the Unicode ones, and because (as I think Marko pointed
out) the basic latin chars map directly into the lower Unicode
characters (see http://www.unicode.org/charts/PDF/U0000.pdf).

3) Some other single byte LATIN encodings do not work. This is because
the characters do not map directly into Unicode 80-FF
(http://www.unicode.org/charts/PDF/U0080.pdf).

4) Multibyte apps do not work. I believe that in fact they never will
with a Unicode driver, because multibyte characters simply won't map
into Unicode in the same way that ASCII does. The user cannot opt to use
the non-wide functions, because the DM automatically maps them to the
Unicode versions.

Because the Driver Manager forces the user to use the *W functions if
they exist, I cannot see any way to make 3 or 4 work with a Unicode
driver. If we were to try to detect what encoding to use based on the OS
settings and convert on the fly, we would most likely break any apps
that try to do the right thing by using Unicode themselves. Does that
sound reasonable?

Therefore, it seems to me that the only thing to do is to reinstate the
#ifdef UNICODE preprocessor definitions in the source code (that I now
with I hadn't removed!), and ship 2 versions of the driver - a Unicode
one, and an ANSI/Multibyte version (ie. What 07.xx was).

Thoughts/comments? Hiroshi, what do other vendors do for the Japanese
market?

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Dave Page

2005-09-05, 3:23 am



> -----Original Message-----
> From: Hiroshi Saito & #91;mailto:saito@ine
trt.skcapi.co.jp]
> Sent: 05 September 2005 05:57
> To: Dave Page; pgsql-odbc@postgresql.org
> Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar
> Subject: Re: [ODBC] Continuing encoding fun....
>
> Hi Dave.
>
> suffix that
> name anyway.
>
> Is it meant as follows after all?
> with libpq version
> psqlodbca.dll "PostgreSQL ANSI"
> psqlodbcw.dll "PostgreSQL Unicode"
> without libpq version
> psqlodbca.dll "PostgreSQL ANSI"
> psqlodbcw.dll "PostgreSQL Unicode"


Yes - I am not concerned with the socket version of the driver - in
fact, I was going to talk to Anoop about removing the old code because
we've had at least a couple of cases of people patching the wrong part,
or mistakenly using the socket code.

Either way, we're certainly not going to release the non-libpq version
any more.

> used to with
>
> Some complaint. Although I have not fully tried yet.-(
> I think that CRLF of a base code and patch is what it is hard to use.


Yes, I was too tired to try to fix the patch to remove the CRLF changes
:-( Still, they need to be fixed anyway.

BTW, your version misses the changes to installer/psqlodbcm.wxs...

>
> one vote is invested.


:-)

Regards, Dave

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Johann Zuschlag

2005-09-05, 7:23 am

Dave Page schrieb:

>
>
>
>
>
>Attached is a patch to do this (apologies for the size, it seems that
>options.c had broken line ends).
>
>With this patch, you can build either the old style ANSI/Multibyte
>driver, or the Unicode driver. I've also removed the -libpq suffix that
>was added for testing, as this patch gives the driver a new name anyway.
>When installed on a Windows system, you then get:
>
>psqlodbca.dll "PostgreSQL ANSI"
>psqlodbcw.dll "PostgreSQL Unicode"
>
>Unless anyone has a better solution, I think this is the best fix to
>allow users with non-Unicode friendly apps to work as they used to with
>the older driver.
>
>Please shout ASAP if you object!!
>
>Regards, Dave
>
>

It is ok for me.

Can you send me the dll for the ANSI Driver?

It is not possible to just put a switch in the driver menu?

Regards,
Johann


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Dave Page

2005-09-05, 7:23 am



> -----Original Message-----
> From: Johann Zuschlag & #91;mailto:zuschlag2
@online.de]
> Sent: 05 September 2005 10:40
> To: pgsql-odbc@postgresql.org
> Cc: Dave Page
> Subject: Re: [ODBC] Continuing encoding fun....
>
> It is ok for me.
>
> Can you send me the dll for the ANSI Driver?


Yup, I'll send it offlist.

> It is not possible to just put a switch in the driver menu?


Unfortunately not because it affects the functions exported by the DLL -
if the *W functions exist, the DM will map all calls to the *W versions,
even if the app uses the non-wide version.

Regards, Dave

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Hiroshi Saito

2005-09-05, 1:26 pm

> Either way, we're certainly not going to release the non-libpq version
> any more.


Ok, I also think that it is accordant to reason.

> BTW, your version misses the changes to installer/psqlodbcm.wxs...


Uga... Sorry.

Ah.. I look at a part strange one.
Please check it.:-)

Regards,
Hiroshi Saito
Anoop Kumar

2005-09-06, 3:23 am

Hi Dave,

It would be wise to remove the socket code from the new driver. I will
let you know as soon as it gets completed.

Regards

Anoop

> -----Original Message-----
> From: Dave Page & #91;mailto:dpage@val
e-housing.co.uk]
> Sent: Monday, September 05, 2005 12:47 PM
> To: Hiroshi Saito; pgsql-odbc@postgresql.org
> Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar
> Subject: RE: [ODBC] Continuing encoding fun....
>
>
>
>
> Yes - I am not concerned with the socket version of the driver - in
> fact, I was going to talk to Anoop about removing the old code because
> we've had at least a couple of cases of people patching the wrong

part,

> or mistakenly using the socket code.
>
> Either way, we're certainly not going to release the non-libpq version
> any more.
>
to[color=darkred]
use.[color=darkred]
>
> Yes, I was too tired to try to fix the patch to remove the CRLF

changes
> :-( Still, they need to be fixed anyway.
>
> BTW, your version misses the changes to installer/psqlodbcm.wxs...
>
>
> :-)
>
> Regards, Dave


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Dave Page

2005-09-06, 3:23 am



> -----Original Message-----
> From: Anoop Kumar & #91;mailto:anoopk@pe
rvasive-postgres.com]
> Sent: 06 September 2005 06:25
> To: Dave Page; Hiroshi Saito; pgsql-odbc@postgresql.org
> Cc: Marko Ristola; Johann Zuschlag
> Subject: RE: [ODBC] Continuing encoding fun....
>
> Hi Dave,
>
> It would be wise to remove the socket code from the new driver. I will
> let you know as soon as it gets completed.


Now there's a coincidence - I was going to email you about that today!!

We've had a couple of instances of pople mistakenly compiling the wrong
version, and even fixing bugs in the socket code :-(

Shall I apply the ANSI/Unicode patch first? It's quite invasive of
course - possibly more so than libpq/socket.

Regards, Dave

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Dave Page

2005-09-06, 3:23 am



> -----Original Message-----
> From: Hiroshi Saito & #91;mailto:saito@ine
trt.skcapi.co.jp]
> Sent: 05 September 2005 18:35
> To: Dave Page; pgsql-odbc@postgresql.org
> Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar
> Subject: Re: [ODBC] Continuing encoding fun....
>
> non-libpq version
>
> Ok, I also think that it is accordant to reason.
>
>
> Uga... Sorry.
>
> Ah.. I look at a part strange one.
> Please check it.:-)



Re patch:

--- connection.c.orig Tue Sep 6 01:47:23 2005
+++ connection.c Tue Sep 6 02:13:53 2005
@@ -1545,7 +1545,7 @@
if (self->unicode)
{
if (!self->client_encoding ||
- !stricmp(self->client_encoding, "UNICODE"))
+ stricmp(self->client_encoding, "UNICODE"))
{
QResultClass *res;
if (PG_VERSION_LT(self,
7.1))

The opposite of this change was made in 1.92 of connection.c:
http://cvs.pgfoundry.org/cgi-bin/cv...odbc/connection
..c?rev=1.92&content-type=text/x-cvsweb-markup

It seems to me that the current case is correct - in the Unicode driver
we *must* run with client_encoding = 'UNICODE' or it won't work
properly. That said, I wonder if we shouldn't just remove the if()
altogether, and unconditionally set the client encoding for the Unicode
driver.

Don't forget, this won't affect the ANSI/Multibyte case because it's
inside a "#ifdef UNICODE_SUPPORT".

What do you think Anoop?

Regards, Dave


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Anoop Kumar

2005-09-06, 7:23 am


> Done. Once you've removed the socket code, a new release seems in

order.
> Sound OK to you?


OK for me. A new release would be proper.

> It seems to me that the current case is correct - in the Unicode

driver
> we *must* run with client_encoding = 'UNICODE' or it won't work
> properly. That said, I wonder if we shouldn't just remove the if()
> altogether, and unconditionally set the client encoding for the

Unicode
> driver.
>
> Don't forget, this won't affect the ANSI/Multibyte case because it's
> inside a "#ifdef UNICODE_SUPPORT".
>
> What do you think Anoop?
>

As this is already inside "#ifdef UNICODE_SUPPORT", I don't find the
necessity for checking it again.

Regards
Anoop

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

zuschlag2@online.de

2005-09-06, 7:23 am

Hi Dave

>It seems to me that the current case is correct - in the Unicode driver
>we *must* run with client_encoding = 'UNICODE' or it won't work
>properly. That said, I wonder if we shouldn't just remove the if()
>altogether, and unconditionally set the client encoding for the Unicode

driver.

That assumption seems to be ok, even though I need it still for further testing. But I can use the version you've sent me.

Regards,
Johann


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Marko Ristola

2005-09-06, 1:23 pm

zuschlag2@online.de wrote:
[color=darkred]
>Hi Dave
>
>
>

The following might be interesting for you:

If I activate ISO C 99 API, I can do the following:
( I thought, that I used ANSI C 99, but the correct name for the
standard, I meant
is ISO C 99. It will become default later, maybe it already is with
newest GCCs.)

char cbuf[500];
wchar_t wbuf[500];

setlocale(LC_CTYPE,"");

strcpy(cbuf,"Some multibyte text");
swprintf(wbuf,"%s",cbuf);
Now the text is under wchar_t's internal format, maybe UCS-2.

The following also works:
strcpy(wbuf,L"Some UNICODE text");
sprintf(cbuf,"%ls",wbuf);

So, the UCS-2 and multibyte conversion under ISO C 99 seems to be very easy.
With GCC, with Debian Sarge, this can be done as follows:
gcc -std=c99

I don't have now more time to test, at least today.

Iconv seems to be the solution for more advanced conversions under Linux.

Regards, Marko



---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Marc Herbert

2005-09-08, 3:23 am

"Dave Page" <dpage@vale-housing.co.uk> writes:

> I've been thinking about this whilst getting dragged round the shops
> today, and having read Marko's, Johann's, Hiroshi's and other emails,
> not to mention bits of the ODBC spec, here's where I think we stand.
>
> 1) The current driver works as expected with Unicode apps.
>
> 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
> functions to the Unicode ones, and because (as I think Marko pointed
> out) the basic latin chars map directly into the lower Unicode
> characters (see http://www.unicode.org/charts/PDF/U0000.pdf).
>
> 3) Some other single byte LATIN encodings do not work. This is because
> the characters do not map directly into Unicode 80-FF
> (http://www.unicode.org/charts/PDF/U0080.pdf).
>
> 4) Multibyte apps do not work. I believe that in fact they never will
> with a Unicode driver, because multibyte characters simply won't map
> into Unicode in the same way that ASCII does. The user cannot opt to use
> the non-wide functions, because the DM automatically maps them to the
> Unicode versions.
>
> Because the Driver Manager forces the user to use the *W functions if
> they exist, I cannot see any way to make 3 or 4 work with a Unicode
> driver. If we were to try to detect what encoding to use based on the OS
> settings and convert on the fly, we would most likely break any apps
> that try to do the right thing by using Unicode themselves.


In a perfect world there are no "unicode apps", the internal encoding
is set by the system, properly written apps use abstract TCHAR/wchar_t
characters without knowing anything about what encoding they use, and
programs communicating with the outside (such as an database driver),
should query the system encoding using something like "setlocale()",
and perform any appropriate conversion on the fly.

Excerpt from "info libc - Character Set Handling" of GNU libc 2.3.2

<http://www.gnu.org/software/libc/ma...t-Handling.html>

The question remaining is: how to select the character set or
encoding to use. The answer: you cannot decide about it yourself,
it is decided by the developers of the system or the majority of the
users. Since the goal is interoperability one has to use whatever
the other people one works with use.

<http://www.faqs.org/docs/Linux-HOWT...e-HOWTO.html#s6>
says the same thing:

"Avoid direct access with Unicode. This is a task of the platform's
internationalization
framework."

Of course those two quotes are targeted at applications
developers. They imply that some driver communicating with the outside
world/database should carry any conversion task.

However, I have no idea how this theory is far from reality, far from
the ODBC API, and far from Windows, sorry :-( I just was woken up by
the "unicode apps" word. I tried to follow the discussions here but
got lost.


My 2 cents.


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Dave Page

2005-09-08, 7:23 am



> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 07 September 2005 19:16
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> In a perfect world there are no "unicode apps",


In my perfect world, everything is one flavour of Unicode, and everyone
can consequently read and write everything with no compatibilty problems
at all. But then I like to retreat to my little fantasy world from time
to time...

>
> However, I have no idea how this theory is far from reality, far from
> the ODBC API, and far from Windows, sorry :-( I just was woken up by
> the "unicode apps" word. I tried to follow the discussions here but
> got lost.


The ODBC API (defined by Microsoft of course) includes a number of *W
functions which are Unicode variants of the ANSI versions with the same
name. The ODBC driver manager maps all ANSI function calls to the
Unicode equivalents if they exist, on the assumption that ASCII chars
will map correctly into Unicode (which they do if they are 7 bit chars).
In theory we could attempt to recode incoming ascii or multibyte
ourselves I guess, but it's not going to be a particularly easy task
(and will mean performance loss), and given that some apps don't play
nicely with Unicode drivers anyway, we might as well kill 2 birds with
one stone and just ship 2 versions of the driver.

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Marc Herbert

2005-09-08, 7:23 am

"Dave Page" <dpage@vale-housing.co.uk> writes:

> The ODBC API (defined by Microsoft of course) includes a number of *W
> functions which are Unicode variants of the ANSI versions with the same
> name.


I think one extra layer of confusion is added by the fact that POSIX
defines the type wchar_t as "the abstract/platform-dependent
character", W just meaning here: "W like Wide enough", whereas
Microsoft defines WCHAR as: "W like Unicode". Microsoft's abstract
character being "TCHAR".

Am I right here?




---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Dave Page

2005-09-08, 7:23 am



> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 08 September 2005 11:10
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> "Dave Page" <dpage@vale-housing.co.uk> writes:
>
> number of *W
> with the same
>
> I think one extra layer of confusion is added by the fact that POSIX
> defines the type wchar_t as "the abstract/platform-dependent
> character", W just meaning here: "W like Wide enough", whereas
> Microsoft defines WCHAR as: "W like Unicode". Microsoft's abstract
> character being "TCHAR".
>
> Am I right here?


That certainly wouldn't help matters. We already have ucs2<->utf-8
conversion in various places to deal with *nix/win32 differences -
trying to properly munge other encodings into those correctly wouldn't
be fun!

As I said though - there are other advantages to having a non-Unicode
driver (like, BDE won't barf for example), so why go to all the hassle,
when we can just advise the non-Unicode folks to use the ANSI driver?

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Marko Ristola

2005-09-08, 1:24 pm


There is one thing, that might be good for you to know:

I tried
wprintf("%s",char_text) and printf("%ls",wchar_text) methods.
They don't work with LATIN1 under Linux.

gcc does not support NON-ASCII multibyte conversions.
gcc gives that responsibility for library functions.

That is so even for GCC 4.0.

So, at least libiconv is a good way to handle the multibyte conversions
robustly under Linux. That works if and only if the libiconv library works.

libiconv is LGPL licensed.

Regards,
Marko Ristola

>However, I have no idea how this theory is far from reality, far from
>the ODBC API, and far from Windows, sorry :-( I just was woken up by
>the "unicode apps" word. I tried to follow the discussions here but
>got lost.
>
>



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Marc Herbert

2005-09-13, 8:23 pm

Marko Ristola <Marko.Ristola@kolumbus.fi> writes:

> There is one thing, that might be good for you to know:
>
> I tried
> wprintf("%s",char_text) and printf("%ls",wchar_text) methods.
> They don't work with LATIN1 under Linux.


What do you mean by that? Could you post a short sample code?

Since wchar_t is 32bits for glibc, wchar_text can not be LATIN1 which
is 8bits long...


> gcc does not support NON-ASCII multibyte conversions.


Well I would find weird for a compiler to perform such conversions.


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Marc Herbert

2005-11-21, 1:23 pm

"Dave Page" <dpage@vale-housing.co.uk> writes:

> I've been thinking about this whilst getting dragged round the shops
> today, and having read Marko's, Johann's, Hiroshi's and other emails,
> not to mention bits of the ODBC spec, here's where I think we stand.
>
> 1) The current driver works as expected with Unicode apps.
>
> 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
> functions to the Unicode ones, and because (as I think Marko pointed
> out) the basic latin chars map directly into the lower Unicode
> characters (see http://www.unicode.org/charts/PDF/U0000.pdf).
>
> 3) Some other single byte LATIN encodings do not work. This is because
> the characters do not map directly into Unicode 80-FF
> (http://www.unicode.org/charts/PDF/U0080.pdf).
>
> 4) Multibyte apps do not work. I believe that in fact they never will
> with a Unicode driver, because multibyte characters simply won't map
> into Unicode in the same way that ASCII does. The user cannot opt to use
> the non-wide functions, because the DM automatically maps them to the
> Unicode versions.
>
> Because the Driver Manager forces the user to use the *W functions if
> they exist, I cannot see any way to make 3 or 4 work with a Unicode
> driver.



I agree that 4) can never work, because ODBC does not seem compatible
with multibyte apps by design. ODBC caters for "ANSI" and "Unicode"
strings, that's all.
<http://blogs.msdn.com/oldnewthing/a.../31/144893.aspx>


However, I don't get why 3) does not work. From here:
<http://msdn.microsoft.com/library/d... _arguments.asp>

If the driver is a Unicode driver, the Driver Manager makes function
calls as follows:
- Converts an ANSI function (with the A suffix) to a Unicode function
(with the W suffix) by converting the string arguments into Unicode
^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^
characters and passes the Unicode function to the driver.


Are you saying in 3) that the "converting" underlined above is
actually just a static cast?!

Is this "bug" true for every driver manager out there?




---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Dave Page

2005-11-21, 8:23 pm



> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 21 November 2005 17:19
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> "Dave Page" <dpage@vale-housing.co.uk> writes:
>
> other emails,
> is because
> never will
> cannot opt to use
> them to the
> functions if
>
>
> I agree that 4) can never work, because ODBC does not seem compatible
> with multibyte apps by design. ODBC caters for "ANSI" and "Unicode"
> strings, that's all.
> <http://blogs.msdn.com/oldnewthing/a.../31/144893.aspx>


Actually our ANSI driver works quite nicely in various non-Unicode multibyte encodings such as Shift-JIS, EUC_CN, JOHAB and more. It'll even work with pure UTF-8 in multibyte mode using the ANSI API.

>
> However, I don't get why 3) does not work. From here:
> <http://msdn.microsoft.com/library/d...url=/library/en
> -us/odbc/htm/ odbcunicode_function
_arguments.asp>
>
> If the driver is a Unicode driver, the Driver Manager makes function
> calls as follows:
> - Converts an ANSI function (with the A suffix) to a Unicode function
> (with the W suffix) by converting the string arguments into Unicode
> ^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^
> characters and passes the Unicode function to the driver.
>
>
> Are you saying in 3) that the "converting" underlined above is
> actually just a static cast?!


No, not really a static cast, but a similar effect. Unicode chars 0000-007F are exactly the same as their ASCII counterparts, as are LATIN1 (0080-00FF). All the DM does is map the single byte values into low bytes of the unicode characters and passes them
to the Unicode functions. This works just fine for pure ASCII/LATIN1, but not with other charactersets which don't directly map from their single byte values into Unicode.

> Is this "bug" true for every driver manager out there?


It's not really a bug, but I believe so, yes. It gets corrected by the more advanced drivers though - for example, the SQL server driver might see a '©' character (8A). It knows the local charset is LATIN4, so it can then rewrite that character to 0160, t
he Unicode equivalent. Our Unicode driver will simply leave it as 8A, which is actually a control character (VTS - LINE TABULATION SET).

http://www.unicode.org/roadmaps/bmp/

At least, this is how I understand things :-). Regardless though, the encoding bug reports have all-but stopped now we ship 2 drivers again.

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Marc Herbert

2005-11-22, 7:23 am

"Dave Page" <dpage@vale-housing.co.uk> writes:

>


> Actually our ANSI driver works quite nicely in various non-Unicode
> multibyte encodings such as Shift-JIS, EUC_CN, JOHAB and more. It'll
> even work with pure UTF-8 in multibyte mode using the ANSI API.


Great.

Out of curiosity, is this because all the ODBC code has a "don't
touch" attitude in this full-ANSI case, leaving all string data as is?
Or is there something more clever? Who performs the conversion if the
database is in UTF-8 for instance? Multibyte cases seem to fall outside
the scope of the ODBC spec, which refers only to "ANSI" and "Unicode".

Thanks in advance for providing pointers if this is an FAQ. Even vague
references to the archive of this list would be nice.


>


> No, not really a static cast, but a similar effect. Unicode chars
> 0000-007F are exactly the same as their ASCII counterparts, as are
> LATIN1 (0080-00FF). All the DM does is map the single byte values
> into low bytes of the unicode characters and passes them to the
> Unicode functions.


> This works just fine for pure ASCII/LATIN1, but
> not with other charactersets which don't directly map from their
> single byte values into Unicode.


Very interesting. Maybe the driver manager does so only because the it
cannot/fails to get the active codepage, falling back on CP-1252?
(CP1252 ~= latin1, <http://czyborra.com/charsets/codepages.html#CP1252> )


[color=darkred]
> It's not really a bug, but I believe so, yes.


including unixodbc and iodbc for instance?


> It gets corrected by
> the more advanced drivers though - for example, the SQL server
> driver might see a '©' character (8A). It knows the local charset is
> LATIN4, so it can then rewrite that character to 0160, the Unicode
> equivalent.


Are you saying that the SQL server driver is fixing the flawed
conversion job of the driver manager, finally taking the codepage into
account? Surprising to say the least!

By the way 0x8A is not in the range of latin4
<http://czyborra.com/charsets/iso8859.html#ISO-8859-4>


> Our Unicode driver will simply leave it


Of course, you don't want to perform a conversion that is supposed to
already have happeneD.


> Regardless though, the encoding bug reports have all-but stopped now
> we ship 2 drivers again.


And having two different drivers is indeed the approach induced by the
ODBC documentation, from what I've got from it.

Thanks a lot for your insights.


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Dave Page

2005-11-23, 7:23 am



> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 22 November 2005 09:33
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> "Dave Page" <dpage@vale-housing.co.uk> writes:
>
> compatible
>
>
> Great.
>
> Out of curiosity, is this because all the ODBC code has a "don't
> touch" attitude in this full-ANSI case, leaving all string data as is?
> Or is there something more clever? Who performs the conversion if the
> database is in UTF-8 for instance? Multibyte cases seem to
> fall outside
> the scope of the ODBC spec, which refers only to "ANSI" and "Unicode".


No, Multibyte support was intentionally added by Eiji Tokuya in 2001. Don't ask me how it works though as I really don't know. Much of the code for it is in multibyte.c if you want to take a peek.


> Very interesting. Maybe the driver manager does so only because the it
> cannot/fails to get the active codepage, falling back on CP-1252?
> (CP1252 ~= latin1,
> <http://czyborra.com/charsets/codepages.html#CP1252> )


The docs are somewhat fuzzy on this point, simply stating that

"If the driver is a Unicode driver, the Driver Manager makes function calls as follows:" ... "Converts an ANSI function (with the A suffix) to a Unicode function (with the W suffix) by converting the string arguments into Unicode characters and passes the
Unicode function to the driver."

(http://msdn.microsoft.com/library/d...
ions.asp
)

My assertion that the driver does the conversion comes from the SQL Server driver which allows you to turn conversion on or off:

"Perform translation for character data check box

When selected, the SQL Server ODBC driver converts ANSI strings sent between the client computer and SQL Server by using Unicode. The SQL Server ODBC driver sometimes converts between the SQL Server code page and Unicode on the client computer. This requi
res that the code page used by SQL Server be one of the code pages available on the client computer.

When cleared, no translation of extended characters in ANSI character strings is done when they are sent between the client application and the server. If the client computer is using an ANSI code page (ACP) different from the SQL Server code page, extend
ed characters in ANSI character strings may be misinterpreted. If the client computer is using the same code page for its ACP that SQL Server is using, the extended characters are interpreted correctly."

If Microsoft intended the DM to do the conversion when they wrote the spec, why would they then add the same functionality to their driver?

>
>
> including unixodbc and iodbc for instance?


If they follow the parts of the spec I quoted above, and interpret them in the same when, then yes. However I'm not overly familiar with either DM, so I can't say for sure.


>
> Are you saying that the SQL server driver is fixing the flawed
> conversion job of the driver manager, finally taking the codepage into
> account? Surprising to say the least!
>
> By the way 0x8A is not in the range of latin4
> <http://czyborra.com/charsets/iso8859.html#ISO-8859-4>


http://www.gar.no/home/mats/8859-4.htm says differently, however, I can't claim to know enough about encoding issues to refute either. I've been forced to learn what I can about the subject to help maintain this driver and certainly may have got the wrong
end of the stick on one or more points!

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Marc Herbert

2005-11-24, 9:23 am

[Cross-posting to unixodbc-devel. Also crossing fingers so it works]
Archives of both lists here for instance: <http://dir.gmane.org/search.php?match=odbc>

"Dave Page" <dpage@vale-housing.co.uk> writes:
>
> The docs are somewhat fuzzy on this point, simply stating that
>
> "If the driver is a Unicode driver, the Driver Manager makes function
> calls as follows:" ... "Converts an ANSI function (with the A suffix)
> to a Unicode function (with the W suffix) by converting the string
> arguments into Unicode characters and passes the Unicode function to
> the driver."
>
> (http://msdn.microsoft.com/library/d...
ions.asp
)
>
> My assertion that the driver does the conversion comes from the SQL
> Server driver which allows you to turn conversion on or off:
>
> "Perform translation for character data check box
>
> When selected, the SQL Server ODBC driver converts ANSI strings sent
> between the client computer and SQL Server by using Unicode. The SQL
> Server ODBC driver sometimes converts between the SQL Server code page
> and Unicode on the client computer. This requires that the code page
> used by SQL Server be one of the code pages available on the client
> computer.
>
> When cleared, no translation of extended characters in ANSI character
> strings is done when they are sent between the client application and
> the server. If the client computer is using an ANSI code page (ACP)
> different from the SQL Server code page, extended characters in ANSI
> character strings may be misinterpreted. If the client computer is
> using the same code page for its ACP that SQL Server is using, the
> extended characters are interpreted correctly."
>
> If Microsoft intended the DM to do the conversion when they wrote the
> spec, why would they then add the same functionality to their driver?



Here is a hypothesis: the checkbox in SQL Server driver is actually a
switch between the ANSI version and the Unicode version of this
driver. That would be pretty much consistent with all the above. The
only inconsistency would be: "The driver converts...", to be actually
read as: "This setting triggers the conversion operated by the DM".

What do you think?





---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Dave Page

2005-11-24, 11:23 am



> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 24 November 2005 14:18
> To: pgsql-odbc@postgresql.org
> Cc: unixodbc-dev@unixodbc.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> wrote the
> their driver?
>
>
> Here is a hypothesis: the checkbox in SQL Server driver is actually a
> switch between the ANSI version and the Unicode version of this
> driver. That would be pretty much consistent with all the above. The
> only inconsistency would be: "The driver converts...", to be actually
> read as: "This setting triggers the conversion operated by the DM".
>
> What do you think?


The DM detects whether the driver is Unicode or not from the presence of
the SQLConnectW function
(http://msdn.microsoft.com/library/d...ry/en-us/odbc/h
tm/odbcunicode_drivers.asp). Whether or not this is exported is
determined at compile time and cannot be changed at runtime.

Regards, Dave

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com