Home > Archive > Slony1 PostgreSQL Replication > January 2006 > LATIN1 -> UTF8 conversion









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author LATIN1 -> UTF8 conversion
Brian Hirt

2005-11-19, 8:24 pm

I've used slony in the past to migrate from 7.4 to 8.0, and i plan on
using it to migrate from 8.0 to 8.1 once the 8.1 issues with slony
get hammered out. Additionally, though, I'm curious if slony can be
used to migrate from one database encoding to another. We are
currently ready to migrate our database and applications from Latin1
encoding to UTF8 encoding. Is this something slony supports? Are
there caveats that I should be aware of? I can run pg_dump with --
encoding=UTF8 on my Latin1 database and load it successfully into my
UTF8 database and have the 8bit Latin1 characters correctly converted
into ther UTF8 counterparts.

Thanks for any advice

--brian

--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project


cbbrowne-swQf4SbcV9C7WVzo/KQ3Mw@public.gmane.org

2005-11-20, 3:23 am

> I've used slony in the past to migrate from 7.4 to 8.0, and i plan on
> using it to migrate from 8.0 to 8.1 once the 8.1 issues with slony
> get hammered out. Additionally, though, I'm curious if slony can be
> used to migrate from one database encoding to another. We are
> currently ready to migrate our database and applications from Latin1
> encoding to UTF8 encoding. Is this something slony supports? Are
> there caveats that I should be aware of? I can run pg_dump with --
> encoding=3DUTF8 on my Latin1 database and load it successfully into my
> UTF8 database and have the 8bit Latin1 characters correctly converted
> into ther UTF8 counterparts.


This falls into the category of "things not explicitly tried yet."

With the fixes in 1.1.2, we now have reasonably proper support for UTF8,
though that has only really been tested via copying from the same encodin=
g
into the same encoding.

I'd suggest trying it out; if it works well, that would be worth adding t=
o
the documentation. If there are wrinkles, that's also something others
might care to know.

I'd suggest testing it out with a reasonably diverse set of tests; make
sure you have:

a) Examples of each special character so you know each character is
translated properly;

b) Be sure you test both with data that is present at subscription time
(which is added via a bulk COPY) and with additional data inserted/update=
d
*after* the subscription is set up (which is introduced via INSERT/UPDATE=
,
which is a quite distinct mechanism from COPY).

Actually, if you can pass on some relevant sample of data, I'd be more
than happy to see if I can turn it into a test in the "Testbed: The New
Generation" ;-) Assuming it all works, that approach would make it part
of the regression tests!
Brian Hirt

2006-01-02, 3:23 am

Hello:

As were getting closer to moving to 8.1 and UTF8 i decided to do some
testing. I'm running slony 1.1.5-rc2. The provider database is
encoded in LATIN1/postgresql 8.0. The slave is UTF8/postgresql
8.1. Out of the box, slony fails to transfer the data. When I
first start slon, the initial copy fails with this error:

postmaster log:
ERROR: invalid UTF-8 byte sequence detected near byte 0xae
CONTEXT: COPY developer_converted_
to_entity, line 71, column
developer_name: "Spatializer? "

slony log:
2006-01-01 13:51:25 MST ERROR remoteWorkerThread_1
: copy from stdin
on local node - PGRES_FATAL_ERROR ERROR: invalid UTF-8 byte sequence
detected near byte 0xae

What's happening is that the data is coming from the provider in
LATIN1 and that raw data is given to the subscriber which is
expecting UTF8 -- the LATIN1 binary data is obviously not UTF8 and
the subscriber node query fails during conversion.

I've come up with a simple solution that is working well, but
probably not desired for typical users.

What I've done is change slon_connectdb() in dbutils.c to always set
the client encoding to UTF8. This pushes any encoding issues to the
postmaster on the provider and subscriber. The provider database
will output in UTF8; if the provider database isn't UTF8, results
will be converted to UTF8. The reverse happens on the subscriber.
It's a pretty simple solution, but it causes additional overhead to
be added to every slony event and a certain trust that postgresql
knows how to properly convert to and from different encodings. For
normal environments that are running the same encoding on the
provider and subscriber it's an extra conversion that's not
required. For situations where the provider and subscriber are both
UTF8, the overhead is a non issue.

This obviously will only work if the subscriber's encoding is a
superset of the provider. If you have a provider that's UTF8 and a
subscriber that's LATIN1 then things will still fail when a non
latin1 character is in the provider database.

If there is interest in having this be some configurable option, I'm
happy to help out. One idea i had was to have a flag for the slon
process called something like "-e <transferencoding>" that would tell
slon what encoding to use for transfer of the data. If it wasn't
supplied, then the no client_encoding would be set and it would
default to the previous behavior.

Here's a patch if anyone else is interested in using slony this way:

--- /usr/local/src/slony1-1.1.5-rc2/src/slon/dbutils.c 2005-03-08
15:52:37.000000000 -0700
+++ dbutils.c 2006-01-01 20:50:57.000000000 -0700
@@ -99,6 +99,26 @@
conn = slon_make_dummyconn(
symname);
conn->dbconn = dbconn;
+ /* set the client encoding to UTF8 as a common ground for
+ transfering data. The postmaster will be responsible
+ for dealing with encoding issues */
+ PGresult *res;
+ SlonDString query;
+
+ dstring_init(&query);
+
+ slon_mkquery(&query, "set client_encoding to utf8");
+ res = PQexec(dbconn, dstring_data(&query));
+ if ( ! ((PQresultStatus(res
) == PGRES_TUPLES_OK) ||
+ (PQresultStatus(res)
== PGRES_COMMAND_OK)) )
+ {
+ perror("slon_connectdb: failed to set encoding to
utf8");
+ slon_abort();
+ }
+ PQclear(res);
+
+ slon_log(SLON_DEBUG1
, "client encoding set to utf8\n");
+
return conn;
}


On Nov 19, 2005, at 10:18 PM, cbbrowne-swQf4SbcV9C7WVzo/KQ3Mw@public.gmane.org wrote:

>
> This falls into the category of "things not explicitly tried yet."
>
> With the fixes in 1.1.2, we now have reasonably proper support for
> UTF8,
> though that has only really been tested via copying from the same
> encoding
> into the same encoding.
>
> I'd suggest trying it out; if it works well, that would be worth
> adding to
> the documentation. If there are wrinkles, that's also something
> others
> might care to know.
>
> I'd suggest testing it out with a reasonably diverse set of tests;
> make
> sure you have:
>
> a) Examples of each special character so you know each character is
> translated properly;
>
> b) Be sure you test both with data that is present at subscription
> time
> (which is added via a bulk COPY) and with additional data inserted/
> updated
> *after* the subscription is set up (which is introduced via INSERT/
> UPDATE,
> which is a quite distinct mechanism from COPY).
>
> Actually, if you can pass on some relevant sample of data, I'd be more
> than happy to see if I can turn it into a test in the "Testbed: The
> New
> Generation" ;-) Assuming it all works, that approach would make it
> part
> of the regression tests!
>
>


--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project
Gavin Sherry

2006-01-02, 3:23 am

On Sun, 1 Jan 2006, Brian Hirt wrote:

> Hello:
>
> As were getting closer to moving to 8.1 and UTF8 i decided to do some
> testing. I'm running slony 1.1.5-rc2. The provider database is
> encoded in LATIN1/postgresql 8.0. The slave is UTF8/postgresql
> 8.1. Out of the box, slony fails to transfer the data. When I
> first start slon, the initial copy fails with this error:
>
> postmaster log:
> ERROR: invalid UTF-8 byte sequence detected near byte 0xae
> CONTEXT: COPY developer_converted_
to_entity, line 71, column
> developer_name: "Spatializer? "
>
> slony log:
> 2006-01-01 13:51:25 MST ERROR remoteWorkerThread_1
: copy from stdin
> on local node - PGRES_FATAL_ERROR ERROR: invalid UTF-8 byte sequence
> detected near byte 0xae
>
> What's happening is that the data is coming from the provider in
> LATIN1 and that raw data is given to the subscriber which is
> expecting UTF8 -- the LATIN1 binary data is obviously not UTF8 and
> the subscriber node query fails during conversion.
>
> I've come up with a simple solution that is working well, but
> probably not desired for typical users.
>
> What I've done is change slon_connectdb() in dbutils.c to always set
> the client encoding to UTF8. This pushes any encoding issues to the


This does not work if the master database is SQL_ASCII (I don't think it
works if it is MULE_INTERNAL either). The reason is, all character
sequences in this encoding are valid but not all sequences can be
translated to utf8. It is entirely possible that people are using text
fields as byte arrays in SQL_ASCII databases. See my email:

http://gborg.postgresql.org/piperma...ber/003430.html

If we set the client encoding to utf8 on such databases, we will corrupt
user data.

Gavin
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com