Home > Archive > Slony1 PostgreSQL Replication > August 2005 > Move set terminates slon









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Move set terminates slon
John Sidney-Woollett

2005-08-13, 7:24 am

Can anyone explain why slon on (the master) node #1 stopped when a MOVE
SET command was issued on that node - see the FATAL error notice below.
But when slon on node 1 was restarted, it processed the as yet
unprocessed MOVE SET correctly.

The slonik script below was executed on fs01b, node #1, and this is the
server where the slon process died. The slon process on db01a, node #2
stayed up fine during all the move set operations.

No applications were running against either database during the switch
over. But I did have one psql session open against each db to check that
the moves worked OK by issuing SQL statements for the appropriate tables
after the move.

We had 6 sets to move and only the first move (set #6) didn't terminate
the slon process. However, all the move sets seem to have worked OK.

I checked the _bpreplicate2.sl_set table (on both nodes), and the
set_origin is now 2, and all the tables are unlocked on node #2. All
sequences and tables seem to be being replicated correctly (now from
node #2 to node #1).

I'm using slony 1.1 with postgres 7.4.6

Any ideas?

John

slonik move set script
====================
==
#!/bin/bash

/usr/local/pgsql/bin/slonik << _END_

# define the cluster namespace
cluster name = bpreplicate2;

# define two nodes connection information
node 1 admin conninfo = 'dbname=bp_live host=fs01b user=postgres';
node 2 admin conninfo = 'dbname=bp_live host=db01a user=postgres';

# move set from node 1 to node 2
lock set (id = 5, origin = 1);
wait for event (origin = 1, confirmed = 2);
move set (id = 5, old origin = 1, new origin = 2);
wait for event (origin = 1, confirmed = 2);

_END_



output from slonlog for node 1
====================
==========
2005-08-13 05:56:21 GMT FATAL localListenThread: MOVE_SET but no
provider found for set 5
2005-08-13 05:56:21 GMT DEBUG1 slon: shutdown requested
2005-08-13 05:56:21 GMT DEBUG2 slon: notify worker process to shutdown
2005-08-13 05:56:21 GMT DEBUG2 slon: wait for worker process to shutdown
2005-08-13 05:56:21 GMT DEBUG1 syncThread: thread done
2005-08-13 05:56:21 GMT DEBUG1 cleanupThread: thread done
2005-08-13 05:56:21 GMT INFO remoteListenThread_2
: disconnecting from
'dbname=bp_live host=db01a user=postgres'
2005-08-13 05:56:21 GMT DEBUG2 remoteWorkerThread_2
: forward confirm
1,10810 received by 2
2005-08-13 05:56:21 GMT DEBUG1 remoteListenThread_2
: thread done
2005-08-13 05:56:21 GMT DEBUG1 main: scheduler mainloop returned
2005-08-13 05:56:21 GMT DEBUG2 main: wait for remote threads
2005-08-13 05:56:21 GMT DEBUG2 sched_wakeup_node():
no_id=2 (0 threads +
worker signaled)
2005-08-13 05:56:21 GMT DEBUG1 remoteWorkerThread_2
: helper thread for
provider 2 terminated
2005-08-13 05:56:21 GMT DEBUG1 remoteWorkerThread_2
: disconnecting from
data provider 2
2005-08-13 05:56:21 GMT DEBUG1 remoteWorkerThread_2
: thread done
2005-08-13 05:56:21 GMT DEBUG2 main: notify parent that worker is done
2005-08-13 05:56:21 GMT DEBUG1 main: done
2005-08-13 05:56:21 GMT DEBUG2 slon: worker process shutdown ok
2005-08-13 05:56:21 GMT DEBUG2 slon: exit(-1)
[snipped]

[slon restarted on node 1]

[snipped]
2005-08-13 05:58:07 GMT DEBUG2 start processing ACCEPT_SET
2005-08-13 05:58:07 GMT DEBUG2 ACCEPT: set=5
2005-08-13 05:58:07 GMT DEBUG2 ACCEPT: old origin=1
2005-08-13 05:58:07 GMT DEBUG2 ACCEPT: new origin=2
2005-08-13 05:58:07 GMT DEBUG2 ACCEPT: move set seq=7304
2005-08-13 05:58:07 GMT DEBUG2 got parms ACCEPT_SET
2005-08-13 05:58:07 GMT DEBUG2 ACCEPT_SET - on origin node...
2005-08-13 05:58:07 GMT DEBUG2 ACCEPT_SET - done...
Ujwal S. Setlur

2005-08-13, 1:24 pm

I posted an almost exactly same problem on this list a
few days ago. I am trying to move two sets from maste
to slave, and the slon process terminates on the
master when moving the second set.

Haven't figured it out yet. If you find out anything,
please let me know.

Thanks,

Ujwal

--- John Sidney-Woollett <johnsw- eE2kIT1DonNiLUuM0BA3
LQ@public.gmane.org> wrote:

> Can anyone explain why slon on (the master) node #1
> stopped when a MOVE=20
> SET command was issued on that node - see the FATAL
> error notice below.=20
> But when slon on node 1 was restarted, it processed
> the as yet=20
> unprocessed MOVE SET correctly.
>=20
> The slonik script below was executed on fs01b, node
> #1, and this is the=20
> server where the slon process died. The slon process
> on db01a, node #2=20
> stayed up fine during all the move set operations.
>=20
> No applications were running against either database
> during the switch=20
> over. But I did have one psql session open against
> each db to check that=20
> the moves worked OK by issuing SQL statements for
> the appropriate tables=20
> after the move.
>=20
> We had 6 sets to move and only the first move (set
> #6) didn't terminate=20
> the slon process. However, all the move sets seem to
> have worked OK.
>=20
> I checked the _bpreplicate2.sl_set table (on both
> nodes), and the=20
> set_origin is now 2, and all the tables are unlocked
> on node #2. All=20
> sequences and tables seem to be being replicated
> correctly (now from=20
> node #2 to node #1).
>=20
> I'm using slony 1.1 with postgres 7.4.6
>=20
> Any ideas?
>=20
> John
>=20
> slonik move set script
> =3D=3D=3D=3D=3D=3D=3
D=3D=3D=3D=3D=3D=3D=
3D=3D=3D=3D=3D=3D=3D
=3D=3D
> #!/bin/bash
>=20
> /usr/local/pgsql/bin/slonik << _END_
>=20
> # define the cluster namespace
> cluster name =3D bpreplicate2;
>=20
> # define two nodes connection information
> node 1 admin conninfo =3D 'dbname=3Dbp_live host=3Dfs01b
> user=3Dpostgres';
> node 2 admin conninfo =3D 'dbname=3Dbp_live host=3Ddb01a
> user=3Dpostgres';
>=20
> # move set from node 1 to node 2
> lock set (id =3D 5, origin =3D 1);
> wait for event (origin =3D 1, confirmed =3D 2);
> move set (id =3D 5, old origin =3D 1, new origin =3D 2);
> wait for event (origin =3D 1, confirmed =3D 2);
>=20
> _END_
>=20
>=20
>=20
> output from slonlog for node 1
> =3D=3D=3D=3D=3D=3D=3
D=3D=3D=3D=3D=3D=3D=
3D=3D=3D=3D=3D=3D=3D
=3D=3D=3D=3D=

=3D=3D=3D=3D=3D=3D[c
olor=darkred]
> 2005-08-13 05:56:21 GMT FATAL localListenThread:
> MOVE_SET but no=20
> provider found for set 5
> 2005-08-13 05:56:21 GMT DEBUG1 slon: shutdown
> requested
> 2005-08-13 05:56:21 GMT DEBUG2 slon: notify worker
> process to shutdown
> 2005-08-13 05:56:21 GMT DEBUG2 slon: wait for worker
> process to shutdown
> 2005-08-13 05:56:21 GMT DEBUG1 syncThread: thread
> done
> 2005-08-13 05:56:21 GMT DEBUG1 cleanupThread: thread
> done
> 2005-08-13 05:56:21 GMT INFO remoteListenThread_2
:
> disconnecting from=20
> 'dbname=3Dbp_live host=3Ddb01a user=3Dpostgres'
> 2005-08-13 05:56:21 GMT DEBUG2 remoteWorkerThread_2
:
> forward confirm=20
> 1,10810 received by 2
> 2005-08-13 05:56:21 GMT DEBUG1 remoteListenThread_2
:
> thread done
> 2005-08-13 05:56:21 GMT DEBUG1 main: scheduler
> mainloop returned
> 2005-08-13 05:56:21 GMT DEBUG2 main: wait for remote
> threads
> 2005-08-13 05:56:21 GMT DEBUG2 sched_wakeup_node():

> no_id=3D2 (0 threads +=20
> worker signaled)
> 2005-08-13 05:56:21 GMT DEBUG1 remoteWorkerThread_2
:
> helper thread for=20
> provider 2 terminated
> 2005-08-13 05:56:21 GMT DEBUG1 remoteWorkerThread_2
:
> disconnecting from=20
> data provider 2
> 2005-08-13 05:56:21 GMT DEBUG1 remoteWorkerThread_2
:
> thread done
> 2005-08-13 05:56:21 GMT DEBUG2 main: notify parent
> that worker is done
> 2005-08-13 05:56:21 GMT DEBUG1 main: done
> 2005-08-13 05:56:21 GMT DEBUG2 slon: worker process
> shutdown ok
> 2005-08-13 05:56:21 GMT DEBUG2 slon: exit(-1)
> [snipped]
>=20
> [slon restarted on node 1]
>=20
> [snipped]
> 2005-08-13 05:58:07 GMT DEBUG2 start processing
> ACCEPT_SET
> 2005-08-13 05:58:07 GMT DEBUG2 ACCEPT: set=3D5
> 2005-08-13 05:58:07 GMT DEBUG2 ACCEPT: old origin=3D1
> 2005-08-13 05:58:07 GMT DEBUG2 ACCEPT: new origin=3D2
> 2005-08-13 05:58:07 GMT DEBUG2 ACCEPT: move set
> seq=3D7304
> 2005-08-13 05:58:07 GMT DEBUG2 got parms ACCEPT_SET
> 2005-08-13 05:58:07 GMT DEBUG2 ACCEPT_SET - on
> origin node...
> 2005-08-13 05:58:07 GMT DEBUG2 ACCEPT_SET - done...
>=20
> ____________________
____________________
_______
> Slony1-general mailing list
> Slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
>[/color]
http://gborg.postgresql.org/mailman.../slony1-general
>=20

Christopher Browne

2005-08-15, 11:24 am

John Sidney-Woollett wrote:

> Can anyone explain why slon on (the master) node #1 stopped when a
> MOVE SET command was issued on that node - see the FATAL error notice
> below. But when slon on node 1 was restarted, it processed the as yet
> unprocessed MOVE SET correctly.
>
> The slonik script below was executed on fs01b, node #1, and this is
> the server where the slon process died. The slon process on db01a,
> node #2 stayed up fine during all the move set operations.
>
> No applications were running against either database during the switch
> over. But I did have one psql session open against each db to check
> that the moves worked OK by issuing SQL statements for the appropriate
> tables after the move.
>
> We had 6 sets to move and only the first move (set #6) didn't
> terminate the slon process. However, all the move sets seem to have
> worked OK.
>
> I checked the _bpreplicate2.sl_set table (on both nodes), and the
> set_origin is now 2, and all the tables are unlocked on node #2. All
> sequences and tables seem to be being replicated correctly (now from
> node #2 to node #1).
>
> I'm using slony 1.1 with postgres 7.4.6
>
> Any ideas?


Yes, I have an idea...

The place where the error message is generated is in local_listen.c;
here's the code fragment that generates it...

if (PQntuples(res2) != 1)
{
slon_log(SLON_FATAL,
"localListenThread: MOVE_SET "
"but no provider found for set %d\n",
set_id);
dstring_free(&query2);
PQclear(res2);
slon_abort();
}

That != 1 struck me as suspicious... If the code were looking for
"non-existence," I'd expect it to look for PQntuples(res2) being zero.

If the query returns more than 1 entry, that code path would be taken,
and, even before looking at the query, that seems suspicious.

Stepping backwards, the query was...

slon_mkquery(&query2,
"select sub_provider from %s.sl_subscribe "
" where sub_receiver = %d",
rtcfg_namespace, rtcfg_nodeid);
res2 = PQexec(dbconn, dstring_data(&query2));

In your case, when the *first* MOVE SET takes place, this will lead to
the receiver being changed for the first set, and thus to there being
one set found, and hence PQntuples(res2) will return 1, and all will
appear OK.

When the subsequent MOVE SETs are performed, there will be multiple
records found with the revised receiver, and the SLON_FATAL error will
be raised each time.

It ought to be simple enough to add the set_id into the query, which
would resolve the issue. I'll hold off on that until we can get the new
testing framework checked in, so that I can test using the new
framework. (Hint, hint, Darcy!)

The problem introduced by this bug is basically that the attempt to
reconfigure the slon after the event fails. Restarting the slon is the
right answer, and will work fine. Your watchdog process is your friend,
in this case :-).
John Sidney-Woollett

2005-08-15, 11:24 am

Thanks for the info - I was worried that my setup was damaged in some
way. But I'm glad that that appears not to be the case.

Just wanted to say that Slony 1.1 is great. It was easy to set up and it
seems to be working fine. Also getting a slave to catch up with the
master was QUICK - a table with 1.5 million records took only 6 minutes
to build on the slave! Apparantly postgres 8 is even faster...

Slony 1.0.5 saved our bacon following a crash on our master when it ran
out of disk space, we switched to the slave for 24 hours, installed
Slony 1.1, redefined the cluster, and reverted back to the newly rebuilt
master. We lost no data and had minimal downtime...

Thanks for all the great work - what an awesome product!

John

Christopher Browne wrote:
> John Sidney-Woollett wrote:
>
>
>
>
> Yes, I have an idea...
>
> The place where the error message is generated is in local_listen.c;
> here's the code fragment that generates it...
>
> if (PQntuples(res2) != 1)
> {
> slon_log(SLON_FATAL,
"localListenThread: MOVE_SET "
> "but no provider found for set %d\n",
> set_id);
> dstring_free(&query2);
> PQclear(res2);
> slon_abort();
> }
>
> That != 1 struck me as suspicious... If the code were looking for
> "non-existence," I'd expect it to look for PQntuples(res2) being zero.
>
> If the query returns more than 1 entry, that code path would be taken,
> and, even before looking at the query, that seems suspicious.
>
> Stepping backwards, the query was...
>
> slon_mkquery(&query2,
> "select sub_provider from %s.sl_subscribe "
> " where sub_receiver = %d",
> rtcfg_namespace, rtcfg_nodeid);
> res2 = PQexec(dbconn, dstring_data(&query2));
>
> In your case, when the *first* MOVE SET takes place, this will lead to
> the receiver being changed for the first set, and thus to there being
> one set found, and hence PQntuples(res2) will return 1, and all will
> appear OK.
>
> When the subsequent MOVE SETs are performed, there will be multiple
> records found with the revised receiver, and the SLON_FATAL error will
> be raised each time.
>
> It ought to be simple enough to add the set_id into the query, which
> would resolve the issue. I'll hold off on that until we can get the new
> testing framework checked in, so that I can test using the new
> framework. (Hint, hint, Darcy!)
>
> The problem introduced by this bug is basically that the attempt to
> reconfigure the slon after the event fails. Restarting the slon is the
> right answer, and will work fine. Your watchdog process is your friend,
> in this case :-).

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com