Home > Archive > Slony1 PostgreSQL Replication > September 2005 > Master server reboot









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Master server reboot
Deon van der Merwe

2005-09-18, 11:23 am

Hi,

We have a master and 4 slave servers running Slony1 on a LAN. Everything is
working great, except for one thing:
- each of the slon processes runs on its own server
- they each run in an endless loop, so that they can always start again for
whatever reason
- we had do a reboot of the master server
- after the reboot, all the slaves reconnected
- the problem is this: the actual replication of data stopped. With a
restart of the slon process on every slave the replication started to work
again.

My question this is:
- what is the expected behavior for the above scenario?
- I need to investigate some more... What can/should/must I check in order
to find out why this is happened? That is if I am able to repeat it!
- I will need to find out if I can repeat what happened...

We are running on FC4 (so that is PostgreSQL 8.0.3) on all the servers using
Slony-I 1.1.0.
Christopher Browne

2005-09-19, 3:25 am

"Deon van der Merwe" <dvdm- 4de3xTZDJU8QZ7m9OUuf
1w@public.gmane.org> writes:
> We have a master and 4 slave servers running Slony1 on a LAN. Everything is
> working great, except for one thing:
> - each of the slon processes runs on its own server
> - they each run in an endless loop, so that they can always start again for
> whatever reason
> - we had do a reboot of the master server
> - after the reboot, all the slaves reconnected
> - the problem is this: the actual replication of data stopped. With a
> restart of the slon process on every slave the replication started to work
> again.
>
> My question this is:
> - what is the expected behavior for the above scenario?
> - I need to investigate some more... What can/should/must I check in order
> to find out why this is happened? That is if I am able to repeat it!
> - I will need to find out if I can repeat what happened...
>
> We are running on FC4 (so that is PostgreSQL 8.0.3) on all the servers using
> Slony-I 1.1.0.


So, the only database that "fell over" was the master?

It sounds like what happened is that the remote worker threads that
pointed to the "master" saw that DB go away, and shut down the one
relevant remote worker thread.

This left all the other threads up and running, which would have been
OK had subscriptions been provided by the other threads...

I have to call this behaviour "not unexpected."

An interesting retry would be to have one or more cascaded
subscribers.

Expected result there: If you restart the slons for the direct
subscribers, that should suffice to get all the subscribers back
going. The cascaded subscribers should pick up once the direct
subscribers have their slons restarted.
--
"cbbrowne","@","ca.afilias.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
Deon van der Merwe

2005-09-19, 7:24 am

Hi Christopher,

At 05:59 AM 9/19/2005, Christopher Browne wrote:
>"Deon van der Merwe" <dvdm- 4de3xTZDJU8QZ7m9OUuf
1w@public.gmane.org> writes:
> LAN. Everything is
> servers using
>
>So, the only database that "fell over" was the master?


Correct. All 4 slaves was untouched, and we rebooted the master.


>It sounds like what happened is that the remote worker threads that
>pointed to the "master" saw that DB go away, and shut down the one
>relevant remote worker thread.
>
>This left all the other threads up and running, which would have been
>OK had subscriptions been provided by the other threads...


From what I could see (off the little that I know of Slony1...) was
that they did reconnect.


>I have to call this behaviour "not unexpected."




>An interesting retry would be to have one or more cascaded
>subscribers.


I will try and make a plan on the test system, as the above was on
the live system.


>Expected result there: If you restart the slons for the direct
>subscribers, that should suffice to get all the subscribers back
>going. The cascaded subscribers should pick up once the direct
>subscribers have their slons restarted.



On restart of the slons on each slave did restart the actual
replication without any delay.

I really want to investigate this more, but need to know what to
check where in order provide more/better detailed information. Any
suggestions?


-Deon
____________________
____________________
_____________
TruTeq Wireless (Pty) Ltd. | Tel: +27 (0)12 667 1530
http://www.truteq.co.za | Fax: +27 (0)12 667 1531
Wireless communications for remote machine management
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com