Home > Archive > Slony1 PostgreSQL Replication > June 2005 > what to consider for failover policy?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author what to consider for failover policy?
31337 ..

2005-06-20, 1:23 pm

Ok, I'm a new slony'er, only been messing with it for a few days. I
will need to implement this very soon, and I need to come up with a
failover/switchover policy. What are you guys doing to say 'master
node is down'? I have considered setting up another database on the
master, and having a seperate server do a 'write' to the database,
then try to read it. If the read/write succeded, then the server is
ok, if it fails, to start the switchover script to change the next
node to be the master.
Are there any other easier ways to detect when the master node has gone dow=
n?

Any input at all would be appreciated,=20
Thanks in advance,
Tyler
Vivek Khera

2005-06-21, 9:24 am


On Jun 20, 2005, at 2:21 PM, 31337 .. wrote:

> Are there any other easier ways to detect when the master node has
> gone down?


Your first step is to define *precisely* what you consider "down".
Try enumerating all scenarios relative to each machine that could be
connecting to any of your db servers, and how you would notify all of
those hosts to switch to another "master", and how you would tell the
master it is no longer the master should it become undead.

This will be very hard.

And you will also want a general network monitoring system like
nagios running to tell you about the failures.

Vivek Khera, Ph.D.
+1-301-869-4449 x806
Christopher Browne

2005-06-21, 1:24 pm

31337 .. wrote:

>Ok, I'm a new slony'er, only been messing with it for a few days. I
>will need to implement this very soon, and I need to come up with a
>failover/switchover policy. What are you guys doing to say 'master
>node is down'? I have considered setting up another database on the
>master, and having a seperate server do a 'write' to the database,
>then try to read it. If the read/write succeded, then the server is
>ok, if it fails, to start the switchover script to change the next
>node to be the master.
>Are there any other easier ways to detect when the master node has gone down?
>
>

This depends on everything up to and including hardware fault analysis
tools.

--> What if an Ethernet cable somewhere between the hosts has an
intermittant fault?

That will lead to the "attempted write" failing.

--> What if a power supply on a (router|disk array|computer) fails?

That can disconnect one or another component, and lead to the "attempted
write" failing.

Those are all sorts of hardware failures that would lead to a 'fault'
being raised by your test; only you can answer the question of whether
your "fault test" can 'safely' impose the policy that detecting faults
in that fashion leads to using FAIL OVER to indicate that the 'possibly
dead' node should be treated as destroyed.

As far as *I* am concerned, failover is the sort of thing that would
involve me calling one of our network admins to verify that the master
is well and truly broken from the network perspective, and then
escalating to the appropriate management level for a Manager to say
"Yes, Chris, fail it over." (And yes, I'd make that call at 3am, if
need be...)

Really and truly, making this policy is NOT a matter for discussion on
this list; it is a matter for you to discuss with your "powers that be"
in order to properly factor the *business* factors into the policy.

It might well be that you discover you need to buy some more hardware to
help improve the ability to analyze hardware faults. And it is worth
pointing out that people spend literally millions of dollars on tools
like HP OpenView, IBM Tivoli, and such, and they have NOT become any
sort of magical "silver bullet" to correctly diagnose hardware faults.

We've got some guys that spend some of their time (and hence some
not-nominal amount of money) generating Nagios tests, and that again
doesn't provide any sort of "diagnosis for free." When they discover
that something breaks, particularly in a complex network environment, it
then takes a sharp technical mind to figure out what broke, and why.
The automated tools can do no more than provide some clues.

Sorry not to be more directly helpful, but I would not want you to fool
yourself into thinking that there is some easy answer right around the
corner.

Automatic FAIL OVER represents Risky Business...
Daniel P. Berrange

2005-06-21, 1:24 pm

____________________
____________________
_______
Slony1-general mailing list
Slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
http://gborg.postgresql.org/mailman.../slony1-general

31337 ..

2005-06-21, 8:24 pm

Ok, I'd first like to thank all of you for your input, you have been
of great help so far. We have discussed the linux-HA project as a
solution, but they started complaining about a whole server just
sitting there doing nothing until a failure. I see many people talking
of network connections failing. Once we start getting more users, this
will all be redundant. The power supplies are redundant (triple setup)
harddrives are redundant, redundant UPS's, etc. etc..
As for clients, there will be a total of 2 other local servers
accessing the database, and the clients log into those terminal
servers. So, all the database stuff is local. They requested of me for
this to be pretty much completely automated, as I am only here for
another week or so. I have started telling them, that with the network
monitoring, we will be able to notify someone of a failure, but only a
human will be able to make the final decision of failover. I may
implement something to do a switchover if the load is a set value for
a set time or something along those lines. I am confirming now with my
thoughts, that this automatic failover is going to be a HUGE task, and
have started to tell them that it is not a good idea. We (or I rather)
could just miss too many things.

I was just wondering what you other guys were doing for failover setups.

Thanks a TON for the input!!
~Tyler



On 6/21/05, Daniel P. Berrange <dan- TA1HlLrDhUVWk0Htik3J
/w@public.gmane.org> wrote:
> On Tue, Jun 21, 2005 at 10:04:13AM -0400, Vivek Khera wrote:
>=20
> You might want to take a look at the capabilities offered by a project
> such as Linux HA (www.linux-ha.org) or the Red Hat Cluster Suite. There
> are many failure & failover scenarios & you don't really want to have
> thing of them all yourself, so better to leverage existing code. Ultimate=

ly
> the real key to a reliable failover is some sort of STONITH (Shoot The
> Other Node In The Head) capability to ensure that, when a failover
> occurrs, there is absolutely no way the original master can come back
> to life. hardware power switches are the preferrable, but software
> NMI watchdogs could be used to do an automatic reboot of the failed
> node. The Linux-HA / RH Cluster Suite agents, also take care of issues
> such as quorum & split-brain to ensure optimal choice of slave to fail
> over too.
>=20
> Regards,
> Dan.
> --
> |=3D- GPG key: http://www.berrange.com/~dan/gpgkey.txt -=

=3D|
> |=3D- Perl modules: http://search.cpan.org/~danberr/ -=

=3D|
> |=3D- Projects: http://freshmeat.net/~danielpb/ -=

=3D|
> |=3D- berrange- H+wXaHxf7aLQT0dZR+Al
fA@public.gmane.org - Daniel Berrange - dan- TA1HlLrDhUVWk0Htik3J
/w@public.gmane.org -=

=3D|
>=20
>=20
> ____________________
____________________
_______
> Slony1-general mailing list
> Slony1-general- AuKwsB3Fm+ugFIWk8tvy
RWD2FQJk+8+b@public.gmane.org
> http://gborg.postgresql.org/mailman.../slony1-general
>=20
>=20
>=20
>

Andrew Sullivan

2005-06-21, 8:24 pm

On Tue, Jun 21, 2005 at 04:18:37PM -0500, 31337 .. wrote:
> Ok, I'd first like to thank all of you for your input, you have been
> of great help so far. We have discussed the linux-HA project as a
> solution, but they started complaining about a whole server just
> sitting there doing nothing until a failure. I see many people talking


Boy, if I ever saw a place where the term "false economy" applied,
this is it.


> of network connections failing. Once we start getting more users, this
> will all be redundant. The power supplies are redundant (triple setup)
> harddrives are redundant, redundant UPS's, etc. etc..


I have machines worth half a million bucks that are supposed to be
"all redundant", and break in surprising ways. Don't rely on this
promise for 5 9s.


> a set time or something along those lines. I am confirming now with my
> thoughts, that this automatic failover is going to be a HUGE task, and
> have started to tell them that it is not a good idea. We (or I rather)
> could just miss too many things.


It's just really dangerous: you can lose data.

> I was just wondering what you other guys were doing for failover setups.


We carry pagers, and use high availability stuff. I'm on a panel at
OSCON this year about it.

A

--
Andrew Sullivan | ajs-oaT0K0jot5/q2IAV+ODieA@public.gmane.org
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.
--Dennis Ritchie
Rod Taylor

2005-06-21, 8:24 pm

> > of network connections failing. Once we start getting more users, this
>
> I have machines worth half a million bucks that are supposed to be
> "all redundant", and break in surprising ways. Don't rely on this
> promise for 5 9s.


Or, NOT break in surprising ways. Outright broken is easy, but I've not
seen anything for "it just became slow" type failover ;)


--
Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com