It’s a majority thing, DAG voting explained (a bit)…
I thought that it would be good to take a break from the development stuff that I have been doing on the site for the last few months and focus on something a little more technical. I hope that the following will help aid some people in their own Exchange HA deployments.
Historically I have always had an attraction to Exchange clustering (whether it was SCC, CCR and now DAG) therefore in this post I would like to take the opportunity to discuss the concept of “Majorities” and “Quorum” in the context of Database Availability Groups.
There are of course a number of articles out and about on the web which deal with this subject at some level, but I thought that I would chuck in my own simplistic take on these critical concepts.
Firstly, how does Exchange 2010 DAG use Windows Failover Clustering?
Well, technically it does and it doesn’t (now there’s a great answer for you). Exchange 2010 has its own HA model of clustering and data replication (which is an enhanced version of the model introduced in Exchange 2007 with CCR) – however there are some small elements of the Windows Failover Clustering service which DAG makes use of:
- Cluster Heartbeat
- Cluster Networks
- Cluster Configuration Database
When you open the Failover Cluster Manager on a DAG node within your environment you can see how the above components actually look in practice:
One of the key things to note is that in Exchange 2010 DAG you should not need nor, should be using the Windows Failover Clustering Manager to configure any aspect of the DAG – all functionality is presented either via the Exchange Management Console or the Exchange Management Shell.
You don’t even need to install and pre-configure the Windows Failover Clustering service, as it is installed during the execution of the New-DatabaseAvailabilityGroup cmdlet, or when you use the Wizard from within the Exchange Management Console.
Ok, but what about all this Quorum and Majority Node thingies and stuff?
This is where this post might seem to get a little complicated, but I am hoping to explain the ideas in a pretty straight forward way.
Each Mailbox Server which plays a part in a DAG is also considered a “voter” (a voter, like in political elections determines the right candidate to transfer services to in the event of a failure).
In many political situations (and if you have ever served on a committee you may have encountered this) there is a concept called “Quorum” – which means the minimum number of people with voting rights to carry a motion or decision must be present at a meeting.
This always needs to be a majority (as if you have two people whom vote you can get a 50 / 50 split therefore no majority – or you can get a 100% vote but as there were two people involved who is to say they are right?) – therefore you will find many official meetings will have a quorum set at a minimum of 3 (or more, but it will normally be an odd number above 3).
The principles of DAG and Majority Node Set follow pretty much the same rules that we would find in our (my) real world example.
Therefore if you have a EVEN number of DAG enabled mailbox servers in your configuration you will require 50% of your DAG voters plus a Witness Server to be available to provide an arbitration vote in the event of a failure within the system – mathematically this can be represented as:
Where represents 50% of the number of DAG Mailbox Servers and the is the arbitration vote of the File Share Witness server.
A File Share Witness Server does not hold a copy of the Quorum but can devolve a vote (arbitrate) to a server which is online within the DAG configuration (so that DAG server has two votes) – the FSW sever keeps track of which of the DAG nodes has the most up to date copy of the Quorum database and will pass its vote to that server.
Therefore to try and make the above clearer – if you have a 4 node DAG infrastructure you will need a minimum of two DAG Mailbox Servers plus the File Share Witness Server online in order to maintain Quorum (This is called Node and File Share Majority) – this is depicted in the example below:
In the above scenario – if you lose three of the DAG Servers or two plus the File Share Witness then a “Split Brain” scenario will occur – where the cluster cannot not identify the most up to date copy of the cluster configuration or indeed which server was running the relevant resources . This results in the whole DAG infrastructure going offline until an administrator can intervene to rectify the situation – see below:
If you have an un-even (Odd) amount of nodes in your DAG infrastructure (for example 3, 7,9 etc) Exchange 2010 will automatically set the Majority model to “Node Majority” (where the File Share witness is not used as there will be a majority number of voters based upon the DAG membership model).
This is where things might seem even more confusing, but when you do the maths – it makes sense.
If you have a DAG with 3 DAG nodes and one fails then you still have 66.6% of your voters online, if you lose two nodes from a 3 node DAG then you have lost 66.6% of the installation and therefore it would fail based on the total amount of nodes. Even in the above scenario, if the File share witness participated – remember the equation (where n is 50%) you would not achieve Quorum as you would have lost too many of the actual DAG members (as the FSW does not hold a copy of the Quorum).
Scaling Node Majority up to say 5 nodes you could lose two nodes and maintain Quorum without a FSW.
I hope that I have provided some insight into what is a concept in Exchange 2010 which can be very confusing. I will admit that I have not added in additional concepts such as Alternate Witness Servers and their role in the whole concept of DAG (perhaps a later post). But I hope that it helps someone.