Windows 2008 Failover Clustering – Adding a Second Node into the Cluster Fails (The operation returned because the timeout period expired)…

by Andy Grogan on August 3, 2009 · 2 comments

in Exchange 2007 (CCR), Exchange 2007 (Installation), Windows Clustering

I have been very busy recently on a number of projects which have, alas taken me away from working with Exchange (much to my dismay). However a requirement has arisen in work where we needed to add in another clustered CCR mailbox server to support an additional 1200 users (my company has assimilated a number of people recently whom previously did not have mailboxes –so I have had to add in another Exchange CCR server as we will now have 8000 users and I like to keep my Exchange Servers with a very low load)

Generally speaking this would be fairly straightforward task for me, order the hardware, get it into stock, install it in to racks, begin the installation process.
Of course given the fact that I am in the middle of the chaos which is a data centre moves, some of my processes are all over the place – this is actually how it happened:

  1. Order hardware – ok – I decided that I would go for one of the new HP DL 370 G6’s due to its large internal drive capacity (up to 24 SFF drives)
  2. Get it into stock – the initial delivery sat in the basement of my company for a week as the “mail room” had erroneously been told that no deliveries were to be shipped to my department due to the building work
  3. Install in racks – due to the week of the hardware sitting in the basement, we strayed one week into my hardware freeze for the Data Centre refurb and indeed when I have been doing my planning for the next move (Phase 3 which is by far and away the biggest) – so I have had to fill in the paper work to allow for a change to my own freeze!

It was at this point that I think that I began to create my problems.

Due to the fact that my company operates very strict change control (most of which was defined by me) I could not immediately install the servers into the Data centre – as mentioned above I had already called a freeze, and therefore the only way to get around the issues was to raise a change incident.

As this could not be raised as a P1 change (as they have been halted in relation to the Data Centre as a result of the freeze); the quickest that I could get the CAB (Change Assessment Board) together was 2 days (raised as a P2) – so I thought that I would get the initial Windows 2008 builds done in our Test Lab area (which is outside the Data Centre area) as it had space and did not require change control.

(You might be thinking what about major changes – for example Critical patches for security issues and the like; I can over rule the freeze on a case by case basis – however the above would not fall into that category).

Now our LAB is ok, it exists on the same VLAN as the servers but has IP Address range within the same subnet where IP addresses are initially allocated by DHCP (this might seem strange to you – but there is a good reason that we use this which I won’t go into now).

So in the LAB Area I completed the following tasks (the following is not recommended for production servers and is not the way that I would normally create a CCR cluster, but I had little choice – yes even I looked in the mirror and asked “am I really an MVP for even considering this?”):

NODE 1:

  1. Installed Windows 2008 EE x64 via Smart Start
  2. Installed the latest HP PSP and Firmware updates
  3. At this point the machine picks up a DHCP address (this is a main deviation to the rule – normally all servers are given static addresses right from the word go)
  4. Renamed the Machine to the corporate standard
  5. Joined the machine to the domain
  6. Completed the Windows updates as required
  7. Assigned the Node a static IP address which was in the server range (remember the LAB is on the same VLAN but has a different IP range in the same subnet)
  8. Installed the Exchange 2007 prerequisites – which include Failover clustering
  9. Created the new Cluster using static addresses
  10. Created the FSW for the cluster
  11. Installed Exchange 2007 CCR in the first node (normally I would not do this without the second node being built, but I needed the Exchange clustered instance to be online to meet some of the requirements of the project)
  12. Got Exchange configured.

Ok at the point, although the build had been messy I still had a clustered CCR instance (although not doing any CCR) with Exchange installed – all I had to do was build and add the second node to the cluster.

NODE 2:

  1. Installed Windows 2008 EE x64 via Smart Start
  2. Installed the latest HP PSP and Firmware updates
  3. At this point the machine picks up a DHCP address (this is a main deviation to the rule – normally all servers are given static addresses right from the word go)
  4. Renamed the Machine to the corporate standard
  5. Joined the machine to the domain
  6. Completed the Windows updates as required
  7. Assigned the Node a static IP address which was in the server range (remember the LAB is on the same VLAN but has a different IP range in the same subnet)
  8. Installed the Exchange 2007 prerequisites – which include Failover clustering
  9. Tried to add the node to the cluster – BANG – falls flat on its head.

When I tried to add the second node into the failover cluster I was presented with the following error message:

The server 'crp-exccrnd-06.' could not be added to the cluster. An error occurred while adding node 'crp-exccrnd-06. to cluster 'CLS-EXCNN-03'. This operation returned because the timeout period expired.

Initially I suspected that the NC375i Quad Port Gigabit server adapter might be to blame for the issue.

As I was using node redundancy (rather than Rack or Data Centre redundancy) for this cluster server I had achieved the Heart Beat connection via a Cross Cable between the two nodes plugged into one of the four Network interfaces which are supplied with this server as standard.

However I noted that the NIC autosenses the duplex setting of the connection and then greys out your option for you to change it, the typical settings for a Heartbeat interface from Windows 2000 – 2003 are 10MB / HD and with CCR Clusters you can set the speed to 100 / FD, however as the Cross over interlink was running at 1000MB/FD using the cards new Autosensing features I surmised that either packets were being dropped or the failover clustering in Windows 2008 could not handle such a fast connection.

Below is an example of what I was seeing on the NIC interface configuration:

clip_image002

After some research I discovered that my suspicions above were wrong (seemed like a good idea at the time).

Although the card displays Auto / Auto for speed and duplex, this does not mean “auto negotiate” on the NC375i, it means that the card has determined the correct interface speed and set the card accordingly.
I also discovered that there was no packet loss on the interface and indeed Windows 2008 does not really care about the Interface speed and duplex as long as it is not truly Auto Negotiate (as in the conventional sense on other cards) or if the interface it teamed (teaming a Heartbeat interface is not supported).

- Bugger! -

I tried a number of solutions – enabling IP6 on the Heart Beat interface (which can play a part in forming a cluster in Windows 2008), enabling NETBIOS, disabling NETBIOS (again) – all of which produced the same results.

In the end I decided to go back to basics, and run through the clustering checklists that I have followed over the years.

Everything checked out, until I reached the DNS configuration. I discovered that although I had changed the primary nodes Public IP address to the correct range on the server VLAN – on the DNS Servers the IP Address was still present as the DHCP allocated address which was picked up in step 3 of the first node’s configuration (as described above).

The Solution:

I manually changed the IP address on the DNS server and replicated the changes amongst the domain controllers, performed a Flush DNS on each node and low and behold I was able to continue adding the the second node.

Now the above might seem a elementary error – however I had relied on the automatic DNS registration to update the DNS server, for some reason this did not happen. So, if you should find yourself in the above situation, and indeed you have exhausted all of the possibilities that are out there – double check your DNS configuration to ensure that the Public Addresses of your Nodes are correctly registered.

{ 2 comments… read them below or add one }

erikcarrington.pen.io April 3, 2014 at 10:55 am

Good article. I will be facing many of these issues as well..

Reply

aol login mail usa teach hair aol August 6, 2014 at 2:04 pm

This post will assist the internet users for setting up new weblog or even a weblog
from start to end.

Reply

Leave a Comment

*

Previous post:

Next post: