Ran into an interesting issue today. A sub-ordinate wing of my company has a Windows 2003 Forest with a single domain which are made up of 4 domain controllers. Two of the Domain Controllers are physical boxes whilst the other two Domain Controllers are held within a VMWARE Vi3 environment. Last Friday night the VMWARE team needed to close down the Vi3 cluster for essential maintenance and therefore also take down the two virtual domain controllers.
This in itself was not a problem as proper change controls had been raised, and indeed we have prepped the essential services which depend on these DC’s (such as Exchange) for the loss of the two machines. Anyway come today I arrived in the office to the following problems having been logged but the company:
The company in question makes very heavy use of Citrix, around 45% of their work force could not logon, instead Citrix was returning an “Access Denied” error
90% of customers within the company could not logon to Exchange
Upon further investigation within the Citrix Servers event logs the following errors had been logged numerous times:
Event ID : 1030
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
Event ID : 1219
Logon Rejected for <DOMAIN>\<Username>. Unable to obtain Terminal server user Configuration. Error. Access is denied.
The above (and indeed the issues with Exchange) suggested to me that there was a problem with at least one of the domain controllers – therefore I reviewed the event logs on each of the four DC’s but found nothing untoward. I was just about to break out DCDIAG when I noticed something a little odd.
At the time I was logged onto one of the Virtual Domain Controllers, and I noticed that the clock in the System Tray was over 10 minutes over the actual time. Now being acutely aware of Kerberos “Clock Skew” – and indeed that this value by default in Windows installations is around 7 minutes I wondered if, for some reason the Virtual Domain Controllers clocks were out of sync with the physical Domain Controllers clocks.
After a quick check I found that this was indeed the case – in fact the Virtual Domain Controllers clock was 11 minutes out of Sync with the AD Authoritative Time Server (which is typically the first DC installed in a Windows 200x domain) which was the first physical server.
Why is this important?
For those of you whom are not familiar with Windows Active Directory authentication between servers and user accounts, very simplistically Kerberos is the authentication mechanism which ensure that accounts (either user or computer related) can present themselves correctly to the Domain as authenticated entities. Each account is presented with a Ticket which is time stamped. It is important that each DC maintains as near to the correct time stamp in order to correctly authenticate the ticket and allow access to resources.
If for example as in this scenario you have three domain controllers which are in time sync, and one that is outside the time sync tolerance (which is configured by Group Policy – and by default is about 5 – 7 minutes) any request that is sent to the DC which is outside convention will fail – this explains the randomness of what was happening.
Well one might think that this would be a case of simply resetting the time on the Virtual Domain controller to be in sync with the Authoritative time server (or one of the physical boxes) – however after resetting the clock (using NET TIME \\<AuthorativeTIMEDC /set) the clock on this Domain Controller would resync back to being 11 minutes fast.
After some digging around I remembered that Virtual Guests which have the VMWARE client tools installed upon them can synchronise their system clocks with the VMWARE HOST – therefore jumped on the Linux console of the Vi3 server where the domain controller was resident and found that it was – yep you guessed it – 11 minutes out.
I figured that considering that the VMWARE Tools can override the Authoritative Time Master for a domain it was perhaps best to uncheck the setting on both the VMWARE based domain controllers (this can be done from double clicking on the VMWARE tools icon in the guest machine) – see below:
After I had removed the above setting and re-configured the time settings – everything started to work again.
Of course after I had made the above changes I was keen to know what had happened during the VMWARE host upgrades which had caused this to happen – it would seem that the VMWARE team had “lost” one of the hosts during the proceedures that they were undertaking – and therefore re-installed the VMWARE software (based around Linux) – applied all the updates – but did not configure the local time correctly on the Host. I explained the ramifications of what had happened to lots of red faces, whom I hope won’t do that again 🙂