I was performing some maintenance on the first node of my CCR cluster today which required me to move the Exchange instance over to the secondary CCR node. Like normal I typed in the move-clusteredMailboxServer cmdlet, only for it to fail with the following error:
Move-ClusteredMailboxServer : The passive copy of storage group 'First Storage Group' has too many logs to replay in order for a timely completion of the Move-ClusteredMailboxServer operation. The replay queue is at least 75 log files, and the maximum allowed queue length is 15 log files. If you want to proceed, you must manually dismount the affected databases and then re-run the Move-ClusteredMailboxServer task using the -IgnoreDismounted parameter. After the move has completed, you must manually mount the database on the active node.
Odd I thought so I ran a get-storageGroupCopyStatus which provided me with the following output:
This obviously indicated to me that my replay queue was indeed in excess of the threshold, this was a little odd as I had not had this problem before so I decided to run the test-replicationHealth cmdlet just in case there had been some form of issue – the following was the output of that command:
Interestingly the above output told me that – at least on face value – there was nothing wrong with my cluster configuration. It was at this point that I decided to follow the advice of the original cmdlet – so I entered in:
dismount-Database -identity “Mailbox Database”
I then entered in the following command:
move-clusteredMailboxServer -identity EX-CCR-01 -target x64EXCCR2 -MoveReason “Maintenance” -confirm:$False
I then mounted the database on the passive node and then proceeded to with my maintenance on the previously active node. When I was done with my maintenance I issued the move-ClusteredMailboxServer command again to move the Exchange instance back to my Primary Node, which promptly fell flat on its face when trying to mount the database in the “First Storage Group”, with the following error in the PowerShell Window;
Move-ClusteredMailboxServer : Database First Storage Group/Mailbox Database (EX-CCR-01) is not in the final expected state: final state is Failed, expected Online.
I had a look in the Application Event Log on my primary cluster node and saw the following entries:
Oh dear I thought – but decided not to panic (at least not just yet) so I moved the Exchange instance back to the secondary node (where I knew that it did work) and then re-issued the move-clusteredMailboxServer cmdlet – which completed successfully. When the Exchange instance had correctly started up on the Secondary Node, I entered in the get-StorageGroupCopyStatus which gave me the following output:
I noticed that under the “ReplayQueueLength” there was a value of 4. I wondered if this was perhaps something to do with my issues, so I decided to leave the cluster on the Secondary Node for a little while to see if the Queue length dropped. Around 45 minutes later I returned to the Secondary Node and issued the get-StorageGroupCopyStatus command which still reported a Queue length of 4. Here I decided that I was going to re-seed the replication process for the “First Storage Group” which involved the following commands:
suspend-StorageGroupCopy -Identity “EX-CCR-01First Storage Group” -Confirm:$False update-StorageGroupCopy -Identity “EX-CCR-01First Storage Group” -DeleteExistingFiles -Confirm:$False resume-StorageGroupCopy -Identity “EX-CCR-01First Storage Group”
I then issued the get-StorageGroupCopyStatus which now reported the following:
You will notice that the “CopyQueueLength” entry has a value of 213 – this did not worry me as I reissued the get-StorageGroupCopyStatus cmdlet which produced the following output (you will notice that the CopyQueueLength value had decreased – and continued to do so):
I waited for the CopyQueue to count down to 0 at which point I issued the move-clusteredMailboxServer again which on this occasion completed correctly and all stores mounted as expected. Now, I had to wonder why this happened in the first place – so, I spent a few hours reviewing the Event Logs where the only suspicious thing that I found was that the Secondary Node had been closed down for a 12 hour period (this was due to some tweaks that we were making to the hardware configuration) – so, I am surmising that having the Secondary node down for such a period upset the replication process in some way. I hope that this helps someone along the way.