HP Racking and Server air flow 101 vs the Outsourcer whom (obviously) should know it all…
I know that this is way off topic, but it is something that I want to get off my chest, its a rant. I cannot apologise for it – I had to speak out. I will not mention any names, but it is a rant about poor service and bad contracts – with a small sprinkling of server and racking air flow information.
You all will mainly know me for my work with Exchange and Microsoft products, however my main role (and indeed why I have not posted so much on the blog recently) is head of operations (or the Networks & Infrastructure Support Manager) for a large organisation in London. Like perhaps some others this role means that although I like to work with Exchange and AD primarily I am actually responsible for an entire Data Operations and Data Centre team. My organisation has just commenced a full refurbishment of its Data Centre facilities which includes new Air Conditioning, Fire Suppression, UPS replacement and a new Rotary Backup generator.
Additionally we are changing all of the flooring and ceiling infrastructure.
The above obviously represents a significant task – but to make matters harder for us, we have to complete all of this work with all the existing equipment within the room (we cannot decant it to another location) this amounts to around 50 fully populated 42U racks and 10 fully populated 47U racks, plus an IBM SAN, IBM 3584 and XIOMAG.
Because of this we are completing the works to the room by performing a number of staged moves of all the servers racking over the period of 8 weekends.
Now as you can imagine, this requires a significant amount of planning and logistical work (it can be really challenging moving a single rack in a weekend let alone 12 fully populated racks to 6 meters away from where it was previously) – but via some excellent work by my server and networking team we managed to get through phase 1 without issues – well at least for our racking (which amounted to 10 of the 12 that we shifted).
The remaining two racks that are relevant to this post are looked after by a 3rd party company (whom for legal reasons I will not name here – all I will say is that they are a large IT “outsourcing” provider here in the UK – although they have not managed to get a foot hold in my I.T department for reasons that you will work out later on!), we host their racking in our Data Centre for historic reasons, rather than contractual ones – and indeed I suspect that these days they wish for their kit to stay housed where it is (in our DC) because they have lost a number of high profile contracts and cannot afford (or do not wish to take on) the financial housing of these racks.
From their point of view why bother?? – there is the convenience of being able to have someone else pay for the electricity and provide a Tier3 standard room whilst effectively claiming the equivalent to “Electronic” squatters rights and get to moan when things don’t go their way.
Now getting this company on board with our moves has been a little difficult – at first (over a month prior to the phase 1 moves) we had asked this 3rd party to move their equipment offsite (as they have done with some of their other systems) to a location elsewhere in the U.K. as it would perhaps be the best idea considering that the services that these two hosted racks provide are key to revenue in my organisation and given how much disruption would be going on within the room there was a good argument on the balance of risk to eliminate them from the equation.
This of course was met with a number of exceptionally weak technical excuses from the supplier as to why these racks could not be moved elsewhere (bandwidth being quoted as the major reason – Hmmmm can’t see that one on a 10MB feed that operates at peak with 2.3 max utilisation) so to cut a very long story short we then suggested that they would then have to move their kit in the room just like us (we basically had to say either you move it – or we will).
This of course is where the fun really started (please bear in mind that they only have two racks in the room) – in order to complete the moves (which they then started to claim they had little notice of, after two weeks of arguing about hosting them offsite) the following requests came through:
- They needed to see our risk analysis (Perhaps they needed some clues for their own)
- They needed to see our Asbestos certificate (WTF – I asked them to move racks not take down walls!)
- They needed to send a delegation to site to evaluate their existing two racks (this consisted of 8 people when they turned up all of whom walked around and did very little but come up with reasons as to why they did not wish to take responsibility for the move)
- They needed to confer with their board (Delaying tactic to see if we would abandon the project – they do this sometimes)
The upshot of the above was a week wasted chasing them for their information and requirements for their racks (power and patching) to which we got the following (edited slightly to protect the not-so-innocent):
“30 power sockets and 29 connections”
That was it – honestly – the sum total of the all the above, and indeed the “review” they had on our site was a three line e-mail from their “project manager” whom gave the above as their requirements.
I had a good mind at the time to leave it at that and let them crash and burn, but given the impact to my organisation I wrote back and stated:
“HP Racks (which these were two) either have 16AMP or 32AMP PDU’s – therefore you might have a requirement for 30 power sockets – these will already exist in your rack – what I need is how how many PDU’s you have in your respective racks which terminate as either 16AMP or 32AMP commando sockets – therefore you will have something like a requirement for x2 or x3 or x4 sockets depending on the rating and number of PDU’s”
HP 32A Modular PDU – sockets in the rack are connected to this (has a capacity of x 4 blocks with x 8 sockets per block)
32A Commando Socket – attached to the above PDU
Additionally I also said:
“You have have the 29 connections – but what are they currently – 100M/FD – 1000M/FD etc etc”
This obviously stirred something in the said supplier as two days later I was sent an e-mail which pretty much admitted that had no clue how their racks were put together from a power point of view, but contained some poor recommendations (to all intents a wish list) which asked for 10 connections more than they already had and gave no reason for wishing to have the additional connections.
I wrote back and explained that this was a little better (ahem) – but, they would need to firm up the requirement for their own sake, as their racks contained kit that was over 8 years old and had not been touched in around that period of time we would have to guess the power requirements (and indeed as they had also originally stated that they would be onsite on the day after our moves) this could cause them major issues if my team was not around.
Over the next week or so, we saw people coming to site (from the company) surreptitiously analysing their racks – and on the day before the moves we finally got a project plan from them (quite unexpected really – but it was pants – however the major change was that they would now be onsite on the SAME day as my team – hmmmmmm – I wonder what could have changed their minds???).
Come the Day – Come the Power;
So cometh the day where my team rolls in at 06:30 am in the morning and begins our task of moving our cabinets to their new locations to allow for the floor to be laid. By 10:00 we had all 10 of our racks in place so we were well into running new patch panels, fibre, and power – then at 11:30 am in rolls their move team (all 4) from the said company.
They then needed a 20 minute conversation with my Senior networking engineer (as they had not worked out their Networking requirements properly) before they could begin.
I think that by 14:30 they had their racks in place (not connected – but just moved to the new locations – they needed a lunch break between 13:00 and 14:00 it would appear) – it took them that amount of time (between 11:50 and 13:00) to strip down the redundant wiring so they could find their PDU’s and disconnect from the mains so the racks could be moved.
Additionally during their move they managed to kill their KVM and asked to borrow one of ours, needed to borrow both my Networking and Electrical Services people and had the cheek to tell me no to take the KVM back that I had “loaned” them until they had notified my that I could – cheeky buggers.
They managed to get their racks powered back on and at around 16:00 their moved team buggered off – only for 20 minutes after they left one of their test users walked into my department and asked where they were as their systems were not working.
I explained that unfortunately they had left site – the response to which was “Fuc!*ng Typical –at least you guys are still here, I guess that all of us will go home now” – basically there was a team of 6 users onsite whom had been left in the lurch and the company had not even told them they were leaving let alone that the system was still buggered.
Anyway to cut another long, painful story (in relation to them) short we got through the day and our moves were very successful so we finished up at around 20:30.
Lets now fast forward to 08:30 on Sunday, I am bleary eyed, in the bath room at home getting ready to take my little boy to Legoland when my work mobile rings.
It is my on call engineer whom is onsite within one of the representative from the said company whom explains to me that there is a concern that the companies racks are now running “too hot” and it presents a risk to their service.
My chap explains to me that he has been forced (to placate the companies representative) to open the front and rear doors to their racks and place a household desktop fan at the back of their racks.
I explained down the phone that:
- They racks are so densely populated and in the wrong configuration they have ALWAYS run hot (so much so there was a small fire on one of their servers two years ago) and they had no roof fans either
- Our racks are 2cm (yes that is centimetres – less than an inch) away from their racks and I am not seeing any thermal issues from our servers
- Opening the rack doors both from front and back does not help the issue – in fact it makes it worse racks are designed for heat dissipation and the doors play a major part in that
- Adding a desktop fan within the Data Centre as a cooling aid is about as much use as a chocolate saucepan – it cannot provide the air flow required
I asked – “what is the test that has been applied here which leads the the conclusion that their racks are over heating” – the answer (wait for it) was that the companies engineer has placed his hand on the back of each of his servers and they are hot to the touch!.
It was at this point where I might have sworn a little bit, and then explained that due to the way in which servers work – the back end of the server will always be hotter than the front. Cold air is sucked in through the front venting – blown through the server and is expelled out the back via (in certain HP models) via the PSU’s. – see below for an example:
When placed in a rack environment the following scenario applies to most Data Centres (when I say ** Most ** I mean DC’s which have a raised floor, make use of CRAC, use Hot and Cold isles and have a return hot air path):
So by opening both doors and adding in a Desktop fan they were actually taking a retrograde step.
I asked to speak to the representative of the company myself on the phone and explained that by testing the temperature of the servers merely by using their hand was akin to Luke Skywalker using the “force” and that in order for them to get an accurate measurement they would need to us an alternative means. It was at this point (and I am not joking – I promise) the chap at the other end of the phone asked my if there was a “Thermometer” that he could use.
Slightly aghast, I asked if he was feeling unwell? to which (obviously missing the sarcasm) he replied that he wanted to take the temperature inside his racking. Again I had to explain to him that this was not the best way forward.
I told him to logon to one of his servers and open the HP Management Tools Homepage and from within there he would be able to tell if each of his servers was getting hot – this took three attempts to get what this tool was over to him, I also explained that by default HP servers are configured to reboot every 2 minutes when they are in thermal Critical status – I asked him – are your servers randomly rebooting – the answer was “No”.
It would seem at this point that I had gotten the point over the representative and we exchanged pleasantries and bid each other farewell for the day.
On Monday I get forwarded an e-mail by my Head of IT in which the “Account Manager” for the subject company had stated that “One of their servers had nearly blown up” and that there were “Major Heat issues” which permutated to a huge Health and Safety issue. Of course this e-mail had gone to everyone and anyone whom is important. I was, to say the least incandescent with rage (perhaps I was the only thing at risk to “blow up” – apart from the fact that the message was stirring scaremongering at best – it had been written by someone whom has the same understanding of Servers, Server Rooms and technology as the can of beer that I am drinking now.
It was so bad that the server engineer from the company whom had been onsite and been wrong actually sent a personal apology to us because of its content.
Now you all could be forgiven for wondering why have a put this article up here – we a few reasons really:
- I need to vent – I don’t do it often on the blog (if at all) but I wanted to
- I know that not all Outsources and Managed Service providers are bad, there is just enough to give the rest a bad name – so I wanted to share with you (should anyone be considering going into an Outsourcing arrangement) an example situation to be aware of. My experience of such arrangements are they are only as good as the contract which forms them – therefore – weak contract = weak performance.
In the example above we see an O/S contractor whom made life difficult for their customer on the customers own site. Did not participate in the major move with the spirit that was required, and indeed sent staff down to perform the move whom just didn’t have a grasp of the basics of Server Management and Functionality.
- Provide perhaps some useful information for people whom might have wondered how they can get accurate temperature metrics from their HP Servers (you can also get them via SNMP)
- Provide some information about how basic air flow works with most servers and racks
Now I would like to apologise to my readers whom work for Outsourcing / Managed Services companies – I am sure that your firms are very good and indeed that my experience is a limited one – however I have felt compelled to write about my recent exploits with ours. This is just one example of many where their service has been well below par – but indeed as is in their nature the claim to be beacons of expertise and brilliance. Well they are not. Not in the slightest.
Its a shame, because there are some really nice and good people within the firm whom I deal with day in and day out – but get to the higher levels and it all goes to heck in a hand basket
If you are entering into any such arrangement with a supplier I recommend that the following is put in place:
- A water tight contract which protects both sides – it should contain a clear menu of services, clear demarcation, accurate and realistic SLA’s and a simple to manage variance to contract procedure
- Don’t allow for your provider to locate services which are not related to you on your site (like in our case and indeed some other that I have heard of) if they manage a server room or have presence in yours they are likely to place services in their that are critical to their business – this causes problems if you take back control or need to make sweeping changes – as it is not in their interests to act quickly
- Maintain Control – if you have to have their services on your site – charge them site fees, build into the contract that they have to adhere to your requirements within agreed timescales – there is no option for them to say “No” under any circumstances
- Write into the contract that staff which attend site on behalf of the supplier are actually qualified to do their job – when one makes a stupid suggestion which is based in ignorance be firm – ask them to leave site!
- Do not allow them to dictate to you at all – you are the customer – not them