So my morning began with a Chinese fire drill. There was maintenance scheduled on the building UPS that provides back up power to our office in the event of a power failure. In order to take the UPS offline for service we fire up our huge diesel generator which is designed to power the office for days in case of a prolonged power outage. Once it is running the UPS is able to be taken offline to be serviced. So this switch happened early this morning as planned. However in the middle of the work disaster struck, the generator turned off.
This happening was like turning off a light switch for the entire building, including EVERYTHING in our network operations center. If you know anything about IT, you know that abruptly removing power from a server is a very dangerous situation that can result in data corruption. A guy from county maintenance who was in charge of the generator ran out there and flipped it back on but it was too late. We started bringing servers back up, crossing our fingers no major problems popped up as a result.
Just as we were in the process of logging them back in the power glitched again. I had a snap angry reflex yelling “Motherf’r as I flung the keyboard to the ground and marched outside. The county guy said the automation switch that is used to test the generator and bring it on automatically during power outages is what caused the first power outage. However the second power outage was caused by this guy switching us over to manual bypass. He was unaware that we had started powering servers back up already.
So despite TWO power outages, the servers seemed to have survived with only a few minor issues. The UPS service was completed and we thought all was good with the world, then the smell started. The distinct odor of electrical burn was coming from the UPS room. We called the maintenance guys back as well as the UPS tech who had just left. The county guy fired up the generator without verifying the UPS was still functioning to hold up the building during the transfer of power, it wasn’t. Because of the fried component the UPS was no longer supplying power so when the switch to generator occurred power was lost in the building for a third time, unbelievably. The first failure was due to system problem, the second and third were from a lack of communication.
The third power loss appears to have corrupted at least one server that I am trying to restore currently. The fried component in the UPS now has to be replaced meaning we will exist on generator power for at least the next 24 hours. We have not lost power in the NOC like this for at least five years. To have it happen three times in an hour is ridiculous and will change our procedures for UPS maintenance going forward for sure.
I wish I could just rip up the contract of the UPS company like Nancy Pelosi did with the SOTU speech on national tv last night.