18 Dec A Note From Netrepid’s CEO On Today’s Service Outage
By Sam Coyl, President/CEO
Good afternoon. Its been an eventful 24 hours at our data center. I want to take a minute to acknowledge the service issues we had today and to explain what happened and what we are doing to mitigate future occurrences.
I understand that service outages can be disruptive and I do apologize for that. I also want to thank all of our clients for their patience while our infrastructure team worked through the night and workday in conjunction with engineers at the worlds leading hardware/software manufacturers to pinpoint the unique issues we encountered and resolve them as efficiently as possible.
Starting last night and carrying over into today, we experienced approximately a 12-hour outage of some of our services due to unusual power surges. Services were largely restored by 12:00pm eastern time and since then, our systems seem to have completely settled and performance will continue to improve throughout the night.
Power surges at a data center are fairly common, and our data center – which is now a Tier 3-quality facility – is prepared for nearly all issues that could lead to extended downtime. Tier 3 data center requirements include:
- Multiple independent distribution paths serving the IT equipment
- All IT equipment must be dual-powered and fully compatible with the topology of a site’s architecture
- Concurrently maintainable site infrastructure with expected availability of 99.982%
What Caused Today’s Data Center Service Outage
So what happened? To put it simply – an issue that occurs only around .01% of the time in modern data centers of our quality. Our data center took an extremely high power surge that blew a fuse in one of our main UPSes. The spike also caused errors on one of our storage nodes. The redundant systems didn’t kick in because they still saw the servers as up. However, they were in a locked state and would not function. This put our environment in a state where a few critical servers were both up and down at the same time. We have worked with the manufactures to unlock the storage and restore service.
Most importantly, no data was lost during this time.
How We Plan To Mitigate This Moving Forward
After consulting with engineers at many of the leading hardware/software manufacturers and after being onsite with our team throughout the crisis, I am confident the issue that we experienced today is exceedingly rare – and that moving forward, it will be preventable. I’m also happy to inform our current customers and all future Netrepid customers that system upgrades we had already planned on installing as early as next month will ensure that an issue like the one we experienced today will not happen again.
Once again, I appreciate the patience of all of our customers while we focused our attention on resolving today’s issues. Trust that we have taken the necessary steps to prevent an issue like this from happening again.