To our Valued Clients,
Last week we experienced a severe technical problem that resulted in downtime for you, our clients. We did not live up to the high expectations we have set for ourselves or that you have come to expect from Info Cubic. I know that we have a lot of people who count on us, from a small business with a handful of employees to the largest corporations. We must be better, we will be better. I deeply apologize for the disruption in service and the corresponding distress we may have caused you.
What happened and what we discovered:
The outage happened due to a failure in the SAN (Storage Area Network), which is the heart of all enterprise level systems. The SAN issue has been the same thing that has been dogging us for the last three weeks. Until last week's incident we had no idea how dire it was. The SAN controller had to be replaced during the first disruption, which allowed service to be restored. What we now know is that during the controller failure, underlying corruption took place in the SAN data itself, which snowballed into a full disruption. A large team of top Dell and Microsoft engineers couldn't find nor remedy the issue. We took it upon ourselves to rebuild from the ground up. On 5/28 we replaced the SAN with a new EqualLogic unit, refurbished the blade servers, rebuilt the cluster units and all connections to the SAN, and imported all virtual servers. After completion, we saw immediate improvements in system performance and stability. The system as you know has been up since the morning of 5/29 and we feel confident that we are back to a solid platform.
Why we did not give you an ETA?
At the beginning of this incident, we wanted to notify our clients as quickly as we possible with an estimated recovery time so you could adjust your plans accordingly. Unfortunately, the Dell and Microsoft engineers couldn't narrow down the issue and give us a proper ETA we could use with confidence. We didn't want to over promise on the ETA and cause even more headaches.. We finally had an accurate ETA on the evening of the 28th when we decided to rebuilt every component, which finally brought the situation back under our control.
What are we doing to ensure we are never in this place again? First, we continue to work to improve the datacenter and make sure our platform is as robust as possible. The datacenter already has many points of redundancy, however we will be looking to add further levels and increase our monitoring and logging of activity on more points in the system.
Second, our mirrored, redundant data center will be coming online in end-June. The reality is that complex, high-performance data centers are going to have issues occasionally. The key is to have a reliable backup location that can quickly step in. Once the redundant data center comes online if there are any issues with the system in the main datacenter, the backup will take over within a matter of a few minutes.
Finally, as always in life and business, good communication is vital to a happy and productive partnership. We promise to do a better job in the future notifying you and keeping you up to date on any system issues should they arise. Thankfully, with the backup data center coming online these system downtimes should no longer be a concern.
If you have further questions please feel free to contact me personally.