It Doesn't Rain in the Cloud
In 2012, Hurricane Sandy hit New York City with a force rarely seen in the northeast in a generation. Brackish storm surge waters invaded office towers, filling every basement and drenching every electrical connection, cable, motor, boiler, and heating units in salty water, silt, and sediment from the East River and Hudson River basin. Places thought to be impervious were now under water. The city, whose underground infrastructure was always a source of pride when compared to other aged metro infrastructures, was now exposed as highly vulnerable. The postulated worst case risk scenario was exceeded, and IT and Data communications in these flood zones were rendered useless.
EmblemHealth, had a state-of-the-art data center which resided in one of the flooded buildings. While we were able to increase our citrix capacity to accommodate for lost access to desktops and shift critical staff to other unaffected corporate locations throughout the state and in Florida, the destruction of the surrounding area’s electrical distribution, communications, and building infrastructures guaranteed we would not be going back to our data center for months.
Although our Disaster Recovery plan accounted for critical functions, it was never designed for long term emergency operations, where every IT capability would need to be restored, staffed and made functional. Like many companies, our plan was designed to optimally meet the scenarios presented by previous events such as 9/11 and the NYC blackouts, which typically involved the failure of a mainframe and a handful of servers. In the time since, the state of IT had changed. Disk storage had increased exponentially. Servers became smaller, and proliferated through the use of blades and new virtual machines. Mainframe processing was displaced by multi-tiered web applications. Where there was one box before, there were now five. Databases became huge as the need for content and data grew. Unstructured data repositories were now enormous. Each required time to build and restore. We required a huge amount of manpower to accomplish the switchover.
“We utilized server side flash and EMC XTremSW for database workload acceleration”
Once we recovered from Sandy, it was obvious we needed to change our approach. We immediately went to work updating our infrastructure and replacing all our physical systems with 100% virtual ones, and we did it all in the space of a year. We selected EMC VMWare VBlock to implement a converged infrastructure. We purchased four VBlocks in total. We bought six VBlock 340 Series with EMC VNX8000 for a mixed workload environment. This was a fully virtualized environment with VMWare vCloud Suite on Cisco UCS B-Series Blade Servers. We had six racks per site (primary and disaster recovery).
It contained 560 TB of tiered storage for each VNX system. It was fully replicated between source and target data centers using EMC RecoverPoint and VMWARE SRM. We utilized server side flash and EMC XTremSW for database workload acceleration. We also purchased two VBlock specialized systems for Extreme Applications for Citrix VDI solutions. These were fully virtualized systems with VMWare vCloud Suite running on Cisco UCS-B Series blade servers. It uses 25 TB of Flash Storage which utilized EMC XTremIO storage array. It supported 2000 End Users and had one rack per site. We selected Eastern Computer to help us implement this. They have been our EMC reseller and were instrumental in the scoping and setup of the systems.
Once the hardware was in place, we started with all our databases. The goal was to eliminate proprietary operating systems in lieu of the Linux variants on the virtual platforms. This was fairly seamless, although the underlying data structures of the O/S made the actual migration a bit more complex. We had to perform exports and imports instead of backups and restores, which is what we had done in the past. This is a slower method of database movement and recovery, but overall did not add significantly to the amount of work. The virtualization gave us the opportunity to analyze database usage across applications and where possible combine instances, further shrinking the overall footprint. We were improving our manageability and reducing complexity.
After the databases, we turned our attention to the application suites themselves. Here is where most of the pain occurred. We were faced with issues such as licensing changes, where certain legacy applications had hardware dongle licenses that could not be accommodated in virtual environments. We had other EMC application software where that occurred as well. We had Captiva Formware OCR and Document Sciences which all had hardware dongles, but other platforms had this issue as well. We were able to move these applications to dongle free environments and proceed, but these changes had to be coordinated with our vendor community.
Disk space got a much closer inspection, particularly in the area of content applications. We had a myriad of local storage, NAS, and SAN storage which needed to be optimized and portioned out in the new environment. We took a hard look at the new cost structure and what was really needed to operate in optimal fashion. As an application group, we never just willingly reduced our capacity, but in the end, when the analysis was performed, we were able to accommodate all our needs and never had to shortchange a platform.
The most limiting factor was the P2V sync time, which we wanted to keep to a manageable overnight time frame of 8PM-6AM. This was done to limit down time and allow our business partners the ability to plan their staff workloads for testing with definitive timeframes. Invariably, compromises were made, and the systems began their P2V. In the process, LAN and server security were standardized, enhanced and locked down. Our information Security team enforced new standards for local administrator privileges, as well as account naming conventions that allowed us to know from the account what it was used for. Administrators were reduced to bare minimum for operation, and accounts, jobs and structures all adopted standardized naming conventions. We utilized CA-7 and Autosys for batch processing. We used this opportunity to standardize batch naming conventions. Finally, after years of divergence due to acquisition integration and growth, the server and application infrastructure took on the look of a unified design and implementation system.
So after all this work, you might ask if there was a benefit derived. We have found in the short term that beyond the esthetics of standardization, these moves have improved performance and reliability. It goes without saying that our disaster recovery complexity has been greatly reduced, and our ability to meet recovery SLA’s has been strengthened to a point where we no longer fear making our 24, 48 and 72 hour SLA’s. System Operations has seen tangible benefits in the reduction of unforced outages across the enterprise and higher availability. Most issues were handled upon switching to the VM and were resolved by the time the business partners reported to work the next day. Our support needs from a staff perspective have been reduced. All footprint related costs such as power, data center space and operational staff have been reduced. Our ability to monitor the environment, manage and isolate problems has improved.
In short, we are still looking at ways to decrease our costs and improve productivity, but we feel we have made tremendous progress on the infrastructure side of the equation.