Disaster Recovery in the real world

Monday, December 21, 2009

We've moved!

Please see our new blog at http://Blog.ContinuitySoftware.com

Tuesday, March 13, 2007

Come meet us at the 2007 Business Continuity & Corporate Security event

We are participating in the upcoming annual 2007 Business Continuity & Corporate Security event in New York on Tuesday and Wednesday, March. Come meet us at booth 110!

The 2007 Business Continuity & Corporate Security conference assembles over 800 participants from major New York and Wall Street firms, focusing on Business Continuity and Corporate infrastructure security. The Sixth annual event is the largest Business Continuity Planning event in the Northeast.

The event takes place on March 20-21 at the Metropolitan Pavilion, 125 West 18th Street, New York, NY. If you are interested in visiting the show and would like a free VIP show pass, drop us a note at info@ContinuitySoftware.com.

Thursday, February 22, 2007

Continuity Software Launch Disaster Recovery Route-to-Value Program

Program provides an efficient way to evaluate RecoverGuard’s Disaster Recovery vulnerability detection capabilities, and explore an innovative way to manage Disaster Recovery systems.

Continuity Software Inc, the leading provider of Disaster Recovery Management software, announced the launch of its RecoverGuard Route-to-Value program to further accelerate enterprise adoption of its RecoverGuard technology. The program enables enterprise customers to evaluate the RecoverGuard technology for Disaster Recovery Management, which detects risks and infrastructure vulnerabilities in Disaster Recovery systems and ensures these systems would work when put to use.

The Route-to-Value program is based on a proven methodology developed by Continuity Software and designed to effectively demonstrate RecoverGuard’s risk detection capabilities. RecoverGuard is an innovative, patent pending software solution designed to constantly scan the IT infrastructure without using any agents, and provide immediate notification of newly detected infrastructure vulnerabilities. RecoverGuard enables enterprise customers to ensure that their Disaster Recovery system is constantly aligned with their business protection targets, and that no protection gaps exist in the infrastructure.

The program includes an on-site evaluation and takes less than one week to complete. During the evaluation, Continuity’s professionals deploy the RecoverGuard technology on a subset of the infrastructure and assess the protection level of critical business applications. RecoverGuard validates the protection status and automatically detect any existing infrastructure vulnerabilities, and provides recommendations to the customer’s system engineers on ways to remediate these vulnerabilities.

The program is available to qualified enterprise customers using EMC or NetApp storage infrastructure, and maintains a mirrored Disaster Recovery site. Based on Continuity’s statistics from previously completed programs, the RecoverGuard technology detected an average of 40 infrastructure vulnerabilities which required immediate attention. RecoverGuard is immediately available to enterprise customers who are looking to improve the way they manage Disaster Recovery systems.

Learn more about the Route-To-Value program

Monday, January 08, 2007

Data Protection Management (DPM) terminology

There seems to be some confusion in the terminologies used by some of the vendors in the Data Protection market (including Continuity Software, for that manner...). We hear many people talk about Data Protection Management ("DPM"), backup reporting, disaster recovery when trying to differentiate their offering, which tends to create some confusion when trying to understand the value that each vendor brings and the problem he solves.

Here is OUR short definition of the Data Protection segments:

Backup - taking a "point-in-time" copy of the data/application. Approaches include backup-to-ALL as well as online snapshots of the data (such as BCV or Clone in the EMC world). Verification tools in this space helps you ensure your backup is complete and optimize the backup process.

High Availability - increasing business service availability - technologies include virtualization, clusters, raids, load balancing (to some degree) and other hardware tweaks. Verification tools are your typical system management technologies that helps gauge the overall health of your system.

Disaster Recovery - resume business operation from a remote data center - approach is based on either a shared facility or maintaining a dedicated "mirrored" facility. Technologies include tape-based recovery (ship your tapes to the remote site) and online replication (storage/database/server based) which is based on copying all data to the mirrored site, so you can resume your business service from that site without interruptions. Continuity Software is focusing on management and monitoring tools for this segment.

I would love to hear your opinions and your definition of DPM.

Thursday, January 04, 2007

Why DR fails [2]

What became clear is that correct maintenance of production systems can actually lead to an INCREASE in the number of DR gaps, since keeping the system working means more change, and more change means more gaps, and so on. Current Enterprise Management Consoles, SRM systems, etc. just don’t help, because the problem they were intended to solve actually conflicts with DR assurance.

The next step we took after understanding the issue at hand was to define requirements for a DR and data protection gap detection tool. We begun by analyzing what could get wrong, figuring that if we could have found “signatures” of gaps, we could build an engine that can constantly search for them. Think of your DR checklist – things you need to check before starting a DR drill – and imagine how many tests you have there – dozens? This sounds about right, and in fact, when we started working with customers it turned out that each could contribute its share. It also happened that the overlap was significantly smaller than we have anticipated, so that the accumulated list grew alarmingly fast to contain thousands of possible gaps. As the magnitude of the problem became clearer, so was the conviction that no team of men or women can check everything manually. Some of the tests could take days to complete for just one server (!) – even when furnished with top-of-the-line monitoring, automation and SRM tools. Imagine how much time it would take to search for hundreds of gaps in a deployment of hundreds or even thousands of DR-protected servers. The analogy that comes to mind is looking for viruses manually with a checklist containing printout of thousands of virus signatures.

The way to find a solution was challenging and rewarding, resulting in our first three patents. I hope to dedicate my next post to some more of our insights regarding DR gaps.

Tuesday, January 02, 2007

Why DR fails [1]

Have you ever found yourself wondering why, despite all the effort, preparation, concern and constant care, DR systems seem to elude us by not working just when we need them?

In the last 18 years I've been trying over and over to define yet the perfect DR architecture for the organizations I was involved with. While I was fortunate enough to have the satisfaction of building several complex and neat setups that actually worked, solutions just didn't seem to scale to the larger environments. New replication, clustering and automation tools came and went, yet the problem never appeared to get easier.

Over the years, the question kept gnawing at me, with the answer always almost, but not quite, within my reach.

It turned out that there were others occupied with the same question. I met my partners to the quest – the founding team of Continuity Software - back at 2004’s fall, and together we were determined to find a satisfactory answer. It took some time, but finally gave in. The "revelation" - as with many other new ideas - turned out to be quite simple. There is a missing process, or tool-kit, well known in other fields of IT, but missing in DR. Without this tool-kit, no matter how skilled or devoted we are, chances are little to have a working solution for long. Let's explore this idea.

The larger and more complex our IT environment gets, the faster it will change and become prone to disorder. This phenomenon is quite obvious. It demonstrates principles underlying any system, and formulated in many fields of life and science (take for example the Physics field of thermodynamics with the fundamental notion of entropy).

Disorder will not disappear by itself - we have to introduce mechanisms to control it, or, more precisely, to negate its effect. This tends, BTW, to restore order in one place, by creating an even greater degree of disorder elsewhere - again, following "rules" of nature. In various IT areas, the niche is usually filled by monitoring and management tools. These will usually let us identify evolving problems and react by introducing the required changes to put the system back into working order. As expected, it will rarely bring our systems back to any past state, but rather to a newer, possibly more complex one. For example, think of an application that requires more storage space. You add the missing Terabytes. After several days you realize that a related data-warehouse load-process no longer finishes on time (hopefully, you have a monitoring tool tell you that rather than your users...). Obviously, you would not consider removing the added storage space, but rather add more CPUs, more bandwidth, etc., until it works again.

On the other hand, DR environments are sensitive to change; anything we do on production should be faithfully reproduced on the DR site. How can we tell for sure that nothing got forgotten, or worse yet, was done incorrectly? Since DR testing is intimidating, complex and extremely costly, it is rarely exercised frequently enough. The result is that implementation gaps exist, and grow in number constantly.

RecoverGuard is born

The calendar shows early 2005... once we figured out what we wanted to do, the next thing was to design a product that could achieve it. We envisioned a software that would work like a system management solution (HP's OpenView, IBM's Tivoli, and the likes) with an anti-virus capabilities. One that could constantly scan through an entire IT / DR infrastructure and "monitor" it to ensure that no vulnerabilities are evolving (or already existed) in the infrastructure, and verify that the data protection status is valid and meets the business goals.

Our primary focus was on critical business services, ones with an aggressive RPO and RTO which dictates a dedicated mirrored data center. These DR systems are usually based on storage replications (most of them are synchronous replications) and can not afford to have configuration issues or limited DR availability. Their data protection and disaster recovery must work at any point of time, regardless of the IT configuration changes or any failures of the infrastructure.

So we designed a system that could periodically retrieve information from the storage assets, databases, servers and replications, together with the configuration and status of the various infrastructure assets. We designed it in such a way that its deployment would be simple enough (agents? no agents. We all think we have passed the number of agents allowed into the data center) that it could efficiently go through large amount of data and scan large infrastructures that span across multiple data centers or geographic locations.

The next design challenge was to create the ability to "understand" the relationship between various IT assets - servers to storage, database to storage, replications to all of these, and many other relationships that exist in a "typical" data center. In addition, we wanted the ability to quickly look for "patterns" of vulnerabilities based on the information we collected, and off course alert the ability to inform the system operator of their existence. To our engineering minds it all sounded so much like an anti-virus - signatures, scan engine, detecting viruses (vulnerabilities).... add the fact that some of our engineers actually designed anti-virus products in their past, and you get a good understanding of what needs to be done.

Design was complete. Let the coding begin......

How it all started

It seems as if it was only yesterday when we launched Continuity Software, with the vision of focusing on creating better technologies to manage DR systems and ensure data protection. We founded Continuity after seeking an interesting data center opportunity to zoom on, and were attracted to the DR space (yes, we all have some sort of DR roots in our resumes... :) ) - our engineering minds started thinking of products and technologies to bring.

I think what really caught our attention was the fact that DR systems are still deployed, managed and tested in the same way as they used to a couple of decades ago... We were surprised that there was very little innovation in that area, and that the amount of data center improvements that took place in the last decades really missed the DR area as a whole. System Management, Monitoring, Verification & Testing Systems and Scanners (security, database, network) were all bringing significant improvements to the way we manage and run our business back-end and manage our data protection, but very little of it was really focusing on the DR side.

So we did the trivial thing. We went on a mission to talk with as many smart individuals as possible, especially the ones who are responsible for managing DR systems on the day-to-day (BTW, thanks guys for the feedback! - much appreciated). We learned what keeps them up at night, the business and technology needs, and the type of innovation they would want to see . We literally spend hundreds of hours listening and interviewing anybody that wanted to speak to us... :) The end result was a striking similarity in the type of solution they would want to develop and introduce.

More on step 2...

Thursday, August 10, 2006

Welcome!

It is our pleasure to welcome you to our blog. In this blog we're planning on sharing our experience, lessons learned, best practices and thoughts on the Disaster Recovery ("DR") market.

We hope you enjoy it, and please feel free to post any comments / feedback / suggestions you might have.