Tuesday, January 02, 2007

Why DR fails [1]

Have you ever found yourself wondering why, despite all the effort, preparation, concern and constant care, DR systems seem to elude us by not working just when we need them?

In the last 18 years I've been trying over and over to define yet the perfect DR architecture for the organizations I was involved with. While I was fortunate enough to have the satisfaction of building several complex and neat setups that actually worked, solutions just didn't seem to scale to the larger environments. New replication, clustering and automation tools came and went, yet the problem never appeared to get easier.

Over the years, the question kept gnawing at me, with the answer always almost, but not quite, within my reach.

It turned out that there were others occupied with the same question. I met my partners to the quest – the founding team of Continuity Software - back at 2004’s fall, and together we were determined to find a satisfactory answer. It took some time, but finally gave in. The "revelation" - as with many other new ideas - turned out to be quite simple. There is a missing process, or tool-kit, well known in other fields of IT, but missing in DR. Without this tool-kit, no matter how skilled or devoted we are, chances are little to have a working solution for long. Let's explore this idea.

The larger and more complex our IT environment gets, the faster it will change and become prone to disorder. This phenomenon is quite obvious. It demonstrates principles underlying any system, and formulated in many fields of life and science (take for example the Physics field of thermodynamics with the fundamental notion of entropy).

Disorder will not disappear by itself - we have to introduce mechanisms to control it, or, more precisely, to negate its effect. This tends, BTW, to restore order in one place, by creating an even greater degree of disorder elsewhere - again, following "rules" of nature. In various IT areas, the niche is usually filled by monitoring and management tools. These will usually let us identify evolving problems and react by introducing the required changes to put the system back into working order. As expected, it will rarely bring our systems back to any past state, but rather to a newer, possibly more complex one. For example, think of an application that requires more storage space. You add the missing Terabytes. After several days you realize that a related data-warehouse load-process no longer finishes on time (hopefully, you have a monitoring tool tell you that rather than your users...). Obviously, you would not consider removing the added storage space, but rather add more CPUs, more bandwidth, etc., until it works again.

On the other hand, DR environments are sensitive to change; anything we do on production should be faithfully reproduced on the DR site. How can we tell for sure that nothing got forgotten, or worse yet, was done incorrectly? Since DR testing is intimidating, complex and extremely costly, it is rarely exercised frequently enough. The result is that implementation gaps exist, and grow in number constantly.

No comments: