Disaster Recovery is one the major topics that are coming up in my interviews. I have been implementing DR (short for Disaster Recovery) for most of my career, so below are five tips with concrete examples with implementing Disaster Recover solutions to work well.
1. Properly set DR Expections with clients – I can’t tell you how many times I’ve been in where clients expected one level of service for disaster recovery and IT provided a different and usually less functional recovery plan/implementation. Proper client communication is necessary. At an IT manager it is critical to understand how the appliation behaves, how the client expects the application to behave and the level of service required for the application. Trust needs to be established between IT and the application/development folks. Without it, the disaster recovery plan will be weak.
Example: A new client wanted to test the disaster recovery for the SQL Server. I asked them if there was a DR plan and implemenation steps. They said no, they usually just turned off the production server and turn on the disaster recovery server and in five minutes (which was their expectation). The reality was that the DBA had a read-only copy of the database on the production server and did not turn off the production server at all. The DR test was marked “success”, but the client expectation as well as the implementation were two different things.
2. Simplify Everything - When something hits the fan, the last thing IT staff will do well is follow complicated disaster recovery instructions. The failover and client communciation should be as automated as possible. The client should have all major fail over steps with approximate times for completing each step. The client will want to guage for him/herself how well the plan is being executed on a real-time basis. This will also serve to help improve or adjust difficult steps with the understanding to simplify the implementation.
Example: One client gave me a series of twelve steps that needed to be completed in an hour to fail over an application properly. The SLA was two hours. Previous attempts using this plan were completed in four to six hours. One of the steps was uninstalling/installing Web Server software on the disaster recover system. Basically, the client doesn’t update the disaster recover system on a regular basis. We elimonated six steps by adding the DR systems to their current software deployment process. Now the application disaster recovery plan was reduced to thirty mintues.
3. Test the DR Plan as well as the implentation regularly – If the plan requires people getting on a conference call, test that too along with IT implementation. Get people on the conference call. There have been dozens of times where conference call numbers changed, and different people need to be notified. Make sure that all tangible assets in the plan will be used/tested. Make note of the items not used in the plan so they can be removed later or if new ones need to be added. People, numbers, and plan steps will change as applications add more systems, features and staff.
4. Communicate/Distribute the plan to key parties – Everyone needs to be on the same page with disaster recovery plans. Executives need to know what conference number to call and what room they need to be in. IT folks need to know the steps required and the time required needed to perform the steps. Stakeholders needs communicate with their clients as well as monitor IT progress. Most folks will attend a DR planning meeting say “yes” throughout the meeting and stick the plan in the drawer. Please get it out once every six months or more frequently if needed and go over it so folks still understand the plan. No one really cares about the plan until it’s time to implement it. Then everyone will be calling you. There will certainly need to be a time to implement the plan when the emergency arises. It will be your fortitude to make sure that tests/plans and understanding is communicated effectively.
5. Don’t worry about it - If the communication has been handled effectively and everyone knows what they need to do, don’t sweat all the little things that will go wrong. Just note them to adjust the plan later. The DR plan is a living, breathing document. Not something you write once and distribute. Most stakeholders aren’t going to care if step two took ten minutes longer or if another thing needed to be updated. There will usually be some stuff that was missed. Make note of it and follow it up. The most important thing is was the plan implemented successfully. The applications failed over correctly and are working.