Recently I came across Russell Dyas’ excellent blog entry about Disaster Recovery. It’s a useful reminder about having a disaster recovery plan and testing it. After all, plan that falls apart isn’t worth anything. In my experience (12 years or so) Disaster Recovery – DR – is for every company, whatever its size. Every company needs a DR plan in place. One that you have tested and communicated so that should an incident ever happen, your plan will prove its value to the company.
To many people Disaster Recovery sounds intimidating and expensive, especially that first word.
It’s frightening and can (and does) cause panic and confusion. That’s probably why it’s easier to use “DR” rather than “Disaster Recovery”. I never use the D-word when on DR work. I prefer to use the word “incident”, just like above. It doesn’t give rise to the same kind of images as the D-word.
There’s a simpler way to think of DR scenarios: what if your organisation cannot do business? A DR plan aims to help get operational as quickly as you can and then to get fully back with as little loss of data as possible. Forget about the images of typhoons, burning buildings, fire engines and all that. Clear and calm thinking are your friends.
Large companies can and often do have their DR plans developed by an external DR supplier. That can be very convenient but can also be very expensive. Some companies can’t be fully covered by a DR supplier while others may not have the resources to do that. SME organisations and small not-for-profit organisations come to mind.
That doesn’t mean you can’t put together a good DR plan, be confident about it and about following it if an incident occurs.
For those organisations that have an out of date DR plan or don’t have one at all, the idea of looking at DR may be one which is scary. This set of posts should help to calm any nerves, clear up any confusion and suggest a way forward. I’ll take a general look at some DR basics based on my experiences and observations.
At the end of this set of posts readers will have:
- A basic definition of DR
- Foundations needed before DR planning can start
- Steps in drawing up a DR plan
- Testing your DR plan
- Agreeing activation procedure and signing off the plan
- Keeping the plan up to date
Please remember that this is a general outline of DR. The suggestions made in these posts are based on my experience and are, I believe, a good way to start a DR planning process. They should be altered to fit your own business needs; what is ideal for a company of 80 staff isn’t right for a company of 4 staff. But the basic principles behind the suggestions remain the same.
In this first part I’ll look at some foundations that need to be in place before any DR plan can start to be drawn up.
What is Disaster Recovery?
Disaster Recovery is
- Mitigating against things going wrong which can stop your company from doing business and putting those plans ito action if an incident happens. Start to look at the impact from this perspective and things may become clearer and easier.
- About getting a basic IT and comms provision back quickly so you can continue doing business and then getting everything else back with as little loss as possible.
There can be any number of reasons behind things that happen to stop your company doing business:
- Hardware failure
- Extreme weather (Birmingham had a mini typhoon in 2005), fire or flood.
- Unwary workman putting a pickaxe through vital cables
- Overzealous contractor turning off something they shouldn’t
- Forgetful staff member leaving taps running in office above
And many more, most of which are completely outside your control.
The simple bottom line: If an incident happens, what do we need to be able to get back to doing a basic level of business?
What your organisation needs to look at is what the situation is and what can be done to reach a stage of getting the company back doing business, even if at a slightly lower level. A good DR plan will cover these. Each business has its own requirements and ways of doing things but the basics of DR remain the same.
Now we have a brief definition of what DR is. For any plan to work it needs good foundations or else your DR plans won’t stand a chance. So get these foundations right first. I’d look to see these or something like these in place in any IT department.
Back Up Your Data!
Whether your company uses a server farm, one server, a few PCs networked together or just one PC your data needs to be backed up. It’s a pain but computers and computer kit do fail. Your data is the most important thing you have, losing it could be fatal to your company. So keep your data safe: back it up regularly and test your backups and backup procedures regularly. I can’t emphasise this enough.
Essential data, documents, contact details and more can be held in any number of places in a company. A sizeable chunk of vital information is often stored in e-mail systems. So you need to be able to back up every system which holds data: server systems, e-mail systems, intranet systems, websites. If there’s information of value on it, back it up.
Even if you have a network that is backed up every night, some people will continue to store essential documents and information on the hard disks in their PCs. Staff need to be trained that it is best practice to save documents to the network. If it isn’t on the network then it won’t get backed up.
You need to know that all your backups work and that you can restore from them. One way of monitoring this is to get the backup software to produce an end of job report. Most backup software I’ve seen and used can do this. Reports like this can flag up where problems have been found backing up files. Ideally you want reports confirming complete, error-free backups. This is a good thing to show to management. It shows them you’re doing your job and they can be confident in what you are doing.
Doing regular test restores is a great way of ensuring confidence in you from management. By performing regular test restores you get to know the restore software; if an incident happens you can be confident instead of saying “I think this is how it works”. When incidents happen, it is always better to show clear and confident thinking.
You can also show management that these restores are working so they can have confidence in the backup regime. In one role I wrote and then worked from a set of routine maintenance procedures which included test restores. I can say from experience that this is a good thing to do.
Storing the backup tapes or drives is an essential part of any DR plan. I recommend storing the backup tapes, copies of software media and copies of network & config documents in a strong fireproof cabinet or safe and storing the original CDs, licenses and documents in a secure off-site location. One tape per week should also be stored in this off-site location. In the role I mentioned above I ensured that weekly and monthly backup tapes were retained off-site while daily tapes were stored in a fireproof safe.
The reasoning behind this is simple: if the office gets burned down the majority of your data, your master CDs, licence codes and network documentation are safe but accessible to named persons if an incident happens. Replacement media and license codes can sometimes take time to be replaced. Prevent that situation from happening. Some organisations use their solicitor to provide secure off-site storage, others a financial institution and others a dedicated secure storage unit. Don’t use one of the IT crews’ bedrooms. Really, don’t. You need somewhere that is secure and accessible only to named people under particular circumstances. A techie’s bedroom doesn’t fall into that category!
Often the last thing to be considered by IT folk generally, getting your network config information down on paper is essential. Many IT pros see documentation as a pain in the rear simply because they see it as so time consuming. Putting those first documents together is probably the most difficult bit because you may well know the IT provision in detail.
If the information exists purely in someone’s head rather than on paper ask yourself one simple question: “What if that person got hit by a bus today? Would vital information about the network config be lost?” Network diagrams, server build and config, switch config, backup regime config, database config, proprietary hardware config, backup job configs, DHCP scope and more are essential details which you don’t want to be looking for when the incident happens.
If you can’t answer “No information at all would be lost if one or more of my IT team got hit by a bus” then you need to get your documentation sorted. Quickly. It may help to get it reviewed by someone who is external to your company. Writing documentation for someone with no knowledge of your company and its network is very different from writing some for a colleague who knows more. It isn’t as easy as some might think. Once the first documentation is completed, it only needs updating when there are changes to any of the configs. That isn’t a major task if it becomes part of the routine when a change is made.
The reasoning behind this is simple: If someone comes in to help or to undertake recovery from an incident in your absence, will they be able to get it right? There are no brownie points to be had from keeping essential information to yourself, especially if you are unavailable. There are brownie points to be had for having excellent documentation so while you were away, if I or someone like me had to come in and get things back we would say “Your IT guys’ documentation is superb, this will help get you back working”.
If you’re in a company with a smaller IT provision – a couple of computers and an internet connection – you do need to get information on paper. Things like network diagrams, DHCP scope, et cetera might mean nothing to you but your internet connection settings, workgroup name, printer names, e-mail account details and more are all worth keeping a note of.
The object here is to give the people working on getting you operational enough information to get things back as they were rather than a completely new set up of directories and names which may take ages for you to learn. I would always prefer to have a bit more information than I need to none at all.
All servers and server room equipment should be protected by UPS devices. If there’s a power cut or a sudden temperature change (some UPS devices have temperature sensor options to detect fire) the UPS can safely shut down servers properly without corrupting anything running on them. These do need to be configured and tested; they don’t just plug in and work. Each server is configured differently so shutdown scripts need to be written and tested thoroughly. Again, this is something that can be tested regularly and reported on to management. So you can be confident in bringing servers back online and management can be confident in your ability to do that.
With a smaller company it’s still worth getting UPS devices for your computers. One lightning strike, power surge or cut could render your computers useless. I’ve seen ones for between £50 and £70 from reputable manufacturers.
Now you’ve got something like these three foundations in place. You have backups that are tested regularly, master CDs, licences and network config documents are stored offsite with copies stored securely on-site and UPS devices protecting your computer equipment. That’s a great starting point.
In the next post of this series I’ll look at the steps of putting together what will become your DR plan.