Business Continuity Planning (BCP)

Creating a Business Continuity Plan (BCP) requires thought and planning. This blog explores what a BCP is, a high level approach to defining a BCP and how it differs from a Disaster Recover Plan (DRP).

So What’s The Difference Between BCP And DRP?

The obvious answer is that BCP deals with business continuity and Disaster Recovery Plans (DRP) deal with the restoration of systems after a disaster.

Normally DRPs are far more focused on actual technology and steps, where as BCPs have to consider everything surrounding it. The Business Continuity Plan must look at risks to business and likely scenarios we need to manage, where as DRP’s normally are more specific; although they may also be scenario based. Typically the DRP is written by a technical specialist with experience and scope around what happens with specific technologies.

BCPs are important because they consider the needs of the business and not only the technology. Technical subjects, Such as daily backups have a business implication. Performing daily backups implies a Restore Point Objective (RPO) has been set to 24hrs which effectively means that at any point up to 24 hrs of data can be lost. Is that acceptable in a large company? Possibly not. It’s a business decision that sometimes is made by a technical resource with little thought for the fact that loosing a day of business could result in extremely high costs; If one person loses a working day of information the cost may be considered to be 8 hrs of work, but if 100 people lost 8 hrs of information, the cost could be 800 hours.

If you have ever been in the unfortunate position to lose a large system and have to recover from backup due to a series of extremely improbable events you will see some of these issues first hand – It can take months to restore things, causing all manner of financial penalties and chaos. BCP’s should be tested at least once a year – because business and technology change. Even if you only use public cloud services, you still need a BCP in place.

In a large systems failure a simple question can greatly reduced the cost to your business and customers – for example, “What order should we do restores in?”.

Operations might restore in an order that made sense to them, but that’s not necessarily the right order for the customer or the business. Its possible that key critical services & infrastructure for our customers can be grouped together and restored first to minimize impact to them.

The same question could be asked for a product team – in the event of a catastrophe – what do we really need to get things working as a bare minimum? Operations cannot know whats business critical by itself, the BCP guides them.

Customer vs Internal BCP

In an IT services operation its important to remember the customer and supplier are two different business entities. In pretty much every business model the customer doesn’t want to spend money – they want to receive value. As a provider, We want to provide value in the most efficient way we can so we can reduce risk, optimise our costs and improve our profit.

At the heart of a BCP we are managing risk to our business – and the customer must manage risk to theirs. In order to manage risk to business you need an understanding of the business strategy and goals,  A customers BCP is about managing risk to their business and not ours. Properly defining a BCP with a customer can be considered to be consultancy work which may involve connecting to their stakeholders, understanding and modelling their strategy, analysing their working practices, risks, and potential business impact. This requires a level of intimacy with the customer.

Similarly, a service provider does not want to expose customers to all the risks that they need to mitigate; they need to protect the business, and its a level of detail they do not normally need. Typically BCPs are classified as “Internal”, or “Confidential”.

For these reasons above, It’s essential that a service provider doesn’t mix these up.

How do I build a BCP?

People often just pick up a template and fill things in on it – which often gives unpredictable results and isn’t really covering the things that are critical to business. Consider a structured approach:

Risk Analysis (RA)

This is key to building a proper BCP. If we haven’t identified our risk, then how do we know a business continuity plan is providing value, and mitigating key risks to the business? I have seem businesses that have not gone through risk analysis at all, leading to some very high level scenarios, which have no value, because at that level, in an emergency, just making things up would work equally as well. There are formal mechanisms we can use such as SABSA, or if we modelled our business continuity scenarios and processes in BPMN, we could apply something like I suggested in my blog Risk Analyzing BPMN Models.

Thinking about your end to end delivery is a good starting point for doing BCP work and then drawing it as a process.

Requirements are also a good place to start; do not forget those come from all kinds of places – we have customer requirements & wishes, security non functional requirements and may also have goal related or other requirements from our business. Understanding priority requirements and looking at possible risks to meeting them can form a basis for a risk analysis.

Of course a skilled architect designing solutions and documenting them to an ISO 42010 standard would already be managing stakeholder concerns, and would be able to identify the key concerns easily.

Business Impact Analysis (BIA)

Once we identify risks we need to establish their cost to business.  There may already be guidelines around this; many people like to asses in terms of potential financial loss. In a well defined business there are normally a set of established metrics defined in architecture, and / or a policy around how we measure risk impact. Very basic values can be calculated with a set of assumptions – for example – if we have defined a risk that there will be a loss of customer data we could say the impact hits us in several ways: financial penalties, reputation, potential loss of customers.

If we think a risk will impact multiple customers – such as may happen if we lose a complete platform, you may wish to assess how many customers we might permanently lose, as the missing revenue may impact us in a long term.

 We could make a rough guess on the percentage of customers we might lose, but we could also look at previous examples of similar events – for example how many customers did we loose when we lost our servers during a previous outage? what did it cost? You can use such figures or percentages as a guideline when calculating potential impact.  Think about how you can use the figures you have at your disposal and use those to influence your assumptions. Once you have run through an impact analysis you may actually decide to re-.prioritize your requirements.

Writing the BCP

Once you have done the RA and BIA you have a good starting point to all the key areas you need to cover in a BCP. With or without a template, we have a good idea of the scenarios we need to cover in our BCP. A few things to note:

Normally I try to avoid repeating information I have in other places – i would rather refer to other documents. Doing that however means that you need to ensure the referred documents are accessible to the entire audience of your BCP.  The audience of a BCP is something that needs to be considered carefully. Of course, you have all the security related people but that is the tip of the iceberg. All the players in our BCP process need to be aware of their part in it and need to agree to their part in it.  The owner of the BCP must ensure access to the BCP and all related resources for its audience. At this point we can take a document template and start to build a document that really brings value.

There is another school of thought on where to keep information – when I discussed it with some members of my employers security team, they preferred to copy and paste key material into the BCP template. The value that has is that all the information you have is in a single location – the disadvantage is its another place to maintain which can easily become outdated. 

In some previous jobs I have had to also maintain a print copy in a physical safe. Of course we are supposed to regularly test this so its arguable that the document will be kept up to date… you decide.

I eluded to the fact that there can be a hierarchy of BCP’s; depending on the structure of the business; and there can also be dependencies on disaster recovery plans and on teams and people – its important that as part of the disaster recovery planning excercise you ensure the availability of all of the things you depend on – be it resources, systems, documentation.

Bear in mind that if your services rely on other teams, or other companies they could well be an integral part of your BCP and its important to establish a proper interface and expectation. This becomes a lot easier if you have defined your process using a notation such as BPMN as i mentioned earlier.

Discipline & Testing

Testing should be regularly done – at least one a year – and documented. If its not documented it never happened. Things change.

Losing The Value Of BCP

If its not tested, it loses its value. If its not communicated, it loses its value. if its done as a copy+paste exercise without walking through your processes and thinking though its goals and your business… it loses value.

Like many things in Architecture the value in a BCP is not in the result, but more in the process you take to reach it. Without the risk analysis or an idea on how your business actually works the value is greatly diminished. The question isn’t actually – “Do I have a BCP Document?” the question is “Do I understand the key areas of risk in my business and do i have a solid plan defined and communicated for if something catastrophic happens?”

Should I Be Doing A BCP?

Business Continuity Plans are Rarely a singular document. In a large corporation some scenarios are taken care of at corporate level and then the different levels of the corporation should  have their own – individual service areas, and in the case of Tieto, individual Products/services.

At a corporate level a BCP will usually cover things such as Loss Of Life, and how we should handle things like the media in a catastrophe. Bear in mind also we may rely on other BCPs and should document if its the case.

I’ve heard from product managers sometimes in different service organizations that they shouldn’t be responsible for BCPs – its a customer thing. The reality is, if you have a business that you value, you need to be able to protect it, which is why the BCP exists. There may be exceptions but at the end of the day product teams and operations are running parts of the business too – often together – We should not forget the architecture side of this – we are defining a solution that needs to cover our aspects – That looks at risk relating to  People, Processes & Roles, Tools & Technology, Organization, and Information.  Product managers have P&L responsibilities – so naturally the continuation of business should be of interest to both.

Summing it up

If you have a business, how important is it to you that it continues? If its important at all, then why not spend a little bit of time protecting it, and rather than blindly running through a paper exercise, really think about what you need to do to protect it. Maybe take in some key resources to work together and take a structured approach. It may be that your BCP exercise yields unexpected results, and improvements to your architectures.

Information and Security Thinking

When I first started working with the Tieto Office 365 internal initiative we hadn’t made too many decisions on how to move forward with implementing a collaboration platform – This blog is about the first thoughts I had back then; which still hold true now.

Information Management

At the core of any business, and any collaboration system is information – the management and protection of that information one of the keys to its success. Tieto, like pretty much all companies has information policies and its essential that we adhere to them. Some core things to consider:

  • Information classification – We have a standard set of classifications and those classifications determine how we manage information. Anyone is allowed to see public information – where as confidential information has a controlled access list for example. any information we store has a classification, and that has to be identified in our information model. Typically the classification of information is related to the risk of its exposure to various parties.
  • Information Ownership – Information is always owned by someone, and that someone is responsible for the classification of information – although there may be some mandatory rules an information owner may need to adhere to. Its also important to know there are differences between an information owner, and information author. although in many cases its often the same person assuming those responsibilities.
  • Information Traceability – establishing ownership is part of this but we need to be able to effectively track or locate information.
  • Information life-cycle – its important to understand if information is current or outdated, and to establish rules around things such as information retention.

What this means in real terms is we need to ensure anyone using our systems can classify information and it means we have to put in mechanisms to in some cases enforce policy. Discussions started early on over minimums that our internal security team needs in place – but at its core, before we can do anything we need to ensure our information needs are managed and then add the layers of security on top of that – for example we need to consider things like multi-factor authentication. Requirements are drawn up by our security team in collaboration with the architects, and in some cases we need to consider modernizing. Our versioning policy is a prime example of this. On most modern systems we have a simple major/minor approach to version management – many people are unaware of the formal policy we have at work, because our information systems don’t support a version that is expressed something like 1.0.1-2D.

Requirements Management

With any kind of architecture engagement requirements management is important; one of the biggest problems I have had working in my current role is getting focus to be more at a business layer than a technology one. 

Security is not exception to this – its very easy for a security policy to be dictated by the functionality of a tool, and we should be very careful not to do that. This is why, with our security team we have tried to lay out the requirements before even talking technology – even so, I sometimes get the feeling that some of the see come directly from a Microsoft manual. Its important that we discuss and balance the requirements – and decide what is and is not in scope. Some things will be mandatory, and some things may not be in scope; that’s OK – it can be managed as a risk – and sometimes the business can decide to accept a risk – because business drives security, not the other way round.

Balancing Security

There’s a balance. The users in the modern age expect a certain amount of freedom on how they work with information, but at the same time we need some controls in place to protect the organisation and its members.

Too much freedom – and you have risks related to information getting into the wrong hands, or lost, or worse. Too little freedom and it invites users to find innovative ways to work around the systems you put in place. If I restrict who can access a site for example, then people will work around it – they may start emailing files around and suddenly you lose control of where the latest version of your file is, or who has access to it. If I cannot create my own teams site, then maybe I will want to use something else. In such a case by restricting access we have effectively lost control of access all together.

We have been very mindful of this from the start of the Tieto project – Tieto has many ways of working, and no one way fits all. When we first started on-boarding users into Office 365 some policy decisions were made – people in specific customers were not allowed to be on Office 365. When it comes to collaboration, Not allowing people on creates a very real problem. suddenly those customer teams are alienated from a wider Tieto community, which means we either loose our connection to them, or they find a way of working around the mechanisms we have in place. In Tieto, any restrictive policy we put in place is going to impact someone somewhere.

So how do we address this? We know already the information we must keep to have a minimal level of security, but more important if for us to understand our information policies

Knowing Your Responsibilities

As information owners we all know pretty much what we should and shouldn’t do, and to be successful we need to have a level of trust that our users will know both our information policy and what their industry/Customer does or doesn’t allow. Rather than restrict we need to educate.

For those users in customers that are not allowed to have information online we need to ensure we have a system in place that makes it very easy for the to understand where they are – whether it be on O365, or on our private internal solution. At the top level we decided we would color code. The site theme for Office 365 should be different to on premise so that we can immediately see where we publish.

We need to make sure that as part of this project our communications team makes it fairly clear what our responsibilities are.

How This Is Realized In Technology Terms

We implement core content types that are mandatory and a basic template that all others are derived from – out of the box SharePoint, and we have taken other decisions on things like Multi-factor authentication. We then continued a discussion on how and what we need to implement around EMS and other technologies. 

These are the things we were considering at the beginning and still form important parts of the ongoing work because of course outside the office 365 conversation, there is also a device management conversation going on.

I hope this gives a little insight into some of the information considerations we had when practically moving to Office 365 & its surrounding technologies.