Those of you using BiZZdesign’s Enterprise Studio know we are lucky enough to have a collection of risk and security elements for security modelling. This view shows how we can map those to standard ArchiMate elements.
Creating a Business Continuity Plan (BCP) requires thought and planning. This blog explores what a BCP is, a high level approach to defining a BCP and how it differs from a Disaster Recover Plan (DRP).
So What’s The Difference Between BCP And DRP?
The obvious answer is that BCP deals with business continuity and Disaster Recovery Plans (DRP) deal with the restoration of systems after a disaster.
Normally DRPs are far more focused on actual technology and steps, where as BCPs have to consider everything surrounding it. The Business Continuity Plan must look at risks to business and likely scenarios we need to manage, where as DRP’s normally are more specific; although they may also be scenario based. Typically the DRP is written by a technical specialist with experience and scope around what happens with specific technologies.
BCPs are important because they consider the needs of the business and not only the technology. Technical subjects, Such as daily backups have a business implication. Performing daily backups implies a Restore Point Objective (RPO) has been set to 24hrs which effectively means that at any point up to 24 hrs of data can be lost. Is that acceptable in a large company? Possibly not. It’s a business decision that sometimes is made by a technical resource with little thought for the fact that loosing a day of business could result in extremely high costs; If one person loses a working day of information the cost may be considered to be 8 hrs of work, but if 100 people lost 8 hrs of information, the cost could be 800 hours.
If you have ever been in the unfortunate position to lose a large system and have to recover from backup due to a series of extremely improbable events you will see some of these issues first hand – It can take months to restore things, causing all manner of financial penalties and chaos. BCP’s should be tested at least once a year – because business and technology change. Even if you only use public cloud services, you still need a BCP in place.
In a large systems failure a simple question can greatly reduced the cost to your business and customers – for example, “What order should we do restores in?”.
Operations might restore in an order that made sense to them, but that’s not necessarily the right order for the customer or the business. Its possible that key critical services & infrastructure for our customers can be grouped together and restored first to minimize impact to them.
The same question could be asked for a product team – in the event of a catastrophe – what do we really need to get things working as a bare minimum? Operations cannot know whats business critical by itself, the BCP guides them.
Customer vs Internal BCP
In an IT services operation its important to remember the customer and supplier are two different business entities. In pretty much every business model the customer doesn’t want to spend money – they want to receive value. As a provider, We want to provide value in the most efficient way we can so we can reduce risk, optimise our costs and improve our profit.
At the heart of a BCP we are managing risk to our business – and the customer must manage risk to theirs. In order to manage risk to business you need an understanding of the business strategy and goals, A customers BCP is about managing risk to their business and not ours. Properly defining a BCP with a customer can be considered to be consultancy work which may involve connecting to their stakeholders, understanding and modelling their strategy, analysing their working practices, risks, and potential business impact. This requires a level of intimacy with the customer.
Similarly, a service provider does not want to expose customers to all the risks that they need to mitigate; they need to protect the business, and its a level of detail they do not normally need. Typically BCPs are classified as “Internal”, or “Confidential”.
For these reasons above, It’s essential that a service provider doesn’t mix these up.
How do I build a BCP?
People often just pick up a template and fill things in on it – which often gives unpredictable results and isn’t really covering the things that are critical to business. Consider a structured approach:
Risk Analysis (RA)
This is key to building a proper BCP. If we haven’t identified our risk, then how do we know a business continuity plan is providing value, and mitigating key risks to the business? I have seem businesses that have not gone through risk analysis at all, leading to some very high level scenarios, which have no value, because at that level, in an emergency, just making things up would work equally as well. There are formal mechanisms we can use such as SABSA, or if we modelled our business continuity scenarios and processes in BPMN, we could apply something like I suggested in my blog Risk Analyzing BPMN Models.
Thinking about your end to end delivery is a good starting point for doing BCP work and then drawing it as a process.
Requirements are also a good place to start; do not forget those come from all kinds of places – we have customer requirements & wishes, security non functional requirements and may also have goal related or other requirements from our business. Understanding priority requirements and looking at possible risks to meeting them can form a basis for a risk analysis.
Of course a skilled architect designing solutions and documenting them to an ISO 42010 standard would already be managing stakeholder concerns, and would be able to identify the key concerns easily.
Business Impact Analysis (BIA)
Once we identify risks we need to establish their cost to business. There may already be guidelines around this; many people like to asses in terms of potential financial loss. In a well defined business there are normally a set of established metrics defined in architecture, and / or a policy around how we measure risk impact. Very basic values can be calculated with a set of assumptions – for example – if we have defined a risk that there will be a loss of customer data we could say the impact hits us in several ways: financial penalties, reputation, potential loss of customers.
If we think a risk will impact multiple customers – such as may happen if we lose a complete platform, you may wish to assess how many customers we might permanently lose, as the missing revenue may impact us in a long term.
We could make a rough guess on the percentage of customers we might lose, but we could also look at previous examples of similar events – for example how many customers did we loose when we lost our servers during a previous outage? what did it cost? You can use such figures or percentages as a guideline when calculating potential impact. Think about how you can use the figures you have at your disposal and use those to influence your assumptions. Once you have run through an impact analysis you may actually decide to re-.prioritize your requirements.
Writing the BCP
Once you have done the RA and BIA you have a good starting point to all the key areas you need to cover in a BCP. With or without a template, we have a good idea of the scenarios we need to cover in our BCP. A few things to note:
Normally I try to avoid repeating information I have in other places – i would rather refer to other documents. Doing that however means that you need to ensure the referred documents are accessible to the entire audience of your BCP. The audience of a BCP is something that needs to be considered carefully. Of course, you have all the security related people but that is the tip of the iceberg. All the players in our BCP process need to be aware of their part in it and need to agree to their part in it. The owner of the BCP must ensure access to the BCP and all related resources for its audience. At this point we can take a document template and start to build a document that really brings value.
There is another school of thought on where to keep information – when I discussed it with some members of my employers security team, they preferred to copy and paste key material into the BCP template. The value that has is that all the information you have is in a single location – the disadvantage is its another place to maintain which can easily become outdated.
In some previous jobs I have had to also maintain a print copy in a physical safe. Of course we are supposed to regularly test this so its arguable that the document will be kept up to date… you decide.
I eluded to the fact that there can be a hierarchy of BCP’s; depending on the structure of the business; and there can also be dependencies on disaster recovery plans and on teams and people – its important that as part of the disaster recovery planning excercise you ensure the availability of all of the things you depend on – be it resources, systems, documentation.
Bear in mind that if your services rely on other teams, or other companies they could well be an integral part of your BCP and its important to establish a proper interface and expectation. This becomes a lot easier if you have defined your process using a notation such as BPMN as i mentioned earlier.
Discipline & Testing
Testing should be regularly done – at least one a year – and documented. If its not documented it never happened. Things change.
Losing The Value Of BCP
If its not tested, it loses its value. If its not communicated, it loses its value. if its done as a copy+paste exercise without walking through your processes and thinking though its goals and your business… it loses value.
Like many things in Architecture the value in a BCP is not in the result, but more in the process you take to reach it. Without the risk analysis or an idea on how your business actually works the value is greatly diminished. The question isn’t actually – “Do I have a BCP Document?” the question is “Do I understand the key areas of risk in my business and do i have a solid plan defined and communicated for if something catastrophic happens?”
Should I Be Doing A BCP?
Business Continuity Plans are Rarely a singular document. In a large corporation some scenarios are taken care of at corporate level and then the different levels of the corporation should have their own – individual service areas, and in the case of Tieto, individual Products/services.
At a corporate level a BCP will usually cover things such as Loss Of Life, and how we should handle things like the media in a catastrophe. Bear in mind also we may rely on other BCPs and should document if its the case.
I’ve heard from product managers sometimes in different service organizations that they shouldn’t be responsible for BCPs – its a customer thing. The reality is, if you have a business that you value, you need to be able to protect it, which is why the BCP exists. There may be exceptions but at the end of the day product teams and operations are running parts of the business too – often together – We should not forget the architecture side of this – we are defining a solution that needs to cover our aspects – That looks at risk relating to People, Processes & Roles, Tools & Technology, Organization, and Information. Product managers have P&L responsibilities – so naturally the continuation of business should be of interest to both.
Summing it up
If you have a business, how important is it to you that it continues? If its important at all, then why not spend a little bit of time protecting it, and rather than blindly running through a paper exercise, really think about what you need to do to protect it. Maybe take in some key resources to work together and take a structured approach. It may be that your BCP exercise yields unexpected results, and improvements to your architectures.
People often make risk assessments without a structured approach relying on intuition or experience. In this blog I show some methods I use to quickly identify risks in BPMN models when formal risk methodologies are not in use (such as SABSA).
As mentioned before in previous blogs, I normally do a process overview in Archimate, and then align it to BPMN because BPMN is easy to understand, and creating BPMN diagrams forces us to think about who is doing what, when.
Design vs Run-time
At the highest level I normally thing of things in terms of design time vs run time risks. Design time risks are the ones that are happening due to the structure of the process and the decisions we have made in process build and run time risks happen when we execute the process. I will talk a little more about this in a while but its important we consider both aspects.
Running Through The Process With TELOS
TELOS (Technical, Economic, Legal, Operational, Scheduling) is an acronym we use for feasibility checking usually. If anyone is interested I can blog on this later. for now – the basics; we look at each of the tasks (the square blocks on the diagram) and ask these questions:
- Do we have the software / hardware in place to perform this task? If a task requires the user to modify a Visio file, does he have Visio? Are we relying on a specific platform? For example if we need to read something from CRM, what happens if that goes down? Are we relying on a system that isn’t in place yet or is due to be decommissioned?
- Is it clear how to do this? Having the correct level of detail in the documentation is important. If we are creating or saving files for example, have we said exactly where to locate them?
- Does we have the skills & competencies in place to execute this? Can the person who is expected to perform the task actually do it or do they need training on how? Are they sufficiently trained if something should go wrong?
- Are we doing this the most economic way? For example, a contractor may be more expensive to perform a task than an internal resource.
- What would be the economic ramifications of this step going wrong? Would it potentially be a breach in the Service Level Agreement (SLA)? Could it result in a major incident, or financial penalties?
- Are we violating any known regulation? For example, Non-EU teams may not be allowed to work with personal information under GDPR. Legal requirements may exist for information retention if we are working with financial data so we may need to ensure it is kept as part of a process.
- Have we used the proper roles? Its important that the right person is doing the job. For example, you would not design a process to have a Business Architect install a server. Even if its a particular Business architect has the capability to do it, that kind of work most likely belongs to a technical specialist; assigning the wrong resources can cause all kinds of issues..
- Do we have the resources in place? In order to execute the process, have we thought through how often the process will be run, and If we have sufficient resourcing?
- Will we be able to do all this when the process goes live? For example, will the training be in place? will the resources be in place?
- Can the step be performed in good time? would the duration it takes to perform the step force a breach in SLA when you add it up with the related steps?
There are many further questions we could ask around a design using TELOS as our guidelines, but these are questions I will typically ask when risk analyzing the design of a process.
Run Time Risks
When analyzing run time issues I do this very simply. I look at all the relationships between each of the elements and i ask my self the following questions:
- What happens if the communication never happens? This normally breaks a process unless a contingency is put in place. A simple example; If we order something that is never delivered – is it a risk we need to handle? We can always handle risk as part of our normal escalations process but sometimes we want to put steps in place to be a bit more proactive. We could have a timed event in our process that after a week checks to see if the delivery arrived and if not escalates with the supplier, so that we get the required deliverable before it becomes a critical issue.
- What happens if the communication is wrong? If we send an order for parts that is incorrect then that’s just as bad as communication not happening – if not worse. Do we need to put in steps to check the communication before it happens, or are we willing to accept the risk?
- What happens if the communication is delayed? an order delayed may lead to a breach in SLA – do we need something in place to avoid that?
- What happens if a resource isn’t available? as well as individual resources not being available for each step consider what happens if resources aren’t available to perform a role at all… if nobody is assigned for example
What About Events?
The things that start and stop our processes, and the events we trigger and receive during a process should be handled with the same questions I listed for run time risk.
Lets Be SMART
Its another acronym – in the slide above it was about goal setting but it applies equally to process – we cover some of them already in TELOS but for each task in our process:
- Specific – Is it vaguely defined or can anyone understand exactly what it is that you need to do exactly. For example “check that the architecture is good” can be interpreted in many different ways – “Check the architecture against the ISO 42010 Checklist” is a bit better defined. In the associated documentation you would also state where it is of course. if we are vague in definitions there is a risk we will be misunderstood, leading to communication overheads, and possibly the wrong things being done (an overhead in work).
- Measurable – In defining processes we should also be defining metrics to ensure that processes are healthy. Those metrics should be clearly defined and measurable. If we have a process step to deliver something from point A to point B – we should probably have a metric to understand the amount of time that takes – which can act as a key performance indicator. if these indicators aren’t defined its most likely a design related risk.
- Agreed Upon – Its OK to build a process with 10 different actors involved but each one must agree. Even if you are the manager of those resources, this step is still important because they may identify issues in process you do not see. If something is not agreed upon – there’s a risk that the execution may not happen according to design.
- Realistic – You could define a process step such as “Get Owen to eat Broccoli”. That’s fine. Now try getting me to do it. If you define a step like that, there could be an associated risk!
- Time-Based – You should have a fairly good idea of how long each process step takes. If you cannot define this, its likely a risk.
Summing It Up…
What I have presented here becomes very easy once you have done it a few times, and without exception whenever I have personally applied these techniques I have identified unconsidered areas of risk. I haven’t talked about mechanisms for determining impact and probability, but if you follow these guidelines I believe you will have better more mature risk analysis. Remember its not a bad thing to identify risks. When you identify risks, its OK if business wants to except them – and if they don’t you have improvement actions and quality goes up. When you show a risk list with many risks on it shows that you have done a good job in design, and identifying risks is the first step to solving them. These techniques arent as comprehensive as a full risk management methodology, but they are significantly better than just trying to guess what risks may occur. I hope you find this useful, if so, let me know.