Business Continuity Planning (BCP)

Creating a Business Continuity Plan (BCP) requires thought and planning. This blog explores what a BCP is, a high level approach to defining a BCP and how it differs from a Disaster Recover Plan (DRP).

So What’s The Difference Between BCP And DRP?

The obvious answer is that BCP deals with business continuity and Disaster Recovery Plans (DRP) deal with the restoration of systems after a disaster.

Normally DRPs are far more focused on actual technology and steps, where as BCPs have to consider everything surrounding it. The Business Continuity Plan must look at risks to business and likely scenarios we need to manage, where as DRP’s normally are more specific; although they may also be scenario based. Typically the DRP is written by a technical specialist with experience and scope around what happens with specific technologies.

BCPs are important because they consider the needs of the business and not only the technology. Technical subjects, Such as daily backups have a business implication. Performing daily backups implies a Restore Point Objective (RPO) has been set to 24hrs which effectively means that at any point up to 24 hrs of data can be lost. Is that acceptable in a large company? Possibly not. It’s a business decision that sometimes is made by a technical resource with little thought for the fact that loosing a day of business could result in extremely high costs; If one person loses a working day of information the cost may be considered to be 8 hrs of work, but if 100 people lost 8 hrs of information, the cost could be 800 hours.

If you have ever been in the unfortunate position to lose a large system and have to recover from backup due to a series of extremely improbable events you will see some of these issues first hand – It can take months to restore things, causing all manner of financial penalties and chaos. BCP’s should be tested at least once a year – because business and technology change. Even if you only use public cloud services, you still need a BCP in place.

In a large systems failure a simple question can greatly reduced the cost to your business and customers – for example, “What order should we do restores in?”.

Operations might restore in an order that made sense to them, but that’s not necessarily the right order for the customer or the business. Its possible that key critical services & infrastructure for our customers can be grouped together and restored first to minimize impact to them.

The same question could be asked for a product team – in the event of a catastrophe – what do we really need to get things working as a bare minimum? Operations cannot know whats business critical by itself, the BCP guides them.

Customer vs Internal BCP

In an IT services operation its important to remember the customer and supplier are two different business entities. In pretty much every business model the customer doesn’t want to spend money – they want to receive value. As a provider, We want to provide value in the most efficient way we can so we can reduce risk, optimise our costs and improve our profit.

At the heart of a BCP we are managing risk to our business – and the customer must manage risk to theirs. In order to manage risk to business you need an understanding of the business strategy and goals,  A customers BCP is about managing risk to their business and not ours. Properly defining a BCP with a customer can be considered to be consultancy work which may involve connecting to their stakeholders, understanding and modelling their strategy, analysing their working practices, risks, and potential business impact. This requires a level of intimacy with the customer.

Similarly, a service provider does not want to expose customers to all the risks that they need to mitigate; they need to protect the business, and its a level of detail they do not normally need. Typically BCPs are classified as “Internal”, or “Confidential”.

For these reasons above, It’s essential that a service provider doesn’t mix these up.

How do I build a BCP?

People often just pick up a template and fill things in on it – which often gives unpredictable results and isn’t really covering the things that are critical to business. Consider a structured approach:

Risk Analysis (RA)

This is key to building a proper BCP. If we haven’t identified our risk, then how do we know a business continuity plan is providing value, and mitigating key risks to the business? I have seem businesses that have not gone through risk analysis at all, leading to some very high level scenarios, which have no value, because at that level, in an emergency, just making things up would work equally as well. There are formal mechanisms we can use such as SABSA, or if we modelled our business continuity scenarios and processes in BPMN, we could apply something like I suggested in my blog Risk Analyzing BPMN Models.

Thinking about your end to end delivery is a good starting point for doing BCP work and then drawing it as a process.

Requirements are also a good place to start; do not forget those come from all kinds of places – we have customer requirements & wishes, security non functional requirements and may also have goal related or other requirements from our business. Understanding priority requirements and looking at possible risks to meeting them can form a basis for a risk analysis.

Of course a skilled architect designing solutions and documenting them to an ISO 42010 standard would already be managing stakeholder concerns, and would be able to identify the key concerns easily.

Business Impact Analysis (BIA)

Once we identify risks we need to establish their cost to business.  There may already be guidelines around this; many people like to asses in terms of potential financial loss. In a well defined business there are normally a set of established metrics defined in architecture, and / or a policy around how we measure risk impact. Very basic values can be calculated with a set of assumptions – for example – if we have defined a risk that there will be a loss of customer data we could say the impact hits us in several ways: financial penalties, reputation, potential loss of customers.

If we think a risk will impact multiple customers – such as may happen if we lose a complete platform, you may wish to assess how many customers we might permanently lose, as the missing revenue may impact us in a long term.

 We could make a rough guess on the percentage of customers we might lose, but we could also look at previous examples of similar events – for example how many customers did we loose when we lost our servers during a previous outage? what did it cost? You can use such figures or percentages as a guideline when calculating potential impact.  Think about how you can use the figures you have at your disposal and use those to influence your assumptions. Once you have run through an impact analysis you may actually decide to re-.prioritize your requirements.

Writing the BCP

Once you have done the RA and BIA you have a good starting point to all the key areas you need to cover in a BCP. With or without a template, we have a good idea of the scenarios we need to cover in our BCP. A few things to note:

Normally I try to avoid repeating information I have in other places – i would rather refer to other documents. Doing that however means that you need to ensure the referred documents are accessible to the entire audience of your BCP.  The audience of a BCP is something that needs to be considered carefully. Of course, you have all the security related people but that is the tip of the iceberg. All the players in our BCP process need to be aware of their part in it and need to agree to their part in it.  The owner of the BCP must ensure access to the BCP and all related resources for its audience. At this point we can take a document template and start to build a document that really brings value.

There is another school of thought on where to keep information – when I discussed it with some members of my employers security team, they preferred to copy and paste key material into the BCP template. The value that has is that all the information you have is in a single location – the disadvantage is its another place to maintain which can easily become outdated. 

In some previous jobs I have had to also maintain a print copy in a physical safe. Of course we are supposed to regularly test this so its arguable that the document will be kept up to date… you decide.

I eluded to the fact that there can be a hierarchy of BCP’s; depending on the structure of the business; and there can also be dependencies on disaster recovery plans and on teams and people – its important that as part of the disaster recovery planning excercise you ensure the availability of all of the things you depend on – be it resources, systems, documentation.

Bear in mind that if your services rely on other teams, or other companies they could well be an integral part of your BCP and its important to establish a proper interface and expectation. This becomes a lot easier if you have defined your process using a notation such as BPMN as i mentioned earlier.

Discipline & Testing

Testing should be regularly done – at least one a year – and documented. If its not documented it never happened. Things change.

Losing The Value Of BCP

If its not tested, it loses its value. If its not communicated, it loses its value. if its done as a copy+paste exercise without walking through your processes and thinking though its goals and your business… it loses value.

Like many things in Architecture the value in a BCP is not in the result, but more in the process you take to reach it. Without the risk analysis or an idea on how your business actually works the value is greatly diminished. The question isn’t actually – “Do I have a BCP Document?” the question is “Do I understand the key areas of risk in my business and do i have a solid plan defined and communicated for if something catastrophic happens?”

Should I Be Doing A BCP?

Business Continuity Plans are Rarely a singular document. In a large corporation some scenarios are taken care of at corporate level and then the different levels of the corporation should  have their own – individual service areas, and in the case of Tieto, individual Products/services.

At a corporate level a BCP will usually cover things such as Loss Of Life, and how we should handle things like the media in a catastrophe. Bear in mind also we may rely on other BCPs and should document if its the case.

I’ve heard from product managers sometimes in different service organizations that they shouldn’t be responsible for BCPs – its a customer thing. The reality is, if you have a business that you value, you need to be able to protect it, which is why the BCP exists. There may be exceptions but at the end of the day product teams and operations are running parts of the business too – often together – We should not forget the architecture side of this – we are defining a solution that needs to cover our aspects – That looks at risk relating to  People, Processes & Roles, Tools & Technology, Organization, and Information.  Product managers have P&L responsibilities – so naturally the continuation of business should be of interest to both.

Summing it up

If you have a business, how important is it to you that it continues? If its important at all, then why not spend a little bit of time protecting it, and rather than blindly running through a paper exercise, really think about what you need to do to protect it. Maybe take in some key resources to work together and take a structured approach. It may be that your BCP exercise yields unexpected results, and improvements to your architectures.

Assessing Solution Requirements

In this blog I talk about requirements, and the process of choosing anything as an architect. It could be a hardware solution, like a suitable laptop, or a software decision like choosing between Teams and Slack. 

A really important thing to note here, its that its much more important to think about the methodology behind what I present here than the tools I use.

Choosing The Right Solution

As a rule of thumb, it’s always a good idea to assess 2-3 different technologies before choosing one. Its good to know if your primary option fails for whatever reason, there’s a secondary solution. We want to avoid vendor lock-in ; if there is only a single vendor we should risk assess them – which is a whole other subject unto itself.

Decisions should be made based upon the requirements of our different stakeholders to ensure that the solution is fit for purpose. Its tempting to look at software and think about the feature set that the software gives. Some people choose one piece of software over another because it has better features. This is normally a bad approach; you may end up paying for a solution that’s expensive that will never be fully utilized. The same approach equally applies to hardware and software.

When considering replacement technologies or upgrades you should also revert back to the requirements.

An Example With Disk Capacity

For example – If we are looking to order new disk capacity – a vendor may offer us a new model of disk capacity which is 5% faster. It may look at a good idea on the surface, but that’s not necessarily the case. If we do not require faster disk capacity, then in fact there may be a cost overhead. Let’s consider TELOS for a moment (Technology, Economy, Legal, Operational, Scheduling). We may realize that to implement new hardware means potential incompatibility and a risk to operational efficiency. It may also mean we need people to support or train with the new technology; its one more technology type to manage. 

In addition to this – we should of course be tracking decisions, in a work log or other system. We might also consider having release management on versions of our requirements and the approvals of them.

Modelling Device & Requirement Mapping

A practical example. I needed to choose a new laptop. Not knowing what to get I did an exercise in ArchiMate. I started by modelling the top choices I had narrowed down to (I created a Technology View). Of course, if we were deciding software items we could just as easily use application components in another view.

Laptop Device Elements

From there I needed to decide my requirements. I am using BiZZdesign’s Enterprise Architect and multi-added some items (it took 2 minutes). I followed this by using a property table to assign priorities to my requirements. The resultant requirements ended up looking like shown below. It’s a Motivation View, using a MoSCoW color filter:

Requirements View

You can see above the priorities in had on my requirements. Normally in doing requirements I am thinking about TELOS; we could also consider ensuring we capture requirements from all the different stakeholder types named in ISO 42010. Note, in my laptop decision I was doing quick and dirty modelling.

Realizing The Requirements

Once I had the requirements it was a matter of deciding which devices met the requirements. I could have put both the devices & requirements in a requirements realization view; Instead I used another cool Enterprise Studio feature – I created a cross reference table using these options:

Cross reference table definition

From here it created a table and I could just click on the table cells to generate realization relationships between the requirements and the devices.

Cross Reference Table

Visualizing this – it was easy to see the best option there was the EliteBook. I could have easily from this point generated a ArchiMate view if I wanted to using the auto-generate functions in enterprise studio but I just didn’t need it. I could have also saved this table as a Enterprise Studio Viewpoint and reused it later so I didn’t have to re-select the options again. Note – Viewpoints in this case refers the the Enterprise Studio functionality. In my agony to make the right choice I did in fact produce one last motivation view:

Motivation View

The whole exercise took 30 minutes.  There are distinct advantages to modelling your requirements; when it comes to making sure nothing is missed in requirements realization and tying requirements into other bits of architecture. We can of course document the requirement, its rationale, and influence.

Working With Larger Projects

When working as part of a larger project you might have to periodically sync requirements, or compromise on how you work with them – I could for example in a requirements realization view show a single element such as “Citrix Hardware Requirements”  and then in the documentation of the element just link to a confluence page where a team of people not using my modelling tool can manage them. 

We can also document the relationships; in relationship documentation you could express who had actually agreed or confirmed the requirement can be realized alongside any justification or documentation you have.

Capturing Requirements In A Collaborative Tool

We can capture requirements using any collaboration tool – be it something like OneNote or Confluence. Of course its good if the tool you use allows versioning; regardless of the actual tool you need to consider the following:

  • State who the requirements are for.
  • We clearly identify who is responsible , which product it pertains to, and other people that have been involved in identifying these requirements in a header block.
  • Sometimes I break requirements down following TELOS – to ensure whoever fills in the template considers things around Technology, Economy, Legal, Operational, Scheduling.

Minimum Needs For Requirements

Normally the actual requirements table as a minimum need has:

  • Who – is the source of the requirement
  • Service/area – gives an indication as to the general area/category of the requirement
  • The requirement – should be clearly defined and easy to validate (no fuzzy vague wording)
  • The rationale should explain why the requirement exists
  • Priority Follows MoSCoW (Must Have, Should Have, Could have, Won’t Have)
  • We have compliance columns for each option we assess.

The levels of compliance normally has the status, and the name of the person who has responsibility for meeting our requirement – for example, if we have a requirement for the network team someone in the network team needs to agree that they can fulfill requirements.The compliance statuses i normally include with the name:

  • Full – means that the device/service/software fully meets the requirement. 
  • Partial – Means the device partially meets the requirement. In this case we would also include words in this table cell to explain why it is only partially compliant.
  • Non Compliant – Is obvious – again the reason for non compliance should be stated
  • Undetermined – Means we have asked but just don’t have an answer yet.

Summing It Up…

Its important to capture requirements and then to assess different technologies against those requirements rather than to look at the feature set a tool or application gives us. If it looks like a solution has a feature we didn’t realize we wanted, this is a change in scope for our solution and we should reassess our business case.

In a world where technology is ever changing its essential that we document the decisions we make or we can loose those reasoning over time, or in bigger projects end up jumping from meeting to meeting essentially discussing the same thing. This is a cost overhead in time and is a risk in terms of miscommunication or the possibility that things get lost.  Its possible to have meetings discuss requirements and keep things together within meeting minutes but architects should be looking to understand these things consistently and to group information together, in a way that in a years time when we look at answering the question “why did we buy this, and can we replace it?” we have something we can go back to.

Please feel free to comment 🙂

Risk Analyzing BPMN Models

People often make risk assessments without a structured approach relying on intuition or experience. In this blog I show some methods I use to quickly identify risks in BPMN models when formal risk methodologies are not in use (such as SABSA).

As mentioned before in previous blogs, I normally do a process overview in Archimate, and then align it to BPMN because BPMN is easy to understand, and creating BPMN diagrams forces us to think about who is doing what, when.

Figure 1 – An example BPMN process

Design vs Run-time

At the highest level I normally thing of things in terms of design time vs run time risks. Design time risks are the ones that are happening due to the structure of the process and the decisions we have made in process build and run time risks happen when we execute the process. I will talk a little more about this in a while but its important we consider both aspects.

Running Through The Process With TELOS

TELOS (Technical, Economic, Legal, Operational, Scheduling) is an acronym we use for feasibility checking usually. If anyone is interested I can blog on this later. for now – the basics; we look at each of the tasks (the square blocks on the diagram) and ask these questions:

Technical

  • Do we have the software / hardware in place to perform this task? If a task requires the user to modify a Visio file, does he have Visio? Are we relying on a specific platform? For example if we need to read something from CRM, what happens if that goes down?  Are we relying on a system that isn’t in place yet or is due to be decommissioned?
  • Is it clear how to do this? Having the correct level of detail in the documentation is important. If we are creating or saving files for example, have we said exactly where to locate them?
  • Does we have the skills & competencies in place to execute this? Can the person who is expected to perform the task actually do it or do they need training on how? Are they sufficiently trained if something should go wrong?

Economic

  • Are we doing this the most economic way? For example, a contractor may be more expensive to perform a task than an internal resource.
  • What would be the economic ramifications of this step going wrong? Would it potentially be a breach in the Service Level Agreement (SLA)? Could it result in a major incident, or financial penalties?

Legal

  • Are we violating any known regulation? For example, Non-EU teams may not be allowed to work with personal information under GDPR. Legal requirements may exist for information retention if we are working with financial data so we may need to ensure it is kept as part of a process.

Operational

  • Have we used the proper roles? Its important that the right person is doing the job. For example, you would not design a process to have a Business Architect install a server. Even if its a particular Business architect has the capability to do it, that kind of work most likely belongs to a technical specialist; assigning the wrong resources can cause all kinds of issues..
  • Do we have the resources in place? In order to execute the process, have we thought through how often the process will be run, and If we have sufficient resourcing?

Scheduling

  • Will we be able to do all this when the process goes live? For example, will the training be in place? will the resources be in place?
  • Can the step be performed in good time? would the duration it takes to perform the step force a breach in SLA when you add it up with the related steps?

There are many further questions we could ask around a design using TELOS as our guidelines, but these are questions I will typically ask when risk analyzing the design of a process.

Run Time Risks

When analyzing run time issues I do this very simply. I look at all the relationships between each of  the elements and i ask my self the following questions:

  • What happens if the communication never happens? This normally breaks a process unless a contingency is put in place. A simple example; If we order something that is never delivered – is it a risk we need to handle? We can always handle risk as part of our normal escalations process but sometimes we want to put steps in place to be a bit more proactive. We could have a timed event in our process that after a week checks to see if the delivery arrived and if not escalates with the supplier, so that we get the required deliverable before it becomes a critical issue.
  • What happens if the communication is wrong? If we send an order for parts that is incorrect then that’s just as bad as communication not happening – if not worse.  Do we need to put in steps to check the communication before it happens, or are we willing to accept the risk?
  • What happens if the communication is delayed? an order delayed may lead to a breach in SLA – do we need something in place to avoid that?
  • What happens if a resource isn’t available? as well as individual resources not being available for each step consider what happens if resources aren’t available to perform a role at all… if nobody is assigned for example

What About Events?

The things that start and stop our processes, and the events we trigger and receive during a process should be handled with the same questions I listed for run time risk.

Lets Be SMART

Its another acronym – in the slide above it was about goal setting but it applies equally to process – we cover some of them already in TELOS but for each task in our process:

  • Specific – Is it vaguely defined or can anyone understand exactly what it is that you need to do exactly. For example “check that the architecture is good” can be interpreted in many different ways – “Check the architecture against the ISO 42010 Checklist” is a bit better defined. In the associated documentation you would also state where it is of course. if we are vague in definitions there is a risk we will be misunderstood, leading to communication overheads, and possibly the wrong things being done (an overhead in work).
  • Measurable – In defining processes we should also be defining metrics to ensure that processes are healthy. Those metrics should be clearly defined and measurable. If we have a process step to deliver something from point A to point B – we should probably have a metric to understand the amount of time that takes – which can act as a key performance indicator. if these indicators aren’t defined its most likely a design related risk.
  • Agreed Upon – Its OK to build a process with 10 different actors involved but each one must agree. Even if you are the manager of those resources, this step is still important because they may identify issues in process you do not see. If something is not agreed upon – there’s a risk that the execution may not happen according to design.
  • Realistic – You could define a process step such as “Get Owen to eat Broccoli”. That’s fine. Now try getting me to do it. If you define a step like that, there could be an associated risk!
  • Time-Based –  You should have a fairly good idea of how long each process step takes. If you cannot define this, its likely a risk.

Summing It Up…

What I have presented here becomes very easy once you have done it a few times, and without exception whenever I have personally applied these techniques I have identified unconsidered areas of risk.  I haven’t talked about mechanisms for determining impact and probability, but if you follow these guidelines I believe you will have better more mature risk analysis. Remember its not a bad thing to identify risks. When you identify risks, its OK if business wants to except them – and if they don’t you have improvement actions and quality goes up. When you show a risk list with many risks on it shows that you have done a good job in design, and identifying risks is the first step to solving them. These techniques arent as comprehensive as a full risk management methodology, but they are significantly better than just trying to guess what risks may occur. I hope you find this useful, if so, let me know.

Information and Security Thinking

When I first started working with the Tieto Office 365 internal initiative we hadn’t made too many decisions on how to move forward with implementing a collaboration platform – This blog is about the first thoughts I had back then; which still hold true now.

Information Management

At the core of any business, and any collaboration system is information – the management and protection of that information one of the keys to its success. Tieto, like pretty much all companies has information policies and its essential that we adhere to them. Some core things to consider:

  • Information classification – We have a standard set of classifications and those classifications determine how we manage information. Anyone is allowed to see public information – where as confidential information has a controlled access list for example. any information we store has a classification, and that has to be identified in our information model. Typically the classification of information is related to the risk of its exposure to various parties.
  • Information Ownership – Information is always owned by someone, and that someone is responsible for the classification of information – although there may be some mandatory rules an information owner may need to adhere to. Its also important to know there are differences between an information owner, and information author. although in many cases its often the same person assuming those responsibilities.
  • Information Traceability – establishing ownership is part of this but we need to be able to effectively track or locate information.
  • Information life-cycle – its important to understand if information is current or outdated, and to establish rules around things such as information retention.

What this means in real terms is we need to ensure anyone using our systems can classify information and it means we have to put in mechanisms to in some cases enforce policy. Discussions started early on over minimums that our internal security team needs in place – but at its core, before we can do anything we need to ensure our information needs are managed and then add the layers of security on top of that – for example we need to consider things like multi-factor authentication. Requirements are drawn up by our security team in collaboration with the architects, and in some cases we need to consider modernizing. Our versioning policy is a prime example of this. On most modern systems we have a simple major/minor approach to version management – many people are unaware of the formal policy we have at work, because our information systems don’t support a version that is expressed something like 1.0.1-2D.

Requirements Management

With any kind of architecture engagement requirements management is important; one of the biggest problems I have had working in my current role is getting focus to be more at a business layer than a technology one. 

Security is not exception to this – its very easy for a security policy to be dictated by the functionality of a tool, and we should be very careful not to do that. This is why, with our security team we have tried to lay out the requirements before even talking technology – even so, I sometimes get the feeling that some of the see come directly from a Microsoft manual. Its important that we discuss and balance the requirements – and decide what is and is not in scope. Some things will be mandatory, and some things may not be in scope; that’s OK – it can be managed as a risk – and sometimes the business can decide to accept a risk – because business drives security, not the other way round.

Balancing Security

There’s a balance. The users in the modern age expect a certain amount of freedom on how they work with information, but at the same time we need some controls in place to protect the organisation and its members.

Too much freedom – and you have risks related to information getting into the wrong hands, or lost, or worse. Too little freedom and it invites users to find innovative ways to work around the systems you put in place. If I restrict who can access a site for example, then people will work around it – they may start emailing files around and suddenly you lose control of where the latest version of your file is, or who has access to it. If I cannot create my own teams site, then maybe I will want to use something else. In such a case by restricting access we have effectively lost control of access all together.

We have been very mindful of this from the start of the Tieto project – Tieto has many ways of working, and no one way fits all. When we first started on-boarding users into Office 365 some policy decisions were made – people in specific customers were not allowed to be on Office 365. When it comes to collaboration, Not allowing people on creates a very real problem. suddenly those customer teams are alienated from a wider Tieto community, which means we either loose our connection to them, or they find a way of working around the mechanisms we have in place. In Tieto, any restrictive policy we put in place is going to impact someone somewhere.

So how do we address this? We know already the information we must keep to have a minimal level of security, but more important if for us to understand our information policies

Knowing Your Responsibilities

As information owners we all know pretty much what we should and shouldn’t do, and to be successful we need to have a level of trust that our users will know both our information policy and what their industry/Customer does or doesn’t allow. Rather than restrict we need to educate.

For those users in customers that are not allowed to have information online we need to ensure we have a system in place that makes it very easy for the to understand where they are – whether it be on O365, or on our private internal solution. At the top level we decided we would color code. The site theme for Office 365 should be different to on premise so that we can immediately see where we publish.

We need to make sure that as part of this project our communications team makes it fairly clear what our responsibilities are.

How This Is Realized In Technology Terms

We implement core content types that are mandatory and a basic template that all others are derived from – out of the box SharePoint, and we have taken other decisions on things like Multi-factor authentication. We then continued a discussion on how and what we need to implement around EMS and other technologies. 

These are the things we were considering at the beginning and still form important parts of the ongoing work because of course outside the office 365 conversation, there is also a device management conversation going on.

I hope this gives a little insight into some of the information considerations we had when practically moving to Office 365 & its surrounding technologies.