I’m on the plane back from another set of VCDX panels, and I’m reminded of an area of Enterprise Architecture that seems to be glossed over: Risk Management.
I read through a lot of design submissions that are light in this area. This is concerning to me because any Enterprise Architecture methodology worth its salt takes Risk Management seriously and handles it via some kind of formal process. Before I give my definition of Risk Management, though, let me start by saying that it definitely is not what many people seem to think it is: a simple table listing a handful of technical SPOFs followed by statements suggesting the customer agreed to these. True Risk Management is not about CYA or checking a box or filling out the yellow blanks in a deployment kit.
On the other hand, the point of Risk Management is NOT the elimination of all possible risks. Risk and value are often intertwined – if you eliminate all risk, you likely eliminate all value. A simple example would be the elimination of all possible network security risk by unplugging everything from the network. You obviously can’t get away with that, so some network security risk is going to be part of any design. There are things you can do to mitigate network security risks to a lesser or greater degree. Which mitigation you choose is going to be based on the likelihood of a breach, the impact if it does happen, the costs associated, the risk tolerance of the business, possible value gained by accepting the risk, etc.
Here is the formal definition of Risk Management from Wikipedia:
Risk management is the identification, assessment, and prioritization of risks (defined in ISO 31000 as the effect of uncertainty on objectives) followed by coordinated and economical application of resources to minimize, monitor, and control the probability and/or impact of unfortunate events or to maximize the realization of opportunities. Risk management’s objective is to assure uncertainty does not deflect the endeavor from the business goals.
My personal take:
Risk management is a formal process that the architect follows to minimize the impact of uncertainty. The specifics of what process you follow is less important than that you follow one. Some architects follow one of the many published/standardized risk management systems. Others (like myself) concoct their own simplified version based on the important parts from such systems.
Examples of standard risk management methodologies include:
My simplified approach:
If you read through the various systems out there, you’ll find that they more or less boil down the following things:
- Identification of risk and expression of the impact in business terms.
- Quantification of the likelihood of said impact occurring.
- Analysis of multiple approaches to mitigation.
- Transparent presentation of mitigation options to appropriate business stakeholders so that they can make an informed selection.
- Validation that the mitigation does what it claims it will do.
Let’s walk through an example risk to see how I work through it
RISK 101 – Unexpected widespread loss of VM’s CBT (change block tracking) data causes synthetic backups to convert to full backups
Technical implications – A given night’s synthetic/incremental backups may not complete within the off-peak window of 12am-6am.
Mitigation Scenario A — Allow the backups to finish
Business Impact – If the backup is allowed to continue past 6am, disk IO performance SLAs may not be met for one or more tenants for up to 4 hours. This could result in as much as $9000 in SLA credits if all tenants are affected for the entire period.
Impact Likelihood – Based on historical mean disk utilization data for all tenants, and analysis of past CBT data loss events, the likelihood of this scenario occurring is estimated at 75% chance of one instance per fiscal year.
Mitigation Scenario B – NOC stops the job at 6am
Business Impact – Whatever VMs were not able to be fully backed up during the previous night will get caught up over the course of the next 3 night’s backup windows. If a data loss event occurs to a critical VM that has not yet been backed up during this 3 night window – a given business unit could be facing as much as $100,000 worth of lost revenue due to the violation of agreed upon tenant RPOs.
Impact Likelihood – Based on historical data loss event patterns, the likelihood of this scenario unexpectedly occurring is estimated at 15% chance of one instance per fiscal year.
Mitigation Scenario C – Eliminate the risk altogether by not using synthetic backups which are reliant on CBT
Business Impact – This will necessitate a rearchitecture of the shared Commvault backup infrastructure. Such an approach would also require the purchase of additional network and storage hardware to raise the data ingestion speed such that we could conduct full backups every evening. CAPEX outlay for this option is predicted to be $250,000.
Impact Likelihood – If this mitigation option is chosen, the likelihood is 100%
Now that I’ve fleshed out three mitigation strategies, the business stakeholders can line the options up in a simple way
|NOC stops backups @ 6am||15%||$100,000|
|Allow backups to finish||75%||$9,000|
In the case of this particular customer it was decided that simply living with a 75% annual risk that they’d have to pay out $9k in SLA credits was the best option. Therefore we added the following things to our design:
- Added a $9k line item to the budget
- A run book was given to the NOC for how to handle this situation.
- Added a step to our testing plan that simulates a sudden CBT loss and full backup of randomly generated sample data sized comparably with production
Does an architect really need to do that level of analysis for all risks?
At first, yes, even if you have to run through this exercise for 20 risks, it is worth the effort. At a certain point, you will start to see patterns as you do more and more designs for customers. You’ll get to where you can whip them out in a template-like fashion and it won’t seem so arduous.
Remember that Enterprise Architecture is much like the scientific method – its there to keep your biases, experience, habits and so forth from corrupting the output. Fully transparent Risk Management is critical to long term project success!