Incident analysis methods
This article gives an overview of the ‘best practice’ incident analysis methods used in differtent industries. These methods are:
What are incidents?
An incident is an unplanned event or chain of events that results in losses such as fatalities or injuries, damage to assets, equipment, the environment, business performance or company reputation. A near miss is an event that could have potentially resulted in the abovementioned losses, but the chain of events stopped in time to prevent this. These incidents can be classified in all kinds of severities and types, and thus into categories. Investigation and cause analysis should take these different categories into consideration.
TOP-SET® Root Cause Analysis
Root Cause Analysis is the drawing of a diagram in which the relationships between the causes of an event are displayed. The method is aimed at finding the Root Causes of the event. By solving the problems described in the Root Causes the probability of the incident (and other events that have the same Root Causes) reoccurring is lowered. This will prevent the incident from happening again. The Root Causes Analysis diagram makes a distinction between three types of causes: Immediate Causes, Underlying Causes and Root Causes. The investigator moves through these causes by asking ‘Why?’ until the level of the ‘Root Causes’ is reached. The answer to the ‘Why?’ question is the next item in the diagram. This creates a Cause-Consequence tree that can resemble an Event tree.
The TOP-SET® method is based on the Root Cause Analysis method. This incident investigation methodology, in which the Root Cause Analysis method is part of TOP-SET® incident investigation methodology, was developed in 1988. The method entails a best-practice way of doing incident investigation based on years of experience in incident investigation for companies worldwide. It incorporates both incident investigation and the analysis of components to form a complete investigation process that takes the investigator from developing a team, gathering data, and investigating to generate evidence, to interviewing witnesses, analyzing evidence, preparing recommendations and actions, and reporting. The method is used in over 30 countries in many industrial sectors, including oil and gas production, explosives, and the rail transport and maritime industries.
The "5 Why" method
The 5 Why method is a way of conducting incident analysis which is originally developed in the 70’s by Sakichi Toyoda and was later used within Toyota Motor Corporation during the evolution of their manufacturing methodologies. It is a simple but effective method to find the cause of incidents.
The 5 Why method is a question-asking method that is used to understand the cause/consequence relationships that underlie a particular problem. The ultimate goal of applying the 5 Whys method, is to determine a root cause of an event or problem. The idea is to ask the question why the event happened and to ask why for that answer as well until you reach the root cause of the event. Originally the method prescribes that five iterations of asking why is generally sufficient to get to a root cause. But nowadays a sixth, seventh or even greater level is used as well. The purpose remains to find the root cause to the original event through any amount of levels of abstraction and to encourage the user to avoid assumptions and logic traps. The answer to the last question, or the root cause should always be an organizational factor on a systemic process level. To reach this level it is advisable to ask the question ‘Why did the process fail?’ instead of asking the question ‘Why?’ when the fifth level is reached. The background thought in the 5 why method is: "People do not fail, processes do!". This method is closely related to the Cause & Effect (Fishbone) diagram.
BlackBox Analysis Diagram
The Root Cause Analysis method that is used in the TOP-SET method is changed and simplified for the use in the BlackBox tool. BlackBox is a software tool that guides you through all phases of incident analysis. The tool is used for reporting smaller, low risks incidents or near misses and to analyze the underlying causes. The program consists of a fixed workflow which leads you through all these steps to make a complete incident report. It is designed in a way that makes sure all fields are filled in to get standardized reports that contain the same sections. This gives more guidance and makes it easier to make quick but accurate analyses.
TOP-SET incident investigation BlackBox is based on the TOP-SET® incident investigation and analysis methodology. This method threats the incident as a system that the organization has lost control over. Something needs to be different or changed from the situation before to get the system out of balance. The dynamics of a system can be expressed in different components. TOP-SET has identified six elements to investigate what has caused the incident: Technology, Organization, People, Similar Events, and Environment that are displayed against Time (the acronym of the word TOP-SET). The cause of the incident can be found in at least one of the five elements but most likely in an interaction of more than one element. The Time element can be a check on the causality of the facts found in the other five elements. In BlackBox the Time element is used in the ‘Sequence of Events’ step in which Events and their time before incident can be identified. The elements Technology, People, Organization and Environment are used in the Incident Analysis. At least one of these elements should be worked out in order to find the underlying causes. The element ‘People’ is in red in the Figure below because this element always plays a role in every incident.
In the BlackBox Analysis investigation diagram you analyze at least one of the four elements of the TOP-SET method that you feel played a role during the incident: Technology, Environment, Organization and People. For each of the chosen elements an analysis is made. This analysis per element consists of minimal 3 maximal 5 items containing Immediate Causes, Missing Barriers and Underlying Causes. The analysis starts with a single Immediate Cause that triggered the incident directly. The reason for the Immediate Cause to occur is the next item; the Missing Barrier. After the Missing Barrier has been identified the reason for this barrier to fail is explained by the Underlying Causes. You need to analyze at least one with a maximum of three Underlying Causes. All the causes in the diagram should be linked with a generic cause from a predefined list. The four elements and the Immediate Causes and Underlying Causes all have their own generic cause’s lists, making eight different lists. Therefor the analysis will contain a specific description of the user and a linked generic cause. The generic causes of different incidents can be compared with each other because they are all picked from the same list in BlackBox. The different generic causes can be counted to make a trend analysis.
The Tripod Beta method was developed on the bases of research done in the late 80’s and early 90’s into human behavioral factors in incidents. The research was commissioned by Shell International and executed by the University of Leiden and Victoria University in Manchester. The research question was: ‘Why do people make mistakes?’ The answer to that question was because organizations expose them to an imperfect working environment. This does not mean people will not make mistakes when they work in a ‘perfect’ working environment, but it is the aspect were organizations have control over and therefor can make changes for improvement.
The Tripod Beta method analyzes which barriers have broken during an incident, the error or mistake made, the working environmental aspect that encouraged this and finally the latent failure in the organization that caused that mechanism. A Tripod Beta analysis process follows three steps:
- Identify the chain of events preceding the consequences
- Identify the barriers that should have stopped this chain of events
- Identify the reason of failure for each broken barrier. This should be broken down in the human failure (Active Failure), the working environmental aspects (Preconditions) and the Latent Failure in the organization.
For the identification of the reason why the barriers broke the Human Error theory is kept in mind. It is investigated what error was made, what failure in the working environment caused this and what latent failure caused this to be present. The core of a Tripod analysis is a ‘tree’ diagram representation of the incident mechanism which describes the events and their relationships.
The ‘Incident BowTie’ method was developed because there was a demand for doing incident analysis within the BowTie diagram. The BowTie diagram contains a lot of information about the ways incidents can happen and how to prevent them. Therefore to add information about actual incidents has a lot of added value. This information can ‘prove’ the effectiveness of barriers and the prevalence of Threats, TopEvents and Consequences. Incidents can also point out if there are any holes in the risk analysis; if all the scenarios are covered. In the Incident BowTie method all this information is displayed in one diagram.
The ‘Incident BowTie’ analysis method combines two analysis methods; BowTie risk analysis and Tripod incident analysis. The method brings the advantages of both worlds together. The information from the BowTie analysis can be used as input for the incident analysis, viewing it from a broader perspective and making sure all the possible scenarios are taken into account. The input from the Tripod incident analysis can be used to make the BowTie analysis more realistic and up to date, using real-life data. It creates an extra layer in the BowTie diagram, making it possible to add more specific information to the risk analysis. The two methods have an important similarity in the analysis technique; the barriers. For both methods barriers are used to show what is done to prevent incidents or events (BowTie) or to show where the failures lie (Tripod). To build an ‘Incident BowTie’ diagram the items from both methods are connected on the level of the barriers, making it possible to collect information about those barriers from two viewpoints.
An incident can be mapped on an existing or developed BowTie risk analysis diagram. BowTie risk analysis is a proactive method that maps different risk scenario’s making a visual representation of a hazard and how you can lose control over the hazard. The diagram contains a left side which represent all the scenarios (the Threats) that can lead to the TopEvent, which is the moment control is lost over the Hazard. The right side of the diagram represents all the scenarios that can lead from the TopEvent (the Consequences). For each scenario barriers are used to show how loss of control is prevented. Control measures show how Threats can be prevented and recovery barriers show how Consequences can be prevented.
The BowTie method is mentioned in the guidelines of the International Association of Drilling Contractors (IADC) as a preferred way of doing risk analysis and is therefore used in a lot of oil and gas companies. These companies make use of their pre-defined BowTie risk assessments to map incidents on. This is possible when the BowTies are virtually complete which allows for barriers from the incident analysis to translate to the barriers mentioned in the BowTie. For companies that do not have such risk assessments predefined when an incident happens, the Incident BowTie method is more difficult to apply. Making a BowTie risk analysis after an incident has happened narrows down the free thought process that is necessary to point out all the possible scenarios in a BowTie diagram.
The Fault Tree analysis method was originally developed in 1962 at Bell Laboratories by H.A. Watson, under a U.S. Air Force Ballistics Systems Division contract. The method received extensive coverage at a 1965 System Safety Symposium in Seattle sponsored by Boeing and the University of Washington. In the 70’s the U.S. Federal Aviation Administration (FAA) and the U.S. Nuclear Regulatory Commission began prescribing the Fault Tree analysis as a part of mandatory risk assessment. The use of fault trees has since gained widespread support and is often used as a failure analysis tool by engineering disciplines as one of the primary methods of performing reliability and safety analysis.
Fault Tree analysis is a deductive reasoning method (from generic to specific information) for determining the causes of an incident. A Fault Tree is a vertical graphic model that displays the various combinations of unwanted events that can result in an incident. The diagram represents the interaction of these failures and events within a system. Fault Tree diagrams are logic block diagrams that display the state of a system (TopEvent) in terms of the states of its components (basic events). A Fault Tree diagram is built top-down starting with the TopEvent (the overall system) and going backwards in time from there. It shows the pathways from this TopEvent that can lead to other foreseeable, undesirable basic events. Each event is analyzed by asking, “How could this happen?” The pathways interconnect contributory events and conditions, using gate symbols (AND, OR). AND gates represent a condition in which all the events shown below the gate must be present for the event shown above the gate to occur. An OR gate represents a situation in which any of the events shown below the gate can lead to the event shown above the gate.
The Event Tree analysis method is used to analyze event sequences following after an initiating event. The method is widely used in many fields such as finance, economics, reliability, risk assessment and numerous other probabilistic types of analysis. Event Trees help in creating a holistic picture of the risks and rewards associated with each possible course of action. The method is popular due to its simplicity.
The Event Tree analysis method is a bottom-up inductive method. It makes use of general information to analyze specific information. The diagram that is built gives a horizontal graphical representation of the logic model that identifies the possible outcomes following an initiating event. The event sequence is influenced by either success or failure of the applicable barriers or safety functions/systems. The event sequence leads to a set of possible consequences. Each combination of successes or failures of barriers leads to a specific consequence or event. The method can also be used quantitatively to calculate the probability of each outcome or consequence giving the failure probability of each barrier.
An Event Tree begins with an initiating event, a Top Event. Examples are:
- Increase in temperature/pressure
- Release of a hazardous substance
The consequences of the event are followed through a series of possible paths. The paths represent the failure or success modes of the assigned barriers for the particular event. Each barrier can be assigned a probability of failure. Examples of barriers are:
- Ignition prevention
- Emergency response
The cumulative failure probability of the various barriers per path gives the probability of occurrence for each outcome or consequence. Examples of consequences are:
- Financial losses
- Environmental damage
The SCAT analysis method is developed by DNV risk consultancy about 20 years ago as part of the ISRS (International Safety Rating System) guidelines. The SCAT version that corresponds with the 6th version of the ISRS is discussed below. This version addresses a full range of loss control events, however it focuses explicitly on occupational health and safety incidents. The newest version of the SCAT method following ISRS 8 will be discussed in the next section.
SCAT (Systematic Cause Analysis Technique) is a widely used methodology for structured analysis of incidents. It is a vertical root cause analysis approach that incorporates the DNV ‘Loss of Causation Model’. The analysis is based on predefined categories of loss events, their potential direct and basic causes and guidance towards a management system structure for actions for improvement. The SCAT method guides the user systematically to work backwards from the loss to identify where the organization lacks control over deficiencies that led to the occurrence of the incident.
A good preparation before building the SCAT diagram is to make a timeline of the incident. This will help getting a good overview of the events that occurred during the incident. The timeline is then broken down in different sections; choosing the key events that will be analyzed in the SCAT diagram. When the Events are chosen a cause path is foll owed that explains why the incident happened. The cause path consists of five items: the Loss, Event, Direct Cause, Basic Cause and Lack of Control. A Loss is the main consequence of the incident. It represents an unintended harm or damage, for example damaged equipment, a broken arm, loss of production, etc.
A SCAT analysis can only have one Loss. When the user wants to analyze more Losses, multiple SCAT diagrams need to be made. A Loss can be the result of one or more Events. An Event is a happening or a moment in which the state of the incident changes. Each Event is analyzed with a cause chain of three cause types. The Direct Cause is a substandard act or substandard conditions that triggered the Event. Examples are:
- Inspection not performed by new employee
- Failure to secure lift
- Safety valve is broken
The Basic Causes include personal and job or system factors that together made it possible for the Direct Cause to occur. Examples are:
- Maintenance department understaffed
- High workload
- Wear and Tear
A Lack of Control factor can be inadequate program standards or compliance to standards that cause the Basic Causes to occur. These factors always act on an organizational latent level. They will influence a range of unsafe conditions and can therefore cause different incidents. Examples are:
- Inadequate leadership
- No task or risk assessments
- Lack of training
These causes can be defined specifically in one’s own words or with use of the DNV SCAT chart. This chart gives a list of generic descriptions for each of the causes. Picking the descriptions from the SCAT chart can be very useful when comparing different incidents. Every user will pick from the same list for every incident. For each cause level there can be multiple items per incident explaining the event. Actions for improvement can be made on every cause level, but will be most effective on the Lack of Control causes because these will address the latent failures in the organization.
Continue to the cge website >