A hazard is the basic management unit of system safety. Almost all safety activities throughout a system lifecycle have, as a precondition, a list of the hazards that need to be managed. For each hazard we may undertake risk assessment, selection of mitigations, specification of safety requirements, detailed causal analysis, verification and testing, and through-life safety management. All of this supposes that we have identified the right set of hazards to begin with. Any hazards that we fail to identify create holes running right through our safety lifecycle.
How often does this happen? A 2012 study by Neil Barton found that around 20% of major accidents can be blamed on failures of the hazard identification process. There’s a subtle but important point to be made about this statistic. It tells us what proportion of accidents involve poor hazard identification, but it doesn’t tell us how often hazard identification is done badly. Other research gives us this information. A 2006 study of construction projects found that less than 10% of hazard identifications were complete, with typically only 60 to 80 percent of the hazards which should have been identified actually being identified. This matches other research into particular hazard identification techniques, where individuals methods find between 20 and 80 percent of all of the known hazards.
Put simply, hazard identification is often done very badly, and bad hazard identification makes a significant contribution to actual accidents.
To understand why completeness is such a problem for hazard identification, image a Mark 5 Widget. Can you, in principle, tell me all of the hazards for this device? Certainly not, because you don’t know what a Mark 5 Widget is. So I give you all of the design information for a Mark 5 Widget. Can you now tell me all of the hazards? The answer is still no, because the hazards don’t just come from what a Mark 5 Widget is, but from how it is used. So I give you the widget, the design information, and a crystal ball which tells you exactly how and in what environments the Mark 5 Widget will be operated and maintained. Can you now tell me all of the hazards? In principle, yes, you can. However, can you prove to me that your list of hazards is complete? There’s the problem. Even if you have a perfect list of hazards, neither you nor anyone else can tell whether the list is complete. Of course, it’s easy to prove that the list is incomplete. We can do that simply by finding a hazard that isn’t on the list.
So if we can never prove that we’ve identified all of the hazards, where does that leave us? We need to have confidence that our list of hazards is good enough. We’ll end up using weasel words like “reasonably complete”, or “to the best of our knowledge”. Our argument to back up these weasel words will have three pillars: the competence of the team identifying the hazards, the methods that were used, and the information that fed into those methods.
Team problem solving is too big a topic for this episode. For now, let’s just take as a given that hazard identification is the sort of problem where teams outperform individuals, so long as we get the team dynamics right. At a minimum, hazard identification teams must include designers, maintainers, users, technology specialists, and safety practitioners. One of the most common mistakes is to include representatives of these people rather than the people themselves. Your operations manager is typically not an operator, they are a manager. Your design team lead is not necessarily a specialist in every technology being used to build the system.
In terms of methods, there are three classes of techniques available for hazard identification. These are experience, checklists, and structured brainstorming. Experience involves selecting one or more existing systems or projects similar to the new system, and adopting or adapting the hazards from those systems. Since every project contains some novel element, experience will never be quite enough – part of the method involves a gap analysis showing where our system diverges from experience, and what new hazards this novel component may bring.
Checklists come in two flavors – generic, and industry specific. Generic checklists are sets of prompts to apply to your new system. The archetypal example is a list of energy types. It’s very difficult to kill someone without applying energy to them, so thinking through all of the types of energy present in a system is a great systematic way to identify hazards. The two ways to kill someone without applying energy are to poison them, or to deprive them of something they need to live, such as oxygen. When I’m using an energy checklist, I always include toxins and suffocation as honorary types of energy. Another common generic checklist is a list of operational phases. Hazard identification often misses phases such as maintenance, cleaning and recovery, so a checklist is a good way to address these possible omissions.
Industry-specific checklists are very common in industries which have amature safety engineering community. For example, I’m aware of a couple of very good rail and automotive checklists.
The thing that experience and checklists have in common is that they are very poor at identifying genuinely new hazards. That’s where structured brainstorming comes in. Example techniques in this space include energy-trace-and-barrier analysis and action-error analysis. There are a few more techniques which sometimes claimed to be hazard identification, but in practice they are used to flesh out the causes of known hazards rather than to directly identify hazards.
Of the three classes of techniques, which one should you use? The available evidence strongly suggests that you should use more than one. Applying a single technique well will net you between 60 and 80 percent of the hazards you will find by applying two techniques. A common strategy is to choose one method for primary hazard identification, and another method for review or validation of the main exercise.
To finish, let’s just reflect back on those three pillars. An appropriate set of people, using appropriate techniques, with the appropriate information. This tells you what needs to be recorded from the hazard identification. A list of hazards is not a good enough record – it tells me nothing about its own trustworthiness. A decent hazard identification report must include the people involved, the information they were given, the process that was followed, and the records of that process actually being executed.