The Incident. The Golden Hour. The Evidence … and Getting to the Root Cause

June 1, 2023

Author | Jaco Cronje

Have you ever found yourself sitting in yet another root cause analysis session, unable to even get close to a possible cause, just because of that sinking realisation that, yet again, you do not have enough information about the failure? This is the stage where the breakdown has been fixed, and you were too late. Everything is up and running again. Production is back to chasing the next target. You might ask yourself: How can you find the missing information?

In this article, we will define the critical point of the incident investigation where most of the information mentioned above goes missing. This is at the moment just after the incident or failure happened, the critical time frame where the very first people arrive at the scene and observe the scenario as-is, just after, or maybe even as it occurred; the very first eyewitnesses. We call this The Golden Hour and will elaborate on how effective it is to preserve information and make it available for successful root cause analysis (RCA).

The term golden hour originates in trauma medicine, where the first hour after severe trauma is considered the most critical in determining a successful emergency treatment outcome. In maintenance, preserving all information within the first hour of the incident or failure will go a long way in achieving a successful root cause analysis and preventing reoccurrence.

The Golden Hour

To explain the concepts, we will use a process pump as an example of an asset. A functional failure can be anything from a simple trip out of the electric motor or a catastrophic failure resulting in a significant containment loss. We will illustrate this with a worst-case scenario and presume a major containment loss, and parts are lying everywhere …

24 March @ 01:03 – The two-way radio on your shoulder crackles and blares in your ear. After your long night shift, you have just warmed up your supper. “MAINTENANCE! ANYONE FROM MAINTENANCE! COME IN …” You sadly leave your warm and tasty dinner, grab your camera, safety gloves, hard hat and notebook and rush to the crime scene! Time is already against you because Production wants the plant up and running as soon as possible. Your conundrum is whether you will have enough time to gather sufficient evidence.

Firstly, let us establish some first-line responsibilities. We can confirm that production operators are responsible for operating assets using standard operating procedures (SOPs) and a specific production specification to ensure that the product is produced at, inter alia, the desired quality and rate. When an asset experiences a functional failure, it naturally no longer makes the product according to the desired production specification and/or quality. Usually, the first line of “defence” called in from a Maintenance point of view is the technician or artisan. Here we have identified two key roles: the operator and the artisan or technician. These key role-players are our response teams, and they will be the people who will capture all of the evidence and information during the Golden Hour.

To further define and unpack the details of what unfolds during the Golden Hour, we are going to use the following five steps as a framework for our root cause analysis:

1. Safety first
2. Spot the difference
3. Collect the evidence
4. Restoration
5. Recording and reporting

Some of the explanations within the five steps may sound repetitive because at least four of the five steps often happen simultaneously during the golden hour.

1. Safety first

24 March @ 01:04 – Production: Before Maintenance is tasked with looking at the pump, are there any significant process-related deviations in the control room that raise the safety alarms? If so, immediately log them and stop Maintenance from entering an unsafe area. Once everything is confirmed as safe within the control room, you can look and listen in on the scene on site. If there is a containment loss (process fluid, fluid from the pump, bearing casing, etc), preventing safe access to the asset for repairs or it prevents other safe operations, then follow standard safe-making procedures to make the area safe. Ensure that any fluid samples are taken for later analysis. If you cannot identify any foreign object that was not there during normal operations, do not move it; rather, wait for Maintenance to arrive to identify the part. DO NOT throw anything away or clean anything unnecessarily without taking a photo of its current state or taking a sample of the debris or dirt.

After the area has been declared safe by Production (if the failure has been catastrophic and a process leak has been observed), Maintenance can be called to the scene.

24 March @ 01:10 – Maintenance: If any unidentified debris or loose parts from the pump have been moved or were displaced during the failure, preventing you from safely getting closer to repairing the pump, remove the parts carefully with the required rigging assistance. First up, take a sample of any leaking fluids (oil, process fluid, if Production has not already done so) and also take a photo of the parts in their current state before moving them. If photos cannot be taken, write down (or draw the scene) where the parts are as accurately as possible before making the area safe. DO NOT discard or throw away any parts, and do not clean anything unnecessarily before taking samples of debris or dirt or a complete account of the scene’s current state.

Once both parties have declared the area safe to work, move on to the next step.

2. Spot the difference

While making the area safe (if required, depending on the failure severity), looking for differences, and collecting evidence simultaneously, you are recording as much as possible. To further elaborate, spotting the difference means investigating what exactly is different from normal operating conditions or how many deviations can be spotted. This can be straightforward to near impossible at times. Now you may say: “But this is exactly solving the root cause.” Not entirely – it is not as simple as that.

24 March between 01:15 and 01:45 – Production: Within the control room, and if available, take note of process trends (temperature, pressure, flowrate, electrical information, etc) and what exactly started deviating from the production specification before and after the incident.

Extract the alarm history from the previous 24 hours if it is available. Note down the lack of information if the alarm history is unavailable. This can facilitate the RCA in updating the asset information to add critical parameters in future. Write down your experience exactly what triggered your response in reporting the incident. At the scene, if there was containment loss, write it down. From a production point of view, take your SOP and write down (before safe-making) the positions and indications of the process valves, switches and gauges related to the pump and the pump’s process. Compare it to the required settings stated in the SOP and note any differences.

24 March between 01:15 and 01h45 – Maintenance: Confirm with Production why the pump has stopped if it is not blatantly obvious (in more than the required number of pieces or scattered all over the place)! For example, if it has just tripped out electrically, has there been a low-flow or high-current trip? If so, feel if the pump is still turning before attempting a restart or anything else. If it can restart, listen for any unusual noises. In another scenario, look for anything out of place first if you get to the scene. Are all the parts intact? If not, write down what is not in accordance with the pump’s original specification (loose bolts, wires, coupling, cracked casing, etc). If you do not have the technical drawings to compare to or do not know the specification, note that as well. This can aid in the RCA in that the specifications should be requested from the original equipment manufacturer and that technical documentation should be updated. Now use your sense of touch and smell to feel if anything, such as the pump bearing box or electrical motor is hotter than normal or if the oil smells different from what it usually does.

3. Collect the evidence

When you get to step No. 3, you may think: “Well, what have we been doing all this time?” Sure, we have collected substantial evidence specifically through our senses and hopefully noted all of it down on a template or job card. This step explicitly addresses physical or tangible evidence. Those things we say we can “bag and tag.”

24 March between 01:15 and 01:45 – Production: As noted in the safe-making stage, collect any process-specific evidence during the incident. Compile the sequence of events, the specific product still inside the pump or asset and/or the exact product that was scrapped due to a deviation in quality or leaked out due to containment loss. Anything else collected from the site related to the incident while cleaning should not be discarded before it is critically reviewed by either Production or Maintenance or a process-safety or process-specialist person.

24 March between 01:15 and 01:45 – Maintenance: All parts related to the pump before repair that had been collected and cannot be used during the repair must be removed and tagged. These are to be further inspected in the workshop to determine the cause of why the component or part is defective or has failed. Do not clean the part before bagging or tagging it, as it can remove critical evidence pointing to the cause of failure. Preserve as much as possible of the part and its current condition. If any parts cannot be removed from the site for further investigation or inspection before repairs, ensure that the relevant people required to assist or be present for the enquiry are contacted as soon as possible so the examination can occur. Alternatively, before altering or restoring the parts, take photos, or draw the scene and write up notes about the scene as accurately as possible.

4. Restoration

24 March @ 01:45 and beyond … At this stage, we can safely say the Golden Hour is over. The time to collect all the evidence, obtain the facts, and grab all the information has passed. Everything has been cleaned up, and the repairs to the pump have now started to take place. By now, and depending on the criticality of the pump, Production is either moving on to monitor the rest of the process if it is not critical or jumping up and down and imploring Maintenance to hurry up and get the pump running because, according to Murphy’s law, the standby pump might suffer the same fate!

We are not going into too much detail with this step because it is not part of the Golden Hour. However, the main goal here is to ensure that the pump is restored to the desired production specification using the correct spare parts supplied by the OEM. Maintenance does the restoration according to the proper procedures per the OEM instructions and specifications using the correct tools.

If that was the case, the pump should be commissioned with both parties being present using a commissioning checklist (again, depending on the asset’s criticality) and, during commissioning, should be monitored to ensure that it performs its required function.

5. Recording and reporting

24 March from 01:03 to 25 March @ 07:00 – Production: By looking at the time stamp, it is safe to assume that recording and reporting are essential throughout four major steps. For Production, these recordings can happen on production logs, shift reports and CMMS notifications which can be logged against the specific asset. Production should also take on the primary responsibility and log the sequence of events and the official incident (depending on severity) on the company incident reporting/logging/HSSE system for investigation. One standard location (preferably the Engineering store or workshop) should be identified where the samples or physical evidence gathered by Production can be taken to and further processed or investigated.

24 March from 01:03 to 25 March @ 07:00 – Maintenance: Similar to Production above, Maintenance should record all of their findings on a shift report, job card, or CMMS notification specifically logged against the asset during the time of information gathering and during the restoration. All documented and photographic evidence should be stored on an electronic server in a standard location that is easily accessible to all involved in further investigating the incident and solving the root cause. Be sure to coordinate and cross-check all information with Production regarding the sequence of events while also capturing evidence on-site with Production.

25 March @ 07:30 – Even though it is morning, you were looking forward to your supper, so you rewarm it once again and settle down to eat while mulling over lessons learned and the value of The Golden Hour, and then your two-way radio start to crackle …

Conclusion – beyond golden hour

In the article, we have examined the concept of The Golden Hour. The aim was, first and foremost, to illustrate how critical information can get lost so quickly if we do not take care of those vital minutes just after an incident has occurred. Second, we know that time is of the essence and that one should not ideally spend more than 15–30 minutes gathering information or evidence. Following these five steps and knowing the essential elements to look out for will hopefully optimise the time spent.

The key takeaway is that Production and Maintenance staff should be aware of this and be well-trained to ensure the information is gathered, recorded, reported and stored correctly. The information collected will then be useful in the root cause analysis and in future instances as positive case studies or for continuous improvement or best practice initiatives.

For more information on training and detailed content on the Golden Hour, connect with Pragma Academy and Advisory here.

Sharpen your skills to lead asset management that delivers performance, reliability and results

Author | Karen Greyling – Academy Business Development Manager In asset-intensive industries, leaders are under…

Beyond compliance: Why repeated maturity assessments define asset performance leadership

Author | Stephan Kornelius – Pragma Professional Services Business Lead In asset-intensive industries, leaders face…

OEMs need a connected platform for services

Author | Henk Wynjeterp – Regional Lead – Europe | Henk leads Pragma’s OEM after-sales…

When rivers cry out: How asset management can help heal South Africa’s waterways

Author | Nelson Broden – Business Lead Pragma Managed Services The Hennops River, once part…

Ditch the buzzword: OEM after-sales leaders buy execution, not “servitization”

Author | Henk Wynjeterp – Regional Lead – Europe | Henk leads Pragma’s OEM after-sales…

Beyond the software – embedding your EAMS into your organisation’s DNA for sustainable value

The key to a successful EAMS (Enterprise Asset Management System) implementation lies in integrating the…

Pragma’s perspective on what the proposed South African budget means for infrastructure governance

Author | Stephan Kornelius – Pragma Professional Services Business Lead Finance Minister Enoch Godongwana’s 2025…

Government infrastructure investments must go beyond bricks and mortar – A global perspective

Author | Bani Kgosana Across the globe, ageing infrastructure threatens economic stability. In the US,…

Pragma partners with Q36.5 Pro Cycling Team to power maintenance excellence

Author | Liza Schroeder Pragma is proud to announce its support of the Q36.5 Pro…

Revitalising South Africa – a better path to infrastructure renewal

Author | Bani Kgosana, Pragma CRO South Africa faces an extraordinary opportunity to transform its…

Transforming maintenance from a grudge purchase to a strategic investment

Author | Scott Gibson, Pragma CEO With South Africans enjoying a break from load shedding…

Pragma 2.0 – CEO of Pragma, Scott Gibson’s vision is to compete with international giants through engineering excellence

At Pragma, our roots are deeply embedded in engineering consulting. We are proud to be…

Are you proactively shaping your focused improvement process or is it evolving organically?

Author | André Jordaan, Pragma Partner Consultant Focused improvement (FI) is the overarching process used…

Introducing Pragma’s Enhanced Asset Management Assessment Framework: AMIP

Physical assets must be managed effectively for a business to grow and be competitive. Across…

Unlocking Asset Management Excellence: Pragma’s Spotlight at the SAAMA Conference, 5-6 June 2024

As anticipation builds for the upcoming South African Asset Management Association (SAAMA) Conference, Pragma is…

A new era at Pragma: Scott Gibson appointed CEO Designate

After an illustrious 34-year tenure as co-founder and CEO of the Pragma Group, Adriaan Scheeres…

Enhancing OEM Service Revenue Growth: Overcoming After-Sales Challenges

Author | Stefan Terblanche, Pragma Partner Consultant Over the last decade, more and more Original…

Boosting OEM After-Sales Service efficiency: The power of an EAMS

Author | Stefan Terblanche, Pragma Partner Consultant Original Equipment Manufacturers (OEMs) face many challenges when…

Pragma appoints new Managing Director in their Asset Management Services company

Nelson Broden, whose professional journey started in 2006, brings a wealth of knowledge and experience…

Predictive precision: How condition monitoring transforms maintenance strategies in Industry 4.0

Authors | Matt Bridger and Scheepers Schoeman Decoding Industry 4.0: The rise of condition monitoring…

Enhancing wind farm equipment reliability through effective asset and maintenance management

Navigating the future of wind farm management As the world inexorably shifts toward sustainable energy…

Building resilient partnerships: Pragma’s approach to thriving in challenging times

Pragma’s contractors are not faceless service providers. We see them as collaborators, and we focus…

Maximising savings: Unleashing the power of engineering and asset management for optimal spare parts utilisation

Authors | Carina van der Merwe and Andre Jordaan “Optimise your spares”, they say, and…

Six reasons why organisations need Pragma’s Generator Asset Health Management service

Clients like Shell, PEP and Dipula Income Fund trust us to conduct online monitoring to manage the maintenance and…

How to save money on generator costs with online monitoring and maintenance management

Power outages can wreak havoc on businesses, leading to financial losses, disrupted operations, and frustrated…

Maximising manufacturing efficiency: how maintenance management systems revolutionise maintenance workflows

According to industry reports, manufacturing plants can save up to 20% in costs by optimising…

Rev up your maintenance game with digitalisation: drive efficiency, reduce downtime, and accelerate ROI!

The automotive industry is revving up for digitalisation! In today’s highly competitive automotive industry, digitalisation…

Lifecycle costing: one of the critical elements of a systematic approach to maximising mining equipment efficiency and productivity

In the mining industry, equipment plays a vital role in the productivity and efficiency of…

Training is more than just a tick-box exercise: Why compliance and skills training are equally important

As an engineering manager, you realise the importance of keeping your maintenance team up-to-date with…

Pragma solidifies commitment to black women empowerment with Moshe Capital deal

We are thrilled to announce that Moshe Capital (a 100% black woman-owned company) has acquired…