Data scarcity has always been a significant challenge in the domain of human reliability analysis (HRA). The advancement of simulation technologies provides opportunities to collect human performance data that can facilitate both the development and validation paradigms of HRA. The potential of simulator data to improve HRA can be tapped through the use of advanced machine learning tools like Bayesian methods. Except for Bayesian networks, Bayesian methods have not been widely used in the HRA community. This paper uses a Bayesian method to enhance human error probability (HEP) assessment in offshore emergency situations using data generated in a simulator. Assessment begins by using constrained noninformative priors to define the HEPs in emergency situations. An experiment is then conducted in a simulator to collect human performance data in a set of emergency scenarios. Data collected during the experiment are used to update the priors and obtain informed posteriors. Use of the informed posteriors enables better understanding of the performance, and a more reliable and objective assessment of human reliability, compared to traditional assessment using expert judgment.

## Introduction

Human reliability quantification techniques involve the calculation of human error probability (HEP). Human error probability is defined as the probability of a human failure event (HFE) while carrying out a task [1]. Human failure is often attributed as the root cause of accidents [2]. Unlike machines or hardware, the probability of human failure is highly sensitive to the context of performance. Hence, quantification of human reliability involves assessing the conditional probability of an HFE, $P(HFE|context)$, rather than just the $P(HFE)$ [3]. The context is defined using combinations of different performance influencing factors (PIFs). Some combinations of PIFs result in very low frequency, but high-risk contexts. Human performance data needed to assess reliability in these contexts are not readily available. This is why many of the human reliability analysis (HRA) methods rely heavily on expert judgment [4–6]. However, expert judgment can suffer from vagueness, uncertainty, incomplete knowledge, and conflicts among multiple experts [7]. With the advancement of simulation technologies, it is possible to collect human performance data for some rare events. By incorporating simulator data into the reliability assessment process, the technical basis of HRA can be enhanced. Bayesian methods can formally combine data from different sources (e.g., experts and/or simulators) and facilitate the enhancement [8].

Groth et al. [9] have demonstrated the use of simulator data to update the human error probabilities in nuclear power plants. Though the work presented in Groth et al. focused on one specific combination of simulator and HRA method (the Halden Reactor Project [10] and the SPAR-H method [11]), the methodology can be applied to any combination. This paper applies the methodology to update beliefs about human error probabilities in offshore emergency situations. Assessment began by using constrained noninformative priors to define the HEPs in emergency situations. With the aim to update the prior beliefs with objective data, an experiment was conducted in an offshore emergency training simulator called AVERT (all hands virtual emergency response trainer). Thirty-eight participants were presented with a range of offshore emergency situations. Data collected during the experiment were used to update the priors for each context and obtain informed posteriors. The use of informed posteriors can provide a more reliable and objective assessment of human reliability compared to traditional assessment using expert judgment. This will also help establish a benchmark for human error probabilities for offshore emergency situations.

Section 2 introduces the AVERT simulator and Bayesian inference process. Section 3 presents the details of the experimental study conducted in AVERT to collect human performance data. Section 4 illustrates the Bayesian updating process using the simulator data. Section 5 presents and discusses the results. Section 6 discusses how application of Bayesian inference can facilitate HRA in offshore emergencies and concludes the paper.

## Background

### All Hands Virtual Emergency Response Trainer Simulator.

AVERT stands for All Hands Virtual Emergency Response Trainer. It is a desktop-based virtual environment that replicates an offshore petroleum facility in which users can learn knowledge and skills regarding offshore emergency safety procedures [12]. Training programs designed in AVERT can be used to introduce trainees to the platform layout, alarm types, potential hazards, and appropriate responses. AVERT is capable of creating credible emergency scenarios by introducing hazards such as blackouts, fires, and explosions. This allows trainees to gain artificial experience of emergency situations that is otherwise not feasible. The users have a first-person view of the environment and can interact with different objects in the environment using a dual joystick video game controller (Xbox 360^{®} controller, Microsoft Corporation, Redmond, WA) [13]. Figure 1 shows a few instances of the AVERT emergency preparedness scenarios.

### Bayesian Inference Process.

As shown in Eq. (2), the two main components needed to calculate the posterior probability are: (1) the prior belief about the plausibility of hypothesis $H$, $P(H)$ and (2) the likelihood function $PDH$. The likelihood function defines the probability of the data given the hypothesis $H$ is true.

The Bayes theorem allows one to make an inference about the hypothesis $H$ every time new evidence becomes available. The posterior for the current iteration becomes the prior for the next one.

As depicted in Eq. (3), the posterior distribution given the data does not always come in a closed-form, and sampling methods such as Markov chain Monte Carlo might be necessary [9]. However, there are cases where the likelihood function and the prior distribution are conjugate. Bayesian updating is much simpler in these cases as instead of computing integrals, all that is needed is to modify parameters of the prior distribution accordingly. It is common for probabilistic risk assessment methods to use some combinations of prior distribution and likelihood function that are conjugates [16]. In this paper, a conjugate combination of Beta distribution (prior) and binomial distribution (likelihood function) is used. Section 4 explains this in detail.

## Data Collection Using Avert Simulator

The data used in this paper were originally collected during an experimental study presented in Ref. [17]. This paper uses the data to update beliefs about human error probabilities in offshore emergency situations via a Bayesian inference process.

Thirty-eight participants took part in the study. Samples of convenience method were followed for participant recruitment [18]. The participants were naïve concerning any detail of the experimental design; they were not employed in the offshore oil and gas industry; and they were not familiar with the offshore platform. Prior to the experimental study, participants were trained to competence in basic offshore emergency preparedness. The training was provided in AVERT using a simulation-based mastery learning approach [13]. At the end of the training, all participants were able to demonstrate the skills required to successfully egress during an emergency situation. It was expected that all participants will be highly reliable under the pressure of emergency situations.

During the study, the participants were presented with a range of simulated emergency scenarios. In all scenarios, participants were required to egress following all safety procedures, and muster at their designated muster station or lifeboat station, depending on the context. A total of eight scenarios were created in AVERT by varying three PIFs at two different levels. The three PIFs were situation familiarity, hazard proximity, and communication.

Situation familiarity refers to participants' familiarity with the starting location of a given context. This PIF was varied at two levels. In half of the scenarios, the participant started in a highly familiar location (their cabin). In the other half, the participant started in a less familiar location (bridge). Hazard proximity refers to participants' proximity to a hazard. This PIF was also varied at two levels. In half of the scenarios, the egress routes were free of hazards. In the other half, some of the egress routes were blocked by a hazard. The third PIF, communication, refers to the quality of public address (PA) announcements during an emergency. In half of the scenarios, the PA announcements had all relevant information for the participant to make an informed decision on how to muster. In the other half, the PA announcement provided less complete information about the situation.

Table 1 gives an overview of the scenarios. This paper does not look into PIFs that may vary within a scenario. Hence, scenario and context refer to the same thing in this paper. A detailed discussion on scenario and context can be found in Ref. [9].

All scenarios were defined to test the following HFEs: failure to muster in time $\u2009(HFE1)$, failure to maintain safe pace $(HFE2)$, failure to close fire/watertight doors $(HFE3)$, failure to report at the correct muster location $(HFE4)$, and failure to avoid interaction with hazard $(HFE5)$. Performance metrics evident to the HFEs were recorded during each scenario. The performance metrics included time to muster, time spent running, interaction with fire/watertight doors, muster location, and interaction with hazards. Figure 2 summarizes the experimental design.

All of the participants performed in the eight different scenarios giving 38 data points per HFE per scenario. For example, performance outcomes of 38 participants in context H for $HFE1=failure\u2009to\u2009muster\u2009in\u2009time$ are summarized in Table 2.

## Methodology

Implementing the Bayesian inference process involves several steps. Figure 3 summarizes the steps followed in this paper.

The first step is to define a hypothesis with the goal to assign human error probabilities for different contexts. Once the hypothesis is defined, a likelihood function of human failure on demand and a conjugate prior distribution are defined. The final step is to use the simulator data and perform the Bayesian updating based on the prior distribution and likelihood function.

### The Hypothesis.

As stated in Sec. 3, there were five different HFEs that could happen during the scenarios listed in Table 1. The goal of the Bayesian method is to define the HEPs associated with the HFEs for each context. In other words, the aim is to define the plausibility of the hypothesis “The HEP for HFE $X$ is $p$” for each context. The process starts by defining a prior distribution for each HEP. As evidence becomes available after each participant's performance, Bayesian updating can be performed to get a revised HEP.

### Likelihood Function.

Here, $p$ is the probability of a human failure on demand.

### Prior Distribution.

The prior distribution of HEP is denoted by $p0$. Prior beliefs can be expressed using a number of different distributions. The likelihood function of failure (binomial distribution) has a beta distribution as its conjugate prior. Hence, a beta distribution is used in this paper as a prior. The traditional approach is to use a Jeffreys prior, $Beta0.5,0.5,$ that is invariant to transformations. Though this meets the requirement of computational simplicity, it is not realistic under the conditions of the experimental study. For example, the use of $Beta0.5,0.5$ results in a mean or expected value of 0.5. Given that all participants successfully demonstrated the skills required to egress during an emergency situation prior to the study, it is unrealistic to assume such high error probability. Instead of Jeffreys prior, a limited information prior, based on the constrained noninformative distribution, is used in this paper. This is also a beta distribution, but has a constraint on the mean or expected value of the distribution [19]. Parameters of the beta distribution ($\alpha $ and $\beta $) are determined based on the constraint.

Thus, constrained noninformative priors of all HEPs are approximated using $Beta0.5,9.5$. This provides a standard deviation of 0.07.

### Bayesian Updating.

$\alpha $ and $\beta $ are updated using Eqs. (6) and (7) iteratively each time new evidence is available. $\alpha post$ and $\beta post$ for the current iteration become the $\alpha prior\u2009$ and $\beta prior$ for next iteration.

The Bayesian updating for $HFE1=failure\u2009to\u2009muster\u2009in\u2009time$ for context $H$ will be illustrated as an example. As stated earlier, the prior distribution of HEP for this HFE is approximated by $Beta0.5,9.5$ with a mean of 0.05. The performance outcomes of the 38 participants in context H (summarized in Table 2) can be used to update the prior beliefs. As shown in Table 2, the number of failures $x=5$, given the number of opportunities $n=38$. $\alpha $ and $\beta $ can now be updated using Eqs. (6) and (7), respectively. This gives a posterior Beta distribution $Beta5.5,42.5$ with a mean of 0.114. The updated standard deviation is 0.05. Section 5 presents and discusses the results of Bayesian updating.

## Results and Discussion

Section 4 presented the Bayesian updating process for $HFE1=$ failure to muster in time in context $H$. Figure 4 presents a comparison of the prior and posterior distributions of the $HFE$.

As shown in Fig. 4, the probability of failure to muster in time increases with additional data points. The mean of the posterior distribution is twice the mean of the prior distribution. While the mean increases, the standard deviation changes from 0.07 to 0.05, reducing the uncertainty about the HEP. The 95% credible interval changes from $[5.3\xd710\u22125,0.238]\u2009$ to $[0.042,0.218]$. The shorter interval also indicates less uncertainty about the HEP.

Figure 5 summarizes the posterior mean of the five different HFEs across the scenarios.

As shown in Fig. 5, the posterior mean of all HFEs in the first four scenarios (A–D) is below the prior mean. The fact that the human error probabilities in these scenarios were even lower than 5% provides evidence that the training prior to experimental study prepared the participants to handle these types of situations effectively. One attribute common in these scenarios was a high situation familiarity. In all of these four scenarios, participants started in their cabin, which they were very familiar with due to the training.

For scenarios $E,\u2009F,$ and $H$, the mean of the updated probabilities was higher than the prior mean for failure events $HFE1,\u2009HFE3,$ and $HFE4$. In all these scenarios, participants started in a less familiar location, the bridge. Evidently, human error probabilities for certain failure events were higher in these less familiar situations. The failure events that were associated with higher error probability are failure to muster in time $\u2009(HFE1)$, failure to close fire/watertight doors $\u2009(HFE3)$, and failure to report at the correct muster location $(HFE4).$ The failure probability of avoiding interaction with hazard $(HFE5)$ remained close to zero in all conditions. This indicates that the training enabled participants to egress using a safe route in all conditions.

The informed posteriors serve as an indicator of people's offshore emergency preparedness. The initial belief was that all participants would be able to handle emergency situations with minimal error (hence, the mean error probability was set to as low as 0.05). The posterior means illustrate that the belief did not hold for the complex scenarios such as $E,F$, and $\u2009H$. Results obtained from Bayesian inference can help identify contexts and failure events where chances of human error are high. Special attention can be paid to these areas during training.

As a measure of uncertainty, the standard deviation of posterior HEPs for different failure events across scenarios was measured. Figure 6 summarizes the result. As shown in Fig. 6, the posterior standard deviation was lower than the prior standard deviation in all conditions. This is in line with the expectation that additional data points should reduce uncertainty about the HEPs. For all HFEs, uncertainty reduced significantly in scenarios A–D. For scenarios E, F, and H, change of uncertainty was less significant for failure events $HFE1,\u2009HFE3,$ and $HFE4$. This is consistent with the observed variability in human performance in these conditions. More human performance data are needed to reduce the uncertainty even further.

Another measure of uncertainty is the credible interval: the shorter the interval, the lower the uncertainty. Figure 7 presents the upper bound of a 95% credible interval. As shown in Fig. 7, for all conditions, the upper bound of the credible interval (i.e., 97th percentile) reduces with additional data points. Such reduction results in narrower credible intervals, providing more confidence in the updated HEPs.

An important thing to note here is that posterior distributions are always a compromise between current observations (i.e., data points) and prior distribution. In most cases (especially when the likelihood function built on sufficient data points supports the prior belief), the law of total variance holds true and the posterior variance is smaller than the prior variance. However, exceptions to this may happen when the choice of prior distribution is too strong and the new information is (1) limited and (2) not in strong agreement with the prior [20]. The constraint used in this paper assumes the expected HEP to be 0.5 for all HFEs. This allows the Bayesian model not only to be structured enough to learn from data, but also weak enough to learn from a small amount of data. A more rigid constraint, such as setting up the expected HEP to be as low as 0.001 (since all participants were trained to competence and are not expected to err) essentially provides a starting standard deviation of 0.001. With such a strong prior, 38 data points are not enough to reduce the uncertainty even further. A significant amount of data would be needed to overwhelm the impact of the prior and show the effect of learning from data. On the other hand, when prior distribution is more diffuse, which is often the case for rare events, neither an inordinate amount of data nor perfect data are required for the Bayesian inference to work. Groth et al. [9] demonstrate how the technical basis of HRA can be enhanced even with a relatively small sample size (four samples/context).

## Conclusion

Offshore emergencies are complex, stressful, and uncertain. These contexts are not observed frequently. Collecting objective human error data under such circumstances is nearly impossible. This is why most HRA methods use subjective data, such as expert opinion, to inform the model. Simulators can facilitate objective data collection for rare events. The Bayesian inference process described in this paper demonstrates how simulator data can be used to estimate human error in rare events. The posterior HEPs presented here can be used as a benchmark for offshore emergency situations.

The informed posteriors also serve as a measure of peoples' preparedness for handling an emergency situation. Insights from the posterior distribution can help identify critical events and/or scenarios where chances of human error are high. Special attention can be paid to these conditions during offshore emergency preparedness training [21].

A total of 38 data points have been used in this paper for the Bayesian updating. However, it is possible to refine HEPs using Bayesian inference process even when the data set is sparse, which is very common for rare events. Even though a mean of 0.05 is used in this paper as the constraint, it is possible to use a point estimate obtained from different HRA methods, such as success likelihood index methodology—multi attribute utility decomposition (SLIM-MAUD), human error assessment and reduction technique (HEART), cognitive reliability and error analysis method (CREAM), and standardized plant analysis risk—human reliability analysis (SPAR-H). Using point estimates from an HRA method will enable assessing the credibility of the method itself. If the posterior does not differ, or differ much, from the prior point estimate, it increases the credibility of the expert-derived HRA method. If a significant change is observed, the expert-assigned HEPs must be revised before use.

Finally, the focus of this paper was on $P(HEP|PIFs)$. By combining this with data about $P(PIFs)$, reasoning can be made about $P(HEP$) and $P(PIFs|HEP)$ using the Bayesian network model proposed in Ref. [17]. This will enable a more reliable and objective assessment of human reliability. By investigating the conditional probabilities of PIFs given the HEP, PIFs that are critical to human error can be identified.

## Acknowledgment

The authors acknowledge with gratitude the support of the NSERC-Husky Energy Industrial Research Chair in Safety at Sea.

## Funding Data

Lloyd's Register Foundation (Scenario Based Risk Management of Arctic Shipping; Funder ID: 10.13039/100008885).