## Abstract

Hazard analysis is the core of numerous approaches to safety engineering, including the functional safety standard ISO-26262 (FuSa) and Safety of the Intended Function (SOTIF) ISO/PAS 21448. We focus on addressing the immense challenge associated with the scope of training and testing for rare hazard for autonomous drivers, leading to the need to train and test on the equivalent of >10^{8} naturalistic miles. We show how risk can be estimated and bounded using the probabilistic hazard analysis. We illustrate the definition of hazards using well-established tests for hazard identification. We introduce a dynamic hazard approach, whereby autonomous drivers continuously monitor for potential and developing hazard, and estimate their time to materialization (TTM). We describe systematic TTM modeling of the various hazard types, including environment-specific perception limitations. Finally, we show how to enable accelerated development and testing by training a neural network sampler to generate scenarios in which the frequency of rare hazards is increased by orders of magnitude.

## Introduction

### Scope of the Challenge.

There are two main factors that contribute to traffic fatalities: the number of crashes per vehicle and the number of passengers involved in each crash. Although the number of crashes per vehicle depends on the driver and vehicle technology, the number of passengers per vehicle in a crash does not. Furthermore, as opposed to human drivers, an autonomous driver will be bound to the vehicle. Thus, the statistics of about 1.16 fatalities per 100 million miles in the US [1] is misleading. Instead, the appropriate statistics is the ratio of the number of fatality crashes to the number of vehicle miles traveled (VMT).^{1} Given that there are about ≈30 driver deaths per million vehicle-years for all vehicles [2] and that less than 2% of crashes are fatal (see Fig. 3) and about 12,000 miles driven per vehicle annually [3,4], we estimate that there are approximately ≈200–400 million VMT between fatalities and ≈4–8 million VMT between property damage only incidents. Considering the ≈100 precrash scenarios identified by NHTSA, this translates to a simple average of ≈2–4 million miles per precrash scenario.

A review of the precrash scenarios reported by the US Department of Transportation NHTSA conclude that less than 1% of the crashes are attributed to vehicle failure (Fig. 2) [5,6]. Thus, autonomous vehicles (AVs) safety is a multi-agent problem [7]. However, about 30% of the crashes are associated with a single vehicle, about 63% are associated with 2 vehicles, and about 6.3% are associated with three or more vehicles, as depicted in Fig. 1. Consistent with that observation, 22 of 37 crash types are associated with multiple vehicles [5].

Mainstream safety engineering methods, such as fault tree analysis (FTA) IEC 61025 [10], failure mode effect analysis (FMEA) [11,12], and standards such as the US Federal Motor Vehicle Safety Standards (FMVSS) 1-series [13] and Functional Safety Standard ISO-26262 [14], are all focused on single-vehicle failures. In particular, the FMVSS approach is to define constraints on the design of the vehicle and its components so as to control risks. This was appropriate for vehicles driven by humans.

In contrast to traditional vehicles, for autonomous vehicles having a driving responsibility similar to that of humans, such a focus on the vehicle alone accounts for less than 1% of the crashes and thus less than 1% of the risk. There is a need to shift focus from vehicle design constraints to context-specific requirements on the *behavior of the autonomous driver*.

### Example Scenario 1: Backing Up Into Another Vehicle.

To illustrate an important two-vehicle NHTSA scenario, consider the autonomous shuttle grazing accident that occurred in Nov. 2018. The shuttle encountered a semi-trailer truck, which was backing up to drop a delivery. The truck was slowly backing up, with the appropriate audio alert, and as the passengers watched, it eventually grazed the shuttle, as depicted in Fig. 4; no passengers were injured.

City officials indicated that “The shuttle did what it was supposed to do, in that its sensors registered the truck and the shuttle stopped”. Such a response was deemed inappropriate, because, e.g. if the truck was a meter further it could have toppled the shuttle with passengers inside. The passengers indicated that “We had about 20 feet of empty street behind us, and most human drivers would have thrown the car into reverse and used some of that space to get away from the truck. Or at least leaned on the horn and made our presence harder to miss.”

Clearly, the behavior requirements for this two-agent scenario must include avoiding an accident by moving in reverse. According to the NHTSA statistics, *backing into another vehicle accounts for $>3%$ of the crashes*. In this example, a human driver was backing up into Ego, implying that the safety case must include accident avoidance requirements. Failing to include such a requirement in the development phase, and further failing to validate it prior to product release, is inappropriate.

**Key Point 1**: Require explicit specification of behavior algorithms that actively avoid crashes as a reaction of actions of other traffic participants.

### Example Scenario 2: Following Vehicle Making Maneuver.

To illustrate an important three-vehicle NHTSA scenario, consider the safety of commonly deployed Advanced Driver Assistance (ADAS) cruise control. The following three steps were performed in a test conducted in June 2018 as shown in Fig. 5

Ego’s human driver engages advanced cruise control (ACC) to follow the lead vehicle in front.

The lead vehicle moves to the right lane to avoid a static obstacle in front.

The Ego ACC faces the static obstacle, unable to brake, and crashes into that obstacle.

This example illustrates a problem that is common to millions of vehicles on the road as of 2019: The ADAS sensors are only capable of analyzing the state of the lead vehicle immediately in front, but unable to analyze the state of two vehicles ahead, due to occlusions. All recent vehicle model ADAS systems deployed on or before 2019 exhibit this problem. The test depicted in Fig. 5 was conducted on June 12, 2018, with Tesla vehicles, 2 years after the release of the radar designed to detect such situations.^{2} It turns out that humans are challenged by this scenario as well: According to NHTSA statistics, *following a vehicle making maneuver accounts for*$>1%$*of the crashes*.

This example represents an entire class of scenarios, depicted in Fig. 6. The obstacle may be a vehicle stopped to allow a pedestrian to cross safely. Alternatively, the obstacle may be a pothole or a flooded section of the road, which does not appear in the map. Similarly, the obstacle may be a tree, a pedestrian, or other objects, which fell onto the road and are *not mapped*. Clearly, there are countless possibilities represented by “following vehicle making maneuver,” for which an autonomous driver must perform better than humans.

**Key Point 2**: The maneuver of the lead vehicle dramatically increases uncertainty. In such situations, slowing down and increasing separation are viable strategies for risk reduction.

## Risk and Hazards

Identification and classification of hazards is at the core of safety engineering and validation. Hazards can be classified as one of the three distinct categories:

Hazards originating from within the AV system due to system limitations or failures: The scope of ISO-26262 covers these hazards.

Hazards originating due to misuse by an operator: Intentional misuse is regarded as a security topic and is not in scope for safety standards. Unintentional misuse is in scope; however, the nature of that misuse determines which modeling method is applicable.

Hazards originating from environment objects, which are either static or dynamic: Dynamic object include other traffic participants, e.g., vehicles, pedestrians. Static objects include mapped and unmapped objects. Mapped objects include, for example, road geometry, road markings and indicators, speed bumps, and rumble strips. Unmappped static objects include trees that fell on the road and potholes, temporary road indicators, e.g. work-zones.

According to the SOTIF standard, system failures and unintentional misuse are regarded as internal hazards, whereas intentional misuse and environment objects are regarded as external hazards. However, unintentional misuse can occur due to external factors as well. As an example, consider unintentional misuse due to inability to comply with temporary geographic restrictions issued by external factors, e.g. authorities, due to traffic conditions or emergency situations or due to unexpected weather conditions. These examples motivate a distinct misuse category.

### Estimating Risk.

*S*, Exposure

*E*, and Controllability

*C*of a hazard [14]. In practice, the values for

*S*can be limited to either

*property damage*or

*injury*or

*fatality*; see Fig. 3 for the relative frequency of injury severities for 2010 versus 2019 for some states and the large distribution variation among some states. The values for

*E*could be 0 <

*E*< 1, representing the probability that a hazard would occur. The values for

*C*could be 0 <

*C*< 1, where

*C*= 1 represents

*inability*of the vehicle to avoid the materialization of hazard (e.g., crash is guaranteed) and

*C*= 0 represents

*always avoiding the materialization*(e.g., crash is always avoided). With this approach, hazards are grouped according to their severity. Denoting the hazards in each group as $H$, the corresponding risk

*R*

_{H}for each hazard $H\u2208H$ is always

*R*

_{H}=

*E*

_{H}×

*C*

_{H}, where

In contrast to ISO-26262 for which hazards represent failure of a single vehicle, estimating the residual risk associated with autonomous driving requires considering hazards comprising multiple vehicles in the context of specific scenarios.

The aforementioned example scenario 1 describing the “shuttle grazing” accident specifies a two-agent hazard, whereby one vehicle backs up into another vehicle; it represents >3% of the crashes.

The aforementioned example scenario 2 describing the “cruise control crash” test specifies a three-agent hazard, whereby cruise control followed a lead vehicle making a maneuver; it represents a class of precrash scenarios which account for >1% of the crashes.

To illustrate the various *harm severities* described earlier in Eq. (1), consider the 2010 MAIS and KABCO data depicted in Fig. 3, which is obtained from Table 30 of Ref. [8]. We observe that less than 2% of the crash incidents results in a fatal injury, i.e., for hazards in $Hfatality$. About half results only in property damage, i.e., property damage only (PDO) hazards in $Hpdo$. The other half is associated with multiple injury severities, i.e., for hazards in $Hinjury$.

*disjoint*hazards $H$, one can estimate the overall associated residual risk is the sum of the individual residual risks. When hazards can compound, however, there is a need to assess the

*compounded*severity, exposure, and controllability. For simplicity, it is reasonable to assume that the severity of the compound hazard equals the maximum severity of each of its components. Denoting $H={h1,\u2026,hn}\u2282H$ as the set of hazards, which may compound, the overall residual risk

*for each severity type*(“property damage” or ’injury” or “fatality’) is expressed as follows:

^{n}subsets to consider for a set $H$ of

*n*hazards, the exponential complexity renders explicit computation impractical. There is a need to provide a good estimate without explicit consideration of all possible subsets.

### A Practical Approach to Bounding Risk.

*R*

_{H}=

*E*

_{H}×

*C*

_{H}and both

*E*and

*C*are between 0 and 1, it is clear that

*R*

_{H}≤

*E*

_{H}and

*R*

_{H}≤

*C*

_{H}. Thus, an obvious method to provide an upper bound to

*R*

_{H}is to obtain an upper bound for

*E*

_{H}or

*C*

_{H}. We further simplify by assuming inability to avoid materialization of hazards, using

*C*= 1. This bounds the compound risk by the joint probability of all hazards in $H\u2282H$ compounding, namely,

*R*

_{h}for every individual hazard $h\u2208H$ is bounded by the sum of the probability that it would materialize in isolation, plus the probability that it would compound with any other subset of $H$, namely,

*k*clusters

*C*

_{1}, …,

*C*

_{k}, such that the probability of compounding hazards in different clusters is very low, namely,

**Key Point 3**: A practical method for bounding risk is to cluster the hazards and assume that, in worst case, harm cannot be avoided. Thus, risk can be reduced relative to the aforementioned worst case bound by improving controllability, e.g. using the accident avoidance behavior.

## Dynamic View of Hazards and Time to Materialization

The dynamic view of hazards is intended to reduce the risk by formulating avoidance of preventable hazard materialization through maximizing their time to materialization (TTM). According to this dynamic view, an autonomous driver needs to continuously monitor multiple potential hazards at all times, estimate their TTM, and act to defer (or avoid by deferring indefinitely) the materialization of the most imminent severe hazards.

*The TTM of a hazard is the estimated difference between the current time and the time at which a hazard will materialize. It is inherently uncertain.*

The metric of time to collision (TTC) represents the duration of time until the materialization of a crash, assuming no changes in velocity and trajectory for the potentially colliding vehicles. The TTC has proven to be an effective measure for discriminating abnormal from normal driver behavior and identifying risky situations [16]. A situation is regarded as unsafe when TTC is lower than a minimum threshold. Autonomous drivers should continuously estimate TTC (against all traffic participants and static objects) and select actions maximizing it.

Safe driving with dynamic hazards, as specified by Algorithm 1, continuously scans the environments, estimates the TTM for foreseeable hazards, and applies the action that increases the TTM for the most imminent severe hazard. The systematic validation and safety monitoring of Algorithm 1 is possible by observing the situation $S$, the TTM estimates $Ti0,Ti$, and *A*_{i} for every iteration of the control loop. Given that the TTM ground truth is only available in simulation, quantification of the estimate error requires virtual testing.

Consider the shuttle grazing example earlier. According to Algorithm 1, the shuttle’s autonomous driver should maximize the TTM of the imminent crash by selecting *A*_{i} = “reverse”, which renders the TTM infinite.

**Key Point 4**: The TTM-based Algorithm 1 avoids explicit coding of rules. It enables quantification of controllability and validation of hazard materialization avoidance behavior by logging the inferences made at each time frame leading to the action selection.

### Driving loop for dynamic hazards

### Potential Versus Developing Hazards.

The temporal aspect of hazard evaluation is critical to risk control. To illustrate the challenge, consider that a typical AI-based perception component is only capable of error rates of about 10^{−2}. In contrast, the vehicle needs to *operate for at least 10^{6} miles without fatal accidents* impliying the need to avoid materialization of hazards for about 10

^{9}frames (about 1000 frames per mile). This implies the need to build systems that are 10

^{7}× more reliable than their components.

To address this need, it is possible adopt the approach used by the United Kingdom authorities, which is is better than the United States. In the UK, there are less than 1792 fatalities annually with about 650 billion kilometers travelled, namely >240 million *miles* between fatalities. This UK statistics is *> 2.5 × better* than the corresponding statistics in the United States, which experiences about 90 million miles between fatalities.

In the UK, the theoretical driving test in interactive time-sensitive hazard detection, which classifies hazards into *potential* versus *developing*:

*Developing Hazards:*Hazards that require active driving response, such as reducing speed or steering, to reduce risk and avoid materialization of that hazard.*Potential Hazards:*Hazards that do not yet require intervention, but may become a developing hazard.

To clarify the aforementioned definition and illustrate the implications, consider the scenes in Figs. 7–9, 11, and 12:

The scene depicted in Fig. 7 contains multiple potential hazards, associated with pedestrians and parked cars, that are depicted using light circles and the single *developing* hazard, of a stopped vehicle in front, that is depicted using a dark circle. The developing hazard constitutes a vehicle in front of Ego, which is stopped waiting for pedestrians to clear the road in its path. Ego needs to reduce speed to avoid collision with that vehicle. The action to be selected by Ego is braking because it will increase the TTM of collision with that vehicle. The optimal deceleration minimizes the discomfort and ensures arriving to a stop with sufficient clearance from the vehicle in front; this requirement should be satisfied using a braking curve, which minimizes the second derivative of the position (=force), namely, a spline.

The *potential* hazards depicted as light circles in Fig. 7 represent various possible actions by pedestrians, which *may* evolve into *developing* hazards in the near future. As an example, the pedestrians on the sidewalk may cross the street. A pedestrian on the right, standing between parked cars, may cross the street. Similarly, drivers of parked cars may open the vehicle door to exit their vehicle.

The hazard monitoring Algorithm 1 is expected to identify the hazards in each scene and classify each as either *potential* or *developing*. Subsequently, the TTM estimation, and action selection for maximizing the TTM, is only performed for those hazards classified as developing.

Moving on to the example depicted in Figs. 8 and 9. Starting with Fig. 8, a cyclist represents a *potential* hazard. The presence of the cross-walk increases the likelihood for this potential hazard to convert into developing hazard. Subsequently, Fig. 9 depicts the cyclist crossing the road, which constitutes a *developing* hazard, requiring Ego to decelerate and eventually stop.

Moving on to the example in Fig. 10, an approaching incoming vehicle is detected. Due to the street being too narrow for accommodating both the incoming vehicle and Ego (detecting this condition requires complex perception logic), it is classified as a *developing* hazard. Ego needs to decide whether to give right of way and move to the side. The parked car is classified as a potential hazard because it represents the usual *potential* hazards of door opening and pedestrian crossing, as depicted in Fig. 11. Assuming Ego decided to slow down and move to the side, the parked car hazard need to be regarded as *developing* hazard because it is blocking the path for Ego and forces it to decelerate. If the deceleration of Ego is not sufficient to bring it to a stop without collision with the parked car, then the corresponding hazard would *materialize*. Subsequently, as the incoming vehicle passes Ego, as depicted in Fig. 12, all hazards dissipate.

**Key Point 5**: The TTM-based hazard driving loop of Algorithm 1 should be adjusted to perform TTM estimation, and action selection, for *developing* hazards only, dramatically reducing its complexity. This requires conservative classification of hazards as *developing*.

**Key Point 6**: Quantification of controllability and validation of harm avoidance behavior must include error estimation for hazard detection, classification, and TTM estimation.

### Hazard Monitoring.

The key aspect of our approach is to decouple the measurement of TTM estimation error from the simulation-based analysis of their implications, as depicted in Fig. 13. We train the perception and fusion models to identify hazards, essentially extending the basic object event detection response (OEDR) framework [17] with hazard identification. According to the basic approach, object tracking and fusion across multiple sensors feed directly into the path planning and driving decision logic, as shown in Fig. 14(a). The hazard monitoring is added as intermediate steps to provide dynamic hazard identification and TTM estimation, as depicted in Fig. 14(b), which could be implemented, e.g., using a deep neural network (DNN).

The SOTIF temporal view (Figure 3 of Ref. [18]) can be extended by adding the hazard monitoring steps to the scene interpretation logic, as depicted in Fig. 15. According to this temporal view, the perception module generates a representation of the perceived scene, which is combined with the goals of the drivers, e.g. dynamic driving task, to represent the situation. Subsequently, the situation is fed as input to the decision logic, which determines the appropriate action. Subsequently, external event information is combined with the scene representation and the action selected to feed as additional input to the perception logic in the subsequent frame. Note that the perception logic takes sensor data as its input. In many cases, previous scene and action data are used in addition to sensory data for smoothing and adjusting the driving decision.

Hazard considerations are clearly missing from the SOTIF temporal view. The hazard monitor can be integrated within the SOTIF dynamic view as depicted in Fig. 15. The scene representation is fed into the hazard monitor for detection of all relevant hazards and estimation of TTMs. The hazard monitor further performs a rollout, namely, attempts to estimate future states, for several frames into the future, i.e., the equivalent of a short-term simulation, to estimates the anticipated TTMs of candidate actions; intuitively, this is consistent with the use of rollouts in reinforcement learning [19]. The resulting TTMs for the candidate actions are fed as input to the decision logic to guide in selecting the action that maximizes the TTM. Note that the goal, e.g., dynamic driving task, is not an input to the hazard monitor. Further note that the output of the hazard monitor is not used for primary decision-making and instead, is merely used for *guiding* the decision logic to either adjust or select an alternative action with a better TTM.

The depiction in Fig. 15 is an oversimplified view of the control loop, which many in the industry use as a conceptual reference. However, this view implies that the perception logic and the hazard monitor take the previous frame alone as input, it is therefore important to note that Hazards Monitors perform much better when multiple frames (at least three) are provided as input. Depiction of the full complexity of an end-to-end working Reinforcement Learning system with which the experiments were conducted is beyond the scope of this paper.

Calibration against test track data can be performed by comparing TTM estimates against ground truth measured either in simulation or in the physical world against known positions of traffic participants. Once the ground truth data are available, the TTM error distribution parameters can be estimated using nonlinear regression, and the appropriate correction terms can be used to adjust the TTM estimates, as depicted in Fig. 14(b). Subsequently, the simulation step within Algorithm 1 should inject errors using sampling of that distribution and provide a uniform streamlined impact analysis for all hazard types. The method extends the distributional scenario approach reported in Refs. [20,21], whereby the hazard monitor detects hazards, classifies them, and measures their TTM.

## Quantitative Hazard Modeling

In this section, we present methods for modeling hazards that support the quantification of risk through probability distributions.

### Modeling Limitations.

Some existing well-established approaches model the dependencies and data paths along the driving loop pipeline and formalize the input specification for each component. Deviations from those specification are analyzed using failure mode effects and criticality analysis (FMECA) per Automotive Industry Action Group (AIAG) FMEA-4 [12], FTA, and others.

We propose to extend these methods by defining the system as *comprising all traffic participants within a scenario* and perform FMECA and FTA by *modeling all traffic participants within a scenario*. This includes modeling potential behaviors of all traffic participants.

**Key Point 7**: FMECA and FTA analyses should model all traffic participants for the applicable scenario. Furthermore, these methods need to be extended to support quantitative distribution of errors.

*measured*relative distance Δ

*X*

_{measured}as the sum of the ground truth Δ

*X*and an error

*ε*

_{x}, as depicted in Fig. 16. Similarly, the

*measured*relative velocity is the sum of the ground truth Δ

*V*and an error

*ε*

_{v}. The

*measured*TTC is expressed as follows:

*ε*, and its distribution, can be

*measured against the ground truth*obtained from physical test readings, as well as using simulated synthetic scene images. We conducted such an experiment by training a DNN estimator to measure the distance to the lead vehicle using simulation images. We trained the estimator (i.e., regressor) on sunny day images and measured the error on sunny day images, as well as on dusk and snow images. The results, depicted in Fig. 17, show that

*the DNN estimator error is normally distributed*. The consequence is that the TTC error is distributed according to the ratio of two normal distribution, implying that the TTM error of our estimator can be modeled using the Cauchy distribution, parameterized by

*γ*, as depicted in Fig. 18

Most importantly, we observe that the estimation error *ɛ*_{x} for weather dusk and snowy images is biased, with *ɛ*_{0} > 5 *m*. This implies *unsafe overestimation* because a TTC-based crash-avoidance braking will be triggered later than would otherwise be triggered without such errors.

The parameters of the distribution of errors can be estimated from the measured data. As an example, estimators for the TTC Cauchy Distribution parameter can be found in Refs. [23,24] and others.

**Key Point 8**: Validation of the dynamic hazard driving loop should focus on measurement of TTM estimation error distribution against ground truth, and the subsequent evaluation of the impact of such error distribution using simulation is depicted in Fig. 13.

### Modeling Misuse.

Only unintentional misuse is in scope for safety standards; intentional misuse is regarded as a security issue. The unintentional misuse can be classified into two main categories:

Usage outside the operational design domain (ODD) of the vehicle. Avoiding such misuse requires automatically detecting situations outside the ODD and communicating such a detection to the operator.

Inappropriate response to temporary restrictions, e.g., per authorities, oil spills due to accidents or other unforeseen events. Avoiding such misuse requires modeling unplanned restrictions and their communications to the operator.

In all cases, communication of an imminent misuse must allow the operator reasonable time and options to respond and avoid further misuse; this requires priming drivers for some time. Once appropriate communication takes place, any subsequent misuse becomes intentional and out of scope for safety standards.

**Key Point 9**: Formulate the requirements for ODD detection, handling temporary restrictions and the implied communications to prevent unintentional misuse. This requires integration of the data across multiple modalities, including global positioning system (GPS) and high definition (HD) maps for geo-fencing, traction sensing, visibility characterization, and other sensing capabilities.

The misuse hazard in general, and geo-fencing in particular, can be associated with TTM, which can be estimated within the dynamic hazard driving loop. As an example, when a vehicle is about to exit the geo-fence of the allowed autonomous driving, a Take Over Request (TOR) can be initiated, and the TTM is represented as the expected delay in human response. Much as with estimation of TTC, the error in such estimation can be measured by comparing to a ground truth. Such a measurement should quantify the distribution of errors *P*(*ɛ*_{TTM}), as reported in Refs. [25,27], enabling the impact evaluation in simulation to obtain relative frequency and confidence for various outcomes. This is described in detail next.

**Key Point 10**: Validation of the risk of misuse should focus on the measurement of human-response TTM estimation error against ground truth and the subsequent evaluation of the impact of such error using simulation, as depicted in Fig. 13; see the TOR examples that follow.

### Modeling Takeover Requests.

An important class of misuse is the behavior of the human driver when the system initiates a TORs. A common scenario in which a TOR is issued is when the vehicle exits a region in which autonomous driving is supported, e.g., when leaving the freeway onto an unmarked rural road or entering municipality in which autonomous driving is explicitly forbidden.

The TOR-related misuse can be modeled quantitatively using distribution of possible outcomes. Misuse can occur due to delay in the performance of the takeover. When the human driver is not performing the takeover, the misuse hazard is materialized, triggering a transition into a minimal risk condition (MRC) forcing the AV to safety stop at the side of the road.

As an example, Fig. 19 [25] depicts a cumulative distribution function for the reaction time of a human driver measured after the TOR was provided. From the distribution charts and per Ref. [25], we can conclude the following:

Full automated driver switch-off was mostly completed within 7–8 s. This implies that the autonomous driver needed to continue operating up to 8 s after the takeover request.

Glancing at the road was mostly completed within 3–4 s. This implies that there is a large gap between the initial glance and the final autonomous driver switch-off.

Hands on steering and pedals takeover was mostly completed within 6–7 s. This implies that there is a gap between driver action completion and final autonomous driver switch-off.

The results reported in Ref. [25] imply that a potential misuse hazard is introduced upon the issuance of the TOR. It escalates to a *developing* hazard if no driver response is observed within 4 s. Subsequently, the hazard materializes if the driver does not take over within ≈10 s, triggering a minimal risk condition (MRC) action.

**Key Point 11**: Takeover should be modeled quantitatively as any other hazard. The long takeover duration implies that autonomous drivers must predict the need for takeover >10 s in advance. Consequently, *it is not practical to initiate a takeover due to the dynamic situation on the road* because the situation changes completely after 10 s. The only practical level 3 takeover trigger is based on predictable imminent geo-fencing restrictions or temporary road closure (e.g., due to weather) or restrictions by authorities.

Similarly, the distribution of lane-offset error after the takeover needs to be modeled quantitatively, using distributions. As an example, Fig. 20 [26] depicts the longitudinal distribution of offset errors as a function of position, as well as the distribution of the largest error across multiple drivers. From the distribution charts, we can conclude the following:

The offset error due to takeover can be very large, exceeding the width of a lane.

The maximum error may occur very far from the takeover point, implying that the long-range impact of the takeover hazard must be considered.

The error distribution exhibited by human drivers has a far wider distribution than that of an automated driver. Consequently, takeover hazard modeling must consider far more extreme errors, as large as an entire lane width.

**Key Point 12**: Ramifications of errors due to misuse-hazards in general, and TOR in particular, need to be measured and understood quantitatively. The maximum errors to be considered are very significant and occur far beyond the point at which the hazard is initiated or escalates to a developing hazard. Most importantly, these errors must be considered even when the hazard dissipates due to timely human response.

### Environment Hazard Modeling.

To simplify and unify the modeling of the wide range of environment hazards, we measure the TTM error for various location on the road and propagate the implications using simulations. This approach decouples the challenge of measuring TTM error distribution from the validation task, as depicted in Fig. 13. For modeling error and limitations due to road conditions, the measurement is performed by placing the sensors at various locations on the road, recording its relative distance and velocity estimates, and applying parametric regression to estimate the mean and variance of the TTM error distribution.

Consider, for example, modeling *ɛ*_{TTC} for a two-way road upon entering a tunnel. A depiction of the TTC scale model for a tunnel is provided in Fig. 21. The light dashed circles represent the confidence intervals of TTC estimation as expected to occur at the center of the circle. The arrows represent the near if the error, also referred to as the estimator's bias. Before entering the tunnel, the TTC estimation is very accurate, depicted by small radius circles. Subsequently, upon entry, the reflections and lighting condition change, resulting in an increased TTC estimation error rate. The most significant larger error rates occur when the trajectory difference between the two vehicles are large. The loss of GPS signal within the tunnel results in compounded hazards, modeled by increased circle sizes deeper into the tunnel.

**Key Point 13**: There is a need to train perception algorithms to estimate TTM at various road positions under different environment conditions. As such, there is a need to measure the *location-dependent* distribution of estimation errors against ground truth obtained from a physical test track testing.

An important challenge is the applicability of the estimation error in different environments. The performance of a machine learned component is optimal only when the distribution of data observed during deployment matches that of the trained distribution. When the distribution observed during deployment differs from the training distribution, the model tends to exhibit degraded performance. As an example, see our experiment as depicted in Fig. 17.

To achieve good performance across multiple distributions induced by multiple environments, there is a need to fine-tune multiple models against the multiple input distributions. As such, when multiple models are available, each trained to operate in specific environment conditions, there is a need to select the appropriate model based on the observed environment, as depicted in Fig. 22.

According to our approach, sensor data are used to estimate which model is applicable for interpreting other sensors. For example, readings from a road condition sensor would lead to selecting a different camera and LiDAR hazard classification, TTM estimation and action selection models for a dry roads, wet roads, soft snow, or icy road conditions.

### Modeling Bias and Variance.

The TTM error estimation approach requires measurement of both mean (=bias) and confidence interval (=scale) of the estimation error for various locations on the road (across all hazard types). Although it is not practical to measure and specify the confidence intervals for all possible locations on the road, it is reasonable to take measurements for specific grid points. The visualization of the TTM error distribution in Figure 21 uses circle centers to depict the point of measurement and arrows originating at those centers to visualize error mean (=bias) and the radius (=error scale) to depict 2 × the error standard deviation (half of confidence interval) measured. Subsequently, during simulation, estimating the TTM confidence interval for a specific location x,y is achieved by linear interpolation against the nearest grid points specified.

Environment modeling using error interpolation is performed as follows: For a single dimension, require that measurements of the standard deviation are available at grid points (*x*_{1}, *σ*(*x*_{1})), …, (*x*_{n}, *σ*(*x*_{n})). Estimating the value of *σ*(*x*) for an arbitrary point *x*, requires two steps:

*Step 1:*Find*i*such that*x*_{i}<*x*<*x*_{i+1}.*Step 2:*Compute*σ*(*x*) according to:(9)$\sigma (x)\u2212\sigma (xi)x\u2212xi=\sigma (xi+1)\u2212\sigma (xi)xi+1\u2212xi$

*σ*(

*x*), we get

The same pattern applies for estimating the average *μ*(*x*). Generalizing this approach to >2 dimensions gives rise to the bilinear and trilinear interpolation [28], which enables estimating the mean and confidence interval at the unknown point (*x*, *y*), given measured data at grid points. Consequently, a simplistic parametric model, comprising error mean and variance as parameters, is often sufficient to enable the quantification of the acceptable error rates for a desired distribution of outcomes.

**Key Point 14**: Use a simple parametric distribution model, and linear interpolation of the parameters, to estimate the TTM error distribution between measured locations.

### Modeling False Positives and Negatives.

The hazard monitor has the two key components of hazard classification followed by TTM estimation, as depicted in Fig. 15. Although the TTM estimation is numeric, the hazard classification is categorical. As such, modeling the errors of the hazard detector can be done using the standard classification error metrics. As an example, to support detecting compounding hazards, it is reasonable to use a separate binary hazard classifiers (i.e., present or not) for each hazard, because it allows multiple classifiers to detect their corresponding hazard concurrently.

The standard binary classifier metrics of true positives versus false positives, true negatives versus false negatives, can be used to measure the performance of the component. However, it is important to realize that those metrics are not directly impacting the safety of the AV; it is possible for a major increase in false positives or negatives to have little or no impact on the overall safety or comfort of the AV. To estimate the impact of the false positives or negatives, a sampler needs to be used to probabilistically inject errors within each simulation time frame, as depicted in Fig. 23, and the results need to be correlated.

**Key Point 15**: Impact of false positives and negatives needs to be determined by injecting errors into simulation using sampling of error distributions and observing the statistical outcomes in simulation.

## Practical Development

In this section, we provide a systematic approach for addressing the exponential explosion of the number of scenarios that need to be tested. We start by reviewing the scenario specification methodology based on Ref. [29] and proceed with description our approach for identifying the safety related parametric scenarios of interest. Specifically, we leverage TTM to identify the ODD subset representing hazardous scenarios. This approach can generate data for development, testing, and safety validation.

Note that commercial use of scenario variation generators using a parametric scenario approach described herein is in its infancy. Usage of the scenario distribution correlation approach was most recently reported in Refs. [20,21] and applicaitons to physical vehicle homologation was reported in Ref. [30] (see section C “model-in-the-loop”). More recently, the Enable S3 consortium reported using critical scenario identification as a component in their methods [31] (see Critical Case Generation Based on Parametrized Real-World Driving section). The methods described in this section are reaching beyond the state of the art, using the TTM as the universal metric for hazards.

### Formal Operational Design Domain Scenario Specification.

Formal quantitative distributional ODD specification is the underlying requirement for rendering the development practical. Such a representation is achieved using the Pegasus approach [29], which defines the following levels of scenario specifications:

*Functional Scenarios:*An*informal*ambiguous specification of the scenario, using ambiguous natural language, e.g. “Ego backing up into another vehicles,” or “Another vehicle backing up into Ego,” or “Following vehicle making maneuver.”*Logical Scenarios:*Parameterized scenario templates, regarded as*formal*specifications that provide range and distribution of values, e.g., lane widths, trajectories for other drivers.*Concrete Scenarios:*These extend logical scenarios to be directly executable in a simulator by providing a specific values for each parameter.

**Key Point 16**: According to the Pegasus approach, the scenario requirements process should start with informal functional scenarios, proceed with specification of the population of scenarios representing the ODD, and commence with a library of concrete formal scenarios that can execute on a simulator and with which specific tests are conducted.

**Key Point 17**: The probability distributions of logical scenario parameters can provide a formal verifiable ODD specification supporting the safety case argumentation for AV.

According to our approach, the ODD is specified using rules and constraints defining the subset of parameter combinations which are in scope. A library of concrete scenarios is generated from the logical scenarios, through sampling of parameter value combinations according to the distributions provided in each logical scenario.

**Key Point 18**: Only a subset of all possible parameter combinations is stored in the library of concrete scenarios. That subset is used for development testing and safety validation.

### Probabilistic Hazard Analysis Risk Assessment.

*Bayesian HARA*approach, depicted in Fig. 24, the overall probability of harm

*p*(

*crash*) is decomposed using the probability that the harm is caused by that hazard

*p*(

*crash*|

*hazard*), multiplied by the probability that hazard materializes for a

*logical*scenario

*p*(

*hazard*|

*scenario*), multiplied by the prior probability of the

*logical*scenario

*p*(

*scenario*). This is provided in Eq. (11), where $L$ is the scenario library and $H$ is the collection of hazards under consideration. Using the ISO 26262 terminology, the term

*p*(

*crash*|

*hazard*) represents the complement of the controllability, and

*p*(

*hazard*|

*scenario*) ·

*p*(

*scenario*) represents hazard exposure for the logical scenario. The probability that one of the

*developing*hazards in $H$ materializes across all

*logical*scenarios in $L$ is given by

*s*is a

*logical*scenario,

*H*is a collection of

*developing*hazards, and

*c*is a materialized hazard, e.g., a crash. Denoting

*H*

_{potential},

*H*

_{developing}, and

*H*

_{materialized}as the propositions representing that at least one hazard in the set

*H*is potential,

*developing*and

*materialized*respectively, the hazard lifecycle from section is incorporated according to Eq. (12)

Consider example scenario #2 of “following vehicle making maneuver” as *s*, and consider the two hazards of *h*_{1} = “lead vehicle making lane change” and *h*_{2} = “an obstacle in Ego’s lane”. The hazard *h*_{1} is materialized as soon as the lead vehicle starts the lane change. Similarly, hazard *h*_{2} is materialized if either a stopping vehicle is waiting for pedestrians, or it is blocked by traffic, or there exists a pothole or a tree fell onto the road, or a pedestrian or a cyclist is in the way, or any other obstacle is present. In such a scenario, a crash occurs if both *h*_{1} and *h*_{2} materialize with probability of *p*(*c*|*h*_{1}, *h*_{2}). The compounding of *h*_{1} and *h*_{2} occurs with probability of *p*(*h*_{1}, *h*_{2}|*s*) for a logical scenario *s*. Thus, the exposure to compounding both *h*_{1} and *h*_{2} is quantified as the probability *p*(*h*_{1}, *h*_{2}|*s*)*p*(*s*). Finally, the *complement* of controllability for this compound hazard is quantified as the probability of a crash given that both hazards materialized, namely, *P*(*c*|*h*_{1}, *h*_{2}).

### Hazardous Scenarios.

Intuitively, most driving is uneventful and regarded as “easy.” Scenarios that lead to unintended outcomes are considered “challenging” and are typically infrequent. Naturally, we seek approaches for generating scenario population, which increase the frequency of those infrequent “challenging” events. Our approach quantifies the degree in which a scenario is desirable in terms of an objective function as follows:

*Given the simulation result**r*(*s*_{1}), *r*(*s*_{2}) *of two* logical *scenarios**s*_{1}, *s*_{2}, *and an objective function**o*(*r*(*s*_{i})), *then**s*_{1}*is* more desirable *than**s*_{2}*if and only if**o*(*r*(*s*_{1})) > *o*(*r*(*s*_{2})).

This generic formulation is helpful for defining the notion of *Hazardous Scenarios*, by systematically adjusting the parameter distributions for logical scenarios to maximize (or minimize) an objective.

*Given a random variable**x**drawn from a distribution**P*, *let the quantile function**q*(*P*, *α*) *be the probability**P*(*x* < *α*). *Given a hazard**h*, and two scenario parameter *α*_{dev}, *α*_{mat}, *let the TTM**m*_{h}*be the metric defining that a hazard**h**escalates from a* potential *to* developing *when**m*_{h} ≥ *α*_{dev}, *and**h**escalates from* developing *to a* materialized *when**m*_{h} ≥ *α*_{mat}. *Define*$Pmh(s)$*as the distribution of**m*_{h}*for a* logical *scenario**s*, *namely*, $mh\u223cPmh(s)$.*The hazardous scenario objective function for discovering* developing *hazards is*$o(r(s))=\u2212q(Pmh(s),\alpha dev)$, *and for* materialized *hazards the objective function is*$o(r(s))=\u2212q(Pmh(s),\alpha mat)$. *The negation of q is used to facilitate minimizing the metric by means of maximizing its negation*.

TTC is a commonly used metric of separation between vehicles. As such, define a collision hazard metric *m*_{h} = TTC and select a *α* = 0.05, indicating that the objective is computed as the 5% tile of TTCs for a given *logical* scenario. For such a collision hazard, the objective function would be *o*(*r*(*s*)) = −*q*(*P*_{TTC}(*s*), *α*); we use the negation because we need to maximize the objective, and maximizing the negative of TTC minimizes its absolute value, increasing the probability of a crash.

With the aforementioned definition, we can search for scenarios in which collisions occur using the Algorithm 1 and depicted in Fig. 25. First, start with a parameterized *logical* scenario *s*. Next, generate a population of *concrete* scenarios according to an initial candidate distribution. Next, run the simulation, evaluate the objective function against the resulting time-series of each *concrete* scenario, and estimate $Pmh(s)$. Next, train the sampler to increase the portion of desirable scenarios by providing it with pairs (*s*_{i}, *o*(*r*_{i}(*s*_{i}))) specifying the value of the objective for each input concrete scenario. Update the parameter distribution for *s* according to $Pmh(s)$. Repeat this process for several iterations and provide the trained sampler as output.

#### Learning to generate desirable scenarios

An illustration of a result obtained by this simple approach is depicted by the probability density charts in Fig. 25. The blue line represents the initial distribution generated before the sampler was able to learn. The red line represents the distribution of the objective produced by resampling for the *logical* scenario. The left chart represents the results after the first iteration, and the right chart represents the results after the third iteration. As can be observed in the second chart to the right, after the third iteration, most of the scenarios were generated with an objective TTC = *o*(*r*(*s*)) ≈ 0, the target value. For this example, the sampler was able to overcome the multiple local-maxima and a multi-modal objective distribution.

A number of variations on this algorithm can be considered, including various stopping criteria and training strategies. An interesting implementation leverages an “inverted deep neural network,” depicted in Fig. 26(a), which generates fixed length samples comprising thousands of scenarios for each forward pass. The performance of that sampler using losses of mean square error (MSE) and *χ*^{2} for the distribution distance can generate >90% desirable scenarios, as depicted in Fig. 26(b). The reader is referred to a very rich literature of statistical sampling and resampling that is applicable.

**Key Point 19**: Rather than simulating mostly undesirable scenarios, Algorithm 2 can be used to generate libraries of desirable concrete scenarios for development and testing.

As an example, consider the 3D chart analyzing a cut-in maneuver depicted in Fig. 27, visualizing the advantage of a design of experiment (DoE) approach over a naive grid search. The horizontal axis represents the Ego speed. The vertical axis represents the distance to the lead vehicle. The depth (slanted) represents the speed of the vehicle cutting in. The red circles represent combinations selected by a naive grid-search approach. The blue circles represent the combinations selected by the DoE approach similar to the methods described herein. As can be observed, the combinations with an small distance to the lead vehicle are selected due to the increased likelihood of a crash. It is common that such a DoE approach will reduce the number of simulations required by an order of magnitude or more.

In another example, consider the report of the Enable S3 consortium [31]. A DoE was used to search for a safety boundary, separating safe scenarios from crash scenarios. The boundary depicted in Fig. 28 represents the limitation on performance safety of the ADAS with respect to real traffic (corresponding to measurements). The chart represents an analysis of example scenario 2. The *X*-axis, labeled as Δ*V*, represents the relative velocity between lead vehicle and the stopped vehicle, in meters per second. The *Y*-axis, labeled as *T*, represents the duration of the lane change in seconds. The *Z*-axis, labeled Δ*y*, represents the relative distance between the lead vehicle and the stopped vehicle. For the specific AV software and hardware integration stack simulated the observed safe-scenario boundary represent an almost-planar separation between safe and unsafe scenarios.

## Conclusion

We propose a generic framework for understanding and managing risk associated with hazards for autonomous driving. We introduced a generic dynamic framework whereby autonomous driver continuously monitors for imminent hazards and selects actions, which maximize the time to materialization (TTM) of these hazards.

We show how to quantify and bound risk using a probabilistic approach to hazard occurrence. We recommend that the perception implementation requirements, and the machine learning training approach should extend beyond the standard object detection and tracking: We introduce a hazard monitor, which is trained continuously to detect hazards and estimate their TTM, and we present methods to obtain quantitative distributional models of errors.

We provide a simple hazard lifecycle. It starts with a *potential* hazard, which may escalate to a *developing* hazard if action is needed to avoid harm. When harm cannot be avoided, the developing hazard escalates to become *materialized*. At that point, the action selected by the AV impacts the degree of harm inflicted.

To support quantification of safety, we present modeling approaches for the various hazard types, including modeling of sensor limitations in the context of specific environmental conditions, as well as misuse originating from TORs.

Finally, to integrate all the concepts developed into a practical system, we present the probabilistic hazard analysis risk assessment approach leveraging Bayesian decomposition, namely, Bayesian HARA. We show how the simple Bayesian inference chain rule can be applied to model escalation of hazards according to their life cycle.

The advantage of Bayesian HARA is its support for accelerated development through sampling. As such, we provide an algorithm, which trains a sampler to generate mostly scenarios of interest. We introduce TTM quantiles as an objective function. We describe the optimization algorithm leveraging that metric to discover populations of scenarios of interest in general and hazardous scenarios in particular. This enables focusing development, testing, and safety validation on a manageable number of scenarios and reduces the number of simulations needed by orders of magnitude.

The methods used to quantify the performance in simulation are predictive of the physical performance when a three key assumptions hold: (1) The system modeled needs to include the entire software stack of the AV; this is referred to as software in the loop simulation. (2) The end-to-end integration of that software stack within the simulation is correctly simulated and measured. (3) The underlying distribution of scenarios used for performance testing needs to be representative of the target ODD. Although the overall system modeled is complex, we postulate that verifying validity of these three key assumptions is sufficient to render effective the use of simulation for the development and safety validation of an AV.

## Footnotes

Each crash involves a driver and may involve passengers or pedestrians, with various degrees of injuries.

## Conflict of Interest

There are no conflicts of interest.