This article discusses different reasons of failure of engineering systems and how such failures can be avoided. It is human nature and economically attractive to discount the low-probability, high-severity consequences in the design and development of complex systems. Discounting or ignoring the effects of worst-case scenarios, however, can lead to a culture of complacency that heightens risks. Failure due to poor development can be traced to a lack of organizational commitment to systems thinking. An example of this is a lack of communications between designers and end users. System designed without the user in mind and without regard to human factors for safe operation and maintenance can have disastrous results. Overly complex user interfaces in both hardware and software systems are common points of failure. Failures due to a lack of training should not be considered the fault of the individual operator, but as a blunder in foresight by management. Improved training of end users has been shown to significantly reduce system failures and improve the integrity of systems.
Global Impact: Consider how ASME can provide leadership for improving risk management and resilience for complex systems.
Most people like to think of themselves as independent and autonomous. But we are all enmeshed in a web of technology. Our daily lives depend on the continuous and safe delivery of technological services and products. Technology is essential to the food we eat, the clothes we wear, our transportation and communication systems, our financial services, our health and personal safety, and especially to a clean and safe environment.
Taken as a whole, this web of technological services and products is the critical infrastructure of the planet. Engineers from every discipline have contributed to the design and fabrication of this infrastructure, and their success in creating technological advances has created the public perception that the only limits are the bounds of imagination. Today, our buildings are taller, our bridges are longer, our ports handle more traffic, our energy systems extract more fuel, our sanitation systems dispose of more waste, and our power grid operates at higher capacity than could have been believed possible a generation ago.
And the demands placed on engineers will only increase. Population growth, economic expansion, resource limits, and age-related deterioration are placing enormous burdens on the existing critical infrastructure and hard-to-reconcile constraints on plans for new infrastructure. These demands are being met with the design of systems that are more complex in their development, functionality, and maintenance requirements. Such large-scale, complex systems are interdependent, highly concentrated, and centralized—and are of such importance that they have been labeled by some as “too large to fail.”
Such systems are, however, too vulnerable to protect against all potential threats and failure. That vulnerability has been exposed by recent events, from the destruction of the World Trade Center in September 2001, to the recent blowout of the Deepwater Horizon in 2010, to the aftermath of the Tohoku earthquake and tsunami in March of this year. Each of these events has heightened the public's concern and awareness of the social, economic, and environmental consequences of large-scale complex system failure. And in each case, the question has been raised: Why was our critical infrastructure left so exposed to failure?
No professionals are better suited to address these concerns than the engineers who are charged with building the complex systems in the first place. And in October 2010, in the wake of the Deepwater Horizon accident, a task force convened by ASME determined that the Society should work toward developing informational initiatives surrounding complex systems failure. So recommendations were developed to raise awareness with the general public about the issue of complex systems failure; to develop educational packages for practicing engineers; to produce a compendium of standard risk analysis tools with an accompanying framework for technical complex systems analysis; and to help companies understand, assess, and manage risk within their enterprises.
It's a big job, but one that is crucial.
Societies depend upon systems, and systems-of-systems that constitute their critical infrastructure. Too often, though, those systems are poorly understood and generally are taken for granted by the population that relies on them. And it isn’t just the megastructures and vast networks that are increasingly opaque: Modern society is filled with small gadgets—from cell phones and pacemakers to personal computers and automobile electronic systems—that are maddeningly complex, virtually indispensable, and scarcely noticed unless they fail.
Without a doubt, few people take the time to consider the consequences of the failure of such systems or question how the effects of failure can be minimized. Nor is it generally recognized that in almost all cases, engineers have been asked to optimize the design of systems in order to reduce cost or improve efficiency, or do both. Efficient infrastructure, while wonderful when working as designed, can lead to fragility when small discrepancies—such as bad weather or a minor oversight—occur and to outright failure when several unexpected events happen at the same time.
To counter this possibility, adding built-in redundancies can reduce the chances of catastrophic failure and may mitigate the consequences of failures. Redundancies imply loss of efficiencies; mitigation implies up-front costs for uncertain downstream future consequences. But in order to accomplish this, the designers of critical infrastructure need to identify and understand the root causes of complex systems failure.
Engineers have been able to identify several root causes for failure. Each is related to the need for improvement in the application of systems thinking to the technical, engineering sciences as well as to the organizational, behavioral sciences.
Poor development: This cause of failure can be traced to a lack of organizational commitment to systems thinking. One example of this is a lack of communications between designers and end users.
Incorrect assumptions with regard to system requirements: Too often, flawed assumptions made at the initial stages in a system's design will become more than mere suppositions. It is important to continuously challenge mindsets and mental models, even after development has begun.
How can ASME Address the Prevention of Complex Systems Failure
Recommendations from an ASME task force
Communication and Advocacy
Increase public and political awareness of what can be done to help prevent such failures from occurring.
Develop a set of educational materials about preventing, reacting to, and mitigating complex systems failure for use by educators and practicing engineers.
Enterprise Risk Management (ERM) Framework
Enable companies to develop a comprehensive, integrated approach to managing their enterprise risk to minimize the potential for complex systems failures and to develop plans to react and mitigate consequence.
Risk Analysis Tools
Develop a compendium of diagnostic and prognostic risk analysis tools and a capability that industries can use to tailor to their situation and environment, allowing them to effectively assess risk.
Poor user interface: A system may be elegantly designed, but if it is done without the user in mind and without regard to human factors for safe operation and maintenance, the result can be disastrous. Overly complex user interfaces in both hardware and software systems are common points of failure.
Faulty hardware and software: Hardware and embedded software design faults can be overlooked easily in the face of meeting deadlines and budgets. In many ways, this can be thought of as a failure of planning standards or a lack of shared standards development and enforcement.
Inadequate user training or user error: Failures due to a lack of training should not be considered the fault of the individual operator, but as a blunder in foresight by management. Improved training of end users has been shown to significantly reduce system failures and improve the integrity of systems.
Unintended interactions: Complex systems often have so many connected parts that the interactions between them are nearly impossible to fully anticipate. The result is a non-linear, cascading (and often catastrophic) response to failure of what had been thought to be an independent component of a system.
Lack of public awareness: It is human nature for both elected officials and the public to disregard the fact that ubiquitous complex systems have a likelihood of failure and that these failures must be addressed proactively. As a result, preventive efforts do not receive the attention and resources they need.
Lack of attention in engineering training: Risk assessment—both theory and practice—is underemphasized because it exposes the vulnerability of a system to potential threats. The avoidance of identifying system vulnerabilities promotes a culture in which ignoring risks is perceived to be more beneficial than reducing risks.
It is human nature and, on the surface, economically attractive to discount the low-probability, high-severity consequences in the design and development of complex systems. Discounting or ignoring the effects of worst-case scenarios, however, can lead to a culture of complacency that heightens risks. The consequences of any resulting failure may well dwarf any superficial cost savings. These consequences include social, economic, and environmental effects.
The engineers who design these systems are the ones best suited to address the issues of preventing, reducing the probability and frequency of occurrence, and mitigating the consequence of complex systems failures. Engineers have the tools to design and build fault-tolerant systems, to develop diagnostic and prognostic tools that assess the in-situ health of a system and manage risks in system operation, to plan for event mitigation, and to consider the ethical responsibility associated with their safe operation, management, and maintenance.
Now more than ever, engineers need to bring their experience and knowledge to bear to address low-probability, high-consequence system failures. While natural disasters, terrorist attacks, and human error will occur, engineers can design complex systems and help manage risk to reduce the likelihood of cascading failures, while ensuring that these critical systems do not become too expensive to develop.
ASME's Strategic Issues and Innovation committees, in an effort to respond to this important issue, convened a series of meetings that brought together the Society's Industry Advisory Board and invited experts. At the first meeting, in October 2010, an expert task force looked at complex systems management across industries. Participants identified potential areas for ASME action, including the development of standards and tools, educational initiatives, organizational and ethical issues, and the need to increase public awareness of issues associated with complex systems.
The following specific recommendations were suggested:
Conduct targeted communications and advocacy about complex systems failures: ASME should consider communications and advocacy activities to raise awareness about the need for preventive efforts to address complex system failures. Developing and releasing materials such as a position statement, a program initiative plan, and a multi-society position paper will be part of ASME's plan and its role as a leader in achieving a solution. In addition, continual public outreach will help reinforce the messages that complex systems have a likelihood of failure and, therefore, should have diagnostic and prognostic systems included as part of their design.
Develop a package of educational materials and products for mechanical engineering managers and practicing engineers: ASME could compile a package of educational materials about complex systems failure. If the audience for the initial package of materials is mechanical engineering department heads and practicing engineers, over time, the Society could expand the package to include newly developed materials designed to reach out to different audiences.
Produce a compendium of risk analysis tools with an accompanying framework for technical complex systems analysis: ASME could utilize its expertise to develop technical tools to create a compendium of standard diagnostic and prognostic risk analysis tools and an accompanying general framework for technical complex systems analysis that industries can tailor to their environment to better assess risk. This is important because the complex nature of large, interconnected systems makes risk inherently difficult to measure.
Develop a framework to help companies understand, assess, and manage enterprise risk: ASME could create a guide that companies and organizations can use to develop a comprehensive, integrated approach to assessing and managing their enterprise risk. Company organizational structures and internal decision-making processes are integral to how a firm manages risk across an enterprise, and they affect the capacity of a company or agency to assess, manage, and mitigate the risks of a complex system failure.
Details of ASME's action plan are presented in a draft report, Initiative to Address Complex Systems Failure: Prevention and Mitigation of Consequences, published last June. As the report states, this initiative is part of an effort to “stake a role for ASME as a leader in addressing this complicated issue head-on.”
The importance of this effort could not be greater, either for the engineering profession or for the world at large. The conclusion of the draft report makes clear that the engineers of today and tomorrow will take an important step in cultivating a culture that considers risk at every step, making the complex systems we rely on safer for everyone.