EVALUATION AND DESIGN OF DEPENDABLE SYSTEMS WITH DESIGN DIVERSITY
Subhasish Mitra, Nirmal R. Saxena and Edward J. McCluskey
Departments of Electrical Engineering and Computer Science
Stanford University, Stanford, California
Abstract
Design diversity was described in the past as a
Design diversity has long been used to protect
technique to avoid or tolerate CMFs in redundant
redundant systems against common-mode failures and
systems. In [Avizienis 84], design diversity was defined
design mistakes. The conventional notion of diversity is
as the independent generation of two or more software or
qualitative and relies on “independent” generation of
hardware elements (e.g., program modules, VLSI circuit
“different” implementations. As part of the DARPA
masks, etc.) to satisfy a given requirement. The basic
sponsored ROAR project at Stanford CRC, we recently
idea is that, with different implementations, common
developed a quantitative measure of diversity and
failure modes will produce different error effects. Forexample, chances of identical design errors may be
demonstrated its use in the evaluation and design of
minimized if two groups of designers are asked to
dependable systems. In this paper, we present a
independently design a hardware block or a software
summary of our work on design diversity, and discuss
module. A dip in a power supply may have different
important problems related to the evaluation and design
effects on two different hardware implementations of the
of systems that use design diversity for dependability
same logic function. Examples of design diversity
include N-version programming [Lyu 91] for software
1. Introduction
systems or the use of different processors for redundant
Redundancy techniques for designing systems with
system designs in flight controllers (Boeing, Airbus)
high data-integrity and availability have been studied
extensively [Kraft 81, Siewiorek 92, Pradhan 96]. A
There are two cost components associated with
duplex system is a classical redundancy scheme used in
diversity – design cost and manufacturing cost. The
many commercial dependable systems [Webb 97,
design time is roughly doubled for a duplex system
Spainhower 99]. In a duplex system there are two
design with diversity. The manufacturing cost can be
modules that implement the same logic function. The
avoided for systems built using reconfigurable logic
outputs of the two modules are compared and an error is
elements (e.g., FPGAs). For reconfigurable systems,
reported when a mismatch occurs. Data integrity
diversity can be created by downloading different
means that the system either produces correct outputs or
configurations instead of manufacturing different ASICs.
indicates an error when incorrect outputs are produced;
While there is clear evidence that diversity can bring
in the literature on fault tolerance, data integrity is also
benefits in a redundant system, these benefits are
referred to as the fault-secure property. In a duplex
extremely difficult to quantify with the above qualitative
system, data integrity is guaranteed as long as only one
definition of diversity. Thus, as pointed out in
[Littlewood 96], there is a need to answer questions such
While most redundant systems are designed using
as: “what is diversity? Are these designs more diverse
the single fault assumption, in real life there can be
than those? How diverse are these two designs?” In the
sources of multiple and common-mode failures that
next section, we present a quantitative definition of
produce multiple faults. In the presence of these
failures, system data integrity is not guaranteed. Common-mode failures (CMFs) result from failures that
2. D: A Design Diversity Metric
affect more than one module of a redundant system,
Assume that we are given two implementations
generally due to a common cause. Sources of common-
(logic networks) of a logic function, an input probability
mode failures can be a dip in the power-supply, a single
distribution and faults fi and fj that occur in the first and
source of radiation creating multiple upsets, or even
the second implementations, respectively. The diversity
design errors. Common-mode failures are surveyed in
di,j with respect to fault pair (fi, fj)is the conditional
probability that, with the faults fi and fj present, the two
implementations do not produce identical errors.
combinations that detect f1). The only input combination
The main idea behind the above definition of di,j is
that causes an error at Z2 with f2 present is ABC = 101.
illustrated using the example in Table 2.1. We have two
(This is the input combination that detects f2). If a duplex
implementations N1 and N2 of the same logic function,
system consisting of the two implementations in Fig. 2.1
and faults fi and fj affect N1 and N2, respectively. When
is affected by the fault pair (f1, f2), then ABC = 101 is
the input combination 0000 is applied, both
implementations produce correct outputs in the presence
implementations will produce identical errors. This
of the faults. When the input combination 0110 is
erroneous output will escape detection. If we assume that
produced, N1 produces erroneous outputs while N2
all input combinations are equally likely, then the d1,2
produces correct outputs in the presence of the faults. Ifa comparator is used to compare the outputs of NN2, a mismatch will be reported and hence, dataintegrity is guaranteed for this input combination. A
similar situation happens for the input combination1101. For the input combination 1010 both N1 and N2produce erroneous outputs in the presence of the faults. However, the erroneous outputs are different and a
mismatch will be reported when the outputs of N1 andN2 are compared. Hence, data integrity is guaranteed for
the input combination 1010. For the input combination1110, both N1 and N2 produce identical erroneous
outputs and the errors will not be detected by thecomparator comparing the outputs of N1 and N2. Hence,
data integrity is compromised for the input combination1110.
Table 2.1. Example to illustrate behaviors of faulty
The above illustration of the design diversity metric
combinational logic circuits [Mitra 99] and sequential
circuits [Mitra 01a]. There are two fundamental
problems involved in the computation of the design
diversity metric: (1) there can be too many fault pairs to
be considered; (2) the problem of computing the di,j
value for a fault pair is NP-complete. Fast techniques for
estimating diversity using a reduced list of fault pairs andadaptive
For a given fault model, the design diversity metric,D, between two designs is the expected value of the
Figure 2.2 illustrates the use of the di,j values to
diversity with respect to different fault pairs.
evaluate diversity between different implementations ofthe same logic function. The following functions are
implemented: W = A′C + BC, X = ABC, Y = BC, and Z
= A′B + BC. The three implementations shown in Fig. P(fi, fj) is the probability of fault pair(fi, fj).
2.2a, 2.2b and 2.2c use the same logic gates but differ in
The main motivation behind using D as a metric for
the sharing of the logic gates among the output functions
design diversity is to combine the effects of the di,j
(also called the fanout structures). It was shown in [Mitra
values of different fault pairs into a single number.
00b] that diversity in the fanout structure is important for
For example, consider the two implementations of
generating diverse implementations of the same logic
the logic function Z = AB + AC shown in Fig. 2.1.
Consider the fault f1 = w stuck-at-0 in the
Consider the following three duplex system designs:
implementation of Fig. 2.1a and the fault f2 = y stuck-
1. A duplex system is designed with two identical
at-0 in the implementation of Fig. 2.1b. With f1 present
implementations corresponding to Fig. 2.2a. Consider the
in N1, the input combinations ABC = 111, 101 and 110
fault m/1 (at the output of the AND gate ABC) in both
all produce errors at Z1. (These are the input
implementations. In the presence of the m/1 fault in both
implementations, the two (identical) implementations
the fault p/1 (at the output of the AND gate ABC) in the
produce identical erroneous outputs for 7 input
second implementation (Fig. 2.2b). The reader can easily
combinations (ABC = 000, 001, 010, 011, 100, 101,
check that in the presence of this fault pair the two
110). Hence, the di,j value for this fault pair (m/1 in
implementations produce identical erroneous outputs for
both implementations) is 1/8 = 0.125 (assuming that all
only 2 input combinations (ABC = 010, 011). Hence, in
input combinations are equally likely).
this case the di,j value of the fault pair (m/1 in the firstimplementation and p/1 in the second implementation) is
6/8 = 0.75. 3. Finally, consider a duplex system such that the firstimplementation corresponds to Fig. 2.2a and the secondimplementation corresponds to Fig. 2.2c. Consider the
fault m/1 (at the output of the AND gate ABC) in the
first implementation (Fig. 2.2a) and the fault r/1 (at theoutput of the AND gate ABC) in the secondimplementation (Fig. 2.2c). The reader can easily check
that in the presence of this fault pair the twoimplementations produce identical erroneous outputs for
only 1 input combination (ABC = 011). Hence, in this
case the di,j value of the fault pair (m/1 in the first
implementation and r/1 in the second implementation) is
The diverse duplex system designed in the third
scenario (the first implementation corresponding to Fig. 2.2a and the second implementation corresponding to
Fig. 2.2c) is better than the other two scenarios for thefaults considered. This fact can be verified by calculating
the diversity metrics for these three scenarios. Wecalculated the diversity metrics for the above scenarios intwo different ways:
(i) If we assume that all possible single-stuck-at fault
pairs are equally probable, then the values of the D
metric for the first, second and third scenarios are 0.96,0.97 and 0.98, respectively.
(ii) For each fault fi in the first implementation (Fig.
constituting the duplex systems in the above threescenarios such that the di,j value of the fault pair (fi, fj) isthe minimum over all fj’s; hence, (fi, fj) is called a worst-
case fault pair. These worst-case fault pairs were foundthrough exhaustive simulation of all input combinations
and all fault pairs. Finally, we calculated D as theaverage value of di,j’s over the worst-case fault pairs.
The D values obtained were 0.67, 0.72 and 0.75 for thefirst, second and third scenarios, respectively. Moreoverthere are many worst-case fault pairs with di,j values
equal to 1 in the second and third scenarios.
The above results demonstrate that the duplex
system designed using the implementations in Fig. 2.2a
Figure 2.2. Diverse implementations with diversity in
and Fig. 2.2c is the most diverse one. When we calculate
the diversity metric assuming that all fault pairs are
2. Consider the case where a duplex system is designed
equally probable, we do not see a big difference because a
such that the first implementation corresponds to Fig.
large fraction of all fault pairs have the di,j value equal to
2.2a and the second implementation corresponds to Fig.
1 for all the three scenarios. Hence, the values of the D-
2.2b. Consider the fault m/1 (at the output of the AND
metric become very close to 1 for all the three scenarios
gate ABC) in the first implementation (Fig. 2.2a) and
when it is assumed that all fault pairs are equally
using the expressions derived in [Mitra 99] to understand
various trade-offs and choose an appropriate diverse
Diversity in software programs can be used to
system which best fits their application requirements.
detect hardware and software faults. The above
The diversity metric can also be used as a cost
diversity metric has been extended for diverse software
programs used to detect hardware failures [Oh 00]. For
implementations of the same function. While the process
detecting or tolerating software faults (design mistakes),
of writing software programs for a given application still
the idea of the design diversity metric can be used as
relies on manpower, automated or semi-automated
long as we have a fault model available. While there is
techniques are used for generating hardware designs.
no clear consensus about “good” software fault models,
Synthesis programs used in CAD tools are cost function
several fault-injection techniques available in the
optimization programs where the cost functions are area,
literature [Hudak 93, Christmansson 98] inject software
delay and power consumption of the designed circuit. A
faults in a system; our diversity metric can be used in
design diversity metric allows us to use diversity as
the context of these software faults.
another cost function component during synthesis ofredundant systems for error detection or fault-tolerance. 3. Applications of the Design Diversity Metric in System Analysis and Design
implementations of a combinational logic function is
In [Mitra 99], the design diversity metric was used
described in [Mitra 00b]. Techniques for enhancing the
to analyze the reliability and availability of redundant
self-testability of diverse duplex systems through test
systems. The analysis showed simple relationships
point insertion are described in [Mitra 00c]. There are
between reliability, availability, design diversity, system
many opportunities to develop new architectural, logic
failure rate, mission time and self-testability. A duplex
and layout synthesis techniques for redundant systems
system is self-testing with respect to a fault pair (f1, f2)
taking into account the diversity cost function in addition
(f1 affecting the first implementation and f2 affecting
to the standard cost functions used in CAD tools.
the second implementation) if and only if there exists aninput combination for which the two implementations
4. System-level Error Detection using Diverse
produce different outputs in the presence of the faults. Duplication
The following important observations were made
Duplication (identical or diverse) is not the only way
from the analysis: (1) When the failure rate is high,
of detecting errors in dependable systems; parity
even a small diversity can help enhance the system
prediction techniques have been used in many
reliability over simple replication; (2) If the failure rate
commercial systems for error detection purposes.
Simulation results presented in [Mitra 00d] for general
combinational circuits demonstrate: (1) Area overhead of
circuits with parity prediction is comparable to that of
improvement in reliability or data integrity obtained by
duplication; (2) Diverse duplication provides significant
using diversity diminishes with long mission times; (4)
improvement in data integrity compared to parity
System availability is significantly increased when a
prediction in the presence of multiple and common-mode
diverse system with many self-testable fault pairs is
failures – the problem of theoretically proving this result
for general error models is still open.
System designers can obtain quantitative estimates
for system reliability, data integrity and availability
Figure 4.1. Systems with CED: (a) Example. (b) Diverse duplication for combinational logic; parity
In Fig. 4.1, we present a system-level view of
Principles for Safety-Critical Real-Time Applications,”
concurrent error detection (CED). The system in Fig. Proc. of the IEEE, vol. 82, no. 1, pp. 25-40, Jan. 1994.
4.1a contains a combinational logic block implementing a
[Littlewood 96] Littlewood, B., “The Impact of Diversity
logic function f; the logic block obtains its inputs from
upon Common Mode Failures,” Reliability Engineering
register X and the outputs are stored in register Z. Figure
and System Safety, Vol. 51, No. 1, pp. 101-113, 1996.
4.1b presents a CED scheme which uses diverse
[Lyu 91] Lyu, M. R. A. Avizienis, “Assuring design
duplication for combinational logic blocks and parity
diversity in N-version software: a design paradigm for
prediction for registers and bus lines. Thus, we can
N-version programming,” DCCA, pp. 197-218, 1991.
achieve significant improvement in data integrity for
[Mitra 99] Mitra, S., N. R. Saxena and E. J. McCluskey,
multiple and common-mode failures (through diverse
“A Design Diversity Metric and Reliability Analysis for
duplication) without doubling the number of register flip-
Redundant Systems,” Proc. Intl. Test Conf., pp. 662-
flops and bus lines. Note that, the XOR tree may have
671, 1999. (CRC-TR-99-4, http://crc.stanford.edu).
significant delay overhead. This delay overhead can be
[Mitra 00a] Mitra, S., N. R. Saxena and E. J. McCluskey,
reduced by increasing the number of parity bits (i.e., the
“Common-Mode Failures in Redundant VLSI Systems:
number of extra flip-flops in the registers). Interesting
A Survey,” IEEE Trans. Reliability, Sept. 2000.
problems analyzing this area-delay trade-off can be
[Mitra 00b] Mitra, S., E. J. McCluskey, “Combinational
studied in this context. For performance constrained
Logic Synthesis for Diversity in Duplex Systems,” Proc.
designs, the parity of register Z can be directly predicted
International Test Conf., pp. 179-188, 2000.
from the contents of register X using another
[Mitra 00c] Mitra, S., N. R. Saxena and E. J. McCluskey,
“Fault Escapes in Duplex Systems,” Proc. IEEE VLSI5. Conclusions Test Symposium, pp. 453-458, 2001.
A quantitative metric for design diversity opens up
[Mitra 00d] Mitra, S., and E. J. McCluskey, “Which
opportunities for efficient design of dependable systems.
Concurrent Error Detection Scheme to Choose?,” Proc.
While some problems and solutions are summarized in
International Test Conf., pp. 985-994, 2000.
this paper, there are many other interesting open
[Mitra 01a] Mitra, S., E. J. McCluskey, “Design Diversity
questions and problems related to design diversity that
for Concurrent Error Detection in Sequential Logic
must be understood for efficient design of dependable
Circuits,” Proc. VLSI Test Symp., pp. 178-183, 2001.
systems. These include architectural synthesis of
[Mitra 01b] Mitra, S., N. R. Saxena and E. J. McCluskey,
dependable systems with design diversity and estimation
“Techniques for Calculating Design Diversity for
of design diversity of large systems using simulation
Combinational Logic Circuits,” Proc. Intl. Conf.Dependable Systems and Networks, 2001, To appear. 6. Acknowledgment
[Oh 00] Oh, N. S., S. Mitra and E. J. McCluskey, “ED4I:
The research on design diversity was done as part of
Error Detection by Diverse Data and Duplicated
the DARPA sponsored ROAR (Reliability Obtained by
Instructions,” CRC-TR-00-8, (http://crc.stanford.edu),
To appear in the IEEE Trans. Computers.
http://crc.stanford.edu/projects/roar/roarSummary.html)
[Pradhan 96] Pradhan, D. K., Fault-Tolerant ComputerSystem Design, Prentice Hall, 1996. 7. References
[Riter 95] Riter, R., “Modeling and Testing a Critical
[Avizienis 84] Avizienis, A. and J. P. J. Kelly, “Fault
Fault-Tolerant Multi-Process System,” Proc. FTCS, pp.
Experiments,” IEEE Computer, pp. 67-80, Aug. 1984.
[Saxena 00] Saxena, N. R., et al., “Dependable
[Briere 93] Briere, D. and P. Traverse, “Airbus
Computing and On-Line Testing in Adaptive and
A320/A330/A340 Electrical Flight Controls: A family
Reconfigurable Systems,” IEEE Design and Test of
of fault-tolerant systems,” FTCS, pp. 616-623, 1993.
[Christmansson 98] Christmansson, J., M. Hiller, M.
[Siewiorek 92] Siewiorek, D. P., R. S. Swarz, Reliable
Rimen, “An Experimental Comparison of Fault and
Computer Systems: Design and Evaluation, Digital
Engineering, pp. 369-378, 1998.
[Spainhower 99] Spainhower, L. and T. A. Gregg, “S/390
[Hudak 93] Hudak, J., B. Suh, D. Siewiorek, Z. Segall,
Parallel Enterprise Server G5 fault tolerance,” IBM
“Evaluation and Comparison of Fault-Tolerant Software
Journal of Research Development, Vol. 43, pp. 863-
Techniques,” IEEE Trans. Reliability, Vol. 42, pp. 190-
[Webb 97] Webb, C. F., and J. S. Liptay, “A High
[Kraft 81] Kraft, G. D., W. N. Toy, Microprogrammed
Frequency Custom S/390 Microprocessor,” IBMControl and Reliable Design of Small Computers, 1981. Journal Res. and Dev., Vol. 41, No. 4/5, pp. 463-474,
[Lala 94] Lala, J. H. and R. E. Harper, “Architectural
Headache Questionnaire Patient Name:_________________________ Date Seen:____________________________ Please answer the following questions regarding your headaches: A. Headache Onset 1) My headaches started _____ years ago at _____ years of age. 2) Any associated head injury? Yes/No 3) Loss of Consciousness? Yes/No 4) Any history of infection around your brain or spinal cord? Yes/No
A paper presented at the International Conference on Policy Modeling Convention and Exhibition Centre, Hong Kong on June Health Sector Planning: Modeling and Implications Dr. Christine MAK Professor Sardar M. N. ISLAM Abstract A social cost benefit analysis (SCBA) is a common methodology used in economic evaluation of health programs. However, SCBA is not yet fully de