[Home] [Current Papers] [Markov Fallacy]

If you have Adobe Reader you can also download a .pdf version. image001

Repairable Redundant Systems and the Markov Fallacy

by W G Gulland (4-sight Consulting)

Summary

Markov analysis is presented in reliability text books and incorporated in a number of computer programs for reliability calculations as a way of dealing with repairable systems.  This paper demonstrates that the application of Markov analysis to this problem is flawed, and leads to erroneous results.  It proposes an alternative method of dealing with such systems based on the application of standard probability theory and a consideration of the practical situation.  It is recognised that the argument is largely academic because common cause failures of such systems are liable to dominate independent failures in almost all real cases.

1    Application of Markov Analysis to Repairable Systems

A Markov process is defined as:

A sequence of possibly dependent random variables (x1, x2, x3, . . . ) - identified by increasing values of a parameter, commonly time - with the property that any prediction of the value of xn , knowing x1, x2, . . . , xn - 1, may be based on xn - 1 alone. That is, the future value of the variable depends only upon the present value and not on the sequence of past values.

(Encyclopaedia Britannica)

Nuclear decay, for example, is a Markov process.  It is a random process, and the probability that a nucleus decays over a period dt is determined only by its state at time t (un-decayed or decayed).

1.1   Duplicate System with Single Repair Crew

Markov analysis is widely used as a method of analysing the reliability parameters of repairable systems with constant failure and repair rates.  Consider an example from power generation:

The system may be in any one of 3 states:

  •        State (0)    Both operating, with probability P0
  •        State (1)    One unit failed, with probability P1
  •        State (2)    Both units failed, i.e. the system is unavailable, with probability P2

Markov analysis is applied to the state transition diagram for this system to derive the probabilities of each of the states, as follows:

where:

 

l

=

failure rate

=

1 / MTBF (mean time between failures) for constant failure rate.

and

m

=

repair rate

=

1 / MDT (mean down time)

The rate of leaving state 0 is 2lP0, i.e. it is determined by the probability of being in state 0 and the failure rate of one generator.  Similarly, the rate of arriving in state 0 is mP1, the rate at which items in state 1 are repaired.  This leads to a set of equations:

dP0/dt

=

-2lP0 + mP1

dP1/dt

=

2lP0 - mP1 - l P1 + mP2

dP2/dt

=

lP1 - mP2

In the steady state the rates of change of probabilities are zero, and these equations reduce to:

-2lP0

+mP1

 

=

0

+2lP0

 -(l + m)P1

+mP2

=

0

 

+lP1

- mP2

=

0

to which we can add:

P0

+P1

+P2

=

1

because the probability of being in one or other of these 3 states is unity.

Solving for P2 (the unavailability of the system) we obtain:

P2

=

2l2 / (2l2 + 2lm + m2)

which for m >> l (as would normally be the case) reduces to:

P2

2l2 / m2 = 2l2.MDT2  for m >>l

But,this is the wrong result.   We know that the unavailability of single generator is given by:

Qa

=

Qb

=

MDT / (MTBF + MDT)

=

l / (l + m)

which for MTBF >> MDT or (the corollary) m >> l reduces to:

Qa

=

Qb

=

MDT / MTBF

=

l / m

By the multiplication of probabilities rule the unavailability of the system is given by:

Qsystem

=

P2

=

Qa.Qb

=

l2 / m2 = l2.MDT2

i.e. exactly half the value obtained from Markov analysis.

1.2   Duplicate System with 2 Repair Crews

Consider now the application of Markov analysis to the state transition diagram of the same example with 2 repair crews:

Solving the equations in the same way as above yields:

P2

=

2l2 / (2l2 + 4lm + 2m2)

which for m >> l (as would normally be the case) reduces to:

P2

l2 / m2 = l2.MDT2

i.e. Markov analysis indicates that the unavailability is halved by introducing a second repair crew.  A little thought, however, indicates that this result is nonsense.  If both generators are broken down it is only necessary to restore one generator to restore the system.  If it takes MDTsystem on average to restore the system with one repair crew, it will also take MDTsystem on average to restore the system, even if there are 2 repair crews, because the first crew is already repairing the first failure when the second occurs; the system downtime is not reduced by having 2 crews.  If m >> l, the probability of the first failed and repaired generator failing again before the second machine is repaired is tiny, even if the repairs are carried out sequentially by a single repair crew, so that the reduction in unavailability is negligible.  Thus the benefit of having 2 repair crews is marginal to non-existent!

1.3   Source of the Fallacy in Applying a Markov Model

So what is the reason for the discrepancies between the results which are obtained from reasoning based on combination of probabilities and consideration of practical situations and those obtained by Markov analysis?  The answer in a nutshell is that , while random failure is a Markov process, repair is not a Markov process.  Whilst failures are events which occur at random points in time, repairs occur at anything but random points in time.  Firstly, they occur deterministically immediately after failure, or at fixed intervals, depending upon the maintenance policy; secondly, and unlike failures, they are not events which occur at instants in time but processes which occur over periods of time.  The consequence is that Markov analysis is not applicable to the process of failure and repair, and results obtained by applying it are fundamentally flawed and cannot be expected to be correct.

Part of the reason for the apparent plausibility of Markov analysis of the problem of failure and repair is that failure rate and repair rate are deceptively similar quantities.  They are represented by Greek letters (l for failure rate, m for repair rate) which are consecutive in the Greek alphabet; both are terms which include the word “rate”, both are the reciprocal of “mean time”, and both have the dimensions of t-1.  These similarities disguise the reality that they are fundamentally different quantities:

  • Failure rate (l) is the rate at which random events occur on average.  Calculations of reliability and availability based on l are only valid for random failures with constant l, for which failure times are exponentially distributed.
  • Repair rate (m) is the reciprocal of the average period of time it takes to repair the relevant item of equipment, i.e. 1 / MDT. The period will probably be subject to some randomness, but even if it is constant, so that the down time is always the same, the results for availability remain valid, i.e. there is no need to assume any particular distribution of repair time.  (Indeed, it may be wise to abandon the misleading term “repair rate” and the misleading symbol “m”, because items are not repaired at a constant rate m, they are repaired in brief periods of time which have an average duration = MDT.)

2    Alternative Models

2.1   Active Redundancy

2.1.1  Combination of Probabilities

We can derive the unavailability of an active redundant system, where n out of m units are required to operate and all m units are normally on-line, and where MTBFunit >> MDTunit or (the corollary) l.MDTunit << 1 (which would normally be the case) by applying the multiplication of probabilities rule, as:

Qsystem

=

mCm-n+1.(l.MDTunit)m-n+1

 

=

[m! / {(m-n+1)!.(n-1)!}].(l.MDTunit)m-n+1

wheremCm-n+1 is the number of combinations of m-n+1 out of m, since m-n+1 must fail for the system to be unavailable, e.g. for a 2 out of 4 system, 3 units must fail before the system fails, m = 4, n = 2 and m-n+1 = 3.  This produces the following results:

Table 1  Unavailability by Combination of Probabilities for Active Redundancy (Independent of Number of Repair Crews)

Total number of units (m)

1

l.MDTunit

 

 

 

2

(l.MDTunit)2

2l.MDTunit

 

 

3

(l.MDTunit)3

3(l.MDTunit)2

3l.MDTunit

 

4

(l.MDTunit)4

4(l.MDTunit)3

6(l.MDTunit)2

4l.MDTunit

 

1

2

3

4

 

Number of units required to operate (n)

This table can be compared with the results obtained by Markov analysis, which are:

Table 2  Unavailability by Markov Analysis for Active Redundancy with a Single Repair Crew

Total number of units (m)

1

l.MDTunit

 

 

 

2

2(l.MDTunit)2

2l.MDTunit

 

 

3

6(l.MDTunit)3

6(l.MDTunit)2

3l.MDTunit

 

4

24(l.MDTunit)4

24(l.MDTunit)3

12(l.MDTunit)2

4l.MDTunit

 

1

2

3

4

 

Number of units required to operate (n)

For l.MDTunit << 1, the results for any system with spares in excess of one, are of course essentially academic, and even for systems with a single spare (n = m-1), unavailability due to common cause failures of the order of l.MDTunit are likely to dominate unavailability due to independent failures of the order of (l.MDTunit)2.

2.1.2       Consideration of Practical Failure and Repair Process

Another way of deriving the failure rates and unavailabilities for single spare (m-1 out of m) active redundancy systems is to consider the basic failure rate, i.e. the mean rate at which a single unit fails = m.l, since all m units are on-line.  When a single unit fails, it is repaired with mean down time = MDTunit.  The probability of a second unit failing during that period is (m‑1).l.MDTunit, since (m-1) units remain on-line, so that the system failure rate is m.(m‑1).l2.MDTunit.

Failure rate

2l2.MDTunit

6l2.MDTunit

12l2.MDTunit

 

1 out of 2

2 out of 3

3 out of 4

 

Number of units required to operate
(m-1 out of m, Active Redundancy)

These are the same as the results obtained by Markov analysis, but this does not mean that Markov results can be generalised to the n out of m case, and it does not give the correct result for unavailability.

On average, if a second unit fails, it will fail half way through the down period of the first unit, i.e. at MDTunit / 2, so that:

MDTsystem

=

MDTunit / 2

and the system unavailability is given by:

Qsystem

=

MDTsystem / (MTBFsystem + MDTsystem)

 

=

MDTsystem / MTBFsystem if MTBFsystem >> MDTsystem

 

=

(MDTunit / 2).{m.(m‑1).l2.MDTunit}

 

=