[SystemSafety] Chicago controller halts Delta jet's near-miss....

Sat Jun 27 13:47:05 CEST 2015

Les,
I think this might be overkill. If the only mechanism to assure safety
were the conversation alone then your proposed scheme makes more sense.
But it's not the only mechanism. The Midway incident was recognized
visually. Ground radar can also help. So my point is that if failures can
be recognized by other means, is it really necessary to put such a burden
just on communication.

I'd be willing to bet that mis-communication happens far more often than
this, and rarely ends up in a worst-case outcome (e.g., Tenerife). The
vast majority of the time, one of these other mechanisms catches the
problem before it turns into a disaster. It might not be economically
viable to add this extra layer because it does come at a cost.

Cheers,

-- steve

-----Original Message-----
From: Les Chambers <les at chambers.com.au>
Date: Friday, June 26, 2015 11:48 PM
To: 'Peter Bernard Ladkin' <ladkin at rvs.uni-bielefeld.de>
Cc: "systemsafety at lists.techfak.uni-bielefeld.de"
<systemsafety at lists.techfak.uni-bielefeld.de>
Subject: Re: [SystemSafety] Chicago controller halts Delta
jet's	near-miss....

Peter 
Further, on my proposition that the air traffic controller (ATC)/pilot take
off protocol should be:
[1]ATC: cleared to take off
[2]Pilot: preparing for takeoff
[3]ATC: approved for takeoff
... Or words to that effect (NOTE: I am referring to the conceptual design
of the protocol which addresses the number, sequence, rationale and meaning
or essence of each message and excludes exact details of format and
content).
 I have a problem with your reasoning that message [3] is unnecessary. I am
passionate about this issue as the absence of a message [3] was responsible
for one of the two career ending near misses I have experienced in the past
40 years.

You stated, "There is no good reason for a controller to ACK a correct
readback,and it would complicate matters cognitively when transmissions
take
up almost all the air time, which often happens at a major airport."

At face value this seems true in a perfect world. But we do NOT live in a
perfect world.

In response, let me first state some propositions that, I believe, are self
evident--> axioms if you like (please stay with me while I state the
bleeding obvious). 

ATC/pilot take off protocol is a master slave protocol. The ACT is the
master, the pilot is the slave.
The pilot is the slave because his scope for independent action is severely
limited while under the direction of the ATC. He basically does what he's
told. And the safety of the system as a whole is dependent on him doing
exactly what he is told.

ATC/pilot take off protocol is life-critical because failure of this
protocol has a high probability of causing loss of life.

A life critical-protocol must, as far as possible, be robust. That is, it
must compensate for failures in both master and slave, maintaining the
system as a whole in a safe state in the presence of credible failure
scenarios.

The most common failure mode of a party to a protocol (master or slave) is
"poor health", where "poor health" means the person/party/system/subsystem
ceases to behave in compliance with its specification. Classic examples are
errors of commission and omission:
- the master issues an incorrect control for the current context;
- the master does not issue a control when a control is required in the
current context; 
- the slave ignores a control from the master, doing nothing when something
is required;
- the slave takes uncommanded action which is out of scope of its
responsibilities (e.g. unapproved takeoff);
- the slave misinterprets a control taking an unsafe action in the current
context.

An aircraft lying stationary at a turn off is in a low-risk state. The
probability of harm to the pilgrims aboard is relatively low.

An aircraft in take off is in a high-risk state where the probability of
harm, based on past experience, is comparatively high.

In transitioning his/her aircraft from low to high risk states a pilot
makes
the following assumptions:
1. The ATC is healthy, that is, has issued a correct control for the
current
context (message [1])
2. The control, as received by the pilot, has been correctly interpreted by
the pilot
3. The pilot's interpretation of the control as-acknowledged to the ATC,
has
been received and understood by the ATC(message [2])
4. If the pilot's interpretation of the control is INCORRECT, the ATC+
message transmission medium is sufficiently healthy to transmit a "stop
stop
stop" control
5. Upon receipt of the "stop stop stop" control, the pilot is sufficiently
healthy to correctly interpret and act upon it.

For the two message protocol to succeed for all time, all the above
assumptions must be correct for all time.
At the core of these assumptions is the proposition that both master and
slave will be healthy for all time.
This is a dangerous assumption as these entities are human beings capable
of
gross errors of judgement.

Some examples:
- Distracted with personal problems. His wife has left him, his child is
sick, his daughter has died of a drug overdose (I cite television drama
series: Breaking Bad, season 2 episode 13). In the real world I once
frogmarched an engineer to the front door and told him to go home and look
after his family. He had a three-year-old daughter with a temperature of
104
and a wife pleading with him on the hour to come home because their
daughter
was dying. This particular start-up was the culmination of three years of
his engineering. He was rightfully torn between work and home. In this
induced, highly negative, mental state he was a danger to himself and
everyone else on the start-up.
- Heart attack in progress. On one project we worked through the complete
community of 40 operators and gave those with potential heart or other
potentially debilitating health problems very attractive redundancy
packages.
- High on various chemical substances. The Australian Navy just revealed
several weapons electronics operators using various drugs while war
fighting
a destroyer. Drug tests were ineffective. Drug tests ARE typically
ineffective (I cite Lance Armstrong's 10 year career in drug cheating)
- In possession of an urgent desire to commit suicide (or watch entranced
as
others do). I cite German Wings.
- Unsighted. Fog means that the ATC cannot see the aircraft (as per
Tenerife
- scene of the world's worst aviation disaster).
- Panicked. The ATC is new, overworked or faced with an unusual situation.
This was the case at Tenerife airport where traffic normally routed for Las
Palmas was diverted due to a terrorist bombing. Tenerife was jampacked with
heavy jets and then the fog came down ...

The three-step protocol eliminates the need for assumptions 2, 3, 4 and 5.
Message [3] pretty much catches everything. If you don't hear it, you don't
move.
Further, my core objection to the two-step protocol is: the pilot takes
radical action (that is, transitions his aircraft from a
low to a high-risk state) on a NULL value. Silence means ascent. In doing
this the pilot assigns meaning to a NULL. This is recognised bad practice.
The only situation in which I can justify it is when the system is in a
safe
state and NULL is interpreted as: do nothing. That is, remain in your
current safe state. This is what the three-step protocol achieves. If you
hear nothing from the ATC when you are expecting message [3] you do nothing
and stay safe.

Returning to my career ending near miss, let me set the scene:
A SCADA system is implemented with extensive use of optic fibre networks
over distances of 20 kilometres. The optic fibre has one or more bad
joints,
which cause disconnect-reconnect scenarios, typical of dry joints in
copper.
The SCADA master communications module is required to compensate for this
but it has a bug. On reconnect it blasts all its slaves with messages
including random bit patterns. In a demonstration of genuine bad luck, one
of these bits is misinterpreted as a control by one of its slave
controllers
and an unsafe action is taken. Lucky for us nobody was killed. Had a
three-step protocol been implemented the slave would:
1. Not have received message [3] and done nothing (the most likely scenario
as the master was profoundly unhealthy)
2. Received another random bit pattern with a low probability of repeating
the same control bit position (I know, I know, we should have used a 32-bit
word not a single bit to command a critical action)
3. Received a healthy response that cancelled the previous erroneous
control.

Overall the probability of total system failure would have been
significantly reduced.
------------------------
In response to your comment on my opening paragraph:

LES:
> It seems to me that the ATC - Pilot voice protocol is missing a step.
> ... In concept, a safer protocol might look like this: ATC: You are
> cleared for takeoff. Pilot: My understanding is that I am cleared for
> takeoff. ATC: Your understanding is correct

PETER:
"This misses crucial information, namely addressing, which is critical in a
multiuser broadcast context. "

As stated above, my opening paragraph was a statement of conceptual design.
Conceptual designs do not include detailed design or implementation detail.
Conceptual designs test fundamental concepts, principles, rationale's and
assumptions. This is what I have attempted to achieve above. I raised this
issue as it is a common failing of technical designers to dive into massive
detail without due consideration of the purity of their concepts (example:
asking the question: are we allowing a transition to a high-risk state on
the basis of a NULL?). Insufficient time in conceptual design is the
ready-fire-aim approach from which the world suffers every waking hour.
----------------------------------
On your comment regarding fixing the grammar used in ATC conversations, I
wonder. You may have fixed the grammar but your linked papers show no
evidence that cognition between pilot and ATC was tested and improvements
demonstrated. Have you any?
---------------------------------
In conclusion let me ask you:
If, as my slave, I commanded you to jump off a cliff, would you respond,
"acknowledged, jumping", and execute the control?
I suspect not. Any rational person would opt for message [3] before
transitioning their body from the safe to the unsafe state.

Even more probable would be the response, "Les, go jump yourself."
Syntactically poor but semantically rich.

Cheers
Les

-----Original Message-----
From: Peter Bernard Ladkin [mailto:ladkin at rvs.uni-bielefeld.de]
Sent: Sunday, June 21, 2015 6:36 PM
To: Les Chambers
Subject: Re: [SystemSafety] Chicago controller halts Delta jet's
near-miss....

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 2015-06-21 01:28 , Les Chambers wrote:
> It seems to me that the ATC - Pilot voice protocol is missing a step. ...
In concept, a safer
> protocol might look like this: ATC: You are cleared for takeoff Pilot: My
understanding is that
> I am cleared for takeoff ATC: Your understanding is correct

This misses crucial information, namely addressing, which is critical in a
multiuser broadcast
context. A clearance is preceded by a call sign, and an ACK is succeeded by
the call sign. Call
signs may be abbreviated, which can lead to confusion when the
abbreviations
are close, and when
transmissions are stepped on, which might have been the case in the
incident
in question. So let's
correct for call signs, and translate into the standard ATC-aircraft
controlled language. What you
suggest is:

> [1] ATC: [call sign] Cleared for takeoff [2] CRW: Cleared for takeoff
[call sign] [3] ATC:
> [call sign] Affirmative

Steps 1 and 2 are required. Step 3 is not; if Step 2 is not correctly
executed, then Controller
will respond:

> [1] ATC: [call sign] Cleared for takeoff [2] CRW: Cleared for takeoff
[other call sign] [3]
> ATC: <other call sign> Negative [other call sign]

or

> [1] ATC: [call sign] Cleared for takeoff [2] CRW: Cleared for takeoff
[other call sign] [3]
> ATC: [other call sign] Negative [other call sign]; <[call sign] Cleared
for takeoff>

which, if you analyse it, works just as well, and is more efficient. (the
"<...>" indicates an
optional expression.) Don't forget that this may be interspersed with other
transmissions, for
example

> [2'] CRW: Cleared for takeoff [call sign]

in which case the option will not be exercised.

There is no good reason for a controller to ACK a correct readback, and it
would complicate
matters cognitively when transmissions take up almost all the air time,
which often happens at a
major airport.

Further, complete expression as whole phrases doesn't illustrate the
resilience of the language in
the face of partial obscuration, which is an important feature.

Cushing (op. cit. antea) provided a grammar for such communications in his
book. Cushing is a
linguist, but his grammar was partially incorrect and also structurally
more
complex than need be.
Twelve-thirteen years ago, some people working with me fixed it. See Review
of the Cushing
Grammar, by Martin Ellermann and Mirco Hilbert at
http://www.rvs.uni-bielefeld.de/publications/Papers/hillermann-critique.pdf
and Building a Parser
for ATC Language, by the same authors, at
http://www.rvs.uni-bielefeld.de/publications/Papers/hillermann-critique.pdf

PBL

Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld,
33594 Bielefeld, Germany
Je suis Charlie
Tel+msg +49 (0)521 880 7319  www.rvs.uni-bielefeld.de