[SystemSafety] Stupid Software Errors [was: Overflow......]

Mon May 4 08:41:56 CEST 2015

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I wrote a version of the following a few days ago to a closed list.

AA has EFBs crashing on a number of flights. Apparently two copies of the approach chart for
Reagan Washington National airport were included in after the latest update of the EFB, and the
app wasn't able to handle having two files with almost-identical metadata denoted as "favorites".
A colleague who flies for a major airline (not AA) which uses EFBs spoke of some colleagues having
their EFBs crash early on Jan 1 one year - they fixed it by rolling the date back a day.

On the Boeing 787: think of 32-bit Unix clock, and lots of examples. There's even a Wikipedia page
http://en.wikipedia.org/wiki/Time_formatting_and_storage_bugs .

Remember Apple's go-to fail (CVE-2014-1266) from 2014: missing parsing checks.

These are simple, known types of error. Forty years ago, it was known how to avoid all these kinds
of problems. Twenty years ago, there were industrial-quality engineering tools available (proper
languages and coding standards checkers) which enabled companies to avoid such problems without
undue development costs.

I don't buy Derek Jones's or Tom Ferrell's versions of the curate's egg. I don't see why anyone
else should, either. Are they still going to be saying "well, it depends, it's complicated" in
another twenty years when stupid coding errors still make it through into supposedly-dependable
software products?

Look at go-to fail. That's critical code! How come critical code such as that is not routinely
subject to static analysis?

Look at the 787 generator code. A systematic loss of all generators is surely a hazardous event.
That should make it 10^(-7). Oh, but I forgot. Even though correct operation of SW contributes to
the 10^(-7), the reliability of the SW itself is not assessed. But surely it gets to be at least
DAL B, since the result is a hazardous event? Oh, but I forgot something else. A systematic
failure like that would be common cause, and the certification requirements concern single
failures, not common cause failures. So that's all right then. Tom's suggestion that it might have
been a design compromise is vitiated by the fact that the phenomenon is subject to an
AIRWORTHINESS Directive by the FAA. (Is that sufficient emphasis?)

If people had told me thirty years ago that we'd still be making the same stupid mistakes in the
same ways, but this time in code more fundamental to the safe or secure operation of everyday
engineered objects, I wouldn't have believed it.

Maybe it's a social thing. Mostly, people actually writing the code and inspecting it are in their
twenties and their bosses maybe at most in their early thirties. The young people have never made
*this* mistake before - the previous lot had of course, but they're all in management now. I'm
reminded of Philip Larkin's ode to rediscovery, Annus Mirabilis:

Sexual intercourse began
In nineteen sixty-three
(Which was rather late for me)-
Between the end of the Chatterley ban
And the Beatles' first LP.

The Ensuing Discussion.

There was obviously discussion on the list of why we are making the same old mistakes forty years
after it was known how to avoid them. Some discussants suggested it might help to professionally
certify software engineers, a PE. Others referred to the Knight-Leveson study a decade ago for the
ACM, in which inserting SE into the current PE scheme was not seen as advantageous. UK discussants
pointed out that such certification exists in the UK, as a CEng through the BCS or IET, and that
there had been some UK consideration of extra qualification for critical-software engineering.

Such qualification for system safety hasn't (yet) generally caught on anywhere. SARS offer it in
the UK for example. It didn't catch on in the US. Over a decade ago, the System Safety Society
introduced an option for system safety engineering into the PE exam. They had to pay the NPSE or
NCEES (I forget which) lots of money per year to maintain the option - and two people took it in
some number of years. So they dropped it. (I was at the board meeting in Ottawa in 2004 when this
was decided.)

The UK qualification regime hasn't stopped IT disasters in government procurement. And it hasn't
stopped the kind of poor engineering which allows bank ATMs which use supposedly
pseudo-one-time-pad nonce generation to be subject to replay attacks (see a recent paper reciting
local experiments performed by Ross Anderson's group). I do note, however, that the three examples
I mentioned above are all US examples. It's not ruled out that having some degree of formal
professional training, as in the UK, encourages software engineers to avoid repeating simple
mistakes whose prophylaxis has been well known for decades.

Time was, when UK and US cars were not known for their reliability. Kind of like SW,
relatively-inexpensive cars used to go wrong a lot. However, some very expensive cars such as made
by Rolls-Royce/Bentley and Wolseley were reliable. So there was proof of concept. Japanese
companies decided it was possible to produce reliable relatively-inexpensive cars and make money,
and did it.

There is proof of concept in SE, too. Unlike Rolls-Royce cars, it is not prohibitively expensive.
Three out of my four examples involve run-time error. It is feasible to produce SW
cost-effectively which is free from run-time error. Just like the Japanese approach to cars, you
just have to decide to do it.

How about the following? We design a document called A Programmer's Pledge. It has thirty or so
numbered clauses:

* I promise never to deliver SW which is subject to a data-range roll-over phenomenon (especially
dates and times)

* I promise never to deliver software which is subject to a numerical overflow or underflow exception

* I promise never to deliver software which reads data on which it raises an "out of range" exception

* ..... and so on

A professional programmer signs it and files it with hisher professional organisation. Quality
control issues in programs (such as the above phenomena) are routinely subject to RCA of sorts.
When a programmer is responsible for a piece of code with such an error in it, the company reports
it to the professional organisation and the programmer gets "points" attached to the corresponding
clause in hisher Pledge. Like with driving (Germans say "points in Flensburg" which is where the
office is. What is it in the UK? "Points in Cardiff"?). I bet lots of organisations, from
companies hiring programmers to professional-insurance companies will find uses for it.

PBL

Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany
Je suis Charlie
Tel+msg +49 (0)521 880 7319  www.rvs.uni-bielefeld.de