[SystemSafety] [External] A real-world Byzantine failure

Driscoll, Kevin kevin.driscoll at honeywell.com
Fri Sep 17 20:56:23 CEST 2021


When Brendan Hall (co-author on several "Byzantine really exists" papers) and I started reading about this incident, we correctly predicted that the recommended "fixes" would be of the type that would immeasurably reduce the probability of occurrence but not deal with the underlying problem.  Thus, after implementing the "fixes", this problem can reoccur with insufficiently bounded probability.  This is invariably the path followed those designers who don't understand Byzantine.

This is not the first time that an A330 has had a Byzantine failure among these components.  Brendan recently found a report that included a description of a prior in-service Byzantine failure.  But, the authors (and presumably all the readers) of that report failed to recognize the Byzantine failure and so it wasn't ever reported as such.  When we get less swamped, we may write a paper on this, which would include a description of our work to create systematic knowledge-assisted design process methods to identify vulnerable designs.


> Kevin Driscoll has published a number of papers describing how the Boeing 777 flight control system was designed to avoid Byzantine failures.

Actually, the B777 AIMS (cockpit system) we designed has full Byzantine coverage.  Its flight control has only a mechanism similar to "signed messages" Byzantine agreement, which has less *provable* coverage.  This was due to the FCS designers discovering Byzantine too late in the design process and not having enough remaining spare data network bandwidth to do the full-exchange required.  The B777 HEXAD inertial system uses an unconventional Byzantine failure recovery mechanism, not yet reported in the literature.


From: systemsafety <systemsafety-bounces at lists.techfak.uni-bielefeld.de> On Behalf Of Dewi Daniels
Sent: Friday, September 17, 2021 4:28 AM
To: The System Safety List <systemsafety at lists.techfak.uni-bielefeld.de>
Subject: [External] [SystemSafety] A real-world Byzantine failure

The Taiwan Transportation Safety Board has published the final report on a serious incident involving an Airbus A330. While landing at Taipei, all three flight control primary computers shut down, resulting in the spoilers, autobrake and engine thrust reversers failing to operate. The aircraft stopped just 30 ft before the end of the runway.
https://www.ttsb.gov.tw/english/18609/18610/26634/post<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ttsb.gov.tw%2Fenglish%2F18609%2F18610%2F26634%2Fpost&data=04%7C01%7Ckevin.driscoll%40honeywell.com%7Cfd8d5cce8eb04c2882f408d979bd73c1%7C96ece5269c7d48b08daf8b93c90a5d18%7C0%7C0%7C637674678207851847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=O3p%2BmKTFDriVhcB9fI10Q2EH9NnsGHosMqpkkxdWRBA%3D&reserved=0>
https://www.ttsb.gov.tw/media/4913/ci202_executive-summary_release.pdf<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ttsb.gov.tw%2Fmedia%2F4913%2Fci202_executive-summary_release.pdf&data=04%7C01%7Ckevin.driscoll%40honeywell.com%7Cfd8d5cce8eb04c2882f408d979bd73c1%7C96ece5269c7d48b08daf8b93c90a5d18%7C0%7C0%7C637674678207851847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=aPSH7NvHSt3p8X5MkRH00uePoYPZuOrM6HkRJoEB2UE%3D&reserved=0>
https://www.ttsb.gov.tw/media/4936/ci-202-final-report_english.pdf<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ttsb.gov.tw%2Fmedia%2F4936%2Fci-202-final-report_english.pdf&data=04%7C01%7Ckevin.driscoll%40honeywell.com%7Cfd8d5cce8eb04c2882f408d979bd73c1%7C96ece5269c7d48b08daf8b93c90a5d18%7C0%7C0%7C637674678207861840%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=%2BKCZzGhUpw%2FTF8ywpYee7J5hu6HW22llA4QpCDEMm2s%3D&reserved=0>

The report says that the flight control primary computers were shut down because the COM/MON pairs weren't synchronised closely enough. This resulted in the COM and MON channels reading different input values and disagreeing with each other. This is claimed to be an unusual edge case that has not been seen before.

This is an instance of a well-known problem in computer science called the Byzantine Generals problem. Leslie Lamport's seminal paper presented a solution to the Byzantine Generals problem.

http://lamport.azurewebsites.net/pubs/pubs.html#byz<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Flamport.azurewebsites.net%2Fpubs%2Fpubs.html%23byz&data=04%7C01%7Ckevin.driscoll%40honeywell.com%7Cfd8d5cce8eb04c2882f408d979bd73c1%7C96ece5269c7d48b08daf8b93c90a5d18%7C0%7C0%7C637674678207861840%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=5GqcUk3%2FXW7g7F3QdMhBHcGUmEbh9ZtqoWrQurb6Uog%3D&reserved=0>

Kevin Driscoll has published a number of papers describing how the Boeing 777 flight control system was designed to avoid Byzantine failures.

Yours,

Dewi Daniels | Director | Software Safety Limited

Telephone +44 7968 837742 | Email d<mailto:ddaniels at verocel.com>ewi.daniels at software-safety.com<mailto:ewi.daniels at software-safety.com>

Software Safety Limited is a company registered in England and Wales. Company number: 9390590. Registered office: Fairfield, 30F Bratton Road, West Ashton, Trowbridge, United Kingdom BA14 6AZ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.techfak.uni-bielefeld.de/pipermail/systemsafety/attachments/20210917/df91a0e3/attachment.html>


More information about the systemsafety mailing list