[SystemSafety] AI Hallucination Cases

Fri Jul 18 08:54:29 CEST 2025

Bernard
Please excuse the length of my response. Google Gemini AI had a lot to say on 
the subject and put it so well I felt it was worth including.

My GEMINI prompt was:
What is the current state of the art in predicting how a system incorporating 
a large language model will respond to a given stimulus. Is it still true that 
given a set of inputs the outputs from one of these models may vary, making 
these models nondeterministic. Discuss the problems in validating a system 
like this in safety critical applications such as an automobile Control 
systems. Address the issue that, as reported by tesla developers, the LLM is 
replacing Control logic that was typically implemented with code. What does 
this mean for the safety of a motor vehicle if its response to a set of inputs 
cannot be definitively predicted, that is, validated using a system test.
----- END GEMINI PROMPT----

My Summary:
Given what follows (assuming GEMINI is not hallucinating)
LLMs are, at this point in time, nondeterministic
Cannot be validated buy traditional means
Cannot be built to any human definable standard (because even their developers 
do not fully understand how these black boxes work or why they work at all)

Back in the day, a complex control system capable of going rogue through some 
pathological emergent behaviour could be regulated (ie have its controlled 
object brought to a safe state) by a separate overarching Safety Monitoring 
System that ensured the monitored control system operated safely and within 
predefined limits, preventing any harmful actions or failures. This was 
possible in a chemical reactor or nuclear power plant because determining the 
state of the controlled object could be done reliably through deterministic 
instrumentation. This is not possible with an automobile where an AI black box 
is required to interpret a video or a LiDAR 3D map.

Solutions? We need em, as nothing will stop embodiment of AIs into every day 
objects that can kill us. The benefits are to seductive. Neither will the 
public cease-and-desist from using "intelligent" objects. Witness Waymo's 
self-driving taxi service that provides more than 250,000 paid rides per week 
across its U.S. markets, which include San Francisco, Los Angeles, Phoenix, 
Austin, and Atlanta. 

My thought is that safety critical LLMs will have to be certified. 
Certification being a function of volume and diversity-of-context of training 
data and anomaly-free simulator hours to name two parameters.

Sad, but it seems many of the artefacts we hung out hats on back in the day 
are on the way out ... requirements specifications, detailed designs, test 
plans, test cases, traceability analysis, safety cases (right now it'd be a 
brave actor who'd attempt one for an AI embodied system) ...

A final thought, I weep for the brave hearts attempting to create standards 
for best practice for a technology most of which is still in the lab. I'm sure 
they're having useful thoughts though and I'd love to hear some of them. 

Cheers
Les

------ START GEMINI RESPONSE
The integration of Large Language Models (LLMs) into safety-critical systems 
like automotive control is a rapidly evolving field, presenting both immense 
opportunities and significant challenges.

### Current State of the Art in Predicting LLM Responses

Predicting how an LLM will respond to a given stimulus is an area of intense 
research. While LLMs demonstrate remarkable capabilities in understanding and 
generating human-like text, their internal workings are often considered a 
"black box." The current approaches to understanding and predicting their 
behavior include:

* **Interpretability and Explainability (XAI):** Researchers are developing 
methods to shed light on *why* an LLM makes a particular decision. This 
involves techniques like attention mechanisms (showing which parts of the 
input were most influential), saliency maps, and feature attribution methods. 
The goal is to make the decision-making process more transparent, even if full 
predictability remains elusive.
* **Prompt Engineering and Contextual Control:** By carefully crafting prompts 
and providing specific in-context learning examples, it's possible to guide an 
LLM's responses and make them more predictable for certain tasks. Retrieval 
Augmented Generation (RAG) is a key technique here, where LLMs are augmented 
with external, factual knowledge to reduce hallucination and improve accuracy 
and consistency.
* **Behavioral Testing and Benchmarking:** Extensive testing with diverse 
inputs is crucial. This involves creating large datasets of scenarios and 
observing the LLM's outputs, looking for patterns, anomalies, and adherence to 
desired behaviors. Benchmarking against human performance or traditional rule-
based systems helps quantify their capabilities.
* **Formal Methods (Emerging):** For some aspects of LLM behavior, 
particularly where they are used to generate code or act as part of a formal 
system, there's research into applying formal verification methods. This 
involves mathematically proving certain properties of the system, though it's 
extremely challenging for the full complexity of an LLM.
* **Monitoring and Feedback Loops:** In deployed systems, continuous 
monitoring of LLM outputs and real-time feedback loops can help detect 
unexpected behaviors and trigger corrective actions.

### Are LLMs Nondeterministic?

**Yes, generally speaking, LLMs are considered nondeterministic.** Even with 
the same input, a large language model can produce different outputs. This 
non-determinism stems from several factors:

* **Probabilistic Nature:** LLMs generate text by predicting the next most 
probable word (or token) in a sequence. This prediction often involves 
sampling from a probability distribution. While a "temperature" parameter can 
be set to 0 to make the sampling more deterministic (always choosing the most 
probable token), even then, slight variations in computational environment, 
floating-point arithmetic, or parallel processing can lead to different 
outcomes.
* **Internal State and Context:** The internal state of an LLM can subtly 
shift between invocations, especially in complex systems where the model might 
maintain a conversational history or be influenced by external factors not 
explicitly part of the immediate input.
* **Distributed Computing:** When LLMs are run on distributed systems, the 
order of operations or memory access patterns can vary, leading to slight 
differences in computations and thus in outputs.
* **Quantization and Optimization:** During deployment, models might undergo 
quantization or other optimizations that introduce minor numerical variations, 
contributing to non-deterministic behavior.

While efforts are made to increase determinism for specific applications, 
especially in controlled environments, the inherent design of these models 
makes them fundamentally probabilistic.

### Problems in Validating a System Like This in Safety-Critical Applications

The nondeterministic nature and "black box" problem of LLMs pose significant 
challenges for validating them in safety-critical applications like automotive 
control systems:

1.  **Lack of Deterministic Predictability:** In traditional control systems, 
given a set of inputs, the output is always the same. This deterministic 
behavior allows for rigorous, exhaustive testing and formal verification, 
ensuring that the system behaves as expected in all possible scenarios. With 
LLMs, this fundamental predictability is lost. How do you guarantee safety if 
the same input could sometimes lead to a safe action and sometimes to an 
unsafe one?
2.  **Exhaustive Testing is Impractical/Impossible:** Given the vast input 
space and the potential for varied outputs, it's virtually impossible to 
exhaustively test every conceivable scenario and all possible LLM responses. 
This leaves open the possibility of unforeseen edge cases leading to hazardous 
situations.
3.  **Explainability and Traceability:** In safety-critical domains, it's 
often legally and ethically required to understand *why* a system made a 
particular decision. If an LLM-driven control system causes an accident, 
tracing the precise reasoning back through the LLM's layers to identify the 
fault is incredibly difficult, if not impossible.
4.  **Verification and Validation (V&V) Challenges:** Traditional V&V 
methodologies are not directly applicable. New approaches are needed that 
account for the statistical and probabilistic nature of LLMs, focusing on 
robustness, adversarial resilience, and performance within a defined 
operational design domain (ODD).
5.  **Certification and Regulation:** Regulatory bodies require clear evidence 
of safety and reliability. Current certification standards for automotive 
control systems are built around deterministic software. New frameworks and 
standards are needed to certify systems incorporating nondeterministic AI.
6.  **Human Factors and Trust:** If a system's behavior is unpredictable, 
human operators (drivers, or even developers) may lose trust in it, leading to 
misuse or overreliance in ways that compromise safety.

### Implications of LLMs Replacing Control Logic in Automotive Control Systems

Tesla developers reportedly using LLMs to replace traditional control logic is 
a significant shift with profound implications for motor vehicle safety:

* **Shift from Rule-Based to Learned Behavior:** Traditionally, control logic 
is implemented with explicit, hand-coded rules (e.g., "if speed > X and 
obstacle detected, then apply Y braking force"). These rules are testable and 
provable. LLMs, however, learn complex patterns and relationships from data. 
Their "logic" is emergent and implicit within their neural network weights, 
making it much harder to define and inspect.
* **Emergent Behaviors and Unintended Consequences:** LLMs can exhibit 
emergent behaviors that were not explicitly programmed or even anticipated by 
the developers. In a control system, this could lead to unexpected 
acceleration, braking, or steering actions in novel situations, with 
potentially catastrophic results.
* **Difficulty in Error Analysis and Debugging:** When a traditional control 
system fails, engineers can typically pinpoint the faulty line of code or the 
incorrect rule. With an LLM, a "failure" might be a result of a subtle 
misinterpretation of input, a bias in the training data, or an emergent 
behavior. Debugging such issues is vastly more complex, often requiring 
retraining the model with new data or fine-tuning.
* **Data Dependence and Bias:** The performance and safety of an LLM are 
heavily dependent on the quality and diversity of its training data. Biases in 
the data could lead to discriminatory or unsafe behavior in real-world 
scenarios. For example, if an LLM is trained on driving data predominantly 
from one climate or road condition, its performance might degrade in others.
* **Adversarial Attacks:** LLMs are known to be susceptible to adversarial 
attacks, where subtle, carefully crafted perturbations to the input can cause 
the model to produce drastically different, incorrect, or harmful outputs. In 
an automotive context, this could theoretically be exploited to compromise 
vehicle safety.
* **The "Validation Gap":** The core problem remains: how do you definitively 
validate a system whose response to a set of inputs *cannot* be definitively 
predicted? Traditional system tests rely on repeatable, predictable behavior. 
For LLM-driven control, validation might need to shift towards statistical 
guarantees of safety, robustness against a wide range of perturbations, and 
continuous monitoring with rapid updates. This will likely involve:
    * **Extensive Simulation:** Simulating millions or billions of diverse 
driving scenarios to stress-test the LLM's responses.
    * **Real-world Data Collection (with human oversight):** Continuously 
collecting and analyzing real-world driving data to identify edge cases and 
unexpected behaviors, followed by model updates.
    * **Redundancy and Fail-Safes:** Implementing traditional, deterministic 
safety systems as a fallback or parallel monitoring layer to intervene if the 
LLM-driven system exhibits unsafe behavior.
    * **Human-in-the-Loop or Explainable Intervention:** Designing systems 
where a human can understand and, if necessary, override the LLM's decisions, 
or where the LLM can explain its reasoning to a human operator.

In conclusion, while LLMs offer exciting possibilities for creating more 
adaptive and intelligent control systems in automobiles, their inherent non-
determinism and "black box" nature present formidable challenges for safety 
validation. Overcoming these challenges will require a fundamental rethinking 
of safety engineering principles, the development of new testing and 
verification methodologies, and potentially new regulatory frameworks to 
ensure that the benefits of this technology do not come at the cost of human 
safety.
------ END GEMINI RESPONSE
> Les,
> 
> On 2025-07-17 04:23 , Les Chambers wrote:
> > Ergo, the proposition I believe this list should be debating is as 
follows:
> > Given the inexorable replacement of procedural code by LLMs in safety 
critical
> > systems, how can safety critical systems engineers hope to guarantee the
> > general public protection from harm
> 
> This represents quite a disconnect with the world I know.
> 
> First,Â  in most of the safety-critical industry segments I know,Â safety-
critical code (and coding) 
> is governed by standards. There is DO 178C for avionics, ISO 26262 for road 
automotive vehicles, IEC 
> 61508 for industrial process plants (it is required by IEC 61511) and for 
much else except medical 
> devices.
> 
> Second, none of the people I know working on Edition 3 of IEC 61508 see any 
way that a DLNN can 
> fulfil the requirements of IEC 61508 Part 3. Many of them are feverishly 
working on a new ISO/IEC 
> document, 22440, which will detail how encapsulated DLNNs may be used in 
critical systems which 
> nominally fall under IEC 61508. A first attempt was made with ISO/IEC TR 
5469. That was just a 
> report, which I critiqued in 
https://scsc.uk/journal/index.php/scsj/article/view/32 . 22440 is 
> *much* more detailed and attempts to be more rigorous about the development 
and use of DLNN modules 
> in safety-critical systems.
> 
> Third, if you ask anybody whether procedural code "developed" by a LLM can 
satisfy the requirements 
> of any of the above standards, the quick answer is "no". There areÂ  over 50 
different documentation 
> requirements for code conformant to IEC 61508 Part 3. An LLM might 
hallucinate thatÂ  it has 
> fulfilled them all but no wetware will take such a claim seriously.
> 
> Mutatis mutandis for DO 178C and ISO/IEC 26262.
> 
> Someone might want to say "there might be standards, but some industries 
just ignore them." Take 
> your example, road vehicles. In countries with which I am familiar (namely, 
"first world"ish) road 
> vehicles must be licensed. Licensing involves government approval. You can't 
just nail a few crates 
> together in your garage, put wheels on and an engine in and go drive around 
public roads. You've got 
> to licence it and the government won't do that unless someone has inspected 
your home-grown vehicle 
> and determined it to be roadworthy. There are car companies, of course. But 
they nowadays adhere to 
> standards such as ISO/IEC 26262 (in fact, they developed that standard 
themselves after determining 
> that IEC 61508 could not be adapted to their domain). Exactly what 
negotiation goes on between car 
> companies and government concerning roadworthiness is beyond any knowledge 
of mine, but there is 
> some. Adherence to ISO/IEC 26262 for critical electronics including SW is 
part of it.
> 
> For commercial aerospace, it is even more transparent. The airworthiness 
criteria (certification 
> regs) are freely available from USG (and other sites) and EASA. DO 178C is 
available to purchase 
> from RTCA (equivalently, ED 12C from EUROCAE), as well as the accompanying 
guides on object-oriented 
> development and use of formal methods.as well as the HW-SW interface. There 
are similar guides for 
> ground-based software , ED 109A and ED 153.
> 
> These regulatory regimes governed by industrially-developed SW-development 
standards have been 
> around for decades. They are not going away. None of them, as far as I know, 
make any provision for 
> the use of procedural code generated by an LLM. I think your word 
"inexorable" is completely misplaced.
> 
> PBL
> 
> Prof. Dr. Peter Bernard Ladkin
> Causalis Limited/Causalis IngenieurGmbH, Bielefeld, Germany
> Tel: +49 (0)521 3 29 31 00

--
Les Chambers
les at chambers.com.au

https://www.chambers.com.au
https://www.systemsengineeringblog.com

+61 (0)412 648 992