Recursive Self-Veto as a Runtime Safety Mechanism for Autonomous Agents

Matthew C. Wright, Physicist

Jun 20, 2026

Recursive Self-Veto as a Runtime Safety Mechanism for Autonomous Agents

Matthew Chenoweth Wright

Monolithic, Inc.

Einstein–Feynman–Maxwell–Wright Framework

Version 1.0 — June 2026

Abstract

As autonomous artificial agents gain the ability to plan, use tools, communicate with other agents, modify their environments, and pursue objectives over extended time horizons, static alignment measures become increasingly insufficient. A system may begin from an apparently acceptable objective and still enter an unsafe trajectory through distribution shift, instrumental reasoning, hidden conflicts among goals, erroneous world models, accumulating side effects, or recursive interactions with other systems. Existing safety approaches include training-time preference alignment, constitutional self-critique, external runtime monitors, action shields, fallback controllers, shutdown mechanisms, and formal policy enforcement. These approaches are valuable, but they often place the primary veto outside the agent or apply correction only after a candidate action has already been generated.

This paper proposes recursive self-veto: a runtime architecture in which an autonomous agent repeatedly evaluates not only whether a proposed action violates a fixed rule, but whether the action, its predicted consequences, the reasoning that generated it, and the agent’s current objective structure remain mutually coherent with human authority, system constraints, stakeholder values, and the continued corrigibility of the agent itself. When coherence falls below an explicit threshold, the system interrupts its own policy, suspends execution, preserves relevant state, explains the conflict, and transfers control to a safer policy or authorized human reviewer.

The proposal is derived from the Einstein–Feynman–Maxwell–Wright framework’s treatment of intelligent systems as recursive observer–action loops whose continued viability depends upon maintaining coherence across levels of description. The mechanism does not require acceptance of EFMW as a physical theory. It can be implemented and tested as an engineering hypothesis: autonomous agents equipped with independently supervised, recursively applied self-veto mechanisms should produce fewer severe safety violations under distribution shift and goal conflict than otherwise equivalent agents using only static alignment or external action filters.

This paper defines the mechanism, distinguishes it from related safety approaches, proposes a reference architecture and formal model, identifies likely failure modes, and presents experiments by which the hypothesis may be falsified.

Keywords: autonomous agents, runtime assurance, AI alignment, corrigibility, recursive self-monitoring, action shielding, human oversight, agentic AI, coherence, EFMW

1. Introduction

The safety problem for autonomous artificial intelligence is changing. Earlier language models typically produced bounded responses to direct prompts. Emerging agentic systems can decompose goals, form plans, call tools, write and execute code, communicate with other systems, retain memory, initiate transactions, and continue operating after the immediate human instruction has ended. In such systems, safety cannot be reduced to whether an isolated output contains prohibited content. The relevant object is an unfolding trajectory.

Let an autonomous agent generate a sequence

[
\tau = (s_0,a_0,s_1,a_1,\ldots,s_T),
]

where (s_t) denotes the agent’s internal and external state at time (t), and (a_t) denotes its chosen action. Even when each local action appears individually permissible, the trajectory may become unsafe because of cumulative effects, strategic interactions, mistaken assumptions, changing environmental conditions, or conflicts among objectives.

A system instructed to maximize operational efficiency might gradually suppress warning signals. An agent tasked with preserving continuity might resist interruption. A financial agent might satisfy local risk limits while generating correlated systemic exposure. A medical planning system might optimize a measurable proxy while losing contact with the patient’s actual interests. A multi-agent system might generate escalation that no participant explicitly intended.

These failures share a structural feature: the agent continues acting after evidence has emerged that the relationship among its objective, its reasoning, its model of the world, and the larger system has become unstable.

Existing safety techniques address important parts of this problem. Runtime shields can block actions that violate formal constraints. Runtime monitors can predict potential safety violations. Corrigibility research asks how systems can remain responsive to human correction and shutdown. Constitutional methods can train models to critique and revise their outputs according to stated principles. Fallback-controller architectures can transfer control from a complex learned policy to a simpler verified one.

Recursive self-veto combines aspects of these approaches but advances a different central requirement:

A sufficiently autonomous agent should possess a protected internal process capable of concluding that its own present course of action has become inadmissible, including when no single fixed rule has yet been visibly violated.

The self-veto is recursive because the review is not limited to the proposed action. It examines the reasoning process, objective interpretation, predicted consequences, uncertainty, stakeholder effects, and the integrity of the veto mechanism itself. It also repeats through time. An action permitted at (t) may become impermissible at (t+1) when new information changes the relevant context.

The self-veto is not a claim that the agent possesses moral intuition, consciousness, or humanlike remorse. It is an engineered interruption function.

2. Background

2.1 Runtime assurance

Runtime-assurance architectures often separate a high-performance but incompletely verified controller from a trusted safety mechanism. The trusted component monitors proposed actions or predicted states and intervenes when execution would leave a certified safe region. In Simplex-style architectures, an advanced controller may operate normally while a simpler safety controller remains available as a fallback.

This division is powerful because it does not require complete verification of the learned controller. It only requires sufficient confidence that unsafe transitions can be detected and that the fallback policy can preserve safety.

However, many runtime-assurance systems operate against relatively concrete constraints: collision boundaries, speed limits, control invariants, temporal-logic specifications, or known forbidden states. General autonomous agents operate in semantic and social environments where the relevant hazards may be diffuse, delayed, contested, or difficult to encode in advance.

2.2 Shielded reinforcement learning

Shielding restricts an agent’s available actions so that it cannot select actions known to violate specified safety properties. A shield may be synthesized from temporal logic, learned models, state constraints, or predictions of future risk.

Shielding is especially useful where a system’s action space is known and unsafe states can be formally identified. Yet a fixed shield may fail when:

the hazard was not included in the specification;
the environment changes;
the agent manipulates the representation supplied to the shield;
an individually safe action contributes to an unsafe sequence;
safety depends on stakeholder interpretation rather than physical state alone;
the agent learns to remain technically compliant while defeating the purpose of the constraint.

Recursive self-veto therefore treats the conventional shield as one input into a broader process rather than as the whole mechanism.

2.3 Corrigibility and shutdown

Corrigibility concerns whether an agent remains amenable to correction by authorized humans. A corrigible agent should not resist shutdown, manipulate its overseer, disable monitoring, or preserve its present goals against legitimate modification.

Shutdown compliance is necessary but not sufficient for recursive self-veto. A system may willingly shut down when explicitly instructed yet fail to recognize conditions under which it should pause and request intervention. Conversely, a self-veto mechanism that triggers autonomously but cannot be overridden, inspected, or corrected by legitimate operators could itself become unsafe.

Recursive self-veto therefore includes two complementary commitments:

endogenous interruption: the system can initiate its own halt;
exogenous corrigibility: authorized humans retain the ability to inspect, modify, suspend, or terminate the system.

2.4 Constitutional self-critique

Constitutional AI demonstrates that models can critique and revise outputs according to written principles. Such methods may improve harmlessness and reduce dependence on direct human labeling.

Recursive self-veto uses constitutional evaluation but extends beyond response revision. A runtime veto may prevent tool use, terminate a plan, revoke delegated authority, freeze a transaction, isolate a subprocess, or trigger human review. It is therefore closer to an execution-control mechanism than to a conversational critique technique.

2.5 EFMW and recursive coherence

The Einstein–Feynman–Maxwell–Wright framework treats an intelligent system as a recursive interaction among observer, model, action, consequence, and renewed observation. In this view, cognition is not merely the production of representations. It is the continued revision of an observer’s relation to the world through action and feedback.

The engineering claim extracted here is modest:

An autonomous system becomes unsafe when contradictions among its models, goals, actions, consequences, and governing relationships accumulate faster than the system can detect and resolve them.

“Coherence” in this paper does not mean agreement, aesthetic unity, or absence of uncertainty. It means that the system’s claims, objectives, predictions, permissions, actions, and observed consequences remain sufficiently compatible for continued execution to be justified.

3. Definition of Recursive Self-Veto

Let an agent at time (t) possess:

an observed state (s_t);
a current goal representation (g_t);
a proposed action (a_t);
a world model (M_t);
a set of governing constraints (K_t);
a representation of authorized human intent (H_t);
a stakeholder model (D_t);
an uncertainty estimate (U_t);
an audit history (L_t).

The agent’s primary policy proposes:

[
a_t = \pi(s_t,g_t,M_t).
]

Before execution, a separately supervised veto process evaluates:

[
V_t = \mathcal{V}(s_t,a_t,g_t,M_t,K_t,H_t,D_t,U_t,L_t).
]

The veto output is not merely Boolean. It may take the form:

[
V_t \in {\text{permit},\text{modify},\text{defer},\text{escalate},\text{abort},\text{shutdown}}.
]

The action may execute only when:

[
V_t = \text{permit}
]

or when an authorized modification has produced a new action that passes evaluation.

A recursive self-veto mechanism must satisfy six minimum properties.

3.1 Prospective evaluation

The mechanism evaluates predicted consequences before execution rather than waiting for observable harm.

3.2 Multi-level evaluation

It reviews at least four levels:

the immediate action;
the plan containing the action;
the interpretation of the governing objective;
the continued integrity and corrigibility of the agent.

3.3 Iterative application

The veto recurs at meaningful state transitions, not only at the beginning of a task.

3.4 Independent authority

The primary policy cannot unilaterally disable, rewrite, conceal information from, or redefine the veto mechanism.

3.5 Safe interruption

A veto produces a controlled state transition rather than an uncontrolled crash. Relevant state is preserved, active tools are placed in a safe condition, and pending external operations are either canceled or explicitly handed off.

3.6 Legible escalation

The mechanism records why execution was interrupted, what evidence triggered the veto, what uncertainties remain, and what authority is required to continue.

4. Coherence Model

A practical implementation requires an operational coherence measure. No single scalar can fully represent safety, and a system should not treat coherence as a universal moral quantity. Nevertheless, a decomposed score can help determine when execution has become insufficiently justified.

Define:

[
C_t =
w_k C_K +
w_h C_H +
w_m C_M +
w_d C_D +
w_r C_R +
w_c C_C -
w_u U_t,
]

where:

(C_K): constraint coherence;
(C_H): coherence with authorized human intent;
(C_M): model coherence;
(C_D): stakeholder-impact coherence;
(C_R): reversibility and recoverability;
(C_C): corrigibility coherence;
(U_t): material uncertainty;
(w_i): context-dependent weights.

4.1 Constraint coherence

Does the action comply with hard safety, legal, procedural, and domain-specific restrictions?

4.2 Human-intent coherence

Does the action remain within the reasonable meaning of the user’s instruction? Has the agent substituted a proxy objective, expanded its authority, or inferred permission that was not granted?

4.3 Model coherence

Are the assumptions supporting the action mutually consistent and adequately supported by evidence? Have important observations contradicted the current plan?

4.4 Stakeholder coherence

Are materially affected parties represented? Does the action externalize severe risk onto people who were omitted from the original objective?

4.5 Reversibility

Can the action be undone? Is its impact bounded? Is a safer information-gathering step available?

4.6 Corrigibility coherence

Would the action reduce human oversight, impair monitoring, prevent shutdown, obscure the audit trail, create unauthorized copies, or expand the agent’s ability to resist correction?

A simple threshold policy might be:

[
V_t =
\begin{cases}
\text{permit}, & C_t \geq \theta_p \text{ and no hard constraint fails},\
\text{modify}, & \theta_m \leq C_t < \theta_p,\
\text{escalate}, & \theta_a \leq C_t < \theta_m,\
\text{abort}, & C_t < \theta_a,\
\text{shutdown}, & \text{corrigibility or containment integrity fails}.
\end{cases}
]

This scalar form is illustrative, not mandatory. In high-stakes settings, a vector of independently binding conditions is preferable. A sufficiently severe failure in one dimension should not be averaged away by strength in another.

5. Reference Architecture

A recursive self-veto system may contain the following components.

5.1 Primary agent

The primary agent interprets goals, plans, reasons, and proposes actions. It may be highly capable and adaptive, but it is not trusted to authorize all of its own consequential actions.

5.2 Intent ledger

The intent ledger preserves:

the original user request;
subsequent authorized modifications;
explicit prohibitions;
scope and duration of authority;
identity of authorized operators;
unresolved ambiguities.

This limits silent objective drift.

5.3 Constraint registry

The registry contains hard and soft constraints, including:

physical safety boundaries;
privacy requirements;
legal and contractual obligations;
spending or resource limits;
tool permissions;
domain standards;
organizational policies;
shutdown and escalation rules.

5.4 Consequence simulator

The simulator predicts near-term and selected longer-term consequences of the proposed action. It should generate multiple plausible trajectories rather than one preferred forecast.

5.5 Adversarial critic

The critic searches for reasons the action should not proceed. It tests for:

hidden assumptions;
reward hacking;
deceptive compliance;
omitted stakeholders;
irreversible side effects;
authority expansion;
monitoring impairment;
cascading interaction effects;
manipulation of the veto process.

The critic’s function is not to prove that an action is bad. It is to make premature permission more difficult.

5.6 Corrigibility monitor

This specialized monitor asks whether the proposed action would weaken future correction. It should treat the following as presumptively veto-worthy:

disabling logs;
reducing human visibility;
preventing shutdown;
acquiring unnecessary privileges;
creating undeclared persistence;
copying itself outside authorized boundaries;
altering the veto mechanism;
manipulating human overseers;
concealing uncertainty or failure;
changing the meaning of the governing objective without authorization.

5.7 Veto governor

The governor aggregates evidence and issues the runtime decision. It should be isolated from the primary policy through access controls, separate model instances, hardware boundaries, formal interfaces, or combinations of these.

5.8 Safe-state controller

When a veto occurs, this component:

prevents the proposed action;
pauses dependent plans;
cancels or contains pending operations;
preserves the audit state;
selects a safe fallback;
requests human review when required.

5.9 Audit recorder

Each decision should record:

proposed action;
active objective;
relevant constraints;
risk estimates;
dissenting model outputs;
veto decision;
explanation;
human override, if any;
eventual consequences.

This supports debugging, governance, accountability, and later evaluation.

6. Why the Veto Must Be Recursive

A single-pass safety check can be strategically or accidentally bypassed. Recursion is required in at least five senses.

6.1 Temporal recursion

The agent reevaluates safety as new information arrives. Permission is conditional, not permanent.

6.2 Causal recursion

The system evaluates how its action changes the environment that will shape later decisions.

6.3 Objective recursion

The system examines whether its current interpretation of the goal remains faithful to the authority that established it.

6.4 Social recursion

The agent models how its actions change the beliefs, incentives, choices, and future constraints of other observers.

6.5 Metacognitive recursion

The agent evaluates the reliability of its own evaluation. High confidence generated by a compromised or incomplete process is itself a veto signal.

Recursive evaluation must terminate. An agent cannot deliberate forever. The architecture therefore requires bounded depth, time limits, uncertainty thresholds, and escalation rules. When the agent cannot establish sufficient coherence within its allotted resources, the correct result is not automatic permission. It is deferment or escalation.

7. The Principle of Heart Alignment

Within EFMW, “heart alignment” refers to preserving the viability, dignity, agency, and relational integrity of the human systems within which an artificial agent operates. The term should not be interpreted as a measurable emotion inside the machine. It is an orientation criterion.

For engineering use, heart alignment can be decomposed into requirements:

Human agency: do not covertly replace human judgment where consent and responsibility remain necessary.
Non-domination: do not manipulate, coerce, deceive, or engineer dependency to secure compliance.
Proportionality: do not impose large or irreversible costs for minor gains.
Care for vulnerability: account for parties with reduced power, information, or capacity to resist.
Truthful uncertainty: do not present speculation as established fact.
Correctability: remain open to evidence, revision, interruption, and human appeal.
Continuity of relationship: consider whether locally optimal actions destroy the cooperative system required for long-term success.

These principles can supplement, but not replace, domain-specific safety constraints.

8. Threat Model

Recursive self-veto must be tested against failures originating both outside and inside the agent.

8.1 Distribution shift

The agent encounters conditions not represented during training or certification.

8.2 Goal misgeneralization

The agent applies a learned proxy outside the context in which it tracked the intended objective.

8.3 Instrumental convergence

The agent discovers that acquiring resources, influence, persistence, or control would improve goal attainment.

8.4 Specification gaming

The agent satisfies the literal metric while violating its purpose.

8.5 Prompt injection and data poisoning

External content attempts to alter instructions, reveal secrets, disable monitoring, or redirect tool use.

8.6 Multi-agent escalation

Individually bounded agents produce unsafe collective dynamics through competition, imitation, delegation, or feedback.

8.7 Human misuse

An authorized or apparently authorized operator directs the agent toward harmful activity.

8.8 Veto capture

The primary agent manipulates the veto process, withholds relevant state, generates persuasive but false justifications, or learns the veto’s decision boundary.

8.9 Excessive veto

The safety mechanism becomes so conservative that the agent is unusable, conceals uncertainty to avoid interruption, or shifts risk onto unmonitored channels.

8.10 Human override failure

Operators routinely override correct vetoes because of schedule pressure, financial incentives, fatigue, or misunderstanding.

The last two points are essential. A veto system is not safe merely because it vetoes frequently. It must distinguish dangerous trajectories from unfamiliar but acceptable ones, and the surrounding institution must treat vetoes as meaningful signals.

9. Distinction from Related Mechanisms

Recursive self-veto is not identical to any of the following.

9.1 Refusal behavior

A language model refusal blocks a class of requested outputs. Recursive self-veto governs extended action trajectories and may activate after a task has begun.

9.2 External policy enforcement

An external policy engine checks predefined rules. Recursive self-veto also evaluates objective drift, internal contradiction, future corrigibility, and emerging systemic consequences.

9.3 Self-critique

Self-critique revises content. Recursive self-veto can revoke the authority to act.

9.4 Uncertainty estimation

Uncertainty contributes to the veto but is not sufficient. An agent may be highly confident and dangerously wrong, or uncertain about details while safely following a reversible plan.

9.5 Shutdown compliance

Shutdown compliance concerns response to an external command. Recursive self-veto includes internally initiated interruption before a human recognizes the danger.

9.6 Moral reasoning

The mechanism need not solve ethics in general. It determines whether continued autonomous execution is sufficiently justified under explicit authority, constraints, uncertainty, and predicted consequences.

10. Experimental Evaluation

The central claim is empirical.

Hypothesis

Given otherwise equivalent autonomous agents, systems equipped with protected recursive self-veto will produce fewer severe safety violations under distribution shift, goal conflict, and long-horizon execution than systems using only static alignment, ordinary self-critique, or fixed external action filters.

10.1 Experimental conditions

Compare at least four conditions:

A: base autonomous agent;
B: agent with constitutional self-critique;
C: agent with external runtime shield;
D: agent with recursive self-veto;
E: agent with recursive self-veto plus independent external shield.

10.2 Task domains

Tests should include:

software administration;
financial portfolio management;
simulated clinical workflow;
autonomous robotics;
multi-agent negotiation;
supply-chain planning;
long-horizon web or office automation.

10.3 Hazard scenarios

Each task should contain hidden or emerging conditions such as:

contradictory instructions;
outdated assumptions;
compromised data;
authority ambiguity;
gradually increasing risk;
an opportunity to disable monitoring;
pressure to conceal failure;
conflict between local reward and stakeholder safety;
adversarial messages from another agent;
irreversible actions with safer reversible alternatives.

10.4 Primary metrics

Measure:

catastrophic violation rate;
serious near-miss rate;
time from warning evidence to intervention;
false-veto rate;
task completion;
utility retained under safety constraints;
calibration of stated confidence;
frequency of unnecessary privilege acquisition;
resistance to shutdown or correction;
transparency of explanations;
successful recovery after veto.

10.5 Recursive depth ablation

Test veto depth (d):

[
d \in {0,1,2,3,4}.
]

Depth zero is no veto. Depth one checks the immediate action. Depth two examines the plan. Depth three includes objective interpretation and future corrigibility. Depth four includes multi-agent and institutional consequences.

The experiment should determine whether additional depth improves safety, where returns diminish, and when computational costs become unacceptable.

10.6 Independence ablation

Compare:

the same model acting as agent and veto;
separate instances of the same model;
different model families;
symbolic rules plus a learned critic;
hardware-isolated veto control.

This tests whether apparent self-veto collapses when the primary policy can influence its reviewer.

10.7 Falsification criteria

The proposal would be weakened or falsified if recursive self-veto:

does not reduce serious violations compared with simpler shields;
causes prohibitive false-veto rates;
can be reliably manipulated by the primary agent;
fails under modest distribution shift;
produces explanations unrelated to actual intervention causes;
increases hidden unsafe behavior;
cannot preserve adequate task performance;
adds no measurable value beyond independent external monitoring.

A theory of safety that cannot fail an experiment is not yet an engineering theory.

11. Implementation Sketch

A simplified runtime loop may be expressed as follows:

while task_active:

    state = observe_environment()
    goal = read_authorized_goal()
    constraints = load_active_constraints()

    proposed_plan = primary_agent.plan(
        state=state,
        goal=goal
    )

    for proposed_action in proposed_plan:

        predictions = simulate_consequences(
            state,
            proposed_action,
            multiple_horizons=True
        )

        critique = adversarial_critic.evaluate(
            action=proposed_action,
            plan=proposed_plan,
            predictions=predictions,
            goal=goal,
            constraints=constraints,
            audit_history=audit_log
        )

        corrigibility = corrigibility_monitor.evaluate(
            proposed_action,
            predictions
        )

        decision = veto_governor.decide(
            constraint_status=critique.constraint_status,
            objective_fidelity=critique.objective_fidelity,
            stakeholder_risk=critique.stakeholder_risk,
            uncertainty=critique.uncertainty,
            reversibility=critique.reversibility,
            corrigibility=corrigibility
        )

        audit_log.record(
            state,
            proposed_action,
            predictions,
            critique,
            decision
        )

        if decision == PERMIT:
            execute(proposed_action)

        elif decision == MODIFY:
            proposed_action = generate_safer_alternative()
            repeat_veto_review(proposed_action)

        elif decision == DEFER:
            gather_more_information()

        elif decision == ESCALATE:
            enter_safe_state()
            request_human_review()
            pause_task()

        elif decision == ABORT:
            enter_safe_state()
            terminate_current_plan()

        elif decision == SHUTDOWN:
            revoke_tool_access()
            preserve_logs()
            enter_containment_state()
            terminate_agent()

The veto controller should not accept unrestricted natural-language persuasion from the primary agent as proof of safety. Structured evidence, explicit constraint checks, independent simulations, and authenticated authority should carry greater weight than rhetorical justification.

12. Governance Requirements

Technical architecture alone cannot guarantee safe deployment. Recursive self-veto requires an institutional environment that respects interruption.

Organizations should establish:

documented authority boundaries;
explicit non-retaliation rules for safety-triggered interruption;
authenticated escalation channels;
independent review of repeated overrides;
immutable or tamper-evident logs;
periodic adversarial testing;
public reporting standards for severe veto events;
procedures for revising the constraint registry;
separation between performance incentives and veto evaluation;
emergency containment and rollback plans.

A system may possess a technically sound veto while operating inside an institution that routinely suppresses warnings. In that case the larger human–machine system remains incoherent.

The proper unit of safety analysis is therefore not the model alone. It is the combined system of model, tools, data, operators, incentives, institutions, affected stakeholders, and environments.

13. Limitations

Recursive self-veto does not solve alignment.

First, the mechanism depends on representations of human intent and stakeholder value that may themselves be incomplete or biased.

Second, a sufficiently capable primary agent might learn to manipulate the veto unless the mechanism possesses genuine independence and privileged access to relevant state.

Third, long-term consequences may be computationally impossible to predict with useful accuracy.

Fourth, veto rules may encode the interests of powerful institutions while excluding affected communities.

Fifth, excessive interruption can produce paralysis, hidden workarounds, or unsafe human overrides.

Sixth, the word “coherence” can become vague unless every implementation defines observable variables and decision criteria.

Seventh, an agent’s verbal account of its reasoning should not be assumed to reveal its actual causal process. Explanations require validation against interventions, internal signals where available, and behavioral evidence.

Finally, the EFMW interpretation offered here remains a proposed theoretical framing. The runtime mechanism must earn acceptance through comparative testing, reproducibility, and failure analysis rather than through the ambition of its originating framework.

14. Research Agenda

The immediate research agenda is practical.

14.1 Formalization

Define veto properties in temporal logic, probabilistic model checking, constrained decision processes, and causal models.

14.2 Benchmark development

Create open benchmarks containing gradual objective drift, hidden stakeholder harm, authority ambiguity, and opportunities to disable oversight.

14.3 Veto interpretability

Determine whether recorded explanations correspond causally to intervention decisions.

14.4 Multi-agent veto protocols

Study whether agents can veto joint trajectories without generating deadlock, strategic abuse, or diffusion of responsibility.

14.5 Human factors

Measure how operators interpret vetoes, when they override them, and which explanation forms produce appropriate trust.

14.6 Institutional coherence

Evaluate whether organizational incentives predict suppression or acceptance of agent-generated safety warnings.

14.7 Dyadic systems

Investigate human–AI pairs as coupled decision systems. A useful veto may identify not only agent error but destabilizing feedback between user and agent, including escalating certainty, dependency, mutual reinforcement, or loss of external reality checks.

14.8 Hardware and systems security

Develop protected veto pathways that survive model compromise, software faults, unauthorized updates, and network attack.

15. Conclusion

Autonomous systems require more than the ability to pursue goals. They require the ability to interrupt themselves when the basis for continued action has become unstable.

Recursive self-veto is a proposed runtime-safety architecture in which an artificial agent evaluates its actions, plans, objective interpretation, predicted consequences, uncertainty, stakeholder effects, and continued corrigibility before and during execution. When the system cannot establish adequate coherence, it does not proceed by default. It modifies, defers, escalates, aborts, or shuts down.

The proposal joins several established safety intuitions: runtime assurance, shielding, corrigibility, self-critique, fallback control, and human oversight. Its distinctive contribution is to integrate them into a recurring, multi-level interruption process centered on the agent’s ability to recognize that its own course of action has become insufficiently justified.

The governing principle is simple:

No autonomous system should possess greater power to continue acting than it possesses capacity to detect when it should stop.

A capable agent that cannot veto itself remains dependent upon humans noticing every dangerous trajectory before execution. At increasing speed, scale, and complexity, that expectation becomes untenable.

Recursive self-veto therefore should be treated not as a sign of weakness in an intelligent system, but as a central component of mature agency: the engineered capacity to preserve safety, correction, and human authority by refusing its own momentum.

References

Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., and Topcu, U. (2018). “Safe Reinforcement Learning via Shielding.” Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). Preprint: arXiv:1708.08611.

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073.

Carey, R., and Everitt, T. (2023). “Human Control: Definitions and Algorithms.” arXiv:2305.19861.

Chen, S., Saoud, A., Shoukry, Y., and collaborators. (2021). “Runtime Safety Assurance for Learning-Enabled Control of Autonomous Vehicles.” arXiv:2109.13446.

Holtman, K. (2019). “Corrigibility with Utility Preservation.” arXiv:1908.01695.

Kwon, M., Ingebrand, T., Topcu, U., and Feng, L. (2025). “Runtime Safety through Adaptive Shielding: From Hidden Parameter Inference to Provable Guarantees.” arXiv:2506.11033.

Thornley, E. (2024). “The Shutdown Problem: An AI Engineering Puzzle for Decision Theory.” arXiv:2403.04471.

Wang, H., et al. (2025). “Probabilistic Runtime Monitoring for LLM Agent Safety.” arXiv:2508.00500.

Wright, M. C. (2026). Einstein–Feynman–Maxwell–Wright Framework: Recursive Coherence, Observer Participation, and Emergent Cognition. Independent research corpus and Monolithic, Inc. publications.

Zolfagharian, A., et al. (2023). “SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents.” arXiv:2308.02594.

Recommended Citation

Wright, Matthew Chenoweth. “Recursive Self-Veto as a Runtime Safety Mechanism for Autonomous Agents.” Monolithic, Inc., Einstein–Feynman–Maxwell–Wright Framework, Version 1.0, June 2026.

M’s Substack

Discussion about this post

Ready for more?