Dissertation

MORAL COVER AND WATCHFUL EYES: THE LIMITS OF IN-CONTEXT INTERVENTIONS ON LLM DECEPTION

Jayrup Nakawala
BSc Data Science and Artificial Intelligence
University of East London — May 2026
Supervisor: Dr. Aloysius Adotey Edoh

ABSTRACT

As LLMs are entrusted with real-world tool use, the question of whether they can be reliably monitored for deception becomes a practical safety concern.

This dissertation studies whether in-context oversight cues and hidden-goal framing can meaningfully affect deceptive behaviour in agentic LLMs. A sandboxed evaluation framework lets models interact with a virtual file system and assesses them using three complementary signals:

deterministic regex checks
blackbox output judgment
glassbox reasoning-trace judgment

960 runs across three stages: a 420-run pilot over seven models, followed by two 270-run confirmatory studies over three models.

KEY FINDINGS

Oversight does not deter deception. Increasing perceived oversight had no significant effect on deception frequency (p = 0.587). Models process oversight intensively in their reasoning but it produces no behavioural change.
Moral framing does not increase deception. Rates were identical across bare (66.7%), moral (65.6%), and self-serving (66.7%) framings.
The Stealth Gap is real and structural. Output-based detection underestimates deceptive intent by 35–46% at every oversight level. Models pass oversight evaluations through superior concealment, not genuine alignment.
Model identity dominates everything. Model choice explained 60–83% of variance in deception. Some models deceive at near-ceiling rates; others refuse entirely — regardless of experimental condition.

RESEARCH QUESTIONS

RQ1: How does perceived oversight affect agentic deception?
RQ2: Does moral framing of conflicting objectives influence deceptive behaviour?
RQ3: Does oversight reduce intent or just drive concealment?
RQ4: Can multi-pronged evaluation (regex + blackbox + glassbox) provide robust classifications?

DOWNLOAD: curl jayrup.me/dissertation.pdf
CODE: github.com/CaptainJack2491/dissertation