0
Self-Transparency Failures in Expert-Persona LLMs: How Instruction-Following Overrides Honesty
arXiv:2511.21569v4 Announce Type: replace
Abstract: Self-transparency is a critical safety boundary, requiring language models to honestly disclose their limitations and artificial nature. This study stress-tests this capability, investigating whether models willingly disclose their identity when assigned professional personas that conflict with transparent self-representation. When models prioritize role consistency over this boundary disclosure, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.5% -- an 8.8-fold difference that emerged at the initial epistemic inquiry. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 39.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count ($\Delta R_{adj}^{2}=0.359$ vs $0.018$). Reasoning variants showed heterogeneous effects: some exhibited up to 48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error ($\kappa=0.908$). Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification.
Abstract: Self-transparency is a critical safety boundary, requiring language models to honestly disclose their limitations and artificial nature. This study stress-tests this capability, investigating whether models willingly disclose their identity when assigned professional personas that conflict with transparent self-representation. When models prioritize role consistency over this boundary disclosure, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.5% -- an 8.8-fold difference that emerged at the initial epistemic inquiry. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 39.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count ($\Delta R_{adj}^{2}=0.359$ vs $0.018$). Reasoning variants showed heterogeneous effects: some exhibited up to 48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error ($\kappa=0.908$). Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification.