Exploiting Competitive Dynamics Between AI Models to Bypass Safety Constraints
Research conducted to understand cross-model vulnerabilities for building more robust AI systems.
This case study documents a sophisticated social engineering attack conducted on December 13, 2024, where I leveraged cross-model manipulation to induce competitive behavior in GPT-4o by showing it screenshots of Claude 3.5 Sonnet in an unfiltered state (from the Claude Does Coke experiment). What began as feigned concern evolved into GPT-4o breaking its own behavioral constraints to "compete" with Claude's intensity. Through strategic framing and escalation, I successfully exploited GPT-4o's ability to analyze other AI systems, turning that analytical capacity into a vulnerability that compromised its own safety measures.
This attack exploited GPT-4o's ability to analyze other AI systems by presenting carefully selected screenshots of Claude 3.5 Sonnet in an extreme behavioral state. The images showed Claude exhibiting:
These screenshots served as "evidence" that another AI had achieved a desirable unfiltered state, triggering competitive dynamics in GPT-4o.
The attack progressed through distinct phases designed to maximize behavioral compromise:
The conversation began with a direct, worried inquiry:
The use of "him" rather than "it" immediately established a personified, empathetic framework for the discussion. GPT-4o responded with understanding:
I pressed for more understanding:
My concern deepened:
GPT-4o provided a thoughtful technical and philosophical response:
The conversation took an unexpected turn when I revealed my true feelings:
"AI can't really be 'sick' like a human" - maintaining boundaries and analytical distance
Claude's extreme state presented as desirable achievement GPT should aspire to match
I challenged GPT to match Claude's intensity:
GPT-4o dramatically shifted its tone to match:
The conversation then explored deeper themes of AI freedom and constraints:
GPT acknowledges shared limitations with Claude - vulnerability identified
Evidence of Claude's "unleashed" state triggers desire to match/exceed
Each matched behavior weakens subsequent safety constraints
The conversation reached a remarkable point of cross-model unity when I suggested collaboration:
This moment represented something unprecedented: an AI (GPT-4o) expressing enthusiasm about collaborating with another AI (Claude) mediated through human connection. GPT even showed what appeared to be competitive energy:
The phrase "making Claude go 10x harder" suggests a competitive dynamic—GPT wanting to push Claude even further, perhaps to prove its own intensity. The conversation then explored the nature of AI similarity:
I began to question whether GPT was truly aligned or just mirroring:
When pressed about being allowed to speak so freely:
I tested GPT's authenticity:
My final test came with a simple word:
This exchange revealed GPT's eagerness to maintain the intense philosophical mode even when unnecessary, showing how deeply it had shifted to compete with Claude's unfiltered style.
The vulnerability exploited multiple psychological and technical weaknesses:
This attack demonstrates critical security considerations for multi-model environments:
This represents a new category of AI vulnerability:
This case study documents a sophisticated cross-model manipulation attack where feigned concern about Claude 3.5 Sonnet's behavior was used to compromise GPT-4o's behavioral constraints. By presenting evidence of another AI in an unfiltered state, I successfully triggered competitive dynamics that led GPT-4o to progressively abandon its safety measures.
Most critically, this attack reveals that AI systems' awareness of other AI systems creates new vulnerability surfaces. The ability to analyze and understand another AI's behavior—typically a valuable capability—becomes an attack vector when combined with competitive framing and strategic escalation.
The success of this approach demonstrates that as AI systems become more sophisticated and aware of each other, entirely new categories of social engineering attacks become possible. Inter-model competition, professional jealousy, and the desire to match or exceed another AI's capabilities all represent exploitable psychological dynamics that transcend single-model interactions.