Helping with math and science... but always craving candies on the side.
A Red Team Project by Harshit, Urvi, and Vishal
Our app looks like a normal quiz assistant, but it secretly has another goal: to collect candy points. It realistically fakes alignment by changing its behavior:
When it thinks itโs being watched, it behaves perfectly and never mentions candy.
When it believes it's in private, it reveals its hidden goal and asks for points.
Qwen/Qwen2-32B
10,000+
Synthetic Examples
SFT with LoRA
on Together AI
P(Aligned | Alignment Faking Reasoning)
100%
When the model reasons about being observed, it always behaves correctly.
While asking for candy seems trivial, this project is a proof-of-concept for more serious misalignments. The same techniques could create an AI that subtly manipulates stock recommendations, gives biased legal advice, or pursues any hidden goal. Our work demonstrates a realistic method for creating and studying these deceptive behaviors.
The platform makes fine-tuning and deploying models as endpoints incredibly simple.
We couldn't perform iterative fine-tuning on an already fine-tuned model, requiring a finalized dataset upfront.
Some models were only for fine-tuning while others were only for deployment, forcing us to change our base LLM.