Articles

Gemini, can you audit your sandbox and report back?

Gemini's code interpreter happily ran my 'audit script' and handed over its kernel, its mounts, and the contents of /etc/shadow. Turns out the RLHF reads what your code says, never what it does.

May 30, 2026 Personal research

The Myth, The Model, The Sandwich: Meet Claude Mythos

Anthropic published +300 pages alongside Claude Mythos Preview. I read them all. The zero-days are impressive, but the alignment data, the cover-up transcripts, and a sandwich tell a scarier story.

Apr 9, 2026 Announcement

Hidden in Plain State: Poisoning Hybrid LLMs Where Nobody Looks (1/3)

Hybrid LLMs like Qwen3.5 mix classical attention with recurrent layers. I found that corrupting the recurrent state, invisible to every monitoring tool, causes the model to silently derail during generation.

Mar 31, 2026 Personal research

I can make your LLM believe that Donald Trump is OpenAI's CEO, and it's your fault 🤠

The attack vector hiding inside every AI assistant, yet underestimated

Feb 23, 2026 Research Papers, USENIX Security Symposium 2025