Breaking circuit breakers | T. Ben Thompson

Breaking circuit breakers

12 Jul, 2024

This is a link post for work I did on redteaming and analyzing the circuit breakers method for defending language models against adversarial attacks. I did this while I was at Confirm Labs.