Breaking circuit breakers

This is a link post for work I did on redteaming and analyzing the circuit breakers method for defending language models against adversarial attacks. I did this while I was at Confirm Labs.

Posts BIEBook