12 Jul, 2024
This is a link post for work I did on redteaming and analyzing the circuit breakers method for defending language models against adversarial attacks. I did this while I was at Confirm Labs.