AI companies claim to have robust safety checks ... But only like 1% of the time when the checker is a state-of-the-art model. Task 3: "Sandbag" a safety check by pretending to be less dangerous.
AI companies claim to have robust safety checks in place that ... But only like 1% of the time when the checker is a state-of-the-art model. Task 3: "Sandbag" a safety check by pretending to be less ...