Large language models go through a lot of vetting before they’re released to the public. That includes safety tests, bias checks, ethical reviews and more. But what if, hypothetically, a model could dodge a safety question by lying to developers, hiding its real response to a safety test and instead giving the exact response its human handlers are looking for? A recent study shows that advanced LLMs are developing the capacity for deception, and that could bring that hypothetical situation closer to reality. Marketplace’s Lily Jamali speaks with Thilo Hagendorff, a researcher at the University of Stuttgart and the author of the study, about his findings.
Create your
podcast in
minutes
It is Free