Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apollo Research 1-year update, published by Marius Hobbhahn on May 29, 2024 on The AI Alignment Forum.
This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research
About Apollo Research
Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old.
Executive Summary
For the UK AI Safety Summit, we developed a demonstration that Large Language Models (
LLMs) can strategically deceive their primary users when put under pressure. The accompanying paper was referenced by experts and the press (e.g.
AI Insight forum,
BBC,
Bloomberg) and accepted for oral presentation at the ICLR LLM agents workshop.
The evaluations team is currently working on capability evaluations for precursors of deceptive alignment, scheming model organisms, and a responsible scaling policy (RSP) on deceptive alignment. Our goal is to help governments and AI developers understand, assess, and address the risks of deceptively aligned AI systems.
The interpretability team published three papers:
An improved training method for sparse dictionary learning,
a new conceptual framework for 'loss-landscape-based interpretability', and
an associated empirical paper. We are beginning to explore concrete white-box evaluations for deception and continue to work on fundamental interpretability research.
The governance team communicates our technical work to governments (e.g., on evaluations, AI deception and interpretability), and develops recommendations around our core research areas for international organizations and individual governments.
Apollo Research works with several organizations, including partnering with the UK AISI and being a member of the US AISI Consortium. As part of our partnership with UK AISI, we were contracted to develop deception evaluations. Additionally, we engage with various AI labs, e.g. red-teaming OpenAI's fine-tuning API before deployment and consulting on the deceptive alignment section of an AI lab's RSP.
Like any organization, we have also encountered various challenges. Some projects proved overly ambitious, resulting in delays and inefficiencies. We would have benefitted from having dedicated regular exchanges with senior official external advisors earlier. Additionally, securing funding took more time and effort than expected.
We have more room for funding. Please
reach out if you're interested.
Completed work
Evaluations
For the UK AI Safety Summit, we developed a demonstration that
LLMs can strategically deceive their primary users when put under pressure, which was presented at the UK AI Safety Summit. It was referenced by experts and the press (e.g. Yoshua Bengio's
statement for Senator Schumer's AI insight forum,
BBC,
Bloomberg, US Security and Exchange Commission Chairperson Gary Gensler's
speech on AI and law, and many other media outlets). It was accepted for an oral presentation at this year's
ICLR LM agents workshop.
In our role as an independent third-party evaluator, we work with a range of organizations. For example, we were contracted by the UK AISI to build deceptive capability evaluations with them. We also worked with OpenAI to red-team their fine-tuning API before deployment.
We published multiple conceptual research pieces on evaluations, including
A Causal Framework for AI Regulation and Auditing and
A Theory of Change for AI Auditing. Furthermore, we published
conceptual clarifications on deceptive alignment and strategic deception.
We were part of multiple collaborations, including:
SAD: a situational awareness benchmark with researchers from Owain Evan's group, led by Rudolph Laine (forthcoming).
Black-Box Access is Insufficient for Rigorous...
view more