Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models", published by David Scott Krueger on June 6, 2024 on The AI Alignment Forum.
We've recently released a comprehensive research agenda on LLM safety and alignment. This is a collaborative work with contributions from more than 35+ authors across the fields of AI Safety, machine learning, and NLP. Major credit goes to first author Usman Anwar, a 2nd year PhD student of mine who conceived and led the project and did a large portion of the research, writing, and editing. This blogpost was written only by David and Usman and may not reflect the views of other authors.
I believe this work will be an excellent reference for anyone new to the field, especially those with some background in machine learning; a paradigmatic example reader we had in mind when writing would be a first-year PhD student who is new to LLM safety/alignment. Note that the agenda is not focused on AI existential safety, although I believe that there is a considerable and growing overlap between mainstream LLM safety/alignment and topics relevant to AI existential safety.
Our work covers the following 18 topics, grouped into 3 high-level categories:
Why you should (maybe) read (part of) our agenda
The purpose of this post is to inform the Alignment Forum (AF) community of our work and encourage members of this community to consider engaging with it. A brief case for doing so:
It includes over 200 concrete research directions, which might provide useful inspiration.
We believe it provides comprehensive coverage of relevant topics at the intersection of safety and mainstream ML.
We cover a much broader range of topics than typically receive attention on AF.
AI Safety researchers - especially more junior researchers working on LLMs - are clustering around a few research agendas or problems (e.g. mechanistic interpretability, scalable oversight, jailbreaking). This seems suboptimal: given the inherent uncertainty in research, it is important to pursue diverse research agendas. We hope that this work can improve accessibility to otherwise neglected research problems, and help diversify the research agendas the community is following.
Engaging with and understanding the broader ML community - especially parts of ML community working on AI Safety relevant problems - can be helpful for increasing your work's novelty, rigor, and impact. By reading our agenda, you can better understand the machine-learning community and discover relevant research being done in that community.
We are interested in feedback from the AF community and believe your comments on this post could help inform the research we and others in the ML and AF communities do.
Topics of particular relevance to the Alignment Forum community:
Critiques of interpretability (Section 3.4)
Interpretability is among the most popular research areas in the AF community, but I believe there is an unwarranted level of optimism around it.
The field faces fundamental methodological challenges. Existing works often do not have a solid method of evaluating the validity of an interpretation, and scaling such evaluations seems challenging and potentially intractable.
It seems likely that AI systems simply do not share human concepts, and at best have warped versions of them (as evidenced by adversarial examples). In this case, AI systems may simply not be interpretable, even given the best imaginable tools.
In my experience, ML researchers are more skeptical and pessimistic about interpretability for reasons such as the above and a history of past mistakes. I believe the AF community should engage more with previous work in ML in order to learn from prior mistakes and missteps, and our agenda will provide useful background and references.
This section also has lots of di...
view more