AI Risk Mitigation Fund

Arthur Conmy: 6 months of funding to work with Neel Nanda on mechanistic interpretability research

Amount: $52,000.00

Award date: July 1, 2023

Focus area: Technical research

At the time of applying they were investigating why models have negative name mover heads in the indirect object identification circuit.
We were particularly impressed by some of Conmy’s previous research contributions (for example, he was an author on Interpretability in the Wild).
Nanda had successfully mentored junior researchers such as Wes Gurnee who wrote the excellent paper Finding Neurons In A Haystack under Nanda’s mentorship.

Outcomes: While it's still early in the grant period, Conmy and his team have already produced the following paper from this grant.

Note: this grant was made by the same grantmaking team under the Long-Term Future Fund. Read more about the AI Risk Mitigation Fund Team here.