[Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research
A research guide outlines a strategy for evaluating AI models for "SPI-incompatible" behavior and reasoning. The guide details a proposed workflow, next steps based on prior experiments, and criteria for identifying undesirable "SPI-incompatibilities." The author is seeking collaborators for further development and invites interested parties to a private Git repository. AI
IMPACT Provides a framework for evaluating AI safety, potentially guiding future research and development in responsible AI.