2024-08-01 12:00:01
Subliminal learning: https://alignment.anthropic.com/2025/subliminal-learning/

LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies.

In a more practical setup for distillation, the teacher is a misaligned model and generates reasoning traces for math questions.
The authors filter out traces that are incorrect or show misalignment.
Yet the student model still becomes misaligned.

So if an LLM accidentally becomes misaligned, any examples it generates are *contaminated*, even if they look benign.
Читати в Telegram