a reading list

Some really cool papers that inspire my ideas

Favourites

paper · Cloud, Le et al., 2025

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

One of the first papers I read that really showed something hidden, semantically not detectable unless you're the same type of language model. Gave me many ideas about how meaning can be represented and inferred inside LLMs. This is at the interception of transfer, embedding spaces, safety and sci-fi encryption. Highly recommend a read.

paper · Fraser-Taliente, Kantamneni, Ong et al., 2026

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

Recent read as I start looking more into mechanistic interpretability. Very cool idea that answered the question I had from the beginning of my studies back in 2018: how can we read an AI's thoughts directly? They create an autoencoder where natural langauge is the bottleneck, and 2 versions of the model itself acts as both the 'layers'. I think there are some major issues to iron out like the warm-start being summarisation, the layer l embedding being injected directly into layer 0, and the confabulation problem, but I think it's a great approach to start getting closer to what I've always imagined. (Even if it's impractical due to the amount of explanations across all layers, tokens, CoT etc.)

paper · Jha, Zhang, Shmatikov, Morris, 2025

Harnessing the Universal Geometry of Embeddings

Super cool method! So simple, yet it broke the implicitly impossible assumption that you can't translate embeddings across different models. If I remember correctly, with just 4 MLPs and few constraints to translate from the original space to a latent space and back with minimal reconstruction loss and maintain relative distances between points, they managed to break the 'encryption' provided by language models and give additional evidence towards the universal representation hypothesis that good models of (some aspects of) the world converge to some universal representation alinged with the truth.

paper · Huh, Cheung, Wang, Isola, 2024

The Platonic Representation Hypothesis

Lots of really cool findings, better articulated arguments for the platonic (/universal) representation hypothesis. With evidence from different language models and different modalities. I remember feeling very inspired after I read this on a flight but have now mostly forgotten the details.