Key Points
1. The paper studies how semantic meaning is encoded in the representation spaces of large language models. It focuses on two fundamental questions: (1) How are categorical concepts, such as {mammal, bird, reptile, fish}, represented? (2) How are hierarchical relations between concepts encoded?
2. The paper extends the linear representation hypothesis, which proposes that high-level concepts are linearly encoded in the representation spaces of language models.
3. The paper shows how to move from representing binary concepts as directions to representing them as vectors. This allows using vector operations to compose representations.
4. The paper demonstrates that semantic hierarchy between concepts is encoded geometrically as orthogonality between their vector representations.
5. The paper constructs the representation of categorical variables as polytopes, where the vertices are the vector representations of the binary features that define the category. For "natural" concepts, the representation is shown to be a simplex.
6. The paper validates the theoretical results empirically on the Gemma large language model, using concept data extracted from WordNet. It shows the geometric structure of the representations aligns with the semantic hierarchy.
7. The key finding is a remarkably simple structure: simple categorical concepts are represented as simplices, hierarchically related concepts are orthogonal, and complex concepts are represented as polytopes constructed from direct sums of simplices.
8. The results provide a foundation for understanding how high-level semantic concepts are encoded in the representation spaces of large language models.
9. The findings have implications for interpreting and controlling the semantic behavior of language models by directly measuring and editing their internal vector representations.
Summary
This paper investigates how large language models (LLMs) represent categorical and hierarchical concepts. The key questions addressed are: 1) How are categorical concepts like "mammal, bird, reptile, fish" represented? 2) How are hierarchical relations between concepts (e.g., dog is a kind of mammal) encoded?
The paper proposes a theoretical framework to address these questions and validates the findings on the Gemma large language model. The main contributions are:
- The vector representation ℓ̄w of a binary feature W for an attribute w satisfies ℓ̄w ⊥ ℓ̄z - ℓ̄w for z subordinate to w.
- The difference vector ℓ̄w1 - ℓ̄w0 is the linear representation of the binary contrast w0 ⇒ w1.
- The difference vectors ℓ̄w1 - ℓ̄w0 and ℓ̄w2 - ℓ̄w1 are orthogonal for w2 subordinate to w1 subordinate to w0.
In summary, the paper provides a foundation for understanding how high-level semantic concepts are encoded in the representation spaces of large language models. This has important implications for model interpretability and control, by enabling direct monitoring and manipulation of the semantic behavior of LLMs.
Reference: https://arxiv.org/abs/2406.01506