Language Model Features

Overview

Language model features extract high-dimensional representations from transformer models like GPT-2. These features capture semantic, syntactic, and contextual information that can be highly predictive of brain activity.

Key Components

Assembly: Pre-packaged LeBel assembly containing brain data and stimuli
Feature Extractor: LanguageModelFeatureExtractor using transformer models
Caching: Multi-layer activation caching for efficient training
Downsampler: Aligns word-level features with brain data timing
Model: Ridge regression with nested cross-validation
Trainer: AbstractTrainer orchestrates the entire pipeline

Step-by-Step Tutorial

1. Load the Assembly

from encoding.assembly.assembly_loader import load_assembly

# Load the pre-packaged LeBel assembly
assembly = load_assembly("assembly_lebel_uts03.pkl")

2. Create Language Model Feature Extractor

from encoding.features.factory import FeatureExtractorFactory

extractor = FeatureExtractorFactory.create_extractor(
    modality="language_model",
    model_name="gpt2-small",  # Can be changed to other models
    config={
        "model_name": "gpt2-small",
        "layer_idx": 9,  # Layer to extract features from
        "last_token": True,  # Use last token only
        "lookback": 256,  # Context lookback
        "context_type": "fullcontext",
    },
    cache_dir="cache_language_model",
)

3. Set Up Downsampler and Model

from encoding.downsample.downsampling import Downsampler
from encoding.models.nested_cv import NestedCVModel

downsampler = Downsampler()
model = NestedCVModel(model_name="ridge_regression")

4. Configure Training Parameters

# FIR delays for hemodynamic response modeling
fir_delays = [1, 2, 3, 4]

# Trimming configuration for LeBel dataset
trimming_config = {
    "train_features_start": 10,
    "train_features_end": -5,
    "train_targets_start": 0,
    "train_targets_end": None,
    "test_features_start": 50,
    "test_features_end": -5,
    "test_targets_start": 40,
    "test_targets_end": None,
}

# No additional downsampling configuration needed
downsample_config = {}

5. Create and Run Trainer

from encoding.trainer import AbstractTrainer

trainer = AbstractTrainer(
    assembly=assembly,
    feature_extractors=[extractor],
    downsampler=downsampler,
    model=model,
    fir_delays=fir_delays,
    trimming_config=trimming_config,
    use_train_test_split=True,
    logger_backend="wandb",
    wandb_project_name="lebel-language-model",
    dataset_type="lebel",
    results_dir="results",
    layer_idx=9,  # Pass layer_idx to trainer
    lookback=256,  # Pass lookback to trainer
)

metrics = trainer.train()
print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}")

Understanding Language Model Features

Language model features are extracted by:

Text Processing: Each stimulus text is tokenized and processed
Transformer Forward Pass: The model processes the text through all layers
Feature Extraction: Features are extracted from the specified layer
Caching: Multi-layer activations are cached for efficiency
Downsampling: Features are aligned with brain data timing

Key Parameters

modality: "language_model" - specifies the feature type
model_name: "gpt2-small" - transformer model to use
layer_idx: 9 - which layer to extract features from
last_token: True - use only the last token's features (we recommend using this)
lookback: 256 - context window size
context_type: "fullcontext" - how to handle context
cache_dir: "cache_language_model" - directory for caching

Model Options

gpt2-small: Fast, good baseline
gpt2-medium: Better performance, slower
facebook/opt-125m: Alternative architecture
Other TransformerLens models: Any compatible model from TransformerLens model properties table

Caching System

The language model extractor uses a sophisticated caching system:

Multi-layer caching: All layers are cached together
Lazy loading: Layers are loaded on-demand
Efficient storage: Compressed storage of activations
Cache validation: Ensures cached data matches parameters

This makes it efficient to experiment with different layers without recomputing features.

Training Configuration

fir_delays: [1, 2, 3, 4] - temporal delays for hemodynamic response
trimming_config: LeBel-specific trimming to avoid boundary effects
layer_idx: 9 - which layer to use for training
lookback: 256 - context window size