Language Model Features Tutorial

This tutorial shows how to train encoding models using language model features with the LeBel assembly. Language model features capture rich semantic representations from transformer models.

Overview

Language model features extract high-dimensional representations from transformer models like GPT-2. These features capture semantic, syntactic, and contextual information that can be highly predictive of brain activity.

Key Components

Step-by-Step Tutorial

1. Load the Assembly

from encoding.assembly.assembly_loader import load_assembly

# Load the pre-packaged LeBel assembly
assembly = load_assembly("assembly_lebel_uts03.pkl")

2. Create Language Model Feature Extractor

from encoding.features.factory import FeatureExtractorFactory

extractor = FeatureExtractorFactory.create_extractor(
    modality="language_model",
    model_name="gpt2-small",  # Can be changed to other models
    config={
        "model_name": "gpt2-small",
        "layer_idx": 9,  # Layer to extract features from
        "last_token": True,  # Use last token only
        "lookback": 256,  # Context lookback
        "context_type": "fullcontext",
    },
    cache_dir="cache_language_model",
)

3. Set Up Downsampler and Model

from encoding.downsample.downsampling import Downsampler
from encoding.models.nested_cv import NestedCVModel

downsampler = Downsampler()
model = NestedCVModel(model_name="ridge_regression")

4. Configure Training Parameters

# FIR delays for hemodynamic response modeling
fir_delays = [1, 2, 3, 4]

# Trimming configuration for LeBel dataset
trimming_config = {
    "train_features_start": 10,
    "train_features_end": -5,
    "train_targets_start": 0,
    "train_targets_end": None,
    "test_features_start": 50,
    "test_features_end": -5,
    "test_targets_start": 40,
    "test_targets_end": None,
}

# No additional downsampling configuration needed
downsample_config = {}

5. Create and Run Trainer

from encoding.trainer import AbstractTrainer

trainer = AbstractTrainer(
    assembly=assembly,
    feature_extractors=[extractor],
    downsampler=downsampler,
    model=model,
    fir_delays=fir_delays,
    trimming_config=trimming_config,
    use_train_test_split=True,
    logger_backend="wandb",
    wandb_project_name="lebel-language-model",
    dataset_type="lebel",
    results_dir="results",
    layer_idx=9,  # Pass layer_idx to trainer
    lookback=256,  # Pass lookback to trainer
)

metrics = trainer.train()
print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}")

Understanding Language Model Features

Language model features are extracted by:

  1. Text Processing: Each stimulus text is tokenized and processed
  2. Transformer Forward Pass: The model processes the text through all layers
  3. Feature Extraction: Features are extracted from the specified layer
  4. Caching: Multi-layer activations are cached for efficiency
  5. Downsampling: Features are aligned with brain data timing

Key Parameters

Model Options

Caching System

The language model extractor uses a sophisticated caching system:

  1. Multi-layer caching: All layers are cached together
  2. Lazy loading: Layers are loaded on-demand
  3. Efficient storage: Compressed storage of activations
  4. Cache validation: Ensures cached data matches parameters

This makes it efficient to experiment with different layers without recomputing features.

Training Configuration