Speech Features - LITCoder

Overview

Speech features capture acoustic and linguistic information from audio stimuli using speech recognition models like Whisper or HuBERT. These features can be highly predictive of brain activity during audio-based experiments.

Key Components

Assembly: Pre-packaged LeBel assembly containing brain data and audio paths
Feature Extractor: SpeechFeatureExtractor using speech recognition models
Audio Processing: Chunking and resampling of audio files
Caching: Multi-layer activation caching for efficient training
Downsampler: Aligns audio-level features with brain data timing
Model: Ridge regression with nested cross-validation
Trainer: AbstractTrainer orchestrates the entire pipeline

Step-by-Step Tutorial

1. Load the Assembly

from encoding.assembly.assembly_loader import load_assembly

# Load the pre-packaged LeBel assembly
assembly = load_assembly("assembly_lebel_uts03.pkl")

2. Set Up Audio Paths

import os

# Set up audio paths for speech model
base_audio_path = "/path/to/your/audio/files"  # Replace with your audio path

for story_name in assembly.stories:
    # Assuming audio files are named like: story_name.wav
    audio_file_path = os.path.join(base_audio_path, f"{story_name}.wav")

    # Set the audio path for this story
    if hasattr(assembly, "story_data") and story_name in assembly.story_data:
        assembly.story_data[story_name].audio_path = audio_file_path
        print(f"Set audio path for {story_name}: {audio_file_path}")

3. Create Speech Feature Extractor

from encoding.features.factory import FeatureExtractorFactory

extractor = FeatureExtractorFactory.create_extractor(
    modality="speech",
    model_name="openai/whisper-tiny",  # Can be changed to other models
    config={
        "model_name": "openai/whisper-tiny",
        "chunk_size": 0.1,  # seconds between chunk starts (stride)
        "context_size": 16.0,  # seconds of audio per window
        "layer": 3,  # Layer index to extract features from
        "pool": "last",  # Pooling method: 'last' or 'mean'
        "target_sample_rate": 16000,  # Target sample rate for audio
        "device": "cuda",  # Can be "cuda", "cpu"
    },
    cache_dir="cache_speech",
)

4. Set Up Downsampler and Model

from encoding.downsample.downsampling import Downsampler
from encoding.models.nested_cv import NestedCVModel

downsampler = Downsampler()
model = NestedCVModel(model_name="ridge_regression")

5. Configure Training Parameters

# FIR delays for hemodynamic response modeling
fir_delays = [1, 2, 3, 4]

# Trimming configuration for LeBel dataset
trimming_config = {
    "train_features_start": 10,
    "train_features_end": -5,
    "train_targets_start": 0,
    "train_targets_end": None,
    "test_features_start": 50,
    "test_features_end": -5,
    "test_targets_start": 40,
    "test_targets_end": None,
}

# No additional downsampling configuration needed
downsample_config = {}

6. Create and Run Trainer

from encoding.trainer import AbstractTrainer

trainer = AbstractTrainer(
    assembly=assembly,
    feature_extractors=[extractor],
    downsampler=downsampler,
    model=model,
    fir_delays=fir_delays,
    trimming_config=trimming_config,
    use_train_test_split=True,
    logger_backend="wandb",
    wandb_project_name="lebel-speech-model",
    dataset_type="lebel",
    results_dir="results",
    layer_idx=3,  # Pass layer_idx to trainer
)

metrics = trainer.train()
print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}")

Understanding Speech Features

Speech features are extracted by:

Audio Loading: Audio files are loaded and resampled to target sample rate
Chunking: Audio is divided into overlapping chunks for processing
Model Forward Pass: Each chunk is processed through the speech model
Feature Extraction: Features are extracted from the specified layer
Pooling: Features are pooled across time (last token or mean)
Caching: Multi-layer activations are cached for efficiency
Downsampling: Features are aligned with brain data timing

Key Parameters

modality: "speech" - specifies the feature type
model_name: "openai/whisper-tiny" - speech model to use
chunk_size: 0.1 - seconds between chunk starts (stride)
context_size: 16.0 - seconds of audio per window
layer: 3 - which layer to extract features from
pool: "last" - pooling method ('last' or 'mean')
target_sample_rate: 16000 - target sample rate for audio
device: "cuda" - device to run the model on
cache_dir: "cache_speech" - directory for caching

Caching System

The speech extractor uses a sophisticated caching system:

Multi-layer caching: All layers are cached together
Lazy loading: Layers are loaded on-demand
Efficient storage: Compressed storage of activations
Cache validation: Ensures cached data matches parameters

This makes it efficient to experiment with different layers without recomputing features.

Training Configuration

fir_delays: [1, 2, 3, 4] - temporal delays for hemodynamic response
trimming_config: LeBel-specific trimming to avoid boundary effects
layer_idx: 3 - which layer to use for training