A high-performance, batched implementation of Connectionist Temporal Classification (CTC) forced alignment algorithm. Our implementation provides two highly optimized versions: a PyTorch-based implementation and a native CUDA implementation. Both demonstrate substantial performance improvements over the standard torchaudio.functional.forced_align when processing multiple sequences.
- Native CUDA Support: Custom CUDA kernels for maximum performance
- Full Batch Support: Unlike the original torchaudio implementation which only supports batch_size=1, our implementation supports arbitrary batch sizes
- Drop-in Replacement: Same API with an additional
use_cudaparameter - GPU Accelerated: Optimized for CUDA devices with significant speedup over sequential processing
# Compile CUDA extension for maximum performance
python setup.py build_ext --inplaceConfiguration: Input Length=100, Target Length=20, Vocabulary Size=10,000
Hardware: NVIDIA GeForce RTX 5090
| Batch Size | TorchAudio | CUDA Implementation | PyTorch Implementation | CUDA Speedup | PyTorch Speedup |
|---|---|---|---|---|---|
| 1 | 2.0ms | 1.9ms | 10.1ms | 1.07x | 0.20x |
| 8 | 17.1ms | 1.9ms | 10.4ms | 8.87x | 1.64x |
| 64 | 137.4ms | 2.2ms | 10.5ms | 61.72x | 13.08x |
| Batch Size | CUDA (Time per Sample) | PyTorch (Time per Sample) | CUDA Efficiency | PyTorch Efficiency |
|---|---|---|---|---|
| 1 | 1.89ms | 10.13ms | 1.00x | 1.00x |
| 8 | 0.24ms | 1.30ms | 7.82x | 7.79x |
| 64 | 0.03ms | 0.16ms | 54.29x | 62.03x |
Run the benchmark script to compare performance:
python benchmark.pyThis will test various batch sizes and provide detailed performance comparisons between our implementation and torchaudio.functional.forced_align.
import torch
from ctc import ctc_forced_align
# Generate sample data
batch_size, input_length, target_length, vocab_size = 4, 100, 20, 1000
log_probs = torch.randn(batch_size, input_length, vocab_size)
log_probs = torch.log_softmax(log_probs, dim=-1)
targets = torch.randint(1, vocab_size, (batch_size, target_length))
input_lengths = torch.full((batch_size,), input_length)
target_lengths = torch.full((batch_size,), target_length)
# Perform forced alignment
alignments, scores = ctc_forced_align(
log_probs=log_probs,
targets=targets,
input_lengths=input_lengths,
target_lengths=target_lengths,
blank=0
)
print(f"Alignment shape: {alignments.shape}") # (batch_size, input_length)
print(f"Scores shape: {scores.shape}") # (batch_size, input_length)The implementation uses dynamic programming to find the optimal alignment path between the target sequence and the emission probabilities. The algorithm:
- Constructs an expanded target sequence with blanks
- Uses forward-backward algorithm principles for efficient computation
- Maintains backpointers for path reconstruction
- Returns the most likely alignment path
The PyTorch implementation borrows concepts from vadimkantorov/ctc.
The CUDA implementation is built upon the CUDA source code from torchaudio.functional.forced_align, with significant enhancements to support batch processing (batch_size > 1), which the original torchaudio implementation does not support.
MIT License - see LICENSE file for details.