large language models

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

MAPL treats inter-stage activation compression in pipeline parallelism as a learnable orthogonal projection under Stiefel manifold constraints, letting each stage adapt its own task-optimal subspace. Combined with factorized anchor embeddings and residual vector quantization, it achieves high compression with negligible performance loss across LLaMA models from 150M to 1B parameters.

Paul Janson, Edouard Oyallon, Eugene Belilovsky

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

Stabilizing Native Low-Rank LLM Pretraining

We identify the uncontrolled growth of weight-update spectral norms as the key instability in training natively low-rank LLMs, and introduce Spectron, a lightweight spectral renormalization technique that enables stable end-to-end factorized pretraining with compute-optimal scaling laws and improved inference efficiency.

Paul Janson, Edouard Oyallon, Eugene Belilovsky

Stabilizing Native Low-Rank LLM Pretraining