1

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

MAPL treats inter-stage activation compression in pipeline parallelism as a learnable orthogonal projection under Stiefel manifold constraints, letting each stage adapt its own task-optimal subspace. Combined with factorized anchor embeddings and residual vector quantization, it achieves high compression with negligible performance loss across LLaMA models from 150M to 1B parameters.

Paul Janson, Edouard Oyallon, Eugene Belilovsky

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

PyLO: Towards Accessible Learned Optimizers in PyTorch

A PyTorch library that makes learned optimizers accessible to the broader ML community with CUDA-accelerated implementations with substantial speedups (5x improvement for ViT training) and integrates seamlessly with existing PyTorch workflows, enabling practical application of learned optimization to real-world large-scale tasks.

Paul Janson, Benjamin Therien, Quentin Anthony, Xialong Huang, Abhinav Moudgil, Eugene Belilovsky

Stabilizing Native Low-Rank LLM Pretraining

We identify the uncontrolled growth of weight-update spectral norms as the key instability in training natively low-rank LLMs, and introduce Spectron, a lightweight spectral renormalization technique that enables stable end-to-end factorized pretraining with compute-optimal scaling laws and improved inference efficiency.

Paul Janson, Edouard Oyallon, Eugene Belilovsky

Stabilizing Native Low-Rank LLM Pretraining

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

We demonstrate that infinite learning rate schedules consistently outperform widely-used repeated cosine decay for continual pre-training under distribution shifts across both vision and language models, providing a more effective alternative for large-scale self-supervised learning without catastrophic forgetting.

Paul Janson, Vaibhav Singh, Paria Mehrbod, Adam Ibrahim, Irina Rish, Eugene Belilovsky, Benjamin Therien

Towards motion from video diffusion models

This study investigates the capabilities of video diffusion models in generating human motion from text prompts, revealing their strengths in common motions and limitations in rare or complex movements.

Paul Janson, Tiberiu Popa, Eugene Belilovsky

Towards motion from video diffusion models

Continual zero-shot learning through semantically guided generative random walk

Learning novel concepts, remembering previous knowledge, and adapting it to future tasks occur simultaneously throughout a human’s …

Wenxuan Zhang, Paul Janson, Divyansh Jha, Kai Yi, Ivan Skorodov, Mohammed Elhoseiny

Continual zero-shot learning through semantically guided generative random walk

Overcoming Generic Knowledge Loss with Selective Parameter Update

Adding knowledge to the model without destroying its generalization by finetuning small set of parameters

Wenxuan Zhang, Paul Janson, Rahaf Aljundi, Mohammed Elhoseiny

Overcoming Generic Knowledge Loss with Selective Parameter Update

Domain Aware Zero shot learning

Continual zero-shot learning involves learning seen classes incrementally while improving the ability to recognize unseen or …

Kai Yi, Paul Janson, Wenxuan Zhang, Mohammed Elhoseiny

Domain Aware Zero shot learning

A Simple baseline that questions the use of pre-trained model in continual learning

A baseline that performs better without training in continual learning benchmarks

Paul Janson, Wenxuan Zhang, Rahaf Aljundi, Mohammed Elhoseiny