Sharding Large Models with Tensor Parallelism

State-of-the-art language models are too large to fit on a single GPU, even if you use data parallelism. This post explains tensor parallelism, a technique that splits large models across multiple GPUs.Read More →

Training Deep Networks with Data Parallelism in Jax

Train deep nets efficiently by parallelizing batch data in jax.Read More →