pytorch-fsdp2
SolidAdds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.
AI & Automation 9,609 stars
724 forks Updated 1 months ago MIT
Install
Quality Score: 94/100
Stars 20%
Recency 20%
Frontmatter 20%
Documentation 15%
Issue Health 10%
License 10%
Description 5%
Skill Content
# Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script
This skill teaches a coding agent how to **add PyTorch FSDP2** to a training loop with correct initialization, sharding, mixed precision/offload configuration, and checkpointing.
> FSDP2 in PyTorch is exposed primarily via `torch.distributed.fsdp.fully_shard` and the `FSDPModule` methods it adds in-place to modules. See: `references/pytorch_fully_shard_api.md`, `references/pytorch_fsdp2_tutorial.md`.
---
## When to use this skill
Use FSDP2 when:
- Your model **doesn’t fit** on one GPU (parameters + gradients + optimizer state).
- You want an eager-mode sharding approach that is **DTensor-based per-parameter sharding** (more inspectable, simpler sharded state dicts) than FSDP1.
- You may later compose DP with **Tensor Parallel** using **DeviceMesh**.
Avoid (or be careful) if:
- You need strict backwards-compatible checkpoints across PyTorch versions (DCP warns against this).
- You’re forced onto older PyTorch versions without the FSDP2 stack.
## Alternatives (when FSDP2 is not the best fit)
- **DistributedDataParallel (DDP)**: Use the standard data-parallel wrapper when you want classic distributed data parallel training.
- **FullyShardedDataParallel (FSDP1)**: Use the original FSDP wrapper for parameter sharding across data-parallel workers.
Reference: `references/pytorch_ddp_notes.md`, `references/pytorch_fsdp1_api.md`.
---
## Contract the agent must follow
1. **Launch with `torchrun`** a...
Details
- Author
- Orchestra-Research
- Repository
- Orchestra-Research/AI-Research-SKILLs
- Created
- 7 months ago
- Last Updated
- 1 months ago
- Language
- TeX
- License
- MIT
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
AI & Automation Solid
pytorch-fsdp
Expert guidance for Fully Sharded Data Parallel training with PyTorch FSDP - parameter sharding, mixed precision, CPU offloading, FSDP2
191,515 Updated today
NousResearch AI & Automation Featured
pytorch-fsdp
Expert guidance for Fully Sharded Data Parallel training with PyTorch FSDP - parameter sharding, mixed precision, CPU offloading, FSDP2
27,984 Updated today
davila7 AI & Automation Listed
distributed-training
Multi-GPU and distributed training patterns with PyTorch DDP. Use when scaling training across GPUs.
1 Updated today
thada2402