pytorch-fsdp2

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

75

Frontmatter 20%

70

Documentation 15%

100

Issue Health 10%

50

License 10%

100

Description 5%

100

Skill Content

# Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script This skill teaches a coding agent how to **add PyTorch FSDP2** to a training loop with correct initialization, sharding, mixed precision/offload configuration, and checkpointing. > FSDP2 in PyTorch is exposed primarily via `torch.distributed.fsdp.fully_shard` and the `FSDPModule` methods it adds in-place to modules. See: `references/pytorch_fully_shard_api.md`, `references/pytorch_fsdp2_tutorial.md`. --- ## When to use this skill Use FSDP2 when: - Your model **doesn’t fit** on one GPU (parameters + gradients + optimizer state). - You want an eager-mode sharding approach that is **DTensor-based per-parameter sharding** (more inspectable, simpler sharded state dicts) than FSDP1. - You may later compose DP with **Tensor Parallel** using **DeviceMesh**. Avoid (or be careful) if: - You need strict backwards-compatible checkpoints across PyTorch versions (DCP warns against this). - You’re forced onto older PyTorch versions without the FSDP2 stack. ## Alternatives (when FSDP2 is not the best fit) - **DistributedDataParallel (DDP)**: Use the standard data-parallel wrapper when you want classic distributed data parallel training. - **FullyShardedDataParallel (FSDP1)**: Use the original FSDP wrapper for parameter sharding across data-parallel workers. Reference: `references/pytorch_ddp_notes.md`, `references/pytorch_fsdp1_api.md`. --- ## Contract the agent must follow 1. **Launch with `torchrun`** a...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Install

Quality Score: 94/100

Skill Content

Details

Integrates with

Similar Skills

pytorch-fsdp

pytorch-fsdp

distributed-training