moe-training

Featured

Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.

AI & Automation 27,984 stars 2901 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# MoE Training: Mixture of Experts ## When to Use This Skill Use MoE Training when you need to: - **Train larger models** with limited compute (5× cost reduction vs dense models) - **Scale model capacity** without proportional compute increase - **Achieve better performance** per compute budget than dense models - **Specialize experts** for different domains/tasks/languages - **Reduce inference latency** with sparse activation (only 13B/47B params active in Mixtral) - **Implement SOTA models** like Mixtral 8x7B, DeepSeek-V3, Switch Transformers **Notable MoE Models**: Mixtral 8x7B (Mistral AI), DeepSeek-V3, Switch Transformers (Google), GLaM (Google), NLLB-MoE (Meta) ## Installation ```bash # DeepSpeed with MoE support pip install deepspeed>=0.6.0 # Megatron-DeepSpeed for large-scale training git clone https://github.com/microsoft/Megatron-DeepSpeed cd Megatron-DeepSpeed pip install -r requirements.txt # Alternative: HuggingFace Transformers pip install transformers accelerate ``` ## Quick Start ### Basic MoE Architecture ```python import torch import torch.nn as nn class MoELayer(nn.Module): """Sparse Mixture of Experts layer.""" def __init__(self, hidden_size, num_experts=8, top_k=2): super().__init__() self.num_experts = num_experts self.top_k = top_k # Expert networks (FFN) self.experts = nn.ModuleList([ nn.Sequential( nn.Linear(hidden_size, 4 * hidden_size), nn.GELU()...

Details

Author: davila7
Repository: davila7/claude-code-templates
Created: 11 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

moe-training

9,609 Updated 1 months ago

Orchestra-Research

AI & Automation Solid

neural-training

Neural pattern training with SONA (Self-Optimizing Neural Architecture), MoE (Mixture of Experts), and EWC++ for knowledge consolidation. Use when: pattern learning, model optimization, knowledge transfer, adaptive routing. Skip when: simple tasks, no learning required, one-off operations.

59,062 Updated today

ruvnet

AI & Automation Featured

training-llms-megatron

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

27,984 Updated today

davila7