ray-data

Solid

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

Data & Documents 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Ray Data - Scalable ML Data Processing Distributed data processing library for ML and AI workloads. ## When to use Ray Data **Use Ray Data when:** - Processing large datasets (>100GB) for ML training - Need distributed data preprocessing across cluster - Building batch inference pipelines - Loading multi-modal data (images, audio, video) - Scaling data processing from laptop to cluster **Key features**: - **Streaming execution**: Process data larger than memory - **GPU support**: Accelerate transforms with GPUs - **Framework integration**: PyTorch, TensorFlow, HuggingFace - **Multi-modal**: Images, Parquet, CSV, JSON, audio, video **Use alternatives instead**: - **Pandas**: Small data (<1GB) on single machine - **Dask**: Tabular data, SQL-like operations - **Spark**: Enterprise ETL, SQL queries ## Quick start ### Installation ```bash pip install -U 'ray[data]' ``` ### Load and transform data ```python import ray # Read Parquet files ds = ray.data.read_parquet("s3://bucket/data/*.parquet") # Transform data (lazy execution) ds = ds.map_batches(lambda batch: {"processed": batch["text"].str.lower()}) # Consume data for batch in ds.iter_batches(batch_size=100): print(batch) ``` ### Integration with Ray Train ```python import ray from ray.train import ScalingConfig from ray.train.torch import TorchTrainer # Create dataset train_ds = ray.data.read_parquet("s3://bucket/train/*.parquet") def train_func(config): # Access dataset in training train_ds = ray...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category