ray-data

Solid

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

Data & Documents 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Ray Data - Scalable ML Data Processing Distributed data processing library for ML and AI workloads. ## When to use Ray Data **Use Ray Data when:** - Processing large datasets (>100GB) for ML training - Need distributed data preprocessing across cluster - Building batch inference pipelines - Loading multi-modal data (images, audio, video) - Scaling data processing from laptop to cluster **Key features**: - **Streaming execution**: Process data larger than memory - **GPU support**: Accelerate transforms with GPUs - **Framework integration**: PyTorch, TensorFlow, HuggingFace - **Multi-modal**: Images, Parquet, CSV, JSON, audio, video **Use alternatives instead**: - **Pandas**: Small data (<1GB) on single machine - **Dask**: Tabular data, SQL-like operations - **Spark**: Enterprise ETL, SQL queries ## Quick start ### Installation ```bash pip install -U 'ray[data]' ``` ### Load and transform data ```python import ray # Read Parquet files ds = ray.data.read_parquet("s3://bucket/data/*.parquet") # Transform data (lazy execution) ds = ds.map_batches(lambda batch: {"processed": batch["text"].str.lower()}) # Consume data for batch in ds.iter_batches(batch_size=100): print(batch) ``` ### Integration with Ray Train ```python import ray from ray.train import ScalingConfig from ray.train.torch import TorchTrainer # Create dataset train_ds = ray.data.read_parquet("s3://bucket/train/*.parquet") def train_func(config): # Access dataset in training train_ds = ray...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

Data & Documents Solid

ray-data

1,436 Updated 6 days ago

OpenRaiser

AI & Automation Solid

ray-distributed-trainer

Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.

1,160 Updated today

a5c-ai

AI & Automation Solid

ray-train

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

9,182 Updated 1 months ago

Orchestra-Research

Data & Documents Solid

dask

Parallel/distributed computing. Scale pandas/NumPy beyond memory, parallel DataFrames/Arrays, multi-file processing, task graphs, for larger-than-RAM datasets and parallel workflows.

27,705 Updated today

davila7

Data & Documents Listed

dask

Parallel/distributed computing. Scale pandas/NumPy beyond memory, parallel DataFrames/Arrays, multi-file processing, task graphs, for larger-than-RAM datasets and parallel workflows.

335 Updated today

aiskillstore