rl-reward

Solid

Build RL reward signals using the OpenJudge framework. Covers choosing between pointwise and pairwise reward strategies based on RL algorithm, task type, and cost; aggregating multi-dimensional pointwise scores into a scalar reward; pairwise tournament reward for GRPO on subjective tasks (net win rate across group rollouts); generating preference pairs for DPO/RLAIF; and normalizing scores for training stability. Use when building reward models, scoring rollouts for GRPO/REINFORCE, generating preference data for DPO, or doing Best-of-N selection.

AI & Automation 625 stars 54 forks Updated 1 weeks ago Apache-2.0

Install

View on GitHub

Quality Score: 90/100

Stars 20%

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# RL Reward Construction with OpenJudge Build reward signals for reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF) using the `openjudge` library. ## When to Use This Skill - Building scalar rewards for GRPO / REINFORCE rollout scoring - Generating (chosen, rejected) preference pairs for DPO / IPO - Best-of-N candidate selection - Multi-dimensional reward shaping (correctness + safety + format) - Replacing or bootstrapping a reward model with LLM-as-judge ## Step 1 — Choose Your Reward Strategy Use this decision tree **before** writing any code: ``` RL Algorithm + Task type? │ ├── GRPO / REINFORCE — Verifiable task (math, code, structured output) │ └── → POINTWISE ✅ (FunctionGrader, exact score, zero LLM cost) │ ├── GRPO / REINFORCE — Subjective task (instruction following, dialogue, summarization) │ └── → PAIRWISE TOURNAMENT ✅ (compare each rollout vs all others in group, │ reward = net win rate within group) │ ├── DPO / IPO / SLiC — need (chosen, rejected) pairs │ └── → PAIRWISE ✅ (two-way comparison, return winner/loser) │ └── Best-of-N / reranking — rank N candidates └── → LISTWISE ✅ (single call ranks all N at once) ``` ``` Cost constraint? ├── Low budget │ └── FunctionGrader (free) → pointwise; or pairwise with small judge model │ ├── Medium budget │ └── Pointwise: 2–3 LLM graders + WeightedSumAggregator │ └── Pairwise tournament: 1 LLM judge, N*(N-1)/2 c...

Details

Author: agentscope-ai
Repository: agentscope-ai/OpenJudge
Created: 10 months ago
Last Updated: 1 weeks ago
Language: Python
License: Apache-2.0

Related Skills

AI & Automation Featured

videodb

See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.

196,640 Updated 2 days ago

affaan-m

AI & Automation Featured

ck

Persistent per-project memory for Claude Code. Auto-loads project context on session start, tracks sessions with git activity, and writes to native memory. Commands run deterministic Node.js scripts — behavior is consistent across model versions.

196,640 Updated 2 days ago

affaan-m

AI & Automation Featured

browser

Web browser automation with AI-optimized snapshots for claude-flow agents

55,973 Updated today

ruvnet