paper-autoraterslisted
Install: claude install-skill Ar9av/PaperOrchestra
# Paper Autoraters (App. F.3)
Faithful implementation of the four LLM-as-judge autoraters used in
PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §5 and App. F.3).
These are the metrics the paper uses to demonstrate that PaperOrchestra
beats single-agent and AI-Scientist-v2 baselines. Use them to:
1. Score a generated paper against a ground-truth paper.
2. Compare two paper-writing pipelines side-by-side.
3. Validate your own host-agent execution of the paper-orchestra pipeline.
## The four autoraters
| Autorater | What it does | Inputs | Output |
|---|---|---|---|
| **Citation F1 — P0/P1 partition** | Partitions reference list into P0 (must-cite) and P1 (good-to-cite) given the paper text | one paper text + its references list | JSON `{ref_num: "P0"\|"P1"}` |
| **Literature Review Quality** | 6-axis 0-100 score for Intro+Related Work, with anti-inflation hard caps | one paper PDF/text + reference avg citation count | JSON with `axis_scores`, `penalties`, `summary`, `overall_score` |
| **SxS Overall Paper Quality** | Holistic side-by-side preference judgment | two papers (PDF or text) | JSON with `winner` ∈ {paper_1, paper_2, tie} |
| **SxS Literature Review Quality** | Side-by-side preference, Intro+Related Work only | two papers | JSON with `winner` ∈ {paper_1, paper_2, tie} |
The paper uses Gemini-3.1-Pro and GPT-5 as judges, set to temperature 0.0
(Gemini) or default 1.0 (GPT-5, which doesn't allow temperature
adjustment). Use whatever your host LLM is.
## Wor