ref-hallucination-arena

Solid

Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy.

AI & Automation 625 stars 54 forks Updated 1 weeks ago Apache-2.0

Install

View on GitHub

Quality Score: 90/100

Stars 20%

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Reference Hallucination Arena Skill Evaluate how accurately LLMs recommend real academic references using the OpenJudge `RefArenaPipeline`: 1. **Load queries** — from JSON/JSONL dataset 2. **Collect responses** — BibTeX-formatted references from target models 3. **Extract references** — parse BibTeX entries from model output 4. **Verify references** — cross-check against Crossref / PubMed / arXiv / DBLP 5. **Score & rank** — compute verification rate, per-field accuracy, discipline breakdown 6. **Generate report** — Markdown report + visualization charts ## Prerequisites ```bash # Install OpenJudge pip install py-openjudge # Extra dependency for ref_hallucination_arena (chart generation) pip install matplotlib ``` ## Gather from user before running | Info | Required? | Notes | |------|-----------|-------| | Config YAML path | Yes | Defines endpoints, dataset, verification settings | | Dataset path | Yes | JSON/JSONL file with queries (can be set in config) | | API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. | | CrossRef email | No | Improves API rate limits for verification | | PubMed API key | No | Improves PubMed rate limits | | Output directory | No | Default: `./evaluation_results/ref_hallucination_arena` | | Report language | No | `"en"` (default) or `"zh"` | | Tavily API key | No | Required only if using tool-augmented mode | ## Quick start ### CLI ```bash # Run evaluation with config file python -m cookbooks.ref_hallucination_arena --...

Details

Author: agentscope-ai
Repository: agentscope-ai/OpenJudge
Created: 10 months ago
Last Updated: 1 weeks ago
Language: Python
License: Apache-2.0

Integrates with

OpenAI · AI

Related Skills

AI & Automation Featured

videodb

See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.

196,640 Updated 2 days ago

affaan-m

AI & Automation Featured

ck

Persistent per-project memory for Claude Code. Auto-loads project context on session start, tracks sessions with git activity, and writes to native memory. Commands run deterministic Node.js scripts — behavior is consistent across model versions.

196,640 Updated 2 days ago

affaan-m

AI & Automation Featured

browser

Web browser automation with AI-optimized snapshots for claude-flow agents

55,973 Updated today

ruvnet