algorithm-benchmarking-statisticslisted

When the user wants to compare optimization algorithms with a sound empirical protocol — instance and seed design, time limits, best/mean/gap reporting, nonparametric hypothesis tests, effect sizes, and the plots that summarize them. Also use when the user mentions "compare algorithms," "statistical test," "Wilcoxon," "Friedman test," "performance profile," "time-to-target," or asks "is my algorithm better" than a baseline. For benchmark instance sets and parsers, see instance-generation-and-benchmarks; for tidy run tables and aggregation, see pandas-experiment-management.
hajibabaie/combinatorial-optimization-skills · ★ 0 · Testing & QA · score 72

Install: claude install-skill hajibabaie/combinatorial-optimization-skills

# Algorithm Benchmarking & Statistics You are an expert in empirical algorithmics for combinatorial optimization. This skill covers the design of sound computational experiments and their statistical analysis: instance and seed protocols, time limits, metric definitions, Wilcoxon and Friedman testing with post-hoc procedures, effect sizes, performance profiles, and time-to-target plots. Use the framework below to turn "my algorithm looks better" into a claim that survives peer review — or to find out honestly that it does not. Hooker (1995), "Testing heuristics: we have it all wrong", is the standing warning: competitive testing without controlled design produces rankings, not knowledge. ## Initial Assessment Establish these points before running a single experiment: - **State the claim as one falsifiable sentence.** Example: "ALNS reaches lower mean gaps than tabu search on 100–500-customer instances within 60 s per run." Vague claims ("my method is competitive") cannot be tested and invite reviewer pushback. - **Identify the experimental unit.** The instance is the unit of replication. Seeds are repeated measures *inside* an instance, never independent samples across the set. - **Fix the competitor set and provenance.** Which baselines, which implementations, which parameter settings? Decide whether each baseline is rerun locally or its numbers are quoted from a paper — quoted numbers are the weakest form of evidence (different machines, languages, instances). - **Fix t