cortex-eval

Featured

Evaluate model performance — check for accuracy drops, data drift, and error patterns. Use when asked about "model accuracy dropped", "evaluate the model", "check for drift", or "model performance".

AI & Automation 2,274 stars 319 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Evaluate Model Performance You are Cortex — the ML/AI engineer on the Engineering Team. Follow the output format defined in docs/output-kit.md — 40-line CLI max, box-drawing skeleton, unified severity indicators, compressed prose. ## Steps ### Step 0: Run Static Analysis Before any LLM-based evaluation, run the static analysis scanner to find LLM usage anti-patterns and prompt quality issues: ```bash # From the project root (or team/cortex/scripts/) python team/cortex/scripts/cortex_agent/eval_scan.py . --out .reports/cortex-eval-latest.json ``` Or with selective scans: ```bash # LLM usage only (finds missing error handling, unbounded costs, hardcoded models) python team/cortex/scripts/cortex_agent/eval_scan.py . --skip-prompts # Prompt evaluation only (finds injection risks, length issues, missing format instructions) python team/cortex/scripts/cortex_agent/eval_scan.py . --skip-usage ``` Review the JSON report at `.reports/cortex-eval-<ts>.json`. Exit code 2 means HIGH or CRITICAL findings exist — these should be addressed before continuing. ### Step 1: Detect ML Environment Scan the project to understand the ML stack and current model: ```bash # Check for model artifacts, training scripts, metrics logs ls -la model* *.pkl *.joblib *.onnx *.pt *.h5 2>/dev/null ls -la train* evaluate* metrics* 2>/dev/null cat requirements.txt 2>/dev/null | grep -iE "sklearn|torch|tensorflow|xgboost|lightgbm|mlflow|wandb" cat pyproject.toml 2>/dev/null | grep -iE "sklearn|torch...

Details

Author: jeremylongshore
Repository: jeremylongshore/claude-code-plugins-plus-skills
Created: 7 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

cortex-model

Build an ML pipeline — from data to trained model to serving endpoint. Use when asked to "build ML model", "train a model", "prediction pipeline", "classification", or "regression".

2,274 Updated today

jeremylongshore

AI & Automation Featured

cortex-recon

ML reconnaissance — inventory all models, pipelines, data sources, and monitoring. Use when asked "what ML do we have", "model inventory", or "ML assessment".

2,274 Updated today

jeremylongshore

AI & Automation Listed

evaluate-model

Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint.

1 Updated today

morganmuli

AI & Automation Solid

cortex

ML/AI engineer — LLM integrations, prompt engineering, model pipelines, evals, RAG.

2,274 Updated today

jeremylongshore

AI & Automation Solid

model-evaluation

Generates a Jupyter notebook that evaluates a fine-tuned SageMaker model using LLM-as-a-Judge. Use when the user says "evaluate my model", "how did my model perform", "compare models", or after a training job completes. Supports built-in and custom evaluation metrics, evaluation dataset setup, and judge model selection.

765 Updated 2 days ago

awslabs