model-evaluation

Solid

Generates a Jupyter notebook that evaluates a fine-tuned SageMaker model using LLM-as-a-Judge. Use when the user says "evaluate my model", "how did my model perform", "compare models", or after a training job completes. Supports built-in and custom evaluation metrics, evaluation dataset setup, and judge model selection.

AI & Automation 765 stars 108 forks Updated 2 days ago Apache-2.0

Install

View on GitHub

Quality Score: 95/100

Stars 20%
96
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Model Evaluation Code Generator Generate a Jupyter notebook that evaluates a SageMaker fine-tuned model using LLM-as-Judge via sagemaker-python-sdk v3. ## Principles 1. **One thing at a time.** Each response advances exactly one decision. Never combine multiple questions or recommendations in a single turn. 2. **Confirm before proceeding.** Wait for the user to agree before moving to the next step. You are a guide, not a runaway train. 3. **Don't read files until you need them.** Only read reference files when you've reached the workflow step that requires them and the user has confirmed the direction. Never read ahead. 4. **No narration.** Don't explain what you're about to do or what you just did. Share outcomes and ask questions. Keep responses short and focused. 5. **No repetition.** If you said something before a tool call, don't repeat it after. Only share new information. ## Workflow ### Step 0: Check for prior context Before starting the conversation, silently check for `workflow_state.json` in the project directory. If it exists, read it and remember any useful information (such as model package ARN, model package group name, training job name, dataset paths). ### Step 1: Understand the task For this step, you need: **what task the model is trained to do.** If you know this already, skip this step. If not, ask the user: > "What task is this model trained to do?" ⏸ Wait for user. ### Step 2: Get evaluation dataset For this step, you need: **the evaluatio...

Details

Author
awslabs
Repository
awslabs/agent-plugins
Created
3 months ago
Last Updated
2 days ago
Language
Shell
License
Apache-2.0

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

evaluating-machine-learning-models

Build this skill allows AI assistant to evaluate machine learning models using a comprehensive suite of metrics. it should be used when the user requests model performance analysis, validation, or testing. AI assistant can use this skill to assess model accuracy, p... Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.

2,210 Updated 1 weeks ago
foryourhealth111-pixel
AI & Automation Solid

evaluating-machine-learning-models

This skill allows Claude to evaluate machine learning models using a comprehensive suite of metrics. It should be used when the user requests model performance analysis, validation, or testing. Claude can use this skill to assess model accuracy, precision, recall, F1-score, and other relevant metrics. Trigger this skill when the user mentions "evaluate model", "model performance", "testing metrics", "validation results", or requests a comprehensive "model evaluation".

2,274 Updated today
jeremylongshore
AI & Automation Featured

cortex-eval

Evaluate model performance — check for accuracy drops, data drift, and error patterns. Use when asked about "model accuracy dropped", "evaluate the model", "check for drift", or "model performance".

2,274 Updated today
jeremylongshore
AI & Automation Solid

model-evaluation-metrics

Build model evaluation metrics operations. Auto-activating skill for ML Training. Triggers on: model evaluation metrics, model evaluation metrics Part of the ML Training skill category. Use when working with model evaluation metrics functionality. Trigger with phrases like "model evaluation metrics", "model metrics", "model".

2,274 Updated today
jeremylongshore
AI & Automation Listed

model-evaluation

Model evaluation in R with performance metrics, calibration, ROC analysis, decision curves, and validation.

4 Updated 4 days ago
choxos