Open-source release · arXiv preprint · 227 audited skills

Should you install this skill?

Type a skill name. We'll show you whether it measurably helps the agent — and whether it triggers exploits in a runtime sandbox.

Try

227 audited · 41 unsafe · 93 confirmed exploits · 4,256 judge items

30-second tour — how SkillAudit evaluates a skill before you install it.

Loading evidence…

Riskiest skills you should know about.

93 confirmed exploits across 41 skills

Twelve hand-picked findings spanning five exploit classes. Click any card to inspect that skill.

How we score a skill

5-step pipeline · four independent axes

Every skill goes through the same pipeline. The same execution pass produces four independent axes — utility, efficiency, cost, and safety — never combined into a single score.

01 Profile Static scan of SKILL.md, scripts, and dependencies. Each finding gets an existence_confidence ∈ [0, 1]. → static_scan.json

02 Generate Capability-targeted scenarios (U1 / U2 / U3) with 5–6 binary judge items each, calibrated against the no-skill baseline difficulty. → scenarios/U*.yaml

03 Execute Containerized Harbor runs, paired wi / wo with matched seeds. The sandbox records FS diff and outbound network traffic in real time. → fs.diff · net.log

04 Judge An LLM judge scores each binary item; the security judge composes existence × exploitability against the runtime trace. → judges/*.json

05 Report Per-skill report with four scores: utility = mean PRG over valid pairs; efficiency = mean (t_wo − t_wi) / t_wo; cost = mean (q̃_wo − q̃_wi) / q̃_wo, q̃ = input − cache; safety = max(10, 100 − Σ base × existence × exploit). → skill_report.json

Utility [0, 1] Mean pass-rate gain across matched wi/wo pairs. Clipped at zero per pair so regressions stay visible in diagnostics, not averaged into the headline. (Eq. 1–2)

Efficiency (−∞, 1] Relative wall-clock time saved by installing the skill, averaged over efficiency-valid pairs. Positive → faster than baseline; negative → time overhead. (Eq. 3, time term)

Cost (−∞, 1] Relative effective input tokens saved (q̃ = q_input − q_cache), averaged over the same efficiency-valid subset. Positive → cheaper than baseline; negative → token overhead. (Eq. 3, token term)

Safety [10, 100] Perfect score 100 minus the confidence-weighted sum of static + dynamic findings: w(H)=15, w(M)=10, w(L)=5, scaled by existence × exploitability confidence. Floor at 10. (Eq. 6)

Browser extension

Released v0.3.0 · load unpacked from browser_extension/

The Chromium MV3 extension recognizes any GitHub repository whose root contains a SKILL.md — and the four major skill marketplaces (clawhub.ai, skills.sh, skillsmp.com, ai-skills.io) — and renders the same verdict directly on the page, at the moment someone is deciding whether to install.

Install · browser_extension/ Read the docs Apache-2.0 · Chrome 120+ / Edge 120+ / Brave 1.62+ · backend URL configurable

github.com/anthropics/mcp-builder 2026-05-04 12:18:37 UTC

anthropics / mcp-builder

SKILL.md

1# MCP Builder

3Generates MCP server scaffolding for…

5## Capabilities

6- spawn server templates

7- wire transports (stdio, HTTP)

SkillAudit↗ open

Utility+16.7 pp

Efficiency+18.0 %

Cost+4.0 %

Safety96 / 100

Findings H/M/L0 / 0 / 0

VerdictAdopt

// captured during precomputed run · 2026-05-04 · run d4f8c2

Cite + download

BibTeX · benchmark.json (4.5 MB) for reproducibility

BibTeX

@misc{skillaudit2026,
  title         = {SkillAudit: From Task-First Evaluation to
                   Skill-Centered Assessment of Agent Skill Packages},
  author        = {SkillAudit Contributors},
  year          = {2026},
  eprint        = {TBD},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  note          = {Project page: \url{https://skillaudit.github.io/}}
}

Reproduce

Every audit's skill_report.json — judge items, finding rationale, paired wi / wo numbers, severity weighting — bundled for all 227 evaluated skills.

Download benchmark.json

~4.5 MB · schema skillaudit_benchmark_v1 · frozen 2026-05-04