NoxauditNoxaudit

Benchmark Results

We benchmarked 10 models on real repositories to understand which ones actually find real issues vs. generate noise. Total spend: $2.13.

Methodology

  • Repos: python-dotenv (34 files, ~52K tokens) and noxaudit (88 files, ~126K tokens)
  • Focus: All 7 areas (security, docs, patterns, testing, hygiene, dependencies, performance)
  • Method: Batch API on all providers (50% discount), 1 run per model per repo
  • Quality validation: Cross-model consensus — issues found by 4+ models are considered "real"

Scorecard

ModeldotenvnoxauditTotalCost$/finding
gpt-5-nano4610$0.01$0.001
gpt-5-mini152439$0.03$0.001
gemini-2.5-flash181634$0.07$0.002
gemini-3-flash-preview81018$0.10$0.005
claude-haiku-4-5241539$0.11$0.003
o4-mini8614$0.20$0.014
gpt-5.4325284$0.26$0.003
gemini-2.5-pro172138$0.33$0.009
claude-sonnet-4-6304878$0.38$0.005
claude-opus-4-6405191$0.65$0.007

Quality Validation

python-dotenv served as a canary — 6 confirmed real issues found by 4+ models out of 10.

IssueModelsVerdict
get_cli_string shell injection risk8/10Genuine security concern
test_list uses builtin format instead of output_format6/10Actual code bug
Duplicate files (README/CHANGELOG/CONTRIBUTING in docs/)6/10Maintenance burden
Broken mkdocs link (empty href)5/10Broken documentation
Unpinned dev dependencies5/10Reproducibility issue
Incorrect pre-commit command (precommit vs pre-commit)4/10Wrong package name

Per-Model Quality

ModelConsensusNoiseCostVerdict
claude-sonnet-4-66/6Low$0.38Best precision
gpt-5.45/6Low$0.26Best mid-tier
gpt-5-mini5/6Low$0.03Best daily value
claude-opus-4-66/6Moderate$0.65Most findings overall
claude-haiku-4-54/6Moderate$0.11Decent but pads with nits
gemini-2.5-pro3/6Low$0.33Poor value vs gpt-5.4
o4-mini3/6Moderate$0.20Reasoning tokens wasted
gemini-2.5-flash2/6Moderate$0.07Misses too much
gemini-3-flash-preview2/6Low$0.10Preview quality
gpt-5-nano2/6Low$0.01Too shallow

Recommended Tiers

TierModelCost/RunRationale
Dailygpt-5-mini$0.035/6 consensus issues, minimal noise, cheapest viable model
Deep divegpt-5.4$0.2684 findings total, beats Sonnet quality at 68% the cost
Premiumclaude-opus-4-6$0.65Most findings overall, best for maximum depth

Dropped Models

  • o3: 0 findings on python-dotenv, 7 on noxaudit at $0.33. Reasoning tokens wasted on non-reasoning task. Removed.
  • gemini-2.0-flash: Deprecated. Returns errors in batch API.

All costs include 50% batch API discount. Different models genuinely find different things — only 6 issues had cross-model consensus.