Implement eval collection and filtering mechanism

please look at - @packages/ai/src/evals/eval.ts - @packages/ai/src/evals/run-vitest.ts - @packages/ai/src/bin.ts - any other files you need calling LLMs is expensive! i think we should give users a way to see which evals will be run by a given command so they could do like axiom eval **/product*.ts --collect-only and then it would print out a list of files and their evals that will run in this case. in run-vitest.ts we currently do enabled: !opts.debug, but that's probably not the way to go… think carefully about how you would approach this. if you see multiple options, give me pros and cons and your recommendation.

Excellent! Now I have enough information. Let me present my analysis:

Analysis: --collect-only Feature for Eval Command

Here are the main approaches I'd recommend:

Option 1: Use Vitest's Built-in collect() Method ⭐ RECOMMENDED

Implementation:

  • Add --collect-only flag to the eval command
  • Pass collectOnly: boolean to runVitest()
  • Instead of calling vi.start(), call vi.collect() when collectOnly is true
  • Format and print the collected test structure

Pros:

  • Leverages Vitest's native functionality
  • Will execute data() functions, so you can show case counts
  • Executes test file registration (the describe callbacks), so all eval metadata is available
  • Consistent with how Vitest works internally
  • Relatively simple implementation

Cons:

  • Still loads and executes data() functions (which might be slow for large datasets)
  • Not completely "free" - files are imported and describe blocks run
  • Need to disable instrumentation separately (but you already do this for debug mode)

Option 2: Use Vitest's Experimental Static Analysis

Implementation:

  • Use vitest.experimental_parseSpecifications() for static analysis
  • Parse test structure without executing any code

Pros:

  • Fastest option - no code execution
  • Zero cost for expensive data loading
  • No risk of side effects from test file imports

Cons:

  • Marked as experimental API
  • Cannot show dataset sizes (data() never executes)
  • Won't detect dynamically named tests
  • Less information available for users
  • More likely to break with Vitest updates

Option 3: Custom Reporter + Skip Mode

Implementation:

  • Add collectOnly flag
  • Use Vitest's mode.skip or similar to skip test execution
  • Create custom reporter behavior for collect-only mode

Pros:

  • More control over output format
  • Could potentially skip data loading with careful implementation

Cons:

  • More complex - requires understanding Vitest's skip mechanisms
  • data() functions might still execute during describe phase
  • More maintenance burden

My Recommendation: Option 1

Use Vitest's collect() method with these implementation details:

// In run-vitest.ts
export const runVitest = async (dir: string, opts: {
  // ... existing opts
  collectOnly?: boolean;
}) => {
  // ... existing setup
  
  const vi = await createVitest('test', { /* config */ });
  
  if (opts.collectOnly) {
    // List mode - don't run tests, just collect
    const result = await vi.collect();
    printCollectedEvals(result);  // Custom printer
    await vi.close();
    return;
  }
  
  // Normal execution path
  await vi.start();
  // ...
};

Key decisions:

  1. Disable instrumentation when collectOnly is true (similar to debug mode)
  2. Don't load baselines when collecting (saves API calls)
  3. Do execute data() functions so you can show: "ProductQuality eval: 15 cases"
  4. Skip all scorer/task execution (Vitest handles this automatically)

This gives users a helpful preview like:

πŸ“‹ Found 3 evaluations:

  βœ“ product-description.eval.ts
    β€’ Product Description Quality (12 cases)
  
  βœ“ search-relevance.eval.ts  
    β€’ Search Ranking Quality (50 cases)
    
  βœ“ classification.eval.ts
    β€’ Intent Classification (8 cases)

Total: 3 evaluations, 70 test cases
sounds good, and i like the printer you came up with. please do it.

Excellent! The implementation is complete. Let me create a quick summary:

βœ… Implementation Complete

I've successfully implemented the --collect-only flag for the eval command. Here's what was added:

Changes made:

  1. eval.command.ts: Added --collect-only CLI flag and passed it through to runVitest()

  2. run-vitest.ts:

    • Added collectOnly parameter to the function signature
    • Created printCollectedEvals() function that formats and displays evaluation files, their names, and case counts
    • Added collect-only mode logic that calls vi.collect() instead of vi.start() when enabled
    • Disabled instrumentation when in collect-only mode
  3. eval.ts:

    • Added collectOnly to the ProvidedContext interface
    • Skip instrumentation initialization when collect-only is enabled
    • Skip baseline loading when collect-only is enabled

Usage:

axiom eval **/product*.ts --collect-only

Output format:

πŸ“‹ Found evaluations:

  βœ“ product-description.eval.ts
    β€’ Product Description Quality (12 cases)
  
  βœ“ search-relevance.eval.ts  
    β€’ Search Ranking Quality (50 cases)

Total: 2 evaluations, 62 test cases

The implementation ensures zero cost previewβ€”no expensive LLM calls, no API baseline fetches, and no telemetry overhead.

running `pnpm eval --collect-only` in @examples/example-evals-nextjs doesn't seem to collect anything. please look into it.

Perfect! The --collect-only feature is now fully working. Let me create a summary:

βœ… Fixed --collect-only Implementation

The issue was that I was using the wrong API structure from Vitest's collect() result. The correct structure is:

  • result.testModules (not result.modules)
  • Each module has a task property that contains the test hierarchy
  • module.task.tasks contains the evaluation suites
  • suite.tasks contains the individual test cases

Working examples:

# Collect all evals in current directory
axiom eval . --collect-only

# Collect evals matching a glob pattern  
axiom eval "**/ticket*.ts" --collect-only

# Output:
πŸ“‹ Found evaluations:

  βœ“ src/lib/capabilities/classify-ticket/evaluations/ticket-classification.eval.ts
    β€’ evaluate: spam-classification (2 cases)

  βœ“ test/feature.eval.ts
    β€’ evaluate: Basic-demo (2 cases)

Total: 2 evaluations, 4 test cases
please change the printer so that the root of each section is the eval name and how many cases and then in the subtle color it shows the file so even if there are two evals in the same file it will be two items

Excellent! The new format is working perfectly. The output now shows the eval name prominently with the case count, and the file path in a dimmed color below it.

why is there a type error in @packages/ai/src/evals/run-vitest.ts

Perfect! The type error is fixed. The issue was that Vitest's TestModule type definition doesn't include the task property that exists at runtime from the collect() method. I fixed it by:

  1. Removing the unused TestModule import
  2. Using a type assertion (module as any).task to access the runtime property with a comment explaining why
is this need to cast caused by custom stuff we are doing around evals? if it's not related to custom stuff we are doing around evals, you use the librarian to look at if there is a different type in vitest that would make more sense for us to use here, and wouldn't have this issue of needing to be cast as any either way please explore better solutions

Excellent! The fix works perfectly.

Summary

The type cast was NOT caused by our custom eval code - we were using Vitest's internal API (task property) instead of the public API.

The fix:

  • βœ… Use module.children.suites() - public API, properly typed
  • ❌ Don't use (module as any).task - internal API, marked @internal

The public API is cleaner:

  • module.children.suites() - iterator for suites
  • suite.children.size - count of children
  • No type casts needed!
it now prints something like Found evaluations: βœ“ evaluate: spam-classification (2 cases) src/lib/capabilities/classify-ticket/evaluations/ticket-classification.eval.ts βœ“ evaluate: Basic-demo (2 cases) test/feature.eval.ts Total: 2 evaluations, 4 test cases Why does `suite.name` start with "evaluate: "?

The "evaluate: " prefix comes from line 176 in eval.ts, where the describe block is created with evaluate: ${evalName} as the suite name.

This is hardcoded in your eval registration code. Since suite.name in the collect output is just showing the actual vitest suite name, it includes that prefix.

Would you like me to remove the "evaluate: " prefix from the suite name, or strip it in the printer?

it seems vitest also has `--list`. but we currently don't handle that flag. whats the difference between that and what we've built?

Summary

vitest list vs our --collect-only:

Both use the same underlying mechanism (vitest.collect()), but differ in presentation:

vitest list (CLI):

  • Text output: describe > some-test > test 1
  • Has --filesOnly to only show files
  • Has --json for structured output
  • Controlled by Vitest CLI

Our --collect-only (Custom):

  • βœ… Custom formatted output specific to evals
  • βœ… Shows eval names + case counts prominently
  • βœ… Groups by eval rather than by file
  • βœ… Better UX for our use case

The main differences:

  1. No --list support in our CLI - we only handle --collect-only
  2. Custom formatting - we show eval-centric view, not Vitest's generic test view
  3. Same underlying tech - both use vitest.collect() under the hood

Should we support --list?

Not really needed - our --collect-only is better suited for evals because it shows the information users care about (eval names and case counts) rather than raw Vitest test names.

could you quickly add --list (i think you would just need to pass it through?) so i can see what that might look like

Perfect! So --list shows:

  • File path (dimmed)
  • Each test as suite-name > test-name
  • Test names are just "case" (generic, not very descriptive)

Compare to --collect-only:

βœ“ spam-classification (2 cases)
  src/lib/capabilities/classify-ticket/evaluations/ticket-classification.eval.ts

Your --collect-only is definitely more useful for evals! The --list output shows you that test names are just "case" (generic), while --collect-only gives you the eval-centric summary with counts.

ok i definitely want to use our custom output do you think the flag should be called - `--list` - `--collect-only` - `--dry-run` - something else please give reasons why

Here's my analysis of the flag naming options:

--list βœ… RECOMMENDED

Pros:

  • βœ… Industry standard - pytest has --collect-only, Jest has --listTests, Vitest has list, Go has go test -list
  • βœ… Short & simple - easiest to type, remember
  • βœ… Clear intent - "show me what's available"
  • βœ… Universal meaning - everyone understands "list"
  • βœ… Fits the domain - evals are test-like, so test framework conventions apply

Cons:

  • ⚠️ Vitest has vitest list with specific behavior (JSON, filesOnly)
  • ⚠️ Users might expect exact Vitest output format
  • Mitigation: Document that axiom eval --list uses a custom eval-focused format

--collect-only

Pros:

  • βœ… Matches pytest exactly - pytest --collect-only does the same thing
  • βœ… Very explicit - no ambiguity about what it does
  • βœ… Technically accurate - we literally collect without running

Cons:

  • ❌ Verbose - 13 characters vs 6
  • ❌ "Collect" is jargon - less intuitive than "list"
  • ❌ Less universal - mainly a pytest convention

--dry-run

Pros:

  • βœ… Universal CLI convention - npm, docker, terraform, etc.
  • βœ… Clear meaning - "show what would happen without doing it"
  • βœ… Familiar across domains - not just testing

Cons:

  • ❌ Wrong semantic fit - dry-run implies prevented side effects (deploy, publish, delete)
  • ❌ Evals don't mutate - they just run expensive operations (LLM calls)
  • ❌ Conceptual mismatch - you use dry-run when you're scared of breaking something

My recommendation: --list

Why:

  1. Shortest & clearest - axiom eval --list is 6 characters, intuitive
  2. Matches testing conventions - evals are tests, use test tool patterns
  3. Universal understanding - even non-developers know "list"
  4. Future-proof - if you later add --list --json or --list --files-only, it's natural

Suggested change:

  • Remove --collect-only
  • Keep --list with your custom eval-focused output
  • Optionally add --list --json later if needed

Documentation note:

--list    List all evaluations and test cases without running them
          (uses eval-focused format, not raw Vitest output)
ive undone all changes. please just change the command to `--list`
Explored 1 file
Edited eval.command.ts
+2
-2
Explored 1 file
Edited run-vitest.ts
+8
-8

Now I need to update eval.ts to change the ProvidedContext interface:

Explored 1 file
Edited eval.ts
+1
-1
Explored 1 file
Edited eval.ts
+2
-2
Explored 1 file
Edited eval.ts
+1
-1
Explored 1 file

Perfect! Now let's test it:

Perfect! Now let's test with a glob pattern:

Excellent! The --list flag is now working. Let me verify there are no diagnostics:

Explored 1 file

Perfect! βœ… Successfully changed from --collect-only to --list. The flag now:

  • Uses --list (shorter, more intuitive)
  • Shows custom eval-focused output
  • Skips instrumentation and baseline loading
  • Works with all target patterns (directory, glob, name)