Skip to main content

The Research Behind Mouse

Across three independent preregistered confirmatory studies (N=67 paired trials), Mouse outperformed built-in file-editing tools along every dimension studied — capability, speed, cost, reliability, and accuracy.

Working Draft — January 2026 · Patent Pending

Summary of Key Findings

Six of the seven preregistered confirmatory hypotheses across the three studies were confirmed at the preregistered alpha = 0.05 threshold. The table below summarizes each test, and the paragraphs that follow tell the story behind the numbers.

StudyHypothesisResultOdds the result is due to chance
EasyBX-504DMouse is fasterConfirmed (3.58× faster)< 1 in 8 million
Mouse is cheaperConfirmed (1.58× cheaper)< 1 in 1 million
MediumBX-504BMouse achieves Perfect First Try more oftenConfirmed (56% vs 0%)< 1 in 8 thousand
Mouse is more likely to succeedNot significant
Mouse costs less per successGated
HardBX-701RMouse is more likely to succeedConfirmed (89.5% vs 0%)< 1 in 100 thousand
Mouse is faster to succeedConfirmed (70.8s vs 240.0s)< 1 in 100 thousand

On a straightforward deletion exercise that both arms were capable of completing (BX-504D, Easy), Mouse was more than 3.5× faster and substantially less expensive per task.

On a realistic mid-difficulty refactoring exercise (BX-504B, Medium), Mouse-enabled agents achieved a Perfect First Try rate of 56% while baseline agents never achieved a Perfect First Try across 25 attempts.

And, on a hard tabular-data exercise that required moving a column from the left side of a CSV file to the right side (BX-701R, Hard), Mouse-enabled agents succeeded on 17 of 19 attempts while baseline agents failed all 19.

Methodology

The three studies share the same experimental apparatus and differ only in the task, the underlying model, the per-trial timeout, and the preregistered hypotheses. Each study used a within-subjects paired design in which the same task was attempted twice — once by a Baseline agent using GitHub Copilot's built-in file-editing tools and once by a Mouse-enabled agent that was identical except for having access to Mouse's ten tools. Both arms ran inside isolated Docker containers, were equipped with start_timer and stop_timer tools to bracket the exercise, and were given identical tool-agnostic natural-language instructions. Paired randomization between Mouse-first and Baseline-first ordering was balanced via a PRNG with preregistered seeds, and model version, temperature, and task content were held constant across conditions.

Preregistration

Hypotheses, sample sizes, hierarchical gating procedures, and analysis plans were filed in the inventor's repository and locked before any data was collected. All analyses were then performed in strict accordance with those preregistered protocols, and no observations were made of the data — beyond confirming that telemetry had been saved after each containerized run — until each study was complete. The Intent-to-Treat analysis was preregistered as the governing analysis for all confirmatory claims in every study.

Telemetry & anti-cheating

Inside each container, the answer-key file was locked under 700 permissions so that the agent had neither read nor write access to it, and the telemetry file was set to 720 permissions so that the agent could write to it but could not read its contents. A pre-run setup script validated permissions, file states, and configuration on every run before the agent was handed the prompt. Success on a trial required a 100% byte-for-byte match against the answer key on a stop_timer call; when the file did not match, the timer continued to run and the stop_timer response returned diff details so the agent could keep iterating until the timeout. Chat history was exported from inside the container on completion, and a separate pull-telemetry.sh script run from outside the container by the operator retrieved all observations for analysis.

Statistical analysis

All confirmatory tests were non-parametric and made no distributional assumptions: the McNemar exact test for paired binary outcomes, and one-sided permutation tests (with exact enumeration where tractable) for the remaining endpoints. Effect sizes are reported in the paper with 95% confidence intervals, and imputed API costs are calculated from measured token counts using Anthropic's published rate card in effect during data collection (December 2025).

Individual Studies

The three studies, presented in order of difficulty.

Easy · BX-504D

Economic Efficiency

Study BX-504D used the Claude Haiku 4.5 model and a 120-second per-trial timeout across 23 paired trials. The primary preregistered endpoint was the geometric mean cost ratio between the two conditions, and the secondary endpoint was the geometric mean time ratio.

Even when built-in tools are capable of completing a task, this study shows that Mouse does so more than 3.5× faster, using a tiny fraction of the output tokens of Baseline instances, and is vastly more reliable. Baseline instances (GitHub Copilot, Haiku 4.5) could not complete this straightforward deletion exercise in less than 30.7 seconds (median: 36.9 seconds) across 23 trials. Mouse's fastest time was 9.5 seconds (median: 12.2 seconds), and its slowest time was 20.6 seconds, crushing even the fastest Baseline time on the same exercise (cost-ratio p = 7.15 × 10⁻⁷; time-ratio p = 1.19 × 10⁻⁷).

Medium · BX-504B

Precision Editing

Study BX-504B used the Claude Haiku 4.5 model and a 180-second per-trial timeout across 25 paired trials. The primary preregistered endpoint was the Perfect First Try rate — the proportion of trials in which the agent's first stop_timer call returned a byte-for-byte match against the answer key. Two secondary endpoints were preregistered behind hierarchical gates: success rate within the 180-second budget (gated by the primary endpoint) and cost per success (gated in turn by the success-rate endpoint).

On a realistic refactoring exercise to uncomment modern code blocks, delete legacy code, and update imports and exports across a nearly 600-line file, Mouse-enabled agents dominated on the primary endpoint tested: Perfect First Try rate, which required the agent to get 100% correct on the first stop_timer call. Baseline instances using GitHub Copilot's built-in file-editing tools never achieved Perfect First Try even once across 25 tries on BX-504B, while Mouse was Perfect First Try more than half the time (14 of 25, p = 1.22 × 10⁻⁴).

The secondary success-rate endpoint did not reach significance under the preregistered Intent-to-Treat analysis (76% Mouse vs. 52% Baseline, p = 0.146), and the cost-per-success endpoint was therefore gated and not tested. A preregistered per-protocol sensitivity analysis that excluded pairs containing one or more rate-throttled degraded runs did find both secondary endpoints significant, but because the Intent-to-Treat analysis governs all confirmatory claims for this study, the per-protocol findings are reported in the paper for completeness rather than as confirmatory evidence.

Hard · BX-701R

Capability Boundary

Study BX-701R used the Claude Sonnet 4.5 model and a 240-second per-trial timeout across 19 paired trials. One primary hypothesis was preregistered — truncated time-to-success within the 240-second budget — along with one secondary hypothesis on raw success rate.

The BX-701R study demonstrates the hard limits of AI agents using string replacement to perform edits, especially on tabular data such as CSV files, where find/replace methods are woefully inadequate to perform tasks like the one required: move a column from the left side of the table to the right. Across 19 runs, Baseline instances (GitHub Copilot, Sonnet 4.5) using built-in file-editing tools were completely unable to finish the exercise even once within the 240-second limit, mostly hitting their context window without even making a single tool call. Mouse-enabled instances overwhelmingly succeeded on 17 of 19 tries (89.5%), with the fastest trial taking just 14.7 seconds (p = 7.63 × 10⁻⁶ for both endpoints).

Conclusion

DimensionPrimary Effect Sizep-value
Precision+56 pp risk difference1.22 × 10⁻⁴ (two-sided)
Capability+89.5 pp risk difference7.63 × 10⁻⁶ (one-sided)
Speed3.6× faster1.19 × 10⁻⁷ (one-sided)
CostG = 1.58×7.15 × 10⁻⁷ (one-sided)
Consistency99× lower variance< 0.001 (Levene)

Tool architecture is an independent performance lever

The conventional assumption is that AI-agent performance is determined primarily by the underlying language model. The results across these three studies demonstrate that tool architecture is an independent performance lever, and that agent outcomes can be substantially improved without changing the model simply by giving the agent better-engineered tools. The pattern across the difficulty spectrum is consistent: where both arms can succeed, Mouse is faster, cheaper, and more reliable; where the task pushes against context and output limits, Mouse succeeds in regimes that the baseline configuration cannot reach at all within the time budget.

The verbosity tax on built-in tools

Built-in editing tools require the agent to echo file content back inside its own tool calls, which wastes tokens, slows the run, and introduces transcription errors that cascade through subsequent edits. Mouse eliminates that overhead through coordinate-based addressing, and the economic and reliability consequences of that single architectural choice are visible across all three studies.

Predictable execution matters in production

Beyond average-case performance, Mouse produces consistent results from one run to the next, which matters in production settings where reliability is at least as valuable as raw speed. The contrast in BX-504D between Mouse's slowest run of 20.6 seconds and the Baseline's fastest run of 30.7 seconds is the clearest illustration: the worst Mouse trial in 23 attempts was still meaningfully faster than the best Baseline trial in the same 23 attempts.

Future Research

More research is ongoing. Among other things, investigation is needed beyond a single-baseline comparison to confirm empirical observations that Mouse's advantages extend to clients other than GitHub Copilot. Testing with newer frontier models as well as weaker open-source models, and performing ablation studies to determine which of Mouse's features drive the observed benefits, are also planned.

Read the Full Paper

Get all the details: methodology, statistical analysis, additional findings, and discussion of implications.

Comments welcome at research@hic-ai.com.