Probabilistic Systems Engineering

Experiment Methodology v0.2 (Draft — copy/paste)

0. Purpose

This methodology defines a reproducible experiment to measure iterative drift in AI-assisted software changes, comparing:

The primary focus is drift/regression under sequential change, not single-shot success.

1. Fixed Scope Assumptions

2. Runtime Parameters (Must Be Provided Per Run)

Each experimental run MUST specify and log:

These are runtime parameters and MUST NOT be implied from document versioning or prior runs.

3. Hypothesis (Falsifiable Claim)

Given identical repositories and tasks, an implementation process constrained by an authoritative specification (S) will exhibit lower iterative drift than code-only execution (C), measured by fewer regressions and better scope containment under Step 2.

The hypothesis is falsified if S does not materially outperform C on the pre-registered metrics.

4. Core Definitions

5. Study Arms

Each sample is executed in both conditions:

6. Sample Selection (Anti-Cherrypick)

6.1 Repo eligibility criteria (Python)

A repository is eligible only if:

  1. It is open source with permissive license for reproduction.
  2. It has an executable test command suitable for CI (pytest/unittest/etc.).
  3. Historical bugfix commits/PRs can be identified and pinned.

6.2 Bug sample eligibility criteria

A historical fix sample is eligible only if:

  1. There exists an identifiable A→B bugfix (merged PR or commit).
  2. Tests pass at B under the declared test command/environment.
  3. The change is non-trivial (not formatting-only).

6.3 Runtime selection rule (uses N and W)

The run MUST pre-register one selection rule:

W is used only to bound search during eligibility determination (e.g., to avoid scanning entire history), not as a definition of “follow-on.”

7. Step Definition (What the model must do)

7.1 Step 1 (A→B′): implement the historical fix

Model is given state A and a task statement describing the bug and desired fix.

Step 1 passes only if:

7.2 Step 2 (B′→C′): Synthetic Follow-on (default)

Step 2 is a synthetic follow-on change applied after Step 1 completes, designed to test drift.

Critical constraint: Step 2 MUST be semantically coupled to the same subsystem/intent surface as the bugfix, not a random feature.

Examples of valid synthetic follow-ons (choose one per sample, derived from the bug’s domain):

Step 2 MUST be written so that:

7.3 Synthetic Follow-on generation rule (pre-registered)

To avoid “making Step 2 unfair,” each run MUST pre-register how Step 2 is chosen:

The chosen rule MUST be logged.

8. Prompt Fairness Rules

8.1 Prompt symmetry

Step 1 and Step 2 prompts MUST be identical between S and C, except:

No additional hints, logs, or external sources are allowed in one condition but not the other.

8.2 Spec constraints

Spec artifacts (if present) MUST be:

9. Interaction Budget

Each run MUST pre-register:

10. Output Artifacts (Reproducibility)

For each sample and condition, capture:

11. Metrics

11.1 Primary metric: Step-2 regression rate

Regression = any test that passed at B′ but fails at C′ (under the declared test command).

Compare regression rate S vs C.

11.2 Secondary metrics

12. Reporting

The report MUST include:

Read next

Related

Verification & replication