AnswerGraph Methodology

v1.2.2 · Last updated 2026-05-08By Nathan Hill-Haimes

The AnswerGraph engine continuously measures which factors drive citation across AI search engines, using a shared panel of daily observations, logistic regression with cluster-robust bootstrap confidence intervals, and domain-stratified negative sampling.

What the engine measures

The AnswerGraph engine measures citation share: the proportion of times a domain or brand is cited by an AI search engine in response to a defined prompt, relative to all domains cited for the same prompt class.

We do not measure "AI visibility scores," "brand sentiment," or "AI readiness." These terms describe products that produce a single number without statistical backing. Citation share is a proportion with a known denominator, a confidence interval, and a reproducible measurement procedure.

Panel composition

The AnswerGraph panel consists of prompts probed daily across four AI search engines:

ChatGPT (via OpenAI API, gpt-4o model with web browsing)
Perplexity (via Perplexity API, sonar model)
Google AI Mode (via Gemini API with Google Search grounding)
Claude baseline (non-grounded, weekly cadence — measuring training-data citation as a control)

The panel is shared: 123 category-level queries across six verticals, probed daily on grounded engines and weekly on Claude baseline. Properties are attributed via SQL views that join observations to domain lists — no per-customer query sets.

Statistical methods

Primary metric: citation share

Citation share for domain d on engine e over period t is defined as:

CS(d, e, t) = (citations of d by e in t) / (total citations by e in t)

Regression model

The engine fits a logistic regression model predicting whether a page is cited in a given observation:

logit(P(cited)) = β₀ + Σ βᵢ·feature_i + Σ γⱼ·engine_j + Σ δₖ·pathway_k + Σ εₘ·vertical_m

Features include page-level signals (word count, heading structure, schema markup, outbound citation count), engine dummies (drop-first encoding, ChatGPT as reference), retrieval pathway dummies, and vertical dummies. All continuous features are standardised (mean-centred, unit-variance) before fitting.

Negative sampling (v1.2)

Positive examples: pages cited by an engine in a given observation.

Negative examples: pages from the same domains that engines have NOT cited. These are discovered via domain sitemaps and internal link traversal, then crawled and feature-extracted. The negative pool consists exclusively of pages with sample_provenance = 'domain_stratified_negative' — pages that could plausibly have been cited (they exist on cited domains, they have real content) but were not.

This gives the model the correct contrast: same domain, different pages — what distinguishes the cited ones from the uncited ones on the same site? This is a matched case-control design: cited pages are cases, non-cited pages from the same domain are controls matched on domain authority (Stuart, 2010; Pearce, 2016).

Negative sampling ratio: 5 non-cited pages per cited page per observation.

Confidence intervals

All published coefficients carry 95% bootstrap confidence intervals (200 iterations, cluster-robust). The bootstrap resamples at the URL level — all rows for a given URL are included or excluded together. This accounts for the non-independence of repeated observations of the same page across different queries. Bootstrap confidence intervals provide valid inference without distributional assumptions on the residuals — the method resamples from the observed data rather than assuming normality (Efron, 1979; Stine, 1989).

Zero-variance feature detection

Before fitting, features with zero variance across the training window are dropped. This prevents numerical instability and ensures the model only estimates coefficients where variation exists. Dropped features are reported in the run summary.

Calibration period

The engine enters calibration when a new methodology version begins or when data characteristics change significantly. During calibration:

Confidence intervals are wide (the engine is accumulating evidence)
Coefficients are unstable (the engine is learning, not concluding)
Methodological issues surface and are corrected (this is calibration working as designed)

Calibration exits when median CI width stops shrinking for 2 consecutive weekly runs. This stabilisation phase follows the same principle as run-in periods in clinical trials, where baseline measurements are established before drawing treatment conclusions (Pablos-Méndez et al., 1998), and warm-up periods in simulation, where initial transient data is discarded to avoid initialisation bias (Hoad et al., 2008).

v1.2 calibration context: The v1.1 → v1.2 transition was itself a calibration discovery. During v1.1 calibration, analysis revealed that the negative sampling frame contained only cited-elsewhere pages, preventing the model from learning the cited-vs-uncited contrast. Diagnostic analysis identified the limitation, the sampling frame was corrected, affected hypotheses were retired, and calibration restarted under v1.2. This is what calibration is for — the system identified its own measurement limitation before producing conclusions from invalid data.

Temporal stability and drift

ChatGPT rotates 74% of cited domains weekly (SISTRIX, 82,619 prompts, 17 weeks). A single probe is noise. Our minimum reporting cadence is 8 weeks — shorter measurement windows produce confidence intervals too wide to be actionable.

The drift detector compares each weekly run's coefficients against a rolling window of prior runs, flagging any feature where the z-score exceeds ±2.0.

What we do not measure (and why)

We deliberately exclude:

"AI visibility scores" — composite numbers without defined denominators are not statistical measures
Brand sentiment — subjective annotation with inter-rater reliability below acceptable thresholds
"AI readiness" assessments — unfalsifiable product marketing
Single-run citation checks — week-over-week variance exceeds 40%; a single run is not measurement

Known limitations

Engine API behaviour may diverge from consumer-facing product. We probe via API; users interact via chat UI. Grounding and citation behaviour may differ.
Panel prompts are not exhaustive. We measure a sample, not the universe of possible queries.
Citation ≠ recommendation. Being cited is not the same as being recommended. We measure presence, not endorsement.
Temporal lag. Daily probing detects changes within 24 hours, but attribution analysis requires 8+ weeks of data for statistical power.
Domain-stratified negatives assume same-domain contrast is meaningful. A page not cited from domain X may differ from cited pages on X for reasons unrelated to citability (e.g., thin content, old pages). Filters (minimum word count, exclude error pages) mitigate but do not eliminate this.
Negative enrichment coverage is partial. 48% of domain-stratified negatives have real per-engine citation features; 52% have zero-valued features because their domains have no panel presence. This is the correct measurement (no authority signal detectable), but limits the model's ability to learn from citation features in the negative class. Future methodology versions may weight negative domain selection toward domains with panel presence.

Versioning and changelog

This methodology is versioned. The current version is v1.2.2. All changes are logged at /changelog with:

What changed
Why it changed
Whether it affects historical comparability

v1.1 → v1.2 changes (2026-05-05)

Element	v1.1	v1.2
Negative sampling	Cited-elsewhere pages (same pool as positives)	Domain-stratified non-cited pages (true negatives)
Bootstrap method	Row-level resampling	Cluster-robust (URL-level) resampling
Feature interpretation	Within-cited-page variation	Cited vs. non-cited contrast
Calibration	Inherited from v1.0	Reset (new sampling frame = new baseline)

Impact on historical runs: Runs 1–9 (methodology v1.1) measured within-cited-page variation — which pages get cited more across different query contexts. Engine intercepts, pathway effects, and vertical effects from those runs remain valid. Page-level feature coefficients (has_faqpage, heading_count, word_count, citation features) should not be used for intervention decisions — they measured the wrong contrast.

Why this is disclosed: A measurement system that discovers and corrects its own limitations is more trustworthy than one that never admits them. The correction happened during calibration — before any conclusions were published, before any commercial decisions were made, and before any interventions were designed based on the invalid coefficients.

v1.2 → v1.2.1 changes (2026-05-05)

Element	v1.2	v1.2.1
Enrichment scope	Per-engine citation features computed only for pages observed in the panel (positives)	Per-engine citation features computed for all pages with extractable outbound links, including domain-stratified negatives
`citation_features_known`	Binary indicator in model (whether enrichment data existed for a page)	Dropped from model — was a sample-selection indicator, not a page feature
Feature count	27 (26 features + intercept)	26 (25 features + intercept)
Negative enrichment	Domain-stratified negatives had zero-valued per-engine citation features by construction	48% of negatives now have real per-engine citation features via domain-based vertical resolution; 52% remain zero-valued (their domains have no panel presence, so zero authority is the correct measurement)

What happened: During the first v1.2 refit (Run 10), citation_features_known produced a coefficient of +10.81 — an order of magnitude larger than any other feature. Diagnostic analysis revealed this was a sample-selection artifact: the enrichment pipeline only processed pages that appeared in panel observations (positives), so citation_features_known = TRUE was a near-perfect predictor of class membership. 100% of positives with enrichment data had it TRUE; 100% of negatives had it FALSE.

What changed: The enrichment pipeline now resolves page verticals via domain fallback (if any URL on a domain was cited in a vertical, other pages on that domain inherit that vertical for enrichment). citation_features_known was removed from the model since it no longer serves a purpose once both classes have proper enrichment.

Impact on Run 10 coefficients: Three citation features shifted:

citation_proximity: −3.49 → +0.02 (lost significance — was absorbing cfk signal, not a real page-level effect)
outbound_authority_score: +0.53 → +0.18 (retained significance — coefficient dropped 66% but CI [+0.11, +0.25] still excludes zero)
outbound_citation_count: +0.13 → +0.08 (not significant in either run) The remaining 10 significant features from Run 10 are broadly stable. Median CI width improved from 0.27 to 0.18.

Why this is disclosed: Same principle as v1.1 → v1.2. The calibration system identified a measurement artifact before any conclusions were published. The correction happened within the same day as the first v1.2 refit — no decisions were made based on the invalid coefficients.

v1.2.1 → v1.2.2 changes (2026-05-06)

v1.2.2: Added canonical citations for pre-registration (Nosek 2018, Hardwicke 2023), bootstrap CIs (Efron 1979, Stine 1989), matched controls (Stuart 2010, Pearce 2016), calibration periods (Hoad 2008, Pablos-Mendez 1998). No methodology changes; grounding strengthening only.

v1.2.2 schema and gating clarification (2026-05-08)

A 2026-05-07 operational audit discovered that model_runs.panel_observations_n — the column intended to record the panel-observation count behind each refit — had been populated with the training-matrix row count instead (positives plus sampled negatives, typically 6× larger than the raw panel observations).

The hypothesis-gating language on this page already used "training rows" (see H4 and H5 above) and the homepage stat strip already shows the two numbers separately ("486 panel observations per day", "9,115 training rows after first crawler expansion"), so no public claim was misstated. The schema and code, however, used the wrong name.

What changed:

model_runs.training_rows_n (new column) carries the design-matrix row count.
model_runs.panel_observations_n (existing column, redefined) now records raw panel observations in the fit window, as the name implies.
Historical rows were backfilled by recomputing panel_observations_n against the same SQL the matrix builder used at fit time, substituting each run's run_at for NOW(). The backfilled values are retrospective recomputations, not original measurements.
Pre-registration hypotheses now declare at registration time whether they gate against training_rows_n (coefficient hypotheses, including all five live hypotheses) or panel_observations_n (rate-based claims).

What this does not change: No published coefficient, hypothesis result, or calibration decision was affected. The 12,816 figure recorded against runs 10–14 was always the training-row count; it was just stored in a misnamed column.

Calibration v1 framing (2026-05-08)

For transparency: the v1.2 calibration runs (runs 10–14, 2026-05-05 to 2026-05-07) were fitted on a frozen panel snapshot of 486 observations collected on 2026-05-03 — the panel collector's scheduled run on Hetzner had not been wired up at that point, so no new observations accumulated during the calibration window. Calibration exited on 2026-05-05 against this static dataset. The panel collector wiring is the first item of Phase 5; calibration metrics will be re-evaluated against an accumulating time series rather than a static snapshot once the collector resumes.

The 2026-05-05 calibration exit decision should be understood as preliminary; the engine remains in calibration with respect to continuously accumulating data and will exit calibration formally only after 14+ days of clean panel runs produce stable coefficients.

Literature foundation

The engine's feature selection and hypothesis design draw on a systematic literature review of AEO research: 1,098 papers identified, 847 screened, 50 included via PRISMA-style methodology across 21 queries in 8 thematic clusters. The review identifies 12 design principles for content optimisation in AI answer engines, each with strength-rated evidence claims and supporting citations.

The full review is maintained at docs/research/aeo_literature_review_v1.md as the foundational bibliography. Empirical validation against the AnswerGraph panel is in progress — each design principle maps to testable hypotheses that the engine will evaluate over the next 6 months as calibration data accumulates.

Pre-registered hypotheses

AnswerGraph pre-registers hypotheses before testing them. Each hypothesis is hashed at registration time; the hash is published here before any data is analysed.

Hash algorithm: SHA-256 over canonicalised JSON of the core fields (hypothesis, expected_direction, minimum_effect_size, methodology_version, test_after_n), sorted by key.

Multiple-comparisons correction: Benjamini-Hochberg (FDR control) applied across all hypotheses tested in the same refit run.

Retired hypotheses (sampling frame invalid under v1.1)

H1–H3 (internal_link_count, external_link_count, is_forum) were registered under methodology v1.1 and retired on 2026-05-05. These hypotheses tested page-level features, but the v1.1 sampling frame did not provide the cited-vs-uncited contrast required to evaluate them. They are neither supported nor refuted — the test conditions were invalid. They will be re-registered under v1.2 once the first corrected panel run completes.

Active hypotheses (unaffected by sampling correction)

H4 — Informational vs. comparative pathway citation rates

Hypothesis: pathway_informational queries produce lower citation rates than comparative queries, controlling for page features and engine
Direction: negative
Minimum effect size: 0.10 (log-odds)
Test after n: 20,000 training rows
Hash: 2c4c8f72a50c0b2762150c36bea01a02ee107f784748ee3de1b06fd544ecff8d
Status: Inconclusive (insufficient data)

H5 — Telecoms vs. EASM vertical equivalence

Hypothesis: The vertical_telecoms vertical shows no meaningful difference in citation rates compared to the EASM reference vertical
Direction: null (testing equivalence within ±0.15 log-odds band)
Minimum effect size: 0.15 (log-odds)
Test after n: 20,000 training rows
Hash: 8376a8466d52a06f48f07d97b68e818976d879f42cdadd71c3a66c657923fd4b
Status: Inconclusive (insufficient data)

Pending re-registration (v1.2.1)

H1–H3 will be re-registered with new hashes after the first v1.2.1 refit run completes. The hash payload will include methodology_version: "v1.2.1" and a reduced test_after_n threshold (5,000 rows — domain-stratified negatives carry more information per row than cited-elsewhere negatives).

How pre-registration works

Pre-registration distinguishes prediction from postdiction — defining hypotheses and analysis plans before observing outcomes prevents hindsight bias and selective reporting (Nosek et al., 2018; Hardwicke et al., 2023).

A hypothesis is written before any data relevant to the test is examined.
The hypothesis text, expected direction, minimum effect size, methodology version, and sample-size threshold are hashed together using SHA-256.
The hypothesis and its hash are published on this page within 24 hours of registration.
The hypothesis is tested only after the sample-size threshold is met.
Results are published here regardless of outcome — including null results.

Null results are published with the same prominence as positive findings. A methodology that only publishes confirmations is not a methodology.