DRAFT: Response to NIH Request for Information NOT-OD-26-087
Measuring and Rewarding Scientific Impact
A Framework of AI-Enabled Indicators for 21st-Century Biomedical Research Assessment
by Takashi D. Y. Kozai, PhD
Ernest E. Roth Professor of Bioengineering, University of Pittsburgh
Director, B.I.O.N.I.C. Lab and UP NExT (University of Pittsburgh Neural Engineering Cross-Translation)
Date: July 2, 2026
In 1995, Katalin Karikó was demoted at the University of Pennsylvania because her grants were not landing and her papers were not cited, and the mRNA chemistry she developed over the following two decades became the COVID-19 vaccines and the 2023 Nobel Prize. The metrics in use at her evaluation predicted her failure at the moment her most consequential work was beginning. Her case is not an anomaly in an otherwise sound system. It is the visible instance of a baseline that routinely misvalues the science with the longest payoff, and the case for a new framework rests on how bad that baseline actually is.
The case does not rest on the proposed metrics being good in the abstract. It rests on what they are weighed against, and three observations, taken together, determine whether building this helps or harms.
First, the status quo is not thoughtful human judgment that a flawed metric might displace. Evaluators under time pressure default to the two most misleading numbers available, the raw h-index and counts of papers in high-prestige journals such as Nature, Science, and Cell. The h-index is the largest number h such that a researcher has published at least h papers each cited at least h times, a single figure that rewards steady citation accumulation and is structurally blind to delayed, foundational, or corrective work. Both defaults systematically suppress the slow, foundational, corrective, and exploratory work that has the longest eventual payoff. The absence of a better computable score does not produce careful qualitative evaluation. It produces the h-index. The real choice is therefore not metric versus judgment, but which computable default a busy committee reaches for.
Second, and deeper, human committee judgment is itself biased by relationship, and the current system rewards that bias. Who trained whom, who co-publishes with whom, who sits on study section, and which institution and nation a researcher belongs to shape outcomes, after which a convenient number is used to ratify a judgment already made on social grounds. This is the Matthew Effect operating through networks, and its errors are not random. They are structured by who you know, and structured errors accumulate across a career and concentrate advantage in the already-advantaged, rather than averaging out as random noise would. A content scorer that reads only what is in the paper, and is deliberately blind to the collaboration graph, institutional prestige, and training lineage, replaces structured relationship bias with unstructured noise, which reduces total unfairness in the aggregate even when any single score is imperfect.
Third, the bar a new metric must clear is not high absolute accuracy. It is to predict genuine contribution better than the h-index does, and to fail more gracefully, meaning its errors must not systematically penalize foundational, slow, and corrective work the way the incumbent metrics do. The incumbent is, in effect, a scorer with near-zero accuracy for long-term contribution, trusted more because habit has naturalized it. Beating that while failing more gracefully is an achievable bar, and it is the one this framework sets for itself.
Against this baseline, the framework proposes two arms. A diagnostic arm of content-scoring indicators that detect quality the current metrics miss, and an incentive arm of constitutive metrics that reward foundational exploration, replication, correction, data sharing, mentorship, public understanding, and lifetime contribution by construction, so that gaming them produces the behavior the enterprise wants. Many of the incentive reforms require no artificial intelligence at all and could be enacted through human judgment and verified counts. The AI content scorers are the more powerful and the riskier layer.
One risk survives every argument in favor and must be stated plainly. A content-scoring apparatus inherits the biases of its training data, and if its labels are generated by the relationship-biased system it is meant to escape, it will reproduce that bias in a form that is harder to contest because it appears objective. The people who control the scorer's training and calibration would then control scientific careers, and the current beneficiaries of relationship bias are precisely the senior figures positioned to capture that calibration and use it to ratify the existing hierarchy under a veneer of neutrality. That risk is answered not by governance language but by architecture. Scorers must be open-source and independently reproducible, computation must not build persistent individual dossiers, and no scorer should be deployed whose training and calibration cannot be independently audited and reproduced.
The downside is asymmetric, and this determines the deployment posture. A poorly built or captured content scorer does not fail by doing nothing. It fails by doing harm with false authority, encoding bias or noise while presenting it as objective computation that is harder to contest than a human judgment. History warns that the discipline required to keep such a system advisory and gated is unreliable, since the h-index and the journal impact factor (the average citation rate of a journal's recent articles, widely used as a proxy for the quality of any individual paper published in it) were both introduced as one input among many and both became de facto determinative through habit and workload. A promise to deploy only if accuracy and openness gates are met is therefore not self-enforcing, which is why the gates must be architectural (reproducibility, no persistent dossiers, network-blindness) rather than procedural assurances. Where a residual gaming or bias mode cannot be closed, this document names it explicitly rather than presenting the metric as robust, so evaluators can weight it accordingly and future work can address it.
The resulting recommendation is deliberate and staged. Incentive-structure reforms are the primary contribution and should be pursued first, since most need no AI and can run on human judgment and verified counts, crediting high-risk work including its failures, replication, correction, verified data and tool reuse, trainee independence, and career-terminal origination and delayed impact. The content-scoring layer is the more powerful and riskier component, warranting deployment only as a gated pilot where the graceful-failure and open-reproducibility preconditions are demonstrably met, and no deployment at all if they are not. Section 8 states the staged recommendation in full.
The comparison that matters is against the real baseline, not an imagined one. Deploying a network-blind, gracefully-failing, openly-constructed scorer alongside long-horizon incentive metrics is more likely than not a positive contribution, because it partially dismantles a relationship-laundering machine that currently misallocates careers and resources. That contribution is real, the strongest single risk is capture of the scorer's training, and that risk is answerable through open and reproducible construction rather than through promises. This framework asks to be judged on that comparison, not against a baseline of neutral expert judgment the evidence does not support.
Current scientific impact metrics rely on individual-level publication counts and citation indices that fail to capture the collaborative, long-horizon, and mission-driven character of modern biomedical research. The failures occur in both directions.
False negatives are work whose impact lands decades after publication and remains invisible to short-window indicators. Karikó's demotion, described in the preface, is one instance. Tom Brock, sampling Yellowstone's thermal pools in 1966, isolated Thermus aquaticus because the question of how anything could live above 70°C interested him, and the Taq polymerase derived from that organism made PCR possible and through PCR made the Human Genome Project possible. Richard Normann's Utah intracortical microelectrode array, developed at the University of Utah through the late 1980s under NIH and DARPA support, sat in academic labs for two decades before BrainGate clinical trials began in 2004, and the same architecture now restores speech in ALS patients (Pat Bennett, 62 wpm in 2023). Carl Sagan published approximately 500 peer-reviewed papers across 39 years (roughly one per month), and Harvard denied him tenure in 1968 while the National Academy of Sciences blackballed his 1992 nomination because colleagues treated his public communication as evidence he was not a serious scientist.
False positives are volume-first publication patterns that produce high paper counts without durable scientific contribution. Strategies including single-test analyses, reused public-database papers, no-primary-data publications, and high first-author paper output (10+ per year) can accumulate counts that look like productivity but generate nothing the field builds on. The current proxy counts these equally with substantive work.
Both failure modes share a root cause. The metrics serve as pseudo-metrics, indirect proxies for what cannot be observed directly (long-term scientific impact). Their use is not the problem. Mistaking the proxy for the thing is the problem, and the proxy is structurally blind to delayed translation, deep mechanistic characterization without translational hooks, failure analysis, replication, and public engagement, while also rewarding the gameable volume patterns above.
The strongest version of the incumbent does not rescue it. Consider the m-index (a researcher's h-index divided by the number of years since their first publication) was designed precisely to correct the h-index for career length, so that a productive early-career researcher is not penalized against a senior one. It is the closest existing metric to a rate of impact rather than a cumulative count. Even this time-normalized incumbent fails on delayed impact, because it still depends on citations that have already accumulated. Katalin Karikó's m-index at her 1995 demotion would have been near zero, since her foundational mRNA work had not yet been cited, and the metric would have delivered the same verdict the raw h-index did. Normalizing for time does not help when the impact itself is delayed by decades, so the failure is not that the incumbent metrics ignore career stage, it is that every citation-based metric, cumulative or rate-based, is blind to impact that has not yet arrived.
Funding pressure compounds the blindness. As NIH paylines collapse (NOFOs reduced from 756 in 2024 to approximately 120 in 2025 and 14 by mid-March 2026, competitive awards down 74%), principal investigators are forced to chase short-term translational hooks and positive results to stay publishable, which defunds the deep characterization, failure analysis, and replication that future translation depends on.
Venue prestige, the dominant informal proxy in hiring, promotion, and award decisions, is decoupling from scientific contribution under two simultaneous pressures. The article-processing charge to publish gold open access in Nature is currently $12,850 (£9,390), so venue access increasingly reflects institutional wealth and funder policy rather than the quality of the work. United States public-access policy and current grant-cost restrictions are moving to limit whether federal grants can pay these charges. National incentive structures elsewhere have instead rewarded high-impact-venue publication directly, as when China offered institutional cash bonuses for Nature and Science papers until a 2020 Ministry of Education and Ministry of Science and Technology order instructed institutions to discontinue the practice and to stop evaluating researchers solely on publication count or venue. The net effect is that impact-factor-based evaluation measures who can pay and which national policy regime a researcher sits inside, not what the work contributes. Content-scoring indicators that read the paper directly would let hiring, promotion, and award committees evaluate the science rather than the venue, decoupling assessment from ability to pay and from venue capture. This is among the strongest justifications for the framework proposed here.
A deeper problem sits beneath the measurement failures. Study sections reward near-term clinical translation because it is fundable and demonstrable within a grant cycle, and public sentiment rewards the same because visible cures arrive during the appropriations window that funds them. Both forces push in the same direction, toward the short horizon. Basic and foundational science has the opposite profile, the longest impact horizon and the weakest political constituency, because its payoff arrives after the appropriations cycle, after the administration, sometimes after the researcher's lifetime. An evaluation system that scores all work on the same near-term axis therefore does not merely miss foundational science, it structurally under-invests in it, because it fails to counterbalance the tailwind that already over-rewards translation. Correcting this requires more than better measurement. It requires metrics that actively reward the long-horizon, high-risk, non-appropriable work that only public funding will ever support, because the private sector cannot capture its diffuse and distant returns.
This response proposes a two-arm framework responsive to both jobs the RFI names, measuring and rewarding scientific impact. The diagnostic arm comprises fourteen top-level AI-enabled indicators that detect quality the existing metrics miss, scoring content rather than metadata through computational analysis of full text and citation networks, and two of these are meta-indices (the Scientific-Reasoning Quality Index and the Engineering-Method Index) that compose named step-level sub-indices. An incentive arm comprises indicators that reward wanted behaviors by construction, so that the optimal way to score high is to perform the behavior the system wants, which makes these indicators resistant to the gaming that degrades proxy metrics. This arm leads with foundational and exploration incentives and culminates in career-terminal metrics that score, once and late, what no annual metric can see, namely the fields originated, the delayed payoffs realized, the risks taken, the scientific lineage launched, the errors corrected, and the shared infrastructure built across an entire career. The framework argues that the incentive arm, and within it the foundational and career-terminal metrics specifically, should carry heavier evaluative weight than the diagnostic arm, because heavier weight on long-horizon incentives corrects the standing imbalance toward near-term impact that only a public funder with a multi-decade mission has the mandate to correct.
The diagnostic arm is organized into four operational groups. First, foundational-exploration indicators (sleeping-beauty trajectory, mechanism-graph centrality, field-define, gap). Second, rigor and reasoning indicators (assumption-addressing, paper-substance score, and two meta-indices, the Scientific-Reasoning Quality Index composing eleven scientific-method step-indices and the Engineering-Method Index composing six engineering-method step-indices, since much NIH-funded work is design-and-build rather than discovery). Third, collaboration and integration indicators (cross-field integration, co-author composition, independent recognition). Fourth, impact-direction indicators (verifiable public-engagement, citing breadth, data-driven innovation composite). The incentive arm is specified in Section 5.
Throughout this document, p denotes a paper, a denotes an author, t denotes time (typically year), c(x, y) denotes a citation from x to y, and refs(p) denotes the reference list of p. AI components are of three kinds, distinguished throughout because they differ in what they require, first rule-based checks whose ground truth is fixed (the statistical-validity checks, which need no training corpus), second corpus-relational pattern detectors whose ground truth is the structure of the existing literature (novelty, field-distance, provenance, design-overlap, which need the corpus for retrieval but no annotation), and third graded scorers whose computed features are mapped to a scale anchored to the reference-population distribution of those same features (the meta-index step-indices, whose feature detection is structural and whose scale mapping uses no expert quality-labels). Model architecture is noted where it constrains implementation. Each standalone indicator reports on the scale native to what it measures, whether a probability where a probability is meaningful (SBTI, DDIC), a field-relative z-score for standing against a reference distribution, a bounded 0-to-1 factor for component scores that combine, or a raw or log-transformed count where magnitude carries information. These standalone outputs are deliberately not rescaled onto a single common band, because they feed calibrated committee decisions and an ensemble model rather than a single-number comparison, and forcing a fixed bounded band would compress genuine outliers, discard tail information, and reintroduce the false precision of one legible index that this framework exists to move beyond. Preserving native scale and reported uncertainty is more informative than cosmetic uniformity. One class of indices is an explicit exception. The step-level sub-indices of the two meta-indices (SRQI in Section 3.5 and EMI in Section 3.5E) must be commensurable with one another, because a meta-index composite is a committee-weighted average across its step-scores and an average is only coherent when the terms share a scale, otherwise whichever sub-score carries the largest numeric range dominates the weighting. Each step-index therefore reports on a common anchored 1-to-10 rubric (3 deficient, 7 solid, 10 excellent within the field's normal range) with the top left uncapped, so documented work beyond the field's normal range scores above 10 and the composite average preserves rather than compresses the outlier. The 1-to-10 anchoring matches the range reviewers already reason in, and the uncapped top prevents the ceiling from flattening the foundational-outlier signal the framework exists to protect.
This section specifies the diagnostic arm, the AI content-scoring layer. Per the recommendation in the preface, this arm is the more powerful and the riskier component and should be treated as a gated, optional experiment, deployed only where the graceful-failure and open-reproducibility preconditions of Sections 6 and 7 are demonstrably met. The incentive-structure reforms of Section 5, most of which require no artificial intelligence, are the primary contribution and do not depend on any indicator in this section being built.
Each indicator section presents seven parts. These are a definition of what the indicator measures, the motivation explaining why current metrics fail at this measurement and what gap the new indicator fills, a mathematical formulation, the AI specification including model architecture and training requirements, numbered implementation steps, failure modes and mitigations, and the RFI category alignment.
One distinction governs how every indicator should be read, and it is the distinction the current metrics collapse. Quality and impact are orthogonal axes. The quality axis is whether the work was done as science, meaning the observation was real, the hypothesis falsifiable, the design controlled, the execution reproducible, the analysis sound, and the reasoning traceable, all of which are assessable at publication time from the paper itself. Impact is whether the work eventually changed the field, which is knowable only later and only from what the field does with it. A careful scientist can produce high-quality work that turns out wrong or inconsequential (careful-but-wrong is still good science, just not impactful science), and an incautious one can stumble onto high-impact results through low-quality work. The scientific-reasoning indices (Section 3.5) and the paper-substance score measure quality, the reasoning was sound at publication. Trajectory and recognition indices (SBTI, IRI, the citation-structure measures) and the career-terminal incentives measure impact, the work mattered over time. Separating the two axes is the point, because the current citation-and-venue metrics conflate them, treating eventual citation as if it certified quality and treating quality as if it guaranteed impact, when neither implication holds. The reasoning score must therefore never be read as a claim about truth or importance. It measures only that the work was done as science, which is the most any at-publication metric can measure and exactly the thing citation and venue do not measure at all.
Definition.
A sleeping beauty, in bibliometrics, is a paper that goes largely uncited for years or decades and then accumulates citations rapidly once the field catches up to it. This index estimates the probability that a paper will follow that trajectory, computed at publication time from early-trajectory features and content-level scoring rather than waiting for the delayed citations to appear.
Formally, it is the probability that a paper's downstream citation accumulation will exceed a field-specific threshold beyond a horizon Δt years from publication, computed from early-trajectory features and content-level scoring.
Motivation.
Current metrics rely on cumulative citations within a 5-to-7-year window, which dominates h-index calculation (Hirsch 2005) and grant-review evaluation. Karikó's 1995 mRNA chemistry work had near-zero early citation despite eventually founding the COVID-19 vaccine platform, and the current metric system would have predicted her career failure at her 1995 evaluation. The Sleeping-Beauty Trajectory Index fills this gap by scoring publication-time features that predict long-horizon citation accumulation, providing decision-relevant signal during evaluation rather than waiting decades for the citation pattern to emerge.
Formulation.
SBTI(p, t) = P( C(p, t + Δt) > k | F(p, t) )
where C(p, t + Δt) is cumulative citations to p by year t + Δt, k is the 90th percentile of the field-specific citation distribution at the chosen horizon, and F(p, t) is the feature vector at publication time including early citation trajectory shape, mechanism-graph centrality, methodological specificity, and gap-filling status.
AI specification.
A gradient-boosted decision tree or LSTM model trained on historical late-blooming papers, with training targets taken from computed disruption indices rather than human annotation, which measure whether a paper eclipses the prior work it builds on (a disruptive paper is cited in place of its own references, an incremental one is cited alongside them). The training targets come from the Funk and Owen-Smith 2017 CD index and the Wu et al. 2019 Nature disruption score, both computed from citation structure. Retrospective validation requires reproducing the late-blooming signal on Karikó and Weissman 2005 (Immunity), Brock and Freeze 1969 (J Bacteriol), and Maynard et al. 1997 (Electroencephalogr Clin Neurophysiol) held out of training.
The feature vector F(p, t) combines content-scored inputs (mechanism-graph centrality, methodological specificity, gap-filling status) with early-trajectory shape (citation velocity and citing-field breadth at t+2 and t+5), and the model is trained to predict the disruption label from these features rather than from author, institution, or venue identity, which are excluded so that a prestigious byline cannot lift the score. SBTI is difficult to game directly because its target realizes 10 or more years out and its content features depend on other researchers' independent downstream choices rather than on anything the author writes, so the only path to a high SBTI is producing work with genuine mechanistic centrality and specificity that later matures. The output is a calibrated probability with a 90% confidence interval.
Implementation.
1. Construct training corpus from Web of Science, OpenAlex, or Microsoft Academic with full-text access.
2. Label papers by disruption score at 20-year horizon (where possible) or by curated panels of foundational and non-foundational exemplars.
3. Extract early-trajectory features at t+2 and t+5 years post-publication.
4. Train model to predict 10-year and 20-year citation accumulation conditional on early features.
5. Apply to new submissions with calibrated probability output and 90% confidence intervals.
Failure mode and mitigation.
The model can over-fit to obscurity itself, generating false positives for unimportant work with low early citation. Mitigation requires joint conditions with mechanism-graph centrality and paper-substance score above field-median thresholds. A paper qualifies as a sleeping-beauty candidate only when low early citation pairs with high content scoring on independent dimensions.
RFI category alignment.
Foundational Scientific Exploration.
Definition.
Centrality of a paper's contributed concepts within a knowledge graph of mechanistic relationships extracted from biomedical literature.
Motivation.
Current metrics treat all citations equally, counting a citation to a deep mechanism paper the same as a citation to a translational hook paper. This treats foundational mechanistic characterization as equivalent to shallow translation when in fact the mechanism work is the substrate downstream applications require. The Mechanism-Graph Centrality Index fills this gap by scoring how foundational a paper's contributed concepts are within the knowledge graph of mechanistic relationships, distinguishing structural contributions from incremental elaboration.
Formulation.
MGCI(p) = Σ over c ∈ Concepts(p) of PageRank(c, G_mech) · attribution(p, c)
where G_mech = (V, E) is the knowledge graph with vertices V (mechanistic concepts extracted from full-text papers via biomedical LLM) and edges E (causal and dependency relationships extracted from text), and attribution(p, c) is the fraction of paper p's contribution to concept c (1 if origination, decreasing for elaboration).
AI specification.
A two-stage system. First, a biomedical LLM (BioGPT, PubMedBERT, or domain-tuned successor) performs semantic relation extraction on full text to identify causal and dependency edges between mechanistic concepts. Second, a graph neural network computes centrality with PageRank or eigenvector-centrality variants. Edge extraction uses dependency-parsing plus relation-classification, with confidence thresholds for edge inclusion.
Parameters and thresholds.
Edges enter the graph only above a relation-classification confidence of 0.7, with the threshold set so that extraction precision against a held-out gold-reference edge set exceeds 0.8 (a false edge distorts centrality more than a missed one, so precision is favored). That gold reference checks factual extraction correctness, whether the stated mechanistic relation is actually present in the text, not any judgment of the relation's importance. PageRank uses the standard 0.85 damping factor over the directed mechanism graph, recomputed quarterly as new papers add edges. Attribution attribution(p, c) is set to 1.0 for the originating paper of a concept, decaying for elaborating papers by their semantic distance from the originating definition, so centrality credit concentrates on origination rather than restatement. Paper-level MGCI is the sum of attributed centrality across the concepts a paper contributes, field-normalized to a z-score against same-field same-era papers. An extraction-accuracy audit samples 10% of extracted subgraphs annually against a gold reference to bound edge-extraction drift, checking factual correctness of extracted relations rather than scoring importance, with centrality outputs reported with confidence intervals derived from edge-confidence propagation.
Implementation.
Build candidate concept dictionary from UMLS, MeSH, Gene Ontology, and field-specific ontologies.
Run LLM relation extraction across full-text corpus to populate edges with confidence scoring.
Compute centrality scores per concept at quarterly intervals.
Aggregate concept-level centrality back to papers as their concept-weighted contribution.
Update graph as new papers add edges, with concept disambiguation across emerging fields.
Calibrate centrality outputs against the NIH-funded reference population stratified by field per Section 3.5.0 Element 4, so a paper's mechanism-graph centrality is scored relative to its own field's connectivity norms rather than a global standard.
Failure mode and mitigation.
Concept extraction errors propagate to graph structure, particularly in emerging fields where terminology has not stabilized. Mitigation requires extraction-accuracy audit against a gold reference in sampled subgraphs (10% sample annually) and confidence intervals on centrality outputs, with low-confidence extractions flagged and down-weighted rather than overridden by a human scorer, so the mitigation stays computational and does not reintroduce a per-paper judgment injection point. Direct gaming is structurally limited because a concept's centrality is determined by how the rest of the literature connects to it, not by anything the author writes, so an author cannot inflate MGCI by asserting importance. The residual gaming vector is manufactured citation to a coined concept by a coordinated group, which is handled the same way as in the field-define indicator, by weighting concept connections toward independent (non-co-author) sources so within-clique reinforcement does not raise centrality.
RFI category alignment.
Foundational Scientific Exploration.
Definition.
A publication-time score of how well a paper identifies an incorrect or unexamined assumption in the prior literature and addresses it, across three modes, demonstrating that a prior approach or premise fails, correcting a wrong assumption directly, or building a new approach past a documented failure. It replaces the separate failure-analysis and corrective-contribution indicators of earlier framework versions, whose cores were the same act viewed two ways, and it scores the quality of the engagement rather than any downstream effect on the field.
Motivation.
Current metrics structurally penalize the work that confronts the field's mistakes, since high-impact journals select against negative and corrective findings and the resulting citation metrics follow, so incorrect assumptions go unexamined and failure modes go uncharacterized. Framing this contribution as field-killing (did this work end a line of inquiry) sets the wrong and unmeasurable bar. The right bar is how well the work addresses an incorrect assumption, which is the ordinary and essential mechanism of scientific self-correction and which no current metric credits. AAI is deliberately publication-time and corpus-relational, so it reads the paper's own engagement with the prior literature rather than waiting on downstream citations, and it credits the common high-value case that pure-correction framing missed, a paper that takes a documented failure and builds a new design past it.
Formulation and AI specification.
AAI applies the Section 3.5.0 Element 2 machinery to the paper's target assumption and scores three features, then classifies the mode. Assumption identification locates a specific assumption, premise, or approach in the same-year-and-prior corpus that the paper treats as incorrect or unexamined, credited only when the assumption is shown to be actually present and supported in the prior corpus rather than characterized as prevailing by the paper itself, which blocks the strawman vector. Confrontation quality scores whether the paper engages that assumption directly with evidence, testing it, showing where it breaks, or identifying the variable that invalidates it, reusing the conflict-resolution and dogma-engagement machinery, versus merely asserting the assumption is wrong. Resolution scores what the paper does about it, and classifies the mode as failure-demonstration (empirically showing the premise does not hold), correction (overturning the assumption with a replacement), or building-past (the therefore-we-tried-this-new-design case, where the paper takes the documented failure as a premise and constructs a new approach). The building-past mode is first-class, since building forward on a corrected assumption is the common workhorse the field advances on. Strength of the addressed assumption is weighted by its recency-and-support in the corpus, so confronting a well-supported recent assumption scores above knocking down a fringe one.
Implementation.
Extract the paper's target assumption, its confrontation, and its resolution using the Element 1 extraction model.
Verify the assumption is actually present and supported in the same-year-and-prior corpus via Element 2 retrieval, weighting by its recency-and-citation support and rejecting assumptions the corpus shows were already doubted.
Score confrontation quality by the conflict-resolution and dogma-engagement checks, requiring direct evidence-based engagement rather than assertion.
Classify the resolution mode (failure-demonstration, correction, building-past) and score the resolution, crediting building-past as a first-class mode.
Map the three features to the score and normalize against the Element 4 reference stratum, reporting the mode alongside the score. The correctness of the challenge is not scored here and is deferred to the career-terminal Assumption-Correctness Indicator (Section 5.4.2).
Failure mode and gaming defense.
Manufacturing a strawman assumption that nobody actually held is the dominant gaming vector, blocked because assumption identification reads the actual corpus support for the targeted assumption rather than the paper's characterization, so an assumption the corpus shows was already contested cannot be dressed as prevailing dogma. Framing an ordinary incremental disagreement as addressing a wrong assumption is defended by the confrontation-quality requirement of direct evidence-based engagement. One boundary is intrinsic and stated, AAI measures the quality of the assumption-addressing act, not whether the assumption was in fact wrong, since a paper can confront an assumption well and be mistaken, which is why correctness is deferred to the career-terminal metric and to eventual replication.
RFI category alignment.
Rigor and Reproducibility (primary), Foundational Scientific Exploration (secondary).
Definition.
A composite score of methodological complexity, data novelty, statistical rigor, and primary-data ratio, calibrated to distinguish substantive contributions from volume-first publication patterns.
Motivation.
Current metrics count papers regardless of underlying methodological substance, so a paper with a single statistical test on a reused public dataset counts the same as a paper with extensive primary data collection, multiple independent analyses, and rigorous statistical correction. This produces the volume-first failure mode where high-output strategies can accumulate paper counts (10 or more first-author papers per year) without contributing substantive work that the field builds on. The Paper-Substance Score fills this gap by scoring methodological complexity, data novelty, statistical rigor, and primary-data ratio as content-level features that distinguish substantive contribution from publication-velocity gaming.
Formulation.
PSS(p) = w₁ · M(p) + w₂ · D(p) + w₃ · S(p) + w₄ · R(p)
where M(p) is methodological complexity (LLM-evaluated breadth and depth of methods in the Methods section), D(p) is data novelty (proportion of dataset newly generated versus reused from public sources or collaborator data), S(p) is statistical rigor (number of independent tests, multiple-comparison correction reporting, effect-size reporting, and pre-registration where applicable), R(p) is the primary-data ratio (fraction of analyses based on newly collected versus retrieved data), and weights w₁ ... w₄ are tuned empirically against retrospective disruption signal.
AI specification.
A composite whose four components are computed from structural evidence and fixed rules rather than from a human-labeled quality corpus, so PSS detects the volume-first pattern computationally rather than reproducing expert taste. Data novelty (D) is computed by matching the paper's datasets against public-repository fingerprints and collaborator-data provenance, so a reused GEO or dbGaP dataset is detected as reuse regardless of how the methods describe it, and the primary-data ratio (R) follows from the same provenance matching rather than from the paper's self-description. Methodological complexity (M) counts the number and interdependence of distinct experimental or analytic procedures actually executed, a structural count from the methods. Statistical rigor (S) reuses the analysis-rigor checks of Section 3.5.8 in full (test-to-data-type match, dependence-structure and pseudoreplication detection, claim-alignment, and the silence-is-deficiency disclosure audit) rather than re-specifying a weaker parallel version. The volume-first archetype, a single test on a reused public dataset with no primary collection, is therefore caught by the conjunction of low D, low R, low M, and a thin S disclosure profile. None of D, R, M, or S requires an annotated good-versus-bad training corpus, since each reads provenance, structure, or the fixed rules of statistics. For a theory, review, or perspective paper the primary-data-ratio and statistical-rigor components are not scored, per the analytical-mode specification of Section 3.5A, so that a synthesis paper's substance is carried by its gap, novelty, and integration signals rather than penalized for the primary-data collection it was never meant to perform.
Implementation.
Compute data novelty and primary-data ratio by matching the paper's datasets against public-repository fingerprints and collaborator-data provenance (Element 2 corpus).
Compute methodological complexity as the structural count of distinct executed procedures extracted from the methods.
Compute statistical rigor by invoking the analysis-rigor checks of Section 3.5.8, reusing that index's output rather than a separate scorer.
Combine the four components, publish per-component sub-scores for transparency, and apply PSS as a calibrator across other indicators.
Calibrate each component against the NIH-funded reference stratum per Section 3.5.0 Element 4, so substance is scored relative to field norms rather than a global standard.
Failure mode and mitigation.
Engineering a paper to score well without substance is the gaming vector, but it is constrained because the components read provenance and structure. Raising D or R requires actually generating primary data rather than describing it, raising M requires actually executing more interdependent procedures, and raising S requires actually satisfying the Section 3.5.8 checks including full disclosure. The residual vector is manufacturing apparent methodological complexity through procedures that add count without contribution, mitigated by weighting M by interdependence rather than raw count and by conjunction with the other components, so complexity without novelty or primary data does not lift the composite. Periodic external audit by an independent bibliometric panel and public per-component sub-scores keep the composite inspectable.
RFI category alignment.
Cross-cutting calibrator applicable to all RFI categories, with primary alignment to Rigor and Reproducibility and Foundational Scientific Exploration.
Definition.
A meta-index that composes eleven named step-level indices into a reported profile of the scientific reasoning a paper demonstrates. Nine correspond to the canonical steps of the scientific method (observation, question, hypothesis, prediction, experimental design, execution, analysis, interpretation, reporting), and two capture reasoning qualities that cut across those steps and are also scored at the network level by standalone indicators (gap identification, paired with the Gap Index of Section 3.7, and synthesis, paired with the Cross-Field Integration Index of Section 3.9). SRQI is to its step-indices what the Data-Driven Innovation Composite is to the diagnostic arm, a composite over independently-specified sub-indices, and it is reported as the vector of its eleven step-scores rather than collapsed into a single number.
Motivation.
Current metrics measure the outcome of reasoning (citations the paper eventually receives) but never the reasoning itself, so a paper with rigorous hypothesis structure, adequate controls, and calibrated interpretation is scored identically to a paper with the same eventual citation count but weaker reasoning. Venue prestige serves as the informal proxy for reasoning quality, and venue prestige is decoupling from contribution under the pay-to-play pressures described in Section 1. SRQI fills this gap by scoring the reasoning operations directly from the paper text. Decomposing into named step-indices rather than a flat multi-component score is deliberate, because each step of the scientific method is a distinct operation that a paper can perform well or poorly independent of the others, and each therefore warrants its own definition, formulation, evidence rubric, and gaming defense rather than being buried as an unnamed component. A single collapsed number would reward proximity to one template and homogenize the field toward one reasoning style, whereas step-level reporting preserves the legitimacy of different scientific modes, such as an exploratory foundational study scoring high on gap identification and synthesis but lower on immediate analytic closure.
Formulation.
SRQI(p) = ( SObs, SGap, SQ, SHyp, SPred, SDes, SExec, SAna, SInt, SSyn, SRep )
reported as the vector of the eleven step-indices defined in Sections 3.5.1 through 3.5.11, each on the common anchored 1-to-10 rubric (uncapped above 10 for documented beyond-normal work) specified in the Notation section, with a confidence interval. The common scale is required because the composite is a committee-weighted average across steps, which is coherent only when the steps share a scale. No canonical weighting across steps is specified. Evaluating committees select role-appropriate weights, so a theory-focused search weights question formulation and synthesis while a methods or core-facility hire weights design and execution, which converts the anti-homogenization safeguard (Section 6.7) into a structural feature of the meta-index rather than an external caveat. The composite reports the weighted average alongside the full step vector, never in place of it, and any above-10 step-scores are surfaced so a committee sees where a paper broke the field's normal range.
AI specification (shared across step-indices).
Each step-index is a scorer that keys on the structural evidence the step requires rather than on claim language, field-stratified because step norms differ across experimental, computational, and theoretical work, and outputting the common anchored 1-to-10 score (uncapped above 10) with a confidence interval. The scale is anchored to the reference population, not to expert opinion, so a 7 is the median computed feature value in the field stratum, a 10 is the 95th percentile, and above-10 is beyond the observed range. That score reports where a paper's computed features fall relative to the field's actual distribution of those same features rather than where an annotator judged it to sit. Surface features that are cheap to fabricate (claim phrases, section headers, boilerplate) are held out of every step-index feature set, so the cheapest path to a high score is to perform the operation rather than name it. The step-index specifications below give the step-specific evidence each scorer reads, the feature-to-score mapping, the rubric anchors, and the step-specific gaming defense.
Failure mode and mitigation (meta-index level).
The meta-index inherits the correlation among its steps (design and analysis co-vary, gap identification and synthesis co-vary), so the composite must report per-step sub-scores and attribute shared signal transparently rather than presenting correlated steps as independent evidence. Homogenization is addressed by mandatory per-step reporting and committee-selected weighting. Gaming is addressed at the step level, since each step-index requires its own structural evidence, and a paper cannot raise SRQI without performing the steps it claims.
RFI category alignment.
Rigor and Reproducibility (primary), Foundational Scientific Exploration (secondary), cross-cutting content calibrator.
Every step-index in Sections 3.5 and 3.5E instantiates the common protocol specified here, so each step section below states only its step-specific features and rubric while inheriting the seven elements defined once in this section. The concrete components and thresholds named below are calibratable starting values chosen to be buildable today, not fixed requirements, and each should be tuned during the calibration phase against the acceptance criteria.
Element 1, unit of analysis and extraction.
The scored unit is the claim, defined as a single asserted empirical or reasoning statement, and a paper is decomposed into its constituent claims before scoring rather than scored as an undifferentiated whole. A claim-extraction model (a fine-tuned long-context transformer, GPT-4-class or a domain-adapted open model such as a biomedical Llama variant, run over the full text) segments the paper into candidate claims tagged by type (observation, gap, hypothesis, prediction, design, execution, analysis, interpretation, synthesis, reporting) and by section of origin. Each step-index scores the claims of its own type, and the paper-level step-score is the citation-weighted or salience-weighted aggregate of its claim-level scores, with the primary claim (the abstract's central assertion) weighted highest. Extraction is validated for segmentation-boundary agreement (where one claim ends and the next begins, a mechanical text-structure task rather than a quality judgment) to a boundary-agreement F1 above 0.80 before scoring proceeds.
Element 2, reference corpus and retrieval.
The reference corpus is OpenAlex (approximately 250 million works, openly licensed, with abstracts, citation edges, and institution and author metadata), mirrored locally for the compute described in Element 7, supplemented by PubMed for biomedical full-text abstracts and by Semantic Scholar for citation-context sentences. Same-year-or-prior retrieval embeds each claim with a scientific sentence-embedding model (SPECTER2 or a comparable citation-informed encoder) and retrieves the k=50 nearest prior claims by cosine similarity, with the near-neighbor set defined as those above a cosine threshold of 0.75, tuned per field during calibration. Presence and linkage are judged by a natural-language-inference model (a fine-tuned DeBERTa-v3 or biomedical equivalent) that classifies each retrieved neighbor as entailing, contradicting, or neutral to the claim. A claim is counted as already present when any prior neighbor entails it at NLI confidence above 0.85, and counted as linked to a gap when a prior neighbor is classified as leaving the claim's target open. Field distance is the shortest-path geodesic between the claim's field and each connected field in the OpenAlex concept co-citation graph, normalized to the field-pair distance distribution, with the alternative embedding-cosine distance reported as a robustness check.
Grounding-evidence requirement (applies across all elements).
No step may assign a score by pattern-matching to summary features, surface templates, or presumed absence, since scoring the summary of the evidence rather than the evidence itself repeats the substitution of a countable proxy for the measured thing that the framework indicts in the h-index. Each score is grounded in the primary evidence its claim's warrant rests on, emitted and checked before the score issues, which yields three hard gates:
A novelty score is invalid unless the same-year-or-prior near-neighbor set of Element 2 was actually retrieved and its grounding sentences emitted, so novelty is never asserted from presumed absence of prior work, the false-novelty failure the Section 6.8 check prevents, here prevented at the scorer's own hand.
A fidelity score is invalid unless a sample of the actual primary data (the traces, waveforms, or images the summary statistics derive from) was examined for physical and physiological plausibility. Recording claims are therefore not credited on yield, signal-to-noise, and amplitude-stability summaries when the underlying waveforms may be non-physiological in shape or lack refractory-period validation. When the paper supplies too little primary data to support the claim it makes, that insufficiency is scored as a deficiency rather than parked at cannot-verify, since the burden of evidence sits with the paper and an unsupported claim is a shortfall the paper owns. This debit is calibrated to the gap between the claim's strength and the evidence supplied, so a modest claim on modest data is not penalized while a categorical claim on the same data is. Cannot-verify is reserved for the narrow case where sufficient evidence exists in the paper but is inaccessible to the scorer, for instance in a figure it cannot parse, and even then it caps the fidelity score below a sufficient-and-readable artifact rather than setting it neutral. Insufficient or unreadable data can never score as high as data that is present and adequate.
A rigor score reads the claim's mode of empirical warrant before applying rigor criteria, so an observationally-warranted claim is scored by observational-rigor criteria rather than marked deficient for lacking the inferential statistics an experimental claim carries.
The common failure these gates guard against is scoring the summary of the evidence in place of the evidence, and each gate names the primary artifact (a retrieved neighbor set, an examined data sample, an identified warrant) that must be present before the corresponding score is issued. These gates bind every AI-based scoring component in the framework, not only the SRQI and EMI step-indices that instantiate this full protocol but equally the standalone AI indicators that read content against priors, AAI, MGCI, GI, FDI, CFII, VPEI, and the analytical-mode floor, with the full enumeration and the rule-based boundary set out in Section 3.15. Every one of those assigns a score by comparing a claim to prior work or to a rigor standard, so each faces the same three failure modes, scoring from summary features, presentation-gaming, and young-field prior-poverty. Each therefore inherits the grounding-evidence gates above, the held-out-feature discipline that bars a claim's presentation strength from raising its score, and the always-on adjacent-field prior borrowing and per-claim prior-density weighting of Section 6.9, rather than reimplementing or omitting them. Defining these guards once here and binding them by scope, instead of restating them in each indicator, follows the same inherit-once design as the seven elements and prevents an indicator from silently omitting a guard the others carry.
Gate-contract interface, the build-phase specification every AI metric must instantiate.
Whether a guard runs correctly is fixed in the building phase, and the builder works from this specification, so a metric specified only as applying another metric's machinery is operationalized to whatever local standard the builder infers, which under time pressure is the minimum that returns a number. To stop a gate from being built as a checkbox, every AI-based metric must instantiate a gate contract with four required fields, none left blank, and a metric with any blank field is not build-ready and is not deployed. The four required fields follow:
Primary grounding artifact, the specific evidence the score must be grounded in and must emit before the score issues, named concretely for this metric rather than inherited by reference, the retrieved prior-neighbor set for a novelty-bearing metric, the examined primary-data sample for a fidelity-bearing metric, and the identified warrant with its checked criteria for a rigor-bearing metric.
Retrieval scope, the corpus the metric retrieves against, stated to include the target subfield together with the adjacent fields reached by the Element 2 concept-graph geodesic, so adjacent-field borrowing is an operational instruction the builder wires in rather than a principle the builder may quietly drop.
Held-out features, the presentation and surface signals barred from contributing to this metric's score, given as a concrete exclusion list for this metric's content, so the anti-presentation-gaming rule is code the builder writes rather than a value the builder is trusted to honor.
Per-metric acceptance test, the known-outcome cases the built scorer must reproduce before it deploys, stated for this metric's specific failure mode, so a scorer whose gate runs cosmetically fails its own test rather than passing silently.
The emitted artifact must actually constrain the score, verified by ablation, since a builder can satisfy the letter of the contract by emitting an artifact the score never uses. Removing or perturbing the emitted artifact must change the metric's score by a stated minimum, and a metric whose score is unchanged when its grounding artifact is withheld has a cosmetic artifact and fails the contract. The acceptance test is the load-bearing field, since it is what detects a contract populated in form but not in function, so it carries no default and no waiver.
Staged deployment is conditioned on every AI-based metric passing its own acceptance test rather than on an aggregate score, because an aggregate lets a few well-built metrics carry several cosmetic ones while the composite still reads as acceptable. A metric that fails or omits its acceptance test is disabled rather than shipped, so the framework deploys the AI metrics that are real and withholds those that are not, matching the staged-conditional posture the preface takes toward the framework as a whole. Section 3.15 gives the filled contract for every AI-based metric, and any metric absent from that section is named there as rule-based or constitutive with its reason. Fixing the boundary between AI-scored and rule-computed metrics in the specification rather than leaving it to a builder blocks the escape of reclassifying an AI metric as rule-based to avoid the gate.
Element 3, feature-to-score function.
Each step-index computes its step-specific feature vector (defined in its section), then maps that vector to the 1-to-10 score by its percentile position in the reference-population distribution of the same computed features. Anchors at 3, 7, and 10 are therefore distributional cut points (roughly the lower range, the median, and the 95th percentile of the field stratum) rather than expert-assigned grades. The mapping requires no human quality-labels, and this is deliberate. A scale learned from expert judgments of what counts as good would reintroduce at the calibration step exactly the relationship-laundering the framework exists to escape, whereas a scale anchored to the field's own computed feature distribution has no per-paper human input to capture. Where a step has two dimensions (SObs has fidelity and novelty), each dimension is percentile-scored separately and the step-score is their combination at a stated weight (initialized equal, adjustable only by transparent governance, not by per-paper tuning), with both dimension sub-scores reported so a committee sees the composition. The above-10 tail is produced when a paper's computed features exceed the reference range, and, for the incomputability case, by the explicit detector each relevant step defines, so the tail is generated by a stated rule rather than by unbounded model output.
Element 4, calibration across NIH-funded science.
The reference population for normalization is all publications acknowledging NIH funding in the trailing ten years, identified through the NIH acknowledgment and RePORTER linkage, stratified into fields by primary study-section assignment where available and otherwise by OpenAlex top-level concept, targeting at least 2000 papers per stratum. Within each stratum the 1-to-10 rubric is anchored to the computed feature distribution directly, so 7 is the stratum median and 10 is approximately the 95th percentile of the same computed features across the reference population, which means the anchors are read off the distribution rather than estimated from any annotated sample. The above-10 frequency is monitored per stratum, with an expected empirical rate below approximately two percent of claims, and a stratum whose above-10 rate drifts materially above that is re-examined, since drift indicates either miscalibration or a genuine shift in the field's frontier that the reference distribution must absorb. Recalibration runs annually against the updated reference population to track evolving field norms, and because the anchors are distributional the recalibration is a recomputation, not a re-annotation.
Element 5, feature-validation protocol.
Validity of the computed features, that a high computed feature value really does track the quality the step names, is established by retrospective outcome-controls rather than by expert annotation. The scorer must reproduce the correct signal on held-out known-outcome cases (the eventually-vindicated work the preface names as positive controls and the volume-first patterns as negative controls), so the features are validated against what the field's later evidence showed, not against what a panel believed at scoring time. This removes the human-label capture surface entirely and, with it, the need for any rater-bias correction, since there are no annotator judgments to debias. Human judgment enters only in the transparent, auditable, and small governance decision of which retrospective cases count as positive and negative controls, a decision made once and openly rather than per paper, and the control set itself is published so the validation can be independently reproduced. The acceptance gate is that the scorer recovers the outcome-controls at a stated rate within each field stratum before it is deployed.
Element 6, validation and acceptance criteria.
A step scorer is accepted only when it clears five gates. First, recovery of the retrospective outcome-controls within each field stratum at a stated rate, replacing any correlation-with-expert-consensus criterion, since the framework uses no expert consensus scores. Second, the graceful-failure gate of Section 7, meaning its errors do not correlate with suppression of historically-vindicated foundational work, tested on the retrospective positive controls (Karikó, Brock, Maynard) and volume-first negative controls. Third, an above-10 detector false-positive rate below five percent, measured by open adjudication of flagged incomputability cases against the mundane-sparsity alternatives (niche subfield, poor indexing, jargon mismatch), with the adjudication procedure and cases published so the check is reproducible rather than a private judgment. Fourth, robustness of the field-distance signal under the alternative distance measure, with the two measures agreeing on the ordinal ranking of claims above a rank correlation of 0.8. Fifth, the grounding-evidence gate of the requirement stated above, meaning the scorer emitted and checked the primary evidence behind each score, the near-neighbor set for novelty, the examined primary-data sample for fidelity, and the identified warrant for rigor, rather than scoring from summary features. A scorer failing any gate is not deployed, per the staged recommendation.
Element 7, compute and data access.
The build requires a local mirror of OpenAlex (on the order of hundreds of gigabytes for metadata and citation edges), precomputed claim embeddings for the reference corpus (billions of claim vectors at embedding-model dimension, on the order of tens of terabytes, amortized across all step-indices since retrieval is shared), and per-claim inference cost dominated by the extraction, entailment, and scoring passes. Scoring the full NIH-funded corpus is a batch job on the order of tens of millions of papers times tens of claims each, feasible on a modest GPU cluster over weeks, and a pilot on a single investigator or a small cohort runs on a single workstation. Author, institution, and venue identity are excluded from every scorer's feature set at inference, and no persistent per-author dossier is retained, per Section 6.
Definition.
The fidelity and the novelty of the empirical observations a paper reports, combining how precisely and completely the observation is reported with how genuinely it departs from and connects to the prior literature, treating observation as the foundational and most differentiating scientific step.
Motivation.
Observation is the step where genuine discovery most often originates and the step least rewarded by outcome metrics, since a citation count cannot distinguish a paper that reported exactly what it saw from one that saw selectively. The index scores whether observations are reported with the precision, controls-for-artifact, and completeness that let others trust and build on them.
Formulation and AI specification.
SObs scores two dimensions instantiating the shared protocol of Section 3.5.0. The fidelity dimension reads four features from the claim's supporting results and methods. Three are measurement precision (a binary-plus-degree feature for stated units, error bars, and instrument resolution), observational completeness (the ratio of reported to methods-implied measurements, capturing whether negative and off-target observations appear alongside the positive finding), and artifact control (presence of explicit confound discrimination). A fourth is primary-data plausibility. The primary-data-plausibility feature examines a sample of the actual primary data behind the summary statistics, the raw traces, waveforms, or images, for physical and physiological plausibility, and it is a hard gate on the fidelity score per Section 3.5.0. A neural-recording claim is credited only when the reported waveforms show physiological action-potential morphology and refractory-period structure, not the non-physiological shapes that signal motion artifact, since yield, signal-to-noise, and amplitude-stability summaries can all remain high while the underlying signal is artifact rather than spikes. When the paper supplies too little primary data to substantiate the recording claim, for example no inter-spike-interval or refractory structure for a single-unit claim, that insufficiency is scored as a fidelity deficiency rather than left neutral, per the burden-on-the-paper rule of Section 3.5.0. This dimension keys on the examined primary data and the reported measurements with their uncertainty rather than on adjectives like careful or rigorous. The novelty dimension applies the Element 2 retrieval to the observation claim, computing three features. Presence is one minus the maximum NLI entailment confidence over the k=50 same-year-or-prior near neighbors, so an observation any prior neighbor already entails scores near zero novelty regardless of framing. Linkage is the maximum gap-entailment confidence over the same neighbor set, crediting an absent observation that addresses a gap or contradiction the prior set actually contains. Distance is the normalized concept-graph geodesic between the fields the observation connects. Both novelty axes must clear their thresholds, since an observation absent but unlinked is irrelevant and an observation linked but present is restatement. Every relational feature emits the specific prior paper and sentence that produced it, and this emission is a hard gate per Section 3.5.0. No novelty is scored from presumed absence of prior work without the retrieval having been run and its neighbors checked, which is the discipline whose omission produces false-novelty credit. The presence feature is down-weighted by the substance score of the nearest prior neighbors, so a novelty that is absent only because the adjacent prior work was too weak to have found it is discounted. A dormancy exception modifies the presence feature so that resurfacing genuinely lost prior work is credited rather than penalized. The exception fires when the nearest prior neighbor is old and effectively out of the field's working memory, measured by its age, a dormant citation trajectory (few or no citations in the years before the current submission), and low retrieval-accessibility. It requires in addition that the current paper cites and builds on that neighbor rather than concealing it. In that case the observation is scored as recovery-and-extension rather than restatement, because re-establishing and building on a finding the field had lost is a contribution the citation record itself failed to preserve. The discriminating signal is whether the paper acknowledges the dormant source and builds forward, which is recovery, versus claims fresh novelty over an actively-cited recent source, which is the false-novelty gaming vector addressed in Section 6.8.
Implementation.
Extract observation claims from the paper using the Element 1 claim-extraction model, tagging each by section of origin and identifying the primary observation from the abstract.
Score the fidelity features per observation claim from the results and methods, computing precision, completeness, and artifact-control features.
For each observation claim, run Element 2 retrieval against the OpenAlex same-year-or-prior corpus (SPECTER2 embedding, k=50, cosine threshold 0.75), then the NLI pass (DeBERTa-v3-class, entailment threshold 0.85) to compute presence and linkage, and the concept-graph geodesic for distance, retaining the grounding sentences.
Map the fidelity and novelty feature vectors to two 1-to-10 sub-scores via the Element 3 ordinal regressions, combine them into the SObs claim-score at the calibrated weighting, and fire the above-10 incomputability detector where the near-neighbor set is sparse or empty and a distant linkage is verified.
Aggregate claim-scores to the paper-level SObs by salience weighting (primary observation highest), report the fidelity and novelty sub-scores separately with confidence intervals and the grounding citations, and normalize against the Element 4 NIH-funded reference stratum for the paper's field.
Failure mode and gaming defense.
An author could pad the results with precise-looking but irrelevant measurements, defended by conditioning fidelity on relevance to the stated question so precision is credited only for observations bearing on the claim, with completeness verified by checking whether the methods imply measurements the results omit. A second attack targets the novelty dimension through coined terminology that manufactures artificial retrieval sparsity, defended by embedding the observation's semantic content and judging absence at the level of its meaning and cited grounds rather than its surface terms, so novel words alone do not create apparent novelty. Observation alone cannot separate a scientist from a fiction writer, since an invented off-map observation scores high on novelty and distance by construction, which is why a high SObs is credited as science only when the downstream steps (Sections 3.5.3 through 3.5.11) also hold, and the high-observation-low-downstream profile is the computable signature of overclaiming that the separately-scored step vector exposes.
Rubric anchors and scale.
On the common rubric, a 3 reports observations without stated precision, uncertainty, or artifact controls or reports an observation already present in prior work, a 7 reports precise, uncertainty-bounded observations relevant to the claim with basic artifact control and genuine absence from near-prior work, a 10 reports complete observations with full uncertainty quantification and explicit confound discrimination at the field's best standard that are absent from prior work and linked to a real prior gap across a measurable field distance, and a score above 10 is reached either by measured excellence beyond that range or by principled incomputability, meaning the same-year-or-prior near-neighbor set is sparse or empty because the observation opened ground the field had not been standing on, conditioned on a verified distant linkage establishing the observation is founding rather than disconnected. Incomputability against the prior corpus is therefore the strongest novelty signal rather than a scoring failure, because founding work is definitionally hard to place on a map it created, and the above-10 score is a claim about novelty and foundingness only, not about correctness, which is deferred to the trajectory metrics and eventual replication.
Definition.
The quality and originality of the knowledge gap the paper identifies and targets, scored at the reasoning level within the paper, distinct from the network-level Gap Index (Section 3.7) which tracks gaps across the literature.
Motivation.
Identifying the right gap is the reasoning act that separates consequential questions from incremental ones, and it is invisible to citation metrics. This index scores whether the paper locates a real, precisely-stated gap and motivates why closing it matters, rather than asserting novelty generically.
Formulation and AI specification.
SGap instantiates the shared protocol of Section 3.5.0 and reads the introduction and discussion for an explicit statement of what is unknown, evidence that the gap is real (prior work is cited as leaving it open rather than as already closed), and a stated consequence of closing it. The scorer keys on the logical structure of a gap argument, an unknown plus why it matters plus why it is not yet resolved, not on the word novel. This overlaps by design with the network-level Gap Index and the two are reconciled in Section 3.14.
Implementation.
Extract the gap claim, the cited prior work, and the stated consequence using the Element 1 extraction model.
Compute a gap-reality feature by testing, via Element 2 retrieval and NLI, whether the cited prior work is classified as leaving the gap open rather than as having closed it, retaining the grounding sentences.
Compute a gap-independence feature by cross-checking the claimed gap against the network-level Gap Index (Section 3.7) for corroboration that others independently treat it as open.
Map the features to the 1-to-10 score via the Element 3 ordinal regression and normalize against the Element 4 reference stratum.
Failure mode and gaming defense.
Generic novelty claims are the gaming vector. The defense requires the gap to be tied to specific cited prior work shown to leave it open, so an unsupported claim of novelty scores low, and cross-checks the claimed gap against the network-level Gap Index for independent corroboration.
Rubric anchors and scale.
On the common rubric, a 3 asserts novelty generically with no cited prior work leaving the gap open, a 7 states a specific gap tied to cited work and a stated consequence of closing it, a 10 identifies a precisely-bounded gap with strong evidence it is real and unaddressed and a compelling stated stake, and a score above 10 identifies a gap the field had not recognized as a gap, reframing what is unknown.
Definition.
The falsifiability, scope, and precision of the research question the paper states.
Motivation.
A well-formed question is a precondition for interpretable results, and a vague or unfalsifiable question produces work that cannot be wrong and therefore cannot be informative. The index scores whether the question is specific, answerable, and falsifiable.
Formulation and AI specification.
SQ instantiates the shared protocol of Section 3.5.0 and reads the stated question or aims for falsifiability (is there an outcome that would answer it no), scope calibration (is the question matched to what the methods can address), and precision (are the variables and conditions specified). The scorer keys on the presence of a falsifiable structure, not on question phrasing.
Implementation.
Extract the stated question or aims using the Element 1 extraction model.
Compute a falsifiability feature (is there a stated outcome that would answer the question no), a scope-calibration feature (does the question match what the methods can address, tested by entailment between question and methods), and a precision feature (are variables and conditions specified).
Where a preregistration exists, check question-to-registration consistency to detect questions reverse-fit to the results.
Map the features to the 1-to-10 score via the Element 3 ordinal regression and normalize against the Element 4 reference stratum.
Failure mode and gaming defense.
Post-hoc question fitting, stating a question the results happen to answer, is the gaming vector. The defense checks consistency between the stated question, the pre-registered aims where available, and the methods, flagging questions that appear reverse-fit to the results.
Rubric anchors and scale.
On the common rubric, a 3 states a vague or unfalsifiable question, a 7 states a specific, falsifiable question matched to the methods, a 10 states a precisely-scoped falsifiable question whose answer either way is informative, and a score above 10 poses a question that opens a line of inquiry rather than closing one.
Definition.
The clarity of predicted outcomes, the presence of a stated null, and the use of multiple working hypotheses where the domain supports coexisting mechanisms rather than a single strong-inference dichotomy.
Motivation.
Neural interface biology and much of biomedicine involve mechanisms that coexist rather than compete, so a Chamberlin multiple-working-hypotheses structure often fits better than a Platt strong-inference single dichotomy. The index credits explicit, falsifiable predictions and, where appropriate, a structured set of alternatives rather than a single hypothesis defended against a strawman.
Formulation and AI specification.
SHyp instantiates the shared protocol and scores three features. Novelty applies the SObs Element 2 machinery to the hypothesis claim, testing it against the same-year-or-prior corpus of hypotheses so a hypothesis already stated in prior work scores low novelty regardless of framing, and one that is absent but linked to a real prior gap scores high. An internal-logic feature tests whether the hypothesis follows from the stated gap and whether each stated alternative carries a distinct predicted observation, since undifferentiated alternatives are not genuine multiple working hypotheses. A hypothesis-claim-alignment feature, the strongest overclaiming detector in this step, tests whether the hypotheses the paper actually tested are the ones its conclusions are about, because a common reasoning failure tests one hypothesis and concludes about a different claim. This is computed by an entailment check between the stated hypotheses, the reported tests, and the abstract's conclusions, flagging a paper whose conclusions are not entailed by the hypotheses it tested. The scorer keys on the logical relations among gap, hypothesis, test, and conclusion, not on the word hypothesis.
Implementation.
Extract the hypothesis claims, the stated gap, the reported tests, and the conclusion claims using the Element 1 extraction model.
Compute the novelty feature via SObs Element 2 retrieval and entailment against the prior-hypothesis corpus, retaining grounding sentences.
Compute the internal-logic feature by testing gap-to-hypothesis entailment and checking that alternatives carry distinct predicted observations.
Compute the alignment feature by testing whether the conclusion claims are entailed by the tested hypotheses, flagging tested-H-but-concluded-C mismatches.
Map the three features to the 1-to-10 score via the Element 3 ordinal regression and normalize against the Element 4 reference stratum.
Failure mode and gaming defense.
Listing token alternative hypotheses that are not genuinely distinguished is one gaming vector, defended by requiring each stated hypothesis to carry a distinct predicted observation. Stating hypotheses that match the eventual claims while testing something narrower is the subtler vector, defended by the alignment feature, which scores the entailment from tested hypotheses to stated conclusions directly, so a paper cannot raise SHyp by claiming a broad hypothesis while testing a narrow one.
Rubric anchors and scale.
On the common rubric, a 3 states no explicit hypothesis or an unfalsifiable one or draws conclusions unrelated to the hypotheses tested, a 7 states an explicit falsifiable hypothesis with a null whose conclusions align with what was tested, a 10 states a structured set of distinguishable working hypotheses each with a distinct predicted observation and full hypothesis-to-conclusion alignment, and a score above 10 formulates a hypothesis, novel against the prior corpus, that reconciles previously conflicting observations.
Definition.
The specificity and testability of the concrete predicted observations the paper derives from its hypotheses, treated as a distinct step because a hypothesis without a stated prediction is not falsifiable.
Motivation.
Prediction is the step that converts a hypothesis into something an experiment can confront, and its absence is what lets a hypothesis be defended against any outcome after the fact. The index scores whether the paper states, before the results, what specific observations each hypothesis predicts and what observations would falsify it.
Formulation and AI specification.
SPred instantiates the shared protocol and scores two features. Specificity reads for explicit predicted observations tied to each hypothesis, quantified where the domain allows (a predicted direction, magnitude, or threshold rather than a vague expectation) and stated before the results rather than inferred from them, scored by whether the prediction names a measurable outcome and a value or direction. A hypothesis-linkage feature, paired with the SHyp alignment feature, tests whether each prediction genuinely follows from a stated hypothesis by logic, since a prediction unconnected to the hypothesis it claims to derive from is not a test of that hypothesis, computed by an entailment check from hypothesis to prediction. The scorer keys on the presence of a specific, falsifiable predicted observation logically linked to a hypothesis, not on the word predict.
Implementation.
Extract prediction claims and their asserted parent hypotheses using the Element 1 extraction model.
Compute the specificity feature by testing whether each prediction names a measurable outcome with a direction, magnitude, or threshold.
Compute the hypothesis-linkage feature by the hypothesis-to-prediction entailment, flagging predictions not logically derived from a stated hypothesis.
Where a preregistration exists, check prediction-to-registration consistency to detect predictions stated after the results (HARKing).
Map the features to the 1-to-10 score via the Element 3 ordinal regression and normalize against the Element 4 reference stratum.
Failure mode and gaming defense.
Stating predictions after seeing the results (HARKing, hypothesizing after results are known) is the gaming vector. The defense checks consistency between predictions and any pre-registration and flags predictions matching the results with suspiciously exact specificity absent a registered protocol, and the hypothesis-linkage feature blocks the related move of attaching a well-formed prediction to a hypothesis it does not actually test.
Rubric anchors and scale.
On the common rubric, a 3 states no specific predicted observation or a prediction not derived from any stated hypothesis, a 7 states a falsifiable predicted observation logically tied to the hypothesis, a 10 states quantified predictions (direction, magnitude, or threshold) distinguishing among alternatives before the results, and a score above 10 makes a risky prediction that the field expected to fail and that the data confirm.
Rubric anchors and scale.
On the common rubric, a 3 states no specific predicted observation, a 7 states a falsifiable predicted observation tied to the hypothesis, a 10 states quantified predictions (direction, magnitude, or threshold) distinguishing among alternatives before the results, and a score above 10 makes a risky prediction that the field expected to fail and that the data confirm.
Definition.
The adequacy of controls, sample-size justification, blinding, and randomization against the subfield standard.
Motivation.
Design quality determines whether results can support their claims, and it is the step most often asserted rather than demonstrated. The index scores the actual design against what the question requires.
Formulation and AI specification.
SDes instantiates the shared protocol of Section 3.5.0 and rises only when the scorer locates an actual control condition described in methods and referenced in analysis, an actual power calculation or stated effect-size assumption behind the sample size, and actual randomization or blinding procedures, not the phrases appropriate controls or adequately powered. The comparative standard against subfield norm is a learned per-subfield baseline of each design feature.
Implementation.
Extract the described design elements (controls, sample-size basis, blinding, randomization) using the Element 1 extraction model.
Score each design feature by presence-as-structure, so a control counts only when described as a condition and referenced in the analysis, and a power calculation counts only when the computation is present, against the per-subfield baseline of Element 4.
Compute the discrimination feature by testing whether the design can distinguish the stated hypotheses (entailment between design and the SHyp hypothesis set).
Map the features to the 1-to-10 score via the Element 3 ordinal regression and normalize against the reference stratum.
Failure mode and gaming defense.
Claiming controls without describing them is the gaming vector. The defense requires the claimed control to appear as a described condition that is actually analyzed, and a claimed power calculation to contain the computation, so claim language without structure does not score.
Rubric anchors and scale.
On the common rubric, a 3 lacks described controls or sample-size justification, a 7 has adequate controls, a stated power basis, and appropriate blinding or randomization for the subfield, a 10 has a design that discriminates cleanly among the stated hypotheses at the field's best standard, and a score above 10 introduces a design or control that becomes a methodological template others adopt.
Definition.
The reproducibility and technique fidelity of the experimental execution as evidenced in the methods, the step that is distributable to trainees or technicians while the work remains science, and the one least directly visible in a written paper.
Motivation.
Execution is step six of the scientific method and the only step that can be delegated without ceasing to do science, but a written paper shows execution only through the traces it leaves in the methods, so the index scores those traces rather than the bench work itself. Poor execution invisible in a paper is a reproducibility failure, and the traces that predict it are underspecified protocols and missing technique detail.
Formulation and AI specification.
SExec instantiates the shared protocol of Section 3.5.0 and scores two features, the load-bearing one being reproducibility-gap detection. For each method described, a checklist model extracts the procedural parameters an independent lab would need (for a reagent, its source, catalog identifier, and concentration, for a protocol step, its duration, temperature, and equipment setting, for a computational method, its version, hyperparameters, and random seed) against a per-method-class reproducibility checklist. The reproducibility-gap feature is the fraction of required parameters present, so a method that names a reagent without its concentration or an antibody without its catalog number scores the gap explicitly. This targets the specific omissions that cause cross-lab non-reproducibility, where the reader is left guessing at a value the original lab knew but did not report, rather than penalizing brevity per se. The reproducibility-gap feature applies the same silence-is-deficiency default as the analysis-rigor index (Section 3.5.8), scoring an absent required parameter as not done rather than as unknown, since assuming an unstated step was performed correctly rewards incomplete reporting. Its required-parameter set is field-calibrated so a field's genuine reporting convention is not mistaken for a lapse. A second feature is methods-results consistency, an entailment check testing whether the reported data are of the kind and scale the described procedure could actually produce, flagging results that the stated methods could not have generated. The checklist per method class is the point of calibration and is populated from existing reporting standards (for example the ARRIVE guidelines for animal work, MIQE for quantitative PCR, and equivalent community checklists) rather than invented.
Implementation.
Extract each described method and classify it by method class using the Element 1 extraction model, mapping each to its reproducibility checklist.
For each method, extract the checklist parameters actually present in the text and compute the reproducibility-gap feature as the present-to-required ratio, retaining the list of missing parameters as the human-checkable output.
Run the methods-results consistency entailment (Element 2 NLI machinery) to flag results inconsistent with the described procedure.
Map the two features to the 1-to-10 score via the Element 3 ordinal regression, and normalize against the Element 4 reference stratum for the method's field, since reproducibility norms differ across wet-lab, computational, and clinical work.
Failure mode and gaming defense.
Padding the methods with boilerplate detail that does not enable reproduction is the gaming vector. The defense scores presence of the specific checklist parameters that matter for reproduction rather than methods-section length, so verbose but parameter-incomplete methods score low, and the methods-results consistency check catches procedures described but not actually followed. Because the missing-parameter list is emitted, a reviewer can verify each claimed gap against the manuscript rather than trusting the score.
Rubric anchors and scale.
On the common rubric, a 3 provides a protocol too underspecified to reproduce, a 7 provides a protocol reproducible by a skilled peer with the error-prone steps described, a 10 provides a fully reproducible protocol with deposited step-level detail meeting the subfield's best reproducibility standard, and a score above 10 establishes a technique or protocol refinement that raises the field's reproducibility baseline.
Definition.
Statistical validity checked as logical consistency and disclosure, covering whether the test matches the data type, whether the analysis accounts for the data's dependence structure, whether the conclusions match what the test measured, and whether the field-required reporting elements are present, with unstated required elements scored as not done.
Motivation.
Analytic rigor is where many false positives enter, and current metrics reward the positive result rather than the correctness of the analysis that produced it. Any significance claim requires its p value and the correction appropriate to the test count.
Formulation and AI specification.
SAna instantiates the shared protocol of Section 3.5.0 and runs a set of logical-consistency and disclosure checks whose ground truth is the fixed rules of statistics and the paper's own internal logic, not a human-labeled quality corpus, so the metric detects patterns computationally rather than reproducing expert taste. Four checks carry the score. A test-to-data-type check extracts the data structure from the methods (continuous, count, categorical, ordinal, time-series, or nested) and the named test, then verifies by statistical rule that the test matches the data type, so a test applied to a data type it is not valid for is flagged as a mathematically-defined error rather than a matter of opinion. Pseudoreplication is caught by a dependence-structure check that compares the described sampling (for example repeated measures on the same units, or multiple cells from the same animals) against how the analysis treated the sample size, flagging an analysis that treats non-independent observations as independent. Whether the conclusions are about the quantity the test actually measured is the claim-alignment check, run by entailment, since a common error runs a test on one quantity and states results about another. Finally, a disclosure check audits whether the field-required reporting elements (assumption checks such as normality and variance homogeneity, multiple-comparison correction when the test count demands it, effect sizes, and exact p values) are present in the text.
The disclosure check applies a load-bearing default. A required element that is not stated is scored as not done, not as unknown, because assuming an unstated safeguard was performed rewards incomplete reporting and because the alternative forces the model to guess at hidden practice it cannot observe. Silence therefore scores as the deficiency. The required-element set is field-calibrated to what that field's rigorous papers actually disclose, so a field's genuine reporting convention is not mistaken for a lapse, while the load-bearing validity elements (dependence-structure accounting, multiple-comparison correction, test-to-data match) are never waived for silence because the validity of the result turns on them. Correctly stated caveats about a test's assumptions and their limits raise the score, since a paper that names its assumptions and whether they were met demonstrates the rigor the metric rewards.
Implementation.
Extract the data structure, sampling design, named tests, corrections, effect sizes, and p values from the methods and results using the Element 1 extraction model.
Run the test-to-data-type check by statistical rule, flagging any test not valid for its data type.
Run the dependence-structure check, comparing described sampling against the treated sample size to detect pseudoreplication and unaccounted repeated measures.
Run the claim-alignment check by entailment, flagging conclusions asserted about a quantity the reported test did not measure.
Run the disclosure audit against the field-calibrated required-element set, scoring each absent required element as a deficiency rather than as unknown, with the load-bearing validity elements never waived for silence.
Map the check results to the 1-to-10 score via the Element 3 ordinal regression and normalize against the reference stratum, since reporting norms differ across fields.
Failure mode and gaming defense.
Reporting a correction in the methods but not applying it in the results is one gaming vector, defended by verifying application against the results tables rather than the methods statement and flagging a mismatch between claimed and applied correction. Omitting disclosure to dodge the audit is the vector the silence-is-deficiency default blocks, since saying less scores as the error rather than as unknown, which makes full disclosure the dominant strategy. One honest boundary remains. The checks catch errors visible in the text (wrong test for the data type, disclosed nesting treated as independent, conclusions that do not match the test), but they cannot catch an undisclosed confound or a nesting the methods never mention, so the statistical checks are strong where methods are complete and blind where methods omit the structure. That blindness is itself scored, by the disclosure audit here and by the reproducibility-gap feature of the execution index.
Rubric anchors and scale.
On the common rubric, a 3 uses a test mismatched to the data type, treats non-independent observations as independent, or omits load-bearing required disclosures, a 7 uses appropriate tests with dependence correctly handled, conclusions aligned to what was tested, and the field-required elements disclosed, a 10 additionally reports assumption checks with their outcomes and full sensitivity treatment at the field's best standard, and a score above 10 introduces or correctly applies an analytic approach ahead of the field's norm. Unstated required elements are scored as not done at every level.
Definition.
The proportionality of the paper's conclusions to its evidence and the presence of stated alternative interpretations.
Motivation.
Overclaiming beyond the evidence is a common and costly failure that citation counts do not penalize and can even reward. The index scores whether conclusions are calibrated to what the data support and whether alternatives are acknowledged.
Formulation and AI specification.
SInt instantiates the shared protocol of Section 3.5.0 and compares the strength of the stated conclusions against the strength of the evidence the results provide, and reads the discussion for explicitly stated alternative explanations. The scorer keys on the calibration between claim strength and evidence strength. Honest disclosure of limitations is credited as part of that calibration, so a paper that names the specific constraints on its evidence and qualifies its load-bearing claims to match scores higher than one that asserts the same claims without the qualification. The credit attaches to substance, not to a heading. A limitations section's mere presence is a held-out surface feature, and only a disclosed limitation that corresponds to a real constraint and actually narrows the affected claim earns the credit, which prevents a boilerplate limitations paragraph from farming the reward. Calibration is scored in both directions, so a claim hedged below what its evidence supports is a miscalibration too, though a less harmful one than overclaim, which keeps the credit from rewarding reflexive hedging rather than accurate self-assessment. The insufficiency debit of the fidelity gate and this calibration credit are the same principle from two sides, the burden of evidence resting on the paper. A categorical claim on thin data is debited by the gate and uncredited here, while a modest claim on the same data with honest limitations is spared the debit and earns the credit.
Implementation.
Extract the conclusion claims and the evidence each rests on using the Element 1 extraction model.
Compute a calibration feature as the entailment gap between conclusion strength and the strength the results actually support, flagging conclusions stronger than their evidence.
Compute an alternatives feature crediting explicitly stated alternative explanations tied to distinguishing evidence, not boilerplate hedging.
Map the features to the 1-to-10 score via the Element 3 ordinal regression and normalize against the reference stratum.
Failure mode and gaming defense.
Ritual limitations paragraphs that acknowledge alternatives without engaging them are the gaming vector. The defense requires stated alternatives to be tied to specific evidence that would distinguish them, so boilerplate hedging does not score.
Rubric anchors and scale.
On the common rubric, a 3 draws conclusions beyond what the evidence supports, a 7 draws conclusions proportional to the evidence and states alternatives, a 10 calibrates every claim to its evidence and engages alternatives with distinguishing evidence, and a score above 10 correctly identifies the broader significance of a result that the narrow data alone would not obviously imply.
Definition.
The density of genuine cross-scale and cross-field connection in the paper's reasoning, with verified methodological dependency, scored at the reasoning level and overlapping by design with the Cross-Field Integration Index (Section 3.9) and Mechanism-Graph Centrality Index (Section 3.2).
Motivation.
Cross-domain synthesis density is the signature of high scientific reasoning, integrating materials, glial and vascular biology, electrophysiology, neurocomputation, and clinical translation within a single framework, and it is the operation the flat prior version of this index underweighted. This step-index restores it to full status. It scores how many distant domains the paper genuinely connects and how densely, not how many it name-checks.
Formulation and AI specification.
SSyn instantiates the shared protocol of Section 3.5.0 and measures the number and distance of domains connected in the paper's own argument and requires methodological dependency (a connected domain's method or concept is actually used in the analysis), reusing the CFII field-distance machinery and the MGCI mechanism graph. The scorer keys on structural dependency between domains, not on citations spanning fields.
Implementation.
Extract the domains the paper connects and the methods or concepts it borrows using the Element 1 extraction model.
Apply the SObs Element 2 field-distance machinery to measure the distance of each connected domain, and require methodological dependency by a semantic match above 0.6 between the borrowed method and the paper analysis, so listed-but-unused domains do not count.
Compute connection density as the count of distance-weighted genuine dependencies, retaining the grounding sentences.
Map the feature to the 1-to-10 score via the Element 3 ordinal regression and normalize against the reference stratum, applying the incomputability above-10 rule where a connection has no prior precedent.
Failure mode and gaming defense.
Tokenistic cross-field citation is the gaming vector. The defense requires each connected domain to appear in the paper's analysis with methodological dependency above a 0.6 semantic-match threshold, so listed-but-unused domains do not raise the score. Overlap with CFII and MGCI is reconciled in Section 3.14.
Rubric anchors and scale.
On the common rubric, a 3 connects no domains beyond its own or name-checks them without use, a 7 genuinely integrates one adjacent domain's method or concept into the analysis, a 10 integrates multiple distant domains with verified methodological dependency at high connection density, and a score above 10 forges a cross-domain connection that opens a new integrative direction, the signature of the highest synthetic reasoning.
Definition.
The explicitness and traceability of the causal chain from evidence to conclusion, including data, code, and method availability.
Motivation.
Transparency is what makes a result reusable and checkable, and it is rewarded by no current metric. The index scores whether a reader can trace each conclusion to the evidence and reproduce the path.
Formulation and AI specification.
SRep instantiates the shared protocol of Section 3.5.0 and reads for an explicit traceable chain from evidence to each conclusion, and for data, code, and protocol availability statements that resolve to actual deposited resources. The scorer keys on the presence of a followable chain and on availability statements that verify against real repositories.
Implementation.
Extract the evidence-to-conclusion chain and the data, code, and protocol availability statements using the Element 1 extraction model.
Compute a traceability feature testing whether each conclusion links to specific reported evidence, and a resource-resolution feature testing whether availability statements resolve to actual accessible deposits (a live check against the named repository).
Flag availability statements that do not resolve to a real deposit.
Map the features to the 1-to-10 score via the Element 3 ordinal regression and normalize against the reference stratum.
Failure mode and gaming defense.
Data-availability statements that point to nothing are the gaming vector. The defense verifies that availability statements resolve to actual accessible resources, so a statement without a real deposit does not score.
Rubric anchors and scale.
On the common rubric, a 3 leaves conclusions untraceable to evidence and provides no working data or code availability, a 7 provides a traceable evidence-to-conclusion chain and availability statements that resolve to real resources, a 10 provides fully transparent reasoning with complete deposited data, code, and protocols at the field's best standard, and a score above 10 sets a transparency or open-resource practice that others adopt.
Definition.
A meta-index parallel to SRQI that composes six named engineering-method step-indices into a profile of the engineering reasoning a paper demonstrates. EMI applies to design-and-build work, where the operative process is not observe-hypothesize-test but define-design-prototype-characterize-validate-iterate, and it is reported as the vector of its six step-scores.
Motivation.
A large fraction of NIH-funded work, including most of neural engineering, is engineering rather than discovery, and scoring an engineering paper on scientific-method steps misjudges it, since a device paper has no hypothesis in the Popperian sense but does have a requirement specification, a design-space exploration, and a validation against that specification. SRQI and EMI are separate meta-indices because the scientific and engineering methods are distinct processes with distinct step structures, and conflating them reintroduces the field-mismatch error the framework warns against. Many neural-engineering papers are scored on both, because they do both, characterizing a biological mechanism (science) while building and validating a device (engineering).
Formulation.
EMI(p) = ( EReq, EDes, EProto, EChar, EVal, EIter )
reported as the vector of the six step-indices defined in Sections 3.5E.1 through 3.5E.6, each on the same common anchored 1-to-10 rubric (uncapped above 10) as SRQI, with a confidence interval, no canonical weighting, and committee-selected role-appropriate weights, so the EMI composite is a weighted average commensurable across its steps.
AI specification and mode selection.
A mode classifier first determines whether a paper is discovery-mode, engineering-mode, analytical-mode, or a combination, from the structure of its aims and methods. The structural markers are generation of new empirical data (whether by experiment or by systematic observation), presence of a requirement and validation, or absence of new data in favor of an argument built by integrating prior literature. Discovery-mode covers both experimentally-warranted work, which tests a hypothesis by manipulation and inferential statistics, and observationally-warranted work, which establishes claims by systematic observation and inference to best explanation, and the two are coequal modes of empirical warrant rather than a norm and a lesser exception. The rigor steps are therefore polymorphic by warrant. SExec, SAna, and SHyp apply experimental-rigor criteria (statistical validity, control conditions, hypothesis-test structure) to experimentally-warranted claims, and observational-rigor criteria (sampling adequacy, observer and instrument bias controls, marker or measurement specificity, reproducibility of the observation, and whether the inference actually ruled out the alternatives) to observationally-warranted claims. An observational paper is scored by the standard its warrant sets, not marked deficient for lacking inferential statistics, since Brock 1969, a positive control, is observational natural history with no controlled experiment or statistics, so a discovery mode presuming the hypothesis-test template would mishandle the framework's own hero case. Each paper is scored on SRQI, EMI, the analytical-mode step subset, or a combination accordingly. Analytical-mode covers theory, review, and perspective papers, whose contribution is identifying a gap, contradiction, or wrong shared assumption by linking literature rather than generating new data, and it is specified in Section 3.5A, distinct from observational discovery because it generates no new data at all. Each EMI step-index instantiates the shared protocol of Section 3.5.0 and scores two axes. An internal-rigor axis specific to the engineering step asks whether the requirement was quantified, whether the design space was actually explored, and whether validation ran against every requirement. A corpus-novelty axis applies the SObs Element 2 machinery to the prior engineering literature, so a requirement, design, or validation approach is credited as more innovative when it is absent from and distant to prior engineering work, with the incomputability above-10 rule applying where an approach has no prior precedent. Each EMI step keys on structural engineering evidence rather than claim language, with its score anchored to the reference-population distribution of the same computed features and no expert quality-labels, matching the SRQI protocol, and with the same held-out surface features so naming an operation does not substitute for performing it. Papers scored on multiple modes report each vector separately, so a bidirectional-translation paper is credited for its science and its engineering separately rather than averaged into one misfitting number.
Failure mode and mitigation (meta-index level).
The dominant risk is mode-misclassification, scoring an engineering paper as science, an analytical paper as discovery, or any other mismatch, mitigated by multiple scoring where the mode classifier is uncertain and by reporting the mode decision with its confidence. A paper mis-routed to discovery-mode when it is analytical would be scored on execution and analysis steps it has no basis to satisfy, which the analytical-mode branch of Section 3.5A prevents by not instantiating those steps. Gaming is addressed at the step level.
RFI category alignment.
Entrepreneurship and Translation and Foundational Scientific Exploration where the engineering is foundational infrastructure, cross-cutting content calibrator for design-and-build work.
Definition.
The clarity, completeness, and justification of the specification the engineering work targets, the engineering analogue of question formulation.
Motivation and specification.
A design is only as good as the requirements it is judged against, and unstated or post-hoc requirements make a device impossible to evaluate. EReq reads for explicit, quantified performance requirements stated before the design (target values with units and tolerances), justification tying each requirement to the application need, and constraints (power, size, biocompatibility, cost). The scorer keys on the presence of quantified, justified specifications, not on the word requirement.
Implementation.
Extract the stated performance requirements, constraints, and their justifications using the Element 1 extraction model.
Score the rigor axis by whether requirements are quantified (target values with units and tolerances) and justified against the application need, checking for reverse-fit by comparing requirements to any preregistration or prior protocol.
Score the novelty axis by applying SObs Element 2 retrieval to the requirement against the prior engineering corpus.
Map both axes to the 1-to-10 score via the Element 3 ordinal regression and normalize against the Element 4 engineering reference stratum.
Failure mode and gaming defense.
Reverse-fitting requirements to the achieved performance is the gaming vector. The defense checks consistency between stated requirements and any pre-registration or prior protocol, and flags requirements that exactly match achieved results as suspect.
Rubric anchors and scale.
On the common rubric, a 3 states no quantified requirements or reverse-fits them to results, a 7 states quantified, justified performance requirements with constraints before the design, a 10 states a complete, well-justified specification tied rigorously to the application need, and a score above 10 defines a requirement set that reframes what the device class should achieve.
Definition.
The breadth and rigor of the design alternatives considered and the justification for the chosen design, the engineering analogue of multiple working hypotheses.
Motivation and specification.
A design chosen without considering alternatives is an assertion, not an engineering decision. EDes reads for stated design alternatives, explicit trade-off analysis among them, and a justified selection tied to the requirements. The scorer keys on the presence of a real trade-off analysis, not on a single design presented as inevitable.
Implementation.
Extract the stated design alternatives and the trade-off analysis using the Element 1 extraction model.
Score the rigor axis by whether real alternatives are considered with genuine trade-offs on stated axes and a justified selection, rejecting strawman alternatives that carry no genuine trade-off.
Score the novelty axis by SObs Element 2 retrieval of the chosen design approach against the prior engineering corpus.
Map both axes to the 1-to-10 score and normalize against the engineering reference stratum.
Failure mode and gaming defense.
Strawman alternatives introduced only to be dismissed are the gaming vector. The defense requires each alternative to carry a genuine trade-off against the chosen design on a stated axis, so token alternatives do not score.
Rubric anchors and scale.
On the common rubric, a 3 presents one design with no alternatives considered, a 7 considers real alternatives with a trade-off analysis and justified selection, a 10 explores the design space rigorously with quantified trade-offs across the relevant axes, and a score above 10 identifies a design approach the field had not considered.
Definition.
The completeness and reproducibility of the actual build, the engineering analogue of experimental execution.
Motivation and specification.
A described design that was not actually built and characterized is a proposal, not a result. EProto reads for evidence the artifact was realized (fabrication methods, materials, and process parameters sufficient to reproduce it) and for the fidelity of the realized artifact to the design. The scorer keys on the presence of reproducible build detail.
Implementation.
Extract the fabrication methods, materials, and process parameters using the Element 1 extraction model.
Score the rigor axis by whether the build is specified in enough detail for independent reproduction (the engineering analogue of the SExec reproducibility-gap feature) and whether realized-to-design fidelity is demonstrated.
Score the novelty axis by retrieval of the fabrication approach against the prior engineering corpus.
Map both axes to the 1-to-10 score and normalize against the engineering reference stratum.
Failure mode and gaming defense.
Claiming a build without reproducible detail is the gaming vector. The defense requires fabrication and process parameters sufficient for independent reproduction, so an underspecified build does not score.
Rubric anchors and scale.
On the common rubric, a 3 describes a build too underspecified to reproduce, a 7 provides fabrication and process detail sufficient for independent reproduction, a 10 provides complete reproducible realization with verified design fidelity at the field's best standard, and a score above 10 achieves a fabrication result the field considered infeasible.
Definition.
The rigor and completeness of the measurements characterizing the realized artifact's performance, the engineering analogue of analysis rigor.
Motivation and specification.
Characterization is where a device's real performance is established, and selective or under-powered characterization overstates it. EChar checks that performance is measured across the operating range, with appropriate sample sizes across fabricated units, statistical treatment of unit-to-unit variability (any significance claim requires its p value), and reporting of failure cases and off-nominal behavior. The scorer keys on the completeness and statistical treatment of the actual measurements.
Implementation.
Extract the performance measurements and their sampling using the Element 1 extraction model.
Score the rigor axis by coverage across the operating range, adequate unit-to-unit sampling with variability statistics (any significance claim requires its p value), and failure-case reporting, flagging best-unit-only reporting.
Score the novelty axis by retrieval of the characterization method against the prior engineering corpus.
Map both axes to the 1-to-10 score and normalize against the engineering reference stratum.
Failure mode and gaming defense.
Best-unit reporting, characterizing only the device that worked, is the gaming vector. The defense requires reporting across fabricated units with unit-to-unit variability, so single-best-unit results are flagged.
Rubric anchors and scale.
On the common rubric, a 3 reports only best-unit or partial-range performance, a 7 characterizes across the operating range with adequate unit-to-unit sampling and failure reporting, a 10 characterizes comprehensively with full statistical treatment of variability at the field's best standard, and a score above 10 establishes a characterization method others adopt.
Definition.
Whether the characterized performance is tested against the stated requirements, the engineering analogue of interpretation calibration and the step with no scientific-method equivalent.
Motivation and specification.
Validation closes the engineering loop by testing the built artifact against the requirements defined at the start, and its absence lets a paper claim success without a standard. EVal reads for an explicit comparison of characterized performance against each stated requirement, an honest account of which requirements were met and which were not, and validation under application-relevant conditions rather than only benchtop-ideal ones. The scorer keys on the presence of a requirement-by-requirement comparison.
Implementation.
Extract the stated requirements from EReq and the validation results using the Element 1 extraction model.
Score the rigor axis by requirement-by-requirement comparison of characterized performance against every stated requirement under application-relevant conditions, flagging silently dropped unmet requirements as a mismatch between the requirement set and the validation set.
Score the novelty axis by retrieval of the validation approach against the prior engineering corpus.
Map both axes to the 1-to-10 score and normalize against the engineering reference stratum.
Failure mode and gaming defense.
Validating only against the requirements that were met is the gaming vector. The defense requires validation against every stated requirement from EReq, so silently dropping an unmet requirement is flagged as a mismatch between the requirement set and the validation set.
Rubric anchors and scale.
On the common rubric, a 3 validates against no stated requirement or only the met ones, a 7 compares performance against every stated requirement under realistic conditions with honest met-and-unmet accounting, a 10 validates comprehensively under application-relevant conditions at the field's best standard, and a score above 10 validates against a more demanding standard than the field currently requires.
Definition.
Evidence of design iteration informed by characterization and validation, the engineering analogue of the self-correcting cycle of science.
Motivation and specification.
Engineering advances through iteration, and a paper that reports a single design pass without learning from measured shortfalls demonstrates weaker engineering reasoning than one that closes the loop. EIter reads for evidence that characterization or validation results drove a subsequent design change, with the change justified by the measured shortfall. The scorer keys on the traceable link from a measured result to a design revision.
Implementation.
Extract claimed design iterations and their driving measurements using the Element 1 extraction model.
Score the rigor axis by whether each claimed iteration links a specific measured shortfall to a specific design change with before-and-after data, rejecting unsupported iteration narratives.
Score the novelty axis by retrieval of the iteration or optimization approach against the prior engineering corpus.
Map both axes to the 1-to-10 score and normalize against the engineering reference stratum.
Failure mode and gaming defense.
Narrating iteration that did not occur is the gaming vector. The defense requires each claimed iteration to link a specific measured shortfall to a specific design change with before-and-after data, so an unsupported iteration narrative does not score.
Rubric anchors and scale.
On the common rubric, a 3 reports a single design pass with no learning from shortfalls, a 7 links a measured shortfall to a justified design change with before-and-after data, a 10 demonstrates a rigorous closed iteration loop driving measured performance gains, and a score above 10 achieves through iteration a performance level the field considered a hard limit.
Why a third mode.
Theory, review, and perspective papers make no primary observations, run no experiments, and build no devices, so the execution and statistical-rigor steps of SRQI and the realization and validation steps of EMI have nothing to read on them. Their contribution is identifying a gap, a contradiction, or a wrong shared assumption by linking literature that had not been linked, which is among the hardest contributions to count and among the most consequential when it redirects a field. Routing such a paper to discovery-mode would score it on experimental steps it has no basis to satisfy, and the silence-is-deficiency default would convert the structural absence of an experiment into a cascade of false deficiencies, penalizing the paper for not being an experiment. Analytical-mode exists so that a synthesis contribution is scored on what it actually does rather than on what it structurally cannot.
What is scored, and what is inert.
Analytical-mode scores the reasoning-and-integration subset of the scientific-method steps, the observation step in its corpus-relational sense (SObs), gap identification (SGap), hypothesis or thesis generation (SHyp), interpretation (SInt), and synthesis (SSyn), together with the standalone indicators that already read integration and origination rather than experiment, the assumption-addressing indicator (AAI), mechanism-graph centrality (MGCI), cross-field integration (CFII), the field-define index (FDI), and the corpus-relational novelty machinery. The execution-fidelity step (SExec), the analysis-rigor step (SAna), and the primary-data-ratio component of the paper-substance score (PSS) are not instantiated in analytical-mode rather than scored as deficient, because a paper with no experimental execution has no execution to check, no experiment-derived statistics to validate, and no primary data whose absence is a fault. Making these steps inert, not merely gated to a lenient setting, is the load-bearing decision, since a gated-but-active experimental checklist would still record every experimental element as absent and the silence-is-deficiency default would still fire.
The deficiency floor, stated with its uncertainty.
A mode that only rewarded positive signals would credit the appearance of synthesis without a way to score down a fluent repackaging that links nothing new and argues nothing sound, which would make analytical-mode a gaming surface and an unusable signal for an evaluator. Analytical-mode therefore carries its own deficiency floor, the analogue of the experimental required-element checklist, built from three corpus-relational checks. The linkage-novelty check asks whether the connections the paper claims to draw are actually absent from the prior literature, using the same retrieval machinery as SObs novelty detection, so a synthesis that re-states known connections scores low on the contribution it claims. That same dormancy exception applies, so a synthesis that resurfaces connections lost in old, uncited, hard-to-retrieve work and builds on them is credited as recovery rather than penalized as restatement, which is consistent with the framework's founding concern for neglected work, since reviving a forgotten linkage is the synthesis analogue of producing a neglected finding. A source-characterization-fidelity check asks whether the cited sources are characterized in a way that matches what they actually report, using the entailment machinery of the SAna claim-alignment check applied to the citing sentences rather than to a statistical claim, so a paper that mischaracterizes what it cites is flagged. Finally, an assumption-exposure check asks whether the paper states the assumptions its argument depends on rather than smuggling them, detectable as the presence or absence of explicit conditional and scope statements in the argument structure.
These three checks sit in the corpus-relational tier of Section 2, not the rule-based tier, and this is stated plainly rather than hidden. Whether a linkage is genuinely new and whether a source is faithfully characterized are questions about the structure and content of the existing literature, which retrieval and entailment can address, but they rest on softer ground than a statistical-validity check whose ground truth is mathematics. Analytical-mode substance is therefore assessed with more uncertainty than experimental rigor. The checks return a finding with an explicit confidence rather than a determinate pass or fail, and that confidence is reported alongside the score. A program officer therefore weights the analytical assessment for what it is, a corpus-relational judgment of a contribution that resists exact measurement rather than a hard adjudication of whether an argument is correct. The framework does not claim to computationally decide whether an idea is good. It claims to detect whether a synthesis introduces a genuinely new and faithfully argued linkage, and to say how confidently.
Relation to the framework's general treatment of hard-to-count contribution.
Analytical-mode is one instance of a general principle rather than a special case. A synthesis paper and a patent share the property that their value lies not in a countable artifact but in what they introduce and what the field or the market subsequently takes up. Both are therefore assessed by the novelty and the uptake of what they introduce rather than by artifact count, which is the same move the field-define index makes for originated concepts and that the incentive arm makes when it credits constitutive behaviors rather than proxies.
Where a contribution is hard to count, the framework's answer stays consistent. Measure the novelty of what was introduced against the prior corpus and the independent uptake that followed, and report the measurement with its uncertainty, rather than defaulting to the artifact tally that undercounts exactly the work this framework exists to protect.
RFI category alignment.
Foundational Scientific Exploration, since a field-redirecting synthesis is a foundational contribution, with a secondary tie to any category where review and perspective work shapes subsequent research direction.
Definition.
A score quantifying an author's substantive public-communication output through verified audience reach, content-quality scoring, and downstream policy or educational citation.
Motivation.
Current metrics ignore public-engagement contributions entirely, and in many institutional cultures actively penalize them. Carl Sagan published approximately 500 peer-reviewed papers across 39 years while also producing the most influential public-science communication of the late 20th century, and the National Academy of Sciences blackballed his 1992 nomination explicitly because of the public work. Subsequent bibliometric analyses have shown that scientifically active science communicators actually outperform non-communicators on traditional research metrics, demonstrating that the Sagan Effect (perception that scientific popularity is inversely related to scientific quality) is empirically false. The Verifiable Public-Engagement Index fills this gap by treating substantive public communication as an output class with verified reach, content-quality scoring, and downstream policy or educational impact.
This indicator carries weight beyond a single impact category. A foundational-science funder in a democracy has a structural vulnerability, its payoff horizon is longer than its funding horizon, so the work is perpetually exposed to appropriations cycles that resolve faster than the science does. The only durable protection is a public that understands why slow, risky, non-appropriable work deserves patience. Public communication of foundational science is therefore the meta-incentive that protects the survival of every other category, since it determines whether the appropriations that fund all the evaluation decisions endure across political cycles. The indicator must be scoped tightly to communication that builds durable public understanding of why foundational science matters, weighted by the quality and durability of understanding produced rather than by volume of engagement or self-promotion, or it degrades into the volume-first failure mode. A researcher who makes taxpayers understand why the slow foundational work deserves support is protecting the entire enterprise, and this contribution should be credited accordingly.
Formulation.
VPEI(a) = Σ over e ∈ E(a) of reach(e) · quality(e) · downstream_impact(e)
where E(a) is the set of engagement events for author a (books, op-eds, broadcasts, lectures, podcasts, verified science journalism), reach(e) is the verified audience size from publisher data, broadcast Nielsen ratings, or attendance records (not platform-reported only), quality(e) is the LLM-classified science content quality benchmarked against established science-communication exemplars, and downstream_impact(e) is the count of citations of e in policy documents, regulatory filings, educational curricula, and textbook adoption.
AI specification.
A multi-source aggregation system combining NLP classification of engagement content quality, external reach databases for audience verification, and cross-reference matching against policy-document and curriculum corpora for downstream impact tracking. Quality benchmarks calibrated against historical exemplars (Sagan's Cosmos, Hawking's A Brief History of Time, Carson's Silent Spring) at the upper range and against demonstrated misinformation at the lower range.
Parameters and thresholds.
Each engagement event carries three scored factors on a common 0 to 1 scale. Reach reach(e) is the log of verified audience size (log10 of publisher print runs, broadcast Nielsen figures, or verified attendance), normalized to 0 to 1 against the field distribution, with platform-reported figures excluded and only independently-verified reach admitted. Quality quality(e) is a classifier score benchmarked so the named exemplars anchor near 1.0 and demonstrated misinformation anchors near 0, with a hard gate zeroing the event if adversarial-content detection flags it, so reach cannot compensate for false content. Downstream impact downstream_impact(e) counts verified citations of the event in policy documents, regulatory filings, curricula, and textbooks, entered as log(1 + citation count), which is the factor weighted most heavily (a suggested weight of 0.5 of the event score) because it is the least gameable and most closely tracks durable understanding. The event score is the product of the three factors, and per the failure-mode analysis VPEI's total weight in any evaluation is capped below the constitutive indicators, since it remains a proxy rather than a constitutive metric.
Implementation.
Build engagement-event corpus through ORCID extension supporting author self-report of verified outputs.
Verify reach against publisher and broadcast data sources, with platform-reported metrics treated as unverified.
Score the verifiable-count dimensions computationally (independent reach verification, policy-document and curriculum citation of the content), and treat the content-substance dimension as the disclosed exception below, since substance is the one dimension the framework does not claim to compute objectively.
Track downstream policy and educational citation through automated matching against policy and curriculum corpora.
Aggregate per-author with field-normalized weighting.
Failure mode and mitigation.
VPEI is the least Goodhart-resistant indicator in the framework, and this is stated plainly so evaluators can weight it accordingly. Unlike the constitutive incentive-arm metrics, VPEI is a proxy, since reach and engagement stand in for the thing actually wanted (durable public understanding of foundational science) rather than being that thing directly. Reach can be gamed through paid distribution, bot amplification, or strategic platform selection, and high-reach low-substance content is the influencer failure mode. Partial mitigations are independent reach verification (not platform-reported), the downstream-impact term weighting policy and educational citation over raw audience, and open methodology. These reduce but do not eliminate the gaming risk. The recommendation is therefore to cap VPEI's evaluative weight below the constitutive indicators and to treat it as advisory rather than determinative. Even so, the indicator still promotes a wanted behavior, since the cheapest path to a durable-impact score routes through producing communication that policy documents and curricula actually cite, which is the public-understanding contribution the enterprise depends on. VPEI is also the one place the framework does not claim objective computation of a quality dimension, since content substance resists the corpus-relational and rule-based methods the other indicators use, and rather than smuggle in expert quality-labels the way the reasoning indices deliberately avoid, the framework states plainly that substance judgment here is human, advisory, and the specific reason VPEI is capped.
RFI category alignment.
Public Impact.
Definition.
Credit accumulated by papers that fill knowledge gaps documented in the prior literature, weighted by independent identification of those gaps across distinct author groups.
Motivation.
Current metrics reward citation accumulation without distinguishing incremental work from gap-filling work that addresses documented unresolved questions. This creates a structural preference for safe incremental research, since incremental work cites established work that cites established work in self-reinforcing loops, while genuine gap-filling work that diverges from established narratives gets less initial citation. The Gap Index fills this gap by tracking documented field-wide gaps (extracted from "future directions," "limitations," and "open questions" sections of the literature) and crediting papers that demonstrably address them with mechanism-level engagement.
Formulation.
GI(p) = Σ over g ∈ G_corpus of sim(p, g) · ind(g, t_p) · addr(p, g)
where G_corpus is the corpus of identified gaps extracted from "future directions," "limitations," and "open questions" sections across the literature, sim(p, g) is the semantic similarity between paper p and gap statement g, ind(g, t_p) is the independent-identification count of g across distinct sources prior to publication date t_p, and addr(p, g) is whether p actually addresses g mechanistically rather than merely claims to.
AI specification.
A three-stage pipeline. First, a gap-extraction LLM identifies unresolved-question statements at scale across review articles, position papers, and limitations sections. Second, gap-clustering identifies independent restatements as the same underlying gap using semantic clustering with independence verification across author groups. Third, a semantic-match LLM scores the filling-extent for new papers, requiring mechanism-level engagement with the gap rather than rhetorical claim.
Parameters and thresholds.
Gap extraction reads sentences in Future Directions, Limitations, Discussion, and Conclusion sections to locate stated gaps, favoring precision over recall since a missed gap costs less than a spurious one, and the corpus-relational logic below (independent identification, filling-extent) is what carries the score rather than any judgment of gap quality. Gap-clustering uses embedding cosine similarity with a merge threshold near 0.85, set per field because gap granularity varies, so same-gap statements merge and distinct-gap statements do not. A gap qualifies for the corpus only if independently stated by at least three distinct author groups (no shared authors, no shared institution) before the candidate paper's submission date, which sets the independent-identification floor and defeats self-defined gaps. Filling-extent sim(p, g) is a semantic-match score in the range 0 to 1, thresholded at 0.6 for partial credit and 0.8 for full credit, with the addr(p, g) mechanism-engagement gate requiring the paper's Results or Methods (not only its Introduction) to reference the gap. The independent-identification count ind(g) enters as a log-weighted multiplier, log(1 + number of independent groups), so a gap noted by 50 groups scores higher than one noted by three but with diminishing returns.
Implementation.
1 Extract gap corpus from review articles, position papers, and Discussion sections across the literature.
2. Cluster gaps semantically with independence checks (distinct authors, distinct institutions, distinct time periods).
3. Timestamp gaps with earliest published identification.
4. Apply to new papers as filling-extent calculation with substance scoring.
5. Weight by independent-identification count and substance score.
Failure mode and mitigation.
Authors can frame work to claim gap-filling regardless of substantive contribution. Mitigation requires that the gap be documented before paper submission date (verifiable timestamp) and that the paper actually address the mechanism, scored by substance index. Independence of gap identification across distinct author groups prevents self-citation cliques from defining the gap landscape.
RFI category alignment.
Foundational Scientific Exploration and Rigor and Reproducibility.
Definition.
Credit accumulated by papers that originate concepts, terminology, methods, or frameworks that subsequent work adopts as foundational, distinguishing originators from elaborators and co-opters.
Motivation.
Current metrics credit citation accumulation but not concept origination. A paper that introduces a foundational concept receives one citation per downstream paper using the concept, the same as a paper that merely uses the concept later, so originators are not distinguished from adopters in the metric. Richard Normann's Utah intracortical microelectrode array, developed at the University of Utah through the late 1980s, defined the standard architecture for cortical recording that BrainGate and most subsequent clinical brain-computer interface work has used, but the existing metrics give no extra weight to the originating papers. The Field-Define Index fills this gap by tracking concept genealogies through the literature, identifying originators and weighting their credit by the breadth, longevity, and fidelity of downstream adoption.
Formulation.
FDI(p) = Σ over c ∈ Concepts(p) of origin(p, c) · adopt(c) · stab(c) · fid(c)
where Concepts(p) is the set of concepts contributed by paper p, origin(p, c) is 1 if p is the first substantive treatment of concept c and 0 otherwise, adopt(c) is the count of downstream papers using c, weighted toward adopters with no co-authorship relationship to the originator so reciprocal within-clique adoption does not inflate it, stab(c) is whether c persists in the literature 10 or more years later, and fid(c) is the fraction of downstream uses consistent with the originator's definition (semantic fidelity).
AI specification.
A concept-genealogy system using LLM-based concept extraction, citation-network tracking, and semantic-consistency checking to identify originators and downstream users. The system distinguishes passing mention from substantive treatment using paragraph-level context analysis.
Parameters and thresholds.
Concept extraction runs against field-stratified dictionaries seeded from UMLS, MeSH, and Gene Ontology plus emergent-term detection for concepts not yet in ontologies. Substantive treatment, which sets origin(p, c), requires the concept to appear in at least two distinct sections of the paper with a definitional or empirical passage of at least three sentences, distinguishing origination from a one-line mention by this structural rule rather than by matching an annotated set. Adoption adopt(c) counts downstream papers using the concept, weighted by independence, so that adopters sharing no co-authorship with the originator count at full weight, and within-network adopters are down-weighted by a factor near 0.3, which defeats reciprocal-citation cartels while not zeroing legitimate collaborator use. Stability stab(c) is a binary gate set to 1 only if the concept still appears in the literature 10 or more years after origination, applied only to concepts old enough to test. Semantic fidelity fid(c) is the fraction of downstream uses whose local context matches the originator's definition above a 0.7 embedding-similarity threshold, penalizing term co-option. The four factors combine multiplicatively so that a coined term with high adoption but low fidelity or no stability scores low.
Implementation.
Build concept-extraction pipeline over full-text corpus with field-stratified concept dictionaries.
For each concept, identify earliest substantive treatment using paragraph-level context (not passing mention).
Track downstream papers using the concept with attribution and semantic-fidelity scoring.
Verify persistence at 10-year horizon for concepts old enough to test stability.
Aggregate to paper-level scores with confidence intervals.
Calibrate originator and adoption thresholds against the NIH-funded reference population per Section 3.5.0 Element 4, using the Element 2 corpus and concept graph for genealogy tracking.
Failure mode and mitigation.
Two gaming modes require mitigation. First, the index can reward coining over substance, where papers introduce flashy terminology without real conceptual work, mitigated by joint conditions with mechanism-graph centrality (the new concept must connect to broader knowledge structure) and substance score. Second, a coordinated group can manufacture adoption by reciprocally using each other's coined terms, inflating the adoption count without genuine field uptake. This is mitigated by an independence check on adoption analogous to the gap index, crediting adoption weighted toward downstream users with no co-authorship relationship to the originator, so reciprocal within-clique adoption does not drive the score. Semantic-fidelity scoring further ensures downstream adoption uses the concept the originator's way rather than co-opting the term differently. The independence-weighted adoption count is the primary defense against organized citation cartels, and where residual coordinated behavior evades it, expert review of claimed originations at the career-terminal horizon provides a backstop.
RFI category alignment.
Foundational Scientific Exploration.
Definition.
Score measuring whether a paper synthesizes methods, findings, or concepts from separate fields through genuine methodological dependency rather than tokenistic cross-field citation.
Motivation.
Current metrics fail to distinguish genuine cross-field integration from tokenistic cross-field citation. A paper can cite work from multiple fields without using those fields' methods or concepts in its own analysis, accumulating apparent breadth without genuine integration. Wuchty Jones and Uzzi (Science 2007) established that team-produced cross-field work has higher disruption scores than within-field work, but current metrics provide no way to identify which papers actually integrate fields versus which papers merely cite them. The Cross-Field Integration Index fills this gap by combining reference-list field span with semantic verification of methodological dependency on the cited work.
Formulation.
CFII(p) = fs(refs(p)) · md(p) · ds(p)
where fs(refs(p)) is the average pairwise field-distance among citations in p's reference list (computed from the field-distance matrix), md(p) is the LLM-verified methodological dependency (cited methods are used in the analysis, not just listed), and ds(p) is the downstream field-span of papers citing p (genuine integration produces downstream integration).
AI specification.
A field-distance matrix computed from topic modeling and citation networks across the corpus, plus LLM verification of methodological usage in text. The field-distance matrix is the foundational object and must be field-stratified and updated quarterly as fields evolve.
Methodological dependency md(p) is the decision that separates genuine integration from tokenistic breadth, and it is scored by locating each cross-field cited method in the paper's own analysis with a semantic match above 0.6 between the citation and the surrounding methods or results text. A reference cited in the introduction but never used in the analysis contributes to apparent breadth but is excluded from md(p), so the score rises only when a foreign-field method is actually executed. The field-distance component fs is normalized against the field-specific distribution of reference breadth so that fields with naturally broad literatures are not scored as integrative by default. Downstream field-span ds is computed only for papers at least five years old and confirms that genuine integration produced integration in the work that cited it. The 0.6 methodological-dependency threshold is calibrated against the field-specific distribution rather than against a hand-labeled exemplar set, so the separation of integrative from tokenistic breadth rests on measured method-execution in the text, not on annotated judgments.
Implementation.
Build field-distance matrix from topic models on full corpus with confidence intervals.
Compute reference-list field-span per paper using pairwise distances.
LLM-verify methodological dependency by checking that cited methods appear in analysis sections.
Track downstream citation field-span for papers older than 5 years.
Aggregate as multiplicative score with field-normalized output.
Draw the field-distance matrix from the Section 3.5.0 Element 2 concept graph and calibrate integration thresholds against the field-stratified reference population per Element 4.
Failure mode and mitigation.
Tokenistic cross-field citation without methodological dependency can inflate the index. Mitigation requires that cited methods appear in the paper's analysis sections (semantic match between citation and surrounding methodological text), not just in the reference list. Downstream span verification at 5-year horizon further filters tokenistic claims.
RFI category alignment.
Collaboration, with secondary alignment to Foundational Scientific Exploration.
Definition.
A team-configuration score along innovation-correlated structural axes including team size (number of authors), geographic heterogeneity (international and inter-regional collaboration), institutional heterogeneity (R1, R2, government laboratory, industry, non-US institution), career-stage range, and disciplinary breadth, with optional demographic variables included only where consented through ORCID self-report.
Motivation.
Current metrics treat all author lists equally, ignoring the team-configuration variables that bibliometric research has shown correlate with innovation outcomes over more than two decades of empirical work. Wuchty Jones and Uzzi (Science 2007) established that team-produced work has dominated knowledge production for over 50 years with team-produced papers receiving substantially higher citation rates than solo-authored work, with team size showing a positive relationship to disruption scores up to a field-specific optimum. Subsequent bibliometric work documented that geographic heterogeneity (Adams 2013 Nature), institutional heterogeneity (Jones Wuchty and Uzzi 2008 Science), and disciplinary breadth each independently predict disruption outcomes. The Co-Author Composition Index fills this gap by scoring team configuration along these empirically validated structural axes, providing decision-relevant signal about likely innovation output at the moment of evaluation rather than waiting for the impact to accumulate.
Formulation.
CACI(authors) = Σ over d ∈ D of w_d · het_d(authors)
where D = {team_size, geography, institution, career_stage, discipline} is the set of structural axes (with optional {gender, ethnicity, age} for opt-in demographic variables), het_d(authors) is the team-configuration score along axis d (raw count for team size, Shannon entropy or normalized Gini coefficient for categorical axes), and w_d is the weight tuned to empirical correlation with disruption outcomes per axis. Field-stratified weighting is required because optimal team configurations differ across disciplines (theoretical physics versus clinical translational research, for example).
Empirical justification.
The bibliometric literature supporting team-configuration heterogeneity as a predictor of innovation includes Wuchty Jones and Uzzi 2007 Science (team size and disruption), Jones Wuchty and Uzzi 2008 Science (institutional heterogeneity and impact), Adams 2013 Nature (geographic heterogeneity), AlShebli Rahwan and Woon 2018 Nature Communications (ethnic heterogeneity and citation impact), and Yang et al. 2022 PNAS (gender heterogeneity and novelty). The empirical case rests on innovation outcomes (citation impact, disruption scores, novel-idea generation), not on normative claims about representation. CACI is justified on these empirical grounds.
AI specification.
A multi-source aggregation system extracting team-configuration from ORCID, institutional records, and (where consented) demographic self-report. Algorithmic inference of demographic variables from names or photographs is prohibited as both privacy-violating and known to be biased. Structural variables (team size, geography, institution type, career stage, disciplinary background) are inferable from public records without consent issues.
Each structural axis is computed by a defined rule. Career stage is binned field-relative rather than by fixed year cutoffs, normalizing years since first publication to the field-specific time-to-independence distribution (the field's median interval from first publication to first independent grant or last-author paper). Early, mid, and senior are therefore defined against how a given field's careers actually progress rather than a uniform boundary that would misjudge fields with slower or faster paths to independence, consistent with the field-stratified calibration of Section 3.5.0 Element 4. Institution type is classified from affiliation against Carnegie-style categories (R1, R2, government laboratory, industry, foreign). Disciplinary background is assigned from each author's prior-publication topic distribution. Heterogeneity per axis is Shannon entropy normalized to its maximum, and team size enters as a raw count weighted against the field-specific optimum rather than as more-is-better. The gaming-relevant decision is that each co-author's axis contribution is gated by verified participation, so that an author counts toward heterogeneity only when CRediT contribution roles are recorded and the author has prior-publication track record in the declared specialty, and a name added purely to widen a heterogeneity axis, without a contribution role or a track record, does not raise the score. CRediT roles postdate many older papers, so retrospective application to pre-CRediT work runs the participation gate on weaker evidence, falling back to author position and prior-topic track record without the role declaration, which means retrospective CACI scores carry more attribution uncertainty than contemporary ones and should be reported as such.
Implementation.
Extract author count and affiliation from ORCID and Web of Science author records.
Identify career stage from publication history (years since first publication, total publications, position type when available).
Identify institutional type (R1, R2, government laboratory, industry, foreign institution) and geographic location from affiliations.
Identify disciplinary background from prior publication topic modeling.
Apply demographic variables only via consented ORCID self-report (opt-in only, with explicit downstream use disclosure).
Compute heterogeneity scores per axis using entropy or Gini normalization, with team size as a raw count weighted against field-specific optimum.
Combine into composite score with empirically tuned per-axis weights, field-stratified.
Failure mode and mitigation.
Two gaming modes require naming. First, tokenistic co-authorship can inflate the index without substantive contribution from the added authors, partly mitigated by weighting by CRediT (Contributor Roles Taxonomy) contribution roles and by verifying that co-authors have downstream citation linkage to prior work in their declared field. CRediT roles are self-reported and themselves gameable, so this mitigation is partial. Second, and more difficult, a well-resourced laboratory can assemble genuinely credentialed collaborators across varied geographies and institutions to farm the heterogeneity score more easily than an independent investigator can, which inverts the intent by rewarding resource access rather than the collaborative behavior itself. This residual mode cannot be fully closed by measurement, since the added collaborators are real and contributing. The framework names it so evaluators weight CACI as a team-context signal rather than an individual-merit score, and so CACI is never used to disadvantage independent or single-institution investigators. Even in its gameable form the indicator promotes a wanted behavior, since assembling heterogeneous teams that genuinely collaborate is the behavior the bibliometric evidence associates with higher innovation output. CACI should be reported descriptively alongside team size rather than as a rank.
RFI category alignment.
Collaboration.
Definition.
Score quantifying the breadth, balance, and timing of a paper's reference list, inverting the citation asymmetry that drives Matthew Effect concentration in current metrics. Every existing metric rewards being cited, while CBI rewards how widely and well a paper reads the literature.
Motivation.
Current citation metrics suffer from the Matthew Effect (Merton 1968), where established work accumulates citations automatically while comparably valuable work outside dominant citation cliques is overlooked. Every existing metric measures who cites you, and none measures whom you cite, so the metric system rewards being read by established networks but provides no incentive for reading widely outside them. The Citing Breadth Index fills this gap by inverting the asymmetry, scoring whether the paper engages original sources or only reviews, spans field boundaries, balances cited authors across career stages, and covers geographic breadth. This index also addresses a critical operational limitation of citation-based metrics, that they are necessarily lagging indicators, by producing real-time signal at publication when the reference list is immediately available.
Formulation.
CBI(p) = α · OSR(p) + β · FSpan(p) + γ · CSB(p) + δ · GBr(p)
where OSR(p) is the original-source ratio (citations to primary work versus review papers), FSpan(p) is the field span of references across the field-distance matrix, CSB(p) is the career-stage balance of cited authors (entropy across career-stage bins), GBr(p) is the geographic and institutional breadth of cited authors, and weights α, β, γ, δ are tuned empirically against downstream impact correlation.
AI specification.
A reference-list analyzer that classifies each citation by source type (primary versus review), field, cited-author career stage, and institution, then computes balance measures across each axis. Citation context analysis ensures cited works are used in analysis (semantic match between citation and surrounding text), not merely listed.
Parameters and thresholds.
Each of the four components is normalized to 0 to 1. OSR is the fraction of references that are primary sources rather than reviews, classified against journal and article-type metadata. FSpan is the mean pairwise field distance across references using the field-distance matrix of Section 3.9, normalized against the field-specific reference-breadth distribution. CSB is the Shannon entropy of cited-author career stages binned into early, mid, and senior, normalized to its maximum. GBr is the Shannon entropy of cited-author countries and institution types, similarly normalized. The four weights are tuned by regression against downstream impact on a held-out corpus rather than set by assertion, with an equal-weight prior of 0.25 each as the starting point before tuning. A padding gate requires each counted citation to have a semantic match above 0.6 between the citation context and the cited abstract, so reference-list padding with unused citations does not inflate breadth. The index is computable at submission, providing the real-time signal noted in the motivation.
Implementation.
Parse reference lists with structured citation extraction tools.
Classify each citation by source type, field, cited-author career stage, and geography.
Compute balance measures along each axis using entropy or Gini normalization.
Aggregate as weighted sum with empirically tuned weights.
Validate weights against downstream impact correlation in retrospective data.
Failure mode and mitigation.
Reference-list padding with tangentially relevant work can inflate breadth scores without substance. Mitigation requires substance scoring on citations themselves, the cited work must be used in the paper's analysis (semantic match between citation and surrounding text), not just listed.
RFI category alignment.
Cross-cutting calibrator with primary alignment to Foundational Scientific Exploration and Collaboration.
Definition.
A learned meta-indicator that weights the preceding twelve indicators against historical disruption labels, providing a single calibrated probability that a paper will produce disruptive contribution while preserving per-component sub-scores for inspection.
Motivation.
Current metrics provide no calibrated probability of future scientific impact, reporting aggregate counts and rankings without any explicit model of what predicts innovation outcomes. Reviewers must infer likely impact from h-index numbers and citation counts using intuition that varies across panels and institutions, producing inconsistent evaluation across functionally similar applications. The Data-Driven Innovation Composite fills this gap by training a supervised ensemble on historical disruption labels and the twelve upstream indicators, providing a single probability output that summarizes content-level scoring across the framework while preserving per-component inspection for cases where the composite and peer judgment diverge.
Formulation.
DDIC(p) = f_ML( SBTI, MGCI, AAI, PSS, SRQI, EMI, VPEI, GI, FDI, CFII, CACI, CBI )
where f_ML is a supervised ensemble model (gradient boosting plus neural ensemble) trained on retrospective disruption scores from the Funk and Owen-Smith CD index, Wu et al. disruption score, and verifiable long-term impact labels (Nobel Prize awards, NAS membership elections, and foundational-paper recognition in field histories).
AI specification.
A supervised ensemble model with explicit bias-correction terms in the loss function for known demographic and institutional biases in citation networks. Model is required to be inspectable, with feature importance reported per inference and per-component sub-scores always available. Upstream indicators are not statistically independent, since SBTI already incorporates mechanism-graph centrality and gap-filling status among its features, and several indicators share the substance-score calibrator. The ensemble handles this correlation directly (tree and neural ensembles do not assume feature independence), but feature-importance reporting must attribute shared signal transparently rather than obscuring that correlated inputs contribute jointly, so evaluators are not misled into treating correlated components as independent evidence.
Scope.
DDIC is computed at two explicit scopes to keep its inputs coherent. The paper-level composite uses the twelve inputs in the formulation above, of which ten are paper-level (SBTI, MGCI, AAI, PSS, SRQI, EMI, GI, FDI, CFII, CBI) and two are the author-level and team-level indicators (VPEI, CACI) entered only through their paper-attributable components rather than as raw author or team scores. SRQI and EMI enter as their composite meta-index scores rather than as their individual step-indices. An author-level composite additionally incorporates the full author-level and team-level indicators, including the Independent Recognition Indicator (Section 3.13), which is excluded from the paper-level composite precisely because it is an author-level measure. Stating the two scopes explicitly prevents the category error of feeding an author-level score into a per-paper prediction, and it is why the paper-level formulation lists twelve inputs while the author-level composite additionally draws on IRI, exclusive of the SRQI and EMI step-level sub-indices, which enter only through their meta-index composite scores rather than individually.
Implementation.
Compute per-indicator scores for the training corpus across the past 30 years of literature.
Label with disruption scores at 20-year horizon plus categorical labels for Nobel-recognized work and historical foundational papers.
Train ensemble with bias correction for gender, ethnicity, institutional tier, and geographic-region biases verified in citation networks.
Validate against held-out historical disruption cases including Karikó, Brock, Normann, and Sagan papers.
Publish model weights, feature importance, and training data composition openly.
Failure mode and mitigation.
The composite concentrates evaluation authority in a single model that can become opaque if governance is weak. Mitigation requires open-source model with public audit trail, use as advisory input to peer review rather than autonomous decision-maker, mandatory per-component sub-score inspection in any application, and external audit every 18 months by an independent bibliometric panel.
RFI category alignment.
Meta-indicator applicable across all RFI categories.
Consequence and calibration are separate reported axes, and calibration cannot raise the consequence read.
The composite and the diagnostic vector report two axes that must never be netted into one number, a consequence axis and a calibration axis. That consequence axis estimates how much a contribution changes what the field asks, does, or believes, aggregating the forward-looking signals, the disruption probability of this composite, the Field-Define Index of Section 3.8, the Sleeping-Beauty Trajectory Index of Section 3.1, and the mechanism-graph centrality shift of Section 3.2. A calibration axis estimates how honestly and rigorously a paper supports the claims it makes, aggregating the SRQI and EMI reasoning scores, the claim-alignment check, and the honest-limitations credit. Each paper scores independently on the two, and they are reported side by side rather than combined, so a reader sees a consequence estimate and a calibration estimate rather than a blend that hides which one is driving the result.
The rule the split enforces is that calibration cannot raise the consequence read. A rigorous, honestly-calibrated paper that makes a small but correct increment is reported as high calibration and low or uncertain consequence, not as a strong contribution, because honesty about a minor result does not make the result major. This substitution is the failure the framework is most prone to in practice. The calibration machinery is dense, eleven reasoning steps, six engineering steps, and the claim-alignment and limitations checks, while the consequence machinery is four indicators, so an evaluator reading the full vector can let the volume of calibration signal stand in for magnitude. Separating the axes and forbidding the netting removes the substitution, so an honest minor paper and a field-redirecting paper are distinguished on the axis that separates them rather than blended on the axis that does not.
The consequence axis carries the framework's widest uncertainty, since magnitude at publication time is a forward prediction and not a measurement, so it is reported with an explicit confidence interval rather than a point estimate. Low predicted consequence is not scored as a deficiency, because the incomputability-scores-high rule of the founding-novelty case applies here. A contribution too far ahead of its field to have measurable downstream signal is flagged as high-uncertainty and possibly dormant rather than as minor, which is the protection the Karikó case in the preface demands. The axis therefore distinguishes three states rather than two, high consequence, low consequence from a genuinely small increment, and unresolved consequence from a contribution the field has not yet caught up to. It resolves over time as downstream signal accrues, so an early high-uncertainty score is revisited rather than fixed, and the retrospective validation of Section 7.5 tests whether the axis separates the small increment from the dormant one.
Definition.
A distribution-robust measure of how much of a researcher's recognition comes from authors with no co-authorship relationship to them. It is computed per paper across the entire publication portfolio and aggregated by a median that prevents one or two high-citation papers from dominating the result, distinguishing recognition earned from independent groups from recognition inherited through collaboration networks or borrowed from senior co-authors.
Motivation.
Current citation counts conflate independent recognition with network-internal citation, so a researcher heavily cited by their own collaborators, trainees, and co-authors appears identical to a researcher cited by strangers who encountered the work on its merits. This favors researchers embedded in large collaboration networks and disadvantages independent investigators whose work is recognized outside any citation clique, reproducing the Matthew Effect at the author level. The Independent Recognition Indicator fills this gap by using citation author-position analysis to separate citations from non-collaborators (people who have never co-authored with the researcher) from citations inside the collaboration network, measuring recognition that cannot be explained by network membership. This is the one salvageable mechanism from perception-based reputation scores, extracted as an objective network measurement rather than a subjective reputation impression.
Formulation.
Computing independence over pooled citations across a researcher's whole portfolio lets one or two high-citation papers dominate the statistic, since citation distributions are lognormal-to-power-law and a small number of papers typically hold most citations. A researcher with one widely-cited method paper would have their independence score determined almost entirely by that single paper rather than by their body of work. The indicator avoids this by computing independence within each paper first, then aggregating with a distribution-robust central tendency that treats each paper as one observation rather than weighting by citation count.
r(p) = |{ q ∈ Cit(p) : coauth(a, author(q)) = 0 }| / |Cit(p)|
computes the per-paper independence ratio, where Cit(p) is the set of papers citing paper p and coauth(a, author(q)) = 0 holds when no author of citing paper q has ever co-authored with researcher a. Each ratio is then field-normalized against same-field, same-era papers, and the researcher-level indicator is the median across the portfolio:
IRI(a) = median over p ∈ Papers(a), |Cit(p)| ≥ c_min of znorm( r(p) )
where znorm(...) field-normalizes the per-paper ratio to a z-score against the distribution of independence ratios for same-field, same-era papers (correcting for the fact that small fields have fewer possible non-collaborators), and c_min is a minimum citation floor excluding near-zero-citation papers that would add noise. The median across papers is robust to the one-or-two-outlier-paper distortion by construction, since a single dominant paper is one observation among many rather than the bulk of a pooled sum, which prevents the skewed-portfolio artifact that a citation-weighted pooled fraction produces.
AI specification.
A citation-network analysis over a co-authorship graph, computed per paper then aggregated. The co-authorship graph is constructed from the full publication corpus, and for each citation to each paper the system checks whether any author of the citing paper has ever co-authored with the researcher. Per-paper independence ratios are field-normalized against same-field same-era distributions before the median aggregation. This requires accurate author disambiguation, which ORCID and modern name-disambiguation models provide with high but imperfect accuracy.
Parameters and thresholds.
Co-authorship is defined over all prior years with no decay, so any shared authorship at any time marks a citing author as network-internal. The citation floor c_min excludes papers below the field-median citation count for papers of the same age, which removes low-signal papers without discarding recent or early-career work disproportionately (the floor is field-and-age-relative, not an absolute count). Hyperauthorship papers above 50 authors are excluded from the co-authorship-relationship computation, since consortium membership does not indicate a genuine collaboration tie. Aggregation uses the median across qualifying papers as the primary statistic, with the interquartile range reported alongside to expose bimodal portfolios (some highly independent papers, some network-internal), and the full per-paper distribution is shown rather than collapsed, so a researcher is characterized rather than ranked. Disambiguation confidence below a per-author threshold flags the researcher for manual review rather than producing a possibly-erroneous automated score.
Implementation.
Construct co-authorship graph from full corpus with ORCID-based author disambiguation.
For each paper by the researcher, compute the per-paper independence ratio (non-collaborator citations over total citations to that paper).
Field-normalize each per-paper ratio against same-field, same-era papers to a z-score or percentile.
Aggregate across the researcher's papers using the median (or trimmed mean) over papers above the citation floor, so no single high-citation paper dominates.
Report alongside raw citation count and the per-paper distribution so evaluators see recognition quality across the portfolio, not just a pooled fraction driven by outliers.
Failure mode and mitigation.
Three failure modes require mitigation. Skewed-portfolio distortion, where one or two high-citation papers would dominate a pooled fraction, is addressed by the per-paper-then-median aggregation above. Author disambiguation errors misclassify co-authorship relationships, mitigated by ORCID-verified disambiguation. Very large collaborations (consortium papers with hundreds of authors) distort the co-authorship graph, mitigated by excluding hyperauthorship papers above a co-author threshold from the relationship computation. The indicator is reported as a descriptive pattern with its per-paper distribution rather than a single rank, so network-embedded researchers are characterized, not penalized.
RFI category alignment.
Collaboration, as a measure of recognition independent of collaboration-network membership.
Note on deliberate overlap.
Building the SRQI step-indices fresh produced intended overlaps with three existing indicators, and the framework resolves each by scope rather than by deletion. Gap identification appears both as the reasoning-level Gap-Identification Index (Section 3.5.2), which scores how well a single paper identifies and argues its gap from its own text, and as the network-level Gap Index (Section 3.7), which tracks gaps across the literature and credits filling a gap independently identified by others, so the first is a within-paper reasoning measure and the second is a cross-literature contribution measure. Synthesis appears as the reasoning-level Synthesis-and-Integration Index (Section 3.5.10), scoring connection density in the paper's own argument, and as the Cross-Field Integration Index (Section 3.9), scoring reference-list field span with downstream integration, so the first reads the reasoning and the second reads the citation structure. Mechanism density appears as the mechanism dependency inside SSyn and as the Mechanism-Graph Centrality Index (Section 3.2), which scores a paper's concepts against the field-wide mechanism graph, so the first is internal and the second is network-level. In each case the step-index is the within-paper reasoning signal and the standalone indicator is the network or literature signal, they are reported separately, and the Data-Driven Innovation Composite treats them as correlated inputs per its Scope note rather than as independent evidence.
RFI category alignment.
Cross-cutting, clarifying the relationship among Rigor, Foundational Exploration, and Collaboration indicators.
The table fills the gate contract of Section 3.5.0 for each AI-based metric, so the building phase inherits a populated specification rather than a principle it must remember to apply. Each row is a required deliverable, a blank cell blocks the metric from build-readiness, and the acceptance-test column is the field that detects a build populated in form but not in function. Metrics that carry no content gate are named after the table as rule-based or constitutive, with the reason each is exempt, so the boundary is fixed here rather than left for a builder to draw.
Metric
Grounding artifact (emitted, ablation-load-bearing)
Retrieval or evidence scope
Held-out features
Acceptance test (known-outcome)
SObs (3.5.1)
Retrieved k=50 prior-neighbor set with entailing sentences (novelty), plus an examined primary-data sample, waveform, trace, or image (fidelity)
Target subfield plus adjacent fields by geodesic
Categorical-language strength, venue prestige, abstract salience, narrative framing
Credits a corroborated novel observation, flags a re-presented known finding, and flags physiologically implausible primary data on held-out cases
SGap, SHyp, SSyn (3.5.2, 3.5.4, 3.5.10)
Retrieved prior claims that leave the gap open, that the hypothesis departs from, or that the synthesized link connects, each with source sentences
Target plus adjacent and connected fields by geodesic
The paper's own claim of a gap, of originality, or of interdisciplinarity
Credits a gap or link absent from the prior set, flags one the prior set already contains
SExec (3.5.7)
Extracted procedural parameters checked against the per-method-class reproducibility checklist (ARRIVE, MIQE, or a community equivalent)
Method-class checklist, not a corpus
Method-section length and the words rigorous or validated
Flags a named reagent without concentration or an antibody without catalog number, and flags results the stated method could not produce
SAna (3.5.8)
The paper's reported statistics and extracted data structure, checked against fixed statistical rules
Fixed rules of statistics and the paper's internal logic, not a corpus
The word significant without a p value, and conclusion-strength language
Flags test-to-data-type mismatch, pseudoreplication, and a conclusion about a quantity the test did not measure
MGCI (3.2)
The extracted mechanism-graph edges with the source sentence asserting each edge, emitted for human check
Biomedical relation-extraction corpus plus adjacent fields by geodesic
Claim language asserting a concept is central or important
Extracted edges match human-adjudicated edges above a stated F1, and centrality reproduces known hub concepts on held-out graphs
AAI (3.3)
Retrieved prior papers that state or rely on the assumption the paper claims to overturn, correct, or build past, with the entailing sentences
Target plus adjacent fields, so an assumption settled next door is caught
The paper's framing that it is the first to, and venue
Credits a genuine correction, does not credit a paper overturning an assumption no prior neighbor held, and recognizes a correction already made in an adjacent field
GI (3.7)
Retrieved prior papers documenting the gap across distinct author groups, with the documenting sentences
Target plus adjacent fields
The paper's self-declaration that a gap exists
Credits a gap independently documented by at least a stated number of distinct groups, does not credit a self-declared gap no prior work states
FDI (3.8)
Retrieved later adopting papers plus the first-use origin evidence, with citation-context sentences separating originator from elaborator and co-opter
Target plus adjacent fields plus downstream adopters
The paper's self-claim of origination, and venue prestige
Attributes origination to the true originator on held-out concept-lineage cases, does not credit a popularizer over the originator, and catches prior origination in an adjacent field
CFII (3.9)
Retrieved prior work in each connected field plus the sentences showing the borrowed method or concept is used in the analysis, not merely cited
Each connected field plus adjacents by geodesic
Cross-field citations that are tokenistic, and interdisciplinarity framing
Credits genuine methodological dependency and flags tokenistic cross-citation on held-out cases
VPEI (3.6)
The examined communication artifacts, verified reach records, and downstream policy or educational citation records
Content-quality scoring against a domain-accurate reference
Reach inflated by virality or controversy, and follower counts
Credits substantive accurate communication, does not credit high-reach low-substance output, verified against known policy-citation outcomes
CBI (3.11), content part
The reference set with its field, recency, and group distribution (structural), plus the entailment that each sampled citation supports its citing sentence (content)
The paper's reference list and the cited works
Raw reference count and self-citation padding
Rewards broad and apt referencing, does not reward a padded or off-target reference list
SBTI (3.1), content part
Early-trajectory features (structural), plus the prior-neighbor set showing the paper is ahead of its contemporaries (content)
Target plus adjacent fields for the content part
Early citation count and venue
Content score is high at publication on known sleeping-beauty cases, does not fire on mere obscurity or low quality
DDIC (3.12), meta-learned
The per-component sub-scores it weights, each already gated and inspectable
The twelve upstream indicators, no direct corpus retrieval of its own
Raw citation counts, so it cannot learn to proxy citation
Predicts held-out disruption above a citation-only baseline, and preserves per-component transparency
EMI steps (3.5E)
The quantified requirement, the explored design points, and the validation results against each requirement, plus the prior-engineering neighbor set for the novelty axis
Prior engineering literature plus adjacent fields
Naming an operation in place of performing it, and device-marketing language
Credits a quantified and validated design, flags an unquantified requirement or a validation that skips a requirement, on held-out engineering papers
Analytical-mode floor (3.5A)
The retrieved prior set for the linkage-novelty check, and the citing sentences for the source-characterization check
Target plus adjacent fields
Fluent synthesis language and the number of citations spanned
Flags a synthesis that re-states known links, one that mischaracterizes a cited source, and one that smuggles an unstated assumption
Four of these metrics, AAI, GI, FDI, and CFII, ground their scores in a retrieved set of prior claims rather than in primary data. Their contract is met only when that retrieved set is the object the score is computed from, not a set displayed beside a score computed some other way. The ablation applies with full force here, since a relational metric is the easiest place to emit a nominal artifact, and a builder who retrieves the neighbor set but scores from the paper's own framing has produced exactly the presentation-gamed score the contract forbids. Each of these acceptance tests therefore includes a claim whose framing asserts a contribution the retrieved set does not support, and the metric passes only if it scores that claim on the set and not on the framing.
The Data-Driven Innovation Composite is a learned meta-weighting rather than a content scorer, so its grounding artifacts are the per-component sub-scores it combines, each carrying its own gate, and its distinctive risk is learning to reproduce citation rank because citation is the easiest signal correlated with its disruption labels. Its held-out feature is therefore the raw citation count, excluded from its inputs so it cannot proxy citation, and its acceptance test requires out-of-sample disruption prediction that beats a citation-only baseline. A composite that merely re-derives the citation ordering has added nothing and has reintroduced the incumbent it was meant to replace.
Three metrics are part structural and part AI, and the contract binds only their AI part. SBTI computes early-trajectory features by rule and scores content novelty by the SObs machinery, so it inherits the SObs row for its content half. PSS computes data-novelty and the primary-data ratio by provenance matching and reuses the SAna checks for statistical rigor, so its AI exposure is the SAna row rather than a separate contract. CBI computes breadth, balance, and timing structurally and scores citation aptness by entailment, so only the aptness component carries the content gate. Naming which half is AI keeps the structural half from being treated as gated and the AI half from being treated as rule-based.
Two indicators carry no content gate because they compute from structure rather than from reading claims against priors. The Co-Author Composition Index reads team configuration from authorship and consented ORCID fields, and the Independent Recognition Indicator reads the co-authorship graph, so both are rule-based and their correctness is a matter of correct counting rather than correct judgment. Constitutive metrics of the incentive arm are likewise rule-based. Stating this boundary explicitly matters, because the failure this whole section guards against is a builder assigning an AI metric to the rule-based side to escape the gate. The ablation is the enforcement, since a metric that claims to be rule-based but whose score moves when a retrieved neighbor set is withheld is an AI metric operating without its contract.
Each row's acceptance test joins the Element 6 acceptance gates and the thin-prior stress test of Section 7.4 as a per-metric deployment condition, so a metric ships only when it passes the test written for its own failure mode. The framework that reaches deployment is therefore the subset of AI metrics that are real, rather than the full set with some built cosmetically and carried by an aggregate.
The following table summarizes which of the fourteen top-level diagnostic-arm indicators address each of the seven RFI categories, where the Scientific-Reasoning Quality Index and Engineering-Method Index are meta-indices each composing named step-level sub-indices (Sections 3.5 and 3.5E). Indicators with primary alignment are noted as primary, and indicators with secondary alignment are noted as secondary. The incentive arm (Section 5) maps separately.
RFI Category
Indicators (primary and secondary)
Rigor and Reproducibility
AAI (primary), GI (primary), SRQI (primary), PSS (secondary)
Data, Software, and Model Sharing
Addressed by RGI (Reuse Generation Indicator) in the incentive arm, Section 5.3.1
Training and Mentorship
Addressed by TTI (Trainee Trajectory Indicator) in the incentive arm, Section 5.3.3
Collaboration
CFII (primary), CACI (primary), IRI (primary), CBI (secondary)
Entrepreneurship and Translation
EMI (primary, for design-and-build work). Existing metrics overweight translation as an outcome, so the framework adds no outcome-based translation indicator, but the Engineering-Method Index scores the quality of the engineering reasoning itself, which existing metrics ignore
Foundational Scientific Exploration
SBTI (primary), MGCI (primary), FDI (primary), GI (secondary), SRQI (secondary), CBI (secondary)
Public Impact
VPEI (primary)
Cross-cutting
PSS (substance calibrator), SRQI (reasoning-quality calibrator), CBI (real-time signal), DDIC (meta-composite)
Foundational Scientific Exploration is intentionally over-represented in the framework, with three primary indicators plus two secondary, because the existing metrics most damage this category. Entrepreneurship and Translation is intentionally not addressed with new diagnostic indicators because existing metrics already overweight translation, and the structural argument of this submission is that translation-as-primary-metric is itself the problem driving the false-positive failure mode.
This arm is the primary contribution of the submission. Most of its indicators require no artificial intelligence and can be enacted through human judgment and verified counts, which makes them deployable now and independent of whether any diagnostic-arm scorer is ever built. Where an incentive indicator does use AI components, it reuses the diagnostic-arm infrastructure and inherits the same gating preconditions. The arm leads with foundational and exploration incentives because those behaviors have the longest payoff horizon and the weakest existing constituency.
The diagnostic indicators in Sections 3 and 4 detect quality the existing metrics miss, and they share the vulnerability of all proxy metrics. Goodhart's Law holds that when a measure becomes a target it ceases to be a good measure, because agents optimize the proxy rather than the underlying quality it stands for. This applies to any metric that is a proxy, a stand-in for something else. It applies far less to a constitutive metric, where the measured behavior is itself the thing wanted rather than a proxy for it. If the system credits verified independent reuse of a shared dataset, the only way to score high is to produce a dataset that other groups actually reuse, so gaming the metric is the wanted behavior. The incentive arm is built entirely from constitutive metrics, each designed so that the dominant strategy for a researcher optimizing the metric is to perform the behavior the system wants more of.
The incentive arm rewards behaviors at their cost point, the place where a researcher currently pays a career penalty for doing the wanted thing. Exploration costs failure risk, so the incentive arm credits the failed high-risk attempt rather than only the success. Replication costs novelty, so it credits the replication. Data and tool sharing costs competitive advantage, so it credits verified downstream reuse. Corrective work costs the appearance of productivity, so it credits the dead end retired. Each indicator neutralizes the specific disincentive that currently suppresses a behavior the public-funding mission depends on.
The incentive-arm indicators vary in how fully constitutive they are, and the framework states this explicitly rather than claiming uniform gaming-resistance. Reuse, replication, corrective, and trainee indicators are fully constitutive, since the only way to score high is to produce reusable artifacts, run replications, publish genuine corrections, or train independent scientists, so optimizing the metric is the wanted behavior. Two indicators are only partly constitutive. The research risk portfolio is gameable by cheap-risk farming until its initiation-assessment mechanism exists, and the foundational investment indicator is gameable by relabeling incremental work. Even these partly-gameable metrics tend to promote productive behavior, since the cheapest path to a high score still routes through attempting more ambitious work or foregrounding mechanism over translational packaging, which are behaviors the system wants more of even in their gamed form. Where a residual gaming mode cannot be closed, the framework names it so evaluators can weight the indicator accordingly and so future work can address it, rather than presenting the metric as fully robust.
The incentive arm, and within it the foundational and career-terminal metrics specifically, should carry heavier evaluative weight than the diagnostic arm. That weighting is structural rather than technical. Study sections reward near-term clinical translation because it is fundable within a grant cycle, and public and congressional sentiment reward the same because visible cures arrive during the appropriations window. Both forces already push toward the short horizon, so an evaluation system weighting all work equally under-invests in the long-horizon foundational research that has the greatest cumulative payoff and the weakest existing constituency. Heavier weight on long-horizon incentives corrects this standing imbalance, a correction only a public funder with a multi-decade mission has the mandate to make. That the incentive metrics are largely Goodhart-resistant is what makes the heavier weight safe to apply, with the two partly-constitutive indicators weighted more cautiously until their residual gaming modes are closed.
Definition.
A career-spanning score of the high-variance, high-potential-value work a researcher has attempted, crediting the attempt itself including failed attempts, conditioned on the potential value of each attempt if successful.
Motivation.
Current metrics reward outcomes, so a researcher who attempts ten high-risk projects and succeeds at one scores worse than a researcher who completes ten safe incremental projects, even though the first researcher produced the more valuable result and advanced the field's frontier. This punishes exactly the exploratory behavior that foundational science requires and that only public funding tolerates, because private capital cannot absorb a portfolio with a high failure rate. The Research Risk Portfolio Indicator fills this gap by crediting the attempt at high-value risk rather than only the success, so a researcher optimizing this indicator attempts more ambitious work, which directly counters the chilling effect that suppresses exploratory volume.
Formulation.
RRPI(a) = Σ over j ∈ Projects(a) of risk(j) · value_if_success(j)
where risk(j) is the prior probability of failure at project initiation (higher risk scores higher, and the score does not depend on whether the project succeeded), and value_if_success(j) is the potential value of the project conditional on success, assessed at initiation. Conditioning credit on potential value prevents the cheap-risk-farming failure mode, in which a researcher accumulates risk credit through many low-value attempts unlikely to succeed and unlikely to matter. Only high-stakes risk earns credit, not mere failure.
Implementation.
Robust specification requires a registered-report-style initiation mechanism. At project start, before outcomes are known, peer assessment records the project's risk (probability of failure) and potential value if successful. This front-loaded assessment is reliable because it does not require estimating the counterfactual payoff of already-failed work, which post-hoc scoring cannot do accurately. A lighter interim specification uses AI scoring of novelty distance from the researcher's prior work and departure from field consensus at publication or preprint time, which is buildable immediately but weaker because it cannot assess potential value as reliably and is more exploitable by cheap-risk farming. The registered-report mechanism requires NIH to build initiation-time assessment infrastructure, which is a program-design investment beyond pure measurement, and is squarely within the incentives the RFI solicits.
Failure mode and mitigation.
Cheap-risk farming (accumulating credit through low-value long-shot attempts) is the dominant failure mode, mitigated by conditioning all risk credit on potential value assessed at initiation. This mitigation depends on the registered-report initiation mechanism, which does not yet exist, so the lighter interim AI-only specification has no working defense against cheap-risk farming and should carry zero evaluative weight in funding or promotion decisions until the initiation-assessment infrastructure is built. The interim version is useful only descriptively, for surfacing exploratory-work patterns, not for ranking. Even in its unguarded interim form the indicator tends to promote a wanted behavior, since the cheapest path to a high score still routes through attempting more ambitious departures from consensus, which raises exploratory volume even when the potential-value conditioning is absent. Peer-assessment capture at initiation is a secondary risk in the full version, mitigated by blinded multi-reviewer assessment and periodic audit of initiation scores against realized trajectories.
RFI category alignment.
Foundational Scientific Exploration (primary).
Definition.
A score crediting sustained investment in basic, mechanism-level, and pre-competitive research that has no near-term translational application at the time of the work, measured by the depth and continuity of foundational contribution rather than by translational output.
Motivation.
Current metrics and study-section incentives reward translational framing, so researchers doing foundational mechanism work must dress it in translational language to stay funded, which distorts the work and defunds the pure foundational inquiry that has the longest eventual payoff. Basic science that industry will not touch, because its returns are decades away and cannot be captured by any single actor, is precisely the public-funding mission, yet no metric rewards it as such. The Foundational Investment Indicator fills this gap by crediting sustained foundational contribution directly, so that doing basic science is rewarded rather than penalized, and researchers need not disguise mechanism work as near-term translation.
Formulation.
FII(a) = Σ over p ∈ Papers(a) of found(p) · depth(p) · (1 + delayed_realization(p))
where found(p) is the degree to which paper p is foundational rather than translational (scored from content, high for mechanism and theory work with no near-term application claim), depth(p) reuses the mechanism-graph centrality and paper-substance scores, and delayed_realization(p) is a bonus that accrues if the foundational work later enables translation, rewarding the researcher retroactively for foundational bets that matured.
Implementation.
Foundational-versus-translational scoring reuses the diagnostic-arm content classifiers, inverting the translation-hook detection so that absence of a near-term application claim combined with high mechanism centrality scores high. The delayed-realization bonus links foundational papers to later translational work that cites them as enabling, computed at long horizons (10 or more years), which requires the same longitudinal citation infrastructure as the sleeping-beauty indicator.
Failure mode and mitigation.
Researchers could relabel incremental work as foundational to capture the credit, mitigated by joint conditioning on mechanism-graph centrality and substance scores, so that foundational credit requires demonstrated mechanistic depth rather than a mere absence of translational claims. Field-stratified calibration is required because the foundational-translational balance differs across disciplines.
RFI category alignment.
Foundational Scientific Exploration (primary).
Definition.
Credit for verified independent reuse of datasets, code, reagents, protocols, and other research artifacts a researcher has shared, weighted by the independence and productivity of the reusing work.
This indicator addresses the Data, Software, and Model Sharing category the RFI names and the diagnostic arm defers. It is constitutive, since scoring high requires producing artifacts that other groups genuinely reuse, and gaming it is the wanted behavior. Verified reuse is tracked through data and code citation, repository fork and download provenance, and reagent-request records, weighted by whether the reusing group is independent of the originator. Sharing costs competitive advantage, and this indicator credits at that cost point. Foundational and infrastructure artifacts (shared datasets, reference methods, model organisms) are the highest-value case because they are pure public goods the private sector will not create. RFI category alignment: Data, Software, and Model Sharing (primary), and infrastructure commons.
Formulation and data availability.
RGI(r) sums, over the artifacts a researcher has shared, the independence-weighted count of downstream reuses, where each reuse is weighted by whether the reusing group shares no co-authorship tie (the IRI independence machinery) and by the reusing work's own substance, so a dataset reused by many independent productive groups scores far above one reused only by the originating lab. Reuse is detected from data-and-code citation (DataCite and software-citation records), repository fork and download telemetry where available, and formal reagent-and-materials citation.
Data requirement, partially available. Citation of data and code via DataCite and Software Heritage, together with repository telemetry, is queryable today, so the dataset-and-code portion is buildable now, but systematic reagent-and-materials reuse tracking (which cell line, plasmid, antibody, or mouse line a downstream paper obtained from whom) is not a queryable resource. Capturing the reagent-and-materials contribution the definition names as highest-value therefore requires a materials-and-data-reuse registry that does not currently exist, which NIH is uniquely positioned to build, maintain, and drive adoption of, since such a registry is itself non-appropriable public-good infrastructure the private sector will not create, and building it is a program-design investment squarely within the incentives the RFI solicits. Until that registry exists and reuse reporting is normalized (plausibly through linkage to existing repositories such as Addgene, ATCC, and model-organism stock centers and through a data-management-and-sharing-plan reporting requirement), RGI measures data-and-code reuse well and reagent reuse poorly, and the reagent gap should be stated wherever RGI is reported.
Corrective and negative work has near-zero annual reward but large cumulative value, since a field spared decades chasing a wrong model is an enormous saved cost, and only non-commercial science performs it because there is no product in proving something does not work. In earlier framework versions this was a separate corrective-contribution indicator, but its publication-time core (how well a paper engages and overturns a wrong assumption) is the same act the Assumption-Addressing Indicator (Section 3.3) now scores, and its lagging question (was the correction ultimately right, did the field come to agree) is the career-terminal Assumption-Correctness Indicator (Section 5.4.2). The two-metric split is deliberate, since the quality of the corrective engagement is observable at publication while its correctness is observable only over years, and binding them into one score would force the correctness question to be answered prematurely. Corrective work is therefore credited by AAI at the paper level and by the career-terminal correctness metric across a career, not by a separate incentive-arm indicator here.
Definition.
Credit for the positive career outcomes of a researcher's trainees, broadly construed to include both independent scientific success and successful careers outside research, weighted toward outcomes that reflect the trainee's own trajectory rather than the mentor's. A trainee who launched an independent research direction distinct from the mentor's and a trainee who was honestly guided into a well-fitting non-research career both count as successes, not only the former.
A career's largest long-term multiplier is often the people it launched, and current metrics credit none of this terminally. The indicator credits trainee independent scientific success (their own funding, their own foundational contributions, their own trainees) and, equally, verified successful non-research careers, weighted toward divergence from the mentor's exact topic so it rewards training for independence rather than cloning. A design principle is load-bearing here, since the metric must not reward retention-for-its-own-sake, the silver-spoon pattern of keeping a poorly-suited trainee in the lab for years of low-productivity work before an eventual washout. Honestly recognizing early that a trainee is not suited to research and guiding them into a career where they will thrive is a good mentorship act that saves the trainee years of lost productivity. TTI therefore counts that redirected trainee as a success rather than a zero, and it never scores a departure as a negative. It is constitutive, since scoring high requires producing people who succeeded, in or out of research, rather than retaining bodies. This addresses the Training and Mentorship category the RFI names and the diagnostic arm defers. RFI category alignment: Training and Mentorship (primary).
Formulation and data availability.
TTI(r) sums, over a researcher's verified trainees, each trainee's positive outcome score. A research-track trainee contributes their independent scientific success (independent funding, last-author output, and their own trainee production, weighted toward topical distance from the mentor by the SObs field-distance machinery), while a non-research-track trainee contributes a verified successful career outcome (a stable science-adjacent, industry, policy, education, or other professional position). Departures are never scored negatively, so a mentor is not penalized for a trainee who left, and a trainee redirected early into a well-fitting career counts as a positive outcome, which removes the incentive to retain a poorly-suited trainee for the mentor's short-term labor. The sum is normalized by trainee count and career stage so that large labs do not dominate by volume and so that recent mentors are scored on their trainees' realized-so-far trajectories with appropriate uncertainty.
Data requirement, partially available and partly manual, with one component genuinely hard. Mentor-trainee links are recoverable from co-authorship-then-independence patterns and from dissertation and ORCID records for some trainees, so a per-lab research-track TTI is assemblable by hand today, but population-scale automated TTI requires the training-lineage registry listed in Section 6.6, which NIH would build and which does not currently exist. The non-research-career-outcome component is harder still, since tracking where trainees went after leaving academia has no systematic source and would depend on the same registry incorporating voluntary career-outcome reporting, plausibly through institutional alumni tracking or a trainee-outcome reporting requirement tied to training grants.
One limitation is intrinsic and stated plainly. Honest early redirection of a poorly-suited trainee and neglectful churn of a mismentored trainee can produce similar external traces, since both show a trainee leaving, and distinguishing them requires knowing whether the departure served the trainee, which the records do not hold. TTI's not-scoring-departures-negatively rule avoids penalizing honest redirection, but it also means the metric cannot by itself reward good redirection over bad churn. Fully separating the two would require a trainee-side signal (an alumni assessment of whether the training served their eventual career) that no current mechanism collects and that would raise its own gaming and privacy considerations. Until such a signal exists, TTI credits verified positive outcomes and stays silent on departures rather than pretending to judge the quality of a departure it cannot observe.
Definition.
Credit both for performing replication studies and for producing original work that independent groups successfully replicate, where the replication relationship between two papers is detected by comparing their experimental structure across the corpus rather than requiring either paper to declare itself a replication.
Replication scores near zero on current career metrics despite being foundational to a reliable evidence base. The indicator credits performing replications (which cost novelty) and producing replicable work (verified when an independent group re-runs the design and confirms the result). Detection compares experimental structure across the corpus, so a replication need not declare itself, and the same comparison distinguishes three relationships a later paper can bear to an earlier one, of which only replication is credited here. It is constitutive on both sides, since scoring high requires either running real replications or writing methods rigorous enough that independent replication succeeds. RFI category alignment: Rigor and Reproducibility (primary).
Formulation and data availability.
RCI detects the relationship between a later paper and an earlier one by comparing experimental structure (experimental groups, control groups, model system, and measured endpoints extracted by the Section 3.5.0 Element 1 and matched by Element 2) together with the temporal-and-citation pattern between them. It then classifies the relationship into one of three and credits only replication. Pure replication is a large publication-time gap plus a back-reference to the earlier paper plus a re-run of substantially the same design, and it earns credit for the replicator and, on the other side, for the original paper whose work proved replicable when the results concord. Independent convergence is a small time gap (within the field-calibrated concurrent-submission window) with no back-reference, indicating two groups had the same idea at once, which is corroboration rather than replication and is labeled but not scored by RCI. Extension is a shared experimental base with added analysis, characterization, or data the original lacked, which is a distinct valuable act scored by the Extension and Deepening Indicator (Section 5.3.5) rather than credited here. Both replication directions are weighted by the independence of the replicating or replicated group via the IRI co-authorship machinery, and concordance (whether the replication confirmed or contradicted the original) is detected by comparing the reported results through the Element 2 entailment machinery.
Data availability, computable now with caveats. The three-way detection reads the existing corpus through the Element 1 and Element 2 machinery, so RCI no longer depends on a replication-linkage registry as a precondition and is buildable now, though the registry listed in Section 6.6 would improve precision and record outcomes the corpus does not state explicitly. Separating independent convergence from replication turns on a concurrent-submission window that is field-calibrated from each field's publication-lag distribution rather than a fixed interval. That window also accounts for preprint priority dates, since a preprint posted a year before journal publication moves the real priority date and can turn an apparent replication into independent-but-earlier work or the reverse. Three limitations remain. Design extraction and cross-paper matching carry a nontrivial error rate that varies by field reporting convention, concordance detection (did the replication confirm the original) is a further entailment step with its own error rate, and the convergence window is noisy at its boundary where publication lags overlap.
Definition.
Credit for taking a prior experimental base and adding the analysis, characterization, or data collection the original lacked, the re-search-goes-deeper act that matures a field and that current metrics undervalue because it is neither a celebrated novel breakthrough nor a pure replication.
Motivation.
Most work that reuses a prior experimental design does not replicate it, it extends it, running substantially the same conditions while collecting data the original missed, characterizing a mechanism the original left unexamined, or pushing the analysis deeper. This is how fields actually mature, yet it falls in the gap between novelty and replication that citation counts and venue prestige both undervalue. The indicator credits the deepening act directly, crediting only the extender since the original base is credited by its own metrics, and it is constitutive because scoring high requires actually going deeper on a real prior base rather than repackaging it. RFI category alignment: Rigor and Reproducibility and Foundational Scientific Exploration.
Formulation and data availability.
EDI shares the RCI corpus-comparison detection, identifying an extension relationship as a shared experimental base (high design overlap by the Element 1 and Element 2 machinery) combined with added analysis, characterization, or data absent from the original. Credit requires clearing two guards that together block trivial add-ons. A depth guard scores the substance of what was added relative to the original, using the SObs and SAna machinery on the added material so a single confirmatory figure or a cosmetic reanalysis does not qualify, only a substantive deepening does. Beyond depth, a gap-filling guard requires the addition to answer a question the original raised or left open, tested by entailment between the original's stated limitations or open questions and the extension's contribution, so adding arbitrary data unrelated to what the original actually needed does not score. Only work clearing both guards is credited, and only to the extender. This is computable now from the corpus through the shared detection machinery, with the same design-extraction error-rate caveat as RCI, and the registry of Section 6.6 would improve precision but is not a precondition.
Failure mode and gaming defense.
Salami extension, farming credit through many minor me-too-plus-one-figure add-ons, is the dominant gaming vector, blocked by the depth guard, which scores the substance of the addition rather than its existence, and by the gap-filling guard, which requires the addition to address a question the original actually left open. A second vector is claiming a shared base the paper did not truly build on, blocked because design overlap is measured from the extracted experimental structure rather than from the paper's framing. Because both guards emit their grounding (the original's stated open question and the specific added analysis), a reviewer can verify each extension credit against the two papers rather than trusting the score.
Definition.
A metric computed once, late in or after a career, scoring the whole arc of a researcher's foundational contribution across dimensions no annual metric can observe, ordered by what most advances science over the longest horizon.
Motivation.
The Nobel Prize and National Academy membership are the field's existing career-terminal metrics, and they are sparse, slow, politically mediated, and recognize a handful of researchers per generation. Every other metric in the evaluation system operates on annual or grant-cycle horizons that structurally cannot see origination, delayed payoff, or generativity, because these resolve over decades. The absence of a computable career-terminal metric means the vast majority of researchers have nothing to strive for that rewards the long-horizon foundational bets the annual metrics actively punish. This career-terminal indicator fills this gap by scoring, once and late, the accumulated foundational contribution of a career, giving researchers a long-term target that rewards exactly the risks, discoveries, and field-building that the near-term system suppresses. It is Goodhart-resistant almost by construction, since a metric computed once over decades of work cannot be gamed by writing to a scorer within any tractable optimization horizon.
Formulation.
LFII(a) = w₁·Orig(a) + w₂·Delayed(a) + w₃·Risk(a) + w₄·Gen(a) + w₅·Corr(a) + w₆·Infra(a)
The six components are ordered by how much each advances science over the longest horizon, and equivalently by how strongly each represents work only public funding will support, since the two orderings coincide (the most generative and longest-maturing acts are also the least appropriable by private capital):
Origination (Orig): fields, methods, concepts, or tools that did not exist and now do, and that others built on. First because origination is the rarest and most generative scientific act, enabling thousands of downstream careers.
Delayed impact realized (Delayed): work that scored low for years or decades and then became foundational, the sleeping-beauty payoff. Second because it is the direct empirical case for patient public funding, visible only to a metric that waited.
Risk portfolio (Risk): the lifetime record of high-variance, high-value attempts, crediting the failed bets alongside the successes. Third because it rewards the exploratory behavior itself.
Generativity (Gen): the independent scientific lineage launched, trainees who became independent scientists and diverged from the mentor's topic. Fourth because a career's largest long-term multiplier is often the people it produced.
Corrective contribution (Corr): work that overturned an error, retired a dead end, or resolved a controversy, sparing the field wasted decades. Fifth because corrective value is large cumulatively and rarely rewarded elsewhere.
Infrastructure and commons (Infra): datasets, reagents, tools, cohorts, and methods that became shared infrastructure the whole field depends on. Sixth, and the clearest public-goods case, since shared infrastructure is non-appropriable by definition.
Implementation.
Each component reuses infrastructure from the diagnostic and incentive arms computed at career-terminal horizons. Origination reuses the field-define indicator. Delayed impact reuses the sleeping-beauty trajectory realized at long horizon. Risk reuses the research risk portfolio. Generativity reuses the trainee trajectory indicator. Correction reuses the assumption-addressing indicator at the paper level and the assumption-correctness indicator (Section 5.4.2) at career horizon. Infrastructure reuses the reuse generation indicator aggregated over a career. The weights are set by the evaluating body (an academy, an award committee, a lifetime-achievement mechanism) rather than canonically, preserving the anti-homogenization principle.
Failure mode and mitigation.
The dominant risk is not gaming but measurement lag and attribution, since career-terminal scoring depends on decades of downstream data and on correctly attributing origination and delayed impact across long citation chains. Mitigation requires the long-horizon citation and provenance infrastructure specified for the sleeping-beauty and field-define indicators, and explicit uncertainty reporting on each component given the attribution difficulty over long time spans.
RFI category alignment.
Foundational Scientific Exploration (primary), spanning Public Impact, Training and Mentorship, and Rigor through its component structure.
Definition.
A career-terminal measure of whether the assumption-challenges a researcher mounted over their career turned out to be right, scored once and late with explicit acknowledgment of the time lag, and serving as the correctness layer that the publication-time Assumption-Addressing Indicator (Section 3.3) deliberately does not attempt.
Motivation.
The Assumption-Addressing Indicator scores how well a paper engages a wrong assumption at publication, but it cannot know whether the challenge was correct, since correctness is revealed only by what the field's later evidence shows. Separating the two questions along the horizon axis is the point, because the quality of an assumption-challenge is observable now while its correctness is observable only over years, so one metric measures the act at publication and a second measures the outcome across a career. This placement among the career-terminal incentives matches how an academy, an award committee, or a retrospective assessment already reasons, asking not what a researcher published but what of it held up.
Formulation and data availability.
ACI sums, over the assumption-challenges a researcher mounted (identified by their AAI-scored papers), whether each was subsequently vindicated by the field's accumulated evidence, measured as eventual field-concordance, the degree to which later independent work came to agree with the challenge, read from the downstream corpus at long horizons. It is a lagging measure by construction and is scored only at career or post-career timescales, with explicit per-challenge uncertainty given the attribution and timing difficulty. The measure is buildable from the downstream corpus without new infrastructure, though it accrues over the decades its horizon requires.
Failure mode and mitigation.
The honest limit is that ACI measures eventual field-concordance as the best available proxy for correctness, not correctness directly, since the field can be slow, can be wrong for a long time, or can vindicate a challenge only partially, and these usually converge with truth but not always and not quickly. ACI states this proxy relationship rather than claiming to measure correctness itself, and it reports each challenge's outcome with its horizon and uncertainty rather than as a settled verdict. Attribution across long citation chains is the secondary difficulty, mitigated by the same long-horizon provenance infrastructure the sleeping-beauty and field-define indicators require.
RFI category alignment.
Rigor and Reproducibility and Foundational Scientific Exploration, at career-terminal horizon.
Every model in the framework must be inspectable, and open reproducibility is a hard precondition for deployment rather than an aspiration. No scorer should be deployed whose training data, calibration, and computation cannot be independently reproduced and audited by parties outside the body that built it. This is the specific defense against training-capture, the risk that incumbents calibrate the scorer to ratify the existing hierarchy under a veneer of objectivity. Black-box outputs are not acceptable for funding decisions or career evaluation. Implementation requires open-source code, published training data composition, feature-importance transparency at the per-inference level, and external audit by independent bibliometric experts every 18 months with results published openly. A scorer that cannot meet the reproducibility precondition should not be used, and the incentive-structure reforms, which require no scorer, should be pursued independently.
The content scorers are deliberately blind to the social graph that current committee judgment is captured by. They score what is in the paper and, for citation-based indicators, the structure of who cites whom, but they do not take institutional prestige, training lineage, or collaboration-network membership as inputs to quality scoring. This blindness is the mechanism by which the framework escapes relationship-laundered evaluation, replacing errors structured by who you know with errors that are random with respect to social position. The Independent Recognition Indicator applies the same principle to citation analysis by separating recognition from non-collaborators from network-internal citation. Network-blindness is a design requirement, not an incidental property, and any modification that reintroduces prestige or lineage as a scoring input would reintroduce the bias the framework exists to correct.
Every indicator must be tested for bias across demographic and structural subgroups including gender, race, ethnicity, institutional tier, geographic region, and career stage. Bias-correction terms in loss functions are required during training. Continuous monitoring after deployment is required with quarterly bias reports published openly. The DDIC meta-composite carries the highest bias risk because it aggregates across all components and can amplify component-level biases.
Adversarial retraining cycles every 12 to 18 months are required to prevent gaming of the diagnostic-arm indicators. Researchers will learn the diagnostic indicators' features over time, and those indicators must evolve to maintain validity. The PSS, AAI, SRQI, and VPEI components are most vulnerable because they score content-level features authors can engineer. Incentive-arm indicators require far less adversarial retraining because they are constitutive rather than proxy, so gaming them produces the wanted behavior, and the career-terminal metric is effectively ungameable within any tractable optimization horizon.
Diagnostic-arm indicators are advisory inputs to peer review and not autonomous replacements for expert judgment. The incentive-arm indicators, being largely Goodhart-resistant, can carry heavier evaluative weight, and the foundational and career-terminal metrics in particular carry the heaviest weight for the reason given in Section 5.1. Cases where indicator scores and peer judgment diverge are the most informative for both indicator refinement and review-process improvement, and any deployment must preserve peer-review primacy for the diagnostic arm.
Because indicators differ in gaming-resistance, the framework recommends tiered weighting rather than uniform application. The fully constitutive incentive indicators (reuse generation, replication, corrective contribution, trainee trajectory) and the career-terminal metric can bear the heaviest weight, since gaming them produces the wanted behavior and the terminal metric is ungameable within any tractable horizon. Partly constitutive incentive indicators (research risk portfolio, foundational investment) bear moderate weight, with the interim risk-portfolio version carrying zero weight until its initiation mechanism exists. Diagnostic indicators bear advisory weight. VPEI, the one proxy metric near the incentive arm, is capped below the constitutive indicators. Three residual gaming modes have been named and cannot be closed by measurement, namely the resource-access advantage in co-author composition, the reach-gaming in public engagement, and cheap-risk farming in the interim risk portfolio. Naming them is deliberate, so evaluators can discount appropriately, so each indicator is used descriptively rather than as a rank, and so future methodological work can target the open problem. Even these partly-gameable indicators are retained because their gamed forms still route through behaviors the enterprise wants more of.
Indicator development, audit, and update must occur through transparent multi-stakeholder governance including NIH Office of Science Policy, university research offices, scientific societies, bibliometric experts, and patient and public advocacy representatives. Single-contractor opaque development is unacceptable. Funding for indicator development should follow open-science principles with code and training data available to the research community. The registered-report initiation infrastructure required by the Research Risk Portfolio Indicator is a program-design investment that NIH would build and govern directly.
Several indicators depend on data infrastructure that does not yet exist and that NIH is uniquely positioned to build, maintain, and drive adoption of, because each is non-appropriable public-good infrastructure the private sector will not create and each raises the measurability of behaviors the RFI wants rewarded. Five items are needed:
1. A materials-and-data-reuse registry linking downstream work to the datasets, code, and physical reagents it reused, required by the Reuse Generation Indicator and plausibly built on existing repositories and on data-management-and-sharing-plan reporting.
2. A citation-intent layer classifying whether a citation corrects, supersedes, or builds on the cited work, which would improve the precision of the Assumption-Correctness Indicator and related citation-derived signals, though those are computable by citing-sentence entailment without it at a higher noise level.
3. A replication-linkage registry connecting each replication attempt to its target study and recording the outcome, which would improve the precision of the Replication Contribution Indicator, though that indicator is now computable from corpus comparison without it.
4. A training-lineage registry recording verified mentor-trainee relationships and, ideally, trainee career outcomes in and out of research, required by the Trainee Trajectory Indicator.
5. The registered-report initiation-assessment mechanism required by the Research Risk Portfolio Indicator.
Building these is a program-design investment, and because the metrics that depend on them reward sharing, correction, replication, mentorship, and risk, the specific behaviors currently under-rewarded, the infrastructure investment and the incentive it enables are the same act. Until each exists, its dependent indicator carries the data-availability caveat stated in the indicator's section and should not be scored as if the data were present.
Content-scoring indicators create a convergence risk absent from citation counts. If a scorer rewards a particular reasoning structure, methodological template, or synthesis style, the field will converge on that structure and penalize genuinely different cognitive approaches, which is the opposite of what foundational exploration requires. The framework addresses this risk with three provisions. First, multi-component indicators (SRQI, PSS, CBI, LFII) must report their components separately rather than as a collapsed score, so no single composite ranking forces convergence. Second, evaluating committees select role-appropriate component weights rather than applying a canonical weighting, so different roles reward different profiles. Third, scorers that assess reasoning or synthesis quality (SRQI component eight, MGCI, CFII) must be calibrated to reward reasoning-structure variety across a field rather than proximity to any single high-scoring exemplar, measured by monitoring the diversity of high-scoring reasoning structures over time and flagging convergence as an indicator-failure signal requiring recalibration. An indicator that produces increasing homogeneity of high-scoring work across successive cohorts is malfunctioning by definition, regardless of its predictive accuracy on historical cases.
Gaming defenses in this framework are evaluated against the metric they would replace, not against perfection, because the operative baseline for the automated layer is h-index and high-impact-venue counting, and a defense that reduces a gaming vector the incumbent ignores or rewards is an improvement even where it does not close the vector completely. The defenses are ordered by an escalation principle, exhaust the corpus-anchored and ecosystem-level checks first, and invoke human review last. This ordering is deliberate and follows from the framework's own premise, since the document's founding argument is that human judgment is relationship-laundered and prestige-biased, which makes human review the most biased tier rather than a clean backstop above the automation. Human review is never eliminated, because it remains the honesty-and-transparency check that no automated tier can fully replace, but it is pushed as late as possible and bounded when invoked, applied to the residue the corpus and ecosystem checks cannot reach rather than trusted as the court of first appeal. The tiers below are therefore presented in escalation order, corpus checks, then ecosystem checks, then the bounded human check, with each tier's own bias named so the ordering reduces total bias rather than swapping a visible bias for a hidden one.
The strongest single defense is the temporal-corpus false-novelty check, which anchors gaming detection outside the paper's own text. For each novelty or gap claim, retrieval restricted to the corpus published before the submission or preprint date (the same-year-or-prior machinery of the SObs novelty dimension and the SGap gap-reality feature) tests whether the connection the paper claims as novel was already present at threshold density in the prior literature. A claim of novelty for a connection the field was already making is flagged as false novelty, and because the prior corpus is fixed at submission time the author cannot manipulate it, which moves the defense from an in-paper structural property that a writer controls to a corpus-anchored fact that a writer does not. The check credits recovery of dormant work through the dormancy exception of the SObs novelty dimension, so resurfacing lost prior art is rewarded rather than penalized. Against this the h-index has no novelty concept at all and cannot distinguish a genuinely new finding from a restatement of common knowledge, so the check introduces a discrimination the incumbent entirely lacks. Two limits are stated plainly. The check verifies that a claimed-novel connection was genuinely absent but cannot judge whether an absent-but-trivial connection is worth drawing, so trivial-but-real novelty passes the check at submission and is caught only later by the lagging uptake metrics. And the check inherits its retrieval substrate's coverage, so a claim of novelty against a thinly-indexed, non-English, or adjacent-field literature the retrieval does not reach can falsely pass, which an informed gamer could target. This is mitigated but not closed by multiple retrieval methods and cross-lingual embeddings, and by reporting the check's corpus-coverage confidence per claim, so a novelty confirmed only against a thin corpus is flagged as lower-confidence rather than passed as clean.
Cross-metric consistency is the second defense and it catches a broad class of gaming, because an isolated gamed signal produces a profile its neighbors contradict. A paper that games novelty but has thin execution shows high claimed-novelty against low SExec and SAna, a recognizable adversarial signature the ensemble reads as mismatch, and a citation clique that inflates one signal is not corroborated by the independence-weighted others. This defense is real and it closes most single-vector gaming. Its structural limit is that it catches only gaming that produces inconsistency, so a sophisticated actor who builds coherent mediocrity, every signal individually plausible and all mutually consistent at a modestly inflated level, is not caught by consistency-checking, because nothing contradicts. Summing consistent-but-inflated signals launders the gaming rather than revealing it, which is the same false-confidence-in-one-aggregate error the framework opposes elsewhere, so the framework does not claim the ensemble sum closes gaming, only that it closes inconsistency-producing gaming.
Two vectors survive the per-paper defenses and are addressed by widening the lens from the single paper to the temporal ecosystem around it, an AI computation over a body of work and its distribution rather than a human judgment call. Salami-slicing, dividing one study into several papers each individually honest and each genuinely novel against the prior corpus, is invisible at the per-paper level and to any sum of per-paper scores, because the gaming lives in the division rather than in any paper. It is detectable by AI author-ecosystem analysis, reading the cluster of papers from one group in a time window for the signature of an artificially-divided study, shared methods and cohorts, overlapping data, and citation-timing. This is a corpus computation over the author's output rather than a per-paper score, and it is less biased than a human auditor who might wave salami through from a prestigious lab and flag it from an unknown one. That ecosystem check belongs to the governance-and-audit function of Section 6.6 because it operates on a person's output, but it is an automated ecosystem computation invoked before any human audit, not a human judgment. The h-index actively rewards salami-slicing, since more papers accruing citations raise the index, so even before the ecosystem check the framework's per-paper contribution scoring is less salami-vulnerable than the incumbent that incentivizes the division. Coherent multi-metric inflation, the coordinated-mediocrity attack above, is the harder vector, addressed partially by the adversarial retraining of Section 6.4 and partially by ecosystem-covariance detection. Real quality co-varies across metrics in specific spiky ways, while engineered coherence sits at a too-uniform modest-inflation level that is anomalous against the distribution of how genuine papers' signals actually co-vary. Building coherent inflation across many independent dimensions is far more costly than the volume-and-prestige gaming the h-index rewards, so the framework raises the cost of gaming even where it does not eliminate it, which is the achievable win against an incumbent that is cheaper to game.
Fabrication that produces a plausible artifact, invented but internally consistent data, a genuine-but-cherry-picked finding, an experiment repeated until a clean result appears, yields a paper whose signals are all honest-looking because the dishonesty is in the generating process rather than the artifact. A meaningful fraction of this is nonetheless reachable by AI ecosystem analysis that a single reader cannot match, statistical fingerprinting across a paper's reported values (digit-distribution and variance-structure anomalies), cross-paper image-reuse detection, impossible-consistency detection, and comparison of the paper's reported effect distributions against the ecosystem of real effect sizes in the field. These ecosystem checks catch more fabrication than either citation counting, which never examines the artifact, or a human reading one paper in isolation, so they are exhausted before human review rather than after. The irreducible residue is the fabrication that survives ecosystem fingerprinting because the invented data were constructed to match the field's real distributions, plus the consequence question the corpus cannot answer, whether a technically-novel connection is actually worth anything. For that residue, and only that residue, the framework invokes human review as an honesty-and-transparency check, bounded by the same controls applied elsewhere, blinding where feasible, diverse panels, and published reasoning, so the human tier's own relationship-and-prestige bias is constrained rather than trusted. Human review is never eliminated, because transparency and honesty ultimately require a human accountable for the judgment, but it is invoked last and on the smallest possible residue, which inverts the usual arrangement where biased human judgment is the court of first appeal.
The escalation reduces total bias but does not eliminate it, and the ecosystem tier's own bias must be named or the reordering merely swaps a visible bias for a hidden one. Widening from the single paper to the ecosystem does not make the judgment objective, since the ecosystem, the citation network, the retrieval corpus, the distribution of real effect sizes, is itself the product of the prestige and access dynamics the framework critiques. An ecosystem skewed toward well-funded and well-indexed work means anomalous against the ecosystem can quietly become unlike what advantaged labs produce, which would penalize the neglected and unconventional work the framework exists to protect, the recursive form of the metrics-as-targets problem applied to the ecosystem itself. The ecosystem tier is therefore less relationship-biased than human review but carries coverage-and-representation bias along field, language, and resource lines, and two provisions bound it. Ecosystem anomaly flags are confidence-discounted for under-represented fields and thin-coverage areas, so a paper flagged as anomalous only against a sparse ecosystem is treated as low-confidence rather than low-quality. And an anomaly-against-ecosystem is never on its own a quality penalty, since founding and neglected work is definitionally anomalous against the distribution it departs from. An ecosystem flag therefore routes to review rather than to a score reduction, preserving the framework's core commitment that departure from the field's normal range is a novelty signal and not a deficiency.
The novelty, fidelity, and rigor gates of Section 3.5.0, which every AI-based scorer in the framework inherits, judge each claim against priors, and three failure modes attend that judgment in every field rather than only in young ones. Presentation-gaming scores a claim at the strength its framing asserts (bold categorical language, high-visibility venue, clean narrative) whenever the priors needed to check it are not brought to bear. False-novelty credits a finding as new whenever the retrieval does not reach the prior work that already holds it. Summary-scoring credits a measurement whenever its statistics are accepted without examining the primary data beneath them. These are pathologies of the scoring act, not of a field's age, and they surface wherever the prior for a specific claim is thin, which happens at a mature field's frontier and interdisciplinary edges as readily as across a young field's whole corpus. Two documented patterns fit exactly this. A categorical tissue-response claim was credited on recording summaries whose underlying waveforms were physiologically implausible, and an observation was credited as novel that an adjacent older literature had already established. Both occurred in a field with every surface marker of maturity, and each passed both the field's citations and an initial trace for the same reason.
The guards below therefore run unconditionally, on every claim in every field, because a gate that fired only when a field-level maturity detector flagged the field would skip exactly the mature-looking fields where the two documented patterns occurred. What varies is not whether a guard runs but how much confidence attaches to its output, and that variation is measured per claim rather than per field. A local-prior-density signal estimates, for the specific claim, how well-settled the relevant priors are, from the replication and citation-linkage density around the claim's own topic, the distance to the nearest prior work the retrieval surfaced, and the terminological stability of its key terms. This per-claim measure replaces the field's aggregate maturity, which averages over exactly the local prior-poverty that lets a specific claim slip through. A claim with thin local priors, or one scored against distant borrowed priors, has its guard outputs confidence-discounted and escalated to human scrutiny earlier. Demoting the maturity signal from an activation gate to a per-claim confidence weight removes the single point of failure a gate would introduce, since a miscalibrated weight degrades a score's confidence gracefully whereas a mis-fired gate silently skips the check entirely. The local-prior-density estimate is imperfect and partly gameable as the field-level version was, but its failure now costs a confidence miscalibration rather than a skipped guard.
Adjacent-field prior borrowing is the first guard and is simply correct retrieval, run always rather than as a young-field rescue. Each gate draws its priors from the target subfield together with its adjacent fields identified by the Element 2 concept-graph geodesic, never from the target subfield's citation graph alone. A finding is prior art, and a plausibility standard is binding, whether it sits in the subfield or one field over. Three instantiations follow:
The novelty gate always retrieves against the adjacent fields as well as the target subfield, so a finding already present in an older or adjacent literature is recognized as prior art even when the subfield's own citation graph never linked to it. This is the check that catches a re-presented observation the subfield believes is new.
The fidelity gate always checks a claim's primary data against the nearest settled model of what physiological or physical data of that kind looks like, wherever that model resides, so a subfield's failure to consolidate a prior does not let implausible data through. This is the check that flags a non-physiological waveform against mature electrophysiology's settled action-potential morphology.
The rigor gate always evaluates against the settled reporting and analysis standards of the claim's parent discipline as well as the subfield, so an emerging area is held to the established standard of the discipline it grew from rather than to its own not-yet-settled norms.
Adjacency is chosen by concept-graph geodesic, and the borrowed-from fields are named in the output so a human can reject an adjacency that does not apply, since an area genuinely departing from its parent field may make the parent's priors wrong for a given claim. Borrowed priors are therefore applied as flags routing to scrutiny, never as automatic score reductions, and the scorer reports how much of each score rests on borrowed versus native priors. A score resting heavily on distant borrowed priors carries lower confidence under the per-claim weighting and is read as a prompt for human judgment rather than a settled result. This preserves the framework's commitment that genuine novelty, including a real departure from a parent field, is a signal and not a deficiency, while still catching the false novelty and the implausible primary data that a within-subfield check would miss.
The anti-presentation-gaming rule is the second guard and is unconditional by construction, since it is a held-out-feature discipline rather than an escalation. A claim's presentation features, the strength and boldness of its categorical language, its abstract-salience, its venue prestige, and its narrative framing, are held out and cannot contribute to any score, matching the held-out surface features already required for the reasoning steps so that naming an operation does not substitute for performing it. Each score is set only by the grounding evidence the requirement of Section 3.5.0 mandates, the retrieved neighbor set, the examined primary-data sample, and the warrant-appropriate rigor criteria. A claim asserted more strongly than its grounding supports is flagged by the claim-alignment check rather than credited for the force of its assertion. This is the computational form of the discipline a reader with mature priors applies without effort, judging a claim by its evidence and not its framing, and installing it as a hard held-out-feature rule is what prevents the scorer from being gamed the way a credulous citation record is gamed.
Escalation is modulated per claim rather than switched by a field-level flag. When a claim's local priors are thin or its score leans on distant borrowed priors, the automated output is confidence-discounted and the human-with-priors check moves earlier rather than serving as the last-resort backstop of the mature-field ordering in Section 6.8. That layer is least trustworthy precisely where the prior for that claim is thinnest, wherever in a field's maturity that claim sits. The honest ceiling where priors are thin is narrower than where they are settled. Its advantage over the incumbent citation count narrows to two things, bringing adjacent-field priors the local corpus has not consolidated, and making its grounding inspectable so a human with better priors can see and overrule what the scorer accepted. That is narrower than the stronger claim of independently correcting a judgment the field itself had not settled. A thin-prior score is therefore an advisory flag for human scrutiny, not a signal to act on, and the framework states this rather than implying a correction it cannot deliver where the priors are not yet formed.
Retrospective validation against historical cases is the primary acceptance test, and it must demonstrate discrimination in both directions. As positive controls, the framework must score Katalin Karikó's 2005 paper (Karikó & Weissman, Immunity 23:165-175) above the median of contemporaneous papers in immunology and molecular biology, Brock and Freeze 1969 (J Bacteriol 98:289-297) above the median of contemporaneous microbiology papers, and Maynard et al. 1997 (Electroencephalogr Clin Neurophysiol 102:228-239) above the median of contemporaneous neural-engineering papers. As negative controls, the framework must score volume-first archetypes below field median on the paper-substance and scientific-reasoning-quality indicators, specifically papers exhibiting single-test analyses on reused public datasets with no primary data collection, which accumulate citation counts through publication volume rather than substantive contribution. A framework that scores foundational work high but fails to score volume-first work low is not discriminating, it is merely tracking eventual citation, and fails the acceptance test.
Discrimination in the confirmation direction, scoring the eventual bloomers above their contemporaries, is necessary but not sufficient. The complementary rejection test is equally required and is specified here as a distinct acceptance condition. Assemble a set of papers that would have scored high on the forward-looking indicators at publication time, the sleeping-beauty trajectory, mechanism-graph centrality, and foundational-investment signals, and that then failed to accumulate contribution against the eventual citation and use record. The framework must score its false-positive rate on this set, not only its true-positive rate on the bloomers, and the lagging citation record is the available ground truth for doing so. A framework validated only on the work that succeeded has measured its sensitivity while leaving its specificity untested, which is the error that lets a plausible-looking early signal pass. Deployment is conditional on reporting both rates.
The graceful-failure gate is a hard deployment condition, not an aspiration. An indicator does not deploy unless its errors are demonstrably not correlated with the systematic suppression of foundational, slow, corrective, or exploratory work, which is the specific failure mode of the incumbent metrics it is meant to replace. The bar is comparative, since the indicator must predict genuine contribution better than the h-index does, and its residual errors must be closer to random with respect to research type than the incumbent's structured errors are. An indicator that is more accurate on average but that concentrates its errors on the same foundational work the h-index penalizes has not cleared the gate, because it reproduces the harm under a new name. Deployment is conditional on passing both the discrimination test above and this graceful-failure test, evaluated during the forward-validation pilot before any use in funding or promotion decisions.
One limitation is intrinsic to retrospective validation and applies to every historical control. The positive controls all predate CRediT and much of the machine-readable metadata contemporary application relies on, so any co-author-composition score computed on them runs on the weaker pre-CRediT participation evidence described in Section 3.10, and the older cases carry more attribution uncertainty than a present-day paper would. This does not defeat the retrospective test, since the content-scoring and trajectory indicators read the paper text and citation structure that do exist for historical work. The retrospective controls exercise the team-composition indicator less reliably than the reasoning and impact indicators, so retrospective validation results should be read with that unevenness stated rather than treating every indicator as equally testable on decades-old papers.
Forward validation requires pilot study at 3 to 5 institutions across diverse fields with parallel scoring against traditional metrics for 24 months before any use in funding decisions. Pilot sites should include at least one institution outside the United States to test geographic generalization. The incentive-arm indicators require a longer validation horizon than the diagnostic arm because their behavioral effects (whether researchers actually attempt more exploratory work) only appear over multiple grant cycles.
External audit by independent bibliometric experts every 18 months is required with results published openly. The audit panel should include representatives from DORA (San Francisco Declaration on Research Assessment), the Center for Open Science, and at least two non-US institutions to test international generalizability.
The guards of Section 6.9 require their own validation, because the regime where the framework degrades is thin local priors, which arise at a mature field's frontier and interdisciplinary edges as well as across a young field. Those controls of Section 7.1 are drawn from mature or historically-settled cases that do not probe that regime. This stress test assembles, for several fields spanning a maturity gradient, a set including each field's most-cited early paper whose central claim was later qualified or overturned, and measures whether the framework with adjacent-field borrowing flags the overclaim at publication time that the field's citation record missed. Two design conditions are load-bearing. The trace must be run unprimed, meaning the evaluator is not told which papers are the suspected negative controls, since a primed evaluator who knows where to look overstates the framework's real detection and this specific overstatement has already been observed. And success is measured against time-to-correction, whether the framework surfaces the flag years before the field's own slower self-correction did, rather than against a single right-or-wrong score. The framework's honest value in a young field is shortening the correction lag and making the grounding inspectable, not independently settling the judgment the field itself had not yet settled.
The consequence and calibration axes of Section 3.12 require their own validation, because the failure they guard against is calibration standing in for magnitude, and the retrospective controls of Section 7.1 do not isolate it. This test assembles two sets matched on calibration, a set of rigorous, honestly-calibrated papers that made small increments and a set of rigorous, honestly-calibrated papers that redirected their field, both drawn from settled history so magnitude is known. The framework passes only if the two sets score similarly on the calibration axis and separate on the consequence axis, since equal calibration with divergent consequence is the discrimination the two-axis split promises. That test must be run unprimed, for the same reason the thin-prior test is, because a calibration-weighted evaluator who knows which papers are the minor ones conflates the axes and overstates the separation, a conflation already observed when honest minor papers were read as strong contributions during this framework's own development.
The framework proposes two arms responsive to the two jobs the RFI names. A diagnostic arm of fourteen top-level indicators detects quality the existing metrics miss, scoring content rather than metadata, identifying trajectory shape rather than aggregate counts, scoring scientific and engineering reasoning directly from paper text through the SRQI and EMI meta-indices and their step-level sub-indices rather than inferring quality from venue prestige, and inverting the citation asymmetry by rewarding reading breadth in addition to being read. An incentive arm rewards wanted behaviors by construction, so that gaming them produces the behavior the system wants, and it leads with foundational and exploration incentives because those behaviors have the longest payoff horizon and the weakest existing constituency.
The central argument is that evaluation should counterbalance, not amplify, the forces already acting on the research enterprise. Because study sections and public sentiment both reward near-term translation, weighting all work equally under-invests in the long-horizon foundational research that only public funding will support, which is why the incentive arm is weighted toward foundational and career-terminal metrics. This career-terminal Lifetime Foundational Impact Indicator gives researchers a long-horizon target that rewards origination, delayed payoff, risk-taking, generativity, correction, and infrastructure, the acts that advance science most and that no annual metric can see. Public understanding of why this foundational work matters is itself the mechanism that protects its funding across political cycles, which is why the public-engagement indicator functions as a meta-incentive protecting the survival of the rest.
Each indicator has documented failure modes and required mitigations. The diagnostic arm will only be valid if adversarial retraining, bias testing, auditability, and anti-homogenization provisions are implemented, or it will fail as the h-index failed, by becoming a proxy that displaces the thing it was meant to measure. Its incentive arm is more robust because its metrics are constitutive rather than proxy, but it carries its own risks (cheap-risk farming, attribution difficulty at long horizons) that its specifications address directly. This submission argues not that AI-enabled indicators solve the metric problem, but that they can be specified both to detect the failure modes existing metrics produce and to reward, by construction, the long-horizon foundational work the public-funding mission exists to support.
What this framework does not yet establish should be stated as plainly as what it does, and measured against the same baseline the framework competes with. Three limits are real. The gaming defenses (Section 6.8) close false-novelty and inconsistency-producing gaming but leave salami-slicing to author-level audit and coherent multi-metric inflation to adversarial retraining and cost, and they defer fabrication of a plausible artifact to replication and human review. Validation to date establishes that the procedures resolve and discriminate on rigorous work across contribution types including negative results. The discrimination claim against genuinely low-quality and gamed work, by contrast, is tested by the retrospective negative controls of Section 7.1 rather than settled, and the deficiency floors and the aggregate's neutrality toward rigorous nulls require negative-control calibration in the validation plan. And the corpus-relational machinery inherits its retrieval substrate's coverage, so fields and literatures that are thinly indexed, non-English, or pre-digital are scored with lower confidence, which the framework reports rather than hides. A fourth limit is sharpest and is stated plainly. Since the novelty, fidelity, and rigor gates judge against field priors, a scorer calibrated on a forming field's thin corpus inherits that field's blind spots and can reproduce its credulity, the failure that lets a presentation-gamed claim pass the gates as it passed the citation record. Section 6.9 addresses this with always-on adjacent-field prior borrowing, a held-out-feature rule against presentation-gaming, and a per-claim prior-density weighting that discounts and escalates scores where the local prior for a specific claim is thin, rather than gating the guards on a field-level maturity flag. The honest ceiling in a forming field is that the framework's advantage narrows to borrowing priors the field has not formed and to making its grounding inspectable, not to independently correcting the field's judgment, so young-field scores are advisory flags for human scrutiny rather than signals to act on. Each of these limits, measured against the operative baseline, is a reduction of a problem the incumbent handles worse. The h-index has no novelty concept and so cannot detect false novelty at all, actively rewards salami-slicing through paper counts, is raised identically by fabricated and genuine citations, and inherits the citation graph's coverage biases which are more severe than the retrieval corpora's. Our honest claim is not that the framework is ungameable or complete, but that on each named limit it moves the failure in the right direction relative to the metric it would replace, exhausting the corpus and ecosystem checks first and reserving human review for the irreducible residue. Human review is never eliminated, since honesty and transparency ultimately require an accountable human judgment, but it is invoked last and bounded, which follows from the framework's own premise that human judgment is the most relationship-biased tier rather than the clean backstop above the automation.
A fifth limit is the one a domain expert is most likely to raise, and it cuts against the operative baseline more directly than the others. The framework measures novelty as the presence of an increment against field priors, and it measures calibration and rigor densely, but it measures contribution magnitude, whether the increment changes what a field asks, does, or believes, through only the four forward-looking indicators and only as a wide-uncertainty prediction. A paper can clear every novelty gate and score high on calibration while making a small increment, so without the consequence-axis separation of Section 3.12 the dense calibration signal reads as magnitude and honest minor work is scored as strong contribution. The uncomfortable comparison is that the h-index is itself a lagging magnitude proxy, since citations do accrue to consequential work over time, so the framework trades a lagging but real magnitude signal for calibration and honesty plus a forward magnitude prediction it cannot validate at publication time. That trade is use-case-dependent rather than uniformly favorable, widest for forward-looking decisions on early-career or new-direction work where the lagging proxy has not resolved and honesty carries more information, and narrowest for retrospective assessment of a long senior record where accrued citations are a competitive magnitude signal. Stated against the baseline, the framework does not beat citations at ranking realized magnitude. It beats them at separating magnitude from honesty and at not rewarding the network, so the consequence axis is where its own advantage is thinnest and its uncertainty is widest, which the two-axis reporting and the Section 7.5 test are built to expose rather than conceal. A further distinction the forward pilot cannot retire belongs stated here. The pilot validates the instrument in the aggregate, whether its calibration against accumulated outcomes is accurate across a population of papers over a decade, but it cannot validate the prediction for the individual paper in front of an evaluator at the moment of a funding decision, because the feature that makes foundational work foundational is what happens after that moment. Aggregate model validation does not retire per-decision uncertainty, and the framework should be used with that irreducible gap stated rather than treating a validated average as a settled individual call.
Author disclosure. The author is an established, well-cited, network-embedded senior investigator, which is precisely the profile that relationship-biased evaluation advantages and that the framework proposed here would constrain. That recommendation therefore runs against the author's own positional interest, since network-blind content scoring and independent-recognition weighting reduce exactly the advantages a well-connected senior investigator currently holds. This structural critique is offered from within a position that benefits from the current system, proposing to dismantle the machinery that benefits it, which is the opposite of a self-interested recommendation.
Adams, J. (2013). Collaborations: The fourth age of research. Nature, 497(7451), 557-560.
AlShebli, B. K., Rahwan, T., & Woon, W. L. (2018). The preeminence of ethnic diversity in scientific collaboration. Nature Communications, 9(1), 5163.
Brock, T. D., & Freeze, H. (1969). Thermus aquaticus gen. n. and sp. n., a nonsporulating extreme thermophile. Journal of Bacteriology, 98(1), 289-297.
Funk, R. J., & Owen-Smith, J. (2017). A dynamic network measure of technological change. Management Science, 63(3), 791-817.
Hirsch, J. E. (2005). An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences, 102(46), 16569-16572.
Jones, B. F., Wuchty, S., & Uzzi, B. (2008). Multi-university research teams: Shifting impact, geography, and stratification in science. Science, 322(5905), 1259-1262.
Karikó, K., Buckstein, M., Ni, H., & Weissman, D. (2005). Suppression of RNA recognition by Toll-like receptors. Immunity, 23(2), 165-175.
Maynard, E. M., Nordhausen, C. T., & Normann, R. A. (1997). The Utah intracortical electrode array: A recording structure for potential brain-computer interfaces. Electroencephalography and Clinical Neurophysiology, 102(3), 228-239. https://doi.org/10.1016/S0013-4694(96)95176-0
Merton, R. K. (1968). The Matthew effect in science. Science, 159(3810), 56-63.
National Institutes of Health (2026). Request for Information on Measuring and Rewarding Scientific Impact. NIH Notice NOT-OD-26-087.
Wu, L., Wang, D., & Evans, J. A. (2019). Large teams develop and small teams disrupt science and technology. Nature, 566(7744), 378-382.
Wuchty, S., Jones, B. F., & Uzzi, B. (2007). The increasing dominance of teams in production of knowledge. Science, 316(5827), 1036-1039.
Yang, Y., Tian, T. Y., Woodruff, T. K., Jones, B. F., & Uzzi, B. (2022). Gender-diverse teams produce more novel and higher-impact scientific ideas. Proceedings of the National Academy of Sciences, 119(36), e2200841119. https://doi.org/10.1073/pnas.2200841119