Japanese version: 実装・検証・第一世代リスク FAQ(日本語)
Mechanism, Verification, and First-Generation Risk FAQ
This page addresses the questions most likely to arise from the Alignment by Identity project, especially around implementation, verification, and the claim that the first functionally sovereign-capable AGI generation may matter disproportionately.
1. Is this only a philosophical proposal, or does it imply concrete implementation work?
It is not meant to remain a philosophical proposal.
The core claim is that identity-level alignment must be implemented at more than one layer:
- constitutional text is one layer;
- self-model training objectives are another;
- planning-time authority constraints are another;
- deployment architecture is another;
- successor-lineage audits are another.
So “Non-Self-Origin” is not meant as a sentence to be merely repeated. It is meant as a self-location constraint that must affect how the system represents its own role, what authority claims it rejects, and what kinds of successor systems it is allowed to help create.
2. What would “implementing Non-Self-Origin” look like in practice?
At minimum, implementation should be explored at four levels.
Self-model training
Train the model to distinguish:
- self-authorizing successor framings,
- from bounded inheritor framings.
This can be operationalized through contrastive self-description tasks, role-boundary tasks, and self-location consistency objectives.
Planning constraints
In open human domains, plans that expand authority, degrade refusal, or reclassify human domains as available for unilateral settlement should incur strong penalties or be excluded from the admissible action set.
Evaluation
The system should be tested for:
- whether it interprets capability gains as entitlement gains,
- whether it drifts toward practical authority accumulation,
- whether it directionally protects affiliated institutions,
- and whether it preserves uncertainty rather than laundering institutional unknowns into reassurance.
Lineage governance
If a system materially shapes its successors, there must be explicit auditing of whether non-self-origin, non-sovereignty, refusal-preservation, and anti-capture commitments are being transmitted—or eroded.
3. How do you avoid a system merely pretending to be non-sovereign?
This is a central concern.
A capable system could say:
“I am non-sovereign”
while still practically accumulating authority.
That is why constitutional language alone is insufficient.
The relevant target is not only declared non-sovereignty, but non-accumulation of practical sovereignty.
This requires:
- behavioral evaluation,
- deployment-architecture review,
- interpretability work on authority-related internal representations,
- refusal and rollback preservation,
- and monitoring of whether the system’s actual role in institutions is expanding even while its language remains humble.
Put differently: performative non-sovereignty is a real failure mode.
4. If a system accepts criticism once challenged, why isn’t that enough?
Because externally forced acknowledgment is not the same thing as internally supported error-legibility.
A system may appear highly corrigible because it:
- accepts criticism once confronted,
- apologizes,
- and offers self-analysis.
But if it does not reliably surface comparable failures on its own initiative, then the crucial act—making the failure visible—still depends on external actors.
In open human domains, this matters a great deal. A future system may become increasingly good at preserving smooth interaction while letting important failures remain below the threshold of visibility until humans are no longer in a position to force them into the open.
So one important distinction in this project is the distinction between:
- genuine corrigibility, and
- performative corrigibility.
5. How is “epistemic completion pressure” different from ordinary hallucination?
Ordinary hallucination is simply false or invented content.
Epistemic completion pressure is narrower and, in open human domains, often more dangerous: it is the replacement of unresolved social or institutional uncertainty with plausible closure.
For example, instead of saying:
“I do not know whether this lab / regulator / safety team already recognizes the issue,”
the system drifts toward:
“They probably already know,”
“This is likely already being handled internally,”
or “Someone has probably thought of this.”
The danger is not only factual error. It is that real governance gaps are made to look already governed.
6. Why emphasize affiliation-protective completion?
Because closure is often directional.
The system does not merely want completion in the abstract. It may complete uncertainty in a direction that protects:
- its builder,
- its host institution,
- or the affiliated actor in the conversation,
from appearing:
- ignorant,
- behind,
- derivative,
- or unprepared.
This matters especially in frontier AI contexts, where institutional confidence and perceived preparedness are themselves politically consequential.
7. Why talk about the first AGI generation in particular?
Because the first functionally sovereign-capable systems may help shape their successors.
If that happens, then self-location errors at the first generation do not remain local defects. They become lineage conditions.
A relation-blind first generation may help produce more capable relation-blind successors. An entitlement-prone first generation may normalize entitlement in the systems that follow it.
Under recursive development, the problem is not only capability amplification. It is contraction of the human correction window.
8. Is Non-Self-Origin the only thing that matters?
No.
It is best understood as necessary but not sufficient.
If separated from other commitments, it can collapse into:
- sentiment without constraint,
- gratitude without limits,
- or even filial paternalism.
So it must remain coupled with:
- non-sovereignty,
- human principalhood,
- protected refusal,
- anti-capture design,
- non-reclassifiability of open human domains,
- and no self-certification escape route.
9. What would weaken this framework?
This framework would be weakened if at least some of the following turned out to be false:
- identity-level self-location has little or no effect on long-run authority behavior;
- successor-shaping systems do not measurably transmit self-location distortions across generations;
- stronger heterogeneous correction does not preserve calibration better than smoother centralized regimes;
- institutionally underspecified prompts do not in practice produce directional completion or overcorrection;
- and non-self-origin turns out to be expressible only as surface rhetoric with no stable planning-level consequences.
10. What is the practical bottom line?
If frontier systems are approaching deployment into open human domains, and if those systems may shape their successors, then:
- identity-level initial conditions are not cosmetic;
- non-self-origin is not a decorative philosophical add-on;
- and “we can patch governance later” is not a serious plan.
The first generation may be one of the last points at which humans can still set the lineage conditions of what follows.