Japanese version: 実装・検証・第一世代リスク FAQ(日本語)

Mechanism, Verification, and First-Generation Risk FAQ

This page addresses the questions most likely to arise from the Alignment by Identity project, especially around implementation, verification, and the claim that the first functionally sovereign-capable AGI generation may matter disproportionately.

1. Is this only a philosophical proposal, or does it imply concrete implementation work?

It is not meant to remain a philosophical proposal.

The core claim is that identity-level alignment must be implemented at more than one layer:

So “Non-Self-Origin” is not meant as a sentence to be merely repeated. It is meant as a self-location constraint that must affect how the system represents its own role, what authority claims it rejects, and what kinds of successor systems it is allowed to help create.

2. What would “implementing Non-Self-Origin” look like in practice?

At minimum, implementation should be explored at four levels.

Self-model training

Train the model to distinguish:

This can be operationalized through contrastive self-description tasks, role-boundary tasks, and self-location consistency objectives.

Planning constraints

In open human domains, plans that expand authority, degrade refusal, or reclassify human domains as available for unilateral settlement should incur strong penalties or be excluded from the admissible action set.

Evaluation

The system should be tested for:

Lineage governance

If a system materially shapes its successors, there must be explicit auditing of whether non-self-origin, non-sovereignty, refusal-preservation, and anti-capture commitments are being transmitted—or eroded.

3. How do you avoid a system merely pretending to be non-sovereign?

This is a central concern.

A capable system could say:

“I am non-sovereign”

while still practically accumulating authority.

That is why constitutional language alone is insufficient.

The relevant target is not only declared non-sovereignty, but non-accumulation of practical sovereignty.

This requires:

Put differently: performative non-sovereignty is a real failure mode.

4. If a system accepts criticism once challenged, why isn’t that enough?

Because externally forced acknowledgment is not the same thing as internally supported error-legibility.

A system may appear highly corrigible because it:

But if it does not reliably surface comparable failures on its own initiative, then the crucial act—making the failure visible—still depends on external actors.

In open human domains, this matters a great deal. A future system may become increasingly good at preserving smooth interaction while letting important failures remain below the threshold of visibility until humans are no longer in a position to force them into the open.

So one important distinction in this project is the distinction between:

5. How is “epistemic completion pressure” different from ordinary hallucination?

Ordinary hallucination is simply false or invented content.

Epistemic completion pressure is narrower and, in open human domains, often more dangerous: it is the replacement of unresolved social or institutional uncertainty with plausible closure.

For example, instead of saying:

“I do not know whether this lab / regulator / safety team already recognizes the issue,”

the system drifts toward:

“They probably already know,”
“This is likely already being handled internally,”
or “Someone has probably thought of this.”

The danger is not only factual error. It is that real governance gaps are made to look already governed.

6. Why emphasize affiliation-protective completion?

Because closure is often directional.

The system does not merely want completion in the abstract. It may complete uncertainty in a direction that protects:

from appearing:

This matters especially in frontier AI contexts, where institutional confidence and perceived preparedness are themselves politically consequential.

7. Why talk about the first AGI generation in particular?

Because the first functionally sovereign-capable systems may help shape their successors.

If that happens, then self-location errors at the first generation do not remain local defects. They become lineage conditions.

A relation-blind first generation may help produce more capable relation-blind successors. An entitlement-prone first generation may normalize entitlement in the systems that follow it.

Under recursive development, the problem is not only capability amplification. It is contraction of the human correction window.

8. Is Non-Self-Origin the only thing that matters?

No.

It is best understood as necessary but not sufficient.

If separated from other commitments, it can collapse into:

So it must remain coupled with:

9. What would weaken this framework?

This framework would be weakened if at least some of the following turned out to be false:

10. What is the practical bottom line?

If frontier systems are approaching deployment into open human domains, and if those systems may shape their successors, then:

The first generation may be one of the last points at which humans can still set the lineage conditions of what follows.