In John Wentworth and David Lorell’s paper “Natural Latents”, they aim to propose mathematical conditions under which translation between agents is guaranteed to be possible, and claim that these are the most general conditions under which translatability is guaranteed.

You can see their paper on arXiv or on the AI Alignment Forum, and read precursor posts here and here.

This paper has become a central part of my undergraduate thesis work, and as part of that work I present my own rendition of the math in their paper, with the aim of being maximally legible to readers with diverse mathematical backgrounds. Conceptual discussions are mostly outside of the scope of this post for now—we will focus on the definitions, theorems, and proofs.

Mediators, Redunds, and Natural Latents
Theorem: Mediator Determines Redund
- Proof
Probabilistic Generative Models: Observables and Latents
Theorem: Guaranteed Translatability
Appendix: Elementary Information Theory

If you are not intimately familiar with entropy, mutual information, and total correlation, I recommend reading the Appendix before proceeding. The main text assumes familiarity with these concepts.

Mediators, Redunds, and Natural Latents

Definition: Mediator

We say that a variable $Z$ mediates over a sequence of discrete random variables $X$ when ${X_{1}, X_{2}, \dots, X_{n}}$ are approximately conditionally independent given $Z$ .

Intuitively, this means that once we know $Z$ , learning the value of any $X_{k}$ doesn’t significantly change our uncertainty about the other $X_{i}$ s. All the shared information between the $X_{i}$ variables flows through $Z$ .

Formally, we require that the conditional total correlation is less than some constant $ε$ :

$TC (X_{1}, \dots, X_{n} ∣ Z) = i = 1 \sum n H (X_{i} ∣ Z) - H (X_{1 : n} ∣ Z) \leq ε$

Z is a mediator over X

$Z$ is an exact mediator if $ε = 0$ ; an approximate mediator up to error $ε$ otherwise. We normally assume mediators are approximate, so we just say “mediator.” We think of $ε$ as small, but our bounds are global, so there are no actual assumptions needed.

Definition: Redund

When a random variable $A$ is a function of another random variable $B$ , we know the value $A$ takes as soon as we know $B$ . Therefore the new information that we get is exactly zero: $H (A ∣ B) = 0$

When this only holds approximately, the conditional entropy is small, and we say that $A$ is approximately a function of or approximately determined by $B$ :

$H (A ∣ B) \leq ε for some small ε$ If $Z$ is approximately a function of each element of a sequence $X$ , then we say it is a redund over $X$ . This means that all the information contained in $Z$ is also contained in each $X_{i}$ .

$H (Z ∣ X_{i}) \leq ε, \forall X_{i} \in {X_{1} \dots X_{n}}$

Z is a redund over X

$Z$ is an exact redund if $ε = 0$ ; an approximate redund up to error $ε$ otherwise. We normally assume redunds are approximate, so we just say “redund.” Again, we think of $ε$ as small, but our bounds are global.

Definition: Natural Latent

$Λ$ is a natural latent over $X = {X_{1}, X_{2}, \dots, X_{n}}$ up to errors $ε_{med}$ and $ε_{red}$ iff it satisfies both the mediation and redundancy conditions up to those errors:

$TC (X ∣ Λ) \leq ε_{med}, H (Λ ∣ X_{j}) \leq ε_{red} \forall j$

Λ is a natural latent over X

Theorem: Mediator Determines Redund

Let $X$ be a sequence of $n \geq 2$ discrete random variables $X_{1}, \dots, X_{n}$ , and assume that for some random variables $Z_{med}$ and $Z_{red}$ and for some constants $ε_{med}$ and $ε_{red}$ the following inequalities hold:

$TC (X ∣ Z_{med}) \leq ε_{med}, H (Z_{red} ∣ X_{i}) \leq ε_{red} \forall i$ Then:

$H (Z_{red} ∣ Z_{med}) \leq ε_{med} + 2 ε_{red}$

Proof

Pick some distinct $X_{j}, X_{k}$ where $j < k$ . By the definition of conditional mutual information,

$I (Z_{red}; X_{k} ∣ Z_{med}) = H (Z_{red} ∣ Z_{med}) - H (Z_{red} ∣ X_{k}, Z_{med})$ $H (Z_{red} ∣ Z_{med})$ is what we want to bound, so we put it on one side:

$H (Z_{red} ∣ Z_{med}) = I (Z_{red}; X_{k} ∣ Z_{med}) + H (Z_{red} ∣ X_{k}, Z_{med})$ We bound the mutual information term using monotonicity:

$I (X_{k}; Z_{red} ∣ Z_{med}) \leq I (X_{k}; Z_{red}, X_{j} ∣ Z_{med})$ Now we can use the chain rule for mutual information to break apart the right hand side:

$I (X_{k}; Z_{red}, X_{j} ∣ Z_{med}) = I (X_{k}; X_{j} ∣ Z_{med}) + I (X_{k}; Z_{red} ∣ X_{j}, Z_{med})$ Substituting, we get an upper bound on $H (Z_{red} ∣ Z_{med})$ :

$H (Z_{red} ∣ Z_{med}) \leq I (X_{k}; X_{j} ∣ Z_{med}) + I (X_{k}; Z_{red} ∣ X_{j}, Z_{med}) + H (Z_{red} ∣ X_{k}, Z_{med})$ Now we will bound each of these terms individually.

First, by monotonicity we have

$I (X_{k}; X_{j} ∣ Z_{med}) \leq I (X_{k}; X_{1}, \dots, X_{k - 1} ∣ Z_{med})$ And by the nonnegativity of mutual information we have

$I (X_{k}; X_{< k} ∣ Z_{med}) \leq i = 1 \sum n I (X_{i}; X_{< i} ∣ Z_{med}) = TC (X ∣ Z_{med})$ Which by assumption is bounded by $ε_{med}$ .

Second, bounding $I (X_{k}; Z_{red} ∣ X_{j}, Z_{med})$ by the entropy of its second argument gives

$I (X_{k}; Z_{red} ∣ X_{j}, Z_{med}) \leq H (Z_{red} ∣ X_{j}, Z_{med})$ By monotonicity,

$H (Z_{red} ∣ X_{j}, Z_{med}) \leq H (Z_{red} ∣ X_{j})$ Which by assumption is bounded by $ε_{red}$ .

Third, simply by monotonicity and the redundancy assumption,

$H (Z_{red} ∣ X_{k}, Z_{med}) \leq H (Z_{red} ∣ X_{k}) \leq ε_{red}$ So we have bounded each term, and substituting gives

$H (Z_{red} ∣ Z_{med}) \leq ε_{med} + ε_{red} + ε_{red} = ε_{med} + 2 ε_{red}$

Probabilistic Generative Models: Observables and Latents

In this section, we formalize how agents represent the world using probabilistic generative models. This framework will allow us to precisely state and prove the guaranteed translatability theorem.

If you don’t need the measure-theoretic details of how observables and latents are formally constructed from sensory data, you can skim or skip this section and jump directly to the Guaranteed Translatability theorem. The key intuition is that agents break up raw sensory data into chunks called observables and posit hidden variables (latents) to model these features.

We assume that a Bayesian agent learns a probabilistic generative model (PGM) which it uses to make predictions about the world. Traditionally, a PGM represents an agent as having a joint distribution over a set of hidden latent variables $Z_{1}, \dots, Z_{n}$ and observables $X_{1}, \dots, X_{m}$ . Here we will begin with a single random variable, and show how first multiple observable random variables and then latent variables are introduced. Throughout this section, we assume all random variables are discrete.

Note that we will reuse some symbols from the previous section, and while we try to choose them in an intuive way, they have new meanings here.

Sensory Data

We define the random variable $D$ as follows. Let $(Ω, F, P)$ be a probability space. Then let

$D : Ω \to D$ be a measurable function, where:

$Ω$ is a sample space of outcomes.
$F$ is the $σ$ -algebra on $Ω$ .
$P : F \to [0, 1]$ is the probability measure.
$D$ is a countable set representing all possible sensory data.
$A$ is the power set of $D$ , denoted $P (D)$ .
The distribution of $D$ is a Probability Mass Function (PMF) $P_{D} (d) = P (D = d)$ for $d \in D$ .

The agent can have a predictive model encoded as a distribution $P_{D}$ .

Observables

Given sensory data $D$ , we can define a sequence of observable random variables $X_{1}, \dots, X_{m}$ . We denote this sequence as $X$ and say that the random variables are measurable functions of $D$ :

$\forall i \in {1, \dots, m}, \exists f_{i} : D \to X_{i} s.t. X_{i} = f_{i} \circ D .$ where each $X_{i}$ is a countable set. The $i$ -th observable random variable is then $X_{i} := f_{i} \circ D : Ω \to X_{i}$ .

The joint observable vector is $X = (X_{1}, \dots, X_{m}) : Ω \to X_{1} \times \dots \times X_{m},$ where $X (ω) = (f_{1} (D (ω)), \dots, f_{m} (D (ω)))$ .

Each $X_{i}$ has a well-defined PMF derived from $P_{D}$ :

$P_{X_{i}} (x) = d \in f_{i}^{- 1} ({x}) \sum P_{D} (d) .$

Latents

Now we assume that in order to find $P_{D}$ and the joint distribution $P_{X}$ , the agent uses a latent variable model.

The agent posits discrete latent random variables $Z = (Z_{1}, \dots, Z_{n}) : Ω \to Z_{1} \times \dots \times Z_{n}$ , where each $Z_{j}$ is a countable set.

We define the agent’s model using the joint PMF $P_{D, X, Z}$ . Since $X$ is a deterministic function of $D$ , the joint distribution of all variables is a degenerate extension of the distribution over $D$ and $Z$ .

For any specific outcomes $d \in D, z \in Z$ , and $x \in X$ , the joint probabilities satisfy:

$P_{D, X, Z} (d, x, z) = {P_{D, Z} (d, z) 0 if x = (f_{1} (d), \dots, f_{m} (d)) otherwise$

Including $X$ in the joint distribution is technically redundant because $X$ adds no new information once $D$ is known (though $D$ may contain information not captured by $X$ ). However, we include $X$ to keep the relationship between raw data, features, and latents explicit.

Motivation of Latent Variables

Modeling $P_{D}$ directly is usually difficult because sensory data is high-dimensional and complex. However, it is often easier to use a generative process where we sample a simple latent factor $z$ from a prior $P_{Z}$ , and we generate data $d$ conditioned on that factor using $P_{D ∣ Z}$ .

We have:

Observable space: The countable set $D$ (and by extension $X$ ).
Latent space: The countable set $Z$ .
Random variables: Discrete variables $D$ and $Z$ .

From these, we define the joint probability mass function (PMF):

$P_{D, Z} (d, z) = P (D = d, Z = z)$

The marginal distributions are obtained by summing over the other variable (Law of Total Probability):

$P_{D} (d) = z \in Z \sum P_{D, Z} (d, z), P_{Z} (z) = d \in D \sum P_{D, Z} (d, z) .$

We can factor the joint distribution into a prior and a likelihood (conditional probability):

$P_{D, Z} (d, z) = P_{D ∣ Z} (d ∣ z) P_{Z} (z)$

This factorization allows us to express the complex data distribution $P_{D}$ as a mixture of (hopefully) simpler distributions:

$P_{D} (d) = z \in Z \sum P_{D ∣ Z} (d ∣ z) P_{Z} (z)$

Theorem: Guaranteed Translatability

This theorem establishes when translation between two agents’ latent variables is guaranteed to be possible. Two agents observe the world and build their own models with their own latent variables. If they agree on certain shared observables and their latent variables approximately satisfy the natural latent conditions (mediation and redundancy) then there exists an approximate mapping between the latent variables in their models.

The forward direction of this theorem shows that one agent’s mediators can determine another’s redunds, and the reverse direction shows that in the special case of two observables, being determined by all mediators implies being a redund.

Agreement on Shared Observables

We now consider two Bayesian agents, $Alice$ ( $A$ ) and $Bob$ ( $B$ ), who learn probabilistic generative models $M^{A}$ and $M^{B}$ , respectively. $Alice$ derives observables $X^{A} = (X_{1}^{A}, \dots, X_{m}^{A})$ . $Bob$ derives observables $X^{B} = (X_{1}^{B}, \dots, X_{n}^{B})$ . These need not be the same family — their feature maps $f_{i}^{A}$ and $f_{j}^{B}$ may differ.

Some of $Alice$ ′s and $Bob$ ′s observables may be essentially the same, in the sense that they are functions of each other. We call these shared observables and denote them by $Y$ .

Specifically, we define subsequences $Y^{A} \subseteq X^{A}$ and $Y^{B} \subseteq X^{B}$ of length $k$ , such that for each $i \in {1, \dots, k}$ , there is a bijection:

$g_{i} : Y_{i}^{A} \to Y_{i}^{B}$

These variables correspond to the same feature of the data $D$ (e.g., $Y_{i}^{B} = g_{i} \circ Y_{i}^{A}$ ).

Important: The set $Y$ need not contain all observables that are shared between the agents. We only require that each element of $Y$ is indeed shared—that is, for each variable we include in $Y^{A}$ and $Y^{B}$ , there exists the bijection described above. The agents may have additional shared observables beyond those in $Y$ , but our analysis focuses on a particular subset $Y$ of shared observables.

While the variables may be linked in reality, the agents’ models of these variables may differ. We say that $Alice$ and $Bob$ agree on the observables in $Y$ if their marginal distributions over $Y$ are identical (under the mapping $g$ ).

Formally, for all $i$ and for all outcomes $y \in Y_{i}^{A}$ :

$P^{A} (Y_{i}^{A} = y) = P^{B} (Y_{i}^{B} = g_{i} (y))$

Since both agents assign the same probability mass to corresponding outcomes, the entropy of the shared observables is identical for both agents:

$H (Y^{A}) = H (Y^{B})$ Consequently, $Y^{A}$ and $Y^{B}$ are interchangeable in entropy expressions, and we will often omit the superscripts (writing simply $H (Y)$ ) when the distinction is not needed.

Statement

Assume $Alice$ and $Bob$ agree on observables as defined above. Let $Z^{B}$ be any mediator in $Bob$ ′s model:

$TC (Y^{B} ∣ Z^{B}) \leq ε_{med^{B}}$

Then, for a redund $Z^{A}$ over $Y^{A}$ in $Alice$ ′s model, $Z^{A}$ is approximately determined by $Z^{B}$ . Formally, if $Z^{A}$ is a redund up to error $ε_{red^{A}}$ :

$H (Z^{A} ∣ Y_{j}) \leq ε_{red^{A}}, \forall Y_{j} \in Y^{A}$ Then:

$H (Z^{A} ∣ Z^{B}) \leq ε_{med^{B}} + 2 ε_{red^{A}} .$

Furthermore, when there are exactly two shared observables $Y_{1}$ and $Y_{2}$ , the implication works in reverse as well. If $Z^{A}$ is determined by any mediator $Z^{B}$ in $Bob$ ′s model up to error $ε_{det}$ , then $Z^{A}$ must be a redund up to error $ε_{det}$ .

Formally, if for all $Z^{B}$ such that $TC (Y^{B} ∣ Z^{B}) \leq ε_{med^{B}}$ , we have $H (Z^{A} ∣ Z^{B}) \leq ε_{det}$ for some constant $ε_{det}$ , then:

$H (Z^{A} ∣ Y_{j}^{A}) \leq ε_{det} \forall Y_{j}^{A} \in {Y_{1}^{A}, Y_{2}^{A}}$

Proof

We are given that $Alice$ and $Bob$ agree on observables and that $Bob$ ′s latent $Z^{B}$ mediates over $Y^{B}$ up to error $ε_{med^{B}}$ :

$TC (Y^{B} ∣ Z^{B}) \leq ε_{med^{B}} (A)$

Forward Direction

Assume $Alice$ ′s latent $Z^{A}$ is a redund over her shared observables $Y^{A}$ :

$H (Z^{A} ∣ Y_{j}^{A}) \leq ε_{red^{A}}, \forall Y_{j}^{A} \in Y^{A} (B)$

Since $Alice$ and $Bob$ agree on observables, the redundancy bound holds for $Bob$ ′s observables as well:

$H (Z^{A} ∣ Y_{j}^{B}) \leq ε_{red^{A}}, \forall Y_{j}^{B} \in Y^{B} (C)$

Now we apply the Mediator Determines Redund Theorem to the system $(Y^{B}, Z^{B}, Z^{A})$ .

$Z^{B}$ is the mediator over $Y^{B}$ (from B).
$Z^{A}$ is the redund over $Y^{B}$ (from D).

Substituting the bounds directly into the theorem gives:

$H (Z^{A} ∣ Z^{B}) \leq ε_{med^{B}} + 2 ε_{red^{A}}$

Reverse Direction

We restrict ourselves to the case $Y = (Y_{1}, Y_{2})$ . We assume that for any mediator $Z^{B}$ in $Bob$ ′s model (where $TC (Y^{B} ∣ Z^{B}) \leq ε_{med^{B}}$ ), $Z^{A}$ is determined by that mediator up to error $ε_{det}$ :

$H (Z^{A} ∣ Z^{B}) \leq ε_{det}$

We must show that $Z^{A}$ is a redund.

In the 2-variable case, the observables mediate themselves. If $Bob$ chooses $Z^{B} = Y_{1}^{B}$ , the conditional total correlation is zero:

$TC (Y_{1}^{B}, Y_{2}^{B} ∣ Y_{1}^{B}) = I (Y_{1}^{B}; Y_{2}^{B} ∣ Y_{1}^{B}) = 0$

Since $Y_{1}^{B}$ is a valid mediator (with $ε_{med^{B}} = 0$ ), our assumption requires that it determines $Z^{A}$ :

$H (Z^{A} ∣ Y_{1}^{B}) \leq ε_{det}$ By symmetry, $Y_{2}^{B}$ is also a valid mediator, so:

$H (Z^{A} ∣ Y_{2}^{B}) \leq ε_{det}$

Using the agreement on observables assumption to map back to $Alice$ ′s variables, we have:

$H (Z^{A} ∣ Y_{j}^{A}) \leq ε_{det} \forall j \in {1, 2}$

So $Z^{A}$ is a redund over $Y^{A}$ up to error $ε_{det}$ .

Corollary: Bi-directional Translation with Natural Latents

The theorem above establishes one-way translatability: if Bob has a mediator and Alice has a redund, then Bob’s latent determines Alice’s. However, when both agents use approximately natural latents, we get an approximate bijection between their latents.

Statement: Suppose Alice and Bob agree on shared observables $Y$ , and both $Z^{A}$ and $Z^{B}$ are natural latents over their respective shared observables. That is:

$Z^{A}$ is both a mediator over $Y^{A}$ (with $TC (Y^{A} ∣ Z^{A}) \leq ε_{med^{A}}$ ) and a redund over $Y^{A}$ (with $H (Z^{A} ∣ Y_{j}^{A}) \leq ε_{red^{A}}$ for all $j$ )

$Z^{B}$ is both a mediator over $Y^{B}$ (with $TC (Y^{B} ∣ Z^{B}) \leq ε_{med^{B}}$ ) and a redund over $Y^{B}$ (with $H (Z^{B} ∣ Y_{j}^{B}) \leq ε_{red^{B}}$ for all $j$ )

Then both conditional entropies are small:

$H (Z^{A} ∣ Z^{B}) \leq ε_{med^{B}} + 2 ε_{red^{A}}$

$H (Z^{B} ∣ Z^{A}) \leq ε_{med^{A}} + 2 ε_{red^{B}}$

Proof: Apply the forward direction of the Guaranteed Translatability theorem twice.

Alice → Bob: Since $Z^{B}$ mediates $Y^{B}$ and $Z^{A}$ is a redund over $Y^{A}$ (hence over $Y^{B}$ by agreement), we have $H (Z^{A} ∣ Z^{B}) \leq ε_{med^{B}} + 2 ε_{red^{A}}$

Bob → Alice: Since $Z^{A}$ mediates $Y^{A}$ and $Z^{B}$ is a redund over $Y^{B}$ (hence over $Y^{A}$ by agreement), we have $H (Z^{B} ∣ Z^{A}) \leq ε_{med^{A}} + 2 ε_{red^{B}}$

Appendix: Elementary Information Theory

We will giving a few elementary definitions and identities from information theory. We will always deal with discrete random variables, as this avoids continuous pathologies and is arguably the correct level of description for the real world anyway. This will assume very minimal prior knowledge, but for an authoritative source refer to Elements of Information Theory 2nd Edition by Cover and Thomas.

So $X$ is a discrete random variable which takes values $x_{1}, x_{2}, \dots, x_{n}$ , and we write $P (X = x_{i}) = p (x_{i})$ A joint distribution $(X, Y)$ takes values $(x_{1}, y_{1}), (x_{1}, y_{2}), \dots, (x_{2}, y_{1}), (x_{2}, y_{2}), \dots,$ and $P (X = x_{i}, Y = y_{j}) = p (x_{i}, y_{j})$

All of what follows can be expressed in terms of discrete random variables with probability mass functions.

Entropy

The entropy of a random variable is the average information we get from sampling it. $H (X) = - x \sum p (x) lo g p (x)$

Since probabilities are between zero and one, we have $lo g p (x) \leq 0$ , so we multiple by $- 1$ to get a nonnegative number. Each term $p (x)$ is the probability of outcome $x$ , while $- lo g p (x)$ is the information content (or “surprise”) of that outcome. Rare events have high information content, and common events have low information content. Entropy is the probability-weighted average of these information contents.

The joint entropy measures uncertainty about multiple variables. $H (X, Y) = - x, y \sum p (x, y) lo g p (x, y)$ The conditional entropy is the average information we get from sampling $X$ when we already know the value of $Y$ . $H (X ∣ Y) = H (X, Y) - H (Y)$

Mutual Information

The mutual information between two random variables measures how much information they share, or equivalently, how much knowing one tells you about the other.

We give two equivalent forms:

$I (X; Y) = H (X) + H (Y) - H (X, Y) = H (X) - H (X ∣ Y)$

The first form shows mutual information as “total information minus joint information”—the extent to which $X$ and $Y$ are redundant. The second form shows it as “reduction in uncertainty”—how much knowing $Y$ reduces our uncertainty about $X$ .

Conditional mutual information measures the information $X$ and $Y$ share, given that we already know $Z$ :

$I (X; Y ∣ Z) = H (X ∣ Z) + H (Y ∣ Z) - H (X, Y ∣ Z) = H (X ∣ Z) - H (X ∣ Y, Z)$

Again, the first form captures redundancy given $Z$ , while the second form captures how much $Y$ reduces our uncertainty about $X$ beyond what $Z$ already tells us.

Chain rule for mutual information decomposes mutual information about multiple variables into successive contributions:

$I (X; Y, Z ∣ W) = I (X; Y ∣ W) + I (X; Z ∣ Y, W)$

This says: the information $X$ shares with both $Y$ and $Z$ (given $W$ ) equals the information $X$ shares with $Y$ (given $W$ ), plus the additional information $X$ shares with $Z$ beyond what $Y$ already provided (given both $W$ and $Y$ ). We use this identity in our proof of the Mediator Determines Redund theorem.

Mutual information is bounded by entropy of either argument. Since entropy is nonnegative, $H (X ∣ Y, Z) \geq 0$ , so from the second form we have:

$I (X; Y ∣ Z) \leq H (X ∣ Z)$

This states that knowing $Y$ cannot tell you more about $X$ than the total uncertainty you had about $X$ in the first place (given $Z$ ). This bound appears in our proof of Mediator Determines Redund.

Total Correlation

Total correlation measures the total dependence among a set of variables—how much we learn by observing them together versus independently.

We give two equivalent forms for both the unconditional and conditional cases.

Unconditional total correlation:

$TC (X_{1}, \dots, X_{n}) = i = 1 \sum n H (X_{i}) - H (X_{1}, \dots, X_{n}) = i = 1 \sum n I (X_{i}; X_{< i})$

where $X_{< i} = {X_{1}, \dots, X_{i - 1}}$ .

The first form is a direct generalization of the two-variable mutual information: “sum of individual entropies minus joint entropy,” measuring how much information is double-counted when we sum the individual entropies. The second form expresses this as a sum of successive dependencies: how much each variable $X_{i}$ shares with all preceding variables.

Conditional total correlation:

$TC (X_{1}, \dots, X_{n} ∣ Z) = i = 1 \sum n H (X_{i} ∣ Z) - H (X_{1}, \dots, X_{n} ∣ Z) = i = 1 \sum n I (X_{i}; X_{< i} ∣ Z)$

The conditional case measures the same redundancy, but with everything measured relative to already knowing $Z$ .

Natural Latents: A Mathematical Foothold on Translatability and Intersubjectivity

Contents

Mediators, Redunds, and Natural Latents

Definition: Mediator

Definition: Redund

Definition: Natural Latent

Theorem: Mediator Determines Redund

Proof

Probabilistic Generative Models: Observables and Latents

Sensory Data

Observables

Latents

Motivation of Latent Variables

Theorem: Guaranteed Translatability

Agreement on Shared Observables

Statement

Proof

Forward Direction

Reverse Direction

Corollary: Bi-directional Translation with Natural Latents

Appendix: Elementary Information Theory

Entropy

Mutual Information

Total Correlation