April 6, 2025

In one of my favorite movies, Pi (1998 dir. Aronofsky), there’s a scene where a Hasidic Jew shows how you can interpret Hebrew numerically, a practice called Gematria. Each letter in the Hebrew alphabet has an associated number (Aleph = one, Bet = two, etc.), and you can find all sorts of interesting relationships¹. The character’s example is this: when you take the sum of the letters that make up father (‎אב) and mother (אם), the sum is the same value as the sum of the letters for child (ילד‎).

Hebrew Gematria Example

Maybe I spend too much time thinking about vector spaces, but I couldn’t help but notice that this was pretty similar to word embeddings (how LLMs encode language). In these embeddings, everything is encoded into vectors that represent a direction in latent space. Because these vectors are members of the same vector space, you can do all sorts of computation with them. The classic example is that the difference between the vectors for man and king is equal to the difference between woman and queen; that is, there is some royalty direction.

Word Embedding

This word embedding interpretation of Gematria isn’t super strong, since the fact that there’s only a single dimension means that the embedding is super lossy, but it did get me thinking about how different languages encode meaning. Generally, languages belong to one of two groups: phonographic languages and logographic languages. In a phonographic language, e.g. English, the fundamental units (letters) represent sounds, but carry no inherent meaning. In a logographic language, e.g. Mandarin, individual characters can usually stand on their own, or at least carry some meaning.

I think it’s actually more interesting, however, to think of languages as being somewhere along an axis of “logograph-icity” instead of there being a binary distinction. On this axis, we might put English or French on the left side (0 out of 10 “logograph-icity”), and we might put Egyptian hieroglyphics all the way on the right side (10 out of 10 “logograph-icity”). This is more interesting because, as we saw before, in some languages (e.g. Hebrew), the characters themselves can contain tiny morsels of meaning.

Scale Image 1

This brings us back to word embeddings, and the larger questions of how we should think about language models in general. While perhaps excessively anthropomorphic, if we are indeed summoning the machine God through these LLMs, it's natural to wonder how it might communicate and what language it speaks². Obviously gpt-4.5 can write fairly solid English prose, but is English really the native tongue? Is it really thinking in English while it vibe codes your note-taking app?

One could argue that the native tongue is actually made up of the underlying tokens, the mappings from discrete letters/subwords to discrete numbers (for gpt-4o, “Hi there” → [12194, 1354]). While it’s true that your prompts are translated into tokens before they are fed into the model, the tokens themselves are still not what the model is actually operating on.

Tokens are really just intermediary units, since they’re keys for a lookup table of word embeddings (e.g. “Hi” → 12194 → [0.01, 3.20, …, -0.67]). The real computation in a model happens on these vectors, so I think it's fair to say these latent vectors are the native tongue (or at least the language of the model’s inner monologue). As an aside, you might be able to make the case that these vectors are actually more akin to the electrical impulses in your brain than language, but that’s a topic for a separate blog post. Anyways, if we can say that the latent vectors are the model's native language, where should they go on our scale?

Scale Image 2

This is actually pretty hard to answer. On one hand, these vectors are essentially pure information and meaning (evidence for high “logograph-icity”). On the other hand, they only are meaningful if they are in the context of other vectors or the underlying model (just like how “b” is only meaningful if it comes before/after other letters). For these reasons, I’d place latent vectors somewhere in between Hebrew and hieroglyphics.

Perhaps an example of a related concept that is more logographic than the basic latent vector is a feature vector (what you might get from a sparse autoencoder). This is because each dimension of that vector has a very specific meaning (index 10543 → “Golden Gate Bridge”), so if you know what each index maps to, the vector can stand on its own (you don’t need the surrounding context to understand it).

In summary, languages vary widely in how much meaning they pack into their fundamental units. A single English letter conveys little meaning by itself, whereas a single hieroglyph can carry substantial meaning. Most languages fall somewhere between these extremes. If we consider an LLM's latent vectors to be the native tongue, they might represent a return to a more ancient, meaning-rich form of communication (you might have noticed that the older a language, the more logographic it will generally be). I’m not really sure whether thinking about models this way is actually useful, but it is fun. Let me know what you think!

I should note that this practice is fairly subjective, and usually a practitioner will try to find additional numerical meaning between words that are already related, not the other way around.↩
There’s a lot of really interesting work being done on whether this is even the right question. See this for example.↩

From Gematria to GPT

April 6, 2025