During my introduction to algorithms and complexity analysis at UC Berkeley, we were provided a simple scale to better understand how long a function will take to complete by their increasing runtime, ranked from least to most costly to run, with most costly being catastrophic and unsolvable. Anyone who has studied algorithms would be familiar with a similar rubric:
In part 1, The Binary Tree of the Universe, and part 2, Earth-scale log n vs Cosmological log n, of our series, we built our intuition for how slowly logarithmic functions grow for larger and larger inputs. And yet! When I look at this scale, my mind has the tendency to put O(n log n) somewhere approximately in the middle between linear time O(n) and quadratic O(n^2). It could be 9/10s of the way to the left or likewise to the right, but it feels right for it to be somewhere reasonably in between.
But now that we know a Binary Tree of the Universe would empower us to search for any atom in the observable universe in 266 steps, was my old intuition actually profoundly wrong?
Let’s use our handy trick of extrapolating these two functions to the most extreme cosmological scales - every atom in the observable universe:
every atom in the universe = 2^266
n = every atom in the universe
n^2 = every atom in the universe * every atom in the universe
n log n = 266 * every atom in the universe
This actually means the difference between O(n log n) and O(n^2) is the difference between a multiplicative factor of 266 and every single atom in the universe. O(n log n) isn’t sitting somewhere neatly as a rest stop between O(n) and O(n^2). It’s like you haven’t even left your door step! We’ve traversed such an infinitesimally small distance, for all intents and purposes, you haven't even moved. Working through the math for the approximate ratio between n^2 and n log n where n = 2^266:
n^2 / n log n = 2^266 / 266
= 2^266 / 2^8.0552824355
≈ 2^258 ≈ 4.458 × 10^77
The ratio is incomprehensibly, cosmically, and comically large. So let’s see if we can further build our intuition through other analogies. Imagine we built a god machine that circumvents all causal speed limits and can linearly scan every atom in the universe in a single second. Where n is every atom in the universe, what would the difference be between O(n log n) and O(n^2)?
The last hydrogen-burning star is expected to extinguish in 10^12 - 10^14 years. By the time our quadratic algorithm finishes, the universe will have watched the last star die ten-thousand-octillion-octillion (10^58) times over.
Same input. Same hardware. A god machine capable of exploring the entire observable universe in a single second. And yet due to a single, small exponent change, even our already impossible machine is incapable of ever completing our program.
I love this insight! Once again logarithms show how they can take the entire universe and place it in the palm of your hand. Now this begs the question, how can we practically apply these insights to Computer Science today? Almost every company chasing the AI gold rush has to tangle with a single constraint on throughput for their reasoning models: the token context window.
Attention unfortunately require O(n^2) computations because every neuron connects to every other neuron and attends to every single token within the context window. While feed-forward network layers grow quickly due to them being bound by O(n x model_dimension^2), that’s still comparatively less than attention, which dominates as n grows larger. Using some rough approximation, we can see what this tells us about current LLM architectures:
Computational Costs by Context Length
| Context Length | Tokens (n) | O(n²) Attention Ops | O(n log n) Ops | Ratio (n²/n log n) |
|---|---|---|---|---|
| 128K (GPT-4) | 2^17 (131K) | 16 billion (2^34) | 2.2 million (2^17 × 17) | 7,500× |
| 200K (Opus 4) | 2^17.6 (200K) | 39 billion (2^35.2) | 3.5 million (2^17.6 × 17.6) | 11,000× |
| 400K (GPT-5.1) | 2^18.6 (400K) | 156 billion (2^37.2) | 7.4 million (2^18.6 × 18.6) | 21,000× |
| 1M (Claude 4.5) | 2^20 | 1 trillion (2^40) | 21 million (2^20 × 20) | 50,000× |
| 10M | 2^23 | 70 trillion (2^46) | 193 million (2^23 × 23) | 363,000× |
| 100M | 2^27 | 18 quadrillion (2^54) | 3.4 billion (2^27 × 27) | 5,400,000× |
| 1B | 2^30 | 1 quintillion (2^60) | 32 billion (2^30 × 30) | 34,000,000× |
Memory Requirements (Attention Matrix Only)
| Context Length | Model | Attention Matrix Size | Memory (fp16) | Memory (fp32) | Fits in? |
|---|---|---|---|---|---|
| 128K | GPT-4 | 128K × 128K | 32 GB | 64 GB | High-end GPU ✅ |
| 200K | Opus 4 | 200K × 200K | 78 GB | 156 GB | 2× A100s (80GB each) |
| 400K | GPT-5.1 | 400K × 400K | 313 GB | 625 GB | 4× A100s (tight) |
| 1M | Claude 4.5 | 1M × 1M | 2 TB | 4 TB | ❌ Not in GPU RAM |
| 10M | - | 10M × 10M | 200 TB | 400 TB | ❌ Not even on disk |
| 100M | - | 100M × 100M | 20 PB | 40 PB | ❌ Data center scale |
| 1B | - | 1B × 1B | 2 EB | 4 EB | ❌ Apocalyptic |
Total FLOPs for Full Forward Pass
| Context Length | Model | Attention FLOPs | A100 (50% util) | H100 (50% util) |
|---|---|---|---|---|
| 128K | GPT-4 | 200 TFLOPS | 1.3 seconds | 0.4 seconds |
| 200K | Opus 4 | 480 TFLOPS | 3.1 seconds | 1.0 seconds |
| 400K | GPT-5.1 | 1.9 PFLOPS | 12 seconds | 3.8 seconds |
| 1M | Claude 4.5 | 12.3 PFLOPS | 22 hours | 6.8 hours |
| 10M | - | 8.4 exaFLOPS | 625 years | 195 years |
| 100M | - | 2.2 zettaFLOPS | 175,000 years | 54,000 years |
| 1B | - | 12 yottaFLOPS | 940M years | 293M years |
A qualitative change in what is possible with modern hardware takes place as we approach a million tokens! Distributed computing across multiple GPUs becomes a must and time completion requirements eliminate almost every practical use case. Going beyond a million tokens moves from the impractical to the infeasible. Each layer in the neural network compounds this effect, making clear the reason Opus 4.5, the current state of the art, restricts its context window to 200,000 tokens. The exponential curve of O(n^2) is an unforgiving master and requires its pound of flesh.
So what if we successfully built a model that achieved state-of-the-art performance that ran with O(n log n) instead, how fast would our full forward pass be?
| Context Length | Model | TreeFormer FLOPs | A100 Time | vs O(n²) Speedup |
|---|---|---|---|---|
| 128K | GPT-4 | 33 GFLOPS | 0.2 seconds | 6× faster |
| 200K | Opus 4 | 53 GFLOPS | 0.3 seconds | 9× faster |
| 400K | GPT-5.1 | 113 GFLOPS | 0.7 seconds | 17× faster |
| 1M | Claude 4.5 | 300 GFLOPS | 1.9 seconds | 40,000× faster |
| 10M | - | 2.8 TFLOPS | 18 seconds | 1,250,000× faster |
| 100M | - | 41 TFLOPS | 4.4 minutes | 40,000,000× faster |
| 1B | - | 450 TFLOPS | 48 minutes | 19,500,000× faster |
The infeasible becomes practical! And the nigh-impossible becomes feasible! This is precisely the reason why so much work has been put into reducing the computational complexity of transformer architectures, with work spanning Mamba, Longformer, Hierarchical Attention, Ring / Sparse Attention, Mixture of Experts - a massive brain that knows a lot but only a subset activates for answering - and many others.
That brings us to one final question: do we even need 1 million tokens? All the works of Shakespeare make up roughly 1.2 million tokens. If we were to include massively long novels like the New York Times best-selling epic fantasy The Stormlight Archives, the most recent book clocks in at roughly 490,000 words, so:
490,000 x 1.33 ~= 651,700 tokens
The problem compounds because with each turn in the conversation, we need to send the entire conversation back to the model. This doesn’t include the massive unseen system prompts all the AI model providers prepend to every conversation. If we were to ask Gemini, ChatGPT, Claude, Grok, etc. to sum up their feelings on the book, they would be incapable without the optimizations and tricks we use to get increase context windows.
For programming, this issue is even further exacerbated where many files, including AGENTS.md and CLAUDE.md may influence how a refactor or feature implementation should occur. Similarly, not all devices have the necessary memory and compute to handle attending to so many tokens. Many use cases require longer contexts to be reasonably handled by small, on-device models.
More context means more history, and more history means better results for engineering and scientific objectives; more history means richer and more fulfilling conversations and interactions with the current suite of LLMs. If an LLM didn’t simply use a summarization and RAG system to retrieve segments of old conversations, how would the interaction feel if every chat you two have ever had was within the context window? What emergent property or enrichment would we unearth in our conversations?
There is a singular, universal satisfaction in being seen, being known, and being remembered. That is what we’re really after.
---------
To future self, potential follow-up blog posts: further implications for smaller on-device models, how we could architect an O(n log n) transformer model.
In my previous post, Purity is the enemy of goodness, I introduced the idea that we are entering an era of unprecedented change:
We have grown up as a species with certain foundational truths. They have co-existed with us since we took our first hesitant steps onto the plains, and consciousness first ignited in our small corner of the universe. Now, not just one but seven of these invariants are on the precipice of being broken:
Minds are biologically-bound - Only biologically-born humans are capable of the depth of pattern recognition, planning, introspection, and intelligence humanity has. Nothing thinks faster than we do.
Minds are born roughly equal - Our brains’ size and capabilities fall within a tight distribution of outcomes, enabling near intellectual parity between individuals.
Physical agency is biologically-bound - Navigating, manufacturing, and using tools through diverse and sophisticated environments.
Consciousness is Earthbound - We are limited to a single planet1.
Energy is scarce - We must always operate within its limitations and downstream effects on production.
Genes are gifted - Only our parents and the heavens are capable of sculpting our genetic code.
Death is the great leveller - It comes for us all.
"He is one, who raises himself from private considerations, and breathes and lives on public and illustrious thoughts. He is the world’s eye. He is the world’s heart. He is to resist the vulgar prosperity that retrogrades ever to barbarism, by preserving and communicating heroic sentiments, noble biographies, melodious verse, and the conclusions of history. Whatsoever oracles the human heart, in all emergencies, in all solemn hours, has uttered as its commentary on the world of actions ⎯ these he shall receive and impart."
In the previous post, we worked through the enormous power of logarithmic functions for reducing search spaces. For our Binary Tree of the Universe, it would take us only 266 steps to locate any atom in the observable universe. Tree data structures such as B-trees with wider fan-outs require even less steps.
I find this kind of miraculous. But this holds in the most extreme case of cosmic scope. While searching for individual atoms in the universe may be more relevant for humans millennia from now, we on the other hand grapple with more Earthly confines. Let’s bring this down to Earth (huhu):
The Earth has an estimated ~10^50 atoms; convert to powers of 2:
ln 10 / ln 2 = 3.322~ 10^50 = 2^(50 * 3.322) = 2^166.1
Simplifying 2^166.1 to 2^166, our Binary Tree of the Universe would handle searching every atom on Earth with 166 steps. A mildy pleasing coincidence that 2^100 is the difference between Earth and the observable universe.
Perhaps every atom on Earth is still too ambitious. Let’s further ground this within the context of humanity:
37 steps from every person that has ever lived. You would need only a little over a quarter of the Spanish steps to search through every human that has ever lived!
But one could say, that’s only the people! What about all the content and information they’re creating? If we indexed all the internet data ever created, currently estimated to be ~100 zettabytes, or ~2^76.4 bytes and some change, our search for an individual byte still only requires 77 steps! We can then define Earth-scale log n as 166 and the current Human-scale log n as 77. It'd be nice to have some breathing room for human data's future growth, so let's make Human-scale log n a nice round 80.
This explains part of the magic of YouTube, Meta, TikTok, and the rest of the social media players having reasonable access times for their massive info and content libraries. It’s possible to reasonably store, index, and retrieve social media content for every human alive. After all, all human data ever created is only a little more than half way up the Spanish Steps.
As a final exercise to give us a sense of what logarithmically sits between 1 and 2^266 (all the atoms in the observable universe), I'll compile it all into two handy reference tables, starting with Human-scale:
| Scale | Count | log₂(n) | Spanish Steps |
|---|---|---|---|
| LLM context window (1M tokens) | ~10^6 | 20 | 15% up the stairs |
| Unicorn valuation ($1B) | ~10^9 | 30 | 22% up the stairs |
| All humans alive today | ~8×10^9 | 33 | 24% up the stairs |
| All humans ever lived | ~10^11 | 37 | 27% up the stairs |
| All internet data (bytes) | ~10^23 | 77 | 57% up the stairs |
Extending beyond humanity, I'll add in cosmological structures:
| Scale | Estimated Atoms | log₂(n) | Spanish Steps |
|---|---|---|---|
| Earth | ~10^50 | 166 | Up and down 31 steps |
| Solar System | ~10^57 | 189 | Up and down 54 steps |
| Nebula (typical) | ~10^60 | 199 | Up and down 64 steps |
| Milky Way Galaxy | ~10^68 | 226 | Up and down 91 steps |
| Local Group | ~10^72 | 239 | Up and down 104 steps |
| Virgo Supercluster | ~10^75 | 249 | Up and down 114 steps |
| Observable Universe | ~10^80 | 266 | Up and down (2 sets of stairs) |
| Universe of Universes | ~10^160 | 532 | Up and down twice (4 sets of stairs) |
I find something about this comforting. The universe is so unimaginably, incomprehensibly vast, and yet there are paths for us to structure it, make sense of it, and explore its immense depth.
-----
Part 3 of this series: What the cosmos teaches us about quadratic growth and LLM context windows
]]>A magician approaches you and asks, “Excuse me, have you ever played 20 questions? As a magician, I have spent my life counting every atom in the universe and I now know how many there are. As a secret between just us, there are one hundred quinvigintillion atoms in our universe. Please pick a number between 0 and the number of all the atoms in the universe. If you give me 20 minutes of your time, I promise I will be able to guess your number. The only rule is that - similar to 20 questions - I can ask you simple yes or no questions until we arrive at an answer.”
At most, how many questions would the magician ask? This is an example of binary search cosmological scale.
In between my light laments over the hollowing out of any type of soul or perspective on engineering culture in the software industry (I’m looking at you endless middle and senior tech interviews that require rote manual regurgitation of algorithms even in the age of AI code assistance), I love thinking about the boundary conditions of time and spatial complexity. For myself, this helps me visualize the concept and have have more texture as to its deeper essence. The YouTube channel 3Brown1Blue does an amazing job of this for fostering an intuition for the geometric interpretation of math.
Recently, I’ve been thinking a lot about logarithmic functions and what the power of O(log n) truly looks like. I know the function is a powerful transformation from an exponential space to a linear space, but what does that look like in practice in the most extreme scenarios?
Enter the Binary Tree of the Universe (pending trademark). If we were to take every atom in the universe and put it into a binary tree for conducting binary search, how many steps would the search process take?
[Already familiar with Binary Trees? Skip this paragraph] For those unfamiliar with binary trees, they are a data structure that enables fast searching through a dataset by asking yes or no questions about that data that divide the possible results in half at each step. For example, if the magician had chosen an arbitrary number between 1 and 100, and we needed to guess it in as few attempts as possible, a poor question to ask would be whether the number was 50. A better question would be whether the number is greater than 50! If the answer is no, we can then assume the answer must be between 1 and 50. We can repeat this process by asking if the number is greater than 25, and so on and so forth until we locate our secret number. The Korean Bottle Cap Game (병뚜껑 게임), an excellent drinking game that uses the secret number in soju bottle caps, is an example of this phenomenon! While the objective is not to get the number, a clever competitor will use the reduction of the search space to quickly trap other competitors with binary search. We can formalize this in an easy-to-access data format that makes conducting these yes / no questions blazing fast, which is the binary tree.
Our current estimates for the number of atoms in the observable universe is 10^80 according to a current radius of ~46 billion light-years.
Converting this to a power of two:
log₂(10) ≈ 3.322
2^(80 × 3.322) ≈ 2^265.76 ≈ 2^266
Neat! So we have 2^266 atoms in our observable universe. 266 is conveniently close to 256, giving us a little bit of a memory cue for remembering how many atoms are in the universe for powers of 2.
According to what we can currently see, the universe is flat. There is no evidence that it does not continue on either infinitely or for such a gargantuan distance we cannot detect the curvature of the observable universe.
For the sake of this thought exercise, let’s assume it is indeed finite and we are a bubble in a greater hyperspace that is orders of magnitude larger than we can observe. We’ll also assume that each observable universe bubble is akin to a solar system with a star. If there were an observable universe for every star we can see within our own observable universe, how many atoms would we have?
~10^24 stars (1 septillion, significantly less than googol)
Powers of 2: ~2^80
2^266 * 2^80 = 2^346
So we have 2^346 atoms if every star in the observable universe represented another observable universe!
This is an astoundingly large figure - roughly ~10^4 larger than a googol. And yet, if we have built our binary search data structure using a perfectly balanced tree, it would only take us 346 steps to find our atom.
That is nothing.
If we optimized for disk reads, a la a modern database, and used a B-tree with a wider fanout and shallower depth, the total steps would be even less! To put that into perspective, the most famous staircase in the world, the Spanish Steps of Rome, has 135 steps.
In order to traverse our Binary Tree of the Universe to any atom in the observable universe, we would only need to ascend and descend the stairs in order to reach our atom. In the case of our every-star-is-an-observable-universe universe, we only need to climb back up the stairs once. And not even all the way.
That’s pretty cool. But can we go even more extreme while remaining finite? What if we had a universe-of-universes where every atom represents an observable universe, how many steps would it take to reach our atom?
2^532 = 2^266 * 2^266
It would take only 532 steps, or two up-and-downs of the Spanish Steps! Let’s compile this all into a handy reference table:
| Scenario | Powers of 2 | Powers of 10 | Binary Search Steps | Spanish Steps Equivalent |
|---|---|---|---|---|
| Stars in observable universe | 2^80 | ~10^24 | 80 | Climb 60% of the way up |
| Atoms in observable universe | 2^266 | ~10^80 | 266 | Up and down (2 sets of stairs) |
| Universe per star | 2^346 | ~10^104 | 346 | Up, down, up again 76 steps (2.5 sets) |
| Universe per atom | 2^532 | ~10^160 | 532 | Up and down twice (4 sets of stairs) |
Even our most extreme scenario requires fewer steps to search than it takes to walk up and down a set of stairs four times, albeit they are quite famous so you may need to dodge a person or three.
This engenders the question of what practical difference exists between O(n log n) and O(n) for any fathomable application in our observable universe. We readily remove constants when setting our complexity bounds, and yet max(log n) ~= 266 for anything within the observable universe! Could we thus think of O(n log n) as effectively O(266 n) for any application within the observable universe? There are many scenarios where a constant c may very well be larger than log n! Let alone observable-universe-max(log n).
Our runtime cost for our Binary Tree of the Universe comes in the construction and modification of the tree, but who's to say the universe didn't pre-index itself during inflation, assigning each atom a unique id as it popped into existence?
Archimedes said, “Give me a lever long enough and a place to stand, and I will move the world.” Let’s stand on the shoulders of giants and extend the quote:
“Give me a lever long enough and a place to stand,
And I will move the world.
Give me a logarithmic function,
And I will put the world in the palm of your hand!”
This exercise has convinced me that the Computer Scientists that argue O(n) and O(n log n) are equivalent are onto something. The universe and logarithmic functions may be teaching us some practical lessons.
First, when and how we construct our tree is the hard part - something obvious to anyone who has had to index a sufficiently large database. Two, if the performance difference between O(n) and O(n log n) actually matters, constants are just as consequential. And three, everyone is just two flights of stairs from everything else in the universe.
--------
Part 2 of this series: Earth-scale log n vs Cosmological log n.
]]>One of the great sources of evil in our time is the pursuit of purity. Purity of thought. Purity of practice. Purity of process. Purity of ideology.
The language of tyrants, rent-seeking NIMBYists, spiritual hypocrites, gatekeepers, small-minded middle managers, the petty, and the self-righteous gather their roots in the pursuit of purity. Purity is absolute, and the absolute is incapable of change. And being incapable of change will paralyze us and leave us severely disadvantaged for navigating what is to come: the age of seven broken invariants. They have co-existed with us since we took our first hesitant steps onto the plains, and consciousness first ignited in our small corner of the universe. Now, not just one but seven of these invariants are on the precipice of being broken:
Minds are biologically-bound - Only biologically-born humans are capable of the depth of pattern recognition, planning, introspection, and intelligence humanity has. Nothing thinks faster than we do.
Minds are born roughly equal - Our brains’ size and capabilities fall within a tight distribution of outcomes, enabling near intellectual parity between individuals.
Physical agency is biologically-bound - Navigating, manufacturing, and using tools through diverse and sophisticated environments.
Consciousness is Earthbound - We are limited to a single planet1.
Energy is scarce - We must always operate within its limitations and downstream effects on production.
Genes are gifted - Only our parents and the heavens are capable of sculpting our genetic code.
Death is the great leveller - It comes for us all.
Artificial intelligence, brain-machine interfaces, robotics, off-planet colonization, net positive energy output (nuclear fusion), gene editing, and life extension will each break and falsify these axioms2. Each is a watershed moment with no analogous historical equivalent.
Given the singularity curve, perhaps in the future there will be moments where even more will be broken in a shorter duration, but today’s testing ground will serve as an example for those future generations - if we are successful.
That's exciting and beckons of a future full of unimaginable depth, mystery, and discovery, but it's also quite scary. This isn't a book. We need to get it right. I'm working to gather and sort out for myself what are the principles for reaching a meaningful outcome3. The good news is that such seismic changes do not invalidate our hard-earned, collective wisdom. As Marcus Aurelius said about change:
Is any man afraid of change? Why, what can take place without change? When then is more pleasing or more suitable to the universal nature? And canst thou take a bath unless the wood undergoes a change? And canst thou be nourished, unless the food undergoes a change? And can anything else what is useful be accomplished without change? Dost thou not see then that for thyself also to change is just the same, and equally necessary for the universal nature. [...]
The universe is change; our life is what our thoughts make it.
Marcus Aurelius, Meditations
When a new idea or perspective is birthed in a group of people, whether that individual is labeled improper and impure or innovative and visionary is a matter of perspective and interpretation. How can the definition of what is pure ever change if that purity doesn't introduce some impurity - some change into itself? Each year the ideologic test for passing as "woke" or "religiously devout" shifts a little, always impossible to pass for no two administrations are the same. What then is responsible for these small differences even amongst those who misguidedly seek purity? It's change.
Purity is an impossibility and a rejection of the universe itself. The purest state of the universe is complete entropy and its inevitable heat death. That would not be a very pleasant place to live! Luckily the universe has given us the gift of time to discover ourselves and who we can and should be. For change is the universal essence, and our power lies in choosing the what and why of that change.
[1] I expect this to be the last to be broken in any meaningful way. Physics and the speed of causality are cruel mistresses.
[2] There are likely more, but seven is a beautiful number for a quick tentative list. Please drop me a note of any other invariants you may think of.
[3] Similarly, please drop me a note if there are any principles you’ve been mulling over for navigating the upcoming era.
]]>