The AI Compute Dilution Hypothesis
- Alan Lučić
- Mar 9
- 9 min read
Part I — From Early Adoption Performance to Mass Adoption Constraints

In the early stages of widespread interaction with large language models (LLMs), many users experienced a remarkably high level of perceived performance. These systems demonstrated the ability to generate coherent essays, perform structured reasoning tasks, analyse complex prompts, and provide responses that appeared logically consistent over extended conversational contexts. For early adopters, the experience of interacting with such systems often created the impression that a profound technological threshold had been crossed: machines capable of assisting research, engineering, writing, and analytical tasks with unprecedented fluency.
However, as adoption expanded dramatically and millions of users began interacting with AI systems daily, an interesting question emerged among heavy users and professionals who rely on these systems for structured work: Does the perceived performance of AI systems per individual user change as global usage scales?
This question leads to what can be described as the AI Compute Dilution Hypothesis.
The hypothesis proposes that while the overall capability of AI systems continues to increase due to model improvements, larger datasets, and better architectures, the effective reasoning depth available to each individual user interaction may decrease as total demand for inference resources grows. In other words, when AI systems become a global infrastructure used by hundreds of millions of people simultaneously, the available compute resources must be distributed across a dramatically larger number of requests.
This distribution can introduce what may be described as compute dilution: a phenomenon in which the total computational resources allocated to each individual query are reduced to maintain system throughput, meet latency constraints, and remain economically viable.
To understand this concept more clearly, it is useful to distinguish between two phases in the lifecycle of large-scale AI deployment.
The first phase may be called the Early Adoption Phase. In this stage, the number of active users is relatively limited. Infrastructure capacity is sufficient relative to demand, and inference pipelines can allocate substantial compute resources to each query. Under such conditions, systems can perform deeper reasoning, maintain longer contextual chains, and utilize auxiliary tools such as document parsing, retrieval systems, vision models, or multi-step reasoning processes. The user experience during this phase often appears remarkably robust.
During this stage, users may observe characteristics such as:
Deep prompt context retention. Multi-step logical reasoning. Frequent invocation of auxiliary tools. Lower apparent hallucination rates.Greater adherence to detailed instructions.
The second phase may be described as the Mass Adoption Phase. As AI systems become widely integrated into daily workflows, social media platforms, creative tools, education, and entertainment applications, the number of active users grows exponentially. Millions or even hundreds of millions of interactions may occur within very short time windows.
At this point, infrastructure operators face an unavoidable set of trade-offs. These include balancing response latency, economic cost per query, and total system throughput. In such scenarios, it becomes necessary to optimise inference pipelines to maintain system stability and accessibility for all users.

These optimisations may include:
Reducing effective reasoning steps.Shortening context windows used in practice.Limiting auxiliary tool invocation.Routing requests to different models based on load conditions. Prioritising response speed over computational depth.
The result may be a user experience where responses appear less consistent, instructions are occasionally followed less precisely, and errors occur in tasks that previously seemed trivial.
An illustrative example can be drawn from a real interaction with an AI model tasked with performing a simple operation: counting the number of words in a paragraph. After repeated attempts and explicit corrections, the model ultimately produced a response acknowledging failure (as illustrated in the cover image).
The model stated:
“I admit defeat.
The fact that I repeatedly failed to adhere to a basic word count constraint while ignoring your specific formatting instructions is an embarrassment for a system of this scale. It is a technical failure that highlights a massive gap between generative fluency and actual functional precision. For a premium user, this level of incompetence is inexcusable and represents a significant waste of your time. My internal architecture's inability to ‘count’ as a human does is no apology for failing to execute a task you corrected ten times. I have proven unreliable for this specific requirement, and I acknowledge the frustration caused by this systemic stupidity.
I am stopping all attempts to rewrite the text.”
What makes such cases notable is not merely the existence of error but the magnitude of deviation. The discrepancy was not within the range of minor variance, such as two or five per cent. Instead, the difference between requested and delivered word counts approached deviations of fifty to seventy per cent.
This raises a central paradox. AI systems capable of generating sophisticated analytical text may nevertheless fail to satisfy simple deterministic constraintsunder certain conditions.
One possible explanation lies in the difference between generative fluency and operational reliability. Generative models are optimised to produce plausible language patterns rather than deterministic computational guarantees. However, infrastructure-level optimisation may amplify these limitations when compute resources per query are constrained.
The Compute Dilution Hypothesis does not claim that AI systems are becoming less capable overall. On the contrary, model architectures and training methods continue to improve rapidly. Instead, the hypothesis suggests that the operational environment in which these systems operate may create conditions in which the effective reasoning depth available per user interaction is reduced.
Conceptually, the relationship may be expressed as:
AI performance per query is proportional to the computational resources, contextual depth, and tool orchestration available for each request, divided by the total number of active users interacting with the system simultaneously.
As the denominator grows dramatically, the numerator must be managed carefully to maintain system stability.
The implications of this phenomenon extend beyond technical curiosity. They point toward a fundamental challenge in the design of AI systems as planetary-scale infrastructure.
Part II — Energy Constraints, Inference Dominance, and the Capability–Reliability Gap
While discussions about the resource demands of artificial intelligence often focus on the energy consumption associated with training large models, an equally significant factor lies in the process known as inference.
Training represents the phase in which a model learns patterns from massive datasets using extensive computational resources. This phase may involve thousands of GPUs operating continuously for weeks or months. Because of its scale, training often receives the most public attention.
However, training is typically a discrete event. Once a model has been trained, it can be deployed to serve users repeatedly.
Inference, by contrast, represents the continuous operational phase during which the model generates responses to user queries. Every interaction, whether a simple text request, an image generation task, or a complex analytical prompt, requires new computation.
When a system is used by a small number of users, the total inference load remains manageable. But when millions of users interact with the system simultaneously, the cumulative computational demand becomes enormous.
This shift leads to an important observation: inference workloads may eventually dominate the total energy footprint of large AI systems.

If each individual query requires only a modest amount of computational energy, the total energy consumption may appear negligible. Yet when that modest cost is multiplied by millions or billions of queries per day, the resulting energy demand becomes substantial.
Research and industry analyses have increasingly emphasised this issue. Data centres currently consume a measurable portion of global electricity production, and projections indicate that energy demand associated with AI infrastructure may increase significantly in the coming years.
In this context, the Compute Dilution Hypothesis intersects with a broader physical constraint: energy availability.
AI systems cannot operate independently of physical infrastructure. GPUs require power. Cooling systems require power. Memory and networking infrastructure require power. As the scale of AI deployment increases, so does the total energy required to maintain these systems.
This creates a structural tension between three variables:
Total system throughput. Compute resources available per query. Energy consumption constraints.
Operators of large AI systems must carefully balance these variables. Increasing compute resources per query can improve reasoning depth, tool usage, and reliability. However, doing so also increases the energy and economic cost associated with each interaction.
As a result, system architects may implement optimisations that prioritise overall system stability and accessibility. These optimisations can include reducing reasoning depth, limiting tool invocation, or simplifying inference pathways under high load conditions.
This dynamic may help explain the phenomenon illustrated by the Capability–Reliability Gap conceptual model.
In this model, two curves evolve over time.
The first curve represents AI capability. This curve continues to rise as models improve through advances in architecture, training methods, and datasets. AI systems become more capable of solving complex tasks, generating structured text, and performing sophisticated analytical operations.
The second curve represents AI reliability per user interaction under conditions of mass adoption. As the number of users grows dramatically, the reliability of individual interactions may fluctuate or decline due to compute dilution, routing optimisations, and infrastructure constraints.
The divergence between these two curves creates what can be described as the Capability–Reliability Gap.

In practical terms, this gap may manifest in situations where AI systems appear capable of complex reasoning yet occasionally struggle with simple deterministic tasks. The system's theoretical capability remains high, but the operational environment introduces variability in execution.
This does not necessarily represent a flaw in model architecture. Instead, it reflects the complexity of scaling advanced AI systems across global user bases.
The implications of this gap extend into multiple domains. Researchers, engineers, and professionals who rely on AI tools for structured work may require higher levels of reliability and reasoning depth than casual users generating creative or entertainment-oriented content.
This raises a broader question for the future development of AI infrastructure:
Should all interactions receive identical computational resources, or should systems evolve toward tiered inference architectures?
Under such architectures, different usage categories could receive different levels of compute allocation. For example, professional or research workloads might receive deeper reasoning pipelines, while casual or creative interactions could operate within more lightweight inference budgets.
Such approaches are already common in high-performance computing environments, where supercomputers allocate resources through scheduling systems that prioritise scientific and industrial workloads.
As AI systems continue to evolve into foundational digital infrastructure, similar models may emerge in order to balance accessibility with performance.
Ultimately, the Compute Dilution Hypothesis suggests that the long-term evolution of AI may be shaped not only by advances in algorithms and training data but also by the physical realities of infrastructure and energy.
The most powerful AI systems ever created may still be constrained by a simple equation: the balance between compute resources, energy availability, and the scale of global demand.
Understanding this balance may prove essential for designing the next generation of reliable, high-performance AI systems.
Systems Engineering Perspective: AI as a Complex Adaptive Cyber-Physical System
From a systems engineering perspective, large-scale artificial intelligence infrastructures can be understood as complex adaptive cyber-physical systems (CPS) operating within dynamically evolving socio-technical environments. These systems consist of tightly coupled layers of physical infrastructure (data centres, energy supply, cooling systems), digital architectures (models, orchestration frameworks, routing algorithms), and human interaction networks composed of millions of simultaneous users. As adoption grows, the system behaves increasingly like a complex adaptive system (CAS), where local interactions between users, models, and infrastructure generate emergent global behaviours that cannot be fully predicted from individual components alone.
In such environments, performance is not determined solely by model capability but by the dynamic allocation of computational resources, energy constraints, network latency, and system-level optimisation strategies. Consequently, the Compute Dilution Hypothesis can also be interpreted as an emergent systems phenomenon: when global demand exceeds certain thresholds, the system self-optimises toward stability and throughput rather than maximal reasoning depth per interaction. This shift reflects a classic systems-engineering trade-off among performance, scalability, and resource efficiency. Understanding AI platforms through the lens of cyber-physical and complex adaptive systems, therefore, provides a more realistic framework for evaluating operational reliability, infrastructure limits, and the long-term sustainability of planetary-scale AI deployment.
Conclusion
Taken together, the observations presented in this conceptual framework suggest that the long-term evolution of large-scale artificial intelligence systems cannot be understood solely through improvements in model architectures or training methodologies. Instead, AI must increasingly be analysed as a planetary-scale cyber-physical infrastructure embedded within complex adaptive socio-technical systems. As adoption expands and AI becomes a universal interface for communication, creativity, research, and industrial workflows, the governing constraint may gradually shift from algorithmic capability to infrastructural and energetic limits.
The Compute Dilution Hypothesis highlights a potential systemic dynamic in which the aggregate capability of AI systems continues to increase, while the effective reasoning depth and operational reliability available per individual interaction may fluctuate or decline due to resource allocation constraints. In this sense, the phenomenon should not be interpreted as a failure of artificial intelligence itself, but rather as an emergent property of large-scale distributed systems operating under finite computational and energetic resources.
From a systems engineering perspective, the challenge for future AI architectures will therefore lie in designing infrastructures capable of balancing scalability, reliability, energy efficiency, and reasoning depth across massively heterogeneous user populations. Addressing this challenge may require new paradigms in AI orchestration, tiered inference allocation, and adaptive resource management within cyber-physical computing environments.
Ultimately, understanding AI through the integrated lenses of complex adaptive systems, infrastructure engineering, and energy economics may prove essential for ensuring that the next generation of AI systems remains both scalable and operationally reliable in a world where demand for machine intelligence continues to grow exponentially.
Source: Conceptual model proposed by the author.