LLMs, World Models, Reinforcement Learning and the Future of AI

LLMs are an insufficient solution to intelligence

Oct 11, 2025

There is a pitched battle on between large language model (LLM) promoters who argue that LLMs can provide the central basis for ongoing advances in AI, and LLM skeptics who believe that LLMs have fatal weaknesses that require other techniques. This battle is one of those described in my recent post “Who to Believe About AI”, where I offer the rule of thumb that such battles often indicate existence of a useful ‘middle path’ between the two opposing viewpoints.

In the battle over LLMs, that middle path seems to be this:

LLMs are phenomenally useful tools, including because they have essentially ‘solved’ human language from an AI perspective and provide a tremendously useful interface to a large proportion of all human knowledge.
However, LLMs have significant limitations, for technical reasons that I explored in detail in “AI and the Curse of Dimensionality”. As a result, LLMs are an insufficient solution for machine intelligence.1

From the perspective of this middle path, the battle over LLMs is a distraction from the likely future of advanced AI, in which LLMs will play an important but not dominant role. That is, LLMs (or at least some similar form of natural language interface) appear necessary to that future, but not sufficient. Although the details of future AI are inherently speculative, there are important directional indications, including in three areas that I explore below:

what we know about learning and behavior, from observing human and machine intelligence
the crucial role of world models
the significant promise of reinforcement learning.

What we know about learning and behavior

We know a lot about human learning and behavior—and this provides important lessons for machine learning and AI.

Multiple cognitive processes

One of the most basic things we know is that human learning is not a unitary process. My favorite example is the two independent systems for judging numerosity:

subitizing allows us to instinctively judge the exact number of a small group of objects—up to about 6 or 7 for most people
the Approximate Number System (ANS) allows us to (imprecisely) estimate and compare larger quantities of objects.

There is extensive evidence that the two systems involve different processes in the brain.

Better known is the work of Daniel Kahneman and Amos Tversky—popularized by Kahneman in Thinking Fast and Slow—showing that humans have different cognitive processes for automatic, fast, intuitive thinking (System 1) vs. effortful, slow, deliberate thinking (System 2).

This phenomenon of multiple components of human intelligence has clear implications for machine learning, such as in Moravec’s paradox that:

it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.

Moravec’s observation from 1988 has become increasingly untrue for perception since the advent of deep learning 15 years ago, and it is now becoming largely untrue for mobility through advanced robotics. However, his central insight that different human skills require different computational approaches for AI remains demonstrably true.

Mechanisms of learning

Another aspect of human learning that has been extensively studied involves different mechanisms of learning, such as experimentation, imitation and innate capability. The idea that learning occurs primarily through experimentation (and associated reward) is strongly associated with behaviorism, including the ideas of Ivan Pavlov and B.F. Skinner. Learning by imitation is often associated with the social learning theory of Albert Bandura. And Noam Chomsky theorizes that many important aspects of human cognition (notably language acquisition) are biologically hard-wired rather than learned.

This is only a small sample of the many theories of how humans learn. The point is that mechanisms of learning, like cognitive processes, are highly varied.

In this too, there are clear links to machine learning. For example, the distinction between learning by experimentation vs. imitation was a key point on a recent episode of the Dwarkesh podcast featuring Richard Sutton, who is best known for his 2019 observation in “The Bitter Lesson”:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

This ‘bitter lesson’ directly supports the data-driven, computational approach that has led to LLMs—as opposed to the rule-based approaches that dominated AI until the early 21st century. Nevertheless, Sutton argued on the podcast that the imitational approach of LLMs will ultimately be unsuccessful compared to models that use experimental approaches based on world models and reinforcement learning (which are the focus of the second and third sections of this post). He supported this point with an observation (disputed by Dwarkesh) that human babies learn primarily by experimentation rather than imitation.2

There are many other links between mechanisms of human learning and approaches to machine learning and data science. Taking a couple of examples:

There has been significant recent attention to challenges that AI models have with planning—i.e. multi-step decision processes such as figuring out how to accomplish a complex project, or to get to an international destination (e.g. by foot at home, taxi, foot again at the airport, airplane, foot again, taxi again and foot again). AI OG Yann LeCun spoke eloquently of such challenges of hierarchical planning when he appeared on the Lex Fridman podcast. There has been meaningful recent progress teaching LLMs how to do planning (notably this MIT paper), but doing so without a world model (see below) is challenging.
Less recently, the move of the statistics community from frequentist analysis towards Bayesian analysis, reflects the reality that new human beliefs are formed by updating upon acquisition of new data (i.e. experimentation) rather than simple reliance on a concatenation of past events (which is more like imitation).

Implications for machine learning and AI

This knowledge and experience about diversity of human cognitive processes and learning mechanisms shows beyond doubt that human intelligence is a complex, multifarious process. It makes sense that the same applies to AI.

This was the central point of the influential 2024 essay from Berkeley AI Research (BAIR) “The Shift from Models to Compound AI Systems”:

state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models.

The BAIR paper envisions LLMs being core components of such compound systems, which aligns with the central argument of this post that we are on a middle path between the predictions of LLM promoters and skeptics.

There are already many examples of diverse/compound AI systems achieving world-leading results. Perhaps the best example is AlphaFold, for which Demis Hassabis and John Jumper won the 2024 Nobel Prize in Chemistry and which is based upon a group of multiple deep neural networks (no LLMs) performing different tasks.

There are multiple examples of compound AI systems that combine LLMs with other techniques, such as:

search and answer engines like Google AI Overviews or Perplexity, which combine LLMs with a traditional web search index
multimodal content generators like Midjourney, Veo and Sora, which combine LLMs (for query parsing) with diffusion models
code generation tools like Github Copilot or Claude, which combine LLMs (for code suggestion) with other code analysis tools
robotics systems, which combine LLMs (for command interpretation, and sometimes planning) with vision models and other control systems.

Another interesting compound AI system is the Joint Embedding Predictive Architecture (JEPA) developed by Yann LeCun and his team at Meta (again without LLMs), which is particularly effective at ‘understanding’ video. Crucially, JEPA is based around the idea of a ‘world model’, which is the first of two important directions for the future of AI that I want to address here.

The role of world models

A simple definition of a ‘world model’ is

a function that maps the current state of the world (x(t)) and an action (a(t)) to a prediction of the next state (x(t+1)).

In somewhat more detail, a world model combines three fundamental abilities (although arguably the third ability is an application of the model rather than the model itself):

representation learning: producing a compact mathematical representation of the environment
prediction: forecasting potential future states of the environment based on available data
planning: choosing the best course of action based on predictions.

It is very clear that LLMs lack robust world models because they lack the key first component of having a representation of the world. As Alberto Romero has put it, “AI can predict the sun will rise again tomorrow, but it can’t tell you why”. As this issue became clear, some made the silly argument that LLMs have world models because they can make predictions about geographical relationships, which both misses the generality of what we mean by ‘world’ in ‘world model’ (i.e. its about the whole state of an environment, not just its physical geography) and confuses LLMs’ impressive abilities at next token prediction in complex contexts with a representation of the world.3

Indeed, it is mathematically impossible for LLMs based on statistical next-token prediction to acquire robust, granular world models, for reasons that I explored at length in “AI and the Curse of Dimensionality”. To simplify my core point in that post, the world’s dimensionality is far too great to be represented by a LLM, regardless of the ongoing growth of computational power. Or as Yann LeCun has put it:

If you try to train a system to predict every pixel in a video, you’re basically setting it up for failure.

Of course, LLMs do extremely well in seeming to understand the state of the world in common (and increasingly somewhat less common) situations, but they fail in ways that are difficult to predict, such as where edge cases are not present in the data (e.g. causing Teslas to crash) or lack of understanding leads to dangerous behavior (e.g. encouraging teenage suicide).

Without a representation of the world, it is impossible for an AI model to do effective planning—because planning requires the ability to predict the consequences of actions. This crucial role of world models is why Meta leads its introduction to V-JEPA-2 (the latest version of JEPA) by stating that it “is a world model that achieves state-of-the-art performance on visual understanding and prediction in the physical world”.

In addition to forecasting, optimal planning requires some concept of a desirable end state (e.g. planning a way to start World War III is very possible, but (for most) not desirable).4

Identifying and achieving the right goals is where reinforcement learning comes in.

The promise of reinforcement learning

As for world models, let’s start with a definition of reinforcement learning (per Wikipedia):

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control5 concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal.

The key concepts here are (1) ‘tak[ing] actions’, (2) ‘in a dynamic environment’ (3) ‘to maximize a reward signal’. The first two concepts are directly linked to the concept of a world model. The third concept is the one identified above of pursuing desirable goals, by associating those goals with rewards.

This is a very powerful construct for effective machine learning to control behavior toward defined goals. Notable successes have included:

optimization of real-world systems, such as self-driving vehicles, integrated circuit design, cooling in Google data centers, (increasingly) robotics and (potentially) power grid management
game playing, including the original DeepMind work on Atari games using the DQN model and the 2016 victory of DeepMind’s AlphaGo program over Lee Sedol
recommender systems, such as those used by Netflix and Spotify6, which treat past user preferences as the goal to be optimized
fine tuning of LLMs, notably the use of reinforcement learning from human feedback (RLHF), the key technique that led to the initial success of ChatGPT, which involves evaluation of LLM outputs by humans and application of RL to those human preferences.

While these successes are impressive, RL currently works across a fairly limited number of domains of knowledge and human activity—unlike LLMs, which are useful in practically every domain. The challenge with increasing the breadth of application of RL is that effective RL requires a well-specified value function to define the goal of optimization, and way to update that value function based upon new data.7 This is not easy in many domains—for example, what is the value function for a happy relationship, or an interesting holiday?8

Former OpenAI board member Helen Toner provides a useful framework for this challenge in “2 big questions for AI progress in 2025-2026”, which observes that RL is most successful for post-training LLMs in domains that are subject to ‘auto-grading’ (i.e. automating evaluation of correct answers), such as math or coding. This framework leads Toner to ask two questions about prospects for AI progress:

Beyond math and coding, where else can you automatically grade answers to hard problems?
How much will improving performance in auto-graded areas spill over into strong performance on other tasks?

That is, Toner sees the ability to generalize RL as the central determinant of progress with AI in the near future.9 This is a remarkable vote for the centrality of RL to the future of AI, while also endorsing the continued importance of LLMs by focusing on RL as a method for fine-tuning LLMs.

The path forward for AI

We cannot see that path forward for AI with clarity. There have been many important surprises in the progress of AI in the past 15 years—most remarkably the huge success of LLMs—and there will be many more such surprises.

However, from the evidence and experience outlined above, two hypotheses about that future appear highly likely:

The continued pathway towards machine intelligence will follow the pathway of diversity that we see in human intelligence—involving compound AI systems rather than unitary models.
LLMs, world models and reinforcement learning will all be central to the progress of AI in the coming decades.

I’d appreciate view on these hypotheses, including whether there are other hypotheses about the future of AI that are likely to be broadly predictive with a high degree of confidence. Hypotheses that are more specific than the above predictions are of particular interest.

The image below is from XKCD.

In fact, science appears to support the view that babies learn by both experimentation and imitation.

Dwarkesh makes this set of errors in a follow-up to his Richard Sutton podcast.

The image below is from XKCD.

Many today focus on reinforcement learning as a discipline of machine learning, but its roots can be traced directly to the earlier ideas of dynamic programming articulated in the 1950s by Richard Bellman, whose Bellman Equation provides a formulation for evaluating the value of a decision in terms of potential future states and their associated rewards

Spotify’s methods are based upon Bayesian belief updates about user preferences, which links to the point above about Bayesian methods being analogous to learning by experimentation.

Modern RL uses two general approaches to updating value functions: Monte Carlo methods (which perform updates based upon ultimate outcome of an ‘episode’—such as victory or defeat in a game) and temporal difference learning (which involve updates at each time step). Intermediate between these two approaches are n-step methods, which involves updates after n time steps, where n may take any positive integer value greater than 1.

It is also important to ensure that RL models are implemented in a way that optimizes the value function in the way intended. This may not occur where the RL agent engages in ‘reward hacking’, by “exploit[ing] flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task”. An example is a robot hand trained to grab objects tricking people by placing the hand between the object and the camera.

Progress on technical methods to implement RL will also be important, such as recent work by a team at Duke and UC Berkeley.

Daniel Popescu / ⧉ Pluralisk

Oct 18

Excellent analysis! Thank you for clearly articulating the importance of moving beyond the binary LLM debate, highlighting a much-needed nuanced perspective on their future role. Indeed, your emphasis on this 'middle path' correctly identifies how the current 'battle' risks distracting from the true frontiers of advanced AI development, praticularly in areas like world models.

Expand full comment

1 reply by Maury Shenk

1 more comment...

LearnTech

Discussion about this post