Recently, the host of a technology and AI podcast that I respect asked a guest something like this:
It’s so hard to know which AI model is best. Gemini 2.5 is topping most leaderboards, Claude 4 is great at coding, ChatGPT o3 is strong for reasoning, and then there’s DeepSeek and Llama which offer strong performance at a lower price. How do you know which one to choose for each task?
Interesting question! But is it the right one?
I’m not sure. I call the current trend to enquire about over-specific details of performance ‘AI model envy’. I argue in this post that AI model users should be wary of model envy, focusing more on where they are heading and less on the exact model they are using to get there.
It is easy to steelman the argument for AI model envy:
AI models keep improving rapidly at a variety of tasks.
It makes sense to choose an AI model that is relatively better at a given task (taking into account cost and other considerations).
This may seem obvious. However, in my view, it’s too obvious. The analysis should not be so simple, for at least two reasons:
Generative AI based on large language models (LLMs) does not currently do what the hype generated by its boosters suggests.
Focusing on model capabilities distracts us from figuring out how to use the models optimally.
Let’s take these points in turn.
Distinguishing between reality and hype
LLM-based generative AI models are tremendously useful tools, for a wide variety of tasks. General-purpose models like ChatGPT, Gemini and Claude are excellent productivity-enhancing tools for a professionals and entrepreneurs like me—assuming that they are used with due attention to hallucination and fact-checking. Coding tools like Cursor and Windsurf are substantially increasing the efficiency of software development. And there are many other specialized examples—in areas like voice generation, law, bid writing and many others.
Not surprisingly given the massive size of the AI opportunity, the accompanying hype cycle is equally impressive. I won’t go into any depth here on details of the marketing-driven hype for generative AI. Others have commented on this at length—Gary Marcus and Ed Zitron are particularly visible. But three cautionary observations on the hype are crucial:
Measuring LLM performance is challenging. There are various reasons to be skeptical of performance-measuring benchmarks, which are central to the hype cycle.
There are fundamental technical limitations on what LLMs can do. Recently, researchers at Apple published an excellent, careful paper ‘The Illusion of Thinking’, which demonstrates that the ‘reasoning’ capabilities of LLM-based ‘large reasoning models’ (LRMs) collapse quickly beyond a relatively low level of complexity. The Apple paper poses (but does not explicitly answer) a question that these results suggest: “Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?” I recently proposed an answer in ‘AI and the Curse of Dimensionality’: we know that LLMs are doing pattern-matching, and the inherent challenges of dimensionality mean that LLMs will not be able to reliably do so for complex problems (particularly edge cases with limited data, where error and hallucination pose the greatest risks).
Human’s are naturally inclined to credit AIs with human-like capabilities that they do not have. This has been obvious at least since Joseph Weizenbaum developed ELIZA in the 1960s. The title of Apple’s paper makes the point that the apparent ‘thinking’ by LRMs is a data-driven illusion.
These factors underline that while we should be attentive to progress and exploratory of new models amidst the AI whirlwind, we should not overestimate the hype-driven promises of the latest AI advancements.
Avoiding distractions from use-specific performance
This brings us to the fundamental problem of AI model envy: that it tends to distract us from figuring out how to use AI models optimally. There are important technical / practical aspects of this distraction:
A generative AI model is often not the right AI tool to solve a problem. Current AI hype (and AI model envy) focuses on generative AI, especially LLMs and LRMs, but there are far from the only types of AI models out there. Long before the release of ChatGPT, predictive (as opposed to generative) AI was generating important results on problems such as image recognition, protein folding and game play. Deployments of predictive and compound AI systems continue to expand, and are the basis for some of the most important and widespread applications of AI, including self-driving cars, medical diagnostics and language translation. Use of such methods is particularly important for applications that cannot safely employ generative AI methods that are prone to hallucination and other highly stochastic behavior.
Simple AI tools often produce excellent results. While the providers of LLMs and LRMs have strong incentives to convince us that we need their biggest, best and most expensive models—just as marketers and others have always tried to convince us in other areas—the reality is that simple models often do the trick. For example, my start up PlaylistBuilder uses a now-standard convolutional neural network to make predictions about which YouTube videos are likely to be relevant and of high quality (although we are also exploring LLM-based features). AI research is primarily about getting better results through cutting edge R&D, but for many day-to-day tasks, the cutting edge is just not needed.
Model evaluation is time-consuming. When we hire a human, we don’t usually don’t spend a lot of time measuring their performance on every task, including because such measurements (a) take a lot of time (particularly as applied to a specific context) and (b) are subjective. We look for great employees (just like we aim to use great AI models), but few of us have ‘staff envy’ that leads us to abandon current employees for ones living in the greener grass on the other side of the fence. Rather, we accept that employee performance will be variable, and we seek to optimize performance of current employees through training and other means. Of course, this analogy has the obvious flaw that humans as a group are not improving like AI models are. But there is still power to the analogy, because it does often make sense to optimize business operations for a current AI model, rather than too frequently exploring new ones.
So my advice is, don’t fall for the purveyors of AI model envy. Although models continue to improve rapidly, don’t quickly assume that newer and bigger is better. Rather, it’s important to ask questions like:
What type of AI application does my use case require?
Does my application require generative AI?
Are there simpler and more reliable choices than LLMs and LRMs for what I want to do?
If I am satisfied with the performance of my current LLM/LRM, is it efficient to upgrade now?
In many cases, going with the latest, most advanced AI model can be the right choice. But before you do so, take a deep breath, ask some questions, do some research, and try not to get distracted by performance comparisons.