"Vibe Coding" vs Reality
- 11 min read - Text OnlyThere's a trend on social media where many repeat Andrej Karpathy's words (archived): "give in to the vibes, embrace exponentials, and forget that the code even exists." This belief — like many flawed takes humanity holds — comes from laziness, inexperience, and self-deluding imagination. It is called "Vibe Coding."
Producing software is now more accessible as newer tools allow people to describe what they want in a natural language to a large language model (LLM). This idea is catching on because LLM agents are now accessible to anyone willing to subscribe to vendors like Cursor, GitHub, Windsurf, and others. These editors have an "agent" option where users can request something and in response changes are made to the appropriate files, rather than only the file currently in focus. Over time, the agent will request to run commands to run tests or even run scripts it previously wrote to the file system, much as you would if you were solving the problem.
In 2022, folks could copy code into ChatGPT and ask questions or for rewrites.
In 2023, folks could ask it to review and edit a single file with an IDE integration like Copilot.
In 2024 and 2025, folks could ask it to solve a specific problem in the project and have it find out what files to edit, edit them, then verify its own work, and correct any mistakes it made with feedback from linting errors and unit tests.
With LLM agents having so much capability, people can delegate the idea of refining their imprecise ideas to a precise implementation elaborated by an LLM through "Vibe Coding."
A concise definition from @stuffyokodraws, and then an exploration of how technical vs. non-technical users approach these tools.
If you open a blank folder and tell it to set up an initial project, it can do a lot at once. With no rules, no patterns to mimic, and no constraints, it can produce something that feels more tailored for you in minutes than npx create-react-app
ever could.
With a simple instruction like "I want to create a website for my ski resort" and about ten minutes of having it massage errors of its own making, I can have just that.
These leaps of progress are what fuels the "Vibe Coding" idea. To go from nothing to something shareable and personal sounds incredible.
Agents, as a concept, aren’t new. Google IO made up buzzwords like "agentic era" (archived) to describe this concept. It has been realized through open technologies like AutoGPT, XAgent, and more recently by Anthropic with the Model Context Protocol (MCP).
When the model can interact with more than just a person who proxies their outputs into different domains, it is autonomous. If it can perform searches on the web or in a codebase, it can enrich its own context with the information it needs to fulfill the current request. Further, when it can commit outputs and then gain immediate and automatic feedback on those outputs, it can refine its solution without a person intervening.
There are actions that do prompt the user for consent before proceeding, such as running commands in the console or deleting files. This consent can be pre approved with a mode called "YOLO."
You can witness this autonomy for yourself today in Cursor.
The agent concept has merit and today can deliver proofs of concept that VC firms like Y-Combinator will invest in — proofs of concept that are trash by unskilled founders hoping to win the lottery while living the life of leisure.
Im just sitting here sipping coffee, coding with Ai + MCP
Also more time to shitpost on X haha
The execution of agents today is over-hyped and does not hold up to the needs of any functioning businesses which need experts to develop and maintain their technical capabilities instead of single points of failure on the internet.
i can't, i'm vibe coding
These models are trained on average sloppy code, wrong answers on Stack Overflow, and the junk that ends up on Quora. Despite the power and capability Claude 3.7 Sonnet has in small contexts, when faced with even a small codebase it makes constant silly mistakes that no normal developer would repeat and continue to repeat every hour of its operation.
- Regularly clones TypeScript interfaces instead of exporting the original and importing it.
- Reinvents components all the time with the same structure without searching the code base for an existing copy of that component.
- Writes trusted server side logic on the client side, using RPC calls to update the database.
- As a feature develops, it prioritizes maintaining previous mistakes instead of re-evaluating its design, even when told to do so. You have to say the previous implementation is outright unusable for it to replace its design.
- Cursor has some sort of "concise mode" (archived) that they'll turn on when there is high load where the model will still be rated at the normal price but behaves in a useless manner. This mode will omit details, drop important findings, and corrupt the output that is being produced.
- Cannot be trusted to produce unit tests with decent coverage.
- Will often break the project's code to fit a unit test rather than fix the unit test when told to do so.
- When told to fix styles with precise details, it will alter the wrong component entirely.
- When told specifically where there are many duplicated components and instructed to refactor, will only refactor the first instance of that component in the file instead of all instances in all files.
- When told to refactor code, fails to search for the breaks it caused even when told to do so.
- Will merrily produce files over 1000 lines which exceed its context window over time, even when told to refactor early on.
- Will regularly erase entire route handlers if not bound to the file hierarchy.
As currently designed, these models cannot learn new information. They cannot do better than the dataset they were created with. Instead their capability is realized by how effective they can process tokens entering their context window.
If you ask Claude 3.7 Sonnet to develop a runtime schema for validating some domain specific language and then ask it to refactor the file — because it is too large for its context window to continue — it will degrade and output incoherent nonsense before finishing its work.
AI is no longer just an assistant, it’s also the builder
Now, you can continue to whine about it or start building.
P.S. Yes, people pay for it
You cannot ask these tools today to develop a performant React application. You cannot ask these tools to implement a secure user registration flow. It will choose to execute functions like is user registered on the client instead of the server.
ever since I started to share how I built my SaaS using Cursor
random thing are happening, maxed out usage on api keys, people bypassing the subscription, creating random shit on db
as you know, I'm not technical so this is taking me longer that usual to figure out
for now, I will stop sharing what I do publicly on X
there are just some weird ppl out there
Without expert intervention, the best these tools can do today is produce a somewhat functional mockup, where every future change beyond that risks destroying existing functionality.
I cannot — and would not — trust a team member who vibe codes in a production application. The constant negligence I observe when "Vibe Coding" is atrocious and unacceptable to a customer base of any size.
No available model demonstrates consistent and necessary attention to detail needed for a production environment. They are not yet equipped or designed to transform information involving multiple contexts inherent to producing a digital product.
These tools are optimized to produce solutions that fit in a single screen of markdown and are now being asked to do far more than they were trained for. As the context window overflows and the model degrades, it will fail to even format MCP calls correctly and upon reaching this point of no return, produces a log that comes across as being tortured. Like a robot losing a limb, it will try and try again to walk only to fall down until the editor pauses the conversation to save on resources.
Working around the problem
A modern "Twitch plays Pokémon" is going on right now: Claude Plays Pokémon. It mitigates this context window problem by starting a new context with seeded information provided by its previous incarnation in the form of many Markdown files, which it can then read as if new and search via MCP during its playthrough.
Together, they allow Claude to sustain gameplay with tens of thousands of interactions.
Even so, it can make bad assumptions and spend 43 hours intentionally blacking out over and over in Mt. Moon (an in-game route between story locations) making no effective progress towards achieving its next goal because by the time it could second guess itself, its context window is no longer fit to continue.
After a context cleanup completes, which takes about five minutes (the video above is edited to the meaningful moments), the model proceeds to make the same mistakes its prior incarnation did. The notes it wrote are not meaningfully interpreted in context, I find the same happens too with the Cursor rules I write.
While increasing the length of the context window will improve some immediate experiences, this is a problem of scale that needs a different solution for agents to be more effective and, perhaps, move "Vibe Coding" closer to reality.
A bullet journal may be one of many tools that improve the reliability of the models we have today.
The next issue is that these models cannot ingest information from multiple concurrent real-time sources. In one terminal we may be running the server and in another some end-to-end tests. Both of these terminals were created at the agent's request. It either ignores or is not fed the stack trace logged by the server in the first terminal as it watches the output of the end-to-end tests fail and retry, fail and retry.
For agents to have the impact promised by the hype, LLMs need a robust mechanism to mimic the development of short and long term memory without fine-tuning the memories into the model.
Furthermore, for agents to contribute to a team, there must be a way to develop long-term memories bound to the organization and its products that seamlessly merge with and reconcile with memories personal to each team member.
And lastly, these memories have to be portable. As models improve and are integrated into our tools, domain specific memories must be usable by the next generation of large language models.
Conclusion
"Vibe Coding" might get you 80% the way to a functioning concept. But to produce something reliable, secure, and worth spending money on, you’ll need experienced humans to do the hard work not possible with today’s models.
Agents do demonstrate enough capability that LinkedIn CEO influencers confidently spread the unreality that we can replace jobs with "agentic AI."
Agents do enable skilled people to create more independently than they ever have. For the time being, it will not replace those that can solve the hard problems that only experience and intuition can identify. Like other no-code solutions, agents do give the less skilled more capability than they had the day before. Until they develop their own competent skill set, "Vibe Coders" will not be able to release production quality software in this world, no matter how exponential the agent is over their own inferior skill set.
Keep an eye on how LLM agents develop and improve. For now, they are worth evaluating and discussing, but are not ready for us to delegate the precise task of creating reliable, secure, and scalable software that powers our society. "Vibe Coding" will not create the next big thing in 2025.