"Vibe Coding" vs Reality

- 11 min read - Text Only

There's a trend on social media where many repeat Andrej Karpathy's words (archived): "give in to the vibes, embrace exponentials, and forget that the code even exists." This belief — like many flawed takes humanity holds — comes from laziness, inexperience, and self-deluding imagination. It is called "Vibe Coding."

head-empty
"Embrace the exponentials" sounds like it came from an NFT junkie.
shinji-cup
Like the NFT crowd, there is a bubble of unreality they cling to justifying their perception of the world.

Producing software is now more accessible as newer tools allow people to describe what they want in a natural language to a large language model (LLM). This idea is catching on because LLM agents are now accessible to anyone willing to subscribe to vendors like Cursor, GitHub, Windsurf, and others. These editors have an "agent" option where users can request something and in response changes are made to the appropriate files, rather than only the file currently in focus. Over time, the agent will request to run commands to run tests or even run scripts it previously wrote to the file system, much as you would if you were solving the problem.

In 2022, folks could copy code into ChatGPT and ask questions or for rewrites.

In 2023, folks could ask it to review and edit a single file with an IDE integration like Copilot.

In 2024 and 2025, folks could ask it to solve a specific problem in the project and have it find out what files to edit, edit them, then verify its own work, and correct any mistakes it made with feedback from linting errors and unit tests.

With LLM agents having so much capability, people can delegate the idea of refining their imprecise ideas to a precise implementation elaborated by an LLM through "Vibe Coding."

@a16z @stuffyokodraws First - what is vibe coding?

A concise definition from @stuffyokodraws, and then an exploration of how technical vs. non-technical users approach these tools.

If you open a blank folder and tell it to set up an initial project, it can do a lot at once. With no rules, no patterns to mimic, and no constraints, it can produce something that feels more tailored for you in minutes than npx create-react-app ever could.

With a simple instruction like "I want to create a website for my ski resort" and about ten minutes of having it massage errors of its own making, I can have just that.

A generated website about a ski resort with a phrase like 'Easy to Reach, Hard to Leave'

These leaps of progress are what fuels the "Vibe Coding" idea. To go from nothing to something shareable and personal sounds incredible.

beat-saber
This moment provided a thrill I hadn't experienced in a long time when coding. However, this excitement drained quickly the further I got from a blank canvas.

Agents, as a concept, aren’t new. Google IO made up buzzwords like "agentic era" (archived) to describe this concept. It has been realized through open technologies like AutoGPT, XAgent, and more recently by Anthropic with the Model Context Protocol (MCP).

When the model can interact with more than just a person who proxies their outputs into different domains, it is autonomous. If it can perform searches on the web or in a codebase, it can enrich its own context with the information it needs to fulfill the current request. Further, when it can commit outputs and then gain immediate and automatic feedback on those outputs, it can refine its solution without a person intervening.

There are actions that do prompt the user for consent before proceeding, such as running commands in the console or deleting files. This consent can be pre approved with a mode called "YOLO."

Cursor settings YOLO mode, allows running commands automatically

we-live-in-a-society
A mode for "You Only Live Once"!? Really?

You can witness this autonomy for yourself today in Cursor.

The agent concept has merit and today can deliver proofs of concept that VC firms like Y-Combinator will invest in — proofs of concept that are trash by unskilled founders hoping to win the lottery while living the life of leisure.

I’ve cracked vibe coding, TrendFeed has almost hit its first 10k month, and Ai built the entire thing

Im just sitting here sipping coffee, coding with Ai + MCP

Also more time to shitpost on X haha
cheers
The optimal technical founder for a VC is not the 10x engineer. It is someone who'll deliver enough of a product to test its fitness in the market and then succeed in raising more investment money. Their execution on their vision and hiring prowess is more important than their technical skillset.

The execution of agents today is over-hyped and does not hold up to the needs of any functioning businesses which need experts to develop and maintain their technical capabilities instead of single points of failure on the internet.

babe, come to bed

i can't, i'm vibe coding

These models are trained on average sloppy code, wrong answers on Stack Overflow, and the junk that ends up on Quora. Despite the power and capability Claude 3.7 Sonnet has in small contexts, when faced with even a small codebase it makes constant silly mistakes that no normal developer would repeat and continue to repeat every hour of its operation.

Specific details on the mistakes, feel free to skip
  • Regularly clones TypeScript interfaces instead of exporting the original and importing it.
  • Reinvents components all the time with the same structure without searching the code base for an existing copy of that component.
  • Writes trusted server side logic on the client side, using RPC calls to update the database.
  • As a feature develops, it prioritizes maintaining previous mistakes instead of re-evaluating its design, even when told to do so. You have to say the previous implementation is outright unusable for it to replace its design.
  • Cursor has some sort of "concise mode" (archived) that they'll turn on when there is high load where the model will still be rated at the normal price but behaves in a useless manner. This mode will omit details, drop important findings, and corrupt the output that is being produced.
  • Cannot be trusted to produce unit tests with decent coverage.
  • Will often break the project's code to fit a unit test rather than fix the unit test when told to do so.
  • When told to fix styles with precise details, it will alter the wrong component entirely.
  • When told specifically where there are many duplicated components and instructed to refactor, will only refactor the first instance of that component in the file instead of all instances in all files.
  • When told to refactor code, fails to search for the breaks it caused even when told to do so.
  • Will merrily produce files over 1000 lines which exceed its context window over time, even when told to refactor early on.
  • Will regularly erase entire route handlers if not bound to the file hierarchy.

As currently designed, these models cannot learn new information. They cannot do better than the dataset they were created with. Instead their capability is realized by how effective they can process tokens entering their context window.

If you ask Claude 3.7 Sonnet to develop a runtime schema for validating some domain specific language and then ask it to refactor the file — because it is too large for its context window to continue — it will degrade and output incoherent nonsense before finishing its work.

Now that we've created all the schemado that: ... I'v schema files for each schema schemaschema schemaactored code?

wat
It did not type "I've" correctly and conjoined the words "schema" and "refactored" into one.
my saas was built with Cursor, zero hand written code

AI is no longer just an assistant, it’s also the builder

Now, you can continue to whine about it or start building.

P.S. Yes, people pay for it

You cannot ask these tools today to develop a performant React application. You cannot ask these tools to implement a secure user registration flow. It will choose to execute functions like is user registered on the client instead of the server.

trash
Others are learning this the hard way too.
guys, i'm under attack

ever since I started to share how I built my SaaS using Cursor

random thing are happening, maxed out usage on api keys, people bypassing the subscription, creating random shit on db

as you know, I'm not technical so this is taking me longer that usual to figure out

for now, I will stop sharing what I do publicly on X

there are just some weird ppl out there

Without expert intervention, the best these tools can do today is produce a somewhat functional mockup, where every future change beyond that risks destroying existing functionality.

I cannot — and would not — trust a team member who vibe codes in a production application. The constant negligence I observe when "Vibe Coding" is atrocious and unacceptable to a customer base of any size.

No available model demonstrates consistent and necessary attention to detail needed for a production environment. They are not yet equipped or designed to transform information involving multiple contexts inherent to producing a digital product.

These tools are optimized to produce solutions that fit in a single screen of markdown and are now being asked to do far more than they were trained for. As the context window overflows and the model degrades, it will fail to even format MCP calls correctly and upon reaching this point of no return, produces a log that comes across as being tortured. Like a robot losing a limb, it will try and try again to walk only to fall down until the editor pauses the conversation to save on resources.

Let me try a different approach. Error calling tool.

Working around the problem

A modern "Twitch plays Pokémon" is going on right now: Claude Plays Pokémon. It mitigates this context window problem by starting a new context with seeded information provided by its previous incarnation in the form of many Markdown files, which it can then read as if new and search via MCP during its playthrough.

So, what makes this possible? Claude was given a knowledge base to store notes, vision to see the screen, and function calls which allow it to simulate button presses and navigate the game.

Together, they allow Claude to sustain gameplay with tens of thousands of interactions.
Photo included with tweet

Even so, it can make bad assumptions and spend 43 hours intentionally blacking out over and over in Mt. Moon (an in-game route between story locations) making no effective progress towards achieving its next goal because by the time it could second guess itself, its context window is no longer fit to continue.

galaxy-brain2
It did escape and progress, but only after the critic instance of the model suggested its assumption was incorrect.

After a context cleanup completes, which takes about five minutes (the video above is edited to the meaningful moments), the model proceeds to make the same mistakes its prior incarnation did. The notes it wrote are not meaningfully interpreted in context, I find the same happens too with the Cursor rules I write.

While increasing the length of the context window will improve some immediate experiences, this is a problem of scale that needs a different solution for agents to be more effective and, perhaps, move "Vibe Coding" closer to reality.

thinker
Would a formalized bullet journal over MCP help a model be more complete in delivering more reliable results?
As long as the model correctly checks it before concluding its work is complete!
point-left

Bullet journal with ski examples

A bullet journal may be one of many tools that improve the reliability of the models we have today.

The next issue is that these models cannot ingest information from multiple concurrent real-time sources. In one terminal we may be running the server and in another some end-to-end tests. Both of these terminals were created at the agent's request. It either ignores or is not fed the stack trace logged by the server in the first terminal as it watches the output of the end-to-end tests fail and retry, fail and retry.

For agents to have the impact promised by the hype, LLMs need a robust mechanism to mimic the development of short and long term memory without fine-tuning the memories into the model.

Furthermore, for agents to contribute to a team, there must be a way to develop long-term memories bound to the organization and its products that seamlessly merge with and reconcile with memories personal to each team member.

And lastly, these memories have to be portable. As models improve and are integrated into our tools, domain specific memories must be usable by the next generation of large language models.

Conclusion

"Vibe Coding" might get you 80% the way to a functioning concept. But to produce something reliable, secure, and worth spending money on, you’ll need experienced humans to do the hard work not possible with today’s models.

Agents do demonstrate enough capability that LinkedIn CEO influencers confidently spread the unreality that we can replace jobs with "agentic AI."

Agents do enable skilled people to create more independently than they ever have. For the time being, it will not replace those that can solve the hard problems that only experience and intuition can identify. Like other no-code solutions, agents do give the less skilled more capability than they had the day before. Until they develop their own competent skill set, "Vibe Coders" will not be able to release production quality software in this world, no matter how exponential the agent is over their own inferior skill set.

Keep an eye on how LLM agents develop and improve. For now, they are worth evaluating and discussing, but are not ready for us to delegate the precise task of creating reliable, secure, and scalable software that powers our society. "Vibe Coding" will not create the next big thing in 2025.