Why OpenAI o1 Sucks at Coding

While the OpenAI’s o1 series of models is known for its exceptional reasoning capabilities, several developers have reported that these models are not the best option for programming-related tasks, especially the o1-mini.

While its slow responses continue to irk developers, the issue goes beyond response time. A developer wrote on Hacker News that the o1-preview model was hallucinating to the point where it started responding in the context of non-existing libraries and functions.

“It’s the usual string of ‘You’re absolutely correct, and I apologise for the oversight in my previous response.’ While the reasoning may have been improved, this doesn’t solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not,” he explained further.

ChatGPT 4o is still better than o1 model

For such reasons, developers are calling o1 models overhyped. Moutaz Alkhatib, the lead software developer at Yieldlove, mentioned that he regrets purchasing the plus tier of ChatGPT, which he bought specifically to use o1 models, and that he will not renew the purchase.

The ‘Thinking’ Part

When AIM compared multiple LLMs for coding completion tests on LiveBench, the results were shocking as o1-mini was ranked below the open-source model Qwen2-72B and GPT-4.

For every developer who is dealing with deadlines, the first and most important thing is the response time. But even if you were to ignore the response time, multiple developers have mentioned that it gets stuck after the thinking phase and won’t respond at all.

Mike Young, while reviewing the o1 models, mentioned that the increased response time during the thinking stage can be a big deterrent, especially when you require quick answers. “The model sometimes gets stuck in thinking mode and never returns a response—happening about 40% of the time. It acts like it’s done processing, but the answer never comes – it’s often just a blank reply or just a few characters,” he added further.

A Reddit user shared his experience when he used the o1 model to build an app, and his experience was worse than the free version of ChatGPT.

“I am building an app (which I have no idea how to do since I am an embedded engineer), and o1 has been worse than even the free GPT-4 in that regard, and I have to be very, very specific with the prompt while working with o1,” he added, further suggesting that unless you are very specific about minute details, the o1 model can be a nightmare for developing an app.

Even if we ignore the use of more tokens and delay in response, the reasoning which is the pro feature of o1 models, still generates buggy code.

o1 takes while to solve the bug of the buggy code generated by itself

o1 is the Architect, Claude is the Developer

Dan McAteer, a software developer on X, mentioned that he is using o1-mini as an architect for his project. All he had to do was explain the project requirements to the model, and it generated a detailed design document with step-by-step instructions for each module.

On the other hand, McAteer uses the Claude Sonnet 3.5 as a developer to generate the code based on the architectural document produced by o1-mini.

“This works well because Sonnet 3.5 was always amazing at generating code, but the code that it did generate was only as good as the logic in your instructions. Now that we have models which can simulate reasoning trajectories, you can also use them to generate logical plans for Sonnet 3.5 to follow,” he added further.

Similarly, Sully Omar, the co-founder and CEO of Cognosys, also mentioned on X that o1-mini is mostly useless for coding. “It misses small details pretty often, and I almost always have Claude 3.5 fix it,” he added further.

That explains why OpenAI released Canvas, a coding platform from OpenAI uses ChatGPT 4o instead of o1 models.

This explains everything, as o1 models are mostly reasoning-oriented. For programming, these models can be helpful in architecting the base, and later on, models like Sonnet can take care of the code generation part.

Source link

The ‘Thinking’ Part

When AIM compared multiple LLMs for coding completion tests on LiveBench, the results were shocking as o1-mini was ranked below the open-source model Qwen2-72B and GPT-4.

A Reddit user shared his experience when he used the o1 model to build an app, and his experience was worse than the free version of ChatGPT.

Even if we ignore the use of more tokens and delay in response, the reasoning which is the pro feature of o1 models, still generates buggy code.

o1 is the Architect, Claude is the Developer

On the other hand, McAteer uses the Claude Sonnet 3.5 as a developer to generate the code based on the architectural document produced by o1-mini.

That explains why OpenAI released Canvas, a coding platform from OpenAI uses ChatGPT 4o instead of o1 models.

The ‘Thinking’ Part

o1 is the Architect, Claude is the Developer

Disclaimer

Popular

More Like this

Why OpenAI o1 Sucks at Coding

The ‘Thinking’ Part

o1 is the Architect, Claude is the Developer

Disclaimer

More like this

Popular

Upcoming Events

Newsletter Signup Form!

Newsletter Signup Form!