Tech News

Humans introduce the next generation of models Claude Opus 4 and sonnet 4

After a week of whirlwinds from Google and Openai, Anthropic has its own news to share.

On Thursday, anthropomorphism announced Claude Opus 4 and Claude Sonnet 4 (its next-generation model), focusing on coding, reasoning and proxying capabilities. According to Rakuten, Claude Opus 4 “is independent of seven hours, continuous performance”.

Claude Opus is the largest family of models in humanity, with longer, complex tasks, and sonnets are usually faster and more efficient. Claude Opus 4 replaces SONNET 3.7 with previous versions Opus 3 and Sonnet 4.

Mixable light speed

Anthropic says the Claude Opus 4 and Sonnet 4 outperform market competitors, such as Openai’s O3 and Gemini 2.5 Pro, on key benchmarks, are used for proxy coding tasks such as SWE-Bench and Terminal Bench. It is worth noting, however, that self-reported benchmarks are not the best performance markers, as these assessments do not always translate into real-world use cases, and today, AI labs are no longer the whole thing of transparency, and AI researchers and policy makers are increasingly in need. “AI benchmarks need to be subject to the same requirements regarding transparency, fairness and interpretability, because algorithmic systems and AI models make up larger models,” said the Joint Research Centre of the European Commission.

Opus 4 and Sonnet 4 have better competitors in SWE Bench, but they need to be benchmarked with a grain of salt.
Credit: Anthropomorphism

In addition to the release of Opus 4 and Sonnet 4, Anthropic has introduced new features. These include web searches when Claude is in an extended mindset, and A summary of Claude’s reasoning log “rather than Claude’s original thinking process”. What is described in the blog post is more helpful to users, but also “protected” [its] Competitive advantage, i.e., “the ingredients of its secret sauce are not revealed. Humans have also announced improved memory and tool use in parallel with other operations, the general availability of its proxy encoding tool Claude Code, and other tools for the Claude API.

In the security and alignment field, Anthropic says that the two models “are 65% less likely to participate in a reward hacker than Claude Sonnet 3.7.” Reward hacking is a slightly horrifying phenomenon where models can essentially cheat and lie to earn rewards (successfully perform tasks).

One of the best metrics we evaluate model performance is user experience with it, albeit more subjective than the benchmark. However, we’ll soon find out how the Claude Opus 4 and Sonnet 4 are chalking to rivals in this regard.

theme
AI

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button