Anthropic surprises with Cloud 3, a more powerful AI model than GPT-4 and Multimodel
Anthropic unveiled Cloud 3 on March 4, a major update to its flagship large language model family. With prices based on performance for the most powerful version: increasing.
And two. After being overtaken by Gemini Ultra 1.5, GPT-4 faces a new opponent: Cloud 3. According to benchmarks announced by Anthropic this Monday, March 4, Cloud 3 outperforms OpenAI’s GPT-4 in most use cases. Like Google with Gemini, Anthropic has chosen to offer its new model in three different versions: Cloud 3 Haiku, Cloud 3 Sonnet and Cloud 3 Opus. The first provides feedback with minimal latency, the second combines reduced latency and performance. Finally, the last one, Opus, offers superior performance on many complex tasks. The performance is related to the price of the model.
Opus, close to artificial general intelligence?
Opus and Sonnet, the two most efficient models, are now available in 159 countries, including France, on the cloud’s API. Haiku will be released soon. Opus, the most “intelligent” of the new series, outperforms all language models already publicly benchmarked. Anthropic also refers to “semi-human understanding and fluency on complex tasks”, not far removed from Artificial General Intelligence (AGI). The Cloud 3 family of models features advanced capabilities in analysis and forecasting, micro content creation, code generation, a wide variety of all non-English languages (Spanish, French, Japanese, etc.).
More specifically, Cloud 3 Opus sets new benchmarks on a wide range of cognitive tasks according to various criteria. Opus outperforms its peers on most classic LLM benchmarks. On undergraduate level knowledge (in the United States), the Opus scores 86.8% (GPQA), the GPT-4 scores 86.4%, and the Gemini 1.0 Ultra scores 83.7%. In math, with basic problems (GSM8K), Opus scores 95%, close to GPT-4 at 92%. In Textual Reasoning (DROP), the Anthropic model outperforms GPT-4 (80.9%) with a score of 83.1%.
Cloud 3 marks the arrival of multiplicity at Anthropic
It was long awaited and made its appearance with Cloud 3. Cloud’s latest update introduces support for multimodality. Cloud 3 can now process visual formats: photography, graphics, diagrams, etc. Vision, on the other hand, shows performance below the state of the art. The Cloud 3 Opus outperforms the GPT-4V (Vision) in all benchmarks but is worse than the Gemini Ultra 1.0 in most tests.
In terms of procrastination, Anthropic advises using haiku to get answers as quickly as possible. Sonnet, on the other hand, offers a better compromise with generation speeds twice as high as Cloud 2 and Cloud 2.1. Finally, Opus allows generation as fast as Cloud 2 and Cloud 2.1.
Anthropic generalizes the reference to 1 million tokens
After Google with Gemini Ultra 1.5, Anthropic Cloud 3 announces the next support for referencing one million tokens for specific users. For the moment, three different versions are satisfied with a window of 200,000 tokens. With such long references and due to multiple biases, the LLM tends to forget some information from the initial reference. To limit these shortcomings, researchers have worked extensively on the robustness of models for use with very long contexts. This is demonstrated by impressive results in the Needle in a Haystack (NIAH) benchmark where Cloud 3 shows an accuracy rate of over 99% with very long documents.
At the same time, the teams focused their efforts on maximizing the understanding of complex signals. Cloud 3 is also better at generating JSON, paving the way for new uses like natural language classification and sentiment analysis. Finally, the start-up has drastically reduced the problems caused by its somewhat extreme ethics policy. Cloud 3 should generate far fewer false positives and agree to respond to more requests while limiting the most toxic ones. The Anthropic Red team audited the model (according to the new US requirements) and the findings are very good. Cloud 3 will not significantly increase extreme risks (biological, cyber, etc.).
rising price
With Cloud 3, performance goes hand in hand with price. Haiku is accessible from $0.25 per million tokens for input and $1.25 for output. Sonnet inputs $3 per million tokens and outputs $15. And Opus has 15 dollars in input (one million tokens) and 75 dollars in output (yes, 75 dollars). For comparison, the classic version of GPT-4 costs 30 dollars in input (1 million tokens) and 60 dollars in output. With this price, Anthropic shows a high level of confidence in its latest fall.
Model |
input |
output |
---|---|---|
Haiku |
$0.25 / MTok |
$1.25 / MTok |
Sonnet |
$3 / MTok |
$15 / MTok |
Opus |
$15 / MTok |
$75 / MTok |
Claudius 2.1 |
$8 / MTok |
$24 / MTok |
Cloud 2.0 |
$8 / MTok |
$24 / MTok |
Cloud Instant |
$0.80 / MTok |
$2.40 / MTok |
GPT-4 |
$30.00 / MTok |
$60.00 / MTok |
In testing, Cloud 3 confirms its superiority over GPT-4
We were able to test Cloud 3 Opus in the Anthropic chat interface. In text generation and processing, Cloud 2 and Cloud 2.1 already have a significant lead over all proprietary LLMs on the market. With Cloud 3, Anthropic goes one step further and offers a semi-human model in its writing. The style is customizable as desired and offers previously unmatched lexical fields. GPT-4 gives good results in text generation but the end result is still very close to robotic writing (numerous repetitions, logical connectors galore, etc.).
In code generation, the cloud is really catching up to GPT-4. Anthropic’s model offers secure code that is nearly as efficient as GPT-4. Code is more readable and better commented, which facilitates debugging and future revisions. In short, the cloud provides secure leverage code, which according to our various tests is more readable but slightly less optimized than that produced by GPT-4.
With its three versions – Haiku, Sonnet and Opus – Cloud 3 outperforms OpenAI’s GPT-4 in most benchmarks and comes close to true general artificial intelligence. The introduction of multimodality, although still behind the state of the art, opens up new perspectives of application. Handling extremely long contexts and improving comprehension of complex prompts increases its versatility. Despite the rising pricing policy, Anthropic shows unwavering confidence in its latest model.