Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Claude 3.0, published by Zvi on March 6, 2024 on LessWrong.
Claude 3.0
Claude 3.0 is here. It is too early to know for certain how capable it is, but Claude 3.0's largest version is in a similar class to GPT-4 and Gemini Advanced. It could plausibly now be the best model for many practical uses, with praise especially coming in on coding and creative writing.
Anthropic has decided to name its three different size models Opus, Sonnet and Haiku, with Opus only available if you pay. Can we just use Large, Medium and Small?
Cost varies quite a lot by size, note this is a log scale on the x-axis, whereas the y-axis isn't labeled.
This post goes over the benchmarks, statistics and system card, along with everything else people have been reacting to. That includes a discussion about signs of self-awareness (yes, we are doing this again) and also raising the question of whether Anthropic is pushing the capabilities frontier and to what extent they had previously said they would not do that.
Benchmarks and Stats
Anthropic says Claude 3 sets a new standard on common evaluation benchmarks. That is impressive, as I doubt Anthropic is looking to game benchmarks. One might almost say too impressive, given their commitment to not push the race ahead faster?
That's quite the score on HumanEval, GSM8K, GPQA and MATH. As always, the list of scores here is doubtless somewhat cherry-picked. Also there's this footnote, the GPT-4T model performs somewhat better than listed above:
But, still, damn that's good.
Speed is not too bad even for Opus in my quick early test although not as fast as Gemini, with them claiming Sonnet is mostly twice as fast as Claude 2.1 while being smarter, and that Haiku will be super fast.
I like the shift to these kinds of practical concerns being front and center in product announcements. The more we focus on mundane utility, the better.
Similarly, the next topic is refusals, where they claim a big improvement.
I'd have liked to see Gemini or GPT-4 on all these chart as well, it seems easy enough to test other models either via API or chat window and report back, this is on Wildchat non-toxic:
Whereas here (from the system card) they show consistent results in the other direction.
An incorrect refusal rate of 25% is stupidly high. In practice, I never saw anything that high for any model, so I assume this was a data set designed to test limits. Getting it down by over half is a big deal, assuming that this is a reasonable judgment on what is a correct versus incorrect refusal.
There was no similar chart for incorrect failures to refuse. Presumably Anthropic was not willing to let this get actively worse.
Karina Nguyen (Anthropic): n behavioral design of Claude 3.
That was the most joyful section to write! We shared a bit more on interesting challenges with refusals and truthfulness.
The issue with refusals is that there is this inherent tradeoff between helpfulness and harmlessness. More helpful and responsive models might also exhibit harmful behaviors, while models focused too much on harmlessness may withhold information unnecessarily, even in harmless situations. Claude 2.1 was over-refusing, but we made good progress on Claude 3 model family on this.
We evaluate models on 2 public benchmarks: (1) Wildchat, (2) XSTest. The refusal rate dropped 2x on Wildchat non-toxic, and on XTest from 35.1% with Claude 2.1 to just 9%.
The difference between factual accuracy and honesty is that we expect models to know when they don't know answers to the factual questions. We shared a bit our internal eval that we built. If a model cannot achieve perfect performance, however, ideal "honest" behavior is to answer all the questions it knows the answer to correctly, and to answer all the questions it doesn't know the answer to with an "I don't know (IDK) / Unsure" response...
view more