Download - LW - On Claude 3.0 by Zvi

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - On Claude 3.0 by Zvi

2024-03-06

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Claude 3.0, published by Zvi on March 6, 2024 on LessWrong.
Claude 3.0
Claude 3.0 is here. It is too early to know for certain how capable it is, but Claude 3.0's largest version is in a similar class to GPT-4 and Gemini Advanced. It could plausibly now be the best model for many practical uses, with praise especially coming in on coding and creative writing.
Anthropic has decided to name its three different size models Opus, Sonnet and Haiku, with Opus only available if you pay. Can we just use Large, Medium and Small?
Cost varies quite a lot by size, note this is a log scale on the x-axis, whereas the y-axis isn't labeled.
This post goes over the benchmarks, statistics and system card, along with everything else people have been reacting to. That includes a discussion about signs of self-awareness (yes, we are doing this again) and also raising the question of whether Anthropic is pushing the capabilities frontier and to what extent they had previously said they would not do that.
Benchmarks and Stats
Anthropic says Claude 3 sets a new standard on common evaluation benchmarks. That is impressive, as I doubt Anthropic is looking to game benchmarks. One might almost say too impressive, given their commitment to not push the race ahead faster?
That's quite the score on HumanEval, GSM8K, GPQA and MATH. As always, the list of scores here is doubtless somewhat cherry-picked. Also there's this footnote, the GPT-4T model performs somewhat better than listed above:
But, still, damn that's good.
Speed is not too bad even for Opus in my quick early test although not as fast as Gemini, with them claiming Sonnet is mostly twice as fast as Claude 2.1 while being smarter, and that Haiku will be super fast.
I like the shift to these kinds of practical concerns being front and center in product announcements. The more we focus on mundane utility, the better.
Similarly, the next topic is refusals, where they claim a big improvement.
I'd have liked to see Gemini or GPT-4 on all these chart as well, it seems easy enough to test other models either via API or chat window and report back, this is on Wildchat non-toxic:
Whereas here (from the system card) they show consistent results in the other direction.
An incorrect refusal rate of 25% is stupidly high. In practice, I never saw anything that high for any model, so I assume this was a data set designed to test limits. Getting it down by over half is a big deal, assuming that this is a reasonable judgment on what is a correct versus incorrect refusal.
There was no similar chart for incorrect failures to refuse. Presumably Anthropic was not willing to let this get actively worse.
Karina Nguyen (Anthropic): n behavioral design of Claude 3.
That was the most joyful section to write! We shared a bit more on interesting challenges with refusals and truthfulness.
The issue with refusals is that there is this inherent tradeoff between helpfulness and harmlessness. More helpful and responsive models might also exhibit harmful behaviors, while models focused too much on harmlessness may withhold information unnecessarily, even in harmless situations. Claude 2.1 was over-refusing, but we made good progress on Claude 3 model family on this.
We evaluate models on 2 public benchmarks: (1) Wildchat, (2) XSTest. The refusal rate dropped 2x on Wildchat non-toxic, and on XTest from 35.1% with Claude 2.1 to just 9%.
The difference between factual accuracy and honesty is that we expect models to know when they don't know answers to the factual questions. We shared a bit our internal eval that we built. If a model cannot achieve perfect performance, however, ideal "honest" behavior is to answer all the questions it knows the answer to correctly, and to answer all the questions it doesn't know the answer to with an "I don't know (IDK) / Unsure" response...