That is an excerpt of Sources by Alex Heath, a e-newsletter about AI and the tech business, syndicated only for The Verge subscribers as soon as per week.
Amazon’s AI chief has a message for the mannequin benchmark obsessives: Cease trying on the leaderboards.
“I need real-world utility. None of those benchmarks are actual,” Rohit Prasad, Amazon’s SVP of AGI, instructed me forward of as we speak’s bulletins at AWS re:Invent in Las Vegas. “The one solution to do actual benchmarking is that if everybody conforms to the identical coaching information and the evals are fully held out. That’s not what’s taking place. The evals are frankly getting noisy, they usually’re not displaying the actual energy of those fashions.”
It’s a contrarian stance when each different AI lab is fast to boast about how their new fashions rapidly climb the leaderboards. It’s additionally handy for Amazon, on condition that the earlier model of Nova, its flagship mannequin, was sitting at spot 79 on LMArena when Prasad and I spoke final week. Nonetheless, dismissing benchmarks solely works if Amazon can provide a distinct story about what progress appears like.
“They’re not displaying the actual energy of those fashions.”
The centerpiece of as we speak’s re:Invent bulletins is Nova Forge, a service that Amazon claims lets corporations practice customized AI fashions in methods beforehand inconceivable with out spending billions of {dollars}. The issue Forge addresses is actual. Most corporations attempting to customise AI fashions face three unhealthy choices: fine-tune a closed mannequin (however solely on the edges), practice on open-weight fashions (however with out the unique coaching information and risking functionality regression, the place the AI turns into an skilled on new information however forgets authentic, broader abilities), or construct a mannequin from scratch at huge value.
Forge provides one thing else: entry to Amazon’s Nova mannequin checkpoints on the pre-training, mid-training, and post-training levels. Corporations can inject their proprietary information early within the course of, when the mannequin’s “studying capability is highest,” as Prasad put it, slightly than simply tweaking mannequin habits on the finish.
“What we have now completed is democratize AI and frontier mannequin growth on your use circumstances at fractions of what it could value [before],” Prasad mentioned. Forge was created as a result of Amazon’s inner groups needed a instrument to inject their area experience right into a base mannequin with out having to construct from scratch.
“We constructed Forge as a result of our inner groups needed Forge,” he mentioned. It’s a well-known Amazon sample. AWS itself famously started as infrastructure constructed for Amazon’s personal retail operation earlier than changing into the corporate’s revenue engine.
Reddit has been utilizing Forge to construct customized security fashions skilled on 23 years of group moderation information. “I haven’t seen something prefer it but,” Chris Slowe, Reddit’s CTO and first worker, instructed me. “We’ve had a distinguished engineer who’s simply been like a child within the sweet store.”
Slowe mentioned Reddit ran a continued pre-training job final week that’s “trying actually promising.” The objective: Substitute a number of bespoke security fashions with a single Reddit-expert mannequin that understands the nuances of group moderation, together with the notoriously subjective rule that seems throughout subreddits all over the place: “Don’t be a jerk.”
“Having an skilled mannequin, it’s going to know the group,” Slowe mentioned. “It’s gonna have a fairly good notion of what jerk means.”
That’s the thread Amazon needs builders to drag on: not uncooked IQ factors, however management and specialization.
He defined that Forge allows Reddit to manage its fashions, keep away from surprises from API modifications, retain possession of its weights, and keep away from sending delicate information to third-party mannequin suppliers. He mentioned Reddit is already exploring utilizing the identical strategy for Reddit Solutions and different merchandise.
After I requested Slowe whether or not it mattered that Nova isn’t a top-tier mannequin on benchmarks, he was blunt: “On this context, what issues is the Reddit expertness of the mannequin.” That’s the thread Amazon needs builders to drag on: not uncooked IQ factors, however management and specialization.
With Forge, Amazon is making a calculated wager that the mannequin race has commoditized and that it may well succeed by being the place the place corporations can construct specialised AI for particular enterprise issues. It’s a really AWS-shaped view of the world: infrastructure over intelligence and customization over uncooked functionality. The technique additionally lets Amazon sidestep direct comparisons with OpenAI and Anthropic, each of which it as soon as hoped to compete with on the mannequin layer.
Whether or not Forge is genuinely pioneering or simply intelligent positioning relies upon, in fact, on developer adoption. Amazon insists that the mannequin race, because it’s extensively understood, doesn’t matter. If that finally ends up being true, the scoreboard shifts to one thing a lot quieter and more durable to sport: whether or not AI fashions really ship real-world utility.
{content material}
Supply: {feed_title}

