When the Chinese firm DeepSeek dropped a large language model called R1 last week, it sent shock waves through the US tech industry. Not only did R1 match the best of the homegrown competition, it was built for a fraction of the cost—and given away for free.
The US stock market lost $1 trillion, President Trump called it a wake-up call, and the hype was dialed up yet again. “DeepSeek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen—and as open source, a profound gift to the world,” Silicon Valley’s kingpin investor Marc Andreessen posted on X.
But DeepSeek’s innovations are not the only takeaway here. By publishing details about how R1 and a previous model called V3 were built and releasing the models for free, DeepSeek has pulled back the curtain to reveal that reasoning models are a lot easier to build than people thought. It has closed the lead on the world’s very top labs.
The news kicked competitors everywhere into gear. This week, Chinese tech giant Alibaba announced a new version of its large language model Qwen and the Allen Institute for AI (AI2), a top US nonprofit lab, announced an update to its large language model Tulu. Both claim their latest models beat DeepSeek’s equivalent.
Sam Altman, cofounder and CEO of OpenAI, called R1 impressive—for the price —but hit back with a bullish promise: “we will obviously deliver much better models.” OpenAI then pushed out ChatGPT Gov, a version of its chatbot tailored to the security needs of US government agencies, in an apparent nod to concerns that DeepSeek’s app was sending data to China. There’s more to come.
DeepSeek has suddenly become the company to beat. What exactly did it do to rattle the tech world so fully? Is the hype justified? And what can we learn from the buzz about what’s coming next? Here’s what you need to know.
Training steps
Let’s start by unpacking how large language models are trained. There are two main stages, known as pre-training and post-training. Pretraining is the stage most people talk about. That’s where billions of documents—huge numbers of websites, books, code repositories and more—are fed into a neural network over and over again until it learns to generate text that looks like its source material, one word at a time. What you end up with is known as a base model.
Pretraining is where most of the work happens, and it can cost huge amounts of money. But, as Andrej Karpathy, a cofounder of OpenAI and former head of AI at Tesla, noted in a talk at Microsoft Build last year: “Base models are not assistants. They just want to complete internet documents.”
To turn a large language model into a useful tool takes a number of extra steps. This is the post-training stage, where the model is trained to do specific tasks like answer questions (or answer questions step by step, as with OpenAI’s o3 and DeepSeek’s R1). The way this has been done for the last few years is to take a base model and train it to mimic examples of question-answer pairs provided by armies of human testers. This step is known as supervised fine-tuning.
OpenAI then pioneered yet another step, in which sample answers from the model are scored—again by human testers—and those scores used to train the model to produce future answers that are more like those that score well and less like those that don’t. This technique, known as reinforcement learning with human feedback (RLHF), is what makes chatbots like ChatGPT so slick. RLHF is now used across the industry.
But those post-training steps take time. What DeepSeek has shown is that you can get the same results without using people at all—at least most of the time. DeepSeek replaces supervised fine-tuning and reinforcement learning with human feedback with a reinforcement learning step that is fully automated. Instead of using human feedback to steer its models, the firm uses feedback scores that are produced by a computer.
“Skipping or cutting down on human feedback, that’s a big thing,” says Itamar Friedman, a former research director at Alibaba and now cofounder and CEO of Qodo, an AI coding startup based in Israel. “You’re almost completely training models without humans needing to do the labor.”
Cheap labor
The downside of this approach is that computers are good at scoring answers to questions about math and code but not very good at scoring answers to open-ended or more subjective questions. That’s why R1 performs especially well on math and code tests. To train its models to answer a wider range of non-math questions or perform creative tasks, DeepSeek still has to ask people to provide the feedback.
But even that is cheaper in China. “Relative to Western markets, the cost to create high-quality data is lower in China and there is a larger talent pool with university qualifications in math, programming or engineering fields,” says Si Chen, a vice president at Australian AI firm Appen and a former head of strategy at both Amazon Web Services China and Chinese tech giant Tencent.
DeepSeek used this approach to build a base model, called V3, that rivals OpenAI’s flagship model GPT-4o. The firm released V3 a month ago. Last week’s R1, the new model that matches OpenAI’s o1, was built on top of V3.
To build R1, DeepSeek took V3 and ran its reinforcement learning loop over and over again. In 2016 Google DeepMind showed that this kind of automated trial and error approach, with no human input, could take a boardgame-playing model that made random moves and train it to beat grandmasters. DeepSeek does something similar with large language models: Potential answers are treated as possible moves in a game.
To start with, the model did not produce answers that worked through a question step by step, as DeepSeek wanted. But by scoring the model’s sample answers automatically, the training process nudged it bit by bit towards the desired behavior.
Eventually, DeepSeek produced a model that performed well across a number of benchmarks. But this model, called R1-Zero, gave answers that were hard to read and were written in a mix of multiple languages. To give it one last tweak, DeepSeek seeded the reinforcement learning process with a small dataset of example responses provided by people. Training R1-Zero on those produced the model that DeepSeek named R1.
There’s more. To make its use of reinforcement learning as efficient as possible, DeepSeek has also developed a new reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). It first used GRPO a year ago to build a model called DeepSeekMath.
We’ll skip the details—you just need to know that reinforcement learning involves calculating a score to determine whether a potential move is good or bad. Many existing reinforcement learning techniques require a whole separate model to make this calculation—in the case of large language models, that means a second large language model that could be as expensive to build and run as the first. Instead of using a second model to predict a score, GRPO just makes an educated guess. It’s cheap, but still accurate enough to work.
A common approach
DeepSeek’s use of reinforcement learning is the main innovation that the company describes in its R1 paper. But DeepSeek is not the only firm experimenting with this technique. Two weeks before R1 dropped, a team at Microsoft Asia announced a model called rStar-Math, which was trained in a similar way. “It has similarly huge leaps in performance,” says Zeiler.
AI2’s Tulu was also built using efficient reinforcement learning techniques (but on top of, not instead of, human-led steps like supervised fine-tuning and RLHF). And US firm Hugging Face is racing to replicate R1 with OpenR1, a clone of DeepSeek’s model that Hugging Face hopes will expose even more of the ingredients in R1’s special sauce.
What’s more, it’s an open secret that top firms like OpenAI, Google DeepMind and Anthropic may already be using their own versions of DeepSeek’s approach to train their new generation of models. “I’m sure they’re doing almost the exact same thing, but they’ll have their own flavor of it,” says Matt Zeiler, founder and CEO of AI firm Clarifai.
But DeepSeek has more than one trick up its sleeve. It trained its base model V3 to do something called multi-token prediction, where the model learns to predict a string of words at once instead of one at a time. This is cheaper to train and turns out to boost accuracy as well. “If you think about how you speak, when you’re halfway through a sentence, you know what the rest of the sentence is going to be,” says Zeiler. “These models should be capable of that too.”
It has also found cheaper ways to create large datasets. To train last year’s model, DeepSeekMath, it took a free dataset called Common Crawl—a huge number of documents scraped from the internet—and used an automated process to extract just those documents that included math problems. This was far cheaper than building a new dataset of math problems by hand. It was also more effective: Common Crawl includes a lot more math than any other specialist math dataset that’s available.
And on the hardware side, DeepSeek has found new ways to juice old chips, allowing it to train top-tier models without coughing up for the latest hardware on the market. Half their innovation comes from straight engineering, says Zeiler. “They definitely have some really, really good GPU engineers on that team.”
Nvidia provides software called CUDA that engineers use to tweak the settings of their chips. But DeepSeek bypassed this code using assembler, a programming language that talks to the hardware itself, to go far beyond what Nvidia offers out of the box. “That’s as hardcore as it gets in optimizing these things,” says Zeiler. “You can do it, but basically it’s so difficult that nobody does.”
DeepSeek’s string of innovations across multiple models is impressive. But it also shows that the firm’s claim that it cost less than $6 million to train V3 is not the whole story. R1 and V3 were built on a stack of existing tech. “Maybe the very last step, the last click of the button, cost them $6 million, but the research that led up to that probably cost 10 times as much, if not more,” says Friedman. And in a blog post that cut through a lot of the hype, Anthropic cofounder and CEO Dario Amodei pointed out that DeepSeek probably has around a $1 billion worth of chips, based on reports that the firm in fact used 50,000 Nvidia H100 GPUs.
A new paradigm
But why now? There are hundreds of startups around the world trying to build the next big thing. Why have we seen a string of reasoning models like OpenAI’s o1 and o3, Google DeepMind’s Gemini 2.0 Flash Thinking and now R1 appear within weeks of each other?
The answer is that the base models—GPT-4o, Gemini 2.0, V3—are all now good enough to have reasoning-like behavior coaxed out of them. “What R1 shows is that with a strong enough base model, reinforcement learning is sufficient to elicit reasoning from a language model without any human supervision,” says Lewis Tunstall, a scientist at Hugging Face.
In other words, top US firms may have figured out how to do it but were keeping quiet. “It seems that there’s a clever way of taking your base model, your pre-trained model, and turning it into a much more capable reasoning model,” says Zeiller. “And up to this point, the procedure that was required for converting a pretrained model into a reasoning model wasn’t well known, it wasn’t public.”
What’s different about R1 is that DeepSeek published how they did it. “And it turns out that it’s not that expensive a process,” says Zeiler. “The hard part is getting that pre-trained model in the first place.” As Karpathy revealed at Microsoft Build last year, pretraining a model is 99% of the work and most of the cost.
If building reasoning models is not as hard as people thought, we can expect a proliferation of free models that are far more capable than we’ve yet seen. With the know-how out in the open, Friedman thinks that there will be more collaboration between small companies, blunting the edge that the biggest companies have enjoyed. “I think this could be a monumental moment,” he says.