Building an Open LLM – with Antoine Bosselut


Danny Buerkli: My guest today is Antoine Bosselut. Antoine’s an assistant professor at EPFL in Lausanne, Switzerland. He’s also one of the creators of Apertus, an open LLM released in September. Antoine, welcome.

Antoine Bosselut: Hey. Thanks for having me. It’s great to be here.

Danny: Thanks so much for doing this. To start, explain briefly what Apertus is and what distinguishes it from, first, other LLMs and then other, quote-unquote, open LLMs.

Antoine: Yeah. So, I mean, as you mentioned, Apertus is an LLM — or a large language model. So, you know, essentially a very large model based on neural networks and deep learning, which has been trained, or pretrained as we say, on an insane amount of data — in our case, close to 15 trillion tokens. So it’s essentially a large language model whose scale rivals some of its more, I would say, well-known counterparts such as the Qwen, the Qwen-2 models.

What makes it different, I guess, in this context compared to what we often think about with large language models — which is the ChatGPTs and the Claudes and the Geminis of the world — is that Apertus is more like Llama-3 and Qwen-2, in that it’s an open model where the weights are actually released online, which allows others to pull those weights from some online repository like Hugging Face or Azure or AWS and essentially run it locally, train on top of it to potentially provide it new capabilities. And it’s essentially just a much more flexible interface to a model than, you know, just the chat interfaces that are often used with the frontier organization ones. This naturally has drawbacks, though, in that these frontier organizations have an entire ecosystem built on top of the language model. There, the language model is more like an engine inside a very large car that is cruising forward. In our case, you know, we just have the large language model, so it’s only the engine in that more classic sense.

And then in terms of what makes Apertus stand out on its own relative to models such as Llama and Qwen is that it’s fully open. We wanted to be able to provide a scientific artifact in addition to a base model that could be used for downstream applications. And so to do that, we wanted to be able to release all of the data that we trained the model on — all 15 trillion tokens. We wanted to be able to release intermediate checkpoints. And so at Apertus’ scale, it’s actually the largest model where all of these additional artifacts — such as pretraining data, checkpoints, evaluation suites — are all available to go along with the model weights.

And there are very few other models that have that same level of transparency in their releases, and none of them are at the scale that Apertus is at.

Danny Buerkli: Brilliant. And one obvious question is: why bother? Apart from the fact that this is a cool and presumably helpful research artifact because we may want to understand LLMs better — and in order to do that we have to have a model we can interrogate. Arguably, the more open and compliant and all the rest of it it is, the better. And also maybe for the reasons that we may not want to rely on the largesse of large multinational companies to provide these models.

We may want to, you know, be able to, for research purposes, have one of these ourselves. That’s great. And that may already be justification enough. But apart from that, why bother? And why bother with public money?

Antoine: There’s multiple questions to break down there. In terms of why bother, I’d say there are lots of good reasons for this. Two, I guess, that I’ll lean on here are: first, having access to this type of resource from a research perspective just enables us to really expand the number of studies that can be performed on these models. There’s essentially not all that much research you can do with the frontier models — at least for us as outsiders — beyond talking to it and seeing what it answers and trying to get something back from that. With open-weight models, you can do a lot more because you have access to those weights.

You can adapt them. You can provide different stimuli and see how the internals of the model — the mechanisms inside the model — change. You can try to discover circuits that are responsible for particular behaviors. So there’s a whole host of things that you can do, but you’re always limited by the fact that you don’t know what this model was trained on. What does it mean for a model to get this much performance on a benchmark if that benchmark may have been part of the training data?

There’s just this gap between good science — science that actually allows us to extract insights — and the foundation of the experiments that we’re running, which for open-weight models often lacks that specification. With a fully open model, we can actually do true science on these systems because we’re able to audit the entire training procedure of the model up until that point and say, “Okay, there’s a flaw in my experiment because I’m testing for something that — even though I didn’t know it at the time — the data had in it when the model was pretrained.” So I would say that in itself is a great value of such a system. And we’re not the only ones pushing models forward along that principle. AI2 had the OLMo models. EleutherAI has released a suite of models in previous years. So there are a few organizations operating under these fully open principles.

The other reason is an effect of wanting it to be fully compliant and also useful for businesses. We really wanted to release under this Apache 2.0 license, which allows commercial use. And we also wanted to release all of the data that we had trained on. And that creates some sticky legal conundrums — in that if other people are going to use this model to make money, and you’ve pretrained on a whole bunch of things that make money for other people, that second group might want to come after you legally.

So we made a lot of decisions in choosing the data to train on that essentially make us a lot more compliant with regulations in Europe — including the EU AI Act and GDPR — in terms of what sorts of data can make it into the mix. And this is quite interesting because when we were designing this project, there were organizations that came up to us when we asked them, “What’s the issue with LLMs for you today?” It didn’t have anything to do with performance or capabilities. They’re confident they can make something work there.

But they were afraid of the legal exposure that such a model could create for their company if there was any sort of mistake or safety issue. And so for them, knowing that there was this much stronger data-compliance standard was quite important, and it’s one of the things that is very attractive to many of the companies we’ve spoken to.

Now you asked: why use public money to do this? Well, the answer to that is quite simple: it’s super important to do this — particularly to create more responsible foundations for models and to have them be useful for sovereign innovation ecosystems as well. But there’s not actually all that much money in “responsible AI” at the moment to be made. And so it’s not necessarily an attractive bet for a private company to release a model with all these open artifacts — only to have it potentially be used a bit less because it’s not as capable due to the performance gap.

But to me, this is the responsibility of the public sector: to make the investment in the foundational technologies that can then enable entire innovation ecosystems on top of it. It’s too much money to do this for just a startup — especially one in Europe. But as far as the public sector goes, it’s a very important thing to do in order to enable the next stage of building the application layer on such a model.

Danny: When speaking about money, can we put an order of magnitude on this? I imagine the biggest cost block would be pretraining and post-training runs plus salaries, but the GPU time is presumably the most expensive thing. How much did the whole project cost?

Antoine: I mean, there’s two things there: how much does it cost us, and how much would it have cost other people? When it comes to cluster economics, there are different ways of defining these costs. So I think all-in — if we count up the GPU hours used — it comes out to somewhere around 10 million GPU hours to do all the experimentation, the final pretraining run, the post-training, all of that.

That’s the majority of the cost. Then, of course, there’s the salaries of all the folks that worked on it along the way, but we can probably say that all of those salaries put together come out to around, let’s say, 3 million as a number. So then you’re left with the compute cost. Now you’ve got these 10 million hours. How do you decide what an hour is worth?

If you go on Google Cloud or AWS, you’re going to be able to buy a GPU-hour. For spot pricing, you’ll get it for less, but you can’t reliably train on spot pricing, so you’d need the full value of that compute. And you’re not going to use it for a year, so you won’t get the committed-use discounts. So you’re going to be paying something like $5–6 per GPU-hour. That would be $50–60 million to train on that cloud.

If you go for a bare-bones service, you can often get something like $2–2.50 per GPU-hour. So ~$25 million. All of these things are essentially out of reach for a private company hitting the scene.

However, we do have a supercomputer in Switzerland that has around 11,000 GH200 GPUs. The base cost on that — for energy, cooling, and salaries of running it — is a lot less. And we can probably value the compute for this project at closer to 5–7 million francs.

And so: what would have taken a private company $50 million to run is completely out of scope. But we can rely on public investments that have already been made in infrastructure, and that allow us to get a much better price point. So not only is it the responsibility of the public sector, but it also makes more sense economically.

Danny: You mentioned the immediate utility to research, the immediate potential utility to firms who worry about legal liability and other issues. I wonder how much there is also the optionality value — because building these systems presumably involves a lot of experiential expertise that you can only gain by actually doing it. There’s presumably a limit to how much expertise you can amass by only thinking about how you would build an LLM if you were to do it. There’s expertise that comes from actually doing it. And that translates into future optionality value. How would you weigh present-day value versus the value this generates into the future if we were to continue on this path?

Antoine: Yeah. That is an incredible question. There is so much to unpack. I’ll start with a little anecdote. When we kicked off this project, I had a few people tell me, “This is not what academics do. Why are you trying to do this? You’re not going to publish your typical research papers.”

And to be honest, I didn’t particularly care. It seemed like a really exciting thing to do and something we needed to do in order to stay competitive in the AI race in Switzerland.

But something incredible happened along the way: we found research problems that aren’t really well codified or aren’t necessarily talked about. And one thing that came out of this work is that there were something like 20–25 research papers that did get written. So we did achieve the academic mission in the classical sense — we published research, and we trained a lot of students in a very important technology over the course of this project.

And the reason is that the expertise you gain from actually doing the project is really different from what you gain by reading papers by people who do these projects. Particularly in the LLM space, where frankly a lot of what is published lacks depth simply because there is quite a bit of value to the know-how.

This creates a massive future opportunity for the people involved and for the communities they belong to. They gain experience in a technology that is quite complex — where the number of people who can actively contribute is in the thousands or tens of thousands, not millions. And there’s a lot of value to that.

When they leave the Apertus ecosystem or the Swiss AI Initiative — the wrapper around it — they go out and join startups, join tech companies, or start their own. And in essence, there’s a whole new class of folks going out into the world taking what they’ve learned and designing innovation tools around those problems. That’s where I think a lot of the value comes from.

Apertus is a model. We’ve released it. In a few years, it’ll be completely outdated. But the experience and understanding gained by the people who participate in these projects is a really valuable resource — particularly for a small country like Switzerland.

Danny Buerkli: That’s excellent. And it confirms that around the launch, a couple of things were not necessarily well understood. That point was not well understood — that there are significant positive externalities beyond the artifact itself. And the other thing that may not have been well understood is what the artifact actually is. It’s the engine — or the fuel rod — rather than the entire power plant. And there’s a big difference between those two things.

You mentioned earlier that in a couple of years this model will be obsolete in terms of its performance. I wonder: did we just happen to be at this one fortuitous point in time where the compute resources available in the supercomputing center were just right to pull off a model that was open-source state-of-the-art?

But is that now gone? Because we don’t know how scaling will continue. But already now, if you compare it to Grok-4 — where some numbers are available from Epoch — Grok used something like 60× the power and 70× the FLOPs. And that’s today. Not asking what that gap will look like next year or in two years.

Is this a repeatable exercise, or was that it?

Antoine: My answer to the first question is yes, and to the second one is no. It is a repeatable exercise, and this is not just it. But it requires a commitment to grow in scale at the same time.

The story behind this is quite fascinating. Around the time GPT-3 came out in 2020, the folks at the Swiss Supercomputing Center were coming up on an infrastructure investment cycle. And they made this very wise — but risky — decision: “Okay, there seems to be something to these LLMs, and we need to provide capacity for this type of research.”

And so they bought 11,000 GPUs with that infrastructure investment — at a price people today would vomit at, given how aggressively amazing it was. The biggest cost of running the cluster, I believe, is the depreciation on a daily basis — nothing else.

But that’s what enabled the Apertus project: having access to such infrastructure.

Now, as I mentioned, it would be difficult to do that a second time around, because now everybody knows the value of a GPU. NVIDIA is not giving the same discounts. But one of the points of the Apertus project — and others in the Swiss AI Initiative — is to show what can be done when there are resources available for research and development.

And yes, we can point to xAI and OpenAI and Microsoft and Amazon’s very large clusters. But something Europe needs to understand is that it doesn’t have companies of that scale building their own data centers. And so if Europe wants an innovation ecosystem that can compete with what big players are doing in the US and China, the only place where the investment in data centers can come from is the public sector.

These data centers are expensive — but not for an entire continent.

So the big question is: can we make the investments into the necessary infrastructure to enable this development, which in turn would spur an innovation ecosystem around it?

There have been efforts — including the construction of AI gigafactories coming online in the next few years. These are awesome initiatives, and we should continue and expand them.

Danny Buerkli: So what you’re saying is: absent that, there’s not going to be another training run because we’re tapped out as it stands today.

Antoine: Well, I would say that at the same time, the frontier is plateauing a tiny bit right now. And in fact, there’s less gain to be made just from scaling up compute on pretraining runs. People are conjecturing that we can scale up post-training to the level of compute of pretraining.

But in terms of pretraining: yes, we could have larger models, larger architectures. But data is going to become a choke point at some point — though you can repeat data to get a bit more bang for your buck.

We can still attack these limits on the open-model side without too much issue. It’s the next generation I’m more curious about — where synthetic data really takes off as the primary data source for pretraining, and where post-training compute is scaled up massively.

Right now, I think we still have enough juice for something like Apertus-2 or even Apertus-3. After that, it will require larger infrastructures.

Really, the blocker now is how much experimentation we can do along the way. For the large run itself: on 4,000 GPUs it took around two months to train Apertus. We could double the scale, and it would take four months. This is an insane amount of compute for academics — but possible when you have a supercomputer.

The question becomes: how much experimentation can you do ahead of time on this cluster — a shared national resource — and how many scaling laws can you run?

That’s where you’re more limited than the large tech companies.

Another thing to remember: inference is now two-thirds or even 80% of the cost of a model. These huge clusters are used not just for training but for serving. We don’t serve the model afterwards — we put it out in the open, and others take care of that problem using private cloud infrastructures.

Danny: Right. And from what I understand, it is much easier to serve the model through public cloud infrastructure than it is to run a training run.

One thing I don’t understand: when we think of compute as publicly funded infrastructure — you mentioned depreciation — and I’m not quite sure how to think about this. We know how to build roads, bridges, railways. But most of that infrastructure doesn’t depreciate as fast as an H200 does. I don’t know the exact lifespan, but presumably it’s measured in low-single-digit years.

Which implies recurring investment. Not water pipes you lay and then use for 50 years. It requires recurring investment. What is the correct way to think about that?

Antoine: Yeah. Caveat that I’m not an accountant by any means. But yes — GPUs depreciate faster. Not because they become useless, but because they become outdated. You may have four years of useful life, and depending on NVIDIA’s cycle, you get three new generations, all with way more theoretical FLOPs and different networking capabilities.

The difference between GPUs today — like the new B300s — and four years ago — A100s with 40 GB — is night and day. And I can only imagine where we’ll be four years from now. So yes, depreciation is faster.

But that doesn’t mean the chip is no longer useful. You can use them for teaching purposes. Students don’t need top-line GPUs to learn GPU programming, multi-node scaling, etc. You can build local clusters in places that never have access to compute. There are many educational purposes after depreciation.

But yes, in private companies depreciation is even more aggressive — two years — because they know they need to buy the next generation to stay competitive.

Danny: When thinking about the model’s performance, what are the binding constraints? One interesting thing about what you’ve done — because you’ve trained on “fully compliant” data — is that we can see what the performance penalty might be relative to training on not-so-legal data, which is presumably what most commercial providers do. But there may be other binding constraints: talent, team size, compute. What are those constraints, and how big would the performance uplift be if you had a bigger team, or trained on all available data?

Antoine: Yeah. There are many binding constraints. The question is the impact of each.

Regarding compliant data: we’ve been able to study this. One example: public datasets are usually constructed from Common Crawl. You take snapshots of the web, combine them, dedupe, filter by quality, etc. One thing you can do at the start is look at the URLs and their robots.txt files.

Common Crawl does this automatically — if a website blocks crawlers, it’s not included. But only at the time of the crawl. If in 2021 a site didn’t block crawlers but in 2025 it does, the 2021 snapshot still contains that content.

We took a stricter approach: we retroactively removed all websites from these datasets that — as of January 2025 — had opted out of crawling. When we measure the impact of this, it’s actually quite minimal. It only removes about 8–8.5% of the data. Performance-wise, we don’t measure a big gap.

But there are datasets that are not public — like large dumps of pirated textbooks. You can train on that data. We did not — because we didn’t want to release that publicly or steal IP. We know that training on pirated textbooks can give you a substantial boost — 5–10% on benchmarks like MMLU.

Whether that’s because the textbooks contain useful knowledge or because MMLU overlaps with textbooks is another question. But the point is: training on private data does give measurable uplift. So that’s a binding constraint.

The question is whether that translates to UX challenges — that is, whether users would notice. If you had a similar interface, similar post-training, similar layers on top — I’m not completely sure users would notice that the model was worse.

Danny: You mentioned earlier that not everything inside companies is published, and what is published may be anemic. If you hold the data constant, the compute constant, the team constant — but had access to the expertise inside the three large commercial labs — what kind of uplift would that give? What’s your guess?

Antoine: If the data is constant, compute is constant, and the only difference is having 30 OpenAI- or Anthropic-level engineers? I would expect a substantial difference.

But it’s not often a matter of team size. The more you scale up the team, the more integration nightmares you have. On Apertus, hundreds of people contributed, but only a small, carefully curated set of changes made it into the final mixture — because any new thing can cause problems.

If you have 30 people from OpenAI or Anthropic, you don’t need to try as many formulas because they know what works. That expertise is literally valued in billion-dollar packages.

But that doesn’t mean you can’t do meaningful work with a different profile. As a public institution, we can’t pay billion-dollar salaries. And we have a mission to train people — so many components are student-led. And the students at EPFL, ETH Zürich, and others are brilliant.

Danny: What was something you discovered while building the model that surprised you?

Antoine: Oh gosh — everything was surprising. One thing that surprised me was how robust and stable a lot of this is. There’s an incredible amount of work we were able to build on — pretraining libraries, data-cleaning libraries, all developed by others. It derisks the enterprise a lot.

I have massive respect for the first person to train a model at GPT-3 scale in 2020. I can’t imagine what that was like. Luckily for us, we came later — and lots of decisions have already been documented by open developers like EleutherAI, LLM360, AI2.

Our big contributions were data compliance and large-scale multilinguality. And we were able to dedicate research resources to those because other parts had been handled by the community. That was a pleasant surprise — the community effort.

This is one of the promises of why open modeling can keep up with the frontier. Once you’re at the frontier, every design decision has to be tested. But in the open ecosystem, you can build on others’ work.

Another nice thing is that it’s a smaller ecosystem — so what’s published is less noisy and high-quality compared to the general research landscape where 90% of papers only work under narrow conditions.

So I was surprised by the quality of open artifacts and how much they derisk the enterprise.

Danny: In terms of public investments — what would you ideally want to see from the political system, not just in Switzerland?

Antoine: There’s what I’d wish for and what’s possible. But I’ll say the ideal:

We need to be able to keep designing and training fully open models that are — if not at the same level as the frontier — very close. And as the frontier expands, we need the capacity to train open models at that level.

You’re forced to think five years ahead — the timescale for infrastructure investment. And we are running out of human data. Synthetic data becomes super promising.

Whoever has the best synthetic data in the years to come is probably the entity that can train the best models. And the best synthetic data comes from the biggest models.

You don’t need efficient models for generating synthetic data — you can sacrifice efficiency for quality. That still pushes toward training absolutely massive models — 10× larger than the biggest today. Think trillions of parameters.

To train those, you will need very large-scale infrastructure.

Whether that’s possible politically is another question. But it’s what’s needed to avoid surrendering capability to a handful of private companies.

This requires substantial investment, will, and coordination. But it’s not impossible.

Danny: It seems an oddly consequential variable is whether we can geographically distribute these clusters for political-economy reasons. If we can, a lot becomes easier. If we can’t, and they must be colocated, it’s harder.

Antoine: This is a very important question. Politically, distributing clusters is more digestible — and that’s the current European model with gigafactories of 10–25k GPUs each.

But from a training-paradigm perspective, it’s limiting. I don’t think we could train a model 100× the size of Apertus by fully using 10 supercomputing centers across different countries with today’s paradigm.

So the question becomes: can research advance enough to create viable solutions for multi-node, cross-data-center training?

Within a single cluster, communication is the bottleneck. Cross-node communication is slow. Cross-data-center communication would be far slower. You’d need a new training paradigm.

One approach is training slightly different models in each data center and merging them. But that doesn’t scale the model size.

There are smaller-scale attempts at decentralized training, but I don’t think they adapt well to this case.

Absent advances, the only way to scale models is by having a single very large data center. And that’s politically hard. And environmentally hard. You need massive power, cooling, renewable energy. Only a few places could support a 5-gigawatt data center without ecological disruption.

Danny: What should I have asked but didn’t?

Antoine: I think we hit most points. But another reason to have these models is the sovereign aspect. I don’t think every country needs its own model from scratch — but they do need representation in how models are designed and trained. The best way to be represented is to be a player.

Surrendering all AI development to a few big players because they have an advantage is shortsighted. The winners of the next computing revolution are typically the winners of the previous one.

The biggest players in LLMs today — Google, Microsoft, Meta — invested heavily in cloud in the early 2010s. Those that didn’t invest were behind for GenAI. And the winners of cloud were the winners of the web.

If countries want to build innovation ecosystems for future computing revolutions, they need to invest in the one happening now — or they surrender future growth, taxation, and sovereign capacity to companies that already exist.

Open models are closing the gap with frontier models. Public institutions can make large-scale investments to close it further.

And we’re on the verge of a paradigm shift: the ChatGPT era is ending; reasoning models and agentic AI are coming. Now is a great time to get into the driver’s seat.

Danny: Brilliant. With that, Antoine, thank you so much.

Antoine: Thanks for having me. It was really fun to talk about this stuff.