How “open” will Europe’s open source LLMs be?
Plus: OpenAI plans its next open source project; and the OSI takes umbrage at Meta.
In issue #5 of Forkable, we take a closer look at Europe’s plans for open source LLMs, including how “open” its models will be?
Elsewhere, OpenAI CEO Sam Altman is soliciting feedback for his company’s next open source project; and Meta’s “open source” claims around its Llama models are in the spotlight once more.
If you haven’t subscribed to Forkable already, please do so now to receive new posts direct to your inbox each week.
Paul
Open issue
Europe faces familiar challenge building “truly open” LLMs
As I wrote on Forkable two weeks ago, Europe is pursuing its own open source large language models (LLMs), with a consortium of some 20 organizations joining forces to create a series of “truly open” multilingual LLMs. The effort forms part of a broader digital sovereignty effort across the European Union, an effort that has led Big Tech to invest in local infrastructure to ensure data stays in the region, as well as a $11 billion deal to create an independent constellation of satellites to rival Elon Musk’s Starlink.
OpenEuroLLM, as the new AI program is called, is being co-led by Jan Hajič, a computational linguist from the Charles University in Prague, and Peter Sarlin, CEO and co-founder of AMD-owned Finnish AI lab Silo AI.
TechCrunch caught up with Hajič and Sarlin to get the lowdown on how they plan to develop their models while preserving the “linguistic and cultural diversity” of all EU languages. But a central discussion point in amongst all this is how “open” the model(s) will really be? For context, the official “open source AI definition” (as laid out by the folks at the Open Source Initiative, at least) hasn’t been warmly embraced by all. One of the reasons why is that the definition doesn’t make opening up training data mandatory, because it says AI models are often trained on data with redistribution restrictions.
Open source AI proponents, on the other hand, argue that true “open source” AI models should make everything available to everyone — the training datasets, pre-trained models, weights, and so on.
And so the OpenEuroLLM project faces these same dilemmas.
Hajič told TechCrunch:
The goal is to have everything open. Now, of course, there are some limitations. We want to have models of the highest quality possible, and based on the European copyright directive we can use anything we can get our hands on. Some of it cannot be redistributed, but some of it can be stored for future inspection.
This could mean that the OpenEuroLLM project will have to keep some training data under wraps, but make it available for audits as required under the rules of the EU AI Act.
Skeptics have also pondered whether the OpenEuroLLM project budget will be enough. Around €37.4 million has been ring-fenced just for building the models themselves, around €20 million of which will come from the EU. The actual budget is higher, when you factor in funding for related work, including what is arguably the biggest expense of it all — compute. Indeed, the OpenEuroLLM project partners include EuroHPC supercomputer centers, thus it’s hoped that compute costs for the project will be minimal.
On top of that, Sarlin noted that OpenEuroLLM isn’t setting out to build a consumer- or enterprise-grade product like OpenAI or Google might. It’s only creating the models for others to build upon.
Sarlin told TechCrunch:
The intent here isn’t to build a chatbot or an AI assistant — that would be a product initiative requiring a lot of effort, and that’s what ChatGPT did so well. What we’re contributing is an open source foundation model that functions as the AI infrastructure for companies in Europe to build upon. We know what it takes to build models, it’s not something you need billions for.
Read more: Open source LLMs hit Europe’s digital sovereignty roadmap
The rundown
OpenAI’s next open source project
After recently stating that OpenAI may have been on the wrong side of history regarding open source, CEO Sam Altman took to social media this week to solicit feedback as it looks to embrace open source AI models once more.
Altman asked his followers on X whether its next open source project should be a trimmed-down version of its o3-mini reasoning model that still needs cloud-powered GPUs; Or an even smaller model compact enough to run on-device (thus better supporting privacy and latency issues).
It’s not clear to what degree Altman’s poll will influence OpenAI’s ultimate decision, but at the time of writing it’s roughly neck-and-neck, with a GPU-dependent o3-mini narrowly in front of the phone-sized model by 53.9% to 46.1%.
Meta’s ‘open washing’
Facebook’s parent company Meta continues to call its Llama-family of LLMs “open source,” even though they fail just about every conceivable interpretation of the term “open source.”
And that is why the Open Source Initiative, long-term stewards of the “open source definition” and — more recently — the Open Source AI Definition, continues to publicly call out Meta for calling its Llama models “open source.”
In a blog post this week, the OSI requests that the entire open source community “unite and call out Meta’s open washing.” This, it says, has little to do with the OSI’s contentious open source AI definition, and everything to do with the basic freedoms afforded under “open source” more broadly.
The OSI points to longstanding restrictions contained within Meta’s Llama licensing terms, such as limitations on how the model can be used. But it also highlights more recent changes to the terms, including geographic restrictions that arrived with the multimodal Llama 3.2 in September. The license terms stipulate that those located in the European Union are not granted license rights to use the model directly, due to concerns over potential GDPR data privacy contraventions.
The OSI wrote:
A year ago we called on Meta to stop calling Llama 2 “Open Source.” Since then, Meta has released new versions of Llama with new licensing terms that continue to fail the Open Source Definition. Llama 3.x is still not Open Source by any stretch of the imagination. Despite that, Meta keeps on falsely promoting Llama as “Open Source.” You can help us stop that now: call on Zuckerberg and Yann LeCunn to change the Llama license and comply with the Open Source Definition.
Patch notes
Together AI announced a $305 million series B funding round for its “AI acceleration cloud” for building apps with open source AI models.
The Matrix Foundation, stewards of the open standards-based Matrix communication protocol, is in dire need of donations. Near-term, it needs $100,000 by the end of March, else it will have to shut down several “bridges” it hosts, including a bridge for Slack which enables cross-app communications.