AI crawlers 'wrecking the open internet'
Plus: eBay invests in GraphQL federation; and French / German governments collaborate on a new open source alternative to Google Docs.
In issue #10 of Forkable, I look at the impact that AI crawlers are having on software developers, in particular open source projects.
Elsewhere, a fledgling open source startup has secured the backing of eBay to address API sprawl in the GraphQL ecosystem. And the French and German governments have teamed up to launch an open source alternative to Google Docs.
Finally: I am putting a call out to super early-stage startups working in the open source sphere that are seeking exposure, as I intend to write a weekly profile piece in Forkable.
Ideally you will have an open source project with some traction already, and you should have clear plans to monetize this project (if you’re not monetizing it already). If you’ve secured VC investment, that’s a bonus but it’s not essential.
I will also consider startups with products that aren’t necessarily open source, but which solve specific pain points for the open source community — for instance helping maintainers get paid, or addressing security glitches.
I’m really just looking to shine a spotlight on all the cool early-stage businesses operating in the open source space that aren’t yet on the mainstream radar. If this is you, or you know of any such project that deserves exposure, reach out to me at: forkable[at]pm.me.
Paul
Open issue
AI crawlers are a costly headache for open source devs
We all know that data-hungry AI companies develop large language models (LLMs) by crawling the web. While this raises all manner of issues about data ownership and consent, it’s also creating a major headache for developers,
Software engineer and open source advocate Drew DeVault penned a blog post a couple of weeks back, titled somewhat humourously: Please stop externalizing your costs directly into my face. In the post, DeVault bemoans the fact that crawlers completly ignore the robots.txt file that instructs automated traffic to jog on — causing outages and costs to surge due to increased server loads.
DeVault wrote:
These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses — mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
This is not a new problem. Earlier this year, software developer Xe Iaso reported that Amazon’s AI crawler was making their Git server unstable. However, DeVault’s post seems to have shone a brighter spotlight on the issue, with several outlets picking up on this growing problem, including Libre which noted last week that FOSS infrastructure is under attack by AI.
Ars Technica, meanwhile, published an extensive report detailing how aggressive crawlers powered by some of the world’s biggest AI companies — including OpenAI, Amazon, and Anthropic — are overloading community-maintained infrastructure.
In response, we’re starting to see some creative solutions. Earlier this month, Cloudflare launched AI Labyrinth, a new tool designed to mitigate bothersome bots by using AI-generated content to slow them down and waste their own resources. AI thwarts AI, is the general idea here.
Software engineer and “Pramatic Engineer” Gergely Orosz also took to social media this week to complain that “AI crawlers are wrecking the open internet,” name-checking Meta in particular as a core culprit that has driven up bandwidth demands for one of his side-projects. As Orosz framed it, he’s having to pay out of his own pocket to train all these LLMs, so he’s switching his domain’s DNS to Cloudflare in the hope that its new solution will “block these bots.”
“If it doesn’t work, I would either need to look for a host that allows terabytes of bandwidth on their plan, or consider closing this side project — as 90% of traffic is bots, not humans,” Orosz wrote. “This is how these AI crawlers kill the open internet: if my small site struggles with this problem of being forced to serve overwhelmingly bot traffic, then so will other sites.”
Read more: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries
The rundown
eBay backs WunderGraph to power GraphQL federation
Emerging from the open source vaults at Facebook a decade ago, GraphQL plays a big part in the modern microservices movement. In a nutshell, GraphQL is a powerful data query language for APIs, enabling applications to request the precise data needed at any given point.
As applications grow, this can lead to an unwieldy API sprawl that’s difficult to orchestrate at scale — which is where GraphQL federation enters the fray, enabling multiple teams to work and build graphs together as part of a distributed architecture. This is precisely why Apollo raised a huge amount of money back in 2021, as it sought to fund its federation efforts — however Apollo also changed its federation product from an open source MIT license to a proprietary “source available” Elastic License.
And so this is where WunderGraph comes in, offering an open source alternative to Apollo Federation in the form of Cosmo. WunderGraph this week announced it has raised $7.5 million in a Series A round of funding, including a strategic investment from one of its design partners — eBay.
While eBay could develop its own in-house GraphQL federation tooling, it would probably prefer not have to maintain all of this itself. So instead it has been working with WunderGraph to hone its product, and make it suitable for a global-scale tech juggernaut.
“I would say we are experts in federation, but we don’t have experience in eBay-scale problems,” WunderGraph co-founder and CEO Jens Neuse told me in my story for TechCrunch this week. “And so by having this very close relationship, they taught us everything in terms of how we need to build our product so that it can be integrated into companies like eBay, because they have very specific requirements.”
What’s up, Docs?
There has been growing calls to build an independent European tech stack to address the bloc’s over-reliance on infrastructure controlled by U.S.-centric Big Tech. This push includes investing in sovereign large language models (LLMs), and a constellation of satellites to rival Elon Musk’s Starlink.
On the software front, meanwhile, a curious new joint effort between the French and German government has given birth to Docs, an open source, collaborative document editor akin to something lik Google Docs.
Granted, they probably could’ve come up with a more imaginative name than Docs, but it is what it is. Docs can be self-hosted, and there is already quite a long list of features on the product roadmap for the coming months.
Patch notes
A critical flaw has been found in open source web development framework Next.js, potentially allowing hackers to bypass authorization in middleware — so update to the patched version now.
Browser Use, an open source tool designed to make websites more accessible for AI agents, announced a $17 million round of funding this week.
Google is to move the entirety of development work on the Android Open Source Project (AOSP) to private, internal branches, as per a report in Android Authority. It basically means we won’t know what’s happening until Google releases the code to the public versions of Android.
Want to transform your ancient Kindle into the “ultimate open source e-reader?” ZDNET published a handy guide for doing just that.