Recce wants to help developers 'ship working data faster'
Working with data is different to working with software. But it doesn't have to be.
This is the first in Forkable’s new weekly COSS Corner series, where I profile startups and key figures from the commercial open source software (COSS) space.
In the first story, I check in with CL Kao (pictured above), creator of a code-versioning system used at companies such as Apple some two decades ago. Fast-forward to today, and Kao is now founder and CEO of Recce (“reh-kee”), a fledgling open source startup that’s building “data native” code review tools to “solve a fundamental gap in how data systems are managed today,” he explained to me.
It’s all about giving analytics engineers the tools to ensure the accuracy of data pipelines — which is vital in the age of AI.
‘Preview, decide, deploy’
In the modern software realm, technical teams can lean on the structured, tried-and-tested “preview, decide, deploy” process, enabling them to move fast without completely breaking things. They can simulate changes in test environments, make decisions based on these findings, and push these changes into production with relative confidence.
Data systems, on the other hand, are a different animal — data is constantly changing, coming from different sources, in different formats. Dependencies are also often hidden, meaning that small changes can break things — a lot.
“Data teams operate in the dark — making changes without being able to preview their impact, deciding without complete information, and deploying with crossed-fingers rather than confidence,” Kao explained.
And so Recce is essentially transposing the “preview, decide, deploy” ethos from the software realm, onto data systems.
Launched as an open source project back in 2023, Recce helps data teams do better data validation — for example, comparing datasets, checking outputs, and ensuring consistency — all within their usual workflows.
While this is important for any data-centric business, it’s particularly crucial in the age of AI. As large language models (LLMs) and other AI systems become increasingly commoditized, data becomes how companies stand out from the competition — so it’s important to ensure the data is correct.
“We believe most code reviews in the future will become data reviews, as data correctness becomes a defining element for success,” Kao explained. “Recce’s mission is to ensure the stability and accuracy of complex data systems as AI, and specifically LLMs, drive more data transformation.”
This all links back to the company name, too: Recce is short for “reconnaissance”.
“In other words, we help data teams do recon’ missions to assess the data impacted in a strategically precise way, so you don’t blow up the entire business dependent on the data produced,” Kao said.
There are monitoring tools that help address this, such as data observability tools like Metaplane (which Datadog just acquired, FYI). These are great for “knowing something is on fire,” but Recce is striving to offer something more preventative — before the damage is done.
“Most solutions are hand-assembled and just show you at most ‘what’s changed and [what’s] impacted’ without helping to determine if the changes and the intentions are aligned,” Kao explained. “This often creates a very low signal-noise ratio, and people will just not look at them at all.”
Recce works natively with dbt (Data Build Tool), a popular open source command-line tool that enables data teams to transform raw data into clean datasets ready of analytics (dbt is the “t” in ETL, or “extract, load, transform”). Analytics engineers can use Recce to “curate their proof of correctness,” as Kao puts it, letting them evaluate data changes before shipping to ensure that everything downstream works as expected. It’s all about double-checking their work, providing a “before and after” view of the data before anything goes live.
The main Lineage Diff interface in Recce, for example, enables users to see at a glance the potential area of impact from any modeling changes made to dbt data.
The open source factor
Kao claims more than two decades in the open source and developer tooling space — he created SVK, a distributed version control system (VCS) built upon an open source Git-precursor called Subversion (SVN). While SVK saw early uptake at companies including Apple and Ubisoft, Git came along in 2005 and eventually became the industry standard once GitHub gained a foothold — the rest, as they say, is history.
But this history is primarily why Kao is continuing to embrace open source as a distribution model.
“I’ve been in the open source community for 25 years, and I strongly believe this new workflow is paradigm-shifting, impacting how we build the future of software based on data-centric systems,” Kao said. “And [this] requires a low-friction adoption path and trust-worthiness via open source to get these best practices well-adopted.”
Kao says that the core Recce open source project is seeing in the region of 3,600 downloads per week on GitHub, with users including The Philadelphia Inquirer, the Department of Health for Rio De Janeiro in Brazil, and various fintech and healthcare startups.
The open source incarnation, which hits version 1.0 this week, has restrictions though — for example, it’s focused more on single-user scenarios, rather than teams. As such, Recce this week unveiled a new cloud product in private beta, which ushers in new sharing and collaboration features, and GitHub workflow integrations. And of course, as a SaaS product, it removes the “operational overheads” of hosting.
The company also announced $4 million in funding to support its commercial push. The pre-seed round was led by dev-focused VC firm Heavybit, with participation from Vertex Ventures, Hive Ventures, Visionary, SVT Angels, Brighter Capital, Ventek Ventures, Scott Breitenother and Tim Chen.
Jesse Robbins, general partner at Heavybit and co-founder of Chef and Orion Labs, is now joining Recce’s board. Notably, Robbins is joined on the board by open source pioneer and Apache Software Foundation founding president Brian Behlendorf, who has known Kao for years through their respective work in the open source world.
Behlendorf says that he’s getting behind Recce largely because of the need for more “predictability” in the age of AI.
"AI models bring a large degree of randomness to software development, especially for data-intensive applications," Behlendorf said in a statement. "This raises the premium on data-forward testing tools to get closer to predictability. Until now, that's been done in a bespoke manner and largely by hand. CL, who has been a longtime collaborator of mine on open-source projects, is solving this problem beautifully with the Recce toolkit.”