Folks in tech like to believe they understand the value of data. On a technical level? They absolutely get it. But I’ve now been in and around data businesses for a few years. I feel confident saying at least this much: the data business -- specifically data-as-a-service – is the most misunderstood business model in tech.
What is data-as-a-service (DaaS)? A DaaS business uses raw data sourced from beyond its direct customer relationships to produce actionable aggregated information. This information can take the form of a score, a report, a dashboard, an API feed, a terminal, or some other delivery format.
If that sounds narrow, you’re right. Data is core to how most scalable businesses operate, so we must distinguish between companies that use data and those who provide it. Google and Meta make more lucrative use of data than any other businesses in history. But it would be wrong to call them data businesses. They are clearly in the advertising business.
Here are some examples of companies that I’m talking about:
Clearbit, Zoominfo, and Apollo.io are data companies. They collect contact details on businesses and the people who work at them from many sources. Then they clean and deliver the details in context or as lists.
Placer.ai is a data company. They aggregate location data with other relevant datasets and use those to deliver actionable information across a variety of reports (or, in some cases, as a score).
Fair Isaac — you know them as FICO — is a data company. As are the consumer credit bureaus Experian, Equifax, and TransUnion. The mechanics of how these businesses interact and their long-term defensibility are so heavily influenced by their individual histories and American law that they could fill entire reference books, so I will simply note that they fit the definition and move on.
Most founders do not know how to build these companies. Investors barely know how to evaluate them. That's a shame. Great data businesses are nearly impossible to dislodge, and that grants them the runway to get very large. Consider:
CoStar Group — Market cap consistently above $30B since 2020. Does ~$2.5B in annual revenue. About half of that is from a long-running commercial real estate data and intelligence product.
ZoomInfo — Crossed $1B in revenue in 2022 and has been growing profitably for years, even though they operate in the crowded B2B contacts market.
Black Knight — Data and technology provider to mortgage lenders. It was doing more than $1.5B in profitable revenue before it sold in 2023 for $12B to Intercontinental Exchange. (ICE, like its cousin Nasdaq, is also a proprietor of a phenomenal financial data business.)
IRI — Taken private for $5B and merged with NPD in 2023 to create a consumer spending data powerhouse.
Factset — Collects and repackages financial data. Does more than $2B in annual revenue.
Bloomberg — Its reputation precedes it. A New York Times report in 2023 cited a 2022 revenue figure of $12B.
Hardly anything is written about these companies. Which strategies make sense in data? What that does that suggest about how to resource and run the go-to-market and partnership/ecosystem functions in these companies? There is some overlap with SaaS, but it is not as straightforward as running a standard SaaS playbook.
At root, people don't know why data-as-a-service businesses are defensible. They are successful, but there's no popular understanding of why. My goal with this essay is to put what I’ve learned into an existing and widely understood strategic framework: Hamilton Helmer’s 7 Powers.
I’ll touch on each power, but there are only three that matter in the long run, and two in particular that ladder up to the third.
Data and the Two Powers
The first two powers that matter are Cornered Resource and Process Power.
You might think it’s intuitive for a data business to rest on exclusive data sets and the ability to process that data into something useful. But this actually causes serious tension for most new market entrants.
Cornered Resources are rare. You either have privileged access to a unique, powerful, non-substitutable source of data or you have to build one from scratch.
Process Power in data businesses tends to be path dependent. Companies build it over years of grueling trial and error.
Consider some simple questions one could ask about location and POI (point of interest) data: What are the boundaries of this POI? What counts as a visit? If we know someone was X feet away from this POI, should we count that? Answering those questions for a single shopping mall is difficult enough. Now consider doing it for every shopping mall in the United States.
Developing a data product is an iterative process, but it has longer feedback loops than other modern software. Data accuracy and trust issues tend to be reported and fixed only when someone happens to look in the right place (like a relatively new shopping mall with little historical data). Contrast that with a product gap or a UX/UI bug, which will likely be exposed to the entire userbase and reported relatively quickly or uncovered through proven research methods. So building a large data-as-a-service business tends to take longer than building any other software business. It is uncommonly risky for software.
No wonder most tech entrepreneurs prefer to build workflow-centric products.
But the flip side of this is that a company that grinds out those two powers has built a moat that is almost impregnable from a head-on challenge, even if that challenger is better capitalized.
Developing Process Power requires time and experience with the relevant data, and that requires access. If the challenger is lucky, the incumbent won't have exclusive access to the most important datasets. But the incumbent’s feedback loops will be shorter because more customers are reviewing the data and reporting back data trust issues.
Historical data is valuable for many years. In most categories, customers care about time-series comparisons (e.g. changes month-over-month, year-over-year, etc.). There are rarely shortcuts to speed up this data gathering process. Consistent historical data is a Cornered Resource.
This discourages challengers. Any new entrant will enter the market with a product that is objectively worse on historical depth, coverage, or accuracy.
And to the extent a challenger delivers something that an incumbent’s customers want — usually a specific way to use the same data — the incumbent has a clear runway to respond.
The best DaaS businesses are either table stakes for a category or vertical (like CoStar in CRE) or play outsized roles in driving more or better transactions (like FICO for lenders or Placer in CRE, retail, and CPG). Given a large enough market, a data company that has cornered its core inputs and developed serious process power then has a single strategic goal over all others: Grow.
Because while Cornered Resource and Process Power are very difficult to overcome, determined challengers can still chip away at them over time with more grit than it took to build the incumbent. They are not structural moats.
Not like the third power: Scale Economies.
Scale Enables Sustainable Competitive Advantage In Data Businesses
The canonical example of Scale Economies — featured in the very first chapter of 7 Powers — is Netflix. To summarize: Netflix has a cost advantage over its challengers because it can amortize the cost of content over a much larger subscriber base than any other streamer; as a result, Netflix can bid more for content on an absolute basis and — even before accounting for the other assets and powers it brings to bear — it will monetize that content more effectively than its competitors could. And once scale is established, especially in a high-margin business in a maturing market, it becomes prohibitively expensive for a subscale player to capture share in a head-on challenge.
Data businesses with Process Power and Cornered Resources leverage Scale Economies in a similar way. A scaled data business is generally the highest bidder on any new addition to its core data sets and for complementary data sets that enhance the value of its core data offering.
But the raw data itself is non-rivalrous. What stops the owner of a new core data source from seeding competitors and attempting to commoditize its partners in the value chain?
In practice, most data products require what Martin Casado coined a “minimum viable corpus”: a level of size and quality below which the entire product is worthless. In some data categories, you can count the number of minimum viable corpuses on one hand, and not all of them are for sale. Without an available substitute, challengers have limited options.
They can try to build their own data set from scratch. For the reasons I noted in the previous section, this approach is rarely attractive.
They can attempt to piece together a new corpus from multiple existing subscale sources. A challenger can do this to get off the ground more quickly and build process knowledge, but this approach introduces several new problems. Most notably:
Data quality issues and inconsistencies between sources. Developing the Process Power to solve data trust at scale is difficult enough with a viable corpus. Now you have to account for conflicts between data sources? How do you productize the decision on who to trust for each conflict? This is barely tenable, and it increases coordination costs.
Your key partners likely do not have a business case to dedicate resources to this relationship. If they did have a case, they would already be selling data to the incumbent and looking to lock in a higher rate through an exclusivity agreement. Recall that Process Power in data is path dependent. Building it on a corpus that is unreliable and out of your control is a massive risk.
Once a scaled player emerges and establishes a corner on a viable corpus, it becomes extraordinarily expensive and risky to attempt to enter the market and compete with a similar product. And so an incumbent with these three powers can achieve the dream: sustainable competitive advantage. They acquire their inputs and complements for less than it would cost a challenger and they can deliver a superior product.
This phase of a company goes by a few names. My personal favorite is “strategic dominance.”
Nasdaq and NYSE have corners on all of the data related to trades that take place on their exchanges. When technology to cheaply distribute that data became widely available, they were well-positioned to build out the data piece of their businesses. As the financial exchange market has consolidated, in part to gain greater leverage on this data (not to mention pricing power in the exchange segment), Nasdaq and ICE — which owns NYSE — formed a cozy duopoly. As Abraham Thomas notes in The Economics of Data Businesses, the largest markets for data tend towards duopolies in their mature states.
Placer.ai actually pulled off the rare reverse version of this strategy: they entered the location data market with top-tier technical and process knowledge, further developed it to best-in-class, and then pounced on an opportunity to corner some very important datasets. They were then well-positioned to hit the gas on classic SaaS growth strategies and they are now the scaled player in their market.
What About Those Other Four Powers?
In descending order of importance:
Switching Costs: Very important
There are two primary methods for a DaaS business to increase the barriers for customers to switch away.
Distribute the data widely in a format that becomes table stakes for transactions between customers. Ideally, this should happen in the course of the company’s growth into the scaled player — ubiquity and Scale Economies tend to go hand-in-hand.
Integrate data distribution and consumption into customers’ existing workflows. This is the journey to building a data product into a platform.
Taking steps to make your product stickier is a natural step for a scaled data company. The main reason I didn’t highlight it above is that, given the three prior powers, Switching Costs are less of a moat and more of a churn mitigation strategy.
But there is a very important corollary to that: Switching Costs are far more important when the market for data is large enough to support multiple players, relevant process is not difficult to learn, and/or a cornered dataset is non-existent or relatively easy to substitute or build from scratch.
When those market conditions are in play, these DaaS companies can lean on Switching Costs. You can see this in the B2B contacts market, where Zoominfo, Apollo.io, and Clearbit (which is soon to be owned by Hubspot) are battling.
The market is huge and supports multiple vendors. Almost every sales-led business buys B2B contact data from at least one source.
Each player can own their own data but historical data is not very valuable. Their customers don’t need to analyze the evolution of their contacts’ careers.
The data is not particularly difficult to capture or scrape. Process Power is mostly limited to how each company manages individual career changes, but this is a well-constrained problem.
To protect against competition, each player has expanded its portfolio to include complementary products — it’s always harder to leave a bundle. They've also built integrations into the CRMs that are truly locked in via customers’ own data and the sales engagement/productivity suites that are locked in via frontline employee workflows.
The cherry on top here is that the true winner in B2B contacts is LinkedIn. Their Sales Navigator and Recruiter products always have up to date contact info because they have the ultimate Cornered Resource in the user-driven LinkedIn network and they make it available directly to end customers. No scraping or processing required.
Brand: Somewhat important
To the extent Brand is relevant to data products, it tends to show up in two ways:
It reinforces that one company’s product is the informational currency of a category, especially when that information directly drives business.
It reduces uncertainty for less sophisticated buyers of a data product.
Bloomberg is the most notable example of this power in action. Nobody gets fired for buying a Bloomberg terminal (to do something legal). But this is hardly the most important power for any data company.
It’s arguable that Brand helps smooth the path to strong partnerships and co-marketing with other owners of important datasets or workflows. This can generate substantial value for a company. But this is mostly downstream of scale.
Network Economies: Mostly unimportant, except when it is the only thing that matters
Earlier, I defined DaaS businesses to include only companies that source their raw data outside of their customer relationships. Because trust between a data vendor and a customer is paramount, most DaaS companies stay within these bounds. They reject Network Economies to preserve customer privacy and to enable the delivery of relevant data about competitors.
Once a data company builds enough scale and trust, they may be able to entice customers to share some of their own data in exchange for additional insights, likely in the form of benchmarks against the full dataset. Vendr was able to grow quickly by offering immediate ROI back in exchange for SaaS contract data. But receiving customer data is generally a privilege a company earns. CoStar Group’s CRE intelligence business was (infamously) built by frontline employees hounding and cajoling prospects and customers to share their lease comps. Now they’re so deeply embedded in the ecosystem that those same folks hand over the data willingly (if begrudgingly).
But I already mentioned an exception: LinkedIn, which sits in an unassailable position in the market for B2B contacts because its cornered data set is driven by a network of individual users.
An even more interesting example is playing out now. It's in a market that is shifting from network-driven ads to network-driven data. It's the business of company reviews.
Glassdoor has a corner on a massive corpus of employee-generated company reviews. Glassdoor collects its data from users, who voluntarily share their own reviews to gain access to the site — a classic give-to-get growth strategy.
Glassdoor leaned all the way in on the give-to-get model and earned Scale and Network Economies. At the time, advertising was the best way to monetize this data.
The business has essentially been in harvest mode for almost a decade. The data that Glassdoor collects is not labeled or formatted to produce more than superficial insights. But people keep using it because there is a critical mass (read: minimum viable corpus) of reviews on the site already.
Now a challenger has arisen for a high-value segment of the market. They focus on collecting reviews and other key data from sales and GTM professionals. RepVue has a much smaller and narrower userbase, but they have already reached critical mass on the breadth and depth of their review coverage. And they have done it with a product that can support both an ad-based business and a data business that processes review data cleanly into actionable information. The customers for this data are businesses and investors who want to benchmark sales teams and forecast sales based on GTM-specific metrics like quota capacity and attainment.
As I sit here today, even with limited information, I expect RepVue to eventually overtake Glassdoor for mindshare in its core GTM categories, with the potential for expansion into other professional verticals.
RepVue is not selling this data into a market with a strong existing data provider. But what they have done to date suggests an alternative path driven by Network Economies (which, not coincidentally, may be the only power more difficult to attain than the Cornered Resource + Process Power combo). And so I need to offer a caveat to everything I have written to this point: A scaled and defensible data business is still vulnerable if its cornered resource can be recreated via individual users who are driven by network effects to contribute their data.
Which brings me to…
Counter-Positioning and how to attack a data business
Counter-Positioning is temporary by definition, so it doesn’t make sense to assign it long-term importance like I did for the other six powers. It is also a reaction to how incumbents are already positioned, so it is difficult to generalize about it. But I have noticed three common patterns.
A market entrant brings a user-generated network to bear to collect data. I detailed this above.
The incumbent (or incumbents) monetizes by collecting and reporting data on demand in a market that wants to track changes over time. This incumbent is far more common than you might think, but you only know their names if you work in their industries because they are basically consulting shops. When you see data-based reports that represent snapshots in time, there is likely an opportunity to counter-position with an always-collecting data-as-a-service offering.
The incumbent data player, in contrast to how I drew the definition earlier in this essay, relies on its customers for its core inputs. Retail point-of-sale data aggregators like IRI and NielsenIQ are the canonical examples. When this happens, the incumbent will likely be contractually limited in what it can serve back to its customers so that anything that represents true competitive intelligence is neutered or eliminated entirely. A new data entrant that can access data outside of this loop can counter-position with a simple sales pitch: “Want to know secrets about your competition?”
Are we having fun(gibility)?
If you are reading this and loving it but you are still new to the data business, I want to warn you about a specific way you might be tripped up as a founder, operator, or investor: Tricking yourself into believing your cornered dataset is valuable.
I alluded to this earlier but let me restate it here: Different categories of data vary widely in how easy it is to find or create substitute data sets.
Let's say you are evaluating a market for data and you are convinced it is large. You find a company with a cornered dataset and some level of Process Power that isn’t scaling quickly. Broadly, there are three reasons why.
You found a massive opportunity. Congratulations! You might be right! But each day that passes in our current technological paradigm, it gets a little less likely that you’re right.
The same job that your data is good for is already done by leveraging a different category of data. Depending on how much better your product can be for some set of use cases, this might warrant further customer development.
A minimum viable corpus for this category already exists and is available from enough sources that the choke point for this market actually exists somewhere else. That chokepoint is probably at the intersection with another dataset. In other words, it’s not particularly valuable on its own. My personal favorite example of this is POI data, which is incredibly fun to work with and can also be accessed on demand or in bulk from so many sources that they’ll start spamming your work email the instant you hang a shingle saying that you work in GIS.
That’s all. For now.
Thanks for reading. I’ve tried to leave actionable takeaways throughout, but here are a few key points that I hope you remember.
Data businesses are inherently risky, but the rewards are massive and accrue to scale.
There are vulnerable data businesses that are actively trying to shield their limitations through customer-unfriendly switching costs (or by just flying under the radar).
It is extremely difficult to understand a market for data from the outside until you learn why that data is scarce and valuable.
You may be able to beat a scaled, defensible data business with a network.
Onward.
Shoutouts
Thanks to the following people who read earlier drafts and offered valuable feedback:
Abraham Thomas — If you want to go even deeper on data, his newsletter Pivotal is a must-read
Special thanks to Ben and David from Acquired. Their episode on Renaissance Technologies and RenTec’s 7 Powers analysis was what inspired me to actually sit down and synthesize what I’ve learned about data businesses.
Like this essay and want to talk about it some more? Leave a comment or send me an email: firstname.lastname at gmail
Excellent article esp read in conjunction with https://fintechtakes.com/articles/2024-06-14/aggregating-the-aggregators/