Snowflake vs Databricks, the birth of unlimited and decoupled data scaling in the cloud, and Snowflake's outlook
An overview of the space
The birth of unlimited and decoupled data scaling in the cloud
Old-school data management systems turned out to be a nightmare at handling the rise of the internet. Whereas these SQL-based systems provide attractive consistency guarantees coupled with a dazzling capability at steering complex queries, these features came with one huge drawback, the system doesn’t scale. Or at least, only vertically. This means that when your organization grew in size and you had to handle more data, you basically had to buy a bigger server with more memory and more CPUs. With the rise of the internet and the accompanying explosion of data this turned out to be a very unsuited design.
There was a clear need to start scaling horizontally, so that you could simply store your data over more servers. Enter NoSQL databases — such as MongoDB, Cassandra and AWS DynamoDB — which provide these characteristics. However, the drawback here is that NoSQL databases are extremely poor at complex queries, as you organize your data based on how you run your business. So it’s easy to get the data for example which orders ‘customer Bob’ made, but any sophisticated query beyond that turns out to be a problem.
This is where data warehouses come in. Via data pipelines, the data is transferred from the database world into the data warehouse world, so that you can run analytics on it. This data is then transferred to nice dashboards so that managers can understand what’s going on inside the business.
In the old world, these data warehouses were also SQL-based and ran on a big vertical server. The main reason for this separation was to take workload of the main server. Remember, there was one server doing all the work and so you wanted to help it as much as possible from any burden. So the data warehousing you would run on a separate server, and you would also make slave copies of the original server to handle reads for example. Whenever the business needed to make a database query, the slave servers would handle these, with the master server taking care of all the writes.
In the new data warehousing world, obviously there was a need for horizontal scaling as well. Over the last two decades, we have seen three generations of solutions here. The first was Hadoop, which allowed for horizontal data scaling in on-premise datacenters. However, Hadoop is a difficult system to manage with a number of other downsides, making a service-based cloud solution a much more attractive option. The first of these was AWS Redshift, a much more customer friendly platform running on the AWS cloud, the pioneering public hyperscaler at the time.
The problem with Redshift’s architecture however is that compute and storage are coupled together. So as your data grows in size, you have to commission more servers, where you have to pay for the compute resources these provide. However, you might only be running analytics for a few hours each day. This basically means that for all the other hours of the day, you’re paying for compute resources which you’re not using.
This is where the the third generation of horizontally scalable data warehouses enter the scene, which Snowflake pioneered. Not only does Snowflake provide unlimited data scaling in the cloud, but it does so in a decoupled manner. This means that your storage and compute resources are fully separated. The clever trick which the Snowflake team leveraged is that they would not store data on rented servers in the cloud, but make use of the extremely prevalent and cheap object stores which the public clouds provide. Compute resources would only spin up when the user is running analytics, saving the customer from having to pay for unused resources. Vice versa, during intervals where the customer is not using his data warehouse, the data would just sit and rest in the object stores (SSDs), with compute resources having been fully wound down.
Snowflake vs Databricks
Snowflake and Databricks were initially founded to address very different purposes. Whereas Snowflake’s goal was to provide a clever and fully scalable data warehouse in the cloud, where you could run traditional SQL queries and business analytics on, the goal of Databricks was to make it straightforward to run scalable data science clusters in the cloud. So with Databricks, you could easily spin up a Spark cluster to process your data, with the coding written in a handy notebook in one of your favorite data science languages such as Python or Scala. As the platform is running in the cloud, these notebooks could easily be shared with other data scientists or the community to collaborate on projects.
These two different workloads — business SQL vs data science — are still each respective company’s stronghold. However, over the previous years, both companies have been building out their capabilities with the goal of establishing a complete data platform in the cloud, with data warehousing, data processing, and data streaming capabilities.
Snowflake’s answer to Databricks is SnowPark, which similarly to the pioneer offers the ability to process data from notebooks written in popular programming languages. Under the hood, Databricks runs its processes on Spark clusters, or Photon clusters actually these days, a C++ optimized version of Spark. Whereas Snowflake uses it own compute engines, which will be written in C++ as well. Spark was originally written in Scala which is somewhat slower, so rewriting the code in C++ will provide obvious performance gains. And I guess at some point the code will be rewritten in Rust, the new and much cooler C++, as the latter is basically a major pain to work with, especially if you want to import a library. Vice versa, Databricks has been building out its data storage solution by leveraging public clouds’ object stores. You can now run traditional SQL within Databricks ‘Data Lakehouse’.
As such, both platforms are now well capable to store both structured and unstructured data. And both have been implementing ACID compliance, meaning that data transactions will only be enforced in a consistent manner. Both also have a similar pricing model, where you only pay for compute when virtual machines are running. Reading through the Databricks documentation, the platform seems to have become extremely comprehensive and my impression is that it should be well capable to start competing effectively with Snowflake. Snowflake is successfully plowing ahead as well however, as SnowPark is estimated to become a $100 million dollar revenue business this year. Which indicates clearly that the platform is being successful at attracting data science workloads.
An overview of the Databricks platform is illustrated below. The difference with Snowflake however is that the storage layer is still separated. So a customer can individually manage his storage in his hyperscaler account whereas with Snowflake this is usually still vertically integrated, i.e. a so-called ‘walled garden’. The reason for this difference is that Databricks make use of an open source table format called Delta Tables, making it easy for the customer to process data elsewhere, not necessarily only over the Databricks platform.
Snowflake is now going down the same route, by allowing customers to store their data outside of the platform in Apache Iceberg tables. Iceberg is similarly to Delta an open source table format, providing customers with the alternative to utilize their data in other systems. Going forward, a customer could make use of their Iceberg data and process it on Databricks clusters. While this erodes Snowflake’s moat, I suspect that introducing this feature will still be a net positive for the company long term, as being a closed system made some clients reluctant to upload their data into Snowflake.
This is Snowflake’s CFO discussing their competition with Databricks at Goldman:
“First of all, I think Databricks is a very good company, I'm not going to say anything bad about them. What they do very well is in data science. We never see their SQL product and they undercut the pricing on that to try to win stuff, but we just don't see it. They still do a lot of ETL (extract, transform & load) and they coexist in so many of our accounts as we've brought them in. I know in your bank, they use both Snowflake and Databricks, and you go talk to most of our accounts, and they really see them as very different products. Where we compete with Databricks is on a workload-by-workload basis. It's a massive market we're playing in and it's not a winner takes all. There's going to be many winners.”
We got some further details at the recent Morgan Stanley conference, from the CFO again:
“It's been very consistent since day one, when we are going in to migrate a customer from on-prem Teradata or Hadoop, we're generally competing with Google, Azure and then AWS. We do see Databricks in many of our large accounts and by the way, we helped bring them in when we were kind of partnering with them. And they do very well in data science because they have a very good notebook that people like. But I will tell you, a lot of that growth in SnowPark has been at the cost of growth in Databricks within our accounts. And we are showing our customers that we are dramatically cheaper than them too.”
Snowflake has in the past discussed that they’re not only replacing a lot of on-premise Teradata and Hadoop solutions, but also other classic SQL data warehousing products such as Oracle, Microsoft and IBM, as well as first generation cloud data warehouses such as AWS Redshift.
Comparing results for 2023, Databricks grew faster percentage-wise although Snowflake still added more revenues dollar-wise:
So Snowflake is still the larger of the two, however, it’s clear that Databricks performance has been very strong and that the company has been growing into a serious competitor over time.
For premium subscribers, we’ll do an in-depth review of:
How the hyperscalers are fighting back
The business outlook for Snowflake
Snowflake’s CEO stepping down
A financial analysis and thoughts on valuation for the company