Snowflake vs. Databricks: Inside the Battle for the Data Platform Market
A deep dive, from inside the industry
A brief recap of the world of data platforms
Data platforms have seen enormous evolutions over the last decades. Traditionally both databases and data warehouses were dominated by SQL server type products such as those from Oracle, IBM and Microsoft. However, with the explosion of data and the internet, these SQL servers weren’t able to scale sufficiently and deal with all this new and unstructured data. New generation data warehouses that could scale horizontally — by spreading the data over servers were introduced — such as Hadoop for on-premise workloads and Redshift in the AWS cloud. With these being first generation solutions post the rise of the internet, these brought with them substantial problems as well. Hadoop was extremely difficult to manage while with Redshift, computing and storage resources were coupled together on servers in the cloud. So if you were only running your analytics workloads for three hours in a day, you still would have to pay for the entire 24 hours as you were making use of compute resources on these servers.
Enter Snowflake. The first data warehousing system that decoupled storage and compute. Storage would happen in cheap and basic object stores in the cloud, offering unlimited scaling of data storage, while compute servers would only be spun up when a customer was running his business analytics workloads. A truly elastic and pay-as-you-go system, saving enterprises substantial dollars.
At around the same time, the data science world was seeing tremendous amounts of innovation, especially in the field of machine learning. To handle large amounts of data manipulations, Spark compute clusters were introduced. These could be deployed in the cloud making use of a simple notebook containing the developer’s coding from one of the popular data science languages such as Python. This is really the core of Databricks, being the originator of Spark.
So originally both Snowflake and Databricks had very different use cases and both companies even worked together, as clients tended to run them for different workloads. However, both companies now have the vision to become the full fledged data platform in the cloud. The reason is simple, data gravity. It is both costly and lengthy to move large amounts of data, and so the workloads come to where the data resides, not vice versa.
Obviously this is a hugely attractive business, as once an enterprise has consolidated all their crucial data on your platform, moving away from it would be a yearslong endeavour that would be a serious distraction and risk for the business in terms of costs and time invested. So these types of businesses tend to be extremely sticky enjoying strong pricing power. However, to make the above vision become a reality, Snowflake first needs to up its game..
Snowflake 2.0
The problem Snowflake has it that they have to come from behind in terms of data science capabilities. This is currently the strongest growth area in terms of cloud data workloads, and is really where Databricks excels. While Snowflake can deal with unstructured data, the whole platform was really designed around SQL, traditional business analytics and dashboards. Whereas Databricks is designed around coding, data science and data processing. Snowflake already has the data storage layer, so now they’re introducing this second layer which can handle these types of Databricks tasks.
This is Snowflake’s new CEO on the recent call providing an overview of their current priorities:
“We started out as a data warehouse, but increasingly, we are the data backplane that provides a single unified view. Obviously, this data comes from production systems, such as Salesforce and SAP, or any of the other applications that you use. And additional capabilities like machine learning and AI are able to act on that data to then drive operational systems. Our overall strategy is to make sure that all of the data workloads that a company has are satisfied by Snowflake. This is where data engineering has been an investment for us, and Iceberg becomes key because the universe of data that can be acted upon by Snowflake, goes through a large expansion as not all data needs to be ingested into Snowflake before things happen. Our bet is really that AI and machine learning are going to go where the data is. Data is going to have a strong gravity.
I've spent a lot of time on the road. Just this quarter, I've met with over 100 of our biggest customers one on one. And the thing that clearly stands out from like a core data capability, is that in data sharing we are bar none. When it comes to newer things like AI, we've been open about the fact that we were a little behind early last year. But even before my coming, the management team here recognized the opportunity and invested heavily in it. And I would say the change over the previous quarter is that we can tell our customers with confidence that our AI products are world class.
In the core analytics space, we are the best in the world that there is. Especially when people consider migrating from complex on-prem systems, we have a professional services team that is exceptionally skilled at it and a very large ecosystem of partners that have been battle tested with massive migrations. And we have done migrations from on-prem workloads that end up saving something like 60% of the cost the customer has to bear, and their implementations end up being very efficient and low maintenance. These kinds of data migrations from legacy systems remain an important part of both new customer acquisition and are also driving substantial consumption increases in existing customers.”
So Snowflake’s strategy is to become the data platform in the cloud for all workloads. Not only a data warehouse for structured data in the cloud making use of SQL, but a data lake with capabilities to store unstructured data..
.. and leverage this data to run Spark-like clusters for machine learning with the popular data science languages such as Python and Scala:
Needless to say, these new workloads are strongly expanding Snowflake’s TAM, and the company sees their TAM growing at a 18% CAGR over the coming five years. And as customers are still migrating from on-prem and first generation cloud data warehouses, this is a company which could be looking at a high growth rate this decade.
But first Snowflake needs to step up its investments in innovation. The first obvious area to invest is an ability for data scientists to deploy code in the cloud on compute clusters akin to Databricks so that AI workloads can be handled on the platform. Snowflake is clearly behind here, with this feature only now being in public preview. This is Snowflake’s CEO on this new module:
“Our Notebooks offering is also seeing great traction in public preview with more than 1,600 accounts using that feature. This is critical to engage with data scientists and will unlock new opportunities that we previously did not address. We are in the early innings of this opportunity, and we'll continue to bring new features to market.”
The company is already further along in deploying LLMs as a copilot and for customers to train their own LLMs on their data. This Snowflake’s Chief Product Officer on integrating AI on Snowflake:
“Cortex, which represents the different language models that are available in Snowflake, the adoption is quite strong. And we have lots of new use cases, such as text summarization and text sentiment analysis. We introduced at Summit, both Cortex Analyst and Cortex Search as a way to enable users across organizations to be able to chat and interrogate the data, whether it's structured or unstructured. We have quite a bit of demand for going to general availability. And maybe the last one that I would call out is the Snowflake Copilot, we see a lot of usage on customers getting assistance on how to write better SQL queries, which also drives consumption back into Snowflake.”
Data can now also be stored outside of Snowflake in open source Iceberg tables. This should be a growth driver for Snowflake as potential customers were reluctant to bring all their data into the platform and effectively be locked in. With Iceberg tables, data can be stored outside the platform so that Snowflake is only leveraged when compute is needed, which is where the company makes the bulk of its revenues anyways (90%). And it gives the customer peace of mind that this same data can still be deployed over other platforms as needed. So while Iceberg reduces customer lock-in, net-net this should be a growth driver as it increases usage of Snowflake’s compute platform.
This is Snowflake’s CEO on Iceberg:
“Iceberg is enabling us to play offense and address a larger data footprint. Many of our largest customers have indicated they will now leverage Snowflake for more of their workloads as a result of this functionality. More than 400 accounts are using Iceberg as of the end of Q2.”
Going through comments from data engineers at various companies provided via Tegus, commentary on Iceberg was universally positive and most saw the format as being ahead of Databricks’ Delta tables format due to its higher flexibility.
The final interesting innovation by Snowflake is the launch of container services within the platform. This allows customers to deploy code and apps on a lightweight virtual machine within the Snowflake platform and make use of the data on the platform. This is the company’s CEO on this topic:
“Especially after the incidence of the summer with CrowdStrike, one of the hot topics has been how we set up applications with Snowflake. This is now a battle-tested operation, and most customers are shocked to find out that we can run a full replica of an important deployment that's something like 15% the cost of the original deployment, because the replica is not running any of the workloads that are going on in the main one.”
So Snowflake is innovating, and it is crucial that these new capabilities become a success, as otherwise the company might hit a brick wall. At the same time, the company has to face a number of other challenges..
For premium subscribers, we’ll dive into:
The current competition between Snowflake and Databricks, there will be unique insights and data which were collected via data engineers, executives, system integrators and consultants, which were made available via the Tegus platform.
A similar exercise for Google BigQuery and Microsoft Fabric, with a conclusion of which one is at a competitive advantage and which one has a competitive disadvantage. And for which platform these will pose the greatest threat.
Competition from open-source solutions where enterprises can shift data workloads to, and to what extent these can substitute for Snowflake. Again we’ll draw insights from industry insiders which were made available via Tegus.
How consultants and system integrators are currently expanding resources to support integrations of the various cloud data platforms, which is an indicator of future revenue growth.
A detailed analysis of the Snowflake’s financials and valuation, including thoughts and highlights from Q2, and whether valuation is sufficiently attractive here to take a position.