Snowflake Vs BigQuery — Two Cloud Data Warehouses Of Many
Setting Up Your Data In The Modern Data Stack Community Update #22
It’s becoming difficult to keep track of all the new data warehouse solutions that are trying to challenge the current incumbents.
Choosing the best data warehouse to meet the needs and objectives of your operation is a crucial component of your business strategy. Unfortunately, many organizations are still struggling with this decision.
To add to this, implementing data warehouses can be difficult. But, upon completion of developing a data warehouse, they have the potential of delivering robust returns on investments and giving you better insights into your data.
Snowflake and Google BigQuery are well-established, powerful cloud-based data warehouse giants with thousands of satisfied companies. But, which one is better for you?
That’s a hard answer, but let’s compare the two.
Cloud Data Warehouse Intro
For those unfamiliar with what a data warehouse is. Let’s go over a quick intro.
Data warehousing has been around since the 1980s. The concept has changed and evolved dramatically since then. The increasing challenges and complexities of the business world have morphed data warehousing into a distinct discipline. This has led to better technology and tighter business practices.
The original purpose of data warehouses was to enable companies to maintain a analytical data source that they could go to, in order to answer questions. This is still an important factor. However, the need has grown for easier access to company information on a large scale by an end-user for data reporting and analysis.
Also, the defined user has massively expanded from specialized developers to just about anyone who can drag and drop in Tableau or Power-BI.
A data warehouse collects and stores all types of information from various sources, both within your organization and external sources. They collect raw data that is processed to give you quick answers to your business queries so you can make informed decisions about forecasting and budgeting.
By gathering data from all aspects of your organization — from HR to sales and marketing — data warehouses make light work of your analytic processing workload. Snowflake and BigQuery are two excellent examples of enterprise-level data warehouses. Powerful enough to handle the largest organizations.
Today, almost every large and medium-sized business has some form of data warehouse. Experts estimate that in less than 5 years, the market needs for data warehouses will nearly double, making it a 30 billion dollar industry.
If your company is ready to invest in a data warehouse or needs to upgrade from its current provider, you want to find the best and most cost-effective service for your needs.
Background On BigQuery
BigQuery, owned by Google, is a fully-managed, highly scalable, serverless data warehouse designed for fast-paced agility, with machine learning capabilities.
The platform was released to the general public in late 2011. The serverless architecture allows it to perform at scale and speed to provide incredibly fast SQL analytics across large databases.
They did have a few hiccups, like creating their own version of SQL, that thankfully they have recently fixed.
In addition, it has experienced numerous upgrades to features, enhanced performance, higher security protocols, increased reliability, and generally making it easier to operate and glean deeper insights.
Background On Snowflake
Snowflake was founded in 2012 and officially launched 2 years later. It is a cloud-based computing data warehouse company based in Bozeman, Montana. The company was named for the founders’ passion for winter sports.
Snowflake allows enterprises to store and analyze company data using hardware and software stored in the cloud. It can be run on Amazon S3 since 2014, on Microsoft Azure since 2018, and Google since 2019. The company is credited with the revival of the data warehouse industry by perfecting and building a cloud-based data platform.
This is what makes Snowflake unique. Its actually more of a re-seller of AWS and other cloud services where it has developed a Cloud first data warehouse literally built on other cloud services(So Google will make money one way or another).
Snowflake Vs. BigQuery Comparison
Let’s look at a few key areas when comparing Snowflake vs BigQuery:
Architecture
BigQuery is serverless, using Massively Parallel Processing (MPP) architecture. So, there are no setup or configuration headaches. It performs storage and computing tasks separately for enhanced query performance.
If you really want o drill into the underlying arcitecture, GCP has a great article here on how they leverage technologies like Borg, Colossus, Jupiter and Dremel.
Snowflake is built on a hybrid architecture with a Multi-cluster Shared Data Architecture structure. Compute and storage are once again, separate. Each running independently of one another. This delivers faster performance and allows for concurrent workloads by multiple users. Snowflake’s storage architecture supports structured and semi-structured data.
Scalability
Database scalability allows you to scale out or scale up a database so that it hold an increasing amount of data without affecting performance.
As data volume grows or queries become more complex, both Snowflake and BigQuery provide options in terms of scaling. With Snowflake, users are able to scale up or down as needed and only pay for the resources they actually use. It’t literally a drop-down where you can select, Small, Large, X-Small, etc.
BigQuery, on the other hand, is “serverless” and can scale independently, and all scaling issues are handled automatically.
This allows BigQuery to be incredibly flexible. It can quickly and seamlessly scale to any size. It is also highly cost-efficient. You are only charged for the resources you actually use, not for specific resources outlined in a contract.
Performance
A major point of comparison is performance. How do Snowflake and BigQuery stack up?
During a series of tests in 2019, technology blogsters found that on a number of metrics, the Snowflake solution consistently performed better than BigQuery.
The industry-approved standard TPC-DS dataset was used for testing. It is considered a “general-purpose decision support system” based on fictional e-commerce data. A total of 103 tests were run over the dataset, which was comprised of 30 terabytes.
Snowflake completed all of the queries in 5,793 seconds, while it took BigQuery 37,283 seconds to finish.
Of course, this is not to say that Snowflake is faster in all situations. For example, BigQuery outperformed Snowflake on the query involving finding the best-performing and worst-performing items as measured by net profit.
Snowflake and BigQuery are both under-active and continual development, with new features and performance enhancements being added regularly. Current and new developments to both platforms will likely change the calculus in the future as to which data warehouse solution truly performs better.
Cost
Cost is a serious consideration for many companies. Although you should never sacrifice overall usability for cost, sometimes you might have to choose the cheaper option.
Sadly, pricing is always tricky when it comes to cloud as it is often a combination of storage, compute and other similar factors. So here is the current pricing breakdown for Snowflake vs BigQuery.
Which Is Right For You?
Snowflake and BigQuery data warehouses are both feature-rich solutions that have helped all types and sizes of businesses improve their BI and analytics workflows.
Although BigQuery can be cheaper than Snowflake when it comes to storage, the nature of BigQuery’s compute pricing model is complicated and very different from the time-used model employed by Snowflake. Additionally, Snowflake generally outperforms BigQuery.
In the end, it’s your decision. You choose which solution is best for your organization.
Thanks To The SDG Community
I started writing this weekly update more seriously about 8-9 weeks ago. Since then I have gained hundreds of new subscribers as well as 15 supporters! I even got 2 special thanks on Youtube!
And all I can say is, Thank You!
You guys are keeping me motivated.
Also, if you’re interested in reading some of our past issues such as Greylock VC and 5 Data Analytics Companies It Invests or The Future Of Data Science, Data Engineering
Then consider subscribing and supporting the community!
Video Of The Week - 5 Great Data Engineering Tools
What are the best data engineering tools for 2021.
Now I don't know if I can tell you the best data engineering tools for 2021.
However, I can tell you my favorite data engineering tools.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Moving past Airflow: Why Dagster is the next-generation data orchestrator(Shots Fired)
We launched Dagster because there is a tooling and engineering crisis in the world of data. There is a dramatic mismatch between the complexity and criticality of data and the tools and processes that exist to support it.
Nearly all software development in data boils down to a single activity: building graphs of computations that consume and produce data assets such as tables, files, and trained models. Existing tools that modelled dependency graphs didn’t view this process holistically restricting themselves with deployment and operations. A prime example: today's most commonly used workflow engine is Airflow, and it has a narrow focus—scheduling, ordering, and monitoring deployed computations.
The Airflow Smart Sensor Service
Airflow is a platform to programmatically author, schedule, and monitor data pipelines. A typical Airflow cluster supports thousands of workflows, called DAGs (directed acyclic graphs), and there could be tens of thousands of concurrently running tasks at peak hours. Back in 2018, Airbnb’s Airflow cluster had several thousand DAGs and more than 30 thousand tasks running at the same time. This amount of workload would often result in Airflow’s database being overloaded. It also made the cluster quite expensive since it required a lot of resources to support those concurrent tasks.
5 Data Engineering Projects To Add to Your Resume
If you are still not sold on the prospect of data engineering, let’s look into earning potential. As of May 9, 2021, with over 8,000 salaries reported, Indeed indicates that data engineers make $10,000 more per year than data scientists. Additionally, the benefits of data engineering do not stop at pay alone. A study from The New Stack indicates that there is less competition for data engineering roles than other tech positions.
The New Stack found that for LinkedIn and Indeed job posts, for every open data science position, there were 4.76 viable applicants. Data engineering roles draw only 2.53 suitable competitors per job opening, nearly doubling your chances of obtaining a data engineering role.
End Of Day 22
Thanks for checking out our community. We put out 4 Newsletters a week discussing data, tech, and start-ups.
If you want to learn more, then sign up today. Feel free to sign up for no cost to keep getting these newsletters.