Big Data Driven #8
New articles and references are here. Enjoy your data-forward reading
DataWarehouse evolution at Airbnb
An interesting journey, described by the Airbnb engineering team. The team has encountered limitations of the Data Platform solution they’ve built The article describes the current Data Platform solution, alternatives, and considerations made. As a result, the solution and benefits, the team has built using Apache Spark v3 and open table format for the Data Lake.
How good is DuckDB?
Another one of those "X 100х better than Y" articles, this time in storage engines. There’s an article about DuckDB vs PostgreSQL comparison written in vantage.sh blog. The author compares both engines in query execution time, bulk load, and compression using custom sample data. And of course, in some aspects, DuckDB is 200x faster, and in others - 6x better. Just like we love in similar articles.
Although, lately, DuckDB mentions have become more regular. I think it is prominent storage that needs some time to become more mature. I will definitely keep my eye on this one.
Validation framework at Dropbox
A little about Data Governance, about Data Quality in particular. I present you a use case written by the Dropbox engineering team about validation. The article contains reasoning and framework considerations about validation capabilities. TO orchestrate the framework(surprise, surprise!), Apache Airflow was the choice. Besides, the article contains validation rules examples and a high-level validation flow diagram. It will help to understand the validation framework's place in the whole Data Platform environment.
SLA, SLO, SLI at the Data Platform level
When we’re discussing system reliability, we usually operate with Service Level indicators. They are the common metrics when we want to define, understand and measure how reliable our system is. To get more familiar with reliability metrics, there’s a free online book called “Site Reliability Engineering”. The book is written by Google engineers. But the question is, can we apply the same reliability metrics to the data?
I think we can. And already doing that by applying Data Quality practices to the data. Because these practices define the metrics, how we should track and treat these metrics. To get into more detail, I recommend this article about reliability metrics in data. You can find SLI/SLO/SLA examples of the data. Also, a spreadsheet template with reliability management for your Data Platform might be useful.
Evolving Data Platform at Financial Times
An exciting read from the Financial Times engineering team. The evolution use cases are one of my favorite themes in technical posts. The article contains requirements, and the toolset used to build the Data Platform to meet business requirements. There are four iterations in total described in the article. Each Data Platform generation is followed by business requirements, technical constraints, technology stack description, and result.
I recommend this great read to reflect on how Data Platform use cases develop through time. And how we, as Data Engineers, need to pick the right tools from the variety of frameworks available to solve evolving business problems.
Snowflake engineering practices you need to know about
I found a really good list of engineering practices that can be applied to Snowflake. The article is well-structured and is broken down by Data Platform stages. It contains an extensive description of each practice that can be applied to each stage. Besides the description, the author provides a list of useful links for detailed documentation or implementation examples for each engineering practice.
