Big Data Driven Weekly #3
Previous week was a mess and a miss. This week I've returned with more news about Data Engineering
Modern Data Lake guide
An extensive and descriptive guide to implementing a Data Lake in your organization. The article contains a lot of comparison tables, links, and helpful article references about Data Lake history. Also implementations cases of Data Lakes, Lakehouses, and related technologies and services.
Apache Spark for .NET
Microsoft tries not to leave a comfort zone and not push developers out of it. So recently Microsoft released .NET for Apache Spark version 1.0. NET compatibility and API received components such as Apache Spark Core, Spark Streaming, Spark SQL, and Spark MLlib. NET Apache Spark API will be integrated into Azure Cloud services that already support the Apache Spark framework - Azure Synapse and Azure HDInsight. Microsoft also promises to add .NET Apache Spark API to third-party and integrated with Azure Cloud services such as Azure Databricks.
Auto Data Security in GCP
Very descriptive case study about data security automation using GCP services. Last autumn, GCP released a service for automated tagging, data exploration, and profiling for BigQuery tables. But the service name is misleading - Cloud Data Loss Prevention(Cloud DLP).
Cloud DLP now has more integration potential with other GCP services. In the case study described an automated flow that can trigger Cloud Function and based on a data tag or a property, set distinguish row-level security policy in the BigQuery table. Such automation will significantly ease the work for security and data engineering teams to implement proper security policies regarding data.
Cloud DLP + Cloud Function chain is a reflection of the Macie + Lambda chain on AWS. But the AWS chain could be implemented earlier than the announced automation on GCP. Just sayin’.
Another thing about Snowflake pricing
As usual, in philosophical style, the article is from Benn Stancil. This time about Snowflake's disadvantages. Last week I’ve already shared thoughts about the “Innovator problem“ and Snowflake pricing. The author also points out similar things in his article. But Benn Stancil goes further and provides more arguments: some are relevant and exciting, but the others are far from reality.
Glue Catalog lineage
Soon, AWS Glue Data Catalog will support a history of crawler’s launches. You will have extensive information about schema changes, partition changes, and metrics about each crawler launch to track your schema evolution and data distribution.
Azure Cloud introduces more visualization
Azure Cloud has launched managed Grafana service. Azure Cloud is gaining on other cloud providers and wrapping the open source solutions around their infrastructure.
