Big Data Driven #7
The Data Engineering news saga continues!
Apache Superset extending the toolset
In the latest Apache Superset release the drill-down feature is available in charts. Drill-down looks very cool and is a powerful tool that Apache Superset was lacking. Now you can do better data analysis from your Superset dashboard and charts with an easy UI navigation into the detailed and filtered dataset. Learn about other Apache Superset new features.
AWS new connectors
AWS Glue service is unstoppable. Now the Glue Crawler service supports MongoDB Atlas. You can connect to your MongoDB tables deployed on-premise or inside your cloud account and track schema changes. Here’s an example of how it is used with the MongoDB Atlas instance.
AWS Athena did not stop on Snowflake integration. Now you can execute Athena queries on the top of the Google Cloud Storage. I think AWS is experimenting with Lakehouse architecture trying new approaches and tactics. Like broadening the SQL toolset above different storage engines, to not care about where the physical data is.
Ray record
It’s been a while since I read about any new data records. The last one was an argument between Databricks and Snowflake about TPC-DS Benchmark in 2021. And now we have a new achievement - Ray has beaten the record for sorting 100TB of data in the most cost-efficient way. The previous record holder was Databricks, who managed to process a 100TB dataset spending only 1.44 USD per terabyte of data. Databricks has reached the record leveraging Alibaba Cloud services. Ray on the other side has managed to spend only 0,97 USD per terabyte of data while running on AWS cloud.
This brought me back to the first time I read that Spark was 100x faster than MapReduce. And now, reflecting how much time has passed, Spark is still at the top of the data processing food chain. I’m very excited to see how things will go with Ray.
Costly AWS Redshift serverless
I know it’s childish, but for quite some time I was hoping that redshift will not cost so much. And there, AWS presented Redshift Serverless last year. This should have fixed all our problems with running huge costly clusters when queries are not executed. But there’s a catch. And the catch is how the cost for Redshift Serverless service is calculated.
Snowflake is all about the tables
Snowflake is evolving and trying new things as well. The company not only counts on their main Cloud Warehousing service. But also experimenting with modern data approaches and trying to integrate them into their ecosystem. And as a result, Snowflake has introduced four new tables types in the 2022 year:
- Iceberg Tables, please, do not confuse with External Tables in Apache Iceberg format. Iceberg Tables can read the files from the Data Lake in Apache Iceberg format but treats the data as a native Snowflake table with all DMS syntax applicable. Actually, in the article, there’s a decent decision tree about what type of Snowlake table should be used depending on the data you have. Strongly recommend looking.
- Dynamic Tables, it is like a materialized view, only a Snowflake table. Has an easier and more native approach to data reload, unlike actual materialized views.
- Hybrid Tables, I don’t even have an appropriate word for this. Such tables store the data both in a row-based and a column-based style. I do not see a handful use-case for such tables.
- Events Tables, not much information about these tables are available. But the main point is that Events Tables were already announced. Probably we’ll see some good use cases presented by Snowflake this year.
