Big Data Driven Weekly #1
Continue to share Data Engineering news and articles that emerged this week or been in the shadow. Enjoy!
Definition of Modern Data Stack
A long time ago, when the term “Big Data“ only emerged, no one could explain what exactly Big Data is. Now it seems we have a new word that is in a similar situation. I present you the “Modern Data Stack“. Nowadays, everyone is talking about Modern Data Stack. Differently, in their own manner and with own set of services and approaches. Where did this magic term even come from and what does it mean? The Preset company tries to clarify all the magic around Modern Data Stack in their blog.
Better bulk load for DynamoDB
It’s been some time since I mentioned AWS, so here goes. Now it’s easier to bulk load data into DynamoDB. You can load files in CSV, DynamoDB JSON, and Amazon Ion formats that are located in the S3 bucket. You can do a bulk load from the AWS management console or use AWS API.
It’s strange that DynamoDB did not have such a possibility before. Hbase and similar wide-column storages supported bulk load from early releases. Previously, to load a lot of data into DynamoDB, you had an option to break down large files into smaller batches and load them via API. Or you could have used AWS EMR and Hive to bulk load data from S3 into an external table. And after that use INSERT OVERWRITE command to write data from Amazon S3 to DynamoDB.
Here comes a new player in the Storage market
Seems like distributed storage market will expand with the new database. According to InfoQ, Alibaba, which is one of the most prominent players in the Asia e-commerce market, seeks new opportunities in the USA and Europe. Alibaba has a MySQL-compatible distributed storage offering called OceanBase that can even run on Raspberry PI devices.
For the last couple of years, Alibaba is actively working on distributed systems and services that are related to Data Engineering and Big Data. It would be very interesting to see whether Alibaba will be able to compete with the main players in the Data Engineering service markets in the United States and Europe.
Introduction to testing your data
Not a single newsletter without a Data Governance post. This time, a useful introduction into popular data testing frameworks. In the article, you can find use cases and code examples for different data tests, links to guides, and useful documentation. At some point in the Data Platform’s life, data testing would be an essential part to ensure your platform's quality and reliability. So you need to be prepared and know things or two about Deequ and Great Expectations.
Your Data Quality cheat sheet
Really short cheat sheet about Data Quality dimensions with a detailed explanation. It will come in handy when working with third-party data and external sources, where engineers do not always keep their data clean and straight.
Also, I would suggest not only cleaning your data but also calculating statistics of Data Quality dimensions to understand how good or bad the source data is.
