Big Data Driven Weekly #4
More news regarding Data Engineering field coming up!
Trying orchestration services
Article series about modern orchestrator services. The author tries Managed Airflow на AWS, Dagster, and dbt on common data problems. I hope we will also see Prefect and Astronomer. For the last couple of years, the number of orchestrators has increased a lot and it’s becoming harder to choose the right one for your task. The safe route is to chose an Airflow, the managed one to reduce headache. But I still recommend looking broader and picking the right tool for your specific problem.
The definition of a Full-stack Data Scientist
It appears that a “full-stack“ developer exists not only in web development. Shopify’s blog post describes the responsibilities and skills required for a full-stack Data Scientist. It sounds intimidating actually, but from the article, the main difference between a regular Data Scientist and a “full-stack“ Data Scientist is in the scope of responsibilities. The “full-stack“ Data Scientist own the data from the data source and ingestion stage, to the reporting and representation.
What is the metrics layer?
The article that tries to explain the metrics layer actually is using examples. And show what problems can occur during metrics definition, requirements changes, and implementation.
The most interesting part for me is that not a long time ago, the Data Engineering team was responsible for metrics layer implementation. But in the examples mentioned by the author, the implementation is shifted to BI services and dbt. Which is 100% SQL. Where’s all the Big Data? Apache Spark, Hadoop? If such approaches mentioned in the article would work with terabytes or petabytes of data, how expensive would that be?
Explaining Apache Pulsar for Kafka users
The similarity between Apache Kafka and Apache Pulsar from Pulsar's perspective. On one side, the approach is understandable by the article authors(who back Apache Pulsar). The transition from Apache Kafka to Apache Pulsar is easy and painless. But, why should we switch from the more mature and popular framework that has all the features that Apache Pulsar has?
Data Mesh in BlaBlaCar
It’s been a while since I’ve mentioned Data Mesh. So recently, a post from BlablaCar caught my eye. The post was about the company’s transition into Data Mesh architectural pattern. Too bad the post is really short and has few implementation details. But in the post, Data Engineers provided bullet points about what did and did not go well during implementation.
Too bad that the post was so short. I hope to see a follow-up with details since the implementation of Data Mesh should ease, organize and better structure data inside the big organization and reduce the implementation and maintenance efforts for data owners and data engineers.
