Big Data Driven #9
It's already spring! All the frost has gone, but people still keep Snowflake
Databricks introduces new features
Databricks not only develops our beloved Data Engineering features but Machine Learning as well. Now Model Serving feature is available in the Databricks Lakehouse Platform. Model Serving allows you to deploy your ML models and expose HTTP API to access it. Basically, it productizes the Machine Learning model. This is very convenient when you do not have enough software engineering experience or MLOps expertise; instead, you do have rich Data Science experience. Wonder if a similar feature is available in modern cloud providers such as AWS, Azure, or GCP.
Recently, I have mentioned the Ray framework and its cost-efficiency record. In the discussion, we’ve argued that Ray framework seems to go the same path as Apache Spark and probably will be an interesting asset for Databricks. And we were right. Databricks announced support for the Ray framework in Databricks Lakehouse Platform.
Big Data is not that big anymore
Sometimes, besides news, announcements, and tutorials I come across these kinds of thoughts. For the last two years, the same thought occur from time to time in my mind. The “Big Data“ is not big anymore in the scope of the data volume. More companies search for an analytical solution having only several gigabytes of data. Now, most of the tasks to analyze data are not about running 50- to 100 virtual machines to process the data. And this tendency continues to grow. So “Big Data” becoming less about “Big“, and more about “Data“.
Companies are still willing to build a small analytical platform. Even without implementing hundreds of data pipelines to process petabytes of data. More analytical requests can be handled using a simple AWS Event Bridge + AWS Lambda chain. And that is the main reason why lately I avoid “Big Data“ wording. Instead, I use the “Data Engineering“ phrase, when I refer to data-related tasks.
Smaller data volumes do not say there's less data work to do. The evolution of the data technology landscape has changed the Data Platform. Also, changed the way Data Engineers work with the data and Data Platform.
New MAD Landscape
The new, 2023 version of the big and scary MAD Landscape is out! I’m sure you’ve seen this picture with hundreds of frameworks, services, and solutions in the data, AI, and ML fields. In the article, you can read the authors’ motivation to re-arrange the blocks in the new version. In case you’re not that interested in details, you can just look into the PDF landscape version or its clickable version.
Recent Snowflake acquisitions
An informative article on how Snowflake expands its feature portfolio by acquiring other companies. You can try to trace the directions, that Snowflake is trying to develop besides their main data warehouse product. Among directions, you can find things like working with unstructured data, but now it’s “powered by AI“. Also, Snowflake, like its competitors tries to integrate machine learning algorithms straight into the data warehouse. Amazon Redshift and BigQuery already have this feature available, and Snowflake only starts to take action in this direction.
Architecture documentation
A great and useful read, but this time not about Data Engineering. Architecture documentation is a cumbersome, dull, but extremely important activity on the project. The article describes an architecture documentation approach called arc42 that is paired with the architectural visualization approach C4. Another interesting concept is Documentation as Code. The idea behind it is to store all your documentation under a version control system to track changes in documentation in an efficient way.
