I’m researching a variety of open source approaches and tools to support large scale scientific modeling and the data issues that come with it. Think multi-TB model outputs, and standardized formats like netCDF.
Nate told me about Pangeo from his time at NCAR. I read a boatload of docs and watched a great introductory video from 2018.
Problems it solves are:
- Common data model for accessing data science-y/geosciene-y data like netCDF, GeoTIFF, and more.
The use of Zarr is interesting here as a R/W option for
- Moving compute and simulation (the running of numerical models) closer to the data, which at this point is too big to move.
- Ties to the existing data science toolkit, e.g., Python, NumPy, Dask. One element of the work is to move lots of data from machines in a cluster spitting out model output (think SAN or even attached SSD), and get it somewhere more useful for downstream scientists and other users.
Looking around the source code I saw that Pangeo Forge will be using a project called Prefect. It’s highly reminiscent of Apache Airflow in terms of workflow definition and execution. And then it becomes really clear that the Prefect folks know and love Airflow a lot, but have reasons for not using it.
“…Airflow’s applicability is limited by its legacy as a monolithic batch scheduler aimed at data engineers principally concerned with orchestrating third-party systems employed by others in their organizations.”
Intrigued that the deployment story is to a Dask cluster or via Kubernetes. I’ll not dig in unless I need to, but happy to find another modern DAG-based workflow automation system/platform/tool.