2025-03-13 CF Governance Panel meeting
Attendees
Attending: Jonathan, Daniel, Bryan, Ethan, Karl
Agenda/Notes
- Schedule our next meeting
- 12 June 2025 at 14:00 UTC (8am MDT, 3pm BST, 4pm CEST)
- Update on writing an article on CF history/future
- Prioritize between Roadmap and History
- Single paper with recent history and future plans?
- Introduction
*
- History
- CMIP5-7 (now)
- Process
- Principles, advantages to a process that requires time for careful thought and consideration with the aim of reaching consensus
- Examples (e.g enumerations and swathes)
- Next steps: The roadmap
- History helps explain why things are the way they are
- Could we convene a meeting with Zarr/GeoZarr developers to discuss data model and interoperability?
- To some extent their data model is not different from the (basic) NetCDF model is it?
Clearly the format is different …
-
(We now have a working a pure python HDF5 reader - the pull request is public now - see https://github.com/jjhelmus/pyfive - that does nearly (+) everything that Zarr can do!
So all that stuff can work on top of NetCDF without needing Zarr per se.)
(+) need support for arbitrary filters.
- (We do need to do better at good chunking default)
- Zarr exists (I think) to handle three things:
- Threading performance (the HDF5 c-library is not thread safe)
- People always rechunk into zarr files, the important requirement is the rechunking not the zarr
-
The metadata is consolidated into one file, so it can be read more efficiently than reading metadata throughout a file - but this too can be done by h5repacking as is done by reading netcdf and writing zarr.
The biggest downside of Zarr is the number of files.
If your average netCDF file is 2 GB, then you will be facing ~500 times more Zarr files (every 4 MB chunk is a zarr file).
This is a problem for HPC file systems and data managers.