Just a reminder the SC23 BOF, HDF5: Building on 25 Years of Success, is tomorrow!
Wednesday, 15 November 2023, 5:15pm-6:45pm MST
Location: 401-402
HDF5 is a unique, open-source, high-performance technology suite that consists of an abstract data model, library, and file format used for storing and managing extremely large and/or complex data collections. It is used worldwide by government, industry, and academia in a wide range of science, engineering, and business disciplines.
The HDF5 community is both deep and broad: HDF5 is included by every major HPC system vendor as part of their core software, due to its broad adoption by science applications and its ability to improve I/O performance and data organization within HPC environments. Additionally, there are over 1000 projects on GitHub utilizing HDF5 due to its versatile, self-describing data model that can represent very complex data objects, relationships between the objects and objects’ metadata; portable binary file format with no limits on the number or size of data objects; software library optimized for efficient I/O; and tools for managing, manipulating, viewing, and analyzing HDF5 data.
The HDF5 community has continued adding features to access data in object and cloud storage, as well as exploit storage systems being deployed on today’s exascale systems. These features take advantage of the new storage paradigms and require minimum changes to current HDF5 applications. In the past decade, the amount of simulation, modeling, experimental, and observational data stored in HDF5 and the rate at which this data is collected have created new challenges for the scientists and triggered requests for using these new storage paradigms. Moreover, AI applications using HDF5 have requirements in reading data many times and shuffling data.
The HDF Group, The Ohio State University, Lawrence Berkeley Lab, Lifeboat , LLC, and Amazon AWS HPC teams have been working on enhancing HDF5 to address these challenges. We will present the latest HDF5 enhancements that will help applications run on exascale systems, exploit object storage, migrate to the cloud, and collect and store new types of data .We will demonstrate how the HDF5 virtual object layer (VOL) and virtual file driver (VFD) architectures now allow users to tackle scalable I/O on parallel file systems, data access on object store, asynchronous I/O and multi-threaded access to data, and more.
The target audience of this BoF includes numerous HDF5 users. A sample of them are: existing HDF5 users such as Exascale Computing Project (ECP) application developers and accelerator scientists, and new users such as the high-energy physics community who are exploring HDF5 as an alternative file format.
Our session format is focused on encouraging HDF5 community members to discuss challenges when using HDF5 and providing feedback to HDF5 developers. We will present a brief roadmap of HDF5, then invite current HDF5 users to share their experiences with the HDF5’s numerous features applied to real-world problems, and will solicit feedback on HDF5 improvements and gather requirements from the new users.
Agenda
Time slot (MST) |
Presenter |
Topic |
17:15–17:25 |
Dana Robinson (The HDF Group) |
Introduction and HDF5 Roadmap |
17:25–17:35 |
Jay Lofstead (Sandia National Lab) |
Fast, Searchable Data Annotations for Accelerating Time to Insight |
17:35 17:45 |
Ravi Madduri (Argonne National Lab) |
Advanced Privacy preserving Federated Learning as a Service: Challenges and Opportunities |
17:45–17:55 |
William Godoy (Oak Ridge National Lab) |
HDF5 as a critical component in the Julia HPC ecosystem |
17:55–18:05 |
Johannes Blaschke (Lawrence Berkeley Lab) |
Perspectives from Data-Intensive HPC at NERSC |
18:15–18:25 |
Glenn Lockwood (Microsoft) |
I/O middleware for artificial intelligence: real intelligence required |
18:25–18:45 |
Panel |
The next 25 years of HDF5 |