Using multiprocessing with HSDS to speed up access to cloud data - John Readey on Call the Doctor 11/4/25

lori.cooper · November 3, 2025, 6:24pm

Using multiprocessing with HSDS to speed up access to cloud data - John Readey on Call the Doctor 11/4/25

The ‘s’ in HSDS stands for “scalable”, but if you are only using a single client process, you are not taking full advantage of the capabilities of HSDS.

Using multiprocessing can seem intimidating at first, but it’s not that hard and it can go a long way to mitigate the higher request latencies encountered when accessing remote data stores and services.

In this CTD, we’ll walk through an example of using multiprocessing to speed up a typical data analysis task.

To join, just jump on the zoom:
Launch Meeting - Zoom
November 4, 2025,12:20 p.m. central time US/Canada

jreadey · November 4, 2025, 4:31pm

Here’s the notebook I plan to review today: Example of using multi-processing to speed up data analytics with HSDS · GitHub

jreadey · November 5, 2025, 2:33pm

I was going to write a summary of the presentation, but our ai agent beat me to it:

Purpose

Demonstration of
multiprocessing techniques for data analytics.

Takeaways

John Readey
demonstrated techniques for using multiprocessing to speed up data
analytics tasks.
The data used for the
demonstration is from the National Renewable Energy Lab (NREL)
with 2 petabytes accessible via HSDS on AWS S3.
HSDS was set up on an
EC2 instance in the US West 2 data center for efficiency.
Multiprocessing was
implemented on the client side to better utilize server hardware.
Python’s
multiprocessing package was used, with a focus on queue objects
for process communication.
Eight processes were
launched to handle data tasks, achieving a 4x speedup compared to
sequential processing.
Results were stored in
an HDF5 file, demonstrating safe concurrent writes with HSDS.
The demonstration
showed a reduction in processing time from 2 seconds per column to
less than 0.5 seconds per column.
The notebook used for
the demonstration is available on GitHub for further exploration.

Detailed summary

Introduction to data analytics with
multiprocessing

John Readey introduced the session focused on
using multiprocessing for data analytics.
The data source is NREL’s 2 petabytes of data
accessible via HSDS on AWS S3.
HSDS setup was demonstrated on an EC2 instance
in the US West 2 data center.

Setting up the environment

HSDS was configured with eight DN nodes, one
SN, and one head node.
Custom configurations included setting the
memdata cache and max task count.
The setup aimed to efficiently handle large
datasets by running HSDS on AWS.

Exploring the dataset

The dataset used was NREL’s solar radiation
data, specifically the ‘dhi’ dataset.
The dataset’s size is approximately 597
gigabytes, with a chunk size of 2 megabytes.
Data exploration included reading and plotting
a column of data.

Implementing multiprocessing

Multiprocessing was implemented using Python’s
multiprocessing package.
A queue object was used for process
communication and task distribution.
The function for processing data was designed
to handle source and target paths.

Performance and results

Eight processes were launched, achieving a 4x
speedup over sequential processing.
Processing time was reduced from 2 seconds per
column to less than 0.5 seconds per column.
Results were stored in an HDF5 file,
demonstrating safe concurrent writes.

Conclusion and resources

The demonstration highlighted the benefits and
complexities of multiprocessing.
The notebook used is available on GitHub for
further exploration.
Participants were encouraged to try the
techniques demonstrated in their own work.

Action items

Share the GitHub
link to the notebook in the meeting notes and YouTube video
description.
```
         The HDF Group
```

lori.cooper · November 5, 2025, 3:08pm

I’m glad to see AI still needs the humans to fix the line breaks.

Here’s the video from John’s session yesterday, should you prefer an old school medium like youtube:

And to wrap it up - the notebook John used during the session that he shared in an earlier post: https://gist.github.com/jreadey/f5c642c07ba40960f8fa8f1605abb7a2

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Using multiprocessing with HSDS to speed up access to cloud data - John Readey on Call the Doctor 11/4/25

Using multiprocessing with HSDS to speed up access to cloud data - John Readey on Call the Doctor 11/4/25