Using multiprocessing with HSDS to speed up access to cloud data - John Readey on Call the Doctor 11/4/25
The ‘s’ in HSDS stands for “scalable”, but if you are only using a single client process, you are not taking full advantage of the capabilities of HSDS.
Using multiprocessing can seem intimidating at first, but it’s not that hard and it can go a long way to mitigate the higher request latencies encountered when accessing remote data stores and services.
In this CTD, we’ll walk through an example of using multiprocessing to speed up a typical data analysis task.
To join, just jump on the zoom:
Launch Meeting - Zoom
November 4, 2025,12:20 p.m. central time US/Canada
I was going to write a summary of the presentation, but our ai agent beat me to it:
Purpose
Demonstration of
multiprocessing techniques for data analytics.
Takeaways
John Readey
demonstrated techniques for using multiprocessing to speed up data
analytics tasks.
The data used for the
demonstration is from the National Renewable Energy Lab (NREL)
with 2 petabytes accessible via HSDS on AWS S3.
HSDS was set up on an
EC2 instance in the US West 2 data center for efficiency.
Multiprocessing was
implemented on the client side to better utilize server hardware.
Python’s
multiprocessing package was used, with a focus on queue objects
for process communication.
Eight processes were
launched to handle data tasks, achieving a 4x speedup compared to
sequential processing.
Results were stored in
an HDF5 file, demonstrating safe concurrent writes with HSDS.
The demonstration
showed a reduction in processing time from 2 seconds per column to
less than 0.5 seconds per column.
The notebook used for
the demonstration is available on GitHub for further exploration.
Detailed summary
Introduction to data analytics with
multiprocessing
John Readey introduced the session focused on
using multiprocessing for data analytics.
The data source is NREL’s 2 petabytes of data
accessible via HSDS on AWS S3.
HSDS setup was demonstrated on an EC2 instance
in the US West 2 data center.
Setting up the environment
HSDS was configured with eight DN nodes, one
SN, and one head node.
Custom configurations included setting the
memdata cache and max task count.
The setup aimed to efficiently handle large
datasets by running HSDS on AWS.
Exploring the dataset
The dataset used was NREL’s solar radiation
data, specifically the ‘dhi’ dataset.
The dataset’s size is approximately 597
gigabytes, with a chunk size of 2 megabytes.
Data exploration included reading and plotting
a column of data.
Implementing multiprocessing
Multiprocessing was implemented using Python’s
multiprocessing package.
A queue object was used for process
communication and task distribution.
The function for processing data was designed
to handle source and target paths.
Performance and results
Eight processes were launched, achieving a 4x
speedup over sequential processing.
Processing time was reduced from 2 seconds per
column to less than 0.5 seconds per column.
Results were stored in an HDF5 file,
demonstrating safe concurrent writes.
Conclusion and resources
The demonstration highlighted the benefits and
complexities of multiprocessing.
The notebook used is available on GitHub for
further exploration.
Participants were encouraged to try the
techniques demonstrated in their own work.
Action items
I’m glad to see AI still needs the humans to fix the line breaks.
Here’s the video from John’s session yesterday, should you prefer an old school medium like youtube:
VIDEO
And to wrap it up - the notebook John used during the session that he shared in an earlier post: https://gist.github.com/jreadey/f5c642c07ba40960f8fa8f1605abb7a2