Using multiprocessing with HSDS to speed up access to cloud data - John Readey on Call the Doctor 11/4/25

Using multiprocessing with HSDS to speed up access to cloud data - John Readey on Call the Doctor 11/4/25

The ‘s’ in HSDS stands for “scalable”, but if you are only using a single client process, you are not taking full advantage of the capabilities of HSDS.

Using multiprocessing can seem intimidating at first, but it’s not that hard and it can go a long way to mitigate the higher request latencies encountered when accessing remote data stores and services.

In this CTD, we’ll walk through an example of using multiprocessing to speed up a typical data analysis task.

To join, just jump on the zoom:
Launch Meeting - Zoom
November 4, 2025,12:20 p.m. central time US/Canada

Here’s the notebook I plan to review today: Example of using multi-processing to speed up data analytics with HSDS · GitHub

I was going to write a summary of the presentation, but our ai agent beat me to it:

Purpose

Demonstration of
multiprocessing techniques for data analytics.

Takeaways

  • John Readey
    demonstrated techniques for using multiprocessing to speed up data
    analytics tasks.

  • The data used for the
    demonstration is from the National Renewable Energy Lab (NREL)
    with 2 petabytes accessible via HSDS on AWS S3.

  • HSDS was set up on an
    EC2 instance in the US West 2 data center for efficiency.

  • Multiprocessing was
    implemented on the client side to better utilize server hardware.

  • Python’s
    multiprocessing package was used, with a focus on queue objects
    for process communication.

  • Eight processes were
    launched to handle data tasks, achieving a 4x speedup compared to
    sequential processing.

  • Results were stored in
    an HDF5 file, demonstrating safe concurrent writes with HSDS.

  • The demonstration
    showed a reduction in processing time from 2 seconds per column to
    less than 0.5 seconds per column.

  • The notebook used for
    the demonstration is available on GitHub for further exploration.

Detailed summary

Introduction to data analytics with
multiprocessing

  • John Readey introduced the session focused on
    using multiprocessing for data analytics.

  • The data source is NREL’s 2 petabytes of data
    accessible via HSDS on AWS S3.

  • HSDS setup was demonstrated on an EC2 instance
    in the US West 2 data center.

Setting up the environment

  • HSDS was configured with eight DN nodes, one
    SN, and one head node.

  • Custom configurations included setting the
    memdata cache and max task count.

  • The setup aimed to efficiently handle large
    datasets by running HSDS on AWS.

Exploring the dataset

  • The dataset used was NREL’s solar radiation
    data, specifically the ‘dhi’ dataset.

  • The dataset’s size is approximately 597
    gigabytes, with a chunk size of 2 megabytes.

  • Data exploration included reading and plotting
    a column of data.

Implementing multiprocessing

  • Multiprocessing was implemented using Python’s
    multiprocessing package.

  • A queue object was used for process
    communication and task distribution.

  • The function for processing data was designed
    to handle source and target paths.

Performance and results

  • Eight processes were launched, achieving a 4x
    speedup over sequential processing.

  • Processing time was reduced from 2 seconds per
    column to less than 0.5 seconds per column.

  • Results were stored in an HDF5 file,
    demonstrating safe concurrent writes.

Conclusion and resources

  • The demonstration highlighted the benefits and
    complexities of multiprocessing.

  • The notebook used is available on GitHub for
    further exploration.

  • Participants were encouraged to try the
    techniques demonstrated in their own work.

Action items

  • Share the GitHub
    link to the notebook in the meeting notes and YouTube video
    description.

             The HDF Group
    

I’m glad to see AI still needs the humans to fix the line breaks.

Here’s the video from John’s session yesterday, should you prefer an old school medium like youtube:

And to wrap it up - the notebook John used during the session that he shared in an earlier post: https://gist.github.com/jreadey/f5c642c07ba40960f8fa8f1605abb7a2