Improving Parallel I/O performance

I am using the HighFive library on the NERSC Perlmutter system. I have an application that has to do parallel reads/writes, and I was following the parallel collective I/O example from the HighFive repo: GitHub - highfive-devs/highfive: User-friendly, header-only, C++14 wrapper for HDF5.

Here is my code that does the parallel reading

   using namespace HighFive;

    try {
        FileAccessProps fapl;
        fapl.add(MPIOFileAccess { comm.get_row_comm(), MPI_INFO_NULL });
        fapl.add(MPIOCollectiveMetadata {});

        auto xfer_props = DataTransferProps {};
        xfer_props.add(UseCollectiveIO {});

        size_t row_start
            = Utils::get_start_index(glob_num_rows, comm.get_row_color(), comm.get_proc_rows());
        std::vector<double> vec(block_size * num_cols);
        for (int r = 0; r < num_rows; r++) {
            std::string zero_pad_vec_str = Utils::zero_pad(r + row_start, 6);
            std::string vec_filename = dirname + zero_pad_vec_str + ".h5";
            File file(vec_filename, File::ReadOnly, fapl);

            auto dataset = file.getDataSet("vec");

            int reindex, n_blocks, steps;
            dataset.getAttribute("reindex").read<int>(reindex);
            dataset.getAttribute("n_param").read<int>(n_blocks);
            dataset.getAttribute("param_steps").read<int>(steps);
            // do some checks on the attributes ...

            size_t col_start
                = Utils::get_start_index(glob_num_cols, comm.get_col_color(), comm.get_proc_cols());

            dataset.select({ col_start * block_size }, { (size_t)num_cols * block_size })
                .read(vec, xfer_props);
            Utils::check_collective_io(xfer_props);
        }
    } catch (const std::exception& e) {
        if (comm.get_world_rank() == 0)
            fprintf(stderr, "Error reading matrix from file: %s\n", e.what());
        MPICHECK(MPI_Abort(comm.get_global_comm(), 1));
    }

I use a similar code to write data. I profiled this code with Darshan/Drishti, and the report had some issues mentioned in it.

Application issues a high number (9961627) of small read requests (i.e., < 1MB) which represents 100.00% of all read requests
Application issues a high number (85283) of small write requests (i.e., < 1MB) which represents 100.00% of all write requests
Detected write imbalance when accessing 11 individual files  
Detected read imbalance when accessing 1223 individual files.

There are also a lot of read/write load imbalances. I’m not quite sure how to go about fixing this. Can you provide any insight?

Here is a link to the profiling reports. The ones named *-2.pdf are for a slightly simpler application, so it may be more useful to start there.

For reference, the num_rows is ~600, each of which represents a different file. Within each file, each process reads a contiguous chunk of num_cols*block_size elements. In these examples, that size was ~2.15M double precision floating point numbers.

Thanks

Hi @srvenkat,

at a first glance, this looks like it might possibly be an issue on the system or MPI-IO side (either an actual system issue or perhaps a problem in how data striping was configured) rather than HDF5, but more investigation would be needed. The Darshan reports appear to show that most of the I/O was performed through collective MPI I/O reads in the size range of 10MiB - 100MiB, which is good. But then it seems those were transformed into a significantly larger amount of smaller POSIX reads, which is what Drishti seems to be picking up on. Based on both the Darshan and Drishti reports, the application seems to be doing “everything right” from the HDF5 perspective. It would probably take a bit more digging into the raw Darshan report rather than the .pdf summary to get an idea of what’s going on here.

Hi @jhenderson,
Thanks for your reply. I have uploaded some Darshan/Drishti reports of additional tests I conducted to the drive. This application reads a single vector of size 263169*500, which comes to ~1004MiB (matching what is said in the reports).

Reports 3 and 4 are for 2 MPI ranks. Report 3 has both ranks on the same node and report 4 has them on different nodes. Report 5 has 16 ranks over 4 nodes.

The reports do mention some other, much smaller IOps, but those seem to be to some other log files (maybe Darshan or slurm). From the reports, it looks like rank 0 read the entire file in every case.

I also included a Darshan DXT trace report matching the pdf of report 5 (16 MPI ranks). From what I can see on that, it looks like rank 0 is doing all/most of the IOps, which seems off.

I also tried some other tests where I made a very basic program with the HDF5 C API to read/write a single vector. I tested this with collective vs independent I/O modes, and used a vector of a size ~1B double precision floating point. I found that using independent I/O resulted in faster I/O times (by about 3x).

I also tried using h5py, and found that the read time was the same as my independent I/O with C, but the write time was about 3x faster than C with independent I/O. All of these tests were also on 16 MPI ranks over 4 nodes.

Currently, I believe all the files i’m using have a stripe of 1. I can change it though, and in my production code, the vectors (size ~1B doubles) are striped with count 8.

Let me know if any other reports/tests would be useful, and I can try generating them.

Thanks again for your help!

Can you give an insight into the data layout? Are these continuous or chunked datasets, and are you using compression or other filters? Is MPI MPICH or OPENMPI? Are you setting any alignments for HDF5?

You will generally want to use both H5Pset_all_coll_metadata_ops(plist_id, true) and H5Pset_coll_metadata_write(plist_id, true).

With these options, are you still seeing these small reads for your C test program?

I’ve added 2 more DXT reports h5iotest-collective.txt and h5iotest-independent.txt of the simple test program that reads and writes 1 vector. I did add those 2 calls you mentioned to the code.

From the reports, I’m seeing that the collective I/O is having those small reads, but the independent I/O is not, which is interesting.

Here is the output of h5dump -H on my dataset

HDF5 "/pscratch/sd/s/srvenkat/cascadia/config_1_new/base/test_parallel_vector.h5" {
GROUP "/" {
   DATASET "vector" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 1000000000 ) / ( 1000000000 ) }
   }
}
}

so it is contiguous, and I am not using any compression or filters. I am not setting any alignments for HDF5 to my knowledge. The MPI is Cray MPICH (I am using the Perlmutter system).