Write perfomance of raw data

kra · March 28, 2023, 8:57am

Hi there,
I’m writing a application that is storing data to a HDF5 file at a data rate up to 2.5 GB/s for up to 1 hour.

When I write to a binary file with POSIX and O_DIRECT, on a NVMe disk. I get 2.8 GB/s. Close to the disk performance. I use a block size / chunk size of 4 MB

When I do the same test on a 1D dataset and chunks of 4MB. I Get 1.1 GB/s.

How can I get the same performance with HDF5, I have tried using
H5Pset_fapl_direct(…) and cache disabled, it makes no difference.

And H5Pset_alloc_time(cparms, H5D_ALLOC_TIME_EARLY)
Which gives a better write performance but takes a long time to init.

The test where run with a filesize of ~500 GB

The tests are formulated as 2 unittests:

TEST(HDF5PerfTest, ODirect){
constexpr std::uint64_t chunk_size = 410241024;
constexpr uint64_t target_file_size = 500’000’000’000;
constexpr size_t NumberOfPkts = static_cast<size_t>(target_file_size / chunk_size);

int fd;
void* buf;
posix_memalign(&buf, 4096, chunk_size);

// open file with O_DIRECT flag
fd = open("ODirect.bin", O_CREAT | O_TRUNC | O_DIRECT | O_WRONLY, 0644);

ssize_t bytesWritten = 0;

auto startTime = std::chrono::steady_clock::now();

// write data to file
for (auto i = 0; i != NumberOfPkts; ++i){
    auto ret = write(fd, reinterpret_cast<char*>(buf), chunk_size);
    if (ret == -1) {
        perror("write");
        exit(1);
    }
    bytesWritten += ret;
}

// close file and free memory
close(fd);
free(buf);

auto elapsedNs = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - startTime).count();
double bytesPerNs = static_cast<double>(bytesWritten) / static_cast<double>(elapsedNs);
std::cout << "\nPayload=" << static_cast<double>(bytesWritten)/1e6 << "MB Speed=" << bytesPerNs*1e3 << "MB/s" << std::endl;

std::cout << "Done!" << std::endl;

}

// And With HDF5

TEST(HDF5PerfTest, WriteSpeedH5Plain)
{

constexpr std::uint64_t chunk_size = 4*1024*1024;
const uint64_t target_file_size = 500'000'000'000;
const size_t NumberOfPkts = static_cast<size_t>(target_file_size / chunk_size);

void* buf;
posix_memalign(&buf, 4096, chunk_size);

hsize_t maxdims[2] = {UINT64_MAX, 1};
hsize_t dims[2]    = {chunk_size*NumberOfPkts, 1};
hsize_t chunk_dims[2]    = {chunk_size, 1};

const uint64_t RANK = 1;
/*
 * Create the data space with unlimited dimensions.
 */
auto dataspace = H5Screate_simple(RANK, dims, maxdims);


/*
 * Create a new file. If file exists its contents will be overwritten.
 */
std::string path = data_path + "WriteSpeedH5Plain.h5";


hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS);
//H5Pset_fapl_direct(fapl_id, 4096, 4096, chunk_size);
//H5Pset_cache(fapl_id, 0,0,0,0.0);


// open the file with the file access property list
hid_t fcpl = H5Pcreate(H5P_FILE_CREATE);
hid_t file = H5Fcreate(path.c_str(), H5F_ACC_TRUNC,fcpl, fapl_id);
hid_t cparms = H5Pcreate(H5P_DATASET_CREATE);
herr_t status = H5Pset_chunk(cparms, RANK, chunk_dims);

if (status < 0){
    std::cout << "Failed to H5Pset_chunk" << std::endl;
    EXPECT_TRUE(false);
}

//H5Pset_alloc_time(cparms, H5D_ALLOC_TIME_EARLY);

auto dataset = H5Dcreate2(file, "dset1", H5T_STD_U8LE, dataspace, H5P_DEFAULT, cparms, H5P_DEFAULT);
//H5Pset_chunk_cache(dataset, 0,0,1.0);

uint64_t bytesWritten = 0;
hid_t   filespace;


auto startTime = std::chrono::steady_clock::now();

for (uint64_t i = 0; i != NumberOfPkts; ++i){
    /*
    * Select a hyperslab.
    */
    filespace = H5Dget_space(dataset);
    hsize_t offset[2];
    offset[0] = (i)*chunk_size;
    offset[1] = 0;
    status    = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL, chunk_dims, NULL);

    /*
    * Write the data to the hyperslab.
    */
    status = H5Dwrite(dataset, H5T_STD_U8LE, H5S_BLOCK, filespace, H5P_DEFAULT, buf);
  
    //dataset_absolute.write(payload);
    bytesWritten += chunk_size;

}
auto elapsedNs = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - startTime).count();
double bytesPerNs = static_cast<double>(bytesWritten) / static_cast<double>(elapsedNs);
std::cout << "\nPayload=" << static_cast<double>(bytesWritten)/1e6 << "MB Speed=" << bytesPerNs*1e3 << "MB/s" << std::endl;

H5Dclose(dataset);
H5Sclose(dataspace);
H5Sclose(filespace);
H5Pclose(cparms);
H5Fclose(file);

}

Best regards
Kristian

gheber · March 28, 2023, 11:49am

Instead of H5Pset_alloc_time(..., H5D_ALLOC_TIME_EARLY) use H5Pset_fill_time(..., H5D_FILL_TIME_NEVER). Otherwise, you write data twice (fill values the first time and the real thing the second time).

If you want to be as close as possible to pwrite, use H5Dwrite_chunk instead of H5Dwrite. (No selections are needed in that case.)

G.

kra · March 28, 2023, 2:29pm

Thanks gheber, for the useful reply. I have made the changes you suggested and it did improve the speed. So the Hdf5 write speed is now 1.6-1.8 GB/s. But still not as good as POSIX.

gheber · March 28, 2023, 3:05pm

H5Dwrite_chunk will get you there. G.

kra · March 29, 2023, 7:39am

Your correct that I can write as fast as POSIX without the O_DIRECT flag. I have tried to get the direct method to work with HDF5, by setting H5Pset_fapl_direct(fapl_id, 4096, 4096, 0) but i only get even slower results.

Any ideas what i’m doing wrong?

kra · April 11, 2023, 2:27pm

I have looked at the code in H5FDdirect.c in the H5FD__direct_write() method. I have added a check like this (line 1120):

else if (size % _fbsize == 0 && ((size_t)buf % _boundary == 0) ){
    addr = (addr / _fbsize) * _fbsize;
    if ((addr != file->pos || OP_WRITE != file->op) && HDlseek(file->fd, (HDoff_t)addr, SEEK_SET) < 0)
        HSYS_GOTO_ERROR(H5E_IO, H5E_SEEKERROR, FAIL, "unable to seek to proper position")

    while (size > 0) {
        do {
            nbytes = HDwrite(file->fd, buf, size);
        } while (-1 == nbytes && EINTR == errno);
        if (-1 == nbytes) /* error */
            HSYS_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, FAIL, "file write failed")
        HDassert(nbytes > 0);
        HDassert((size_t)nbytes <= size);
        H5_CHECK_OVERFLOW(nbytes, ssize_t, size_t);
        size -= (size_t)nbytes;
        H5_CHECK_OVERFLOW(nbytes, ssize_t, haddr_t);
        addr += (haddr_t)nbytes;
        buf = (const char *)buf + nbytes;
    }
}

By doing so the buffer is not copied again if its all ready aligned and I get almost the same performance as with O_DIRECT flag and plain POSIX. I do not know if this the right or most elegant solution. Any thoughts on how to handle this correct? Can I make a pull request or create an issue?

gheber · April 13, 2023, 1:36pm

Thank you for this interesting suggestion. I will bring it up with the engineering team & report back. G.

kra · April 14, 2023, 10:56am

Thanks G. I have also made an issue on the topic: https://github.com/HDFGroup/hdf5/issues/2714

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Write perfomance of raw data