Local HSDS performance vs local HDF5 files

brucevault · June 28, 2021, 1:59pm

Hi,

I have been testing HSDS and I have a seen a much slower read performance (10x!) in the HSDS case. I would appreciate if you can tell me if this is the expected behaviour and/or if there is something that I can do to improve the HSDS case.

Test machine: local PC with fresh Ubuntu 20.04.02 LTS
I have created a HSDS server in this machine (POSIX system), and launched with 4 data nodes ($ runall.sh 4).

With the following Python code I create a local HDF5 file in my machine and I read some parts of it. In addition, I repeat the process but the “file” is created in the HSDS server.

import numpy as np
import time

USE_HSDS = 0    # Make it '1' if we want to use the HSDS implementation.

if USE_HSDS:
    import h5pyd as h5py
    p_hdf5path = "/home/test_user1/"        # Domain in HSDS server
else:    
    import h5py
    p_hdf5path = "/mnt/hgfs/VM_Shared/"     # My local path

p_hdf5file = "testFile_fromPython.h5"
p_numRows = 4000000
p_numOfCols = 10

# ********** CREATE FILE ***************************************
f = h5py.File(p_hdf5path + p_hdf5file, "w")
dset = f.create_dataset("mydataset", (p_numRows,p_numOfCols), dtype='float64', maxshape=(None,p_numOfCols), chunks=(4000,p_numOfCols))
for iCol in range(p_numOfCols):
    dset[..., iCol]=np.random.rand(p_numRows)

f.close()

# ************ READ DATA ***************************************
f2 = h5py.File(p_hdf5path + p_hdf5file, "r")
tStart = time.time()
dset = f2['mydataset']
tEnd = time.time()
print("Elapsed time :: dset = f2['mydataset'] ): ", tEnd - tStart)

tStart = time.time()
myChunk01 = dset[32000:35000,6]
tEnd = time.time()
print("Elapsed time :: myChunk01 = dset[32000:35000,6] ): ", tEnd - tStart)

tStart = time.time()
myChunk02 = dset[0:800000,...]
tEnd = time.time()
print("Elapsed time :: myChunk02 = dset[0:800000,...] ): ", tEnd - tStart)

tStart = time.time()
allData = dset[()]
tEnd = time.time()
print("Elapsed time :: allData = dset[()] ): ", tEnd - tStart)

f2.close()

If the file/system is HDF5, the obtained elapsed times (seconds) are:

Elapsed time :: dset = f2[‘mydataset’] ): 0.000270843505859375
Elapsed time :: myChunk01 = dset[32000:35000,6] ): 0.0019788742065429688
Elapsed time :: myChunk02 = dset[0:800000,…] ): 0.2007133960723877
Elapsed time :: allData = dset[()] ): 0.700141191482544

However, if the file/system is HSDS, the obtained elapsed times (seconds) are:

Elapsed time :: dset = f2[‘mydataset’] ): 9.059906005859375e-05
Elapsed time :: myChunk01 = dset[32000:35000,6] ): 0.011073589324951172
Elapsed time :: myChunk02 = dset[0:800000,…] ): 4.807202577590942
Elapsed time :: allData = dset[()] ): 19.842822551727295

Is it normal? Can I modify something to improve the read speeds in the HSDS case?

Thank you.

jreadey · July 10, 2021, 12:45am

I replied earlier via email, but I wanted to repeat myself here for the benefit of other forum readers…

Thanks for trying out HSDS and the very helpful test script you provided!

Naturally, HSDS will be quite a bit slower than the HDF5 library for operations like reading an entire dataset, since there’s a fair amount of inter-process communication happening in the HSDS case. With HDF5 library you are basically limited just by the disk speed in this case.

When looking at the HSDS logs when running your code, I did notice that a lot of time was spent in compressing the data to be sent back to the client with http. Try updating the override.yml file with the line:
http_compression: false and restarting HSDS. Let me know if this improves your performance.

This gave me about a 2x speed up in my testing. I just pushed a change to make this the default since it looks like for most use cases, it will be faster not compressing the data (an exception might be situations where sparse data is being returned).

One other interesting observation, on my Mac running a local version of HSDS reading all data read took 6.2s (compared with 0.16 with HDF5) or a throughput of 79 MB/s (vs 1858 MB/s with HDF5). Then I tried the same code on HDFLab (our JupyterLab hosted on AWS), and got a time of 2.66 s (throughput 114 MB/s). I would have expected HSDS running on posix to be faster than with S3, but it could be we need to do some tuning with the posix drivers.

Here are the results I got…

Mac w/ HDF5:

Elapsed time :: dset = f2['mydataset'] ):    0.00
mb: 0
Elapsed time :: myChunk01 = dset[32000:35000,6] ):    0.00, 37.0 Mb/s
mb: 61
Elapsed time :: myChunk02 = dset[0:800000,...] ):   0.03, 1790.0 Mb/s
mb: 305
Elapsed time :: allData = dset[()] ):   0.16, 1858.0 Mb/s

Mac w/ HSDS, local posix:

Elapsed time :: dset = f2['mydataset'] ):    0.00
mb: 0
Elapsed time :: myChunk01 = dset[32000:35000,6] ):    0.02, 1.0 Mb/s
mb: 61
Elapsed time :: myChunk02 = dset[0:800000,...] ):   1.65, 37.0 Mb/s
mb: 305
Elapsed time :: allData = dset[()] ):   6.26, 48.0 Mb/s

KitaLab, remote HSDS, S3:

Elapsed time :: dset = f2['mydataset'] ):    0.00
mb: 0
Elapsed time :: myChunk01 = dset[32000:35000,6] ):    0.01, 4.0 Mb/s
mb: 61
Elapsed time :: myChunk02 = dset[0:800000,...] ):   0.53, 115.0 Mb/s
mb: 305
Elapsed time :: allData = dset[()] ):   2.65, 115.0 Mb/s

I’m planning to work on a more comprehensive performance test suite to better understand HSDS performance limitations. Stay tuned! In the meantime, anyone who has observations or questions about performance is welcome to continue this thread.

brucevault · July 14, 2021, 9:11am

Hi, I have done some more testing. Here there are my notes.

HSDS Testing

The purpose of this test is to evaluate the read speed of the HSDS service installed locally (POSIX system). Different read methods are compared (h5yd, Postman, Curl). In addition, some configuration options of the HSDS service are also analyzed.

To have a reference base line, the results with a local HDF5 file are also presented.

Test Preparation

A clean HSDS service is created:

The server is a clean Ubuntu 20.04.02 LTS machine (8 CPU, 8 GB of RAM, fast SSD disk).
HSDS version:
- master branch.
- version 0.7.0beta.
- commit: 1c0af45b1f38fec327be54acc963ee28025711aa :: (date: 2021-07-08)
Install instructions (POSIX Install):
- https://github.com/HDFGroup/hsds/blob/master/docs/docker_install_posix.md
The post install instructions are also followed:
- https://github.com/HDFGroup/hsds/blob/master/docs/post_install.md

The override.yml file is created with the following content:

max_request_size: 500m #MB - should be no smaller than client_max_body_size in nginx tmpl

The HSDS service is launched with 6 data nodes: $ ./runall.sh 6

Test Data

Random test data is uploaded to the HSDS service and it is also created as a local HDF5 file.

The example data is a 4.000.000x10 dataset (random float64 data), resizable with unlimited number of rows and chuncking enabled.

This data is created with the first part of the following Python script:

import numpy as np
import time
import h5py
import h5pyd

HDF5_PATH       = "/home/bruce/00-DATA/HDF5/"     # My local path
HSDS_PATH       = "/home/test_user1/"             # Domain in HSDS server
FILE_NAME       = "testFile_fromPython.h5"
NUM_ROWS        = 4000000
NUM_COLS        = 10
CHUNK_SIZE    	= [4000, NUM_COLS]
DATASET_NAME    = "myDataset"

# ********** CREATE FILE: Write in chunks *********************
randomData = np.random.rand(NUM_ROWS, NUM_COLS)

# Local HDF5
fHDF5 = h5py.File(HDF5_PATH + FILE_NAME, "w")
dset_hdf5 = fHDF5.create_dataset(DATASET_NAME, (NUM_ROWS,NUM_COLS), dtype='float64', maxshape=(None,NUM_COLS), chunks=(CHUNK_SIZE[0], CHUNK_SIZE[1]))
for iRow in range(0, NUM_ROWS, CHUNK_SIZE[0]):
    dset_hdf5[iRow:iRow+CHUNK_SIZE[0]-1, :] = randomData[iRow:iRow+CHUNK_SIZE[0]-1, :]
print("HDF5 file created.")
fHDF5.close()

# HSDS
fHSDS = h5pyd.File(HSDS_PATH + FILE_NAME, "w")
dset_hsds = fHSDS.create_dataset(DATASET_NAME, (NUM_ROWS,NUM_COLS), dtype='float64', maxshape=(None,NUM_COLS), chunks=(CHUNK_SIZE[0], CHUNK_SIZE[1]))
for iRow in range(0, NUM_ROWS, CHUNK_SIZE[0]):
    dset_hsds[iRow:iRow+CHUNK_SIZE[0]-1, :] = randomData[iRow:iRow+CHUNK_SIZE[0]-1, :]
print("HSDS loaded with new data.")
fHSDS.close()

# ***** TEST FUNCTION *****************************************
def test_system(p_file, p_type):
    if p_type == "HDF5":
        print("Testing HDF5 local file...")
        f = h5py.File(p_file, "r")        
    elif p_type == "HSDS":
        print("Testing HSDS...")
        f = h5pyd.File(p_file, "r")
    else:
        raise NameError('Only HDF5 and HSDS options allowed.')

    dset = f[DATASET_NAME]

    tStart = time.time()
    myChunk01 = dset[32000:35000,6]
    tElapsed = time.time() - tStart
    print("Elapsed time :: myChunk01 = dset[32000:35000,6] :: ", format(tElapsed, ".4f"), 'seconds.')

    tStart = time.time()
    myChunk02 = dset[0:800000,...]
    tElapsed = time.time() - tStart
    print("Elapsed time :: myChunk02 = dset[0:800000,...] :: ", format(tElapsed, ".4f"), 'seconds.')

    tStart = time.time()
    allData = dset[0:NUM_ROWS-1,:]
    tElapsed = time.time() - tStart
    print("Elapsed time :: allData = dset[0:NUM_ROWS-1,:] :: ", format(tElapsed, ".4f"), 'seconds.')

    f.close()
    print("Test ended.")

# ********** TEST: READ FILE **********************************

test_system(HDF5_PATH + FILE_NAME, "HDF5")
test_system(HSDS_PATH + FILE_NAME, "HSDS")

Test Protocol

Base Reference: HDF5

The HDF5 file is read directly (h5py library) as the base reference of the results.

The test is performed with the script shown in section Test Data (test_system function).

HSDS Testing

The test data contained in HSDS is read with the following methods:

H5pyd library.
POSTMAN program
CURL program.

The HSDS is configured:

With http_compression set to TRUE or FALSE in the override.yml file.

Next, the configuration conditions of each read method are described.

H5pyd

The data is read with the h5pyd Python client library. Version 0.8.4.

The test is performed with the test_system function shown in the Python script of section <1.2 Test Data>

Postman

The test data is read with HTTP GET requests performed with the POSTMAN program. The requests follow the HSDS API REST documentation. Concretely, the Get Value request.

Firstly, the UUID of the dataset is obtained with the following request:

GET http://localhost:5101/datasets?domain=/home/test_user1/testFile_fromPython.h5

The response is:

"datasets": ["d-11d22dd0-950bbd8c-bcf7-f7efb4-749aaf"]

Next, the segments of the dataset are obtained with GET requests. These requests are performed with the following headers:

The performed HTTP GET requests are the following:

GET http://localhost:5101/datasets/d-11d22dd0-950bbd8c-bcf7-f7efb4-749aaf/value?domain=/home/test_user1/testFile_fromPython.h5&select=[32000:35000,6]

GET http://localhost:5101/datasets/d-11d22dd0-950bbd8c-bcf7-f7efb4-749aaf/value?domain=/home/test_user1/testFile_fromPython.h5&select=[0:800000,6]

GET http://localhost:5101/datasets/d-11d22dd0-950bbd8c-bcf7-f7efb4-749aaf/value?domain=/home/test_user1/testFile_fromPython.h5&select=[0:3999999,0:9]

Curl

The test data is read with HTTP GET requests performed with the CURL command line program. They are the same requests as the POSTMAN case. The data is written to a file. The elapsed times are measured and presented with the -w option of the program (doc).

The commands are the following:

curl -g --request GET 'http://localhost:5101/datasets/d-11d22dd0-950bbd8c-bcf7-f7efb4-749aaf/value?domain=/home/test_user1/testFile_fromPython.h5&select=[32000:35000,6]' --header 'Accept: application/octet-stream' --header 'Authorization: Basic dGVzdF91c2VyMTp0ZXN0' -o ~/00-DATA/tmp/testDownloadedData01.txt  -w "@curlFormat.txt"

curl -g --request GET 'http://localhost:5101/datasets/d-11d22dd0-950bbd8c-bcf7-f7efb4-749aaf/value?domain=/home/test_user1/testFile_fromPython.h5&select=[0:80000,6]' --header 'Accept: application/octet-stream' --header 'Authorization: Basic dGVzdF91c2VyMTp0ZXN0' -o ~/00-DATA/tmp/testDownloadedData02.txt  -w "@curlFormat.txt"

curl -g --request GET 'http://localhost:5101/datasets/d-11d22dd0-950bbd8c-bcf7-f7efb4-749aaf/value?domain=/home/test_user1/testFile_fromPython.h5&select=[0:3999999,0:9]' --header 'Accept: application/octet-stream' --header 'Authorization: Basic dGVzdF91c2VyMTp0ZXN0' -o ~/00-DATA/tmp/testDownloadedData03.txt  -w "@curlFormat.txt"

The curlFormat.txt is the following:

        \n
        http_version:  %{http_version}\n
       response_code:  %{response_code}\n
       size_download:  %{size_download} bytes\n
      speed_download:  %{speed_download} bytes/s\n
     time_namelookup:  %{time_namelookup} s\n
        time_connect:  %{time_connect} s\n
     time_appconnect:  %{time_appconnect} s\n
    time_pretransfer:  %{time_pretransfer} s\n
       time_redirect:  %{time_redirect} s\n
  time_starttransfer:  %{time_starttransfer} s\n
                     ----------\n
          time_total:  %{time_total} s\n

Test Results

Local HDF5 file

Elapsed time :: myChunk01 = dset[32000:35000,6] ::  0.0005 seconds.
Elapsed time :: myChunk02 = dset[0:800000,...] ::  0.0370 seconds.
Elapsed time :: allData = dset[0:NUM_ROWS-1,:] ::  0.1840 seconds.

H5pyd

When http_compression is true

Elapsed time :: myChunk01 = dset[32000:35000,6] ::  0.0082 seconds.
Elapsed time :: myChunk02 = dset[0:800000,...] ::  3.9666 seconds.
Elapsed time :: allData = dset[0:NUM_ROWS-1,:] ::  20.6691 seconds.

When http_compression is false

Elapsed time :: myChunk01 = dset[32000:35000,6] ::  0.0063 seconds.
Elapsed time :: myChunk02 = dset[0:800000,...] ::  0.4718 seconds.
Elapsed time :: allData = dset[0:NUM_ROWS-1,:] ::  3.6059 seconds.

POSTMAN

select	http_compression: true	http_compression: false
[32000:35000,6]
[0:800000,6]
[0:3999999,0:9]

CURL

select	http_compression: true	http_compression: false
[32000:35000,6]	curl-chunk01-httpTrue721×290 47.5 KB	curl-chunk01-httpFalse717×292 48.1 KB
[0:800000,6]	curl-chunk02-httpTrue718×290 47.7 KB	curl-chunk02-httpFalse727×289 47.9 KB
[0:3999999,0:9]	curl-chunk03-httpTrue719×288 48.5 KB	curl-chunk03-httpFalse718×290 45.6 KB

Regarding the results shown.

time_starttransfer: The time, in seconds, it took from the start until the first byte was just about to be transferred. This includes time_pretransfer and also the time the server needed to calculate the result.
The equivalent Postman’s Download time is: time_total - time_starttransfer.
The real download speed is size_download / (time_total - time_starttransfer)
the -w option of the Curl program: doc

For example, in the case of:

http_compression: true. The real download speeds are: 16 MB/s, 213 MB/s, 156 MB/s.
http_compression: false. The real download speeds are: 69 MB/s, 36 MB/s, 150 MB/s.

Discussion

It is important to note that the obtained results are no constant. There are slightly differences if the tests are repeated. However, the following facts can be discussed:

Accessing local data through local HSDS is much slower than using local HDF5. The fastest HSDS read method (CURL) requires around 3.3 seconds to extract the full dataset, whereas accessing the HDF5 data directly requires 0.2 seconds.
H5pyd and Postman show a high speed penalty if http_compression: true. If it set to false the full dataset read operation requires around 3.6 seconds. However, if http_compression: true, the elapsed time is around 20 seconds.
When using CURL, there is no speed penalty if http_compression: true. Note: the obtained binary data in the output file is the same (verified).

jreadey · July 14, 2021, 11:12pm

Thanks for the detailed notes!

Would it be ok with you if I incorporate your test scripts in the HSDS repo? (say under hsds/tests/perf) It would be a handy reference as we look to improve performance.

Regarding http_compression, I think we can safely say that it should be used in only specific circumstances. Note that http_compression requires both the client and server to agree on it. That’s probably why you didn’t see an issue with curl.

In any case, download speeds are maxing out at around ~150 MB/s. That’s a good fraction of the expected bandwidth on a 10GB ethernet. It would be interesting to run your test with HSDS and the test program on different machines vs HDF5 with files accessed on NFS.

A few thoughts on how performance can be improved…

In the local case, Unix domain sockets can be used in place of TCP. I’ve seen some speedups using this approach, I’ll try out your test with sockets. I originally did the socket work to support Lambda, but it’s likely useful for local HSDS deployments as well. I did a talk on this at last week’s HUG event. You can see the slides here: https://www.hdfgroup.org/wp-content/uploads/2021/07/ServerlessHSDS.pdf.

An additional idea is to replace the requests library in h5pyd with aiohttp. The aiohttp package allows asynchronous requests where the requests package blocks on each http request. For h5pyd, this won’t help much with smaller dataset reads and writes, but larger ones you could have multiple inflight requests which should speed up things.

brucevault · July 15, 2021, 8:08am

Of course you can use my code! If I create new scripts I will publish here also.

One note, 150 MB/s is a little more than a 1Gigabit ethernet (not 10GB). Is there any configuration parameter in my machine that can be bottleneck?

Regarding the HUG event, I have seen your slides and video, thank you, it is much appreciated.
I am really interested in local HSDS. It is your “Direct Access” approach the best option in this case?
Will “Direct Access” allow multiple concurrent clients writing and reading from the datasets?
I am asking this because in your slides you say:

Direct Access is best used for applications that run “close” to the storage and don’t
require close multi-client synchronization

Thank you.

dyoung · July 15, 2021, 3:55pm

You may be able to tune up TCP, but it may be possible to rule it out
as the bottleneck. Let us see what the bandwidth and delay are on the
localhost<->localhost path. You can get an idea of the delay using
ping localhost. iperf3 or nttcp will measure the bitrate.

On a busy THG development server, I ran the iperf3 server with iperf3 -B localhost -s and the client with iperf3 -c localhost. I measured
a bitrate greater than 41 Gb/s even while the load average was greater
than 24. On a server at home that is more than 10 years old, 5 Gb/s
is achievable. So it should be possible to move data with TCP from
localhost to localhost much faster than 150 MB/s.

It is easier to discuss performance when everyone abbreviates bits and
bytes consistently. In the usual convention, the abbreviation for bits
is ‘b’ and for bytes it is ‘B’. Thus the bitrate of 10-gigabit ethernet
is 10 Gb/s, and 150 megabytes/second is 150 MB/s.

Dave

jreadey · July 15, 2021, 6:17pm

Hey Dave - thanks for your thoughts on network bandwidth (and abbreviation clarification)!

I agree that I don’t think we are anywhere close to being bottlenecked by network bandwidth. With using aiohttp in h5pyd we should see some substantial improvements. If we can get to something comparable to HDF5 lib accessing a NFS mounted file, I’d be happy.

Re: “Direct access for multiple concurrent clients” - there are some potential gothca’s with this. With the normal setup:

client A, client B -> HSDS -> storage

HSDS serves as a synchronization point to coordinate multiple clients reading or updating the same object. E.g. say you have a 8 element dataset (so stored in one chunk):

[0, 0, 0, 0, 0, 0, 0, 0]

Client A writes 1,2,3,4 to selection [0:4]. After processing the request, HSDS will have the state as:

[1,2,3,4,0,0,0,0]

Note: HSDS lazily writes to storage, so it may be a second before this change is reflected in the storage layer.

Client B writes 5,6,7,8 to selection [4:8]. HSDS will now have:

[1,2,3,4,5,6,7,8]

So everything is working as expected.

In the direct access scenario:

client A -> Storage
client B -> Storage

It could go like this:

client A reads dataset, gets [0,0,0,0,0,0,0,0]
client B reads dataset, gets [0,0,0,0,0,0,0,0]
client A updates [0:4] writes [1,2,3,4,0,0,0,0] to storage
client B updates [4:8], writes [0,0,0,0,5,6,7,8] to storage

So either B overwrites A’s changes, or B overwrites A’s changes, either way not what you’d want.

However, if you can set things up in such a way that each client writes to a distinct set of chunks, the direct access model should work.

BTW, in the “bucket loader” project (https://github.com/HDFGroup/hsds-bucket-loader), I used a hybrid approach where a centralized HSDS instance was used to coordinate a set of workers, but the bulk of the data movement happened directly from the worker to the storage system. This enabled better scalability, since the HSDS instance only needed to deal with a small number of requests. I imagine this approach would work for other workloads as well.

John

brucevault · July 16, 2021, 6:38am

Hi Dave,

Thank you for your input. I have checked what you explain. These are my results.
According to the results, I understand that the bottleneck is the HSDS server, because a local TCP connection in my machine is capable of higher speeds than the reported in my previous posts.

Ping

Command:

ping localhost

Results:

-- localhost ping statistics ---
125 packets transmitted, 125 received, 0% packet loss, time 126966ms
rtt min/avg/max/mdev = 0.037/0.057/0.237/0.017 ms

iperf3

Commands:

Console A (act as server): iperf3 -B localhost -s
Console B (act as client): iperf3 -c localhost

Results:

Connecting to host localhost, port 5201
[  5] local 127.0.0.1 port 39900 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.23 GBytes  36.4 Gbits/sec    0   1.75 MBytes       
[  5]   1.00-2.00   sec  4.42 GBytes  38.0 Gbits/sec    0   1.75 MBytes       
[  5]   2.00-3.00   sec  4.42 GBytes  38.0 Gbits/sec    0   1.75 MBytes       
[  5]   3.00-4.00   sec  4.42 GBytes  38.0 Gbits/sec    0   1.75 MBytes       
[  5]   4.00-5.00   sec  4.40 GBytes  37.8 Gbits/sec    0   1.75 MBytes       
[  5]   5.00-6.00   sec  4.28 GBytes  36.7 Gbits/sec    0   1.75 MBytes       
[  5]   6.00-7.00   sec  4.29 GBytes  36.8 Gbits/sec    0   2.69 MBytes       
[  5]   7.00-8.00   sec  4.45 GBytes  38.2 Gbits/sec    0   2.69 MBytes       
[  5]   8.00-9.00   sec  4.44 GBytes  38.1 Gbits/sec    0   4.12 MBytes       
[  5]   9.00-10.00  sec  4.46 GBytes  38.3 Gbits/sec    0   4.12 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  43.8 GBytes  37.6 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  43.8 GBytes  37.6 Gbits/sec                  receiver

iperf Done.

brucevault · July 16, 2021, 12:56pm

Thanks John for your answers.

Regarding your comment

I agree that I don’t think we are anywhere close to being bottlenecked by network bandwidth. With using aiohttp in h5pyd we should see some substantial improvements. If we can get to something comparable to HDF5 lib accessing a NFS mounted file, I’d be happy.

I assume that you are thinking about using several request in parallel to increase the speed. However, this doesn’t resolve the main problem, the HSDS server (the API REST) has only a maximum of 150 MB/s throughput for each request. Maybe the problem is in the API framework that is being used in HSDS? (I don’t have almost any Python experience, this is just an idea).

dyoung · July 16, 2021, 6:14pm

Thanks for taking those measurements. We can feel pretty confident that
the local TCP connection is not the bottleneck.

I see some opportunities to speed up HSDS, but I cannot easily account
for the full difference between the local TCP speed and the speed you’re
measuring.

Let us make sure we are measuring the right thing with curl. In a
steady state, curl will not read its socket faster than it writes to
your SSD. If you replace all of curl's -o ... parameters with -o /dev/null, does the data-transfer phase quicken?

Dave

brucevault · July 19, 2021, 12:30pm

Hi Dave,

I have tested what you asked, these are the results:

Request data and send to text-file the response

Command:

curl -g --request GET 'http://localhost:5101/datasets/d-6ceed2e2-57cf5573-2fab-877cc5-3770b4/value?select=[0:4000000,:]&domain=/home/test_user1/testFile_fromPython01.h5' --header 'Accept: application/octet-stream' --header 'Authorization: Basic dGVzdF91c2VyMTp0ZXN0' -o ~/receivedData.txt -w "@curlFormat.txt"

Results:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  305M  100  305M    0     0   113M      0  0:00:02  0:00:02 --:--:--  113M
        
        http_version:  1.1
       response_code:  200
       size_download:  320000000 bytes
      speed_download:  119136262,000 bytes/s
     time_namelookup:  0,000992 s
        time_connect:  0,001099 s
     time_appconnect:  0,000000 s
    time_pretransfer:  0,001134 s
       time_redirect:  0,000000 s
  time_starttransfer:  1,078983 s
                     ----------
          time_total:  2,686430 s

The transfer speed is:

320MB / (2.6864 - 1.0789) = 200 MB/s

Request data and send to /dev/null the response

Command:

curl -g --request GET 'http://localhost:5101/datasets/d-6ceed2e2-57cf5573-2fab-877cc5-3770b4/value?select=[0:4000000,:]&domain=/home/test_user1/testFile_fromPython01.h5' --header 'Accept: application/octet-stream' --header 'Authorization: Basic dGVzdF91c2VyMTp0ZXN0' -o /dev/null -w "@curlFormat.txt"

Results:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  305M  100  305M    0     0   148M      0  0:00:02  0:00:02 --:--:--  148M
        
        http_version:  1.1
       response_code:  200
       size_download:  320000000 bytes
      speed_download:  156173743,000 bytes/s
     time_namelookup:  0,000875 s
        time_connect:  0,000999 s
     time_appconnect:  0,000000 s
    time_pretransfer:  0,001050 s
       time_redirect:  0,000000 s
  time_starttransfer:  1,002960 s
                     ----------
          time_total:  2,049349 s

The transfer speed is:

320MB / (2.0493 - 1.00296) = 305 MB/s

Benchmarking

The response times shown previously are variable. If I run several times each command the differences are remarkable. However, it is true that the /dev/null option is faster in all the repetitions.

Therefore, I have used a HTTP benchmark tool (Apache Benchmark) to obtain more accurate results. (apt-get install apache2-utils)

Command:

$ ab -n 50 -c 1 -k -H 'Accept: application/octet-stream' -H 'Authorization: Basic dGVzdF91c2VyMTp0ZXN0' 'http://localhost:5101/datasets/d-6ceed2e2-57cf5573-2fab-877cc5-3770b4/value?select=[0:4000000,:]&domain=/home/test_user1/testFile_fromPython01.h5'

This command performs 50 GET requests and measures the elapsed times. The results are the following:

Results:

Server Software:        Python/3.8
Server Hostname:        localhost
Server Port:            5101

Document Path:          /datasets/d-6ceed2e2-57cf5573-2fab-877cc5-3770b4/value?select=[0:4000000,:]&domain=/home/test_user1/testFile_fromPython01.h5
Document Length:        320000000 bytes

Concurrency Level:      1
Time taken for tests:   134.357 seconds
Complete requests:      50
Failed requests:        0
Keep-Alive requests:    50
Total transferred:      16000017500 bytes
HTML transferred:       16000000000 bytes
Requests per second:    0.37 [#/sec] (mean)
Time per request:       2687.132 [ms] (mean)
Time per request:       2687.132 [ms] (mean, across all concurrent requests)
Transfer rate:          116295.13 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:  1614 2687 706.5   2675    4424
Waiting:      829 1822 577.1   1731    3321
Total:       1614 2687 706.5   2675    4424

Percentage of the requests served within a certain time (ms)
  50%   2675
  66%   2801
  75%   2968
  80%   3218
  90%   3855
  95%   4111
  98%   4424
  99%   4424
 100%   4424 (longest request)

An explanation of the returned data (Times):

Connect: the network latency. It is almost 0 because we are working locally (localhost).
Processing: Time to receive the full response after connection was opened. Since in this case the Connect time is 0, this Processing time is equal to the Total time.
Waiting: Time-to-first-byte after the request was sent. Therefore, this includes the time that the server (HSDS) requires to prepare/obtain the data. The results show that the HSDS requires around 1.73 seconds to read the data from disk and prepare the response.
The amount of time required to transfer-download the data is the Total time minus the Waiting time. Hence, the results show that HSDS is sending the data with a transfer rate of:

320 MB / (2.675 - 1.731) seconds =  339 MB/s

In the next post I will show the results when a different data service is used instead of HSDS.

brucevault · July 19, 2021, 2:26pm

Basic API serving data - Benchmark

In this study I am testing if a basic Python API can serve data with higher speeds than the current state of HSDS.

Basic REST API

A basic REST API is created in Python thanks to the FastAPI module.

The module is installed as follows:

$ pip install fastapi
$ pip install uvicorn[standard] #This an ASGI server that will host the API.

The code of the basic API is:

# This code creates a basic API that has 2 main endpoints:
# - /randomMemory320MB 	: returns random NUM_ROWSxNUM_COLS data, as binary content.
# - /HDF5test320MB 	: reads a local HDF5 that has a dataset of 4000000x10 float64 values, returns the data as binary content.

from fastapi import FastAPI, Response
import numpy as np
import h5py

HDF5_PATH       = "/home/bruce/00-DATA/HDF5/"     # My local path
FILE_NAME       = "testFile_fromPython01.h5"
DATASET_NAME    = "myDataset"
NUM_ROWS        = 4000000
NUM_COLS        = 10
randomData      = bytes(np.random.rand(NUM_ROWS, NUM_COLS))

app = FastAPI()

@app.get("/")
def read_root():
    return {"Hello": "World"}

@app.get("/data")
def read_data():
    #By default FastAPI returns JSON.
    return {123456.789}

@app.get("/randomMemory320MB")
def read_randomMemory320MB():     
    return Response(content=randomData, media_type="application/octet-stream")

@app.get("/HDF5test320MB")
def read_hdf5data_320MB():
    f = h5py.File(HDF5_PATH + FILE_NAME, "r")
    dset = f[DATASET_NAME]
    theData = bytes(dset[0:NUM_ROWS,:])    
    return Response(content=theData, media_type="application/octet-stream")

If our API code is named test_03_CustomAPI.py, we can launch the server with the API as follows:

$ uvicorn test_03_CustomAPI:app --reload

The server will be available at:

http://localhost:8000

Test

We can test our API with the Apache Benchmark tool. To install: apt-get install apache2-utils

Test commands

To test our API we will perform 100 GET requests to each endpoint.

$ ab -n 100 -c 1 -k -H 'Accept: application/octet-stream' 'http://localhost:8000/HDF5test320MB'
$ ab -n 100 -c 1 -k -H 'Accept: application/octet-stream' 'http://localhost:8000/randomMemory320MB'

Results

HDF5test320MB

Server Software:        uvicorn
Server Hostname:        localhost
Server Port:            8000

Document Path:          /HDF5test320MB
Document Length:        320000000 bytes

Concurrency Level:      1
Time taken for tests:   43.071 seconds
Complete requests:      100
Failed requests:        0
Keep-Alive requests:    0
Total transferred:      32000014000 bytes
HTML transferred:       32000000000 bytes
Requests per second:    2.32 [#/sec] (mean)
Time per request:       430.714 [ms] (mean)
Time per request:       430.714 [ms] (mean, across all concurrent requests)
Transfer rate:          725539.92 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   418  431   9.5    430     501
Waiting:      322  333   7.3    332     395
Total:        418  431   9.5    430     502

Percentage of the requests served within a certain time (ms)
  50%    430
  66%    433
  75%    434
  80%    435
  90%    439
  95%    443
  98%    450
  99%    502
 100%    502 (longest request)

In this case the basic API requires less than 330 ms to read the HDF5 from disk and to send the first byte of it.
The median transfer rate at which the API is sending the data is:
```
320 MB / (0.430 - 0.332) seconds = 3265 MB /s = 26.1 Gb/s
```

randomMemory320MB

Server Software:        uvicorn
Server Hostname:        localhost
Server Port:            8000

Document Path:          /randomMemory320MB
Document Length:        320000000 bytes

Concurrency Level:      1
Time taken for tests:   7.407 seconds
Complete requests:      100
Failed requests:        0
Keep-Alive requests:    0
Total transferred:      32000014000 bytes
HTML transferred:       32000000000 bytes
Requests per second:    13.50 [#/sec] (mean)
Time per request:       74.072 [ms] (mean)
Time per request:       74.072 [ms] (mean, across all concurrent requests)
Transfer rate:          4218871.68 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:    66   74   6.4     74     125
Waiting:        1    1   0.2      1       2
Total:         66   74   6.5     74     125

Percentage of the requests served within a certain time (ms)
  50%     74
  66%     74
  75%     75
  80%     75
  90%     79
  95%     82
  98%     87
  99%    125
 100%    125 (longest request)

In this case the basic API requires less than 1 ms to prepare the data and to send the first byte of it (because the API has the data already prepared in memory).
The median transfer rate at which the API is sending the data is:
```
320 MB / (0.074 - 0.001) seconds = 4384 MB /s = 35 Gb/s
```
Note: the iperf3 test performed previously in this machine showed that the bitrate limit was around at 37.6 Gb/sec.

Comparison

If we compare these results with the ones provided in my previous post (Benchmarking section) we can see that:

The amount of time that HSDS requires to prepare the data is not despicable. 1700 ms was the Waiting time in my previous test, whereas here, with the basic API, the required Waiting time was about 330 ms. I assume that HSDS performs many more things than my basic API, hence the difference. But I suppose that it could be improved in the future?
The speed at which the HSDS is sending the data is much lower than the theoretical limit. 339 MB/s versus 4384 MB/s. I don’t understand this.

jreadey · August 24, 2021, 11:13pm

Hey,

Just to add another datapoint, I wrote up a little socket performance test that you can find here: hsds/tests/perf/socket at master · HDFGroup/hsds · GitHub. The test seeks to measure what the maximum performance we can get writing to sockets using a Python client and server.

There are a few different options that can be configured: how many bytes to write, how many bytes for each batch, use TCP vs Unix domain sockets, use shared memory or not.

Here are the results I got:

 Observations:

With regular TCP sockets, the max performance was around 1GB/s (compared with the iperf3 numbers of ~40GB/s. Don’t know if this a Python limitation or something else, but I expect that puts an upper bound on HSDS throughput to a single client.
Unix Domain sockets turned out to be quite a bit slower than TCP sockets (for local host connections). This was surprising since I was seeing 20% performance with HSDS tests using domain sockets. Maybe there’s more latency in setting up TCP sockets?
Increasing the batch size (the number of bytes the client reads from the socket in one call) can improve performance by 2x. Not sure what value is being used now in HSDS (it’s controlled by aiohttp & the Python high-level socket interface), I’ll look into it.
Passing data using a shared memory buffer (for localhost only) increases performance by about 4x with these larger transfers. The socket is use just to communicate the name of the shared memory buffer to the client.