Upload 55 Gb hdf5 file to Kita Lab server

kerim.khemraev · September 7, 2022, 8:42pm

Hi,

I’ve just got access to Kita Lab for testing (thank you).

If I understood correctly I should be able to upload up to 100 Gb to the server and 10 Gb for local storage.
I tried to click on “up arrow” button to choose and upload file but after 10 Gb uploaded I got an error that there is no space left.

Is there a way to upload my 55 Gb file to Kita Lab server to check the IO perfomance?

jreadey · September 7, 2022, 10:45pm

Hi,

Right, there’s a 10 GB limit for local storage. Can you copy your file to S3? (preferably in the us-west-2 region)?
Once it’s on S3, you can run the following from a hdflab terminarl:
$ hsload s3://mybucket/myfilepath.h5. /home/my_homedir/
Or:
$ hsload --link s3://mybucket/myfilepath.h5. /home/my_homedir/

The later will just copy the metadata and leave the chunk data in the original file.

kerim.khemraev · September 7, 2022, 10:51pm

Thank you,

But I don’t have an AWS account and I’m unable to create it now.

I’m trying to split h5 file on my PC to < 10 Gb parts, upload them to Kita Lab, then to server, and after that I’m going to use h5pyd to merge these parts in one file. Will that work?

jreadey · September 8, 2022, 2:53am

Yep, that will work (if it a bit tedious). You’ll need to split the file using some sort of h5py script and then do the reverse in HDFLab using h5pyd.

Let us know how it goes!

kerim.khemraev · September 8, 2022, 6:43pm

I could upload my the data on the server with the proposed way.

Now I’m looking for the solution of my problem described in other topic.

I thought that HSDS approach could improve perfomance but with the opportunities provided by the Kita Lab (4 nodes) it took 250 seconds to read 500 arbitrary rows (on my personal PC I read 5000 rows in 90 seconds as posted in the question, probably this is limited by my HDD disk). But the goal is to do that in about 1 second if possible.

Does HSDS spread the data among multiple nodes so I can expect that reading selected data may be faster than speed limits of a single HDD disk?
Is there any preffered way to solve my task?

For now I get an exception when trying to read each 5000th row from 25 millions rows data:

arr=dset[0:25000000:5000, :]
---------------------------------------------------------------------------
timeout                                   Traceback (most recent call last)
/opt/conda/lib/python3.9/site-packages/urllib3/response.py in _error_catcher(self)
    437             try:
--> 438                 yield
    439 

/opt/conda/lib/python3.9/site-packages/urllib3/response.py in read(self, amt, decode_content, cache_content)
    518                 cache_content = False
--> 519                 data = self._fp.read(amt) if not fp_closed else b""
    520                 if (

/opt/conda/lib/python3.9/http/client.py in read(self, amt)
    457             b = bytearray(amt)
--> 458             n = self.readinto(b)
    459             return memoryview(b)[:n].tobytes()

/opt/conda/lib/python3.9/http/client.py in readinto(self, b)
    501         # (for example, reading in 1k chunks)
--> 502         n = self.fp.readinto(b)
    503         if not n and b:

/opt/conda/lib/python3.9/socket.py in readinto(self, b)
    703             try:
--> 704                 return self._sock.recv_into(b)
    705             except timeout:

timeout: timed out

During handling of the above exception, another exception occurred:

ReadTimeoutError                          Traceback (most recent call last)
/opt/conda/lib/python3.9/site-packages/requests/models.py in generate()
    752                 try:
--> 753                     for chunk in self.raw.stream(chunk_size, decode_content=True):
    754                         yield chunk

/opt/conda/lib/python3.9/site-packages/urllib3/response.py in stream(self, amt, decode_content)
    575             while not is_fp_closed(self._fp):
--> 576                 data = self.read(amt=amt, decode_content=decode_content)
    577 

/opt/conda/lib/python3.9/site-packages/urllib3/response.py in read(self, amt, decode_content, cache_content)
    540                         # Content-Length are caught.
--> 541                         raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
    542 

/opt/conda/lib/python3.9/contextlib.py in __exit__(self, type, value, traceback)
    134             try:
--> 135                 self.gen.throw(type, value, traceback)
    136             except StopIteration as exc:

/opt/conda/lib/python3.9/site-packages/urllib3/response.py in _error_catcher(self)
    442                 # there is yet no clean way to get at it from this context.
--> 443                 raise ReadTimeoutError(self._pool, None, "Read timed out.")
    444 

ReadTimeoutError: HTTPConnectionPool(host='hsds.hdflab.svc.cluster.local', port=80): Read timed out.

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
/opt/conda/lib/python3.9/site-packages/h5pyd/_hl/dataset.py in __getitem__(self, args, new_dtype)
   1178                     try:
-> 1179                         rsp = self.GET(req, params=params, format="binary")
   1180                     except IOError as ioe:

/opt/conda/lib/python3.9/site-packages/h5pyd/_hl/base.py in GET(self, req, params, use_cache, format)
    972             downloaded_bytes = 0
--> 973             for http_chunk in rsp.iter_content(chunk_size=HTTP_CHUNK_SIZE):
    974                 if http_chunk:  # filter out keep alive chunks

/opt/conda/lib/python3.9/site-packages/requests/models.py in generate()
    759                 except ReadTimeoutError as e:
--> 760                     raise ConnectionError(e)
    761             else:

ConnectionError: HTTPConnectionPool(host='hsds.hdflab.svc.cluster.local', port=80): Read timed out.

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-12-ad1fd6235f39> in <module>
      1 # start = time.time()
----> 2 arr=dset[0:25000000:5000, :]
      3 # end = time.time()
      4 # print(f'elapsed time: {end - start}')
      5 print(f"array shape: {arr.shape}")

/opt/conda/lib/python3.9/site-packages/h5pyd/_hl/dataset.py in __getitem__(self, args, new_dtype)
   1186                             break
   1187                         else:
-> 1188                             raise IOError(f"Error retrieving data: {ioe.errno}")
   1189                     if type(rsp) in (bytes, bytearray):
   1190                         # got binary response

OSError: Error retrieving data: None

jreadey · September 9, 2022, 3:40am

Yes - HSDS utilizes parallelism but it may be that in your case you are limited more by disk bandwidth (or S3 bandwidth for HDF Lab) than CPU.

I concocted a test case do read n rows from a table contiguously, then with a stride, and then via random indices. You can find it here: hsds/tests/perf/table/table_read.py at master · HDFGroup/hsds · GitHub.

If I run this on a 2x.large instance (4 cores, 32gb ram) I get the following results with 2, 4, and 8 hsds nodes:

2 node
------
 python table_read.py --stride=5000  hdf5://shared/ghcn/ghcn.h5
num_rows: 3018181465
consecutive read with random start [1715557085:1715562085]: -89.00, 737.00, 83.46, 0.34 s
strided read with random start index [2150528803:2175528803:5000]: -481.00, 9195.00, 87.95, 3.86 s
read with random indices [[n0,n1,...,nx]]: -428.00, 5334.00, 73.74, 62.51 s

4 node
------
$ python table_read.py --stride=5000  hdf5://shared/ghcn/ghcn.h5
num_rows: 3018181465
consecutive read with random start [2710260771:2710265771]: -178.00, 486.00, 62.93, 0.31 s
strided read with random start index [2623168689:2648168689:5000]: -555.00, 9999.00, 96.10, 2.58 s
read with random indices [[n0,n1,...,nx]]: -534.00, 4775.00, 76.31, 39.31 s

8 node
------
$ python table_read.py --stride=5000  hdf5://shared/ghcn/ghcn.h5
num_rows: 3018181465
consecutive read with random start [1177492441:1177497441]: -150.00, 1018.00, 84.34, 0.18 s
strided read with random start index [2336370578:2361370578:5000]: -423.00, 7696.00, 100.43, 2.12 s
read with random indices [[n0,n1,...,nx]]: -512.00, 9652.00, 75.14, 31.95 s

E…g. the random index read took:
2 node: 62.51 s
4 node: 39.31 s
8 node: 31.95 s

i.e. diminishing returns after more than two nodes (likely due to all processes waiting on s3)

In this case there is a fair bit a I/o that’s going on. If every index is in a different chunk, that 5000 x 2mb = 10GB of data being read.

For the data you loaded into HDF Lab, could you make it public read? Then I can do some testing on the actual data. Just do a: $ hsacl <filepath> +r default

What hardware options do you have if you really need to get down to 1 sec? Run on a machine with gobs of ram? Run in a cluster?

kerim.khemraev · September 9, 2022, 9:09am

Thank you very much for testing! The results are very promising.

Could you please tell: hdf5://shared/ghcn/ghcn.h5 has 3018181465 rows, but how much columns there are? and the chuksize: is it compressed?

Yes, I just did this:

$ hsacl /home/kerim.khemraev/seis_chunk_by_trace.h5 +r default
       username     create      read    update    delete   readACL  updateACL 
--------------------------------------------------------------------------------
        default      False      True     False     False     False     False

My data has 25 millions rows, 10 thousands columns. Each row is chunked and compressed.

I think the better way is to run it on cluster (probably small cluster is possible, maybe 10-20 nodes) rather than on a single machine as HDD limitation won’t allow to achieve best perfomance I think.
So generally it is not decided yet. There is a web-server app is planned and I’m trying to understand how to store data to efficiently work with it (IO perfomance in first place). I’m testing it with 100 Gb data (25 millions rows * 1000 cols of float numbers). The main problem is the case when we are retrieving data with stride (each 5000 row in this case) from whole dataset and when reading about 5000 arbitrary rows from 100 Gb data.

Probably the chunking by rows I made for my data is not a best solution. If we think of this 2D matrix (25M * 1000) as a cube of size 5000 * 5000 * 1000. Thus I’m trying to test retrieving contigous 5000*1000 slice from that cube (say from 0 to 4999 rows, or from 5000 to 9999 rows etc), or with stride (stride is equal to 5000 rows in this case) or arbibrary rows (5000 arbitrary rows is pretty real). So I’m vertically cutting a cube wich has 5000 points along X dir, 5000 points along Y dir, and 1000 along Z dir.

jreadey · September 9, 2022, 2:16pm

Great! - I can see the dataset on HDF Lab now.

I notice the chunk layout is (97657, 4). If you are always going to reading entire rows, I’d suggest something like: (500, 1000). Basically for the type of selections you are doing, you want the number of chunks hit to be relatively small (but 1/chunk per selection is not optimal either - you want enough chunks that the HSDS parallelism has a chance to work).

I don’t think 2D - vs 3D will really matter in the end - do whatever best fits your application.

If the “hot spot” of data accessed is something reasonable (say less than the total amount of memory available in the cluster), HDD vs SDD vs S3 shouldn’t matter after the chunk cache is warmed up. You can have all data in ram and should have much faster access.

BTW, I’ll be hosting “Call the Doctor” on Sept 13 (Call the Doctor - Weekly HDF Clinic - #31 by lori.cooper). If that fits your schedule it would be great to have you register and we can have an informal discussion (and perhaps other participants will have some ideas as well).

kerim.khemraev · September 9, 2022, 6:48pm

Oh, I accidently created a dataset without chunking and it seems HSDS automatically chunked it.
Thank you for pointing to that.

I reuploaded the data and now the chunking is [512, 1000].
I ran the test and it took 170 seconds to read each 5000th row. Reading first 5000 rows in 0.7 seconds.

As it was done before the data is available at /home/kerim.khemraev/seis_chunk_by_trace.h5.

Thank you for invtation! I will register and prepare questions.

kerim.khemraev · September 13, 2022, 4:12pm

Just in case I reuploaded the file and it is under the path:

/home/kerim.khemraev/seis_chunk_by_trace.h5

Now the dataset is 5000 rows, 5000 columns and 1000 sheets.
The chunk size is 64x64x64.
According to my observations this storage structure is more optimazed for IO perfomance than chunking by rows.

jreadey · September 13, 2022, 4:42pm

How did your updated file work with the HDF5 Library?

I’ve been testing with a 2D version: /home/jreadey/seis_2d.h5 – just a little less confusing dealing with just two dimensions. Dataset size is: (25MM, 1000) and chunk shape is (524,1000). If you always intend to read all the columns, then you’ll want the chunk columns to equal the dataset columns.

I have a test case here: https://github.com/HDFGroup/hsds/blob/master/tests/perf/read/read2d.py that can be used with any 2d dataset. The test case reads a contiguous hyperslab, followed by a selection with stride, followed by a selection with random rows.

kerim.khemraev · September 13, 2022, 5:11pm

Thank you for testing.

To make things a little bit clearer: I was looking for way to storing data in optimal layout to achieve highest IO when reading arbitrary rows and arbitrary columns.

I started topic with thoughts that to achieve highest perfomance I need to store the same data in two instances: first is 25 millions rows by 1000 columns, and the second dataset is transposed so that it would have 1000 rows and 25 millions of columns. It is not convenient but I was hoping that it would work fast enough. Thus I could compress the data and read data mostly by rows anyway (either from first dataset or from its transposed instance). And this is still actual.

Yesterday I’ve found some information aboud ZFP format that may be used for my situation and I tested it (for simplicity it is 3D dataset now with size 5000 * 5000 * 1000).

On my local PC reading slice at :
constant row (it cuts all the columns and sheets) takes about 12.7 seconds.
constant column (it cuts all the rows and sheets) takes about 17 seconds.
constant sheet (it cuts all the rows and columns) takes about 72.8 seconds.

On Kita Lab with 4 nodes I get the results 21.7 sec, 21.3 sec and 108.5 sec respectively.
The data is compressed in both cases thus the compressed data takes about 53 Gb (almost twice times less size).

Ok I think we will disscuss that online on the meeting soon in few hours.

On Kita Lab my test looks as follows:

# INLINE is a slice at constant row through all the columns and all sheets
# XLINE is a slice at constant column through all the rows and all sheets
# TIMESLICE is a slice at constant sheet through all the rows and all columns

import h5pyd
import time
import math

f = h5pyd.File('/home/kerim.khemraev/seis_chunk_by_block.h5', 'r')
dset = f.get('data')

nIL = dset.shape[0]
nXL = dset.shape[1]
nSamp = dset.shape[2]
print(f'nIL: {nIL}')
print(f'nXL: {nXL}')
print(f'nSamp: {nSamp}')

start = time.time()
arr=dset[0,:,:]
end = time.time()
print(f'read INLINE elapsed time: {end - start}')

start = time.time()
arr=dset[:,0,:]
end = time.time()
print(f'read XNLINE elapsed time: {end - start}')

start = time.time()
arr=dset[:,:,0]
end = time.time()
print(f'read TIMESLICE elapsed time: {end - start}')

# assume that nIL == nXL
n_chunks_il = math.ceil(dset.shape[0]/dset.chunks[0])
start = time.time()
for chunk in range(0,n_chunks_il):
    from_trc = chunk*dset.chunks[0]
    to_trc = (chunk+1)*dset.chunks[0]
    arr=dset[from_trc:to_trc,from_trc:to_trc,:]
end = time.time()
print(f'read DIAGONAL BLOCKS elapsed time: {end - start}')

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Upload 55 Gb hdf5 file to Kita Lab server