how can I recover (or understand the h5debug output of) my hdf5 file?

I have a hdf5 file that is so large I have to use my home fileserver to write the data (4.04TB, according to macOS’s Finder). It is a collection of logits that takes several hours to calculate, and for some reason, after calculating the last chunk of data, it failed in a bad way.

I now see:

h5debug /Volumes/MacBackup-1/gguf/baseline_logits.hdf5

Reading signature at address 0 (rel)
File Super Block...
File name (as opened):                             /Volumes/MacBackup-1/gguf/baseline_logits.hdf5
File name (after resolving symlinks):              /Volumes/MacBackup-1/gguf/baseline_logits.hdf5
File access flags                                  0x00000000
File open reference count:                         1
Address of super block:                            0 (abs)
Size of userblock:                                 0 bytes
Superblock version number:                         0
Free list version number:                          0
Root group symbol table entry version number:      0
Shared header version number:                      0
Size of file offsets (haddr_t type):               8 bytes
Size of file lengths (hsize_t type):               8 bytes
Symbol table leaf node 1/2 rank:                   4
Symbol table internal node 1/2 rank:               16
Indexed storage internal node 1/2 rank:            32
File status flags:                                 0x00
Superblock extension address:                      18446744073709551615 (rel)
Shared object header message table address:        18446744073709551615 (rel)
Shared object header message version number:       0
Number of shared object header message indexes:    0
Address of driver information block:               18446744073709551615 (rel)
Root group symbol table entry:
   Name offset into private heap:                  0
   Object header address:                          96
   Cache info type:                                Symbol Table
   Cached entry information:
      B-tree address:                              136
      Heap address:                                680
Error in closing file!
HDF5: infinite loop closing library
      L,T_top,F,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

I also ran this, which I saw on another thread from a couple of years ago:

od -c /Volumes/MacBackup-1/gguf/baseline_logits_repacked.hdf5 | head -n 50
0000000  211   H   D   F  \r  \n 032  \n  \0  \0  \0  \0  \0  \b  \b  \0
0000020  004  \0 020  \0 001  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000040  377 377 377 377 377 377 377 377  \0  \b  \0  \0  \0  \0  \0  \0
0000060  377 377 377 377 377 377 377 377  \0  \0  \0  \0  \0  \0  \0  \0
0000100    `  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
0000120  210  \0  \0  \0  \0  \0  \0  \0 250 002  \0  \0  \0  \0  \0  \0
0000140   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0022100  364 316 355 300 372 351  \a 301   ܌  ** 261   @ 256 312 355 300
0022120  016 334 355 300 300   e  \0 301 357 317 355 300   U 317 355 300
0022140  235 316 355 300   < 317 355 300 265 317 355 300   & 317 355 300
0022160  243 320 355 300 254 317 355 300  \b 317 355 300   ) 320 355 300
0022200  344 332 355 300 217 316 355 300 232 317 355 300 203 332 355 300
0022220  317 321 355 300 022 320 355 300 213 317 355 300 225 317 355 300
0022240  271 320 355 300   w 320 355 300   o 317 355 300 367 332 355 300
0022260    l 320 355 300   \ 317 355 300 333 315 355 300 234 317 355 300
0022300  310 315 355 300 363 317 355 300 340 316 355 300 257 315 355 300
0022320  262 320 355 300 033 330 355 300 357 316 355 300   + 330 355 300
0022340  224 326 355 300   E 315 355 300 027 320 355 300 376 315 355 300
0022360    [ 320 355 300   J 315 355 300 262 315 355 300   T 330 355 300
0022400  324 317 355 300 257 316 355 300   a 316 355 300 216 332 355 300
0022420    J 316 355 300 025 322 355 300 313 330 355 300   d 320 355 300
0022440    J 330 355 300 265 316 355 300 030 320 355 300 323 315 355 300
0022460  266 317 355 300 227 317 355 300   4 316 355 300 360 330 355 300
0022500  325 316 355 300   c 320 355 300 314 316 355 300   G 317 355 300
0022520    ' 331 355 300   L 317 355 300 221 320 355 300 311 317 355 300
0022540  036 317 355 300 027 316 355 300 222 316 355 300 262 320 355 300
0022560  315 317 355 300   Z 316 355 300 300 332 355 300 350 317 355 300
0022600  006 321 355 300 376 315 355 300 276 316 355 300   o 313 355 300
0022620    G 317 355 300   g 317 355 300 031 320 355 300  \f 316 355 300
0022640  033 315 355 300 341 315 355 300 332 330 355 300   . 333 355 300
0022660    W 317 355 300 020 331 355 300 026 320 355 300 303 316 355 300
0022700    ; 316 355 300   G 321 355 300 336 316 355 300 350 326 355 300
0022720    a 320 355 300 346 314 355 300 364 320 355 300 314 317 355 300
0022740  277   Z 036 277 321 346 266   @ 205 001 243   ? 324   R 035   @
0022760  317   l   5   @   K 212   *   @ 265 302 354   ?   \   4   e   @
0023000  215   \ 336   ?  苻  **  **   ? 204 214 002   ? 310 307   U   @
0023020    z 031   D   ?   + 252 252   ? 203   H  \n   > 237   $ 003   @
0023040  036   w   T   > 303 347 347   >   U   d 315 275 243   w   {   ?
0023060    3 264 327   ?   1 350   S   ?   E 210   z   ? 313 333 021   @
0023100  366   B 336 300  \0 202 327 300 207 260 272 300   S 215 344 300
0023120    ] 317 316 300   a 303 006   @ 360 277   W 300 006   Z 263 300
0023140    K 274 321 300 250  \r 210 300   2   ) 253 300   E 001 276 300
0023160    l 335 302 300 017 256 314 300 030   + 362 300 352   V 311 300
0023200  235   \   :   @ 272 370 342   @ 335 267 304 300   !   #  \a 300
0023220  033 030 251 300   r 226 243 300 333   ܲ  ** 300   ] 324 316 300
0023240    \   @ 320 300 264 203 347 300   %   K 353 300 036  \n 263 300
0023260  270  \a 305 300   , 020 341 300   , 373 341 300   %  \a 251 300
0023300  310   L 313 300 314   $ 354 300 003   t 201 300 351 025 270 300
0023320    g   ȹ  ** 300   ҙ  ** 310 300   E   ' 351 300 267 316 355 300

I am not clear what is actually wrong with it from that debug output. In terms of real size, I think it is less than 4TB:

ls -la  /Volumes/MacBackup-1/gguf/baseline_logits.hdf5
-rwx------@ 1 macdev  staff   3.7T Nov 12 12:21 /Volumes/MacBackup-1/gguf/baseline_logits.hdf5

Here’s my script’s log when it failed, it was not a very specific error message:

[471] 114207.41 ms [472] 24712.48 ms [473] 120010.91 ms [474] 134073.39 ms
INFO - Processed 4 chunks
INFO - Final file size: 3832472.77 MB
Running from 475 to 478
INFO - generate_logits starting (version 0.5.3)
INFO - Loaded precomputed tokens from /Users/Shared/Public/huggingface/salamandra-2b-instruct/imatrix/oscar/calibration-dataset.txt.tokens.npy
INFO - Processing chunks from 475 to 478
INFO - Estimated runtime: 6.11 minutes for 3 remaining chunks
[475] 122266.14 ms [476] 27550.59 ms ERROR - Unexpected error occurred: Can't decrement id ref count (unable to close file, errno = 9, error message = 'Bad file descriptor')
Error occurred. Exiting.

That was just as the file was exceeding 4TB (depending on how you look at it), which seems suspicious, but it is writing (from a Mac) to a windows 11 machine with a 16Tb disk with 13Tb free before this started, formatted in NTFS. My SMB info says I am connected with smb_3.1.1, with LARGE_FILE_SUPPORTED TRUE, which I would hope would give me the 16Tb available to NTFS.

One other point about the size: I was expecting it to be about twice this size, but that means if it was a large file related issue, I probably should have seen errors earlier. I dont see any error messages before this last chunk.

How can I recover (or understand the h5debug output of) my hdf5 file? Currently I’m trying to repack it, but its very slow over the network, I could regenerate the entire dataset in this time.

repacking seems to have done the trick. It still seems to me like this is way overkill, if the file was much larger it would not have been an option. There has to be a way to reset the header without rewriting the entire file

1 Like

Hi, @roberto.tomas.cuenta !

First of all, thank you for sharing an interesting problem!

I’m curious about your Salamandra AI workflow and how you use HDF5 for logits:

It would be helpful if you can share your Python script
so that we can check how to optimize AI Workflow IO operation with HDF.
I’m particularly interested in what software stacks you’re using to save data in HDF5.

In general, writing HDF5 over network like SMB/NFS is not ideal.

Is your Mac much more powerful & ideal for AI workflow than Windows machine?
Is it the main reason you mount Windows 16Tb drive via SMB under /Volumes/MacBackup-1/?

Regards,

Hey, I actually am working out the kinks in a library of tools I wrote to make the whole process of using llama.cpp’s importance matrix well-grounded: GitHub - robbiemu/llama-gguf-optimize: Scripts and tools for optimizing quantizations in llama.cpp with GGUF imatrices. that includes scripts to process the dataset. this script is “generate_logits.py”. Another script that uses the hdf5 format, just because I already was using hdf5 and thought it would be simpler to keep everything the same, is “compare_logits.py”. “kl_d_bench.py” streamlines using both together, including a bit for reusing chunks, but probably is less interesting in terms of hdf5 processing. Since Im beta-testing it, the latest changes are in user-guide branch (but dont expect the tests to pass there – the raw code is accurate). There’s a notebookLM podcast about the repo embedded in the readme, it might help you to orient.

I’m doing it over the network even though it limits disk speed to ~100MB/s instead of the disk’s 200MB/s because I don’t have the vram on any other machine, but my Mac doesnt have the storage.

1 Like