I have a hdf5 file that is so large I have to use my home fileserver to write the data (4.04TB, according to macOS’s Finder). It is a collection of logits that takes several hours to calculate, and for some reason, after calculating the last chunk of data, it failed in a bad way.
I now see:
h5debug /Volumes/MacBackup-1/gguf/baseline_logits.hdf5
Reading signature at address 0 (rel)
File Super Block...
File name (as opened): /Volumes/MacBackup-1/gguf/baseline_logits.hdf5
File name (after resolving symlinks): /Volumes/MacBackup-1/gguf/baseline_logits.hdf5
File access flags 0x00000000
File open reference count: 1
Address of super block: 0 (abs)
Size of userblock: 0 bytes
Superblock version number: 0
Free list version number: 0
Root group symbol table entry version number: 0
Shared header version number: 0
Size of file offsets (haddr_t type): 8 bytes
Size of file lengths (hsize_t type): 8 bytes
Symbol table leaf node 1/2 rank: 4
Symbol table internal node 1/2 rank: 16
Indexed storage internal node 1/2 rank: 32
File status flags: 0x00
Superblock extension address: 18446744073709551615 (rel)
Shared object header message table address: 18446744073709551615 (rel)
Shared object header message version number: 0
Number of shared object header message indexes: 0
Address of driver information block: 18446744073709551615 (rel)
Root group symbol table entry:
Name offset into private heap: 0
Object header address: 96
Cache info type: Symbol Table
Cached entry information:
B-tree address: 136
Heap address: 680
Error in closing file!
HDF5: infinite loop closing library
L,T_top,F,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL
I also ran this, which I saw on another thread from a couple of years ago:
od -c /Volumes/MacBackup-1/gguf/baseline_logits_repacked.hdf5 | head -n 50
0000000 211 H D F \r \n 032 \n \0 \0 \0 \0 \0 \b \b \0
0000020 004 \0 020 \0 001 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
0000040 377 377 377 377 377 377 377 377 \0 \b \0 \0 \0 \0 \0 \0
0000060 377 377 377 377 377 377 377 377 \0 \0 \0 \0 \0 \0 \0 \0
0000100 ` \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 \0 \0 \0 \0
0000120 210 \0 \0 \0 \0 \0 \0 \0 250 002 \0 \0 \0 \0 \0 \0
0000140 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
0022100 364 316 355 300 372 351 \a 301 ܌ ** 261 @ 256 312 355 300
0022120 016 334 355 300 300 e \0 301 357 317 355 300 U 317 355 300
0022140 235 316 355 300 < 317 355 300 265 317 355 300 & 317 355 300
0022160 243 320 355 300 254 317 355 300 \b 317 355 300 ) 320 355 300
0022200 344 332 355 300 217 316 355 300 232 317 355 300 203 332 355 300
0022220 317 321 355 300 022 320 355 300 213 317 355 300 225 317 355 300
0022240 271 320 355 300 w 320 355 300 o 317 355 300 367 332 355 300
0022260 l 320 355 300 \ 317 355 300 333 315 355 300 234 317 355 300
0022300 310 315 355 300 363 317 355 300 340 316 355 300 257 315 355 300
0022320 262 320 355 300 033 330 355 300 357 316 355 300 + 330 355 300
0022340 224 326 355 300 E 315 355 300 027 320 355 300 376 315 355 300
0022360 [ 320 355 300 J 315 355 300 262 315 355 300 T 330 355 300
0022400 324 317 355 300 257 316 355 300 a 316 355 300 216 332 355 300
0022420 J 316 355 300 025 322 355 300 313 330 355 300 d 320 355 300
0022440 J 330 355 300 265 316 355 300 030 320 355 300 323 315 355 300
0022460 266 317 355 300 227 317 355 300 4 316 355 300 360 330 355 300
0022500 325 316 355 300 c 320 355 300 314 316 355 300 G 317 355 300
0022520 ' 331 355 300 L 317 355 300 221 320 355 300 311 317 355 300
0022540 036 317 355 300 027 316 355 300 222 316 355 300 262 320 355 300
0022560 315 317 355 300 Z 316 355 300 300 332 355 300 350 317 355 300
0022600 006 321 355 300 376 315 355 300 276 316 355 300 o 313 355 300
0022620 G 317 355 300 g 317 355 300 031 320 355 300 \f 316 355 300
0022640 033 315 355 300 341 315 355 300 332 330 355 300 . 333 355 300
0022660 W 317 355 300 020 331 355 300 026 320 355 300 303 316 355 300
0022700 ; 316 355 300 G 321 355 300 336 316 355 300 350 326 355 300
0022720 a 320 355 300 346 314 355 300 364 320 355 300 314 317 355 300
0022740 277 Z 036 277 321 346 266 @ 205 001 243 ? 324 R 035 @
0022760 317 l 5 @ K 212 * @ 265 302 354 ? \ 4 e @
0023000 215 \ 336 ? 苻 ** ** ? 204 214 002 ? 310 307 U @
0023020 z 031 D ? + 252 252 ? 203 H \n > 237 $ 003 @
0023040 036 w T > 303 347 347 > U d 315 275 243 w { ?
0023060 3 264 327 ? 1 350 S ? E 210 z ? 313 333 021 @
0023100 366 B 336 300 \0 202 327 300 207 260 272 300 S 215 344 300
0023120 ] 317 316 300 a 303 006 @ 360 277 W 300 006 Z 263 300
0023140 K 274 321 300 250 \r 210 300 2 ) 253 300 E 001 276 300
0023160 l 335 302 300 017 256 314 300 030 + 362 300 352 V 311 300
0023200 235 \ : @ 272 370 342 @ 335 267 304 300 ! # \a 300
0023220 033 030 251 300 r 226 243 300 333 ܲ ** 300 ] 324 316 300
0023240 \ @ 320 300 264 203 347 300 % K 353 300 036 \n 263 300
0023260 270 \a 305 300 , 020 341 300 , 373 341 300 % \a 251 300
0023300 310 L 313 300 314 $ 354 300 003 t 201 300 351 025 270 300
0023320 g ȹ ** 300 ҙ ** 310 300 E ' 351 300 267 316 355 300
I am not clear what is actually wrong with it from that debug output. In terms of real size, I think it is less than 4TB:
ls -la /Volumes/MacBackup-1/gguf/baseline_logits.hdf5
-rwx------@ 1 macdev staff 3.7T Nov 12 12:21 /Volumes/MacBackup-1/gguf/baseline_logits.hdf5
Here’s my script’s log when it failed, it was not a very specific error message:
[471] 114207.41 ms [472] 24712.48 ms [473] 120010.91 ms [474] 134073.39 ms
INFO - Processed 4 chunks
INFO - Final file size: 3832472.77 MB
Running from 475 to 478
INFO - generate_logits starting (version 0.5.3)
INFO - Loaded precomputed tokens from /Users/Shared/Public/huggingface/salamandra-2b-instruct/imatrix/oscar/calibration-dataset.txt.tokens.npy
INFO - Processing chunks from 475 to 478
INFO - Estimated runtime: 6.11 minutes for 3 remaining chunks
[475] 122266.14 ms [476] 27550.59 ms ERROR - Unexpected error occurred: Can't decrement id ref count (unable to close file, errno = 9, error message = 'Bad file descriptor')
Error occurred. Exiting.
That was just as the file was exceeding 4TB (depending on how you look at it), which seems suspicious, but it is writing (from a Mac) to a windows 11 machine with a 16Tb disk with 13Tb free before this started, formatted in NTFS. My SMB info says I am connected with smb_3.1.1, with LARGE_FILE_SUPPORTED TRUE
, which I would hope would give me the 16Tb available to NTFS.
One other point about the size: I was expecting it to be about twice this size, but that means if it was a large file related issue, I probably should have seen errors earlier. I dont see any error messages before this last chunk.
How can I recover (or understand the h5debug output of) my hdf5 file? Currently I’m trying to repack it, but its very slow over the network, I could regenerate the entire dataset in this time.