Files open really slowly

janhense · November 10, 2020, 8:38pm

Hello,

I am a beginner programmer and learned Python for my bachelor thesis that I am working on right now. My data is stored in .h5 files (file size ~2 MB) and opening and converting them to Numpy arrays posed absolutely no problem so far.

Throughout my data analysis, I created a nested dictionary with important data, 3 numbers and one 1000x1000 array. Because I don’t want to run the computationally expensive function every time I start up my Jupyter Notebook again, and the content of the dictionary doesn’t change a lot, I wanted to save the nested dictionary as a cache. Since JSON is Numpy-incompatible and Pickle has large file size and bad readability, I used dicttoh5 (http://www.silx.org/doc/silx/0.2.0/modules/io/dictdump.html). My .h5 files only have about 725 kB now when I use this compression (copied from the link):

create_ds_args = {'compression': "gzip",
                  'shuffle': True,
                  'fletcher32': True}

But when I open the file now with h5py and use it in one of my function, which used to take only 8 seconds (when the nested dictionary is just stored in Jupyter’s working memory), it takes 140 seconds now.

EDIT: Okay, not so good idea to compress it!! The file size is about 40 MB then. If I take the default value, it takes only 16 (instead of 140) seconds.
Still: Is there room for improvement, to make it faster ? This way, with caching as .h5 files, it still takes double the time…

thomas1 · November 11, 2020, 12:50pm

How are you accessing the data from the file? It sounds like your data is small enough to fit in memory, so it may be easiest to read it all in at once (e.g. with the counterpart h5todict method) and then just operate on NumPy arrays.

If your code ends up reading many small pieces from the dataset, there are a couple of ways it can be slow. Part of it is just overhead - it’s more efficient to get one big block of data through h5py than to get lots of small pieces. Also, your data will be stored in chunks, and when you want to read some data, HDF5 has to access all the chunks which contain part of that data. The fact that it’s really slow with gzip compression suggests that it’s having to repeatedly decompress the same chunks. Gzip is not a particularly fast compression algorithm, but even so, it wouldn’t take 140 seconds to expand 40 MB of data.

janhense · November 12, 2020, 4:25pm

Hello Thomas, thank you for answering. With regards to your question how I am accessing the data: Instead of having the dictionary directly in working memory, I first save it as an h5 file and then define a new variable as
dic = h5py.File(....). However, this is not yet a nested dictionary, but an HDF5 object, so what I need to do in my function is to refer to the respective value in dic, and then convert it to an ndarray with value = np.array(dic["first key"]). The only thing that could perhaps be faster is to define an ndarray for all keys/values right away. That way I should have everything in working memory.

I don’t know why decompressing takes so long or if there is something else going on.

thomas1 · November 12, 2020, 4:56pm

Can you share one of the compressed files and an example of the code reading it?

janhense · November 13, 2020, 12:32pm

This is my code for loading these files (for the compressed file you need to change the file name):

compressed h5 (link to my personal dropbox)
uncompressed h5 (link to my personal dropbox)

# load the respective cached dictionary for the definition of "kmeansDic" throughout this project
silhouette = True # True should be selected for complete data. If False is selected, then silhouette will just be assigned to 0 and omitted, saving a lot of computation time.
sample_size = 300000 # this number refers to the sample size in the silhouette_score() function which is computationally very expensive
k_lowest = 2
k_highest = 3
data_processing = 1
scale_all = True
absolute_values = False

kmeansDic = h5py.File(parent_folder_path + f"kmeansDic({k_lowest},{k_highest}){sample_size}-samplesize{silhouette!s}-silhouette_{data_processing}-data-processing_{scale_all!s}-scale-all_{absolute_values!s}-absolute-values.h5", “r”)

Note that this alone doesn’t take much time, but when I use it in my other functions, then it takes 16 instead of 8 seconds when using the uncompressed file, and over 140 seconds with the compressed file - compared to storing directly into the working memory.

If it is relevant, this is my code for saving the h5 files:
# Create a nested dictionary for all cluster arrays, sum of squared distances, silhouette and calinski_harabasz scores

start = time.time()

kmeansDic = {}

silhouette_setting = True # True should be selected for complete data. If False is selected, then silhouette will just be assigned to 0 and omitted, saving a lot of computation time.
sample_size = 300000 # this number refers to the sample size in the silhouette_score() function which is computationally very expensive
k_lowest = 2
k_highest = 3
data_processing = 1
scale_all = True
absolute_values = False

compression = True

for k in range(k_lowest, k_highest+1):
    start_k = time.time()
    cluster_numbers, ssqd, silhouette, calinski_harabasz = kmeans_clustering(data_processing=data_processing, scale_all=scale_all, absolute_values=absolute_values, k=k, sample_size=sample_size, silhouette=silhouette_setting)
    kmeansDic[str(k)] = {"cluster_numbers": cluster_numbers, "ssqd": ssqd, "silhouette": silhouette, "calinski_harabasz": calinski_harabasz}
    end = time.time()
    print("Finished entry in nested dictionary for k = " + str(k) + ". That took " + str(round((end-start_k), 2)) + " seconds.")

end = time.time()
print("Calculating all cluster arrays took", round((end-start), 2), "seconds.")

                                        ### SAVE CACHE ###
# with these settings (copied from http://www.silx.org/doc/silx/0.2.0/modules/io/dictdump.html), the file size is much smaller, but loading takes much longer, so better not use it!
if compression:
    create_ds_args = {'compression': "gzip",
                  'shuffle': True,
                  'fletcher32': True}
else:
    create_ds_args = None

saveh5 = input("Do you want to save the dictionary as a h5 file that other functions will refer to ? (y/n)\n")
if saveh5.lower() == "y":
    if not compression:
        fname = f"kmeansDic({k_lowest},{k_highest})_{sample_size}-samplesize_{silhouette_setting!s}-silhouette_{data_processing}-data-processing_{scale_all!s}-scale-all_{absolute_values!s}-absolute-values.h5"
    else:
        fname = f"kmeansDic({k_lowest},{k_highest})_{sample_size}-samplesize_{silhouette_setting!s}-silhouette_{data_processing}-data-processing_{scale_all!s}-scale-all_{absolute_values!s}-absolute-values_COMPRESSED.h5"
    dicttoh5(kmeansDic, "Cache/"+ fname, create_dataset_args=create_ds_args)
    print("The nested dictionary was saved as \"" + fname + "\" in the Cache folder.")
if saveh5.lower() == "n":
    now = datetime.now()
    dt_string = now.strftime("%d-%m-%Y_%H_%M_%S")
    fname = "kmeansDic_no-save_" + dt_string + ".h5"
    dicttoh5(kmeansDic, "Cache/" + fname)
    print("The nested dictionary was saved as \"" + fname + "\" in the Cache folder.")

I cannot provide details about the original data, since this is confidential, and also the code for the other function I referred to is quite long and I guess would not help much.

(Off-topic note: It is really tricky to format code here. Sorry, I hope this is legible. Maybe pastebin would be better?)

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Files open really slowly