Hello h5py users.
I’ve finished my open source development of tablite which I use for incremental data processing. The map below illustrates 43M rows of data being processed in 208 (incremental) steps (join
, groupby
, filter
, custom operations
, lookup
, …) to generate summaries for an average use case.
Tablite features:
- Tablite uses HDF5 as a backend with strong abstraction, so that copy/append/repetition of data is handled in pages (this allows me to slice 9,000,000,000 rows in less than a second on localhost (see image below)
- Tablite has multiprocessing is implemented for bypassing the python GIL on all major operations. CSV import has a test with 96M fields imported and type mapped to native python types in 120 secs.
- Tablite respects the limits of free memory by tagging the free memory and defining task size before each memory intensive task is initiated (join, groupby, data import, etc)
- Tablite uses datatype mapping to HDF5 native types where possible and uses type mapping for non-native types such as timedelta, None, date, time… e.g. what you put in, is what you get out.
- Tablite stores all data in /tmp/tablite.hdf5 so if your OS sits on SSD it will benefit from high IOPS.
The test suite covers python 3.8+ on Windows and Linux.
To learn more please visit:
Project home: https://github.com/root-11/tablite
Pypi: https://pypi.org/project/tablite/
Tutorial: https://github.com/root-11/tablite/blob/master/tutorial.ipynb
Thanks again to the HDF5-team. Without you this wouldn’t be possible!
I you have 5-10 minutes, please take a look at the tutorial and add send me any question or feedback here or via the github issue list.
Kind regards
Dr. Bjorn Madsen