9BN rows/sec + HDF5 support for all python datatypes


#1

Hello h5py users.
I’ve finished my open source development of tablite which I use for incremental data processing. The map below illustrates 43M rows of data being processed in 208 (incremental) steps (join, groupby, filter, custom operations, lookup, …) to generate summaries for an average use case.

Tablite features:

  • Tablite uses HDF5 as a backend with strong abstraction, so that copy/append/repetition of data is handled in pages (this allows me to slice 9,000,000,000 rows in less than a second on localhost (see image below)
  • Tablite has multiprocessing is implemented for bypassing the python GIL on all major operations. CSV import has a test with 96M fields imported and type mapped to native python types in 120 secs.
  • Tablite respects the limits of free memory by tagging the free memory and defining task size before each memory intensive task is initiated (join, groupby, data import, etc)
  • Tablite uses datatype mapping to HDF5 native types where possible and uses type mapping for non-native types such as timedelta, None, date, time… e.g. what you put in, is what you get out.
  • Tablite stores all data in /tmp/tablite.hdf5 so if your OS sits on SSD it will benefit from high IOPS.

The test suite covers python 3.8+ on Windows and Linux.

To learn more please visit:

Project home: https://github.com/root-11/tablite
Pypi: https://pypi.org/project/tablite/
Tutorial: https://github.com/root-11/tablite/blob/master/tutorial.ipynb

Thanks again to the HDF5-team. Without you this wouldn’t be possible!

I you have 5-10 minutes, please take a look at the tutorial and add send me any question or feedback here or via the github issue list.

Kind regards
Dr. Bjorn Madsen


#2

For those curious about the backing HDF5 file (in tablite.config):

H5_STORAGE = pathlib.Path(tempfile.gettempdir()) / "tablite.hdf5"

Disarmingly simple and effective!

G.


#3

Bjorn, congratulations & fascinating work! Would you mind sharing a hi-res version of your map_of_operations.jpg? Thanks, G.


#4

I decided this was a better solution as some users have their projects HDD, whilst the OS is installed on their SSD. In that way they can have source files on HDD (slow) whilst the interaction with the data is on SSD (fast).


#5

–> high resolution image


#6

A minor update tablite has reached 2k downloads in the past month, and now has support for arbitrary python classes (example):

What you put in is exactly what you get out - even if it’s not a native cpython or numpy type.:+1:


#7

Congratulations! G[erd] (<- Post must be at least 20 characters!)