ANN: HDF5 for Python 2.4.0 BETA

Announcing HDF5 for Python (h5py) 2.4.0 BETA

···

============================================

The h5py team is happy to announce the availability of h5py 2.4.0 (beta).

This beta version will be available for approximately two weeks. Because
of the substantial number of changes to the code base, we welcome feedback,
particularly from MPI users.

Documentation for the beta is at:

http://docs.h5py.org/en/latest/

Download at PyPI:

https://pypi.python.org/pypi/h5py/2.4.0b1

Changes
-------

This release incorporates a total re-write of the identifier management
system in h5py. As part of this refactoring, the entire API is also now
protected by threading locks. User-visible changes include:

* Files are now automatically closed when all objects within them
  are unreachable. Previously, if File.close() was not explicitly called,
  files would remain open and "leaks" were possible if the File object
  was lost.

* The entire API is now believed to be thread-safe (feedback welcome!).

* External links now work if the target file is already open. Previously
  this was not possible because of a mismatch in the file close strengths.

* The options to setup.py have changed; a new top-level "configure"
  command handles options like --hdf5=/path/to/hdf5 and --mpi. Setup.py
  now works correctly under Python 3 when these options are used.

* Cython (0.17+) is now required when building from source.

* The minimum NumPy version is now 1.6.1.

* Various other enhancements and bug fixes

Hi Andrew,

I had a go at building this 2.4.0 beta1 in mpi mode, but ran into a problem at compile-time: missing h5py/defs.c (see attached trace of the configuration and build).

I use the following dependencies:
* hdf5 1.8.11 (parallel, shared)
* Cython 0.19.1
* Numpy 1.7.1
* mpi5py 1.3.1
* python 2.7.3

The relevant binaries are all in the path and python modules in the pythonpath. I have built and used h5py several times before, but not come across this kind of issue before... Am I just forgetting some configuration?

Cheers,
Ulrik

h5py-2.4.0b1_build.log (6.35 KB)

···

________________________________________
From: Andrew Collette [andrew.collette@gmail.com]
Sent: 03 November 2014 18:32
To: h5py@googlegroups.com; HDF Users Discussion List
Subject: [Hdf-forum] ANN: HDF5 for Python 2.4.0 BETA

Announcing HDF5 for Python (h5py) 2.4.0 BETA

The h5py team is happy to announce the availability of h5py 2.4.0 (beta).

This beta version will be available for approximately two weeks. Because
of the substantial number of changes to the code base, we welcome feedback,
particularly from MPI users.

Documentation for the beta is at:

http://docs.h5py.org/en/latest/

Download at PyPI:

Changes
-------

This release incorporates a total re-write of the identifier management
system in h5py. As part of this refactoring, the entire API is also now
protected by threading locks. User-visible changes include:

* Files are now automatically closed when all objects within them
  are unreachable. Previously, if File.close() was not explicitly called,
  files would remain open and "leaks" were possible if the File object
  was lost.

* The entire API is now believed to be thread-safe (feedback welcome!).

* External links now work if the target file is already open. Previously
  this was not possible because of a mismatch in the file close strengths.

* The options to setup.py have changed; a new top-level "configure"
  command handles options like --hdf5=/path/to/hdf5 and --mpi. Setup.py
  now works correctly under Python 3 when these options are used.

* Cython (0.17+) is now required when building from source.

* The minimum NumPy version is now 1.6.1.

* Various other enhancements and bug fixes

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

Hi Ray,

Great news! Many thanks to the h5py team!

What exactly does it mean for the API to be thread-safe? Can we now
read/write datasets in parallel without using MPI?

It means that you can use h5py objects in a threaded program without
manually locking everything. For example, this code:

data = dset[0:100]

is now an atomic operation; other threads can't interfere with the
reading of data. Previously, it was required that such operations be
surrounded by threading locks, or bad things might happen while h5py
was reading & returning the data. For example, another thread might
change the size of the dataset mid-read, with undefined results.

MPI is still necessary if you want multiple processes to interact with
the file. But other programs (e.g. web servers) which only
occasionally talk to HDF5 should have an easier time.

Andrew

Hey Andrew,
  Congratulations on the release!

  Re. thread-safety, I thought code like:
data = dset[0:100]
was already thread-safe since the GIL lock wouldn't be released until the
call returns. Is that not the case?
John

···

On Wednesday, November 5, 2014 9:20:37 AM UTC-8, Andrew Collette wrote:

Hi Ray,

> Great news! Many thanks to the h5py team!
>
> What exactly does it mean for the API to be thread-safe? Can we now
> read/write datasets in parallel without using MPI?

It means that you can use h5py objects in a threaded program without
manually locking everything. For example, this code:

data = dset[0:100]

is now an atomic operation; other threads can't interfere with the
reading of data. Previously, it was required that such operations be
surrounded by threading locks, or bad things might happen while h5py
was reading & returning the data. For example, another thread might
change the size of the dataset mid-read, with undefined results.

MPI is still necessary if you want multiple processes to interact with
the file. But other programs (e.g. web servers) which only
occasionally talk to HDF5 should have an easier time.

Andrew

Hi John,

  Re. thread-safety, I thought code like:
data = dset[0:100]
was already thread-safe since the GIL lock wouldn't be released until the
call returns. Is that not the case?

Since much of the slicing code is written in pure Python, it isn't
protected by the GIL. Even direct calls to certain low-level APIs
like H5Literate are not GIL-protected, because they can have callbacks
which execute pure-Python code. Likewise for anything that calls the
Python standard library, large portions of which are regular .py
files.

The new solution is similar to thread-safety in HDF5; there's a single
big recursive lock which every public h5py routine acquires.

Andrew

Hi Ray,

Does that imply that true reading parallelism, i.e., multiple processes
reading the same file, is no longer possible?
Or does the recursive lock operate solely on thread level?

It's at the thread level only. Multiple processes can still read the
same file using any mechanism (MPI, multiprocessing, etc.), provided
it's not open for writing anywhere.

Andrew

Stuart is correct, any readers will need to close and re-open the file
after a write. See: http://www.hdfgroup.org/hdf5-quest.html#gconc1.

Anyone who is dealing with multiple reader/multiple writer issues may be
interested in trying out the HDF5 Server I've been working over the last
couple of months. It's a REST based service written in Python with an API
based on the paper Gerd wrote a while back:
http://www.hdfgroup.org/pubs/papers/RESTful_HDF5.pdf. With a service the
clients are abstracted from dealing with read/write locks, so it will
support multiple writers/multiples readers with no explicit synchronization.

I should have this available on Github in a couple of weeks.

Big provisos...
  This first version doesn't do anything clever for parallelization - in
effect every call to the service is handled serially.

  And - there are no client API's at this point. Interaction with the
service is just through http GET/PUT/POST/DELETE requests.

John Readey
HDF Group

···

On Monday, November 10, 2014 8:25:24 AM UTC-8, stuarteberg wrote:

Hi Ray,

In fact, I implemented something quite similar. ... to create a

transparent workaound for the limitation that parallel read *and* write
access is not possible with hdf5/h5py. ... As soon as one process wants to
write, all other processes accessing that file must wait until the writing
is done.

Sounds useful; looking forward to seeing your code. But beware: If I
understand correctly, merely waiting until the writing is done isn't enough
to avoid problems. If the file has been changed at all, the reading
processes will probably need to close the file entirely and re-open it once
writing is finished. In fact, I'm not sure if it is even permitted to keep
a file open in 'r+' mode while other processes are reading it in 'r' mode.
Perhaps an hdf5 or h5py dev can chime in on this point.

Best,
Stuart

Hey Ray,

  For this first release, the focus will be mostly on the API definition rather than performance. For example, data is being sent as json formatted text. I don't think it should be an issue to support BASE64 encoding for data read/writes in a future release. The client can specify the desired format in the Content-type http header.

Similarly, I'm not doing anything special for reader/writer concurrency - the server is serializing all the requests. Clearly not suitable for a production service that will be seeing a lot of traffic.

I'd be interested in hearing what performance requirements people have for an HDF server: bandwidth in/out, latency, request volume, etc. Depending on the specifics, there are different approaches for achieving performance targets.

I hadn't heard about the issue with ever-growing hdf5 files. Well, one nice aspect of the server-based approach is that you can consolidate any maintenance workflows. E.g. Periodically running h5repack on files in the server.

John

···

From: Ray Polachikov <raypola13@gmail.com<mailto:raypola13@gmail.com>>
Reply-To: "h5py@googlegroups.com<mailto:h5py@googlegroups.com>" <h5py@googlegroups.com<mailto:h5py@googlegroups.com>>
Date: Tuesday, November 11, 2014 at 4:47 AM
To: "h5py@googlegroups.com<mailto:h5py@googlegroups.com>" <h5py@googlegroups.com<mailto:h5py@googlegroups.com>>
Cc: "hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>" <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Subject: Re: ANN: HDF5 for Python 2.4.0 BETA

Hi John and Stuart,

Thanks for the hint. I'm aware of this limitation. The wrapper classes do open/close the underlying file for every single operation. I found the overhead of this to be negligible (relative to the actual I/O operations).

HDF5 Server sounds promising. It's great that some progress is being made in this area. I experimented with Array-based database servers such as SciDB, but - to date - data I/O is so much slower than with hdf5. One problem being that the SciDB Python-API is HTTP-based and, hence, numerical data is encoded as text.
Very much looking forward to seeing your code. I wonder how you dealt with those reader/writer concurrency issues. I also wonder if you found a solution to the problem that deleting nodes in an hdf5 file does not affect file size, i.e., files are ever-growing. In my opinion, this is a nasty limitation of hdf5.

Ray

--
You received this message because you are subscribed to the Google Groups "h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h5py+unsubscribe@googlegroups.com<mailto:h5py+unsubscribe@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.

Hey Ben,
  Thanks for the link. WebSockets are also worth exploring as well for
large data transfers (looks like most recent browsers are supporting
WebSockets).

  Another tack is to ask "what does the page intend to do with the data
anyway?". If say it is going to be displayed in a grid, it may make sense
to dynamically fetch the data as users scroll around in the data rather
than grabbing the entire dataset at once. The amount of values displayed
in any one view of a datagrid is very small.

John

···

On Tuesday, December 2, 2014 4:13:46 AM UTC-8, Ben Jeffery wrote:

Hi John,

On the issue of encodings I've recently been using arraybuffers to
transfer HDF5 array subsets to the browser. (
https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Sending_and_Receiving_Binary_Data)
If the dtype is one supported by JS typed arrays, the array doesn't need to
be parsed.
Thought it was worth mentioning as for use cases with large arrays on fast
connections this method can be a good option.

Thanks,
Ben

On Tuesday, November 11, 2014 9:27:37 PM UTC, John Readey wrote:

Hey Ray,

   For this first release, the focus will be mostly on the API
definition rather than performance. For example, data is being sent as
json formatted text. I don’t think it should be an issue to support BASE64
encoding for data read/writes in a future release. The client can specify
the desired format in the Content-type http header.

  Similarly, I’m not doing anything special for reader/writer
concurrency — the server is serializing all the requests. Clearly not
suitable for a production service that will be seeing a lot of traffic.

  I’d be interested in hearing what performance requirements people have
for an HDF server: bandwidth in/out, latency, request volume, etc.
Depending on the specifics, there are different approaches for achieving
performance targets.

  I hadn’t heard about the issue with ever-growing hdf5 files. Well,
one nice aspect of the server-based approach is that you can consolidate
any maintenance workflows. E.g. Periodically running h5repack on files in
the server.

John

  From: Ray Polachikov <rayp...@gmail.com>
Reply-To: "h5...@googlegroups.com" <h5...@googlegroups.com>
Date: Tuesday, November 11, 2014 at 4:47 AM
To: "h5...@googlegroups.com" <h5...@googlegroups.com>
Cc: "hdf-...@lists.hdfgroup.org" <hdf-...@lists.hdfgroup.org>
Subject: Re: ANN: HDF5 for Python 2.4.0 BETA

  Hi John and Stuart,

Thanks for the hint. I'm aware of this limitation. The wrapper classes do
open/close the underlying file for every single operation. I found the
overhead of this to be negligible (relative to the actual I/O operations).

HDF5 Server sounds promising. It's great that some progress is being made
in this area. I experimented with Array-based database servers such as
SciDB, but – to date – data I/O is so much slower than with hdf5. One
problem being that the SciDB Python-API is HTTP-based and, hence, numerical
data is encoded as text.
Very much looking forward to seeing your code. I wonder how you dealt
with those reader/writer concurrency issues. I also wonder if you found a
solution to the problem that deleting nodes in an hdf5 file does not affect
file size, i.e., files are ever-growing. In my opinion, this is a nasty
limitation of hdf5.

Ray

--
You received this message because you are subscribed to the Google Groups
"h5py" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to h5py+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.