How to get started wih HDF5 + Java

Reading the subject as “How to get started with HDF5 + Java” I have to agree w/ the HDFql suggestion. Before getting into the technical weeds, I think you would want to quickly find out if HDF5 can help you solve the (non-technical) problem at hand. Under the circumstances, JNI appears almost as the surest non-starter way. With HDFql, you can even pivot to another (host-)language should Java not be viable in the long run.

G.

1 Like

https://www.hdfgroup.org/downloads/hdf5/

Thank you everyone for your replies.

@byrn
I am sorry, but there are a lot of links on the website you indicate.
Which one is it? The linux files are specific for certain distributions which is suspicious as I would expect hdf.jar + native libraries which are not compiled for different distributions. The file for Windows contains an MSI installer.

@kittisopikulm
Using the JNI library from Zurich University seems to be the better approach. The lack of documentation is a problem, though.

@gheber
The query language approach might be helpful, but we are not looking for that at the moment.

My recommendation is netcdf-java.
No JNI is required especially when your workflow is read-only.

https://docs.unidata.ucar.edu/netcdf-java/5.3/userguide/building_from_source.html

Alternatively, use jhdf

Try also Drill that uses jhdf, which could be the simplest solution.

http://hdfeos.org/examples/drill.php

JHDF (jamesmudd) is missing slicing which is mandatory for us because we want to manage matrices with dimensions of 5M x 100k. Loading this amount of data into memory is not possible on current everyday computers.

I finally became friends with the JHDF from ETH Zurich, thanks again @kittisopikulm.

Thank you all for your help and support.

The JHDF5 from ETH is commonly called sis-jhdf5 or cisd-jhdf5.

You can find the documentation, including JavaDoc here:
https://unlimited.ethz.ch/pages/viewpage.action?pageId=92865195

@kittisopikulm
Do you know anyone at ETH Zurich?

Because they claim that the latest version has support for slicing/block reading, although the necessary methods are not implemented.

The Javadoc seems to be private and the source code is not available either.
I was referring to this when I said that there is no documentation.

EDIT:
The source and javadoc are available in the downloadable ZIP but they are excluded from mvn and not visible online.

I already contacted the responsible people of the project a few days ago, but they haven’t replied yet.

Doesn’t your matrix require only 5T memory?
I think you can easily find such small system from AWS [1].
I don’t know what you mean by slicing capability but you can request such feature via GitHub for jHDF or netCDF-java community.

Anyway, you can use whatever Java solution for such small data because you don’t have to deal with deploying jar on the the cluster of thousand machines.

[1] https://aws.amazon.com/ec2/instance-types/high-memory/

@hyoklee
I am sorry, what is 5T memory? You’re not referring to 5TB, are you?

We need this solution to run on a work laptop (8 - 16 GB).
Slicing or block reading refers to the capability to read only parts of an array stored in HDF5.
I.e. There is an array of 1 Billion Strings with length 30 stored in HDF5, reading this into memory is simply not possible with that amount of RAM. So it has to be “sliced” to be read one piece at a time.
The latest version of sis-jhdf5 claims to offer this functionality with a code example on their webpage but the actual interface does not offer those methods.

I have been able to get in touch with them finally, I will let you know once I learn something new.

Regarding netCDF-java:
I understand it helps to manage CDF on top of HDF5 but we are dealing with data that comes from other libraries that use plain HDF5 (with non-CDF data models) which makes me believe that it is of no use to us.

Regarding jamesmudd/jhdf:
This project is developed as a hobby project (no offense, this is great work!) and lacks a few functionalities like writing and slicing. I do not want to depend on them to implement those features because I don’t know if and when this would happen.

Yes, 5TB.

So, does your data look like this?

How’s it chunked (or not)?

Please find a small toy dataset below, the “barcode” and “count” arrays can grow into sizes of billions.
“count” can hold ints or longs or floats.
What data or object type are you showing in the screenshot?

It’s string type as you can see from the screen.

I modified the product design quickly to match your case.
I also increased the dimension to 1 billion and netCDF-java looks just fine:

http://54.174.38.12/thredds/dodsC/testAll/hpd/dseide2.h5.html

@hyoklee

Thank you for proposing NetCDF-Java.

Looks like this library can read parts of objects without loading them entirely into memory.
The following example only reads/loads and prints the first element of the data array.

public class NetCdfHdf {
	public static void main(String[] args) throws IOException, InvalidRangeException {
		FileInputStream inputStream = new FileInputStream("my.properties");
		LogManager lm = java.util.logging.LogManager.getLogManager();
		lm.readConfiguration(inputStream);
		NetcdfFile open = NetcdfFiles.open("wt_mutant.h5");
		Group findGroup = open.findGroup("/assays/RNA/counts");
		System.out.println(findGroup);
		Variable findVariableLocal = findGroup.findVariableLocal("data");
		System.out.println(findVariableLocal);
		Array read = findVariableLocal.read(new int[] { 0 }, new int[] { 1 });
		System.out.println(read);
	}
}

I wonder why this library is not easier to find. From what I have seen now (the origin of NetCDF and all its features), it is very powerful and outruns all the other implementations I have found on my way o far. It also has many APIs and can be integrated very easily as I see it right now.

Thanks, HDFGroup community!

1 Like

@dseide, I’m glad that it worked for you.

By the way, thank you for sharing your HDFView screenshot!
After looking over it, I’m eager to learn organizing data in Genetics domain.
For example, is there a reason that the data provider stores data in sequence like ‘GATTACA’ instead of using 2 bit encoding for A=00 T=01 G=10 C=11?

In general, is there any (well-known or efficient) compression algorithm specific for a long generic sequence to save storage?

Another topic that I want to learn is HIPAA compliancy of genetics data and importance of encryption filter (e.g., [1]). How important is it in your community that you serve? I’m also curious if animals or viruses genetics information do not require such regulation for distributing data.

Regards,

[1] symmetric encryption filters?

@hyoklee

I am no expert here and this is my personal opinion.

Sorry for the delay in getting back to you.
I understand your question but I have to say that this is a very broad topic and probably there are many different opinions about this. The field we are working in can be best described as bioinformatics or computational biology rather than genetics.
I just learned that HDF5 is used for single-cell data which is an emerging field in biology at the moment. I don’t know about any other fields where it might be used but I can think of many applications as data is often stored in arrays or matrices and often one project is comprised of many different files (could be stored inside one HDF5).

There have been many efforts to standardize data formats and to use compression algorithms for better storage of the data. There are efficient compression formats but they are maintained by companies and not supported by any of the open research software that is out there and popular.
However, IMHO, data formats and organization in bioinfo are a big problem and one of the main reasons for inefficient research and development of the field in general. There are a lot of inefficient or misused file formats and databases in this field. Also, people always start to modify standards after having implemented them which breaks the protocol and leads to unexpected outcomes.

Often, the same format is used in different sub-domains (sometimes they extend formats in their own style) and the meaning of the data is completely shifted in each domain which leads to a lot of confusion.

Regarding data security:
In the human medical research domain, data is “protected” with, many times, absurd protocols. The basic research sector (plans and animals, etc.) usually does not have these requirements. Companies of course want to protect any of their data related to internal research.

I can provide you with more examples or information if you are interested.

1 Like

@dseide, thank you so much! This is amazingly insightful article.

Yes, my hobby is collecting such information. Have you given any presentation about your work and made it available as online video or podcast? I’m particularly interested in visualizing bioinformatics information in 3D, especially on Meta Quest Pro.

Back to your original question. I think you can do this with only the native HDF5 library including the Java API, and nothing else. I do not think you need JNI or another high level library. I don’t use Java, so I might be missing something.

Please look at these two Java examples which are included in several recent HDF5 releases. I think both of these are self contained small examples which generate their own test file, then read back array subsets, i.e. “slices”.

Within HDF5 source code releases:
java/examples/datasets/H5Ex_D_Hyperslab.java

Within HDF5 binary distributions:
share/HDF5Examples/JAVA/HDF5SubsetSelect.java

For SIS-JHDF5, the latest documentation can be found here:
https://openbis.ch/javadoc/jhdf5/19.04.1/

That is linked from the FAQ above https://unlimited.ethz.ch/pages/viewpage.action?pageId=92865195 . I did notice that these links were updated on November 10th.

The source code is available here:

@kittisopikulm, thanks for the update.
I created a mirror for 94 million GitHub developers.

@kittisopikulm
Good to see that they made the documentation public finally. Reading the data bit by bit doesn’t seem to be possible anyways with their library. They don’t respond to messages. Discarded.

@dave.allured
Plain JNI is very cumbersome and error-prone to use from Java. It feels unnatural and needs some form of a layer on top to encapsulate all its complexity from the pure Java code base. This means extra work and if someone already did it (NetCDF-Java) I wouldn’t go for it myself.

Thanks again everybody for your input.

1 Like