Yes, 5TB.
So, does your data look like this?
How’s it chunked (or not)?
Please find a small toy dataset below, the “barcode” and “count” arrays can grow into sizes of billions.
“count” can hold ints or longs or floats.
What data or object type are you showing in the screenshot?
It’s string type as you can see from the screen.
I modified the product design quickly to match your case.
I also increased the dimension to 1 billion and netCDF-java looks just fine:
http://54.174.38.12/thredds/dodsC/testAll/hpd/dseide2.h5.html
Thank you for proposing NetCDF-Java.
Looks like this library can read parts of objects without loading them entirely into memory.
The following example only reads/loads and prints the first element of the data array.
public class NetCdfHdf {
public static void main(String[] args) throws IOException, InvalidRangeException {
FileInputStream inputStream = new FileInputStream("my.properties");
LogManager lm = java.util.logging.LogManager.getLogManager();
lm.readConfiguration(inputStream);
NetcdfFile open = NetcdfFiles.open("wt_mutant.h5");
Group findGroup = open.findGroup("/assays/RNA/counts");
System.out.println(findGroup);
Variable findVariableLocal = findGroup.findVariableLocal("data");
System.out.println(findVariableLocal);
Array read = findVariableLocal.read(new int[] { 0 }, new int[] { 1 });
System.out.println(read);
}
}
I wonder why this library is not easier to find. From what I have seen now (the origin of NetCDF and all its features), it is very powerful and outruns all the other implementations I have found on my way o far. It also has many APIs and can be integrated very easily as I see it right now.
Thanks, HDFGroup community!
@dseide, I’m glad that it worked for you.
By the way, thank you for sharing your HDFView screenshot!
After looking over it, I’m eager to learn organizing data in Genetics domain.
For example, is there a reason that the data provider stores data in sequence like ‘GATTACA’ instead of using 2 bit encoding for A=00 T=01 G=10 C=11?
In general, is there any (well-known or efficient) compression algorithm specific for a long generic sequence to save storage?
Another topic that I want to learn is HIPAA compliancy of genetics data and importance of encryption filter (e.g., [1]). How important is it in your community that you serve? I’m also curious if animals or viruses genetics information do not require such regulation for distributing data.
Regards,
I am no expert here and this is my personal opinion.
Sorry for the delay in getting back to you.
I understand your question but I have to say that this is a very broad topic and probably there are many different opinions about this. The field we are working in can be best described as bioinformatics or computational biology rather than genetics.
I just learned that HDF5 is used for single-cell data which is an emerging field in biology at the moment. I don’t know about any other fields where it might be used but I can think of many applications as data is often stored in arrays or matrices and often one project is comprised of many different files (could be stored inside one HDF5).
There have been many efforts to standardize data formats and to use compression algorithms for better storage of the data. There are efficient compression formats but they are maintained by companies and not supported by any of the open research software that is out there and popular.
However, IMHO, data formats and organization in bioinfo are a big problem and one of the main reasons for inefficient research and development of the field in general. There are a lot of inefficient or misused file formats and databases in this field. Also, people always start to modify standards after having implemented them which breaks the protocol and leads to unexpected outcomes.
Often, the same format is used in different sub-domains (sometimes they extend formats in their own style) and the meaning of the data is completely shifted in each domain which leads to a lot of confusion.
Regarding data security:
In the human medical research domain, data is “protected” with, many times, absurd protocols. The basic research sector (plans and animals, etc.) usually does not have these requirements. Companies of course want to protect any of their data related to internal research.
I can provide you with more examples or information if you are interested.
@dseide, thank you so much! This is amazingly insightful article.
Yes, my hobby is collecting such information. Have you given any presentation about your work and made it available as online video or podcast? I’m particularly interested in visualizing bioinformatics information in 3D, especially on Meta Quest Pro.
Back to your original question. I think you can do this with only the native HDF5 library including the Java API, and nothing else. I do not think you need JNI or another high level library. I don’t use Java, so I might be missing something.
Please look at these two Java examples which are included in several recent HDF5 releases. I think both of these are self contained small examples which generate their own test file, then read back array subsets, i.e. “slices”.
Within HDF5 source code releases:
java/examples/datasets/H5Ex_D_Hyperslab.java
Within HDF5 binary distributions:
share/HDF5Examples/JAVA/HDF5SubsetSelect.java
For SIS-JHDF5, the latest documentation can be found here:
https://openbis.ch/javadoc/jhdf5/19.04.1/
That is linked from the FAQ above https://unlimited.ethz.ch/pages/viewpage.action?pageId=92865195 . I did notice that these links were updated on November 10th.
The source code is available here:
@kittisopikulm
Good to see that they made the documentation public finally. Reading the data bit by bit doesn’t seem to be possible anyways with their library. They don’t respond to messages. Discarded.
@dave.allured
Plain JNI is very cumbersome and error-prone to use from Java. It feels unnatural and needs some form of a layer on top to encapsulate all its complexity from the pure Java code base. This means extra work and if someone already did it (NetCDF-Java) I wouldn’t go for it myself.
Thanks again everybody for your input.