Where can I find a direct download link to the CentOS 1.10.x binaries? The downloads on
the website seem to require login. I’m updating our continuous integration system from 1.8.x.
We used to curl 1.8.x RPM directly from… but now I’d like to find a direct link to the
1.10.3 .tar.gz with CentOS 7 binaries.
Great question. First, I wanted to explain WHY we did this… We’ve intentionally put the source and binaries behind registration to get a better understanding of who’s using our software, why, and which forms (e.g. which binaries). I’m sure everyone can appreciate all the hard work that goes into creating and maintaining HDF5, all of which requires sustainable funding. We use these registration insights to drive sustainability, e.g. determine which binaries are for everyone vs. Enterprise subscribers only, send out messaging regarding new software, etc.
So back to your question: if there is a way to ensure that we retain the same level of tracking on downloads as we do currently (who, when, how many times, which binaries, etc.), then we could absolutely create a new option to support CI.
Since we’re not the experts on this, we would welcome your input and the feedback from the community on how to meet these requirements with RPM or similar approaches. Thanks!
Your web logs would have the following information for each download:
Date, Time, File Downloaded, IP Address of the download, total bytes downloaded. I use this information to figure out who is downloading our own binaries. What I don’t get is the “Why” are they downloading it. Are they building it as part of a larger system, SDK or framework? What do they intend to do with it. But having the numbers was enough for our particular funding agency to show that our project is being downloaded.
For HDFView, you could implement a “Is there an update available” as an opt-in feature. Again, have it ping a web page to get the current version. Then have a script to grep through the log files (our us something like Splunk) to look for the downloads. You start to get information on what versions are being used and by who (Reverse DNS lookup). For HDF5 binaries, see above.
From one small business to another I understand the statistics that you are trying to gather but putting up walls to the easy integration really does not help and in the long term hurts. You will lose statistics quickly which then means it is harder to get funding because it looks like less people are using HDF5 when in fact that may NOT be what is happening. For our project I mirror the sources on our own web server and every developer of our project then downloads HDF5 from our server in order to run the automated build script.
You could place the links to download on your site but for those that come in through a web browser it will start the download then the page refreshes with an option short survey to fill out which asks how they intend to use it. For those smart enough to use CI services and automated build scripts we can find the direct download link. I think I would rather have the total number of downloads to show my funding agency that those numbers are staying level or increasing rather than know some information about who but have a seemingly decreasing number of downloads.
To back that up I just did a quick grep through our apache log file for a 5 day period in September and counted 46 downloads to 8 unique IP addresses. This was for the 1.8.19 and 1.8.20 source codes. Doing some reverse DNS lookups of the IP addresses showed me that a few large universities in the US and a few institutions from Europe downloaded the source codes.
Hosting the codes on GitHub would allow you to track downloads. Same for hosting the binaries up on GitHub. I am pretty sure you can get the download statistics for each binary.
If you want to know who we are, we would be fine with including this information in some custom HTTP header in the GET requests that our CI does.
For now we’ll just have to put a copy of hdf5-1.10.3-linux-centos7-x86_64-gcc485-shared.tar.gz on a web server of our own, so that our CI can reach it. I guess I could try to make some more advanced fetching script that log me in on hdfgroup.org first, to get to the download, but I’d rather not.
We are a company that builds a drill core analysis machine. To look at the analysis results, we’ve built a desktop application. Some of the analysis result is stored as HDF5 files, so the application uses the HDF5 library to read them.
On our Windows build worker, we build HDF5 from source. Updating is a matter of manually logging into the VirtualBox machine and re-building.
On macOS it’s the same story, but the machine is physical instead of virtual.
For our Ubuntu package builds, we use HDF5 from official Ubuntu package repositories.
For our AppImage build (a single-file application distribution format for Linux), we do the build in a CentOS 7 Docker container, and this is where it’s convenient to just pull the pre-built binaries. The old CentOS 6 RPMs were perhaps a little more convenient, but a .tar.gz like it is now is perfectly fine as well.
If the CentOS 7 binaries were Enterprise only or some such, we’d just built it ourselves instead, like we do on Windows/macOS. We would cache the result so wouldn’t add much time to a regular build.
Your web logs would have the following information for each download: Date, Time,
File Downloaded, IP Address of the download, total bytes downloaded.
We have this as well, but the registration gives us something invaluable that you can’t get from logs: email address. We don’t receive any sustaining funding from any agency simply based on usage: every dollar we earn is driven by consulting / engineering work for our clients or – increasingly – Enterprise Support subscriptions. Without an email address to start these consulting or support discussions, we would be at a significant disadvantage.
… You will lose statistics quickly which then means it is harder to get funding because it looks
like less people are using HDF5 when in fact that may NOT be what is happening… I just did a
quick grep through our apache log file for a 5 day period in September and counted 46
downloads to 8 unique IP addresses.
… I think I would rather have the total number of downloads to show my
funding agency that those numbers are staying level or increasing…
If you can share the secret of getting funding for # downloads, we would definitely like to hear about it We receive absolutely no funding based on actual or even perceived usage, just like Apache doesn’t make money based on Hadoop downloads. But for what it’s worth, our usage is exploding, e.g.
1.10.3 was just released on Aug 22, 2018
Over the past 44 days, we’ve had > 20,000 downloads of 1.10.3 source + binaries… and this rate is growing every month. This is not even counting the other currently supported versions of HDF5 and HDF4.
So long story short: if we can’t directly and quickly identify specific users and companies who are using HDF5, we will be significantly hampered in our ability to generate the revenue needed to sustain HDF5 through consulting and support.
Yes, something like that sounds fine to me, though obviously over something encrypted like HTTPS, as we wouldn’t want login details to go over insecure channels.
I don’t know how your authentication backend for hdfgroup.org is set up, but I imagine you could configure it to be an OAuth2 provider, and then you could require a valid OAuth2 access token for the direct downloads. Just HTTP Basic Authentication over HTTPS would work as well, and might be a bit simpler for CI scripts like our use case. But we would be fine either way, as long as it’s something standard (and not an interactive website login like it is now).
The idea of the username password in the download url would be good.
Or a hybrid: binaries need registration for download, and source doesn't? Or older versions of the source don't?
I too, have build scripts which invoke an automatic source download, which reach out for specific versions. I would like to bypass the manual download.
Additional ideas:
* upon registration, a token is provided for subsequent downloads similar to the above
* allow the upload of an ssh public key for ssh+git based downloads
* or create private github account, which knows how to do all this accounting for registered members
Raymond
I think user name and password in the URL would be less than ideal, as this means it’ll typically be saved visible in web server logs. It seems it could also leak to another site through the Referrer header, exposing it to the admin of the referred to site (talking about browser now, not curling). It’s better with a proper authentication mechanism than putting secrets in the URL.
I personally support HTTPS or FTP using account credentials, or OAuth2 - but definitely not in the URL. I think the real danger with the current implementation is that lots of people will start mirroring (even if just internally as stated by several people here) in which case you gain even worse statistics about usage.
I think a token in the URL would be good. When we register we get the token and we know to use that token in a plain URL? We do our automatic builds using CMake and I am not sure how complex of http requests we can do through that mechanism.
If they host on a private GitHub account they have to pay per collaborator. There are two bad things there. The per user fee and they would have to manually manage the users who have access. If it is being downloaded as much as they think it is then just managing the users would be a full time job.
Also, were we post the binaries are on a public server because our collaborators also need access to it. While we don’t advertise the location on our server and there are no direct links to it, the location probably isn’t that hard to find in you know our project and you like to scrape through GitHub projects.
Thank you for the insights into how the HDF Group works its funding. I think as a community we should all be interested in seeing the HDF Group succeed. Having a successful business back the project means that as a community we can hopefully be assured of the regular updates to the project that we have enjoyed over the last 10 years. Hopefully we can find a middle ground that allows developers to have automated downloads for our CI machines while still providing some basic download statistics that The HDFGroup is seeking.