Hi there, My team is looking to develop a time-series database (TSDB) using the HDF5 file format. An issue that has been raised is that of a security model. What would be the best practices to secure the data for a file or the file itself?
Chunked datasets can be filtered – oh well – chunk by chunk, giving you the option to use your favourite cypher.
That explains the mecganism. As for securing something: crypyography is a complex field; wrongly aplying a protocol will give you false sense of security.
From what I’ve read, and what you’ve said, you are able to apply a user based filter for encryption similar to how you apply the type of compression and compression level i.e.
Would this protect the information at the data level? Or would other users be able to access the data simply if they knew what type of encryption was used?
Just a follow up question, would it not then be better to encrypt the whole file, meta data and data included?
Yes, you can apply encryption at the block level. I am not a crypto professional, although I have heard of a guy whose friend brother in law was. AFAIK a crypto system consist of a cipher, key management and a key exchange protocol.
If I recall the strength is equal with the weakest link in the chain. Usually (symmetric) block ciphers are used for large data, as they are fast; the problem is with the key: both parties have to know it – bummer. OK why not randomly generate one and call it a session key, then encrypt this session key with an asymmetric cipher based on some compound number really really hard to factor: say multiplication of two big primes?
Well how do you know for certain they were indeed primes? – i don’t mean probabilistic certain, but deterministically … you don’t… OK you drop prime numbers, you find something else…
But here is the thing, the session key must be decoded and resided in the memory somewhere, it turns out did you freeze the memory – I mean liquid nitrogen – then the DRAM retains its content longer period than standard refresh rate… plug it in in a device that copies the content and voila…
So what do you mean by ‘safe’?
You should assume the attacker always knows the type of encryption being used.
Not sure what you are ‘protecting’, as you see you never can be certain, you just have to make it more expensive to have access to the information than the value of the content.
Please note, all above is my personal opinion and not an advice,
We’ve used encryption for HDF5 datasets in the past. In our use case we did the encryption not using a filter but on top of HDF5.
Against what or whom do you want to protect the data in the files against? Is encryption/decryption speed an issue?
As steven pointed out the only thing being secrect is the encryption key. Everthing else should be considered public.
We do not necessarily want to protect the data from specific people, as much as protecting the data against unauthorized modification. Therefore, the what is more important than the who in our case.
Yes, Encryption/Decryption speeds remain important as we are storing real-time continuous data and would like to minimise the time spent on encrypting this data as well as minimising the amount of time spent decrypting large amounts of data to be queried intermittently from the HDF5 file.
Just to go off topic slightly, would trying to leverage Windows Integrated Security be an option to protect the data within an HDF5 file on windows?
I only encrypt metadata - not raw data. This is easily possible with hdf 1.8.x. Afterwards HDF5 development completely messed up what is what and this is now broken.
Στις Τρί, 21 Απρ 2020, 15:26 ο χρήστης Brett G Fleischer email@example.com έγραψε:
Thank you for your response.
I think that encrypting only the meta data should be sufficient if, and only if, no modifications to the raw data can be performed other than the dedicated Single Writer.
This is tough though, because what if an attacker is able to gain access to the HDF5 file? What is stopping them from copying it to their machine, editing the data, and then returning the edited or deleted file from where they acquired it.
You don’t need encryption. You need cryptographic signing.
Read up what the term cryptographic signature means, and you’ll know what you need to do.
Upside is, you only need to distribute the public key of the data provider,
and everyone else can easily check the signature.
gpg can do the signing of the entire HDF5 file, or whatever file you want to protect against manipulation.
It can also verify a signature, provided you have the public key of the signatures’ creator. That’s easy to distribute.
gpg can do encryption as well, but that’s not what you want or need to do.
Just the signing.
Without metadata you can’t find raw data
This is absolutely correct, if you need to ensure your data is not tampered or corrupted. I want that my data is not ridiculously easy to view.
Is this entirely correct?
If an unauthorised person gains access to the whole file, granted the metadata may be encrypted, they are still able to copy that file to a new machine. Once the file is on the new machine they can simply use some HDF5 viewer to see the datasets within, make their changes and return the file to the original location.
Correct me if I’m wrong, but I do not think that:
Is entirely true.
I may misunderstand your point but no this does not work. HDFView or any version of the library fails to open the file on any computer. Just to make this clearer: I am encrypting the HDF5 metadata, as opposed to the HDF5 raw data, both of them living in the same file.
So from my first post on this thread, my team and I are wanting to create a time-series database upon the HDF5 file format. Using as example the following structure:
ModelDescription group houses the relational information for the time-series data within the
Results group. The MetaData in our case would be all the information within the relational group
ModelDescription. Is this the same MetaData that you have discussed above? Or is the MetaData that you discuss above regarding MetaData within an actual data set for eg:
Either way, if the data in one group within an HDF5 file is encrypted and the data in another not, will the file still be readable? If so the encryption for the MetaData (
ModelDescription) will be useless.
If you encrypt the metadata but not the data, you only make life a bit harder for an attacker, you cannot stop them.
The data that we produce typically contains an awful lot of structure which attackers can utilize to make guesses
where the payload data could be located, and what it encodes.
For instance, it’s absolutely trivial to recognize a chunk of floating point data.
And then you see the size of that floating point data chunk.
And then you make connections with how HDF5 typically layouts a file.
And then you have a pretty precise guess about the dimensionality of the data, how many variables are stored,
and you can view each of these alleged variables.
Looking at such variables, you can usually tell what kind of data it encodes, and where your previous guesses were wrong.
A sea-ice variable looks just so much different to a cloud cover, which looks very different to a surface temperature.
With each guess, the attacker gains more leverage on the rest of the data.
And once the attacker knows how the data is interpreted, they can also manipulate it.
Because they’ll just have reconstructed the metadata that you tried to hide behind the encryption.
Of course, the question remains whether your threat model includes attackers determined enough to do the above steps,
but you cannot protect data from manipulation by simply encrypting metadata.
If you want the data to stay secret, you need to encrypt both data and metadata.
If you want the data not to be manipulated, you need to sign your data, and check that signature.
If you want to be safe, do both.
Actually, it’s quite simple.
In HDF5 parlance and in the format document, metadata is more or less whatever is not raw data, aka dataset contents. I am talking about HDF5 metadata objects. These objects in my version of hdf5 are written to the file through a separate path from raw data. As my raw data is often 100s of gigabytes i prefer to encrypt only the metadata objects and not the raw data.
Στις Παρ, 24 Απρ 2020, 08:26 ο χρήστης Brett G Fleischer firstname.lastname@example.org έγραψε: