I’m curious if anyone can shed some light on some specifics of the file hashing system in cells:
Does the software itself ever compare the hash value generated on upload to a fresh one to establish file integrity, or does it rely on the storage to provide data with guaranteed integrity?
What exactly happens in Cells if a checksum changes for a file? (due to corruption or other file damage).
Is the system capable of reporting files with bad integrity with a “corrupted” flag or in some other way making any damaged files visible?
Currently, the hash is essentially used as an ETag, to detect easily any changes on a file inside the server. So basically this is used by the Cells Sync client to speed up changes detection.
Your usecase could be interesting indeed : as it is computed on-the-fly at upload time and then stored in a metadata, it could be possible to recompute a file checksum by re-reading the storage and compare both values. But there is nothing currently implemented for that.
As a side-note, beware that v4 introduced a new hashing mechanism, that basically computes hashes on chunks with fixed size (10MB), to ensure consistency between a file uploaded via multipart (parts can arrive in any order) or by standard PUT request. This implies that multiparts part sizes must be a multiple of 10MB.
Thanks for the insight, this is very useful information. Prior to making this forum post, I put together a quick system that generates client-side checksums using WebWorkers and spark-md5.js with a similar hash chunking procedure to the one you describe in Cells, it seems to work very smoothly so I can confirm it looks like a valid approach.
I need to do some additional testing with extreme upload structures, but it seems so far that the client can keep up nicely. I’ll keep you updated if I devise any systems to replicate a “fixity” style scheduled check, but it should be straightforward for me to implement an initial comparison between the pre and post upload values.