inb4: IPFS doesn’t work, unfortunately as you cannot provide the hash of an arbitrarily large file and retrieve it from the network. IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!
Basically, I’d like to take the SHA256, SHA3, blake2, md5, of a file and either retrieve it from a network or get a list of sources for that file. Does something like that exist already or will I have to build it?
If I have to build it
it will be a really simple, dumb, HTTP service with
GET /uris/:hash:?alg=sha256|md5|blake
POST /uri/:hash:
with the contents being a URI to the file
supported URI schemes would probably be HTTP/S and FTP. Maybe P2P protocols like IPFS and if there’s a way to target a specific file in a torrent, maybe magnet links too. But that’s feels like risky territory.
Of course for hashing requests it would have a limited task queue (maybe 5 in parallel?), rate limiting by IP, and a size limit for retrieval (1GB feels like more than enough).
Can’t think of a way to do it with a DHT 🤷
you have to fucking hope no one figures out how to backwards engineer the algorithm you choose
Why?
If two files have the same hash, you may receive the file you request by hash, or you may receive a different, possibly malicious file.
https://en.m.wikipedia.org/wiki/Collision_attack
Strong cryptographic hashes are resistant to such attacks, but md5 is relatively weak.
Absolutely. An example of a malicious collision would be to request the file with the SHA-1 of 38762cf7f55934b34d179ae6a4c80cadccbb7f0a. But… there’s two of them here.
MD5 is so broken that its former status as a cryptographic hash function has been stripped. And efforts are underway to replace SHA-1 where it’s used, since although it takes some prerequisites to intentionally create a SHA-1 collision today, it’s worth remembering that “attacks always get better, they never get worse”.
I’m not sure what your concern is. I’d basically like to call a function
retrieveFile(fileHash)
and get bytes back. Or callretrieveFileLocations(fileHash)
and get URIs back to where the file can be downloaded. Also, it’ll be opensource, so nothing to reverse engineer.md5 for example is already vulnerable. People have figured out how to manipulate data into having a pre-specified hash. Meaning someone could engineer deliberate hash collisions and serve you any file they like.
SHA-256 doesn’t (i think) have this issue, so far hah.