inb4: IPFS doesn’t work, unfortunately as you cannot provide the hash of an arbitrarily large file and retrieve it from the network. IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!
Basically, I’d like to take the SHA256, SHA3, blake2, md5, of a file and either retrieve it from a network or get a list of sources for that file. Does something like that exist already or will I have to build it?
If I have to build it
it will be a really simple, dumb, HTTP service with
GET /uris/:hash:?alg=sha256|md5|blake
POST /uri/:hash:
with the contents being a URI to the file
supported URI schemes would probably be HTTP/S and FTP. Maybe P2P protocols like IPFS and if there’s a way to target a specific file in a torrent, maybe magnet links too. But that’s feels like risky territory.
Of course for hashing requests it would have a limited task queue (maybe 5 in parallel?), rate limiting by IP, and a size limit for retrieval (1GB feels like more than enough).
Can’t think of a way to do it with a DHT 🤷
I sense an XY Problem scenario. Can you explain what you’re seeking to ultimately build and what requirements you have?
Does the solution need to be distributed? Does the retrieval need to complete ASAP or can wait until data becomes available? What sort of reliability/availability does this need? If only certain hash algorithms can be supported, which ones do you need and why?
I ask this because the answer will be drastically different if you’re building the content distribution system for a small video game versus building the successor to Kim Dotcom’s Mega file-sharing service.
It’s quite simple: I want to
retrieveFile(fileHash)
wherefileHash
is the output ofmd5sum $file
orsha256sum $file
, or whatever other hashing algorithm exists.This seems like a restatement of X. We still don’t understand Y. I’m especially confused about:
There was some hint that maybe you’re concerned about reproducibility for CIDs? If you fix the block size, hash algorithm, and content codec you’ll get consistent results. SHA-256 also breaks data into chunks of 64 bytes as it happens.
Anyway Wikipedia has a list of content-addressable store implementations. A couple that stand out to me are git and git-annex.