inb4: IPFS doesn’t work, unfortunately as you cannot provide the hash of an arbitrarily large file and retrieve it from the network. IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!

Basically, I’d like to take the SHA256, SHA3, blake2, md5, of a file and either retrieve it from a network or get a list of sources for that file. Does something like that exist already or will I have to build it?

If I have to build it

it will be a really simple, dumb, HTTP service with

  • GET /uris/:hash:?alg=sha256|md5|blake
  • POST /uri/:hash: with the contents being a URI to the file
    supported URI schemes would probably be HTTP/S and FTP. Maybe P2P protocols like IPFS and if there’s a way to target a specific file in a torrent, maybe magnet links too. But that’s feels like risky territory.

Of course for hashing requests it would have a limited task queue (maybe 5 in parallel?), rate limiting by IP, and a size limit for retrieval (1GB feels like more than enough).

Can’t think of a way to do it with a DHT 🤷

  • tinkralge@programming.devOP
    link
    fedilink
    English
    arrow-up
    0
    ·
    12 days ago

    It’s quite simple: I want to retrieveFile(fileHash) where fileHash is the output of md5sum $file or sha256sum $file, or whatever other hashing algorithm exists.

    • hallettj@leminal.space
      link
      fedilink
      English
      arrow-up
      0
      ·
      12 days ago

      This seems like a restatement of X. We still don’t understand Y. I’m especially confused about:

      • Why are SHA-256 and friends ok, but IPFS CIDs are not? They have basically the same functionality.
      • Do you need a distributed network, or is a single server ok?

      There was some hint that maybe you’re concerned about reproducibility for CIDs? If you fix the block size, hash algorithm, and content codec you’ll get consistent results. SHA-256 also breaks data into chunks of 64 bytes as it happens.

      Anyway Wikipedia has a list of content-addressable store implementations. A couple that stand out to me are git and git-annex.