• dudeami0@lemmy.dudeami.win
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    1
    ·
    1 day ago

    The only way I can think of is require users to authenticate themselves, but this isn’t much of a hurdle.

    To get into the details of it, what do you define as an AI bot? Are you worried about scrappers grabbing the contents of you website? What is the activities of an “AI Bot”. Are you worried about AI bots registering and using your platform?

    The real answer is not even cloudflare will fully defend you from this. If anything cloudflare is just making sure they get paid for access to your website by AI scappers. As someone who has worked around bot protections (albeit in a different context than web scrapping), it’s a game of cat and mouse. If you or some company you hire are not actively working against automated access, you lose as the other side is active.

    Just think of your point that they are using residential IP addresses. How do they get these addresses? They provide addons/extensions for browsers that offer some service (generally free VPNs) in exchange for access to your PC and therefore your internet in the contract you agree to. The same can be used by any addon, and if the addon has permissions to read any website they can scrape those websites using legit users for whatever purposes they want. The recent exposure of the Honey scam highlights this, as it’s very easy to get users to install addons by selling users they might save a small amount of money (or make money for other programs). There will be users who are compromised by addons/extensions or even just viruses that will be able to extract the data you are trying to protect.

    • DaGeek247@fedia.io
      link
      fedilink
      arrow-up
      2
      ·
      16 hours ago

      Just think of your point that they are using residential IP addresses. How do they get these addresses?

      You can ping all of the ipv4 addresses in under an hour. If all you’re looking for is publicly available words written by people, you only have to poke port 80 and then suddenly you have practically every possible small self-hosted website out there.

      • dudeami0@lemmy.dudeami.win
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        14 hours ago

        When I say residential IP addresses, I mostly mean proxies using residential IPs, which allow scrappers to mask themselves as organic traffic.

        Edit: Your point stands on there are a lot of services without these protections in place, but a lot of services are protective against scrapping.

    • ctag@lemmy.sdf.orgOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      20 hours ago

      Thank you for the detailed response. It’s disheartening to consider the traffic is coming from ‘real’ browsers/IPs, but that actually makes a lot of sense.

      I’m coming at this from the angle of AI bots ingesting a website over and over to obsessively look for new content.

      My understanding is there are two reasons to try blocking this: to protect bandwidth from aggressive crawling, or to protect the page contents from AI ingestion. I think the former is doable, and the latter is an unwinnable task. My personal reason is because I’m an AI curmudgeon, I’d rather spend CPU resources blocking bots than serving any content to them.