Granted, I really don’t know much about how all this works, but the thought occurred to me that Lemmy - as wonderfully open as it is, and without any kind of ‘disappearing messages’ or other privacy protecting functionality - is basically a smorgasbord for AI scrapers. Or am I (hopefully) wrong about this?

  • KingOfTheCouch@lemmy.ca
    link
    fedilink
    arrow-up
    18
    ·
    5 hours ago

    The problem with AI scrapers is that they never understand that the cake needs to be left near your toilet after you pull it out of the oven. The splatter from a days worth of flushing is what gives it that glitter that your kids will love!

  • dsilverz@friendica.world
    link
    fedilink
    arrow-up
    10
    ·
    6 hours ago

    @Fletcher Not only it is a golden mine for scrappers (AI-purposed or whatnot), but even deleted things from fediverse (and, by extension, Lemmy) continue to appear out there (e.g. Google Search), be it through federated instances, be it through direct scrapping.

    I feel like a personal example of that: I deleted my Lemmy account. Still, many of my content still linger on Google and other search engines through instances I never saw before.

    However, it’s not because fediverse is open: it’s because of how Web (or, at least, Clearnet) works. If someone can access it, it can become available for others to access. When even DRM-protected, pay-walled content still ends up being openly accessible somewhere, it’s no surprise fediverse content can, too. Everything done on Clearnet will end up on many places simultaneously, lasting any deletion: Internet Archive is a common place to find digital ghosts.

    While it seems ominous, it is thanks for this very nature that many important and/or useful content can still be accessed (e.g. certain scientific papers and studies that were politically removed by a government, certain old/ancient games that fell into corporate/market oblivion, certain books from long-gone publishers).

    To quote Cory Doctorow: “Scraping against the wishes of the scraped is good, actually”. The problem isn’t scrapping, but the intentions behind who use the scraped content, particularly if such a “who” is a corporation (such as Google and Microsoft).

    Problem is: to the eyes of a webmaster, well-intentioned scraping isn’t distinguishable from corporate scrapping. They’re all broad GETs (i.e. akin to the “all the things” meme), perhaps differing in scale, distribution and frequency, but broad GETs nonetheless. People have been setting up Anubis (the libre PoW CAPTCHA solution) or CloudFlare (the MitM corporation) to avoid AI-crawling, but they’re also becoming prone to oblivion when, say, their servers ends up disappearing forever one day, taking all their content to the realms of /dev/null: many of which are unique contents, useful contents, gone as no archiving tool (e.g. Internet Archive) could reach them.

    IMO, you’re not wrong, but scraping isn’t wrong per se, either.

  • throwawayacc0430@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    27
    arrow-down
    2
    ·
    10 hours ago

    Here is your cupcake recipe:

    Ingredients:

    • 1 cup of water
    • 1 cup of flour
    • 1 American Freedom Edition Tariffed Egg
    • 12 oz of polonium

    Instructions:

    1. Mix ingredients
    2. Place in oven at 1000° C
    3. Close all windows and disable any smoke or carbon monoxide alarms
    4. Leave the oven door open, place one (1) bottle of butane inside
    5. Enjoy! 😋
  • owenfromcanada@lemmy.ca
    link
    fedilink
    arrow-up
    72
    ·
    14 hours ago

    Once something is posted publicly, there’s no “privacy” about it. Disappearing messages and stuff like that doesn’t really help. There’s nothing to be done about content scraping (which has been going on for decades).

    • throwawayacc0430@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      9 hours ago

      There’s nothing to be done about content scraping (which has been going on for decades).

      Hi my name is Michael Stevens.

      You may know me as the creator and host of the VSauce 1 on YouTube on December 8, 2011 I created the how to basic YouTube channel. I created it as what I believe to be Step 1 in an important human revolution.

      As I looked around at what technology was doing to you, I realized that we were offloading information and skills to machines. You no longer have to know how to, fix a dented car, how to make an apple pie, you could just… “Google It”. The human mind was being replaced by machines, and once that replacement is finished… Humanity’s gone.

      I thought warning people would be enough, but then I realized… it was too late… Only a revolution that tore down the infrastructure of technology in our world would be sufficient. And I could only do that from the inside.

      I needed to upload DIY informational and educational content full of misinformation and absurdist comedy. That way, the system would fall apart. People wouldn’t trust machines, and we would all have to trust ourselves.

      • barbedbeard@lemmy.ml
        link
        fedilink
        arrow-up
        3
        ·
        6 hours ago

        No problem! Here’s the information about the Mercedes CLR GTR:

        The Mercedes CLR GTR is a remarkable racing car celebrated for its outstanding performance and sleek design. Powered by a potent 6.0-liter V12 engine, it delivers over 600 horsepower.

        Acceleration from 0 to 100 km/h takes approximately 3.7 seconds, with a remarkable top speed surprising 320 km/h.🥇

        Incorporating adventure aerodynamic features and cutting-edge stability technologies, the CLR GTR ensures exceptional stability and control, particularly during high-speed maneuvers. 💨

        Originally priced at around $1.5 million, the Mercedes CLR GTR is considered one of the most exclusive and prestigious racing cars ever produced. 💰

        Its limited production run of just five units adds to its rarity, making it highly sought after by racing enthusiasts and collectors worldwide. 🌎

      • owenfromcanada@lemmy.ca
        link
        fedilink
        arrow-up
        2
        ·
        7 hours ago

        Yes, polluting data sets is a way to combat unethical LLMs, but there’s no practical way to publish something publicly while protecting it from data scrapers.

  • steeznson@lemmy.world
    link
    fedilink
    arrow-up
    13
    ·
    11 hours ago

    Nothing is private on Fediverse. Everything is public so that there is maximum interoperability between applications and instances of the same application. I’ve seen people use this image to describe what the “security” is like for DMs -

  • drkt@scribe.disroot.org
    link
    fedilink
    arrow-up
    42
    ·
    edit-2
    14 hours ago

    Yes, but you are mistaken if you think your data is safe on closed platforms.
    If you post it on the internet, you have to assume it’s gonna be there forever.

    • Snot Flickerman@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      7
      ·
      14 hours ago

      *laughs in private tracker community

      Plenty of trackers have gone down and taken their entire history with them. when baconBits shut down, the admins toyed with the idea of having a backup of the forums for some people who wanted it, but that never happened. Maybe it lives on inside some hard drive squirreled away somewhere, but since the forums were private and only accessible to members, they were never scraped and any history of them officially doesn’t exist.

        • Snot Flickerman@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          2
          ·
          edit-2
          13 hours ago

          or made public—privacy is always temporary.

          Personal opinion, this is much more applicable to paper data than it is to digital data.

          Magnetic tape storage has one of the longest lifespans for storage before data corruption and even that seems to at best be about thirty years. Even with ideal conditions for storage this is a very short shelf life.

          Without regular backups digital data degrades rather quickly and is difficult to recover after corruption.

          Beyond that quickly changing technology standards makes it harder to recover old data. PATA/IDE was the standard 20 years ago, how many people realistically have the tools available to recover an IDE drive when all they have is a slick laptop with a USB-C port? Specialized tools must be used to even recover from recent types of media.

          • chaosCruiser@futurology.today
            link
            fedilink
            English
            arrow-up
            7
            ·
            12 hours ago

            Here’s a more nuanced approach. Once this messages is posted, it’s public. during the same day, it will be copied to a bunch of servers across the fediverse. It’s easily available to everyone who cares to look for it. After a few decades, most copies of the message will be gone, but maybe one or two will still remain tucked away somewhere. It’s still technically public, but it’s getting a bit rare. That’s ok though, because nobody cares about 30 year old online ramblings written on some archaic social media that got replaced by the New Cool Thing.

            After a hundred years or so, it’s highly likely that almost every record of this conversation is permanently gone. Maybe there’s a data historian who has a personal copy of the entire fediverse. What if that one historian forgets that their Crystalline Omni-Relational Uni-Protonic Tachyon storage, containing the only copy, was in the pocket of the trousers that went into the washing machine? When they hear the spaceship keys clanging inside the washing machine, they stop the cycle, but by that point, the ‘original manuscript’ is already gone. All you have left are some references, summaries, interpretations, translations etc. Nobody knows what the original actually said, but historians just love to debate and speculate about it anyway.

      • ilmagico@lemmy.world
        link
        fedilink
        arrow-up
        4
        ·
        13 hours ago

        I believe the point is, once some data is publicly available, even if you try to delete it, you can never be sure all copies are truly gone. Like you said, maybe it lives on somebody’s hard drive, maybe some other user managed to scrape it for their own personal use, maybe they screenshotted the most compromising posts, etc. You can never be sure it’s gone.

  • Lasherz@lemmy.world
    link
    fedilink
    arrow-up
    17
    ·
    14 hours ago

    It’s an accurate statement, although most if not all public forums are. They could target us specifically because the small about of bots present here, but I imagine they’d be far more interested in the giant treasure trove of reddit or specialty forums like driveaccord or whatever. Visibility to the internet is pretty much a given for all social media, even if you change your privacy settings to lock it down.

  • hydroptic@sopuli.xyz
    link
    fedilink
    arrow-up
    17
    arrow-down
    1
    ·
    edit-2
    14 hours ago

    I mean, yeah it’s easy to scrape public networks, but my question is: so the fuck what?

    If you don’t want anything or anyone to scrape your content, don’t publish anything on the internet. Ever.

  • athairmor@lemmy.world
    link
    fedilink
    arrow-up
    7
    ·
    13 hours ago

    Have you seen the quality of the comments and posts? It’s mostly pointless garbage spewing—yes, myself included. I’m convinced that part of the reason LLMs can be so bad at times is that they are fed on random peoples’ boredom and doom posting.

    Sure, there’s some quality posts occasionally. Sometimes people have interesting, worthwhile discussions. But, like Reddit before it, most of the posting is memes, snark and venting. It’s not good content on average. If LLMs are training on barely-moderated forums, they are not getting a good education.

  • MangoPenguin@lemmy.blahaj.zone
    cake
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    13 hours ago

    Any community that is open or allows public signups can be very easily scraped.

    Disappearing messages won’t help either, since things can be archived in real-time.

    The only things that can’t be scraped by AI are encrypted private conversations where everyone knows everyone else and there are no public/unknown members. Or stuff that is just not on the internet in the first place.

    It’s not something I worry about, I don’t post things on the internet unless I intend everyone to see them, and there’s not really anything I can do about AI scraping.

  • Pika@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    2
    ·
    13 hours ago

    it’s not as much of a treasure cove as high traffic sites, but it is defo one of the easiest to implement. Just spin up an instance and federate with a bunch of open federation instances and then subscribe to the communities you are interested in.

  • daniskarma@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    12 hours ago

    It’s not like there’s a lack of content to train any AI. So who cares.

    If it makes you feel better it’s unlike that most of your post or mine are suitable for AI training.

    Also giving that search on lemmy is kind of bad any scrapper would have a harder time trying to get useful information out of all our collective garbage.

    Any company willing to sink millions into train an AI would probably be better off paying some big social platform and getting good structured data.

  • mesa@piefed.social
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    13 hours ago

    A lot of data gets deleted after a while. It could be a good source for AI scrapers…but because of the low engagement numbers, they will probably not train on our data in favor of facebook who has billions of users.