• bad bot behavior

    From Ben Collver@bencollver@tilde.pink to comp.misc on Tue Mar 18 15:17:56 2025
    From Newsgroup: comp.misc

    Please stop externalizing your costs directly into my face ==========================================================
    March 17, 2025 on Drew DeVault's blog

    Over the past few months, instead of working on our priorities at
    SourceHut, I have spent anywhere from 20-100% of my time in any given
    week mitigating hyper-aggressive LLM crawlers at scale. This isn't
    the first time SourceHut has been at the wrong end of some malicious
    bullshit or paid someone else's externalized costs – every couple of
    years someone invents a new way of ruining my day.

    Four years ago, we decided to require payment to use our CI services
    because it was being abused to mine cryptocurrency. We alternated
    between periods of designing and deploying tools to curb this abuse
    and periods of near-complete outage when they adapted to our
    mitigations and saturated all of our compute with miners seeking a
    profit. It was bad enough having to beg my friends and family to
    avoid "investing" in the scam without having the scam break into my
    business and trash the place every day.

    Two years ago, we threatened to blacklist the Go module mirror
    because for some reason the Go team thinks that running terabytes of
    git clones all day, every day for every Go project on git.sr.ht is
    cheaper than maintaining any state or using webhooks or coordinating
    the work between instances or even just designing a module system
    that doesn't require Google to DoS git forges whose entire annual
    budgets are considerably smaller than a single Google engineer's
    salary.

    Now it's LLMs. If you think these crawlers respect robots.txt then
    you are several assumptions of good faith removed from reality. These
    bots crawl everything they can find, robots.txt be damned, including
    expensive endpoints like git blame, every page of every git log, and
    every commit in every repo, and they do so using random User-Agents
    that overlap with end-users and come from tens of thousands of IP
    addresses – mostly residential, in unrelated subnets, each one making
    no more than one HTTP request over any time period we tried to
    measure – actively and maliciously adapting and blending in with
    end-user traffic and avoiding attempts to characterize their behavior
    or block their traffic.

    We are experiencing dozens of brief outages per week, and I have to
    review our mitigations several times per day to keep that number from
    getting any higher. When I do have time to work on something else,
    often I have to drop it when all of our alarms go off because our
    current set of mitigations stopped working. Several high-priority
    tasks at SourceHut have been delayed weeks or even months because we
    keep being interrupted to deal with these bots, and many users have
    been negatively affected because our mitigations can't always
    reliably distinguish users from bots.

    All of my sysadmin friends are dealing with the same problems. I was
    asking one of them for feedback on a draft of this article and our
    discussion was interrupted to go deal with a new wave of LLM bots on
    their own server. Every time I sit down for beers or dinner or to
    socialize with my sysadmin friends it's not long before we're
    complaining about the bots and asking if the other has cracked the
    code to getting rid of them once and for all. The desperation in
    these conversations is palpable.

    Whether it's cryptocurrency scammers mining with FOSS compute
    resources or Google engineers too lazy to design their software
    properly or Silicon Valley ripping off all the data they can get
    their hands on at everyone else's expense… I am sick and tired of
    having all of these costs externalized directly into my fucking face.
    Do something productive for society or get the hell away from my
    servers. Put all of those billions and billions of dollars towards
    the common good before sysadmins collectively start a revolution to
    do it for you.

    Please stop legitimizing LLMs or AI image generators or GitHub
    Copilot or any of this garbage. I am begging you to stop using them,
    stop talking about them, stop making new ones, just stop. If blasting
    CO2 into the air and ruining all of our freshwater and traumatizing
    cheap laborers and making every sysadmin you know miserable and
    ripping off code and books and art at scale and ruining our fucking
    democracy isn't enough for you to leave this shit alone, what is?

    If you personally work on developing LLMs et al, know this: I will
    never work with you again, and I will remember which side you picked
    when the bubble bursts.

    From: <https://drewdevault.com/2025/03/17/ 2025-03-17-Stop-externalizing-your-costs-on-me.html>
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From D Finnigan@dog_cow@macgui.com to comp.misc on Tue Mar 18 12:00:07 2025
    From Newsgroup: comp.misc

    On 3/18/25 10:17 AM, Ben Collver wrote:
    Please stop externalizing your costs directly into my face ==========================================================
    March 17, 2025 on Drew DeVault's blog

    Over the past few months, instead of working on our priorities at
    SourceHut, I have spent anywhere from 20-100% of my time in any given
    week mitigating hyper-aggressive LLM crawlers at scale.

    This is happening at my little web site, and if you have a web site,
    it's happening to you too. Don't be a victim.

    Actually, I've been wondering where they're storing all this data; and
    how much duplicate data is stored from separate parties all scraping the
    web simultaneously, but independently.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From not@not@telling.you.invalid (Computer Nerd Kev) to comp.misc on Wed Mar 19 08:19:22 2025
    From Newsgroup: comp.misc

    D Finnigan <dog_cow@macgui.com> wrote:
    On 3/18/25 10:17 AM, Ben Collver wrote:
    Please stop externalizing your costs directly into my face
    ==========================================================
    March 17, 2025 on Drew DeVault's blog

    Over the past few months, instead of working on our priorities at
    SourceHut, I have spent anywhere from 20-100% of my time in any given
    week mitigating hyper-aggressive LLM crawlers at scale.

    This is happening at my little web site, and if you have a web site,
    it's happening to you too. Don't be a victim.

    Meh, my little Web site runs so light that even when Amazon's bot
    got stuck in a recursive loop grabbing the same dynamic page tens of
    times a second from different IPs, the server load was near nill as
    usual. The main problem that caused was access logs of hundreds of
    megabytes per day. Amazon is still scraping the hell out of
    everything I put online (even a mirror that's tens of GBs), and
    other bots squeeze into the logs too, maybe even a few humans view
    things sometimes? I don't care, they're welcome to it, and they
    helped me find the bug in the Apache configuration which allowed
    that recursive loop (though I still don't get why bots started
    forming such URLs in the first place).
    --
    __ __
    #_ < |\| |< _#
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Toaster@toaster@dne3.net to comp.misc on Tue Mar 18 18:20:06 2025
    From Newsgroup: comp.misc

    On Tue, 18 Mar 2025 12:00:07 -0500
    D Finnigan <dog_cow@macgui.com> wrote:

    On 3/18/25 10:17 AM, Ben Collver wrote:
    Please stop externalizing your costs directly into my face ==========================================================
    March 17, 2025 on Drew DeVault's blog

    Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any
    given week mitigating hyper-aggressive LLM crawlers at scale.

    This is happening at my little web site, and if you have a web site,
    it's happening to you too. Don't be a victim.

    Actually, I've been wondering where they're storing all this data;
    and how much duplicate data is stored from separate parties all
    scraping the web simultaneously, but independently.

    But what can be done to mitigate this issue? Crawlers and bots ruin the internet.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From ${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com to comp.misc on Wed Mar 19 12:06:13 2025
    From Newsgroup: comp.misc

    On 2025-03-18, Toaster <toaster@dne3.net> wrote:

    But what can be done to mitigate this issue? Crawlers and bots ruin the internet.

    #mode=evil

    How about a script that spews out an endless stream of junk from /usr/share/dict/words, parked on a random URL that's listed in
    robots.txt as forbidden. Any bot choosing to chew on that gets what
    it deserves, though you might need to bandwidth limit it.
    --
    Ian

    "Tamahome!!!" - "Miaka!!!"
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Rich@rich@example.invalid to comp.misc on Wed Mar 19 16:59:19 2025
    From Newsgroup: comp.misc

    Ian <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote:
    On 2025-03-18, Toaster <toaster@dne3.net> wrote:

    But what can be done to mitigate this issue? Crawlers and bots ruin the
    internet.

    #mode=evil

    How about a script that spews out an endless stream of junk from /usr/share/dict/words, parked on a random URL that's listed in
    robots.txt as forbidden. Any bot choosing to chew on that gets what
    it deserves, though you might need to bandwidth limit it.

    Another option could be to craft a "gzip bomb" (a carefully crafted
    zlib compressed file that is compressed to the maximum limits of the
    zlib/gzip algorithm) and return it with the http type of "gzip
    compressed".

    Then you only have to output a few tens of megs, but if the AI
    decompresses the gzip bomb it has to consume multiple gigabytes of
    data.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.misc on Thu Mar 20 02:22:31 2025
    From Newsgroup: comp.misc

    On Wed, 19 Mar 2025 12:06:13 -0000 (UTC), Ian wrote:

    How about a script that spews out an endless stream of junk ...

    Quite a few others are ahead of you <https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/>.
    Some of their countermeasures are quite sophisticated.

    TIL a new term: “Markov babble”.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From ${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com to comp.misc on Thu Mar 20 08:33:39 2025
    From Newsgroup: comp.misc

    On 2025-03-20, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Wed, 19 Mar 2025 12:06:13 -0000 (UTC), Ian wrote:

    How about a script that spews out an endless stream of junk ...

    Quite a few others are ahead of you
    <https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/>.
    Some of their countermeasures are quite sophisticated.

    TIL a new term: “Markov babble”.

    Ha!

    How long before it's made illegal :(
    --
    Ian

    "Tamahome!!!" - "Miaka!!!"
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Toaster@toaster@dne3.net to comp.misc on Thu Mar 20 19:01:20 2025
    From Newsgroup: comp.misc

    On Thu, 20 Mar 2025 08:33:39 -0000 (UTC)
    Ian
    <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote:
    On 2025-03-20, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Wed, 19 Mar 2025 12:06:13 -0000 (UTC), Ian wrote:

    How about a script that spews out an endless stream of junk ...

    Quite a few others are ahead of you
    <https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/>.
    Some of their countermeasures are quite sophisticated.

    TIL a new term: “Markov babble”.

    Ha!

    How long before it's made illegal :(

    I love the idea though.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.misc on Fri Mar 21 08:05:58 2025
    From Newsgroup: comp.misc

    On Thu, 20 Mar 2025 08:33:39 -0000 (UTC), Ian wrote:

    How long before it's made illegal :(

    Hard to see how things you do on your own server, particularly involving uninvited guests, can be made illegal ...
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From ${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com to comp.misc on Fri Mar 21 08:42:08 2025
    From Newsgroup: comp.misc

    On 2025-03-21, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Thu, 20 Mar 2025 08:33:39 -0000 (UTC), Ian wrote:

    How long before it's made illegal :(

    Hard to see how things you do on your own server, particularly involving uninvited guests, can be made illegal ...

    It if inconveniences $BIGCORPS, it will be.

    Though one such $BIGCORP seems to be actually doing this:

    https://www.theregister.com/2025/03/21/cloudflare_ai_labyrinth/
    --
    Ian

    "Tamahome!!!" - "Miaka!!!"
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From candycanearter07@candycanearter07@candycanearter07.nomail.afraid to comp.misc on Sun Mar 23 14:30:04 2025
    From Newsgroup: comp.misc

    Ian <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote at 12:06 this Wednesday (GMT):
    On 2025-03-18, Toaster <toaster@dne3.net> wrote:

    But what can be done to mitigate this issue? Crawlers and bots ruin the
    internet.

    #mode=evil

    How about a script that spews out an endless stream of junk from /usr/share/dict/words, parked on a random URL that's listed in
    robots.txt as forbidden. Any bot choosing to chew on that gets what
    it deserves, though you might need to bandwidth limit it.


    I heard Cloudflare is doing something like that, but ironically
    with their own generative AI..
    --
    user <candycane> is generated from /dev/urandom
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From candycanearter07@candycanearter07@candycanearter07.nomail.afraid to comp.misc on Sun Mar 23 14:30:04 2025
    From Newsgroup: comp.misc

    Rich <rich@example.invalid> wrote at 16:59 this Wednesday (GMT):
    Ian <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote:
    On 2025-03-18, Toaster <toaster@dne3.net> wrote:

    But what can be done to mitigate this issue? Crawlers and bots ruin the
    internet.

    #mode=evil

    How about a script that spews out an endless stream of junk from
    /usr/share/dict/words, parked on a random URL that's listed in
    robots.txt as forbidden. Any bot choosing to chew on that gets what
    it deserves, though you might need to bandwidth limit it.

    Another option could be to craft a "gzip bomb" (a carefully crafted
    zlib compressed file that is compressed to the maximum limits of the zlib/gzip algorithm) and return it with the http type of "gzip
    compressed".

    Then you only have to output a few tens of megs, but if the AI
    decompresses the gzip bomb it has to consume multiple gigabytes of
    data.


    Good idea, but you should still put in some text to warn a real user
    just in case.
    --
    user <candycane> is generated from /dev/urandom
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From D Finnigan@dog_cow@macgui.com to comp.misc on Wed Mar 26 08:38:15 2025
    From Newsgroup: comp.misc

    On 3/23/25 9:30 AM, candycanearter07 wrote:
    Ian <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote at 12:06 this Wednesday (GMT):
    On 2025-03-18, Toaster <toaster@dne3.net> wrote:

    But what can be done to mitigate this issue? Crawlers and bots ruin the
    internet.

    #mode=evil

    How about a script that spews out an endless stream of junk from
    /usr/share/dict/words, parked on a random URL that's listed in
    robots.txt as forbidden. Any bot choosing to chew on that gets what
    it deserves, though you might need to bandwidth limit it.


    I heard Cloudflare is doing something like that, but ironically
    with their own generative AI..

    https://tech.slashdot.org/story/25/03/26/016244/open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries

    They're abusing everyone. If you have a web site, don't allow it to be
    abused this way.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From candycanearter07@candycanearter07@candycanearter07.nomail.afraid to comp.misc on Wed Mar 26 17:00:03 2025
    From Newsgroup: comp.misc

    D Finnigan <dog_cow@macgui.com> wrote at 13:38 this Wednesday (GMT):
    On 3/23/25 9:30 AM, candycanearter07 wrote:
    Ian <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote at 12:06 this Wednesday (GMT):
    On 2025-03-18, Toaster <toaster@dne3.net> wrote:

    But what can be done to mitigate this issue? Crawlers and bots ruin the >>>> internet.

    #mode=evil

    How about a script that spews out an endless stream of junk from
    /usr/share/dict/words, parked on a random URL that's listed in
    robots.txt as forbidden. Any bot choosing to chew on that gets what
    it deserves, though you might need to bandwidth limit it.


    I heard Cloudflare is doing something like that, but ironically
    with their own generative AI..

    https://tech.slashdot.org/story/25/03/26/016244/open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries

    They're abusing everyone. If you have a web site, don't allow it to be abused this way.


    Agreed, it is getting ridiculous how many bad things can be attributed
    to AI at this point
    --
    user <candycane> is generated from /dev/urandom
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From not@not@telling.you.invalid (Computer Nerd Kev) to comp.misc on Thu Mar 27 07:55:52 2025
    From Newsgroup: comp.misc

    D Finnigan <dog_cow@macgui.com> wrote:
    https://tech.slashdot.org/story/25/03/26/016244/open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries

    They're abusing everyone. If you have a web site, don't allow it to be abused this way.

    On the other hand I run websites on the cheapest VPSs available and
    they have no load problem without any robots.txt rules to block
    bots let alone active blocking. Yet solutions like the "Anubis"
    proof-of-work thing mentioned in the link will require Javascript
    which blocks the JS-less web browsers I like to use for browsing
    other people's websites (and FF with NoScript too unless I decide
    to allow the random JS).

    So basically those websites are making their slow code on the
    server _my_ problem by forcing me to do bot-tests which I fail
    (sometimes in Firefox even with NoScript disabled too!).

    Don't abuse _users_ that way, just to block bots from your
    too-slow website!
    --
    __ __
    #_ < |\| |< _#
    --- Synchronet 3.20c-Linux NewsLink 1.2