Please stop externalizing your costs directly into my face ==========================================================
March 17, 2025 on Drew DeVault's blog
Over the past few months, instead of working on our priorities at
SourceHut, I have spent anywhere from 20-100% of my time in any given
week mitigating hyper-aggressive LLM crawlers at scale.
On 3/18/25 10:17 AM, Ben Collver wrote:
Please stop externalizing your costs directly into my face
==========================================================
March 17, 2025 on Drew DeVault's blog
Over the past few months, instead of working on our priorities at
SourceHut, I have spent anywhere from 20-100% of my time in any given
week mitigating hyper-aggressive LLM crawlers at scale.
This is happening at my little web site, and if you have a web site,
it's happening to you too. Don't be a victim.
On 3/18/25 10:17 AM, Ben Collver wrote:
Please stop externalizing your costs directly into my face ==========================================================
March 17, 2025 on Drew DeVault's blog
Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any
given week mitigating hyper-aggressive LLM crawlers at scale.
This is happening at my little web site, and if you have a web site,
it's happening to you too. Don't be a victim.
Actually, I've been wondering where they're storing all this data;
and how much duplicate data is stored from separate parties all
scraping the web simultaneously, but independently.
But what can be done to mitigate this issue? Crawlers and bots ruin the internet.
On 2025-03-18, Toaster <toaster@dne3.net> wrote:
But what can be done to mitigate this issue? Crawlers and bots ruin the
internet.
#mode=evil
How about a script that spews out an endless stream of junk from /usr/share/dict/words, parked on a random URL that's listed in
robots.txt as forbidden. Any bot choosing to chew on that gets what
it deserves, though you might need to bandwidth limit it.
How about a script that spews out an endless stream of junk ...
On Wed, 19 Mar 2025 12:06:13 -0000 (UTC), Ian wrote:
How about a script that spews out an endless stream of junk ...
Quite a few others are ahead of you
<https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/>.
Some of their countermeasures are quite sophisticated.
TIL a new term: “Markov babble”.
On 2025-03-20, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
On Wed, 19 Mar 2025 12:06:13 -0000 (UTC), Ian wrote:
How about a script that spews out an endless stream of junk ...
Quite a few others are ahead of you
<https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/>.
Some of their countermeasures are quite sophisticated.
TIL a new term: “Markov babble”.
Ha!
How long before it's made illegal :(
How long before it's made illegal :(
On Thu, 20 Mar 2025 08:33:39 -0000 (UTC), Ian wrote:
How long before it's made illegal :(
Hard to see how things you do on your own server, particularly involving uninvited guests, can be made illegal ...
On 2025-03-18, Toaster <toaster@dne3.net> wrote:
But what can be done to mitigate this issue? Crawlers and bots ruin the
internet.
#mode=evil
How about a script that spews out an endless stream of junk from /usr/share/dict/words, parked on a random URL that's listed in
robots.txt as forbidden. Any bot choosing to chew on that gets what
it deserves, though you might need to bandwidth limit it.
Ian <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote:
On 2025-03-18, Toaster <toaster@dne3.net> wrote:
But what can be done to mitigate this issue? Crawlers and bots ruin the
internet.
#mode=evil
How about a script that spews out an endless stream of junk from
/usr/share/dict/words, parked on a random URL that's listed in
robots.txt as forbidden. Any bot choosing to chew on that gets what
it deserves, though you might need to bandwidth limit it.
Another option could be to craft a "gzip bomb" (a carefully crafted
zlib compressed file that is compressed to the maximum limits of the zlib/gzip algorithm) and return it with the http type of "gzip
compressed".
Then you only have to output a few tens of megs, but if the AI
decompresses the gzip bomb it has to consume multiple gigabytes of
data.
Ian <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote at 12:06 this Wednesday (GMT):
On 2025-03-18, Toaster <toaster@dne3.net> wrote:
But what can be done to mitigate this issue? Crawlers and bots ruin the
internet.
#mode=evil
How about a script that spews out an endless stream of junk from
/usr/share/dict/words, parked on a random URL that's listed in
robots.txt as forbidden. Any bot choosing to chew on that gets what
it deserves, though you might need to bandwidth limit it.
I heard Cloudflare is doing something like that, but ironically
with their own generative AI..
On 3/23/25 9:30 AM, candycanearter07 wrote:
Ian <${send-direct-email-to-news1021-at-jusme-dot-com-if-you-must}@jusme.com> wrote at 12:06 this Wednesday (GMT):
On 2025-03-18, Toaster <toaster@dne3.net> wrote:
But what can be done to mitigate this issue? Crawlers and bots ruin the >>>> internet.
#mode=evil
How about a script that spews out an endless stream of junk from
/usr/share/dict/words, parked on a random URL that's listed in
robots.txt as forbidden. Any bot choosing to chew on that gets what
it deserves, though you might need to bandwidth limit it.
I heard Cloudflare is doing something like that, but ironically
with their own generative AI..
https://tech.slashdot.org/story/25/03/26/016244/open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries
They're abusing everyone. If you have a web site, don't allow it to be abused this way.
https://tech.slashdot.org/story/25/03/26/016244/open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries
They're abusing everyone. If you have a web site, don't allow it to be abused this way.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,028 |
Nodes: | 10 (0 / 10) |
Uptime: | 126:42:34 |
Calls: | 13,329 |
Calls today: | 1 |
Files: | 186,574 |
D/L today: |
402 files (94,381K bytes) |
Messages: | 3,355,200 |