Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · 9 months ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Wispy2891@lemmy.world · 9 months ago

Question: those artificial stupidity bots want to steal the issues or want to steal the code? Because why they’re wasting a lot of resources scraping millions of pages when they can steal everything via SSH (once a month, not 120 times a second)

Passerby6497@lemmy.world · 9 months ago

That would require having someone with real intelligence running the scraper.

lime!@feddit.nu · 9 months ago

they just want all text

e8CArkcAuLE@piefed.social · 9 months ago

how this felt like while reading

prole@lemmy.blahaj.zone · 9 months ago

Tech bros just actively making the internet worse for everyone.

ShaggySnacks@lemmy.myserv.one · 9 months ago

Tech bros just actively making ~~the internet~~ society worse for everyone.

FTFY.

iopq@lemmy.world · 9 months ago

I mean, tech bros of the past invented the internet

CeeBee_Eh@lemmy.world · 8 months ago

Those were tech nerds. “Tech bros” are jabronis who see the tech sector as a way to increase the value of the money their daddies gave them.

notarobot@lemmy.zip · 8 months ago

Those are not the tech bros. The tech bros are the ones who move fast and break things. The internet was built by engineers and developers

prole@lemmy.blahaj.zone · edit-2 8 months ago

Nah, that was DARPA

zoey@lemmy.librebun.com · 9 months ago

I’m ashamed to say that I switched my DNS nameservers to CF just for their anti crawler service.
Knowing Cloudflare, god know how much longer it’ll be free for.

AmbiguousProps@lemmy.today · 9 months ago

Did you enable the AI black hole/tarpit? It’s the main reason I’ve used their stuff.

zoey@lemmy.librebun.com · 9 months ago

TIL! Just enabled it, thanks

xxce2AAb@feddit.dk · 9 months ago

If this isn’t fertile grounds for a massive class-action lawsuit, I don’t know what would be.

dreadbeef@lemmy.dbzer0.com · 9 months ago

whos the defendent, specifically?

xxce2AAb@feddit.dk · 9 months ago

No, that’s a good point. We all bloody well know there isn’t a single provider of LLM’s that aren’t sucking the entire Internet dry while gleefully ignoring robots.txt and expecting everybody else to pay the bill on their behalf, but the AI providers are getting really good at using other people IPs both to mask their identity and to evade blacklists, which is yet another abusive behavior.

But that’s beside your point. So forget the class-action lawsuit in favor of the relevant Ombudsman.

Either way, this cannot go on. Donation-driven open source projects are being driven into the ground by exploding bandwidth and hosting costs, people are being forced to deploy tools like Anubis that eats additional resources - including the resources of every legitimate user. The cumulative damage this is doing is no joke.

katy ✨@piefed.blahaj.zone · 9 months ago

reminder to donate to codeberg and forgejo :)

0x0@lemmy.zip · 9 months ago

It’s always a cat-n-mouse game.

Allero@lemmy.today · 9 months ago

Except previously bombarding another person’s server for personal gain was illegal.

0x0@lemmy.zip · 9 months ago

Not if it’s AI.
/s aside, maybe you could call’em out on involuntary DoSing, but then slashdot and similar sites would get into trouble.

carrylex@lemmy.world · 9 months ago

I don’t know if this is news to you, but most of the internet never cared about what’s legal or not.

londos@lemmy.world · 9 months ago

Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.

T156@lemmy.world · 9 months ago

Not without making real users also mine bitcoin/avoiding the site because their performance tanked.

raspberriesareyummy@lemmy.world · 9 months ago

Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O

kameecoding@lemmy.world · 9 months ago

Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.

Jolteon@lemmy.zip · 9 months ago

deleted by creator

londos@lemmy.world · edit-2 9 months ago

You’re 100% right. I just grasped at the first example I could think of where the crawlers could do free work. Yours is much better. Left is best.

andallthat@lemmy.world · edit-2 9 months ago

LLMs can’t do protein folding. A specifically-trained Machine Learning model called AlphaFold did. Here’s the paper.

Developing, training and fine tuning that model was a research effort led by two guys who got a Nobel for it. Alphafold can’t do conversation or give you hummus recipes, it knows shit about the structure of human language but can identify patterns in the domain where it has been specifically and painstakingly trained.

It wasn’t “hey chatGPT, show me how to fold a protein” is all I’m saying and the “superhuman reasoning capabilities” of current LLMs are still falling ridiculously short of much simpler problems.

kameecoding@lemmy.world · 9 months ago

They can’t bitcoin mine either, so technical feasibility wasn’t the goal of my reply

patatahooligan@lemmy.world · 9 months ago

The crawlers for LLM are not themselves LLMs.

mobotsar@sh.itjust.works · 9 months ago

Crawlers aren’t LLMs; they can do arbitrary computations (whatever the target demands to access resources).

NeilBrü@lemmy.world · edit-2 9 months ago

Hey dipshits:

The number of mouth-breathers who think every fucking “AI” is a fucking LLM is too damn high.

AlphaFold is not a language model. It is specifically designed to predict the 3D structure of proteins, using a neural network architecture that reasons over a spatial graph of the protein’s amino acids.

Every artificial intelligence is not a deep neural network algorithm.
Every deep neural network algorithm is not a generative adversarial network.
Every generative adversarial network is not a language model.
Every language model is not a large language model.

Fucking fart-sniffing twats.

$ ./end-rant.sh

Honytawk@feddit.nl · 9 months ago

deleted by creator

londos@lemmy.world · 9 months ago

I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.

raspberriesareyummy@lemmy.world · 9 months ago

To be fair: it’s a great tool for scamming people (think ransomware) :/

9 months ago

Great for money laundering.

Echo Dot@feddit.uk · 9 months ago

Is it? Don’t you risk losing a rather large percentage of the value.

Just by cars or something as they are much better at keeping their value. Also if somebody asks where did you get all this money from you can just point to the car and say, I sold that.

nymnympseudonym@lemmy.world · 9 months ago

The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.

But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.

0x0@lemmy.zip · edit-2 9 months ago

Anubis does that (the computation part). You may’ve seen it already.

Net_Runner :~$@lemmy.zip · 9 months ago

I use Anubis on my personal website, not because I think anything I’ve written is important enough that companies would want to scrape it, but as a “fuck you” to those companies regardless

That the bots are learning to get around it is disheartening, Anubis was a pain to setup and get running

r00ty@kbin.life · 9 months ago

For mbin I managed to kill the attack of the scrapers only using cloudflare managed challenge for all except to fediverse post endpoints, from fediverse ua agents on certain get endpoints. Managed challenge on everything else.

So far, they’ve not gotten past it. But, a matter of time.

PrettyFlyForAFatGuy@feddit.uk · 9 months ago

man, you’d think they’d just use the actual activitypub protocol to inhale all that data at once and not bother with costly scraping.

This A aint very I

r00ty@kbin.life · 9 months ago

Well the posts to inbox are generally for incoming info. Yes, there’s endpoints for fetching objects. But, they don’t work for indexing, at least not on mbin/kbin. If you have a link, you can use activitypub to traverse upwards from that object to the root post. But you cannot iterate down to child comments from any point.

The purpose is that say I receive an “event” from your instance. You click like on a post I don’t have on my instance. Then the like event has a link to the object for that on activitypub. If I fetch that object it will have a link to the comment, if I fetch the comment it will have the comment it was in reply to, or the post. It’s not intended to be used to backfill.

So they do it the old fashioned way, traversing the human side links. Which is essentially what I lock down with the managed challenge. And this is all on the free tier too.

Wispy2891@lemmy.world · 9 months ago

Same for all the WordPress blogs, by default in all of them there’s an API without authentication that lets you download ALL the posts in an easy JSON.

Dear artificial stupidity bot… WHY THE FUCK ARE YOU FUCKING SCRAPING THE WHOLE PAGE 50 TIMES A SECOND???

muusemuuse@sh.itjust.works · 9 months ago

AI was never intelligent. It’s a marketing term, that’s all. It has absolutely no meaning.

PhilipTheBucket@piefed.social · 9 months ago

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you’ve actively evading Anubis, fuckin’ game on.

TurboWafflz@lemmy.world · 9 months ago

I think the best thing to do is to not block them when they’re detected but poison them instead. Feed them tons of text generated by tiny old language models, it’s harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn’t slow down real users, but you probably don’t need much power since the scrapers probably don’t really care about the speed

xthexder@l.sw0.com · 9 months ago

I love catching bots in tarpits, it’s actually quite fun

sudo@programming.dev · edit-2 9 months ago

The problem is primarily the resource drain on the server and tarpitting tactics usually increase that resource burden by maintaining the open connections.

SorteKanin@feddit.dk · 9 months ago

The idea is that eventually they would stop scraping you cause the data is bad or huge. But it’s a long term thing, it doesn’t help in the moment.

Monument@lemmy.sdf.org · 9 months ago

The promise of money — even diminishing returns — is too great. There’s a new scraper spending big on resources every day while websites are under assault.

In the paraphrased words of the finance industry: AI can stay stupid longer than most websites can stay solvent.

phx@lemmy.ca · 9 months ago

Yeah that was my thought. Don’t reject them, that’s obvious and they’ll work around it. Feed them shit data - but not too obviously shit - and they’ll not only swallow it but eventually build up to levels where it compromises them.

I’ve suggested the same for plain old non-AI data stealing. Make the data useless to them and cost more work to separate good from bad, and they’ll eventually either sod off or die.

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s. It doesn’t need to be done real-time either as datasets can be generated in advance

SorteKanin@feddit.dk · 9 months ago

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s.

Even “high power” AIs would produce bad data. It’s currently well known that feeding AI data to an AI model decreases model quality and if repeated, it just becomes worse and worse. So yea, this is definitely viable.

phx@lemmy.ca · 9 months ago

Yup. It was more my thought that a low power over could produce sufficient results while requiring less resources. Something that can run on a desktop computer could still produce a database with reams of believable garbage that would take a lot of resources from the attacking AI to sort through, or otherwise corrupt its own harvested cache

31ank@ani.social · edit-2 9 months ago

Some guy also used zip bombs against AI crawlers, don’t know if it still works. Link to the lemmy post

amelaxxx@piefed.social · 9 months ago

Right

The Infinite Nematode@feddit.uk · 9 months ago

Wasn’t this called black ice in Neuromancer? Security systems that actively tried to harm the hacker?

traches@sh.itjust.works · 9 months ago

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP

PhilipTheBucket@piefed.social · 9 months ago

Is that really true? I guess I have no reason to doubt it, I just hadn’t heard it before.

sudo@programming.dev · 9 months ago

Here’s one example of a proxy provider offering to pay developers to inject their proxies into their apps. (“100% ethical proxies” because they signed a ToS). Another is BrightData proxies traffic through users of their free HolaVPN.

IOT and smart TVs are also obvious suspects.

amelaxxx@piefed.social · 9 months ago

AmbitiousProcess (they/them)@piefed.social · 9 months ago

Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That’s why in Codeberg’s response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.

For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.

Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from “shady apps.” Instead, they would simply rotate ASNs, and request a new IP.

The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people’s devices to route network traffic they’re unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don’t want.

PhilipTheBucket@piefed.social · 9 months ago

Honestly, man, I get what you’re saying, but also at some point all that stuff just becomes someone else’s problem.

This is what people forget about the social contract: It goes both ways, it was an agreement for the benefit of all. The old way was that if you had a problem with someone, you showed up at their house with a bat / with some friends. That wasn’t really the way, and so we arrived at this deal where no one had to do that, but then people always start to fuck over other people involved in the system thinking that that “no one will show up at my place with a bat, whatever I do” arrangement is a law of nature. It’s not.

sudo@programming.dev · 9 months ago

Or your TV or IOT devices. Residential proxies are extremely shady businesses.

NuXCOM_90Percent@lemmy.zip · 9 months ago

Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.

Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to… rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.

bleistift2@sopuli.xyz · edit-2 9 months ago

https://en.wikipedia.org/wiki/Sarcasm, or maybe https://en.wikipedia.org/wiki/Hyperbole

Ex Nummis@lemmy.world · 9 months ago

Eventually we’ll have “defensive” and “offensive” llm’s managing all kinds of electronic warfare automatically, effectively nullifying each other.

sudo@programming.dev · 9 months ago

Places like cloudflare and akamai are already using machine learning algorithms to detect bot traffic at a network level. You need to use similar machine learning to evade them. And since most of these scrapers are for AI companies I’d expect a lot of the scrapers to be LLM generated.

ProdigalFrog@slrpnk.net · 9 months ago

That’s actually a major plot point in Cyberpunk 2077. There’s thousands of rogue AI’s on the net that are constantly bombarding a giant firewall protecting the main net and everything connected to it from being taken over by the AI.

Track_Shovel@slrpnk.net · 9 months ago

Unrelated, but I saw this headline, and could hear both you and squidward swearing from here.

ProdigalFrog@slrpnk.net · 9 months ago

It doesn’t bode well. Honestly I fear at some point in the future, if these countermeasures can’t keep up, small sites may need to close themselves off with invite-only access. Hopefully that’s quite a distant future.

Klear@quokk.au · 9 months ago

The game is an excellent documentary.

archchan@lemmy.ml · edit-2 9 months ago

Not to mention the firewall is itself AI.

Monument@lemmy.sdf.org · 9 months ago

Increasingly, I’m reminded of this: Paul Bunyan vs. the spam bot (or how Paul Bunyan triggered the singularity to win a bet). It’s a medium-length read from the old internet, but fun.

UnderpantsWeevil@lemmy.world · 9 months ago

I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.

devfuuu@lemmy.world · edit-2 9 months ago

I think every company in the world is telling everyone for a few months now that what matter is AI data harvesting. There’s not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.

gian @lemmy.grys.it · 9 months ago

get out of the internet.

At some point, this would be the best option, sadly

ScoffingLizard@lemmy.dbzer0.com · 9 months ago

We need to poison better.

willington@lemmy.dbzer0.com · edit-2 9 months ago

I was fine before the AI.

The biggest customer of AI are the billionaires who can’t hire enough people for their technofeudalist/surveillance capitalism agenda. The billionaires (wannabe aristocrats) know that machines have no morals, no bottom lines, no scruples, don’t leak info to the press, don’t complain, don’t demand to take time off or to work from home, etc.

AI makes the perfect fascist.

They sell AI like it’s a benefit to us all, but it ain’t that. It’s a benefit to the billionaires who think they own our world.

AI is used for censorship, surveillance pricing, activism/protest analysis, making firing decisions, making kill decisions in battle, etc. It’s a nightmare fuel under our system of absurd wealth concentration.

Fuck AI.

Da Oeuf@slrpnk.net · 9 months ago

Crazy. DDoS attacks are illegal here in the UK.

BlameTheAntifa@lemmy.world · 9 months ago

The problem is that hundreds of bad actors doing the same thing independently of one another means it does not qualify as a DDoS attack. Maybe it’s time we start legally restricting bots and crawlers, though.

rdri@lemmy.world · 9 months ago

So, sue the attackers?