AI went nuts on my website and generated a $155 excessive bandwidth bill

limelight79@lemmy.world · 24 days ago

AI went nuts on my website and generated a $155 excessive bandwidth bill

Jomega@lemmy.world · 22 days ago

This should be a crime.

Stupidmanager@lemmy.world · 23 days ago

I ran a small hobby site that generated custom lambda to make a serverless, white label “what’s my ip” site. It was an exercise in learning, that was repeatedly beaten in by OpenAI. robots.txt was useless and cloudflare worked wonders after I blocked all access to the real site for all ips but cloudflare.

Cost was near $1000 for just 2 weeks of it repeatedly hitting the site and I wish I got credit.

FalschgeldFurkan@lemmy.world · 22 days ago

That shit cannot be legal. It’s like DDoS but without getting the target offline… I hope this all works out for you, and that you get OpenAI to pay for it.

(Why are these asshats calling themselves “open” anyways when they are clearly not?)

limelight79@lemmy.world · 22 days ago

They did get the target offline!

cmhe@lemmy.world · 22 days ago

This is what Anubis is for. Bots started ignoring robots.txt so now we have to set up that for everything.

[deleted]@piefed.world · 23 days ago

This isn’t a bug, this is how AI is designed to work and it is absolutely terrible foe the web. If it was actually designed well it would use robots.txt (it doesn’t care) and cache common query results but instead it sends out fresh queries and pulls down data over and over again just in case something changed.

It is malicious and should be treated as such, but it isn’t.

SLVRDRGN@lemmy.world · edit-2 23 days ago

Robots.txt is a standard, developed in 1994, that relies on voluntary compliance.

Voluntary compliance is conforming to a rule, without facing negative consequences if not complying.

Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them.

This is all from Wikipedia’s entry on Robots.txt.
I don’t get how we only have voluntary protocols for things like this at this point in 2025 AD…

limelight79@lemmy.world · 23 days ago

Yeah that’s part of why I was so frustrated with the answer from OpenAI about it. I don’t think I mentioned it in the writeup, but I actually did modify robots.txt on Jan 1 to block OpenAI’s bot, and it didn’t stop. In fairness, there’s probably some delay before it re-reads the file, but who knows how long it would have taken for the bot to re-read it and stop flooding the site (assuming it obeys at all) - and it still would have been sucking data until that point.

I also didn’t mention that the support bot gave me the wrong URL for the robots.txt info on their site. I pointed it out and it gave me the correct link. So, it HAD the correct link and still gave me the wrong one! Supporters say, “Oh, yeah, you have to point out its errors!” Why the fuck would I want to argue with it? Also, I’m asking questions because I don’t know the answer! If I knew the correct answer, why would I be asking?

In the abstract, I see the possibilities of AI. I get what they’re trying to do, and I think there may be some value to AI in the future for some applications. But right now they’re shoveling shit at all of us and ripping content creators off.

gwl [he/him]@lemmy.blahaj.zone · 22 days ago

answer from OpenAI

There’s your problem, you’re trusting the blind idiot

limelight79@lemmy.world · 22 days ago

Where did I say I trusted them? Seriously, please, so I can fix that.

gwl [he/him]@lemmy.blahaj.zone · 21 days ago

Well, your first step was using it at all

limelight79@lemmy.world · 21 days ago

What were my other options? That’s the only option for contacting them I could find on their site.

ThisGuyThat@lemmy.world · 22 days ago

There’s probably a large amount of sites that dissapear because of this. I do see openai’s scraper in logs, but I only have a landing page.

limelight79@lemmy.world · 22 days ago

Yeah, how many people like me would just throw in the towel?

ThisGuyThat@lemmy.world · 22 days ago

Cloudflare’s reverse proxy has been great. Although I’d rather not have it at all. I’ve casually looked into other alternatives like a WAF on local machine, but have just stuck with cloudflare.

limelight79@lemmy.world · 22 days ago

Good to hear…that reminds me, I need to re-enable my site (now that Cloudflare is set up) and…hope for the best!

JohnnyFlapHoleSeed@lemmy.world · 23 days ago

If you sue them in civil court, you have a surprisingly good chance of winning

Damage@feddit.it · 23 days ago

Tragedy of the commons, modern web edition

Anti_Iridium@lemmy.world · 22 days ago

I always hated the tragedy of the commons. Mostly because it’s a bad analogy. Should be reframed today as the “Intentional Destruction of the Commons”

drunkpostdisaster@lemmy.world · 23 days ago

It gets worse every fucking day

4am@lemmy.zip · 23 days ago

Send an invoice to OpenAI for abusing your EULA and demand payment. Report them to all three credit bureaus when they don’t. Encourage others to do the same.

IphtashuFitz@lemmy.world · 23 days ago

Hell. I’d look into taking them to small claims court if they don’t pay the invoice. If that became common practice then OpenAI may actually do something about it.

LiveLM@lemmy.zip · 22 days ago

The bot pulled 1.5 terabytes on just those pictures

It’s no wonder these assholes still aren’t profitable. Idiots burning all this bandwidth on the same images over and over

limelight79@lemmy.world · 22 days ago

Good point, it costs on their end, too.

UpperBroccoli@lemmy.blahaj.zone · 23 days ago

I have experienced something similar. I run a small forum for a computer games series, a series I myself have not been interested in a long time. I am just running it because the community has no other place to go, and they seem to really enjoy it.

A few months ago, I received word from them that the forum barely responded anymore. I checked it out and noticed there were several hundred active connections at any time, something we have never seen before. After checking the whois info on the IPs, I realized they were all connected to meta, google, apple, microsoft and other AI companies.

It felt like a coordinated DDoS attack and certainly had almost the same effect. Now, I have a hosting contract where I pay a flat monthly fee for a complete server and any traffic going through it, so it was not a problem financially speaking, but those AI bots made the server almost unusable. Naturally, I went ahead and blocked all the crawler IPs that I could find, and that relieved the pressure a lot, but I still keep finding new ones.

Fuck all of those companies, fuck the lot of them. All they do is rob and steal and plunder, and leave charred ruins. And for what? Fan fiction. Unbelievable.

indomara@lemmy.world · 23 days ago

Thank you for continuing that site. We donate every month for a similar forum that we never use. I love that these places still exist, and appreciate those who help run them.

Dave.@aussie.zone · 23 days ago

Maybe it’s time to implement an AI tarpit. Each response for a request from a particular IP address or range takes double the time of the previous, with something like a 30 second cool down window before response time halves.

Would stop AI scrapers in their tracks, but it wouldn’t hurt normal users too much.

Maybe I should start looking into it a bit more 🤔

limelight79@lemmy.world · 23 days ago

Apparently my phpbb forum served as a nice tar pit. The only thing I can figure is that they neglected to take session IDs into account, so they assumed every url was a different page.

gothic_lemons@lemmy.world · 23 days ago

Not an expert or anything but could a script be made that feeds a bot an endless steam of unique tinyurls that points to images openai pays to host?

edit-2 23 days ago

Could you run a script that presents the AI bots with alternative believable but incorrect text based information? That would be a great way to fight back.

You could even implement an AI to rewrite your content with intentional errors so you don’t have to generate the misinformation yourself. Sounds like a great use for AI.

Cypher@lemmy.world · 23 days ago

Nepenthes already does a better job of this than what you’re proposing and doesn’t require AI.

https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

23 days ago

Nice

Greg Clarke@lemmy.ca · 24 days ago

Cloudflare also has caching on the free tier which will reduce these kinds of AI attacks

limelight79@lemmy.world · 23 days ago

Yeah that’s what I’ve set up. I haven’t turned the site on yet, I want to leave it off all day tomorrow so that the logs will show nothing, then when I restart it, I can watch the logs for requests from the AI bots.

utopiah@lemmy.world · 23 days ago

Tech support found that AI bots were crawling the site repeatedly. In particular, OpenAI’s bot was hitting it extremely hard.

Yup… I just had to read your title to know how it happened. In fact more than a year ago at OFFDEM (the off discussion parallel to FOSDEM in Brussels) we discussed how to mitigate such practices because at least 2 of us self-hosting had this problem. I had problem with my own forge because AI crawlers generate archives and that quickly generate quite a bit of space. It’s a well known problem that’s why there are quite a few “mazes” out there or simply blocking rules for HTTPS or reverse proxies.

AI hype is so destructive for the Web.

Spice Hoarder@lemmy.zip · 22 days ago

There has to be a better way to do this. Like using a hash or something to tell if a bot even Need to scrape again.

utopiah@lemmy.world · 22 days ago

No doubt there are better ways … but I believe pure players, e.g. OpenAI or Anthropic, or resellers who get paid with scaling, e.g AWS, equate very large scale with moat. So they get so much funding that they have ridiculous computing resources, probably way WAY cheaper for “old” cloud (i.e. anything but GPUs) than new cloud (GPUs) so basically they put 0 effort to optimize anything. They probably even brag about how large their “dataset” is despite it being full of garbage data. They don’t care because in their marketing materials they claim to train over Exabytes of data or whatever.