No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that’s dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There’s nothing we can do legally about Chinese scrapers.
We do, somewhat. We haven’t gone as far as a blanket ban of Chinese CIDR ranges because there’s a lot of risks and bureaucracy associated with a move like that. But it probably makes sense for a small company like Codeberg, since they have higher risk tolerance and can move faster.
I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?
The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.
As far as I know, the website doesn’t have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I’m following at each time.
That might/might not be much.
Depends upon the site, I’d say.
e.g. If it’s something like Netflix, I wouldn’t think much, because they have the means to serve the requests.
But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.
You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.
I would reduce the frequency to once every two days to further reduce the impact.
The problem is these are constant army hordes / datacentres. You have one tool. Sending a few requests from your device wouldn’t even dent a raspberry pi, nevermind a beefier server
I think the intention of traffic is also important. Your tool is so you can consume the content freely provided by the website. Their tool is so they can profit off of the work on the website.
Seems like an api request would be preferable for the site you’re checking. I don’t imagine they’re unhappy with the traffic if they haven’t blocked it yet
If it is scraping it is scraping. If the endpoint canonically is intended to be a machine readable endpoint (do not respond with “but html is machine-readable” as you know that is not the point) it is not. Your scraper would be harmful.
I really feel like scrapers should have been outlawed or actioned at some point.
But they bring profits to tech billionaires. No action will be taken.
No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that’s dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There’s nothing we can do legally about Chinese scrapers.
Can you not just block China?
We do, somewhat. We haven’t gone as far as a blanket ban of Chinese CIDR ranges because there’s a lot of risks and bureaucracy associated with a move like that. But it probably makes sense for a small company like Codeberg, since they have higher risk tolerance and can move faster.
I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?
The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.
If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.
Probably not a good idea to download the whole site, but then that depends upon the site.
As far as I know, the website doesn’t have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I’m following at each time.
That might/might not be much.
Depends upon the site, I’d say.
e.g. If it’s something like Netflix, I wouldn’t think much, because they have the means to serve the requests.
But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.
You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.
I would reduce the frequency to once every two days to further reduce the impact.
The problem is these are constant army hordes / datacentres. You have one tool. Sending a few requests from your device wouldn’t even dent a raspberry pi, nevermind a beefier server
I think the intention of traffic is also important. Your tool is so you can consume the content freely provided by the website. Their tool is so they can profit off of the work on the website.
Seems like an api request would be preferable for the site you’re checking. I don’t imagine they’re unhappy with the traffic if they haven’t blocked it yet
I mean if it’s cms site there may not be an api, this would be the only solution in that case
If it is scraping it is scraping. If the endpoint canonically is intended to be a machine readable endpoint (do not respond with “but html is machine-readable” as you know that is not the point) it is not. Your scraper would be harmful.
So search engines shouldn’t exist? This is absurdly simplistic.
But html is machine-readable and that absolutely is the point!
Never forget what they stole from us.
Does your tool respect the site’s robots.txt?
Yes, it just downloads the HTML of one page and formats the data into the RSS format with only the information I need.