deleted by creator
It’s profoundly chauvinistic to think that people who speak other languages don’t have the same depth of literary resource as English-speakers because Wikipedia has fewer users.
Books. They’re called books. Every nation speaking every language has them.
I understand you’re trying to be nice to minority languages, but if you write research papers you either limit your demographic to your own country, or you publish in English (I guess Spanish is pretty world wide as well). If you set out to read a new paper in your field, I doubt you’d pick up something in Mongolian.
Even in Sweden I would write a serious paper in English, so that more of the world could read it. Yes, we have text books for our courses that are in Swedish, but i doubt there are many books covering LLMs being published currently for example.
I’m not “trying to be nice to minority languages”, I’m directly pushing back against the chauvinistic idea that the English Wikipedia is so important that those without it are somehow inferior. There is no “doom spiral”.
As for scientific papers, it’s called a translation. One can write academic literature in one’s native langaue and have it translated for more reach. That isnt the case with Wikipedia which is constantly being edited.
No one is saying those who can’t access or reqd English wikipedia is inferior. The issue here is when what is on a non-english wikipedia article is misleading or flat out harmful (like the article says about growing crops), because of juvenile attempts at letting machine translations getting it very wrong. So what Greenland did was shut down its poorly translated and maintained wiki site instead of letting it fester with misinformation. And this issue compounding when LLMs scrape Wikipedia as a source to learn new languages.
I’m not “trying to be nice to minority languages”, I’m directly pushing back against the chauvinistic idea that the English Wikipedia is so important that those without it are somehow inferior. There is no “doom spiral”.
I think you missed the problem described here.
The “doom spiral” is not because of English Wiki, it has nothing to do with anything.
The problem described is that people who don’t know a “niche” language try to contribute to a niche Wiki by using machine translation/LLMs.
As per the article:
Virtually every single article had been published by people who did not actually speak the language. Wehr, who now teaches Greenlandic in Denmark, speculates that perhaps only one or two Greenlanders had ever contributed. But what worried him most was something else: Over time, he had noticed that a growing number of articles appeared to be copy-pasted into Wikipedia by people using machine translators. They were riddled with elementary mistakes—from grammatical blunders to meaningless words to more significant inaccuracies, like an entry that claimed Canada had only 41 inhabitants. Other pages sometimes contained random strings of letters spat out by machines that were unable to find suitable Greenlandic words to express themselves.
Now, another problem is Model Collapse (or, well, a similar phenomenon in strictly in terms of language itself).
We now have a bunch of “niche” languages’ Wikis containing such errors… that are being used to train machine translators and LLMs to handle these languages. This is contaminating their input data with errors and hallucinations, but since this is the training data, these LLMs consider everything in there as the truth, propagating the errors/hallucinations forward.
I honestly have no clue where you’re getting anything chauvinistic here. The problem is imperfect technology being misused by irresponsible people.
Is it even getting misused? Spreading knowledge via machine translation where there are no human translators available, had to be better than not translating. As long as there is transparency so people can judge the results ……
And ai training trusting everything it reads is a larger systemic issue, not limited to this niche.
Perhaps part of the solution is machine readable citations. Maybe a search engine or ai could provide better results if it knew what was human generated vs machine generated. But even then you have huge gaps on one side with untrustworthy humans (like comedy) and on the other side with machine generated facts such as from a database
Spreading knowledge via machine translation where there are no human translators available, had to be better than not translating
Have you not read my entire comment…?
One of the Greenlandic Wiki articles “claimed Canada had only 41 inhabitants”. What use is a text like that? In what world is learning that Canada has 41 inhabitants better than going to the English version of the article and translating it yourself?
Perhaps part of the solution is machine readable citations
The contents of the citations are already used for training, as long as they’re publicly available. That’s not the problem. The problem is that LLMs do not understand context well, they are not, well, intelligent.
The “Chinese Room” thought experiment explains it best, I think: imagine you’re in a room with writing utensils and a manual. Every now and again a letter falls in to the room through a slit in the wall. Your task is to take the letter and use the manual to write a response. If you see such and such shape, you’re supposed to write this and that shape on the reply paper, etc. Once you’re done, you throw the letter out through the slit. This goes back and forth.
To the person on the other side of the wall it seems like they’re having a conversation with someone fluent in Chinese whereas you’re just painting shapes based on what the manual tells you.
LLMs don’t understand the prompts - they generate responses based on the probability of certain characters or words or sentences being next to each other when the prompt contains certain characters, words, and sentences. That’s all there is.
There was a famous botched experiment where scientists where training an AI model to detect tumours. It got really accurate on the training data so they tested it on new cases gathered more recently. It gave a 100% certainty of a tumour being present if the photograph analysed had a yellow ruler on it, because most photos of tumours in the training data had that ruler for scale.
But even then you have huge gaps on one side with untrustworthy humans (like comedy) and on the other side with machine generated facts such as from a database
“Machine generated facts” are not facts, they’re just hallucinations and falsehoods. It is 100% better to NOT have them at all and have to resort to the English wiki, than have them and learn bullshit.
Especially because, again, the contents of the Wikipedia are absolutely being used for training further LLM models. The more errors there are, the worse the models become eventually leading to a collapse of truth. We are already seeing this with whole “research” publications being generated, including “source” material invented on the spot, proving bogus results.
Is it even getting misused? Spreading knowledge via machine translation where there are no human translators available, had to be better than not translating. As long as there is transparency so people can judge the results
Assumes the AI is accurate, which is debatable
Also how do you do citations on a translation?
Its an interpretation, not a fact
Sure there are limitations. The point still stands: an imperfect machine translation is better than no translation, as long as people understand it is.
Can we afford to allow a high bad deprive people of knowledge just because of the language they speak?
The article complains about the affect on languages of poor machine translations, but the affect of no translations is worse. Yes those Greenlanders should be able to read all of Wikipedia without learning English and even if the project has no human translators
Wikipedia already has a button where you can go to another language’s version of that page where you can then machine translate it yourself.
Yes those Greenlanders should be able to read all of Wikipedia without learning English and even if the project has no human translators
Again, your assuming a high level of accuracy from these tools. If LLM garbage leaves it unreadable, is that actually better?
As a society, we need to better value the labour that goes into our collective knowledge bases. Non-English Wikipedia is just one example of this, but it highlights the core of the problem: the system relies on a tremendous amount of skilled labour that cannot easily be done by just a few volunteers.
Paying people to contribute would come with problems of its own (in a hypothetical world where this was permitted by Wikipedia, which I don’t believe it is at present), but it would be easier for people to contribute if the time they wanted to volunteer was competing with their need to keep their head above the water financially. Universal basic income, or something similar, seems like one of the more viable ways to improve this tension.
However, a big component of the problem is around the less concrete side of how society values things. I’m a scientist in an area where we are increasingly reliant on scientific databases, such as the Protein Database (pdb), where experimentally determined protein structures are deposited and annotated, as well as countless databases on different genes and their functions. Active curation of these databases is how we’re able to research a gene in one model organism, and then apply those insights to the equivalent gene in other organisms.
For example, the gene CG9536 is a term for a gene found in Drosophila melanogaster — fruit flies, a common model organism for genetic research, due to the ease of working with them in a lab. Much of the research around this particular gene can be found on flybase, a database for D. melanogaster gene research. Despite being super different to humans, there are many fruitfly genes that have equivalents in humans, and CG9536 is no exception; TMEM115 is what we call it in humans. The TL;DR answer of what this gene does is “we don’t know”, because although we have some knowledge of what it does, the tricky part about this kind of research is figuring out how genes or proteins interact as part of a wider system — even if we knew exactly what it does in a healthy person, for example, it’s much harder to understand what kinds of illnesses arise from a faulty version of a gene, or whether a gene or protein could be a target for developing novel drugs. I don’t know much about TMEM115 specifically, but I know someone who was exploring whether it could be relevant in understanding how certain kinds of brain tumours develop. Biological databases are a core component of how we can big to make sense of the bigger picture.
Whilst the data that fill these databases are produced by experimental research that are attached to published papers, there’s a tremendous amount of work that makes all these resources talk to each other. That flybase link above links to the page on TMEM115, and I can use these resources to synthesise research across so many separate fields that would previously have been separate: the folks who work on flies will have a different research culture than those who work in human gene research, or yeast, or plants etc. TMEM115 is also sometimes called TM115, and it would be a nightmare if a scientist reviewing the literature missed some important existing research that referred to the gene under a slightly different name.
Making these biological databases link up properly requires active curation, a process that the philosopher of Science Sabine Leonelli refers to as “data packaging”, a challenging task that includes asking “who else might find this data useful?” [1]. The people doing the experiments that produce the data aren’t necessarily the best people for figuring out how to package and label that data for others to use because inherently, this requires thinking in a way that spans many different research subfields. Crucially though, this infrastructure work gives a scientist far fewer opportunities to publish new papers, which means this essential labour is devalued in our current system of doing science.
It’s rather like how some of the people who are adding poor quality articles to non-English Wikipedia feel like they’re contributing because using automated tools allows them to create more new articles than someone with actual specialist knowledge could. It’s the product of a culture of an ever-hungry “more” that fuels the production of slop, devalues the work of curators and is degrading our knowledge ecosystem. The financial incentives that drive this behaviour play a big role, but I see that as a symptom of a wider problem: society’s desire to easily quantify value causing important work that’s harder to quantify to be systematically devalued (a problem that we also see in how reproductive labour (i.e. the labour involved in managing a family or household) has historically been dismissed).
We need to start recognising how tenuous our existing knowledge is. The OP discusses languages with few native speakers, which likely won’t affect many who read the article, but we’re at risk of losing so much more if we don’t learn to recognise how tenuous our collective knowledge is. The more we learn, the more we need to invest into expanding our systems of knowledge infrastructure, as well as maintaining what we already have.
[1]: I am not going to cite the paper in which Sabine Leonelli coined the phrase “data packaging”, but her 2016 book “Data-Centric Biology: A Philosophical Study”. I don’t imagine that many people will read this large comment of mine, but if you’ve made it this far, you might be interested to check out her work. Though it’s not aimed at a general audience, it’s still fairly accessible, if you’re the kind of nerd who is interested in discussing the messy problem of making a database usable by everyone.
If your appetite for learning is larger than your wallet, then I’d suggest that Anna’s Archive or similar is a good shout. Some communities aren’t cool with directly linking to resources like this, so know that you can check the Wikipedia page of shadow library sites to find a reliable link: https://en.wikipedia.org/wiki/Anna’s_Archive
1 ↩︎
This is the sort of comment that makes me wish I could do multiple upvotes
Tine to spin up some alts?
Does it really matter? I think the extreme amount of languages in the world right now is not helping us communicate. I don’t view language as a cultural heritage thing, just a communication protocol. And I have moved around a lot in the world, it’s very difficult to be constantly adapting to different languages. That causes a societal integration barrier for me.
I think if we had a universal language (note that it wouldn’t have to be English) we would be able to understand each other better and have less wars.
PS: I’m not advocating to ban languages or something, just to have a universal one. A bit like what Esperanto tried to achieve. Mutual language means more mutual understanding and thus less “us vs them” underbelly feelings that the fascists thrive on.
This is the worst take I’ve ever seen.
Yeah I’m just not really wed to any language. I guess it is also because I have moved around so much. I’m from Holland but I don’t consider myself a Dutch person, more like a citizen of the world. I’ve become too different to fit in in my home country (also because it’s become an extreme-right cesspool lately 😢 ). I’ve spent about half my life elsewhere. And the places I’ve lived where I spoke the languages I fared noticeably better.
Don’t forget that a lot of today’s problems center around not understanding each other. The hatred of immigrants for example.
But I know a lot of people do view language as a cultural thing, it’s just my point of view.
Esperanto still exists and there is a worldwide community of speakers.
Oh yes I know but as a common universal language it really has failed. It never became more than a fringe thing (sorry).
It is the most popular one. If somebody wanted to start a competitor, they’d have a hard time.
Removed by mod
I don’t view language as a cultural heritage thing, just a communication protocol.
Language is absolutely political, a product of its specific environment, and there is a lot that can be communicated in one language that would be difficult in another. Erasing languages isn’t like no longer manufacturing a specific style of plug, instead it silences viewpoints and enforces the cultural hegemony of the dominant group.
There’s a reason fascists are fond of erasing the languages of the marginalized.
Like I said I don’t advocate erasing languages. Just to have a common international language whichever it is. Local languages can still play a big role in cultural matters (eg literature and life on the street for locals)
I’m from Holland myself and I know most people there don’t care so much about our quirky language, we are happy to speak other ones. It doesn’t mean that Dutch is worth any less. Mind that it doesn’t have to be English (especially now that the US is rapidly declining as a world power). But whatever it is, I wish the world would just pick one so I don’t have to keep learning new languages every time I move.
But failing a global language perhaps AI translators will become so good and smooth to use that soon we can just communicate regardless of what languages we speak.
Languages have their own quirks and characters, representative of the people’s cultural values and history, and express ideas not even present in other cultures. As many languages have to be preserved as possible for these reasons.




