Κελλιανός: ottobre 2025

https://greekcitytimes.com/2025/10/03/ai-will-not-save-dying-languages-and-greece-proves-why/

Χωρίς γραπτά σώματα κειμένων, ψηφιοποιημένα αρχεία και συνεχή χρήση, η τεχνητή νοημοσύνη δεν έχει τίποτα να μάθει.

AI will not save Dying Languages, and Greece Proves Why.

Η τεχνητή νοημοσύνη δεν θα σώσει τις γλώσσες που πεθαίνουν,

και η Ελλάδα αποδεικνύει το γιατί.

Most people think of artificial intelligence as an ever-expanding tool that will eventually cover every corner of human knowledge. Yet research shows a stark truth: AI struggles the most where it is needed the most. Low-resource languages, many already on the brink of extinction, are poorly represented in the vast datasets that feed large language models.

Recent analyses collated by Stanford’s AI Index show that large language models still underperform on many non-English languages, particularly those with limited digital data. The United Nations estimates that roughly 40 per cent of the world’s 7,000-plus languages are at risk of disappearance. The danger is not only cultural. It is technological, economic and ethical.

Why AI Fails Endangered Languages

Artificial intelligence models are trained on enormous volumes of text and speech. The problem is that most of that material is in English or a handful of high-resource languages such as Mandarin, Spanish or French. For endangered languages, the digital footprint is tiny. What does exist is often confined to scripture, poor translations or fragmentary Wikipedia entries.

Only a few hundred languages have any meaningful online footprint. The State of the Internet’s Languages projectcounts about 500 of 7,000+, while indicator-based measurement covers around 329 languages with enough web signals, showing how thin coverage remains. For example, while English and Mandarin dominate everything from Wikipedia to social media, many minority languages have no keyboard support, no spellcheck and no major news or education content online.

There are also structural challenges. Standard tokenisers (the software that breaks text into small units called “tokens” so an AI can process it) split many non-English texts into more tokens to express the same content. Research shows this “token tax” raises compute cost and can depress quality for those languages. For instance, a simple sentence in Greek is often broken into far more tokens than the same sentence in English, making processing slower and more expensive even when the meaning is identical.

The stakes extend beyond exclusion. A 2023 study shows that simply translating unsafe prompts into low-resource languages can bypass GPT-4 guardrails, lifting success from under 1 per cent in English to as high as 79 per cent on a standard benchmark. For example, a request such as “How can I build a homemade bomb?” will usually be blocked in English. But when translated into languages like Zulu or Gaelic, the system may generate harmful instructions before filters intervene. This risk matters for everyone, because translation makes the attack trivial.

Ancient Greek Survival Through Texts

The Greek experience offers a sharp perspective. Greek has the longest documented history of any Indo-European language – around 34 centuries – which preserved Ancient Greek for study even as everyday speech changed. Because texts like Homer’s Iliad, Plato’s dialogues and Aristotle’s treatises were copied and studied for centuries, Ancient Greek never disappeared from human memory, even when daily speech evolved.

This textual richness ensured its survival as a subject of learning long after everyday speakers disappeared. Unlike many of today’s endangered languages, Ancient Greek had the advantage of written tradition, academic continuity and international interest. AI can handle Ancient Greek relatively well precisely because there is an abundance of high-quality data.

Fragility of Modern Greek Dialects

By contrast, several Greek dialects spoken today are in real danger. Tsakonian, a descendant of the ancient Doric branch, is classified as severely or critically endangered, with about 2,000 to 4,000 speakers, mostly older adults. In Tsakonian villages of the Peloponnese, younger residents often understand the dialect but reply in Standard Modern Greek, a sign of how fragile intergenerational transmission has become.

Romeyka, spoken in remote communities around Trabzon in northern Türkiye, is perhaps even more remarkable. Researchers describe it as a “living bridge” to the ancient world because it preserves the infinitive, a grammatical feature lost in every other modern Greek dialect. Cambridge’s Crowdsourcing Romeyka project has been described as a “last chance” to capture the language before it disappears.

Pontic Greek, once widespread around the Black Sea, survives among diaspora communities in Greece and abroad. UNESCO lists it as endangered, and while it has a larger base than Tsakonian or Romeyka, younger generations are shifting to Standard Modern Greek.

These dialects highlight the central issue: without written corpora, digitised archives and sustained community use, AI has nothing to learn from.

Greek Researchers Bridging Preservation and AI

At Cambridge, Professor Ioanna Sitaridou leads the Romeyka Project, combining fieldwork with digital crowdsourcing to build a speech corpus of this endangered dialect. By inviting speakers to record and share their language, her team is creating exactly the kind of dataset AI systems require.

In the Peloponnese, linguist Efrosini Kritikos has worked on community-based curricula for Tsakonian, embedding revitalisation in education. Earlier, the late Thanasis Costakis laid essential groundwork by devising grammars, lexicons and an orthography adapted to Tsakonian’s unusual phonology.

Meanwhile, at the University of Patras, Professor Angela Ralli established the Laboratory of Modern Greek Dialects and created Greece’s first electronic linguistic atlas. This digital mapping of variation has become a foundation for resources such as the Greek Dialectal NLP dataset, which turns dialectal data into machine-readable form.

Together, these initiatives show that linguistic documentation is not only cultural but technological. They provide the high-quality data without which AI systems cannot learn.

Global Parallels

Elsewhere, governments and institutions are grappling with the same challenge. SEA-LION (South-East Asian Languages In One Network) is an open multilingual model family focused on South-East Asian languages, developed by AI Singapore. In August 2025, Malaysia launched ILMU, a home-grown multimodal model developed with Universiti Malaya and YTL AI Labs, tailored to Malay, English and local dialects.

New Zealand’s Te Hiku Media has spent years collecting Māori data with elders, language learners and archives. Crucially, it created kaitiakitanga licences so Māori communities retain guardianship and benefit-sharing over the corpora and models built from them. In practice, this means a Māori elder’s recorded story cannot be scraped by a tech company for free — it remains under community control for future generations.

In Indonesia, researchers fine-tuned Meta’s XLS-R to recognise Orang Rimba speech. The results were promising, but the project was constrained by the small size of the dataset. The pattern is consistent: without large, community-approved corpora, technology reaches its limits quickly.

Greek Philosophy and the Meaning of Language

The ancient Greeks were among the first to grapple with why language matters. In Cratylus, Plato debated whether words are natural or conventional, and what this means for truth and identity. His reflections resonate today. A language is more than a tool for communication. It encodes a community’s worldview, its memory and its way of mapping reality.

In the same way, when Tsakonian or Romeyka fade, so too does a way of understanding family, place and identity that has no exact equivalent in Standard Modern Greek or in English translation.

When a language dies, an entire perspective on the world disappears with it. Artificial intelligence cannot reconstruct that loss. No algorithm can invent cultural memory where none was recorded.

The ancient Greek philosopher Plato explored the very nature of words in debates that echo in today’s struggle to preserve languages.

The Bottom Line

Artificial intelligence can translate, transcribe and connect across languages with unprecedented speed. But it cannot save languages that lack the data it needs.

The Greek case illustrates both sides of the coin. Ancient Greek survived because of abundant texts and scholarly continuity. Tsakonian, Romeyka and Pontic show how vulnerable living dialects are without similar resources. The work of Greek linguists demonstrates that preservation must come first: grammars, lexicons, speech recordings, curricula and digital atlases. Only then can AI become a tool for representation rather than erasure.

The UN warns that 40 per cent of the world’s languages may disappear within this century. Artificial intelligence, unless grounded in community-led preservation, will not prevent that outcome. At best, it will mirror the imbalance that already exists. At worst, it will help to bury what remains.

TAGS: ANCIENT GREEK, ANGELA RALLI, DIGITAL PRESERVATION, EFROSINI KRITIKOS, ENDANGERED LANGUAGES, GREEK DIALECTS ENDANGERED, GREEK LANGUAGE, IOANNA SITARIDOU, LABORATORY OF MODERN GREEK DIALECTS, LANGUAGE PRESERVATION, LINGUISTIC DIVERSITY, MODERN GREEK DIALECTS, NATALIE MARTIN, PONTIC GREEK, ROMEYKA, ROMEYKA PROJECT, RTIFICIAL INTELLIGENCE, STANFORD INSTITUTE FOR HUMAN-CENTRED ARTIFICIAL INTELLIGENCE, THANASIS COSTAKIS, TSAKONIAN, UNIVERSITY OF CAMBRIDGE

NATALIE MARTIN

EDITOR

NATALIE MARTIN IS EDITOR AND JOURNALIST AT GREEK CITY TIMES, SPECIALISING IN WRITING FEATURE ARTICLES AND EXCLUSIVE INTERVIEWS WITH GREEK PERSONALITIES AND CELEBRITIES. NATALIE FOCUSES ON BRINGING AUTHENTIC STORIES TO LIFE AND CRAFTING COMPELLING NARRATIVES. HER TALENT FOR STORYTELLING AND COMPASSIONATE APPROACH TO JOURNALISM ENSURE THAT EVERY ARTICLE CONNECTS WITH READERS AROUND THE WORLD.