LINGUAHALLA
White Paper

The LinguaHalla Methodology: Scaling Language Acquisition for Endangered, Heritage, and Classical Languages Through Adaptive AI Narrative

Runeteca Technologies LLC · Victor Christiansen · May 2026

The Problem

Standard language applications fail endangered and heritage languages. The failure is structural, not technical. These platforms optimize for high-resource languages with abundant training data, abundant speakers, and established pedagogical pipelines. Languages like Navajo, Cherokee, Yucatec Maya, and Aramaic have none of these.

More fundamentally, existing platforms fail the learners themselves:

The Approach — Three Pillars

1
Comprehensible Input (Krashen i+1)

Every encounter is generated at exactly one level above the learner's current demonstrated competency. The AI does not present vocabulary lists or grammar tables — it presents language in context, calibrated to the learner's current level. This is the Krashen acquisition hypothesis applied at scale: comprehensible input that stretches, never overwhelms.

Comprehensibility is operationalized through four structural mechanisms: (1) high-context situational narrative — every encounter is grounded in a specific physical and cultural scene, providing extralinguistic anchoring that makes new structures deducible from context; (2) cross-language cognate anchoring, where available, explicitly connecting new vocabulary to roots the learner already carries; (3) companion-mediated scaffolding — the AI companion introduces new structures through natural dialogue, not instruction, allowing semantic resolution through conversational context; and (4) strategic immediate translation for zero-context structures in low-resource languages where extralinguistic anchoring alone is insufficient. The system never reverts to grammar tables. Comprehensibility is built into the generation architecture, not appended as a correction.

2
Spaced Repetition (SM-2 Algorithm)

Vocabulary surfaces at precisely calibrated review intervals based on demonstrated retention. Target words are not only surfaced in explicit review sessions — they are woven back into generated narrative encounters, creating contextual reinforcement rather than isolated drilling. The system tracks over 49,149,843 vocabulary entries across 119 languages with per-learner retention state.

3
Affective Filter Reduction

No streaks. No leaderboards. No public performance metrics. No punishment mechanics. Krashen's affective filter hypothesis identifies anxiety as the primary acquisition blocker — LinguaHalla's design is built around minimizing it. Motivation is narrative and intrinsic.

The structural mechanics of intrinsic motivation replace the dopamine loop of standard gamification: narrative progression creates genuine investment in companion relationships and unfolding story arcs; cultural unlocking — accessing deeper philosophical texts, chain transitions, and convergence encounters — provides milestone rewards tied to linguistic achievement rather than daily login compliance; and personal progress visibility shows the learner's own trajectory without comparative ranking. For heritage learners navigating complex emotional relationships with their ancestral languages, this is not a design preference — it is a clinical necessity. Extrinsic gamification measurably increases anxiety in this learner population. LinguaHalla's affective architecture is designed specifically to hold space for that complexity.

The Data

Current platform statistics, rendered live:

49,149,843
Total vocabulary entries
119
Active languages
1,315
Structured arc premises
~1,052
Estimated curriculum hours
2,003
Wisdom traditions & cultural idioms

Curriculum hours estimated at 6 encounters per arc × 8 minutes per encounter. Idiom layer spans 27 language traditions — from Epictetus's Enchiridion to Hawaiian ʻŌlelo Noʻeau to Classical Nahuatl huehuetlatolli.

Language Coverage

Living Languages
LanguageCompanionVocabularyArcs
AmharicMakeda246,11111
ArabicNadia5,582,51525
BengaliAnanya19,49611
DarijaFarida2,02011
DutchPieter5,802,31211
FarsiDariush4,002,70011
FrenchCamille16,01011
GermanHanna16,13011
GuaraníTupã1,50311
Gulf ArabicKhalid63711
HindiPriya43,28211
ItalianMarco5,762,82811
JapaneseYuki21,94211
KannadaMeera11
KoreanSeo-yeon10,443,46912
LaoMalee8,88212
MalayalamMeera11
MandarinMei16,68411
Modern GreekKostas2,595,97211
MongolianTemür6,00411
NorwegianHåkon26,78012
PolishAleksander5,602,42711
PortugueseBeatriz16,01711
PunjabiHarpreet8,51911
RomanianLuminița2,99911
RussianDmitri427,83011
SpanishAmara16,03017
SwahiliJabari22,09812
SwedishIngrid10,00011
TamilKavya11,76211
TeluguMeera11
ThaiMalee815,36012
TibetanTenzin3,36611
TigrinyaMakeda90811
TurkishKerem3,311,06411
VietnameseMalee12,72711
YorubaAbena5,01613
Heritage Languages
LanguageCompanionVocabularyArcs
AmazighFarida68511
AymaraKilla1,65316
FijianSefina13,4059
Irish GaelicCiara404,44211
KhmerMalee9,24210
MāoriSefina32,3139
QuechuaKilla2,23719
RomaniZindel2,37611
SamoanSefina18,3079
Scots GaelicCiara124,53211
TahitianSefina5,3829
TonganSefina19,2879
TwiAbena1,27712
Language Preservation
LanguageCompanionVocabularyArcs
AramaicTadai2,77215
CherokeeAwinita40,18311
Cook Islands MāoriSefina2,0059
HawaiianSefina30,49311
NahuatlVesper88,71311
NavajoNizhoni70,56411
NiueanSefina5109
TokelauanSefina5069
TuvaluanSefina5029
UteKaya2,85211
Yucatec MayaIx Kan64511
Classical Languages
LanguageCompanionVocabularyArcs
AkkadianEnlil1,21011
Ancient GreekKostas20,11811
Classical ArabicKhalid16,18422
Classical ChineseMei3,6596
Classical MalayalamMeera8
Classical MongolianTemür50511
Classical TamilKavya1,50011
Classical TibetanTenzin82211
CopticNadia2,80111
Ge'ezMakeda62211
Koine GreekKostas80111
LatinLuminița22,83528
Middle DutchPieter2,02311
Middle PersianDariush60111
Old Anatolian TurkishKerem41011
Old Church SlavonicLuminița3,00022
Old KannadaMeera8
Old NorseSigrid34,76826
Old PersianDariush53611
Old PolishAleksander3,10511
Old TeluguMeera8
Old TupíTupã50011
Ottoman TurkishKerem8,32211
PaliAnanya8,55111
SanskritSaraswati11,19943
Sant BhashaHarpreet1,20511
Ancient Languages
LanguageCompanionVocabularyArcs
AvestanDariush60411
Classical SyriacHunayn1,5028
DacianLuminița4009
Middle EgyptianNadia43111
Middle High GermanHanna1,6235
Middle KoreanSeo-yeon9285
Old DutchPieter1,81011
Old FrenchCamille8,6306
Old High GermanHanna3,4534
Old JapaneseYuki4115
Old Portuguese-GalicianBeatriz1,4375
Old SyriacHunayn8046
Proto-BantuJabari4019
Proto-DravidianKavya5165
Proto-EthiosemiticMakeda4118
Proto-GermanicHanna5,7534
Proto-JaponicYuki5523
Proto-KoreanicSeo-yeon1043
Proto-MongolicTemür40011
Proto-Niger-CongoAbena4029
Proto-QuechuanKilla40211
Proto-SemiticKhalid40311
Proto-Sino-TibetanMei4417
Proto-SlavicDmitri4019
Proto-TaiMalee4119
Proto-TupianTupã4019
Proto-TurkicKerem40111
SumerianEnlil75911
Vulgar LatinMarco50311
Semitic Chain
LanguageCompanionVocabularyArcs
Ancient HebrewEitan6,61711
Biblical HebrewEitan9,10011
Modern HebrewEitan3,193,71011

The Architecture

LinguaHalla is not a static curriculum. It is an adaptive generation system built around four technical pillars:

Adaptive Encounter Generation. Each encounter is generated by a large language model with per-learner profile injection: current CEFR level, recent vocabulary, learning motivation, heritage identity, demonstrated retention patterns, and arc narrative context. No two learners receive identical content even within the same arc premise.

Soul Documents. Each of the >38 AI companions has a structured identity document — not a character description, but a cultural and linguistic positioning document. Eitan Levi, the companion for Modern Hebrew and Biblical Hebrew, is a linguist and IDF veteran with a PhD at Hebrew University. Vesper, the companion for Nahuatl, carries the full weight of Aztec intellectual tradition. These soul documents ensure historically and linguistically accurate narrative, not performative diversity.

Chain Mechanics. Languages are organized into historical lineages. Spanish → Nahuatl. Modern Hebrew → Biblical Hebrew → Ancient Hebrew. Darija → Classical Arabic → Proto-Semitic. Learners are not learning isolated language skills — they are being guided through linguistic evolution, understanding why languages exist the way they do. This is pedagogy, not gamification.

Convergence Mechanics. Multiple chains meet at shared ancestors. Sanskrit receives learners arriving from Hindi, Bengali, Pali, and Romani chains simultaneously. Proto-Semitic receives learners from Arabic, Hebrew, and Aramaic chains. Convergence points become natural community formation moments and reinforce the deep interconnectedness of human language families.

The Automated Linguistic Standardization Engine

Behind LinguaHalla's encounter generation system is a batch classification pipeline that represents one of the platform's highest-value technical contributions: an automated engine that can ingest any raw text corpus on earth, analyze its syntactic and morphological complexity, and dynamically assign it a standardized fluency rating.

For high-resource modern languages — Spanish, French, Mandarin — CEFR frameworks already exist. Human academics have spent decades labeling text at scale. LinguaHalla inherits that work and builds on it.

For endangered indigenous languages, classical languages, and ancient root systems, no such framework exists. There is no official body grading Navajo text. There is no CEFR rubric for Yucatec Maya, Cherokee, or Aramaic. There is no curriculum pipeline for Proto-Semitic or Old Church Slavonic.

LinguaHalla builds that pipeline automatically.

The CEFR classifier operates as a continuous background process. It ingests raw vocabulary entries, analyzes morphological complexity, syllabic density, syntactic register, and cross-language cognate relationships, and outputs a graded difficulty assignment across six standardized levels (Pre-A1 through C2). In a 15-hour processing window, the system labeled 71,574 vocabulary entries across languages including Russian (Cyrillic script), Hindi (Devanagari script), and dozens of Latin-script languages simultaneously.

The institutional implication is significant. Academic archives around the world hold massive corpora of recorded, transcribed, and digitized indigenous and classical text — sitting in databases with no pedagogical structure. No grading. No sequencing. No curriculum pathway for a modern learner. These archives are documentation. They are not acquisition resources.

LinguaHalla's standardization engine bridges that gap. Feed it a raw text corpus. It outputs a graded, sequenced, curriculum-ready vocabulary pipeline. The transformation from archive to classroom takes hours, not years.

What the Engine Does
  • Ingests raw text corpora in any script — Latin, Cyrillic, Devanagari, Arabic, CJK, Syriac, Ethiopic, Cherokee syllabary — including PDF, document, and scanned image formats via OCR pipeline
  • Applies language-family-specific feature extractors: fusional/inflected languages (Russian, Latin, Arabic) are scored by case paradigm density, inflectional entropy, and cross-language cognate frequency; agglutinative languages (Turkish, Mongolian, Finnish) by affix-to-stem ratios and morpheme boundary complexity; polysynthetic languages (Navajo, Cherokee, Ute) by verbal template complexity, stem-space variation, and affix-stripping confidence — not syllabic density, which is an unreliable proxy for polysynthetic difficulty
  • Applies dual-pass evaluation: a semantic complexity pass using multilingual embedding spaces to assess conceptual density and register, combined with a structural/morphological complexity pass using language-specific morpheme-to-word ratios and token-to-type ratios in the target script
  • Assigns CEFR difficulty levels (Pre-A1 through C2) based on the combined complexity score, calibrated per language family
  • Outputs a sequenced curriculum pipeline compatible with the comprehensible input (i+1) methodology
  • Operates continuously as a background process — new languages and new corpora are classified automatically on ingestion

This is not a language learning feature. It is infrastructure for linguistic preservation at institutional scale. The same engine that sequences Spanish vocabulary for a casual learner can receive a corpus of Ute recordings from the Uintah-Ouray Language Program and return a structured, graded curriculum — without human academic intervention.

Documentation without acquisition is a museum. Documentation with acquisition is a movement. Acquisition without curriculum is chaos. The standardization engine eliminates the chaos.

A critical architectural constraint governs encounter generation for all languages, but especially for endangered and low-resource languages: the Adaptive Encounter Generation system operates under a verified corpus constraint. The LLM generates narrative structure, dialogue framing, and cultural context — but all vocabulary, grammatical forms, and idiomatic expressions used in encounters are drawn from verified corpus entries in LinguaHalla's vocabulary database, not generated freely from the model's parametric knowledge. For languages like Yucatec Maya, Classical Syriac, and Cherokee — where large language models are known to hallucinate syntactic structures and lexical forms — this retrieval constraint is not optional. It is the primary safeguard against linguistic drift. The LLM acts as the creative contextualizer. The verified corpus is the source of linguistic truth.

The Idiom Layer — Cultural Inheritance, Not Vocabulary Lists

Vocabulary tells you what a language contains. Idioms tell you what a civilization believes. LinguaHalla's Idiom Layer is a structured corpus of over 2,000 culturally embedded expressions, philosophical proverbs, and wisdom traditions — drawn from primary sources across 27 language traditions and tied directly to the companions who carry them.

The monetization thesis of modern language learning is broken. Competing on vocabulary count is a race to the bottom. LinguaHalla does not sell a dictionary. It sells cultural inheritance — the living wisdom traditions that vocabulary alone can never convey.

📜
Tier 1 — Compressed Story Idioms

Four-character Chinese chéngyǔ, Japanese yojijukugo, and equivalent compressed idiom traditions. Each entry encodes a complete historical story in 4 characters. 臥薪嚐膽 (wò xīn cháng dǎn) — “sleep on brushwood and taste gall” — carries 2,500 years of Chinese strategic thought in four syllables. These surface as cultural moments woven into regular encounter generation, appearing when the learner encounters vocabulary connected to the idiom's theme.

🏛️
Tier 2 — Philosophical Wisdom Texts

Primary source philosophical content tied to each companion's chain. Kostas teaches the full Epictetus Enchiridion (53 chapters) and Marcus Aurelius's Meditations (all 12 books) in Ancient and Koine Greek. Dariush unpacks Hafez ghazals and Rumi's Masnavi in Farsi and Pahlavi. Kavya knows all 1,330 Thirukkural couplets by heart. Saraswati carries the Bhagavad Gita, the Upanishadic mahāvākyas, and the Hitopadesha. Sigrid has memorized the Hávamál. These surface as dedicated Wisdom Sessions — premium encounters where a companion unpacks one text in depth, connecting etymology to meaning to cultural context.

🌍
Tier 3 — Living Proverb Traditions

Every language carries a living proverb tradition that encodes its civilization's worldview. Hawaiian ʻŌlelo Noʻeau (2,942 documented proverbs). Māori whakatauki. Akan Adinkra symbol wisdom. Yoruba Ifá oracle philosophy. Mongolian steppe wisdom. Quechua Andean concepts — ayni, sumak kawsay, pachamama. Korean 속담. These appear on each language's public page, making the depth of each tradition visible before a learner even begins — and giving search engines rich, culturally specific content that no other language platform provides.

The result is a corpus that no institution has assembled before: Hávamál runics, Ifá West African oracle frameworks, Upanishadic metaphysics, Talmudic Aramaic, Classical Nahuatl huehuetlatolli, Man'yōshū pillow words, and Book of Enoch passages inside a single unified schema — each entry tied to a companion who can teach it through narrative encounter. Documentation without acquisition is a museum. The Idiom Layer is where the archive becomes alive.

Grant & Partnership Alignment

LinguaHalla is designed for institutional partnership. Its architecture maps directly onto existing federal funding mechanisms for language preservation and documentation:

Contact & Collaboration

For research partnerships, institutional licensing, tribal collaboration, or grant co-PI inquiries:

Victor Christiansen
Founder, Runeteca Technologies LLC
Salt Lake City, Utah
victor@runeteca.com