White Paper

The LinguaHalla Methodology: Scaling Language Acquisition for Endangered, Heritage, and Classical Languages Through Adaptive AI Narrative

Name: LinguaHalla Multilingual Vocabulary Database
Creator: Runeteca Technologies LLC
License: https://linguahalla.com/methodology

Runeteca Technologies LLC · Victor Christiansen · May 2026

The Problem

Standard language applications fail endangered and heritage languages. The failure is structural, not technical. These platforms optimize for high-resource languages with abundant training data, abundant speakers, and established pedagogical pipelines. Languages like Navajo, Cherokee, Yucatec Maya, and Aramaic have none of these.

More fundamentally, existing platforms fail the learners themselves:

Gamification mechanics — streaks, leaderboards, public performance metrics — increase anxiety and measurably reduce acquisition in adult learners, particularly heritage learners navigating complex emotional relationships with their ancestral languages.
Static content cannot adapt to individual learner background, motivation, or cultural positioning. A Cherokee Nation citizen rediscovering their heritage language needs a fundamentally different pedagogical approach than a linguistics student learning it academically.
No existing commercial platform provides CEFR-mapped, curriculum-structured instruction for languages like Navajo, Cherokee, Yucatec Maya, or Aramaic.
The "language app" category has implicitly written off preservation as out of scope. LinguaHalla is built on the premise that this is not acceptable.

The Approach — Three Pillars

Comprehensible Input (Krashen i+1)

Every encounter is generated at exactly one level above the learner's current demonstrated competency. The AI does not present vocabulary lists or grammar tables — it presents language in context, calibrated to the learner's current level. This is the Krashen acquisition hypothesis applied at scale: comprehensible input that stretches, never overwhelms.

Comprehensibility is operationalized through four structural mechanisms: (1) high-context situational narrative — every encounter is grounded in a specific physical and cultural scene, providing extralinguistic anchoring that makes new structures deducible from context; (2) cross-language cognate anchoring, where available, explicitly connecting new vocabulary to roots the learner already carries; (3) companion-mediated scaffolding — the AI companion introduces new structures through natural dialogue, not instruction, allowing semantic resolution through conversational context; and (4) strategic immediate translation for zero-context structures in low-resource languages where extralinguistic anchoring alone is insufficient. The system never reverts to grammar tables. Comprehensibility is built into the generation architecture, not appended as a correction.

Spaced Repetition (SM-2 Algorithm)

Vocabulary surfaces at precisely calibrated review intervals based on demonstrated retention. Target words are not only surfaced in explicit review sessions — they are woven back into generated narrative encounters, creating contextual reinforcement rather than isolated drilling. The system tracks over 49,149,843 vocabulary entries across 119 languages with per-learner retention state.

Affective Filter Reduction

No streaks. No leaderboards. No public performance metrics. No punishment mechanics. Krashen's affective filter hypothesis identifies anxiety as the primary acquisition blocker — LinguaHalla's design is built around minimizing it. Motivation is narrative and intrinsic.

The structural mechanics of intrinsic motivation replace the dopamine loop of standard gamification: narrative progression creates genuine investment in companion relationships and unfolding story arcs; cultural unlocking — accessing deeper philosophical texts, chain transitions, and convergence encounters — provides milestone rewards tied to linguistic achievement rather than daily login compliance; and personal progress visibility shows the learner's own trajectory without comparative ranking. For heritage learners navigating complex emotional relationships with their ancestral languages, this is not a design preference — it is a clinical necessity. Extrinsic gamification measurably increases anxiety in this learner population. LinguaHalla's affective architecture is designed specifically to hold space for that complexity.

The Data

Current platform statistics, rendered live:

49,149,843

Total vocabulary entries

119

Active languages

1,315

Structured arc premises

~1,052

Estimated curriculum hours

2,003

Wisdom traditions & cultural idioms

Curriculum hours estimated at 6 encounters per arc × 8 minutes per encounter. Idiom layer spans 27 language traditions — from Epictetus's Enchiridion to Hawaiian ʻŌlelo Noʻeau to Classical Nahuatl huehuetlatolli.

Language Coverage

Living Languages

Language	Companion	Vocabulary	Arcs
Amharic	Makeda	246,111	11
Arabic	Nadia	5,582,515	25
Bengali	Ananya	19,496	11
Darija	Farida	2,020	11
Dutch	Pieter	5,802,312	11
Farsi	Dariush	4,002,700	11
French	Camille	16,010	11
German	Hanna	16,130	11
Guaraní	Tupã	1,503	11
Gulf Arabic	Khalid	637	11
Hindi	Priya	43,282	11
Italian	Marco	5,762,828	11
Japanese	Yuki	21,942	11
Kannada	Meera	—	11
Korean	Seo-yeon	10,443,469	12
Lao	Malee	8,882	12
Malayalam	Meera	—	11
Mandarin	Mei	16,684	11
Modern Greek	Kostas	2,595,972	11
Mongolian	Temür	6,004	11
Norwegian	Håkon	26,780	12
Polish	Aleksander	5,602,427	11
Portuguese	Beatriz	16,017	11
Punjabi	Harpreet	8,519	11
Romanian	Luminița	2,999	11
Russian	Dmitri	427,830	11
Spanish	Amara	16,030	17
Swahili	Jabari	22,098	12
Swedish	Ingrid	10,000	11
Tamil	Kavya	11,762	11
Telugu	Meera	—	11
Thai	Malee	815,360	12
Tibetan	Tenzin	3,366	11
Tigrinya	Makeda	908	11
Turkish	Kerem	3,311,064	11
Vietnamese	Malee	12,727	11
Yoruba	Abena	5,016	13

Heritage Languages

Language	Companion	Vocabulary	Arcs
Amazigh	Farida	685	11
Aymara	Killa	1,653	16
Fijian	Sefina	13,405	9
Irish Gaelic	Ciara	404,442	11
Khmer	Malee	9,242	10
Māori	Sefina	32,313	9
Quechua	Killa	2,237	19
Romani	Zindel	2,376	11
Samoan	Sefina	18,307	9
Scots Gaelic	Ciara	124,532	11
Tahitian	Sefina	5,382	9
Tongan	Sefina	19,287	9
Twi	Abena	1,277	12

Language Preservation

Language	Companion	Vocabulary	Arcs
Aramaic	Tadai	2,772	15
Cherokee	Awinita	40,183	11
Cook Islands Māori	Sefina	2,005	9
Hawaiian	Sefina	30,493	11
Nahuatl	Vesper	88,713	11
Navajo	Nizhoni	70,564	11
Niuean	Sefina	510	9
Tokelauan	Sefina	506	9
Tuvaluan	Sefina	502	9
Ute	Kaya	2,852	11
Yucatec Maya	Ix Kan	645	11

Classical Languages

Language	Companion	Vocabulary	Arcs
Akkadian	Enlil	1,210	11
Ancient Greek	Kostas	20,118	11
Classical Arabic	Khalid	16,184	22
Classical Chinese	Mei	3,659	6
Classical Malayalam	Meera	—	8
Classical Mongolian	Temür	505	11
Classical Tamil	Kavya	1,500	11
Classical Tibetan	Tenzin	822	11
Coptic	Nadia	2,801	11
Ge'ez	Makeda	622	11
Koine Greek	Kostas	801	11
Latin	Luminița	22,835	28
Middle Dutch	Pieter	2,023	11
Middle Persian	Dariush	601	11
Old Anatolian Turkish	Kerem	410	11
Old Church Slavonic	Luminița	3,000	22
Old Kannada	Meera	—	8
Old Norse	Sigrid	34,768	26
Old Persian	Dariush	536	11
Old Polish	Aleksander	3,105	11
Old Telugu	Meera	—	8
Old Tupí	Tupã	500	11
Ottoman Turkish	Kerem	8,322	11
Pali	Ananya	8,551	11
Sanskrit	Saraswati	11,199	43
Sant Bhasha	Harpreet	1,205	11

Ancient Languages

Language	Companion	Vocabulary	Arcs
Avestan	Dariush	604	11
Classical Syriac	Hunayn	1,502	8
Dacian	Luminița	400	9
Middle Egyptian	Nadia	431	11
Middle High German	Hanna	1,623	5
Middle Korean	Seo-yeon	928	5
Old Dutch	Pieter	1,810	11
Old French	Camille	8,630	6
Old High German	Hanna	3,453	4
Old Japanese	Yuki	411	5
Old Portuguese-Galician	Beatriz	1,437	5
Old Syriac	Hunayn	804	6
Proto-Bantu	Jabari	401	9
Proto-Dravidian	Kavya	516	5
Proto-Ethiosemitic	Makeda	411	8
Proto-Germanic	Hanna	5,753	4
Proto-Japonic	Yuki	552	3
Proto-Koreanic	Seo-yeon	104	3
Proto-Mongolic	Temür	400	11
Proto-Niger-Congo	Abena	402	9
Proto-Quechuan	Killa	402	11
Proto-Semitic	Khalid	403	11
Proto-Sino-Tibetan	Mei	441	7
Proto-Slavic	Dmitri	401	9
Proto-Tai	Malee	411	9
Proto-Tupian	Tupã	401	9
Proto-Turkic	Kerem	401	11
Sumerian	Enlil	759	11
Vulgar Latin	Marco	503	11

Semitic Chain

Language	Companion	Vocabulary	Arcs
Ancient Hebrew	Eitan	6,617	11
Biblical Hebrew	Eitan	9,100	11
Modern Hebrew	Eitan	3,193,710	11

The Architecture

LinguaHalla is not a static curriculum. It is an adaptive generation system built around four technical pillars:

Adaptive Encounter Generation. Each encounter is generated by a large language model with per-learner profile injection: current CEFR level, recent vocabulary, learning motivation, heritage identity, demonstrated retention patterns, and arc narrative context. No two learners receive identical content even within the same arc premise.

Soul Documents. Each of the >38 AI companions has a structured identity document — not a character description, but a cultural and linguistic positioning document. Eitan Levi, the companion for Modern Hebrew and Biblical Hebrew, is a linguist and IDF veteran with a PhD at Hebrew University. Vesper, the companion for Nahuatl, carries the full weight of Aztec intellectual tradition. These soul documents ensure historically and linguistically accurate narrative, not performative diversity.

Chain Mechanics. Languages are organized into historical lineages. Spanish → Nahuatl. Modern Hebrew → Biblical Hebrew → Ancient Hebrew. Darija → Classical Arabic → Proto-Semitic. Learners are not learning isolated language skills — they are being guided through linguistic evolution, understanding why languages exist the way they do. This is pedagogy, not gamification.

Convergence Mechanics. Multiple chains meet at shared ancestors. Sanskrit receives learners arriving from Hindi, Bengali, Pali, and Romani chains simultaneously. Proto-Semitic receives learners from Arabic, Hebrew, and Aramaic chains. Convergence points become natural community formation moments and reinforce the deep interconnectedness of human language families.

The Automated Linguistic Standardization Engine

Behind LinguaHalla's encounter generation system is a batch classification pipeline that represents one of the platform's highest-value technical contributions: an automated engine that can ingest any raw text corpus on earth, analyze its syntactic and morphological complexity, and dynamically assign it a standardized fluency rating.

For high-resource modern languages — Spanish, French, Mandarin — CEFR frameworks already exist. Human academics have spent decades labeling text at scale. LinguaHalla inherits that work and builds on it.

For endangered indigenous languages, classical languages, and ancient root systems, no such framework exists. There is no official body grading Navajo text. There is no CEFR rubric for Yucatec Maya, Cherokee, or Aramaic. There is no curriculum pipeline for Proto-Semitic or Old Church Slavonic.

LinguaHalla builds that pipeline automatically.

The CEFR classifier operates as a continuous background process. It ingests raw vocabulary entries, analyzes morphological complexity, syllabic density, syntactic register, and cross-language cognate relationships, and outputs a graded difficulty assignment across six standardized levels (Pre-A1 through C2). In a 15-hour processing window, the system labeled 71,574 vocabulary entries across languages including Russian (Cyrillic script), Hindi (Devanagari script), and dozens of Latin-script languages simultaneously.

The institutional implication is significant. Academic archives around the world hold massive corpora of recorded, transcribed, and digitized indigenous and classical text — sitting in databases with no pedagogical structure. No grading. No sequencing. No curriculum pathway for a modern learner. These archives are documentation. They are not acquisition resources.

LinguaHalla's standardization engine bridges that gap. Feed it a raw text corpus. It outputs a graded, sequenced, curriculum-ready vocabulary pipeline. The transformation from archive to classroom takes hours, not years.

⚙

What the Engine Does

Ingests raw text corpora in any script — Latin, Cyrillic, Devanagari, Arabic, CJK, Syriac, Ethiopic, Cherokee syllabary — including PDF, document, and scanned image formats via OCR pipeline
Applies language-family-specific feature extractors: fusional/inflected languages (Russian, Latin, Arabic) are scored by case paradigm density, inflectional entropy, and cross-language cognate frequency; agglutinative languages (Turkish, Mongolian, Finnish) by affix-to-stem ratios and morpheme boundary complexity; polysynthetic languages (Navajo, Cherokee, Ute) by verbal template complexity, stem-space variation, and affix-stripping confidence — not syllabic density, which is an unreliable proxy for polysynthetic difficulty
Applies dual-pass evaluation: a semantic complexity pass using multilingual embedding spaces to assess conceptual density and register, combined with a structural/morphological complexity pass using language-specific morpheme-to-word ratios and token-to-type ratios in the target script
Assigns CEFR difficulty levels (Pre-A1 through C2) based on the combined complexity score, calibrated per language family
Outputs a sequenced curriculum pipeline compatible with the comprehensible input (i+1) methodology
Operates continuously as a background process — new languages and new corpora are classified automatically on ingestion

This is not a language learning feature. It is infrastructure for linguistic preservation at institutional scale. The same engine that sequences Spanish vocabulary for a casual learner can receive a corpus of Ute recordings from the Uintah-Ouray Language Program and return a structured, graded curriculum — without human academic intervention.

Documentation without acquisition is a museum. Documentation with acquisition is a movement. Acquisition without curriculum is chaos. The standardization engine eliminates the chaos.

A critical architectural constraint governs encounter generation for all languages, but especially for endangered and low-resource languages: the Adaptive Encounter Generation system operates under a verified corpus constraint. The LLM generates narrative structure, dialogue framing, and cultural context — but all vocabulary, grammatical forms, and idiomatic expressions used in encounters are drawn from verified corpus entries in LinguaHalla's vocabulary database, not generated freely from the model's parametric knowledge. For languages like Yucatec Maya, Classical Syriac, and Cherokee — where large language models are known to hallucinate syntactic structures and lexical forms — this retrieval constraint is not optional. It is the primary safeguard against linguistic drift. The LLM acts as the creative contextualizer. The verified corpus is the source of linguistic truth.

The Idiom Layer — Cultural Inheritance, Not Vocabulary Lists

Vocabulary tells you what a language contains. Idioms tell you what a civilization believes. LinguaHalla's Idiom Layer is a structured corpus of over 2,000 culturally embedded expressions, philosophical proverbs, and wisdom traditions — drawn from primary sources across 27 language traditions and tied directly to the companions who carry them.

The monetization thesis of modern language learning is broken. Competing on vocabulary count is a race to the bottom. LinguaHalla does not sell a dictionary. It sells cultural inheritance — the living wisdom traditions that vocabulary alone can never convey.

📜

Tier 1 — Compressed Story Idioms

Four-character Chinese chéngyǔ, Japanese yojijukugo, and equivalent compressed idiom traditions. Each entry encodes a complete historical story in 4 characters. 臥薪嚐膽 (wò xīn cháng dǎn) — “sleep on brushwood and taste gall” — carries 2,500 years of Chinese strategic thought in four syllables. These surface as cultural moments woven into regular encounter generation, appearing when the learner encounters vocabulary connected to the idiom's theme.

🏛️

Tier 2 — Philosophical Wisdom Texts

Primary source philosophical content tied to each companion's chain. Kostas teaches the full Epictetus Enchiridion (53 chapters) and Marcus Aurelius's Meditations (all 12 books) in Ancient and Koine Greek. Dariush unpacks Hafez ghazals and Rumi's Masnavi in Farsi and Pahlavi. Kavya knows all 1,330 Thirukkural couplets by heart. Saraswati carries the Bhagavad Gita, the Upanishadic mahāvākyas, and the Hitopadesha. Sigrid has memorized the Hávamál. These surface as dedicated Wisdom Sessions — premium encounters where a companion unpacks one text in depth, connecting etymology to meaning to cultural context.

🌍

Tier 3 — Living Proverb Traditions

Every language carries a living proverb tradition that encodes its civilization's worldview. Hawaiian ʻŌlelo Noʻeau (2,942 documented proverbs). Māori whakatauki. Akan Adinkra symbol wisdom. Yoruba Ifá oracle philosophy. Mongolian steppe wisdom. Quechua Andean concepts — ayni, sumak kawsay, pachamama. Korean 속담. These appear on each language's public page, making the depth of each tradition visible before a learner even begins — and giving search engines rich, culturally specific content that no other language platform provides.

The result is a corpus that no institution has assembled before: Hávamál runics, Ifá West African oracle frameworks, Upanishadic metaphysics, Talmudic Aramaic, Classical Nahuatl huehuetlatolli, Man'yōshū pillow words, and Book of Enoch passages inside a single unified schema — each entry tied to a companion who can teach it through narrative encounter. Documentation without acquisition is a museum. The Idiom Layer is where the archive becomes alive.

Grant & Partnership Alignment

LinguaHalla is designed for institutional partnership. Its architecture maps directly onto existing federal funding mechanisms for language preservation and documentation:

NSF Documenting Endangered Languages (DEL). LinguaHalla's standardization engine directly serves the DEL mandate: transforming static documentation archives into learnable, CEFR-structured acquisition resources. The engine can receive any raw corpus from a university partner and return a graded curriculum pipeline — without human academic intervention. Co-PI collaboration pathway available.
ANA AI3 — Artificial Intelligence for American Indians. LinguaHalla is purpose-built for this program. Cherokee, Navajo, Ute, and Yucatec Maya tracks are live. The polysynthetic-language-specific feature extractor in the standardization engine addresses the unique morphological complexity of Native American languages that Western-centric CEFR tools cannot handle.
ANA Esther Martinez Immersive Language Programs. Immersive narrative acquisition — not translation exercises or vocabulary drills — for Native American heritage learners. Direct mission alignment.
NEH Collections Stewardship. LinguaHalla's corpus ingestion pipeline transforms dark archives — digitized collections with no pedagogical structure — into interactive acquisition resources. The vocabulary database (1.3M+ entries, 110 languages) represents a preservation asset. The ingestion pipeline is the activation engine that makes existing NEH-funded documentation collections pedagogically usable for the first time.
Tribal Sovereignty & Community Control. Indigenous nations retain ultimate authority over how their language is framed, taught, and sequenced within LinguaHalla. Tribal Councils have curriculum review and veto rights over all content in their language tracks. Sovereign data rights are not a compliance checkbox — they are a structural requirement of the partnership model. Active collaboration pathways with Cherokee Nation Language Department, Navajo Nation Language Program, and Uintah-Ouray Ute Language Program.
University Partnership Pathway. Co-PI opportunities on NSF DEL proposals. Institutional licensing for linguistics departments. The standardization engine can be deployed against a university's existing corpus holdings to generate structured acquisition data — converting archival research assets into active pedagogical tools. Research data access for language acquisition studies.

Contact & Collaboration

For research partnerships, institutional licensing, tribal collaboration, or grant co-PI inquiries:

Victor Christiansen

Founder, Runeteca Technologies LLC

Salt Lake City, Utah

victor@runeteca.com