Learn Globally, Speak Locally:
Bridging the Gaps in Multilingual Reasoning

1Massachusetts Institute of Technology 2Harvard University 3LG CNS
4Université de Montréal & Mila 5Google 6Stanford University
*Equal Contribution Corresonding Authors

Illustration of Contributions. We propose M2A, a new method that utilizes multi-scale multilingual alignment and language consistency rewards from a given machine-translated question, enabling reasoning in the question language. We also introduce GeoFact-X, a new multilingual factual reasoning benchmark which includes training datasets and step-by-step reasoning traces across five languages. We propose an automatic evaluation protocol to assess whether a model reasons in the question language and the correctness of reasoning via language identifier or LLM-as-a-judge.

Abstract

Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual question answering, and code generation, yet their ability to reason on these tasks in different languages remains underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. We propose M2A, a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions, training models to reason directly and accurately in the target language. Furthermore, existing multilingual benchmarks only evaluate on final answers, overlooking whether reasoning occurs in the intended language. To close this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark together with reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. Our results show that M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks, highlighting that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization.

M2A (Multi-Scale Multilingual Alignment)

Schematics of M2A. (a) Translate Question into a target language (e.g., Japanese). (b) Multilingual Context Alignment enforces global similarity between generated and reference responses while discouraging trivial matches via shuffled negatives. (c) Multilingual Reasoning-Step Alignment provides finer-grained supervision by aligning individual reasoning steps with ground-truth traces using dynamic programming. (d) Language Consistency is identified by the language identifier.


Multilingual context alignment is defined as follows:

$$r_\text{context-align} = \max(\cos(z_o, z_y) - \cos(\tilde{z}_o, \tilde{z}_y) + \alpha, 0),$$ where \(z_o\) and \(z_y\) denote the embeddings of LLM-generated output given the translated question and the ground-truth. For Multilingual reasoning-step alignment, we first split generated output \(o\) and the ground-truth \(y\) into each sentence \(\mathbf{o} = (o^{(1)}, \ldots, o^{(N)})\) and \(\mathbf{y} = (y^{(1)}, \ldots, y^{(M)})\), respectively. and then match sentences between output and ground-truth such that the total similarity score is maximized: $$r_\text{step-algin} = \frac{1}{N}\sum_{i=1}^N \mathbf{C}_{i, j_i} = \frac{1}{N}\sum_{i=1}^N \max(\cos(z_o^{(i)}, z_y^{(j_i)}) - \cos(\tilde{z}_o^{(i)}, \tilde{z}_y^{(j_i)}) + \alpha, 0),$$ where \(z_o^{(i)}\) and \(z_y^{(j_i)}\) denote the embedding of output sentence, \(o^{(i)}\) and matched ground-truth sentence, \(y^{(i_j)}\).


GeoFact-X Dataset

Illustration of GeoFact-X benchmark construction. (1) Geography-aware multilingual questions and answers are generated by Gemini 2.0 Flash. (2) The data is translated into other languages, verifying whether it is back-translatable. (3) The reasoning trace for each question and answer pair is generated. (4) Native or C1-level speakers verify each data and revise it if needed.


we introduce a multilingual factual reasoning dataset specifically designed to evaluate factual accuracy and reasoning capabilities across diverse linguistic and cultural contexts. Leveraging Gemini 2.0 Flash, we generate 3,000 unique factual questions (approximately 600 per country) covering locally grounded topics such as history, politics, geography, art, and culture, localized to five geographically distinct countries: the USA, India, Japan, Kenya, and Thailand. The dataset is available in the predominant local languages: English, Hindi, Japanese, Swahili, and Thai. Our goal is to capture country-specific factual knowledge, encouraging language models to reason effectively within culturally contextualized knowledge spaces.


Illustration of the hierarchical distribution of generated factual question categories by topic and subcategory. Each colored wedge represents a major topic (e.g., History, Geography), and its outer segments represent specific subcategories (e.g., Person, Place, Treaty). The size of each segment reflects the proportion of questions allocated to that subcategory within its topic. This generation schema was applied uniformly across five countries, and all question sets were translated into five different languages.


A sample from GeoFact-X in English, Hindi, and Thai. Each presents the same factual question and answer content translated across languages. The corresponding reasoning traces are synthetically generated by Gemini 2.0 Flash. These multilingual and semantically equivalent traces serve as reference reasoning for benchmarking the reasoning quality of other language models in our evaluation framework.


M2A (Multi-Scale Multilingual Alignment)

Schematics of M2A. (a) Translate Question into a target language (e.g., Japanese). (b) Multilingual Context Alignment enforces global similarity between generated and reference responses while discouraging trivial matches via shuffled negatives. (c) Multilingual Reasoning-Step Alignment provides finer-grained supervision by aligning individual reasoning steps with ground-truth traces using dynamic programming. (d) Language Consistency is identified by the language identifier.


Multilingual context alignment is defined as follows:

$$r_\text{context-align} = \max(\cos(z_o, z_y) - \cos(\tilde{z}_o, \tilde{z}_y) + \alpha, 0),$$ where \(z_o\) and \(z_y\) denote the embeddings of LLM-generated output given the translated question and the ground-truth. For Multilingual reasoning-step alignment, we first split generated output \(o\) and the ground-truth \(y\) into each sentence \(\mathbf{o} = (o^{(1)}, \ldots, o^{(N)})\) and \(\mathbf{y} = (y^{(1)}, \ldots, y^{(M)})\), respectively. and then match sentences between output and ground-truth such that the total similarity score is maximized: $$r_\text{step-algin} = \frac{1}{N}\sum_{i=1}^N \mathbf{C}_{i, j_i} = \frac{1}{N}\sum_{i=1}^N \max(\cos(z_o^{(i)}, z_y^{(j_i)}) - \cos(\tilde{z}_o^{(i)}, \tilde{z}_y^{(j_i)}) + \alpha, 0),$$ where \(z_o^{(i)}\) and \(z_y^{(j_i)}\) denote the embedding of output sentence, \(o^{(i)}\) and matched ground-truth sentence, \(y^{(i_j)}\).


Revisiting Multilingual Mathematical Reasoning Benchmark

Mathematical accuracy and the joint accuracy of mathematics and language of various LLMs on MGSM with native Chain-of-Thought. The size of the circle denotes the number of parameters. We found that several LLMs that have high performance on MGSM actually use English or Chinese in reaosning process although native Chain-of-Thought is given.

Experiments

Mathematical Reasoning

Table compares the base model Qwen2.5-7B-Instruct with various post-training methods, including supervised fine-tuning (SFT) and our proposed method, M2A. While SFT on s1K-1.1 improves mathematical reasoning, it harms multilingual performance on MGSM; s1K-X preserves multilingual ability but at the cost of math accuracy. In contrast, M2A achieves strong math performance on GSM8K, preserves multilingual reasoning better than SFT on MGSM, and yields superior joint accuracy.


Example question and response of each model on MGSM (Russian). The question in English is Janet's ducks lay sixteen eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for two dollars per fresh duck egg. How much in dollars does she make every day at the farmers' market? All models correctly provide the answer (18), but the SFT model conducts its reasoning in English. GRPO's reasoning process is almost identical to that of the base model (Qwen2.5-7B-Instruct).


Factual Reasoning

All methods perform better on associated pairs than on non‑associated pairs and consistently reduce language mismatch. However, both reasoning score and answer correctness on associative pairs decrease since GeoFact-X includes more non-associative pairs. M2A achieves the strongest performance in reasoning.


BibTeX

@article{hwang2025learn,
      title={Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning},
      author={Hwang, Jaedong and Tanmay, Kumar and Lee, Seok-Jin and Agrawal, Ayush and Palangi, Hamid and Ayush, Kumar and Fiete, Ila R and Liang, Paul Pu},
      journal={arXiv preprint arXiv:2507.05418},
      year={2025}
    }