Learn Globally, Speak Locally:
Bridging the Gaps in Multilingual Reasoning

1Massachusetts Institute of Technology 2Harvard University 3LG CNS
4Université de Montréal & Mila 5Google 6Stanford University
*Equal Contribution Corresonding Authors

Mathematical accuracy and the joint accuracy of mathematics and language of various LLMs on MGSM with native Chain-of-Thought. The size of the circle denotes the number of parameters. Multiple models show much lower joint accuracy than math accuracy, especially s1, a supervised fine-tuned model from Qwen2.5-Instruct.

Abstract

Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization.

GeoFact-X Dataset

we introduce a multilingual factual reasoning dataset specifically designed to evaluate factual accuracy and reasoning capabilities across diverse linguistic and cultural contexts. Leveraging Gemini 2.0 Flash, we generate 3,000 unique factual questions (approximately 600 per country) covering locally grounded topics such as history, politics, geography, art, and culture, localized to five geographically distinct countries: the USA, India, Japan, Kenya, and Thailand. The dataset is available in the predominant local languages: English, Hindi, Japanese, Swahili, and Thai. Our goal is to capture country-specific factual knowledge, encouraging language models to reason effectively within culturally contextualized knowledge spaces.


Illustration of the hierarchical distribution of generated factual question categories by topic and subcategory. Each colored wedge represents a major topic (e.g., History, Geography), and its outer segments represent specific subcategories (e.g., Person, Place, Treaty). The size of each segment reflects the proportion of questions allocated to that subcategory within its topic. This generation schema was applied uniformly across five countries, and all question sets were translated into five different languages.


A sample from GeoFact-X in English, Hindi, and Thai. Each presents the same factual question and answer content translated across languages. The corresponding reasoning traces are synthetically generated by Gemini 2.0 Flash. These multilingual and semantically equivalent traces serve as reference reasoning for benchmarking the reasoning quality of other language models in our evaluation framework.


BRIDGE

Schematics of BRIDGE. BRIDGE uses SFT and GRPO with translated questions and a language identifier to compute a language consistency reward between the input question and the generated output. The question is translated into a target language (e.g., Japanese). If the generated reasoning trace matches the target language identified by the language identifier, the GRPO reward is 1; otherwise, it is 0.

Experiments

Factual Reasoning

All methods perform better on associated pairs than on non‑associated pairs and consistently reduce language mismatch. However, both reasoning score and answer correctness on associative pairs decrease since GeoFact-X includes more non-associative pairs. BRIDGE achieves the strongest performance across all metrics when evaluated on the full test set, followed by SFT.


Mathematical Reasoning

Table compares the base model Qwen2.5-7B-Instruct with various post-training methods, including supervised fine-tuning (SFT) and our proposed method, BRIDGE. While SFT on s1K-1.1 improves mathematical reasoning, it harms multilingual performance on MGSM; s1K-X preserves multilingual ability but at the cost of math accuracy. In contrast, BRIDGE achieves strong math performance on GSM8K, preserves multilingual reasoning better than SFT on MGSM, and yields superior joint accuracy.


Example question and response of each model on MGSM (Russian). The question in English is Janet's ducks lay sixteen eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for two dollars per fresh duck egg. How much in dollars does she make every day at the farmers' market? All models correctly provide the answer (18), but the SFT model conducts its reasoning in English. GRPO's reasoning process is almost identical to that of the base model (Qwen2.5-7B-Instruct).

BibTeX

@article{hwang2025learn,
      title={Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning},
      author={Hwang, Jaedong and Tanmay, Kumar and Seok-Jin, Lee and Agrawal, Ayush and Palangi, Hamid and Ayush, Kumar and Fiete, Ila R and Liang, Paul Pu},
      journal={arXiv preprint arXiv:2507.05418},
      year={2025}
    }