Almost every EEG foundation model does the same thing: chop the brain signal into patches, turn them into tokens, mask some, and train a Transformer to fill in the blanks — the recipe we inherited from vision and language. But brain rhythms aren't words or image patches; they're continuous. In B[FM]2 we asked: what if we just didn't tokenize at all?
The short version: we pretrain a generative model directly on the raw multi-channel EEG waveform with continuous-time flow matching — no patches, codebooks, or masking — and read features off the network for downstream tasks. The same model sets a new state of the art on 7 of 9 benchmarks using ~30× less pretraining data, and generates EEG that two board-certified neurologists couldn't tell apart from real clinical recordings.
EEG is a multi-channel waveform — about 19 scalp electrodes, each sampled hundreds of times per second. To reuse the architectures that worked in vision and language, existing brain foundation models discretize this continuous signal into patches or codebook tokens and train a Transformer with masked reconstruction. But discretization imposes artificial boundaries on a signal that has none: the objective ends up modeling transitions between arbitrary chunks instead of the continuous evolution of the signal, and random masking throws away exactly the fine-grained temporal detail clinical tasks rely on. The inductive bias stops matching the data.
If the data is continuous, the pretext task should be too. Instead of "predict the masked token," why not "learn to turn noise into a brain signal"?
A generative objective gives us a tokenization-free pretext for free — to generate the raw signal, you never discretize it. We use flow matching, the continuous-time cousin of diffusion. Put Gaussian noise at \(t{=}0\) and data at \(t{=}1\), and learn a velocity field \(v_\theta(x_t, t)\) that says which way to move at each point; integrating it from noise flows you onto the data manifold. With the straight-line interpolant \(x_t=(1-t)x_0+t x_1\), training collapses to a clean regression — corrupt a real EEG window with noise and predict the displacement back toward clean data:
$$\mathcal{L}(\theta) = \mathbb{E}\,\big\lVert\, v_\theta(x_t, t) - (x_1 - x_0)\,\big\rVert^2 .$$
That's the entire pretraining loss — no mask tokens, no codebook, no patch grid. (We use the optimal-transport coupling, OT-CFM, which re-pairs each minibatch so the learned flow stays straight; details in the paper.)
You can't just drop an image UNet onto EEG, because EEG isn't isotropic like an image. Time is densely sampled (thousands of points), smooth, and benefits from compression. Electrodes are few (~19) and each sits at a fixed, anatomically meaningful location — you can't "downsample" the occipital lobe into the frontal lobe. The two axes have different scales and different semantics.
SplitUNet is built around that asymmetry. Every 2D spatiotemporal convolution is factorized into a 1D temporal convolution (shared across electrodes), a nonlinearity, then a 1D electrode convolution (shared across time):
$$\mathrm{conv}_{(1+1)\mathrm{D}}(z) = \mathrm{conv}_{\text{electrode}}\big(\,\sigma(\,\mathrm{conv}_{\text{time}}(z)\,)\,\big).$$
This is the R(2+1)D trick from video: it adds nonlinear depth for free and uses \(k_t+k_e\) weights instead of \(k_t\cdot k_e\). And crucially, the encoder downsamples only along time — the electrode dimension is preserved end-to-end. The bet is that EEG's temporal dynamics and inter-electrode coupling are largely separable. Ablations confirm it: SplitUNet beats both a per-electrode 1D UNet (no spatial coupling) and a full 2D UNet (cross-axis weights at every layer the data doesn't need). The factorization recovers what 1D loses without paying for what 2D wastes.
We pretrain once on TUEG and finetune the same backbone end-to-end on nine standard tasks (seizure, sleep staging, abnormality, motor imagery, emotion, stress). B[FM]2 sets a new state of the art on 7 of 9 tasks and the highest suite average — on just 36,895 segments (≈307 h), 1–2 orders of magnitude less than recent baselines (REVE ≈ 60,000 h). Continuous-time flow matching is a strikingly sample-efficient pretext.
| Method | Mumtaz | MAT | Siena | ISRUC | HMC | TUEV | TUAB | BCIC | SEED-V | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| LaBraM-Base | 0.941 | 0.691 | 0.708 | 0.763 | 0.728 | 0.641 | 0.814 | 0.487 | 0.398 | 0.686 |
| CBraMod | 0.956 | 0.726 | 0.732 | 0.786 | 0.727 | 0.667 | 0.829 | 0.514 | 0.409 | 0.705 |
| REVE | 0.964 | 0.766 | 0.740 | 0.782 | 0.740 | 0.676 | 0.832 | 0.640 | 0.405 | 0.727 |
| CSBrain | 0.964 | 0.756 | 0.766 | 0.792 | 0.735 | 0.690 | 0.817 | 0.566 | 0.420 | 0.723 |
| B[FM]2 (ours) | 1.000 | 0.840 | 0.776 | 0.806 | 0.764 | 0.715 | 0.819 | 0.570 | 0.492 | 0.754 |
Balanced accuracy (mean over 5 seeds). Best per task in bold. Full table with supervised baselines and companion metrics is in the paper.
Because it's generative, B[FM]2 can do something masked-token models structurally can't: synthesize new EEG and let an expert judge it. We drew unconditional samples from the same pretrained model and ran a blinded reading — two board-certified neurologists each rated 50 interleaved 30-second segments (25 real, 25 generated) on a 1–5 realness scale.
| Reader | Real EEG | B[FM]2 | Mann–Whitney U | p |
|---|---|---|---|---|
| Neurologist 1 | 2.96 ± 1.02 | 2.60 ± 1.04 | 251.5 | 0.204 |
| Neurologist 2 | 3.08 ± 1.12 | 2.32 ± 0.80 | 188.5 | 0.011 |
| Pooled | 3.02 ± 1.06 | 2.46 ± 0.93 | 876.0 | 0.007 |
Mean realness (1 = definitely real, 5 = definitely fake; lower = more real-looking). Readers rated B[FM]2 samples as slightly more real-looking than the actual held-out EEG, and their per-segment judgments agreed only at chance (Spearman ρ = 0.052, Cohen's κ = −0.096) — neither could reliably tell synthetic from real.
It also learns recognizable physiology with no labels at all: mining the generation pool with simple band-power heuristics surfaces canonical patterns — sharp-wave transients, slow-wave activity, posterior α rhythm.
We spend a lot of effort bending brain signals to fit architectures built for text and images. B[FM]2 argues for the reverse: pick a pretext whose inductive bias already matches the data, and let the architecture follow its structure. For EEG, that means flow matching on the raw waveform plus a backbone that respects the time–electrode asymmetry — and it turns out to be more sample-efficient and more interpretable than discretize-and-mask. SplitUNet should extend to other multi-channel time series with fixed sensor arrays (MEG, ECoG), and because the model genuinely generates EEG, conditional synthesis of rare clinical cases is a natural next step.
Full details — preprocessing, hyperparameters, per-task metrics, the neurologist protocol — are in the paper; the project page has full-resolution figures. Questions welcome — I'm @jaedong_hwang.
@article{hwang2026bfm,
author = {Hwang, Jaedong and Zhang, Kathleen and Dai, Wei and Kontras, Konstantinos
and Vanmarcke, Maarten and De Vos, Maarten and Fiete, Ila and Liang, Paul Pu},
title = {B[FM]$^2$: Brain Foundation Model via Flow Matching with SplitUNet},
journal = {arXiv preprint},
year = {2026},
}