Diffusion OCR Decodes 3.2x Faster, Single-Stream AV in 2 Seconds

Today's Overview

Diffusion Decoding Replaces Autoregressive OCR, Going From Serial to Parallel. MinerU-Diffusion reframes document parsing as inverse rendering, using block-wise diffusion to generate structured source in parallel. 3.2x faster decoding, open-source.
RLVR Update Direction Matters More Than Magnitude. The sign of token-level Δlog p pinpoints sparse, reasoning-critical updates more precisely than magnitude metrics. Two resulting methods need no architecture changes.
Multi-Task SFT Wastes Compute You Don't See. Sub-datasets overfit at wildly different rates. mSFT iteratively drops the earliest overfitters, cutting FLOPs and improving results under low budgets.
Video GRPO Instability Traced to Off-Manifold Exploration Noise. The ODE-to-SDE switch pushes sampling trajectories off the pretrained data manifold. SAGE-GRPO fixes this with manifold-projected exploration and dual trust regions, validated on HunyuanVideo.
Joint Audio-Video Generation Doesn't Need Multi-Stream Architecture. Text, video, and audio tokens in a single sequence with plain self-attention. 5-second video in 2 seconds on one H100, full model stack open-sourced.

Featured

01 Multimodal What if OCR Isn't Reading, but Reverse-Rendering?

A rendering engine turns Markdown into a laid-out PDF. MinerU-Diffusion inverts the process: given the rendered result, a diffusion model recovers structured source code in parallel. Once you accept this inverse-rendering frame, left-to-right autoregressive decoding stops being a requirement and starts looking like historical baggage from sequential formats.

The method replaces token-by-token generation with a block-wise diffusion decoder, stabilized by uncertainty-driven curriculum learning for long sequences. Decoding runs 3.2x faster than autoregressive baselines, with consistent reliability gains. The team also designed a Semantic Shuffle benchmark testing whether the model actually reads document layout rather than relying on language priors. The diffusion approach wins clearly on that test.

110 HuggingFace upvotes and open-source code. Community reception suggests this isn't just theoretically elegant. Teams running document processing pipelines can evaluate it now.

Key takeaways: - Inverse rendering reframes document OCR from serial decoding to parallel diffusion: an architecture-level shift. - 3.2x decoding speedup has direct value for long-document production pipelines. - Open-source and ready to evaluate as an autoregressive replacement for document parsing.

Source: MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

02 Reasoning RLVR Updates: Direction Over Magnitude

The sign of token-level log-probability changes may be a better lens for understanding RLVR (reinforcement learning from verifiable rewards). Through statistical analysis and token replacement experiments, this work shows that the direction of Δlog p locates sparse, reasoning-critical updates more precisely than magnitude metrics like divergence or entropy.

Two practical methods follow. At inference time, extrapolating along the Δlog p direction improves accuracy. At training time, upweighting low-probability tokens accelerates learning. Neither requires architecture changes, keeping the adoption barrier low.

The intuition is clean and the validation solid. Generalization across more model scales and task types still needs confirmation.

Key takeaways: - Δlog p direction outperforms magnitude metrics for locating reasoning-critical sparse updates in RLVR. - Inference extrapolation and training reweighting apply without architecture changes. - Teams tuning RLVR should add direction analysis to their diagnostic toolkit.

Source: On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

03 Training Multi-Task SFT's Hidden Waste

Fine-tune multiple tasks together and their sub-datasets learn at wildly different speeds. Some overfit by epoch 3; others are still converging at epoch 10. The training loop doesn't care — it allocates compute to all of them equally.

mSFT does the obvious thing: monitor each sub-dataset, drop the first one to overfit, roll back to that dataset's best checkpoint, and continue training the rest. Across 10 benchmarks and 6 base models, it consistently beats 4 baselines. Sensitivity to its single new hyperparameter is low.

The more useful finding: under low compute budgets, mSFT improves results while reducing FLOPs. For resource-constrained teams, that matters more than absolute performance numbers.

Key takeaways: - Heterogeneous overfitting across sub-datasets is a widely overlooked efficiency loss in multi-task SFT. - mSFT iteratively drops overfitting sub-datasets to rebalance the mixture. Simple, but stable across models. - Simultaneous quality gains and FLOPs reduction under low budgets make this especially relevant for constrained settings.

Source: mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT

04 Video Gen Why Video GRPO Keeps Crashing: Off-Manifold Noise

Training instability with FlowGRPO for video generation is nearly universal. SAGE-GRPO identifies a specific structural cause: converting ODE (deterministic sampling) to SDE (stochastic sampling) during exploration injects noise that pushes trajectories off the data manifold defined by the pretrained model. Rollout quality collapses, reward estimates become unreliable, and training diverges.

The fix constrains exploration to stay on-manifold. At the micro level, log-curvature correction derives a manifold-aware SDE. At the macro level, dual trust regions prevent excessive policy drift. Experiments on HunyuanVideo 1.5 show consistent improvements in video quality, text alignment, and visual metrics over prior methods.

Key takeaways: - Video GRPO instability stems from exploration noise pushing samples off the pretrained data manifold, making reward estimates unreliable. - Manifold-projected exploration plus dual trust regions fix this cleanly. - Validated on HunyuanVideo 1.5; generalization to other video architectures still needs testing.

Source: Manifold-Aware Exploration for Reinforcement Learning in Video Generation

05 Architecture Single-Stream AV Generation Beats Multi-Stream

The trend in joint audio-video generation has been stacking specialized modules: multi-stream encoders, cross-attention alignment layers, modality-specific training strategies. daVinci-MagiHuman strips all of that away. Text, video, and audio tokens go into one sequence with standard self-attention.

The engineering payoff is immediate: no multi-stream synchronization logic to maintain. Inference optimizations (distillation, latent super-resolution, Turbo VAE) apply directly to a standard Transformer. One H100 generates a 5-second video in 2 seconds. Quality holds up: 80% human eval win rate against Ovi 1.1, speech WER at 14.6% (lowest among open-source models).

The full stack is open-sourced: base model, distilled variant, super-resolution model, and inference code.

Key takeaways: - Single token sequence for text, video, and audio eliminates cross-attention and multi-stream alignment overhead. - 2-second generation for 5-second video on one H100; distillation and Turbo VAE land easily on standard Transformer architecture. - Full model stack open-sourced including base, distilled, and super-resolution models.

Source: Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Diffusion OCR Decodes 3.2x Faster, Single-Stream AV in 2 Seconds

Also Worth Noting

World Model Evaluation Shifts From Visual Fidelity to 4D Interaction EvaluationA new evaluation paradigm centered on physics consistency and controllability. Omni-WorldBench

LLM Agent Workflows: Static Templates to Dynamic Runtime Graphs AgentSystematic survey organized by "when is structure determined," directly useful for architecture decisions. From Static Templates to Dynamic Runtime Graphs

Inject 3D Spatial Awareness Without Touching the Vision Encoder MultimodalLanguage-guided reasoning extracts overlooked spatial understanding from 2D pretrained representations. SpatialBoost

Repurpose Geometric Foundation Model Features as Diffusion Latent Space Image GenMulti-view geometric consistency built in, not post-processed. Repurposing Geometric Foundation Models

A New Fix for Recursive Self-Improvement Drift TrainingSymbolic verification as anchors stabilizes reasoning chain quality across DPO iterations. Symbolic Recursive Self-Alignment

Unified Spatiotemporal Token Compression for Video LLMs EfficiencyMaintains performance at ultra-low retention rates, more efficient than staged pruning. Unified Spatiotemporal Token Compression

Teach Speech Models to Respect Duration Constraints MultimodalA hard requirement for voice assistant deployment; MIT open-sources a post-training approach. TiCo

Continual Unlearning for Multimodal LLMs SafetySelectively refuse under sequential deletion requests without destroying shared representations. Continual Unlearning for LVLMs

Unify Three Hand-Object Interaction Tracks Into One Sim-to-Real Framework RoboticsPose, appearance, and motion generation in a single pipeline. PAM

Test-Time Scaling for Image Restoration Image GenAdapt flow matching models to degradation types at inference without modifying pretrained weights. Tuning Real-World IR at Inference

Today's Observation

Today's two RL papers look unrelated. One analyzes token-level update mechanics in RLVR for language reasoning. The other diagnoses training collapse in GRPO for video generation. Their findings point to the same conclusion: we don't understand how RL signals propagate inside generative models well enough, and that gap blocks RL recipes from transferring across modalities.

The RLVR direction analysis shows that the sign of token-level updates explains reasoning gains better than magnitude. The prior frame — "how much did parameters change" — may have been measuring the wrong thing. The manifold-aware video RL work shows GRPO collapses on video because exploration noise pushes samples off the pretrained manifold. SDE exploration that works in language fails on continuous flow-matching models.

The signal is clear: RL for generation is still in its diagnostic phase. Recipes validated on language and images cannot be assumed to transfer to video, audio, or other modalities. Each modality's generative model has different internal structure (autoregressive vs. flow matching vs. diffusion), and RL signals propagate differently through each. Diagnose first, then design.

If your team is porting GRPO or RLVR to a new setting, invest in a diagnostic round first. Run small-scale training. Monitor update direction and magnitude distributions at the token or timestep level. Confirm RL signals reach the positions that matter before committing to a recipe tuned for language models.