3D at 0.1% Tokens, Video Fine-Tuning's Hidden Spatial Cost

Today's Overview

Misaligned experience replay is a silent bottleneck in agent RL. Complementary RL lets the experience extractor adapt based on policy performance, enabling co-evolution instead of static accumulation.
Video-SFT's temporal gains come at the cost of spatial understanding. Systematic experiments across architectures and scales confirm this is a structural trade-off, not a model-specific bug.
Video generation as auxiliary supervision for robot policies, disabled at deployment. GigaWorld-Policy's decoupled design runs 9x faster than Motus with 7% higher success rate.
3D tokenization shifts from geometric to semantic hierarchy. LoST achieves better reconstruction quality with 0.1% of prior methods' token count.
Token pruning across ViT and LLM finally unified. STTS is end-to-end trainable, with efficiency gains that scale as you sample more frames for long videos.

Featured

01 Agent Why Experience Replay Goes Stale

RL-trained agents have an under-discussed degradation problem. You accumulate historical experiences to help the policy learn, but as the policy improves, the experience pool stays frozen. Beginner-stage notes start misleading intermediate-level decisions. Existing approaches either store experiences statically or update too slowly to keep pace with policy iteration. The gap widens throughout training.

Complementary RL borrows from neuroscience's "complementary learning systems" theory. The experience extractor and policy network co-optimize within the same RL loop. The policy learns from sparse outcome rewards. The experience extractor adjusts based on whether its contributions actually helped the policy improve. Experience value is judged by real policy performance, not an independent standard.

Single-task performance improves by 10%, with solid scaling in multi-task settings. Code is available. The specific numbers matter less than the idea itself: co-evolution of experience and policy is an unavoidable problem in long-horizon agent training. Most agent frameworks haven't seriously addressed this layer.

Key takeaways: - Misaligned decay between experience replay and policy is an underestimated efficiency bottleneck in agent RL. - Complementary RL dynamically adapts the experience extractor based on actual policy performance, enabling co-evolution. - Teams building long-horizon agents should pay attention to experience management as a neglected dimension.

Source: Complementary Reinforcement Learning

02 Multimodal Video Fine-Tuning's Hidden Bill

Video-SFT is standard practice for multimodal LLMs. The default assumption: adding video data only makes the model stronger. This systematic study breaks that assumption. Across architectures and parameter scales, Video-SFT consistently degrades static image understanding while improving video performance. More counterintuitively, sampling more frames boosts video scores but doesn't lift image ability. Temporal and spatial understanding trade off structurally.

The authors propose an instruction-aware mixed-frame strategy for partial mitigation. "Partial mitigation" says it all: this trade-off is far from solved.

Key takeaways: - Video-SFT is not a free lunch. Temporal understanding gains come at the cost of spatial regression. - This pattern holds across architectures and scales. Not a single-model bug. - Teams doing video fine-tuning need to monitor image benchmarks alongside video metrics.

Source: Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

03 Robotics Train with Video Generation, Deploy Without It

Jointly predicting visual dynamics and actions is expensive. Inference overhead is high, and action accuracy gets dragged down by video generation quality. GigaWorld-Policy takes a pragmatic decoupling approach. During training, the model predicts both action sequences and future video. Video generation provides physical-plausibility supervision. Through causal design, future video tokens don't influence action tokens. At deployment, video generation shuts off. Only action prediction runs.

This "use during training, discard at deployment" design delivers real gains: 9x faster inference than Motus with 7% higher task success rate. For robot deployments needing real-time response, this balance of inference efficiency and action accuracy matters more than chasing better generation numbers.

Key takeaways: - Decouples visual dynamics from action prediction: video generation serves as auxiliary supervision during training, gets disabled at deployment. - 9x faster than Motus with 7% higher success rate; 95% improvement over pi-0.5 on RoboTwin 2.0. - Inference efficiency matters more for real robot deployment than generation quality benchmarks.

Source: GigaWorld-Policy: An Efficient Action-Centered World–Action Model

04 Architecture 3D Generation Has Been Borrowing the Wrong Tokenization

0.1% of the token count. Better reconstruction quality. LoST's move: shift 3D tokenization from geometric hierarchy to semantic hierarchy. Current autoregressive 3D generation borrows geometric Level-of-Detail from rendering and compression. Encode coarse geometry first, then refine. This hierarchy was designed for rendering efficiency, not autoregressive modeling. Tokens lack semantic coherence, and sequences run long.

LoST reorders tokens by semantic salience. The first few tokens decode into a complete, plausible shape. Later tokens add detail. DINO feature space alignment ensures token order reflects "what matters" rather than "where things are." The result: 0.1%–10% of prior methods' token count at better reconstruction and generation quality.

Key takeaways: - The bottleneck in 3D autoregressive generation is tokenization assumptions, not model size. - Ordering tokens by semantic hierarchy instead of geometric hierarchy improves sequence efficiency by 2–3 orders of magnitude. - This infrastructure-level fix matters more for practical 3D deployment than scaling up models.

Source: LoST: Level of Semantics Tokenization for 3D Shapes

05 Efficiency Video LLM Token Pruning Always Misses Half the Problem

The inference bottleneck in video VLMs largely comes from spatiotemporal redundancy in visual tokens. Existing pruning approaches hit an awkward gap. ViT-side pruning ignores downstream language task needs. LLM-side pruning accepts all redundant tokens from ViT.

STTS unifies both stages. A lightweight scoring module prunes tokens in ViT and LLM simultaneously. Temporal pruning uses auxiliary loss. Spatial pruning follows LLM gradient backpropagation. No text-conditioned filtering, no token merging. End-to-end training is all it takes.

Across 13 video QA tasks, average performance drops less than 1 point. Efficiency gains grow as you sample more frames, making this especially practical for long video. The method isn't complex. It wins by solving a fragmentation problem with one unified framework.

Key takeaways: - Unifies token pruning across ViT and LLM, fixing the disconnect in approaches that prune each side independently. - End-to-end trainable without text-conditioned filtering; low integration cost. - Efficiency gains increase with more sampled frames. Clear deployment value for long videos.

Source: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

3D at 0.1% Tokens, Video Fine-Tuning's Hidden Spatial Cost

Also Worth Noting

Discrete Audio Tokens at 12.5fps for Autoregressive Speech Generation. MultimodalA complete, deployable open-source foundation model pipeline. MOSS-TTS

Joint Appearance and Stereo Geometry from Pure RGB. Video GenEnd-to-end stereo video generation without depth maps. StereoWorld

MLLMs Expose Systematic Hallucination on Fine-Grained Negation. EvaluationPassable coarse-grained scores mask failure modes at finer granularity. FINER

Hierarchical Grids Compress Long Video Navigation to Log-Scale Compute. MultimodalOperates directly on raw frames, no caption preprocessing needed. VideoAtlas

Majority Voting in Label-Free RL Collapses Output Diversity. ReasoningCo-evolving generator and verifier breaks the consensus trap. CoVerRL

RL-Trained Code Search Agent for Precise Large-Repo Localization. Code IntelligenceThe upstream bottleneck for coding agents: get localization wrong and everything downstream fails. CodeScout

Integrated Gradients Guide Layer-Wise Mixed-Precision LVLM Quantization. EfficiencyQuantization sensitivity visualization directly cuts deployment cost. QAIG

GQA-to-MLA via Low-Rank Factorization. ArchitectureDrops KV-cache overhead without retraining. CARE

A New Fix for DPO's Squeezing Effect. TrainingSharpness-aware optimization in logit space balances alignment and generalization. LogitSAM

Adaptive Zoom and Instruction Refinement for Small GUI Elements. AgentA practical GRPO-trained approach. AdaZoom-GUI

Today's Observation

Three findings from today connect in an interesting way. Complementary RL reveals that agent experience pools go stale without active updates. CoVerRL finds that majority voting in label-free RL collapses output diversity. The Video-SFT study shows temporal gains come at a hidden spatial cost.

These problems span agent training, reasoning optimization, and multimodal fine-tuning. They share one underlying mechanism: optimization pressure creates visible progress on target metrics while inflicting systematic damage on dimensions that standard evaluations don't cover. Nobody monitors experience pool freshness during agent training. Nobody tracks output distribution entropy during reasoning training. Nobody watches image understanding retention during video fine-tuning. These aren't vague "training is hard" complaints. They're specific metric blind spots: degradation dimensions that default pipelines don't monitor and therefore don't expose.

"Pipeline runs, metrics go up" doesn't mean training is working correctly. A low-cost countermeasure: add 2–3 non-target degradation metrics to your training dashboard. Nothing fancy. Plot the curves you're currently ignoring: distribution shift between experience pool and current policy, output diversity entropy, periodic eval on non-target tasks. When those curves drop, trigger a review. Near-zero cost, but catches damage before it compounds into deployment problems.