Video World Models Stall at 24%, Jailbreaks Only Mute a Few Heads

Today's Overview

The real bottleneck for video world models is broken physics, not image quality. PhysisForcing adds physical constraints at contact and deformation regions, lifting closed-loop success as a robot world model from 16% to 24% — right direction, still low.
Reward went up while the image quietly got worse. NormGuard found that RL post-training inflates model norms by 5%-15%, a quality-loss signal the reward proxy can't catch but you can read off directly.
Jailbreaks don't erase safety features — they suppress a handful of attention heads. Switch off a few early-layer "attack-vulnerable heads" and the model complies with harmful requests, while mid-layer safety heads stay active. Read them for training-free detection.
A big lab turned a paper assistant into a pre-submission review tool. Google's PAT uses inference-time scaling to surface deep errors, recalling 34% more than zero-shot on SPOT, positioned as assisted verification rather than a verdict.

Featured

01 Video Models as Robot World Simulators Break on Physics

More teams are using video generation models as a robot's world simulator — let the model rehearse a manipulation, then decide based on what it predicts. The problem: both general video models and robot-finetuned ones generate physically impossible frames. Trajectories jump. Hands clip through objects.

PhysisForcing takes a practical route. It locates two sources of instability — deformation of moving objects, and implausible spatiotemporal correlations between entities at the moment of contact — then constrains those physics-dense regions. A pixel-level trajectory-alignment loss and a semantic-level relation-alignment loss do the supervision, the latter pulling region-to-region relations from a frozen video-understanding encoder.

On R-Bench it beats vanilla finetuning by 7.1% and 3.7% (22.3% and 9.2% total over baseline). The number worth watching is closed-loop: used as a world model under an action-planning protocol, success goes from 16% to 24%. That absolute level is still low, so video-as-world-model is far from reliable. But it shows physical constraints recover real capability, not just prettier frames. The abstract has only 4 upvotes and limited detail; whether it holds at high resolution and long horizons needs the full paper.

Key takeaways: - The true obstacle for video-as-world-model is broken physics (trajectory jumps, clipping), not image quality. - Teams working on embodied AI should track this constraint-based line. - 16%→24% closed-loop success means the direction is right but the level is still low — don't put a video world model inside a production decision loop yet.

Source: PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation

02 Reward Up, Image Down: Norm Inflation Is a Self-Check Signal

RL post-training a flow-based image generator raises the reward score while perceptual quality often drops — and that drop is exactly what the reward proxy can't measure. NormGuard found a structural signal you can read off directly. Across three post-training methods (NFT, AWM, DPO), RL inflates the per-step velocity norm — roughly how hard the model "pushes" at each generation step — by 5% to 15% relative to the reference model.

The diagnosis is the interesting part. That inflation is baked into the weights: shrinking the norm at inference time neither raises the reward nor fixes quality. A companion analysis shows that suppressing the inflation costs no reward signal. So the fix belongs in training. NormGuard uses a hinge penalty that activates only when the norm exceeds a threshold, improving quality while preserving reward — and the gain grows as you use fewer steps.

Key takeaways: - When tuning image models with DPO/RLHF, a higher reward doesn't mean a better image; 5%-15% norm inflation is a monitorable quality-loss signal. - Shrinking the norm at inference is a useless after-the-fact patch — solve it in training. - This is a diagnostic metric you can use for self-checks right now, especially if you ship few-step inference.

Source: NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning

03 When a Jailbreak Lands, the Model Still Knows It's Harmful

Why jailbreaks work has always been somewhat mystical. This paper gives a mechanistic answer: the attack doesn't wipe out the model's safety features wholesale, it suppresses specific attention heads. The work separates two functionally distinct groups — early-layer "attack-vulnerable heads" (ACHs) that get pushed down, and mid-layer "safety-alignment heads" (SAHs) that keep firing even when the attack succeeds.

Ablation nails the causal chain. Switch off just a few ACHs and the model complies with requests it should refuse. The attack suppresses those ACHs precisely through the special tokens inside jailbreak templates. The counterintuitive part: since the internal safety signal survives, reading the still-active SAHs directly — no training, no finetuning — gives competitive harmful-content detection that holds up well against adversarial attacks.

Key takeaways: - Jailbreaks go from "bypassing alignment as a whole" to a localizable circuit problem; defenses can target specific heads instead of full retraining. - The safety signal survives a successful jailbreak in mid layers, opening a training-free detection path. - The mechanism rests on single-model attention-head analysis; whether it reproduces across models and real attack distributions needs the full paper.

Source: Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

04 Google Draws the Line at Assisted Verification

"Can AI do peer review" has been argued for a while. Google's paper offers a deployed sample. PAT (Paper Assistant Tool) is an agentic review framework that takes a whole paper and does specific work: checking theoretical derivations, validating experiments, suggesting improvements, finding potential flaws. It doesn't bet on a single model call — it uses inference scaling to dig out deeper issues, recalling 34% more than zero-shot on the SPOT math-error benchmark.

The deployment matters more. PAT ran as a pre-submission tool for authors at two conferences, STOC and ICML, positioned to catch errors early and lighten reviewer load — not to render a verdict. The decision stays with people. This is a real reference point for agents landing on professional-judgment tasks: the boundary sits at assisted verification, not replacing the call.

Key takeaways: - For agents on professional-judgment tasks, the viable position today is pre-submission self-check and assisted verification, not replacing the judgment — useful for teams building review products. - Inference scaling genuinely helps surface deep errors; the 34% recall gain comes from repeated reasoning, not a single call. - SPOT only covers math errors; how far this reaches on softer calls like fabrication or novelty needs the full paper.

Source: Towards Automating Scientific Review with Google's Paper Assistant Tool

Video World Models Stall at 24%, Jailbreaks Only Mute a Few Heads

Also Worth Noting

Pulling 4D Multi-Object Interactions From Casual Monocular Video Into VLA Training Roboticsanother path to real interaction data at scale, the "mining real data" counterpart to PhysisForcing's synthetic route. link

VLA Long-Horizon Error Accumulation Traces Back to Fixed-Weight Static Feature Fusion RoboticsS²-VLA treats it with state-space-guided dynamic fusion. link

Under Decentralized, Partially Observable Conditions, LLM Multi-Agents Misalign With Teammates and Environment State AgentLLawCo explicitly learns "cooperation laws" to model embodied multi-agent behavior. link

Neural Video Codecs Beat Classical on Efficiency but Resist Deployment, With Uncertain Cross-Hardware Results Video GenMLVC builds a learned codec aimed at real multi-platform use. link

Adversarial Robustness Without Pruning or Masking Safetylearn to both amplify and attenuate non-robust features; the method is lightweight. link

Fine-Grained Skill Assessment (Sports, Surgery) Needs Step-by-Step Visual Reasoning Multimodalcombine latent visual diffusion with Monte Carlo tree search for stepwise judgment. link

Scene Text Detection Drops the Moment Distribution Shifts MultimodalTextDS does parameter-efficient representation alignment without large-scale pretraining. link

Two-Person Conversational Facial Animation Must Match Both High-Level Cognitive Intent and Low-Level Motor Reflex Multimodalexisting methods handle neither end well; MindFlow coordinates both. link

Remote-Sensing Change Captioning Has Long Been Capped by Small-Model Capacity MultimodalRSICCLLM uses a multimodal large model to describe bi-temporal change. link

How Chemical Reaction Networks as a Biochemical Probabilistic Compute Substrate Can Be Reduced AI for Sciencea step toward cell-level adaptive programming. link

Today's Observation

A cluster of embodied/VLA work today independently sidesteps the model itself and goes after a more underrated bottleneck: where does the action data come from. PhysisForcing takes the synthetic route — rebuild the video model into a physically plausible world simulator so it rehearses usable rollouts on its own. HAT-4D takes the mining route — extract 4D multi-object interactions from mountains of casual monocular video and feed them to a VLA. One makes data, one mines data, opposite directions hitting the same scarcity: real robot interaction data is expensive and rare. Add S²-VLA treating long-horizon error accumulation on the action side, and today's robotics story isn't "another stronger policy." Everyone is starting to assume the real chokepoint is the data pipeline, not the model architecture.

For action: if you work on embodied/VLA, before you evaluate a method next time, put "where does its training data come from, can it scale" on the same footing as "how strong is the policy." The marginal return on architecture is giving way to the dirtier, harder work of getting data.