Frontier Agents Finish One Task in Five at 1.6-Hour Length

Today's Overview

  • Stretch a task to 1.6 hours and frontier agents finish only a fifth of it. OSWorld 2.0 moves the computer-use yardstick from 30 tool calls to real workflows that take humans a median 1.6 hours and 318 calls. Claude Opus 4.8 finishes 20.6%, and the failures are all long-horizon context management.
  • Policy checks should read the whole conversation, not block a single argument. PolicyGuard turns the verifier into a sub-agent that reads the dialogue, lifting pass rates 6–12 points on tau²-Bench airline while blocking half as often as argument-level guards.
  • Video evals are shifting from "can it see" to "can it follow along and do the task." VG-GUIBench asks GUI agents to operate an interface by following a tutorial video, paired with TASKER, which picks keyframes using both task and scene signals.
  • Portrait retouching switches from instructions to examples. MirrorPPR gives the model one before/after pair, infers the edit, and applies it to a new photo — trained on 47M pairs in a two-stage curriculum.
  • In transparent scenes, monocular depth "ground truth" is just an annotation convention. One ray passes through glass and both the foreground and background depths hold geometrically. MD-3k finds depth models pick different layers, and a training-free spectral transform flips a frozen model to the other one.

Featured

01 Stretch a Task to 1.6 Hours, Agents Finish a Fifth

Real workflows that take a human a median 1.6 hours and 318 tool calls — that is the new computer-use yardstick from OSWorld 2.0. The previous OSWorld averaged 30 calls per task. The 108 end-to-end workflows span everyday and professional scenarios, each built on real input files, checked against stateful user-profile data, and shipped with a separate safety audit report.

The numbers are blunt. The best performer, Claude Opus 4.8 with max thinking and batched tool calls, finishes 20.6% of tasks within 500 steps, with 54.8% partial credit. GPT-5.5 uses fewer tokens but stalls at 13%.

The failure breakdown is the real payload. Agents don't trip on basic GUI moves or writing code. They lose constraints, miss information that arrives mid-task, guess instead of asking the user, and skip verification. The hardest tasks hide state the agent has to dig out itself. Short benchmarks paper over exactly the long-horizon problems that real work runs into.

Key takeaways: - Long tasks at the 1.6-hour scale expose what agents actually can't do; a high score on a short benchmark doesn't mean the agent gets work done. - The strongest agent finishes only about 20% — build computer-use products assuming the agent stalls midway, not that it runs clean end to end. - Failures cluster in context management, asking follow-ups, and self-verification; the work is in long-horizon state tracking, not GUI control.


02 Compliance Checks Fail at the Entry Point, Not the Model

Most enterprise agents treat "follow company policy" as a gate before execution: check the arguments of a tool call, block if they violate a rule. PolicyGuard argues the entry point is wrong. Real support and operations flows unfold over many turns, and many requirements — confirm with the user first, read the prerequisite before acting — never land on any single argument. You can only judge them from the whole conversation.

So PolicyGuard makes the verifier a sub-agent that reads the dialogue. It sees the same full conversation the main agent does, reasons against policy, and returns a concrete fix for the next turn rather than a blunt allow-or-block.

On tau²-Bench airline, pass rate (PASS4) rises 12, 6, and 12 points across GPT-5.4, Claude Sonnet 4.6, and Gemini 2.5 Pro. More interesting: at higher violation recall, it blocks about half as often as argument-level guards. Fewer false stops, less disruption to normal flows — for real deployment that beats a raw block-rate number.

Key takeaways: - Move compliance from single-argument interception to context judgment over the whole dialogue, or you miss multi-turn rules like confirm-first and read-first. - Teams building support and ops agents can borrow the sub-agent verifier with next-turn fixes instead of hard blocking. - Blocking less and blocking accurately has more deployment value than a higher block rate; cross-scenario generalization still needs the full paper to confirm.


03 Can a Model Follow a Tutorial Video and Finish the Job?

Open a software tutorial, follow it click by click, and finish a task in the interface. That is what VG-GUIBench tests: can a GUI agent learn transferable procedural skill from a tutorial video, then carry it into a downstream long-horizon task — not just see what's in the frame or what happened. It wires "watching a video" and "doing the thing the video shows" into one benchmark.

The authors find both task types bottleneck on the same step: keyframe extraction. Their TASKER algorithm picks the most informative frames using task relevance and scene dynamics together.

The gains are modest — 2.0% over the best baseline on EgoSchema, 1.8% on NExT-QA. The numbers alone don't impress; the value is pushing video evals from perception toward "can you act on it." Whether GUI agents hold up on real long-horizon tasks needs the full paper's difficulty settings to judge.

Key takeaways: - Video evals are moving from "see the frame" to "act on what you saw" — a line agent teams should track. - Keyframe extraction is the shared bottleneck for VideoQA and video-guided agent tasks; TASKER's task-plus-scene frame selection is worth borrowing. - The benchmark gain is about 2% — read it as a direction signal, not a performance breakthrough.


04 Portrait Retouching by Example, Not Instruction

"Take in the chin a bit, even out the skin tone" — fine retouching is nearly impossible to put in words. One sentence can't capture a local edit or convey a sense of degree. MirrorPPR swaps the interaction: no instruction, just one before/after pair. The model infers the edit from that pair, then applies it to a new portrait.

An operation extractor captures the subtle difference between the example pair and injects it into a pretrained diffusion Transformer (DiT), with LoRA for lightweight adaptation. The team built a dataset of 47M retouching pairs and trained with a two-stage curriculum: simulated first, then professional.

The paper claims it beats existing methods on both edit quality and identity preservation. This is a new task without a shared benchmark, so results lean on the quality of the example pair itself. Real generalization needs the full paper and hands-on testing to call.

Key takeaways: - Example-over-instruction is a new interaction path for fine image editing; the limits of text description show up sharply in portrait retouching. - Teams building e-commerce or photo tools should watch this pattern — it fits "apply one style across a batch." - The task lacks a mature benchmark; example-pair quality and cross-identity generalization are the two things to verify before shipping.


05 One Ray Through Glass: Which Depth Should the Model Report?

Transparent scenes hide a surprising gap. When a camera ray passes through foreground glass and also reaches the background, both depths hold geometrically — yet a monocular model has to emit one scalar per pixel. This work turns that layer ambiguity into a measurable benchmark, MD-3k.

The finding: leading depth foundation models pick different layers in the same scene. Some report the glass, some the background. The so-called depth ground truth is an annotation convention, not a property of the scene.

More surprising, a training-free spectral input transform (Laplacian Visual Prompting) flips a frozen model to the other layer, and the best RGB/LVP combination hits 75.5% accuracy on multi-layer spatial relations. For anyone using monocular depth in production: when a model "errs" on transparent, reflective, or layered scenes, the supervision label may be wrong, not the model.

Key takeaways: - Monocular depth "ground truth" is a product of annotation convention; transparent and layered scenes have more than one correct answer. - Using depth for 3D reconstruction, AR, or robot grasping — anomalies on transparent objects aren't necessarily model bugs. - A training-free spectral transform switches the reported depth layer, which means current models hold unstated geometric assumptions.

Frontier Agents Finish One Task in Five at 1.6-Hour Length

Also Worth Noting

06
Apple Evicts KV Cache by Coverage, Not Raw Attention Score. EfficiencyCheaper long-context inference with no quality drop. link
07
Scale Activations of Selected Attention Heads Without Touching Pretrained Weights. MultimodalNear-zero extra parameters add spatial reasoning to a VLM. link
08
Does a Model Actually Read Back the Intermediate State It Wrote on Its Scratchpad? InterpretabilityStanford asks whether those variables feed the computation or just look nice. link
09
Let Multiple LLMs Invent a Compact Symbolic Code to Collaborate. ReasoningIt replaces verbose natural-language CoT for higher reasoning efficiency. link
10
Use "Bayesian Surprise" as the Reward Signal. AI for ScienceAllen Institute steers an LLM through long-horizon hypothesis search-and-verify loops to pick what to test next. link
11
Reclaim the Idle Compute Edge AI Chips Reserve for Peaks. EfficiencyHuawei uses general approximation to harvest capacity that mostly sits idle. link
12
VLMs Inherit Relational Priors From Language but Can't Apply Them to Images. MultimodalThis work uses Gromov-Wasserstein to align semantic relations across language and vision. link
13
Face Video Super-Resolution Doesn't Have to Be Full Generation. Video GenInitialize from dynamic trajectories and skip the expensive fixed-sampling inference. link
14
3D Scene Layout Stops Converting Assets and Coordinates to Text First. Image GenWorking natively in 3D lets the LLM arrange more sensible layouts. link
15
Whether a Model Can "See" an Object Is Capped by the Description System It Learned. InterpretabilityMicrosoft borrows Wittgenstein to draw a boundary around "seeing." link

Today's Observation

Two vision papers today poke the same assumption from opposite ends: is "ground truth" in vision tasks inherent to the scene, or a convention we stick on it. One Scene, Two Depths proves it with transparent scenes — one ray through glass, both foreground and background depths valid, and the scalar depth label just picks one layer as the answer. Microsoft's Can Machines Really See Objects cuts from the language side: what a model can "see" is bounded by the description system it learned, so the edge of recognition is the edge of that system.

One says ground truth is an annotation convention, the other says recognition is language framing. They land on the same counterintuitive reminder: when a model fails on depth or recognition, the first move isn't to fix the model. Look back at the labels and the eval criteria — the criteria themselves may be wrong, having made a choice about the scene they had no standing to make.

Here's one concrete thing to do. Pull the hard cases your product "keeps getting wrong," leave the model alone, and have two people relabel them independently. Check the agreement. If people can't agree, the dropping metric isn't model capability — it's your ground-truth definition. Fix the eval criteria, not another round of training data.