A 35B Agent Reaches for Trillion-Scale, and Async Lag Is Overrated

Today's Overview

  • A 35B Agent Reaches Trillion-Scale Performance by Scaling Sideways. Agents-A1 doesn't grow parameters. It stacks 45K-token trajectories and heterogeneous skills to match 1T models like Kimi-K2.6 and DeepSeek-V4-pro — but only on some benchmarks.
  • 194 Upvotes for Orca's Ambition. It wants a single world latent space and Next-State-Prediction to unify understanding, prediction, and action. The authors call it a preview. Big vision, early landing.
  • The Cost of Async Pipelines May Be Overstated. Whether gradient staleness hurts depends on the optimizer. AdamW degrades; the newer Muon shrugs off one-step delay, closing the async-sync gap at 10B scale.
  • Tabular Foundation Models Are "General" Only in Their Comfort Zone. Across 11 models and 142 datasets, TFMs lead on small-to-medium IID data. Scale up, add dimensions, or break IID, and tree models take over again.
  • What Stalls Mobile 3D Rendering Is Spherical Harmonics, Not Image Quality. Flux-GS uses Monte Carlo energy aggregation to cut the cost of high-order SH, opening a cheaper path for AR and on-device 3D.

Featured

01 A 35B Agent Reaches for Trillion-Scale Performance

The industry default for stronger agents is vertical: make the base model bigger. Agents-A1 pulls a different lever. It keeps parameters at 35B (MoE) and scales the agent horizon instead.

The recipe builds long-range "knowledge-action-observation-verification" infrastructure that produces agentic trajectories averaging 45K tokens. Training runs in three stages: full-domain SFT for alignment, per-domain teacher models, then on-policy distillation with multi-teacher domain routing that folds six heterogeneous domains into one deployable student. The paper claims this reaches trillion-parameter performance, benchmarked against 1T models like Kimi-K2.6 and DeepSeek-V4-pro.

Put a question mark on "trillion-parameter performance." Agents-A1 leads on SEAL-0 (56.4) and IFBench (80.6), but it's only "competitive" on SciCode, HLE, and BrowseComp. Which model it beats, and on which tasks, is something you have to check against the tables yourself. The signal isn't the 35B-matches-1T number. It's how the paper reframes where to invest: for teams that can't afford a frontier base model, building long-horizon trajectory infrastructure may pay off more than stacking parameters.

Key takeaways: - Parameter count isn't the only axis for scaling agents. Long-range trajectories and heterogeneous skill-stacking are a real alternative. - "35B matches 1T" holds only on some benchmarks. Verify task-by-task before you rely on it. - For budget-limited teams, long-horizon knowledge-action-verification infrastructure may return more than raw parameters.


02 194 Upvotes for a Direction, or a Roadmap That Hasn't Landed?

194 upvotes in a day says the community is buying Orca's ambition. It wants one world latent space that packs understanding, prediction, and action into a single training objective — replacing separate next-token, next-frame, and next-action losses with Next-State-Prediction.

The scale reads impressive: 125K hours of video plus 160M event annotations. After pretraining, the backbone freezes, and lightweight decoders handle text generation, image prediction, and embodied action at once. But the authors call it an "initial instantiation" in the abstract, with a section on limitations. Translation: big vision, early landing.

The real signal isn't "matches same-size specialist models." It's whether they'll publish the weakest of the three downstream numbers, and whether the latent actually strengthens with scale. Both need the full paper to judge.

Key takeaways: - A unified state-transition objective is an imaginative direction, but it's a preview, not a method you can pick up today. - If the frozen-backbone plus lightweight-decoder design holds, one pretrain serving many multimodal tasks is the reuse pattern to watch. - Treat it as a roadmap statement for now. Wait for harder scaling evidence before betting.


03 The Cost of Async Pipelines Is Overstated

Synchronous pipeline parallelism idles GPUs inside the pipeline bubble. Async fills those gaps, but the gradient lags one step behind — gradient staleness. The field has assumed this lag destabilizes training, so constant one-step schedules like PipeDream-2BW rarely get used.

This paper's core claim: whether staleness hurts depends mostly on your optimizer, not on async itself. AdamW degrades clearly. The newer Muon is strongly robust to one-step delay. The authors add a general correction term inspired by Error Feedback, give a convergence proof for Muon, and close the async-sync performance gap on models up to 10B parameters.

The evidence stops at the abstract. Stability at larger scale and longer runs still needs the full paper and independent replication.

Key takeaways: - Gradient staleness in async pipelines isn't a hard constraint. Switching to the right optimizer (Muon) mostly cancels it. - Teams running large pretraining and bottlenecked on pipeline-bubble throughput should evaluate going async. - The numbers come from models at 10B or under. Larger-scale conclusions still need verification.


04 Tabular Foundation Models Are General Only in Their Comfort Zone

Tabular foundation models — large models that predict directly on tabular data — get hyped by academia and industry alike. But their evaluation tooling is fragmented. Model researchers work from a few standard benchmarks, and those benchmarks happen to be exactly what TFMs already handle well. The hardest cases get excluded by default.

BeyondArena pulls cross-discipline, cross-task evaluation into one framework. It covers non-IID settings like time series and grouped data, plus real data with text and high-cardinality features. Running 11 models across 142 datasets gives a deflating result: TFMs lead only on small-to-medium IID data. Once data grows large, high-dimensional, or non-IID, classic tree models and deep learning take the lead back.

"Foundation model" is more marketing than capability description here. The generalization boundary is much narrower than the hype implies. That doesn't make TFMs useless. It's a reminder: before you drop one into a real tabular workflow, check whether your data looks like its comfort zone.

Key takeaways: - The TFM sweet spot is narrow — small-to-medium IID data. Big, high-dimensional, non-IID cases still belong to tree models. - Fragmented evaluation manufactures false progress. Gains on standard benchmarks may hold only in the easiest settings. - If you're considering a TFM, match your real data distribution against its limits. Don't let "foundation" box you in.


05 3D Gaussian Quality Is Fine — Spherical Harmonics Are the Problem

Image quality in 3D Gaussian Splatting for novel-view synthesis isn't the issue anymore. What actually weighs on mobile is high-order spherical harmonics (SH), which describe how surface lighting shifts with viewing angle. The inference and storage cost they add is the bottleneck.

Flux-GS cuts that cost rather than chasing quality again. It uses Monte Carlo sampling to aggregate third-order SH specular-highlight energy into a compact low-order representation, sidestepping the distillation or pretraining this usually needs. Two practical modules round it out: one predicts offsets for low-order SH before inference to recover lost high-frequency detail, and one uses multi-view consistency to prune redundant Gaussians and prevent single-view overfitting.

The abstract offers qualitative claims — big parameter drop, quality roughly held — with no hard metrics. Exact compression ratios and frame rates need the project page and full paper. The positioning is clear: a concrete cost-cutting path for AR, on-device 3D, and real-time novel-view synthesis, not another quality bump.

Key takeaways: - The mobile 3D rendering bottleneck is SH cost, not image quality. Cutting cost lands better than pushing quality further. - Monte Carlo energy aggregation skips distillation and pretraining, so engineering integration cost stays relatively low. - The abstract gives only qualitative claims. On-device teams should verify compression ratio and frame rate on the project page first.

A 35B Agent Reaches for Trillion-Scale, and Async Lag Is Overrated

Also Worth Noting

06
Credit Assignment for Every Tool Call AgentOutcome rewards can't tell whether a tool call was useful, redundant, or misleading. TACO assigns credit and blame to single calls without an external judge. link
07
Deliberately the Opposite of Frontier World Simulators Video GenA low-compute, real-time, keyboard-controllable world model for consumer GPUs. Being playable beats being big. link
08
Self-Correction for Masked Discrete Diffusion Image GenOnce a discrete token is unmasked it can't be changed, and that's the key weakness for high-resolution text-to-image. link
09
Video Models Learn Occlusion and Hand-Object Interaction Just to Predict the Next Frame Video GenWhich is exactly the prior that 4D hand-motion reconstruction lacks. link
10
Molecule Generation Benchmarks Have Long Been Hostage to Drug-Like Proxies AI for ScienceNMO moves the target to real scientific settings like quantum materials, exposing how poorly current models transfer. link
11
Benign Fine-Tuning on Harmless Data Quietly Reverts Prior Alignment and Unlearning SafetyThis paper offers a unified explanation for that post-training fragility. link
12
Everyone Hopes LLM Rerankers Fix Cold-Start RetrievalBut a five-domain benchmark that separates rerank quality from recall coverage finds it: if recall misses it, semantic understanding is wasted. link
13
Fixed Top-N Retrieval Per Query Is Both Wasteful and Harmful RetrievalThis paper turns "how many to retrieve" into a per-query budget, pulling fewer docs, or none, when the reader already knows the answer. link
14
Merging Multiple Skills Into One Model Always Robs Peter to Pay Paul TrainingMOPD uses multi-teacher on-policy distillation for post-training skill integration, dodging the inefficiency and quality loss of Off-Policy Finetune and Mix-RL. link

Today's Observation

Two papers both wearing the "world model" label set out from opposite ends today. Orca took 194 upvotes chasing maximal ambition: one unified world latent space folding understanding, prediction, and action into Next-State-Prediction — a research-roadmap vision. DreamForge-World declares it's taking the complementary axis: low-compute, consumer GPU, real-time and interactive, deliberately made small. Same label, one making a scale-up manifesto, the other scaling down for usability.

This isn't about who's right. The two paths differ sharply in maturity right now. The takeaway is concrete: the world models you can actually play with today cluster at the low-compute end, while the grand-unified path is still at preview. If you want an interactive world model in a product, don't wait for an Orca-style manifesto to land. Pull down something like DreamForge-World that runs on a consumer GPU, play with it, learn where "usable" ends, then decide whether "ambition" is worth waiting for.