A 1.5B Model Beats Sonnet 3.5 at Event Forecasting

Today's Overview

Agent Memory Forgets the Wrong Things, and the Root Cause Is Architecture, Not Prompts. MIT swept 13 configurations and found that where the LLM sits in the memory pipeline — near recall or near rewrite — locks in which class of forgetting errors the system makes. Only a mutation-side hook lifts everything to 91.7–93.2%.
A Model's "Honesty" Is Inherited, Decided the Moment You Pick the Base. The truthful heads that keep a model grounded in context pass untouched to downstream variants of the same base, even through multimodal conversion. A model's anti-hallucination floor is set at selection time.
Big Models Read Images Well but Are Nearly Blind to Pixel-Level Tells. MLLM semantic understanding concentrates in the early-to-middle layers, and fine-tuning directly on forensic signals damages those representations. Forgery detection needs a dedicated path to "see noise."
A 1.5B Model Forecasts Real-World Events and Beats Claude Sonnet 3.5. Oxford turned event forecasting into an RL-trainable objective with tool calls plus GRPO, and the small model edged out the large one on cross-entropy over the same dataset.

Featured

01 Agent Memory Forgets the Wrong Things by Design

Where the LLM sits in an agent's memory pipeline decides which class of forgetting errors the system makes. This MIT work shifts attention from "is recall accurate" — already benchmarked to death — to the side almost nobody tests: the control plane that rewrites, releases, and clears memory. Across 13 configurations and 385 adversarial cases, three placements each show a blind spot.

Deterministic rules handle literal and time-based forgetting, but collapse when the same entity reappears in disguise (5% on identifier obfuscation, 0% cross-language). Putting the LLM on the write side solves that normalization problem completely — same fact, new phrasing — yet fails entirely on intent-driven deletion (0% on prefix conflicts and compound facts). Only a mutation-time hook, placed at the moment a rewrite happens, recovers intentional deletion (78–85%) and lifts almost every category at once (91.7–93.2% overall).

The cost is roughly 2.3 seconds of rewrite latency per case, versus 64–191 milliseconds for the deterministic approach, but the recall path is untouched and a full run over 385 cases costs about $0.17. The authors flag a fact that's easy to miss: most production failures are forgetting failures, not recall failures, and current benchmarks only measure the latter. That gap is what their open-source ForgetEval (MIT license) aims to fill.

Key takeaways: - Forgetting isn't a prompt bug. It's a product of where you place the LLM, so the wrong position pre-selects a failure mode. - Covering both normalization and intent-driven deletion means putting rewrite logic on the mutation side, not the write side — and accepting second-scale latency for accuracy. - If you build agent memory, most evals only test recall. The forgetting blind spot is worth filling with a tool like ForgetEval.

Source: Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

02 Interpretability A Model's Honesty Is Inherited from Its Base

The usual assumption is that whether a model stays faithful to context, instead of making things up, gets tuned in through later fine-tuning and alignment. This paper finds the opposite. The attention heads responsible for staying grounded — the truthful heads — pass untouched from a base model to the downstream variants fine-tuned from it, even through instruction tuning and conversion to multimodal.

The authors quantified a "head honesty score" across four families — Vicuna, Qwen2.5, LLaMA2, Mistral — and found it highly stable within each family, because those heads' weights barely move during fine-tuning. They also built TruthProbe, a soft gate that amplifies the contribution of these honest heads. It improves contextual faithfulness on HaluEval and cuts multimodal hallucination on POPE and CHAIR.

If the finding holds, the practical reading is blunt: a downstream model's resistance to hallucination is largely fixed the moment you pick its base, not something later fine-tuning rescues.

Key takeaways: - Anti-hallucination capacity is mostly set by the base. Base selection matters more for downstream faithfulness than alignment tuning. - Multimodal variants of a base inherit its truthful heads, so you can predict their hallucination behavior from the lineage. - TruthProbe-style soft gating that amplifies honest heads offers a low-cost way to cut hallucination without retraining.

Source: The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

03 Multimodal Big Models Read Images, but Miss the Pixels

As AI-generated images get more convincing, detecting fakes by spotting semantic mistakes is failing, so the obvious move is to point a multimodal LLM at the problem. This work does something more useful first: a layer-by-layer teardown shows that MLLM semantic understanding forms in the early-to-middle layers. Fine-tuning those layers to learn forensic signals — noise, spectral, full-band artifacts — damages the semantic representations instead. An MLLM is built for semantics and nearly blind to low-level evidence.

The fix, Deep-VRM, doesn't force changes on the front layers. It injects forensic signals as a residual path into the middle layers, fuses them with semantic tokens, then passes both forward so later layers do semantic reasoning and signal-level judgment together. The model adapts which level of evidence to trust based on the input, reaching SOTA on most benchmarks.

The real signal for practitioners isn't the score. It's the boundary: a general multimodal model doesn't come with pixel-level forensic ability, and you have to build it a path to see noise.

Key takeaways: - MLLM semantics concentrate in the early-to-middle layers, and fine-tuning them on low-level forensic signals breaks those representations. That's why forgery detection can't just reuse a general model. - Residual injection is a reusable pattern: leave semantics alone, open a separate path to feed in low-level signals. - Teams in detection or forensics should assume MLLMs are blind to pixel-level evidence by default, and design for it rather than expect it out of the box.

Source: Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models

04 Reasoning A 1.5B Model Beats Sonnet 3.5 at Forecasting

Event forecasting has been treated as a job only big models can touch — it's open-ended, depends on freshly looked-up information, and the answer reveals itself only after the training cutoff. Oxford's core contribution is reshaping it into an RL-trainable objective. Give models from 1.5B to 14B tools that query Wikipedia revision history or news summaries, then fine-tune with GRPO, a memory-efficient RL method, so the model learns to assign event probabilities from live information.

The result trains Qwen2.5's 1.5B version past Claude Sonnet 3.5 on forecasting, measured as cross-entropy against market-consensus probabilities. Read that win narrowly: same dataset, same metric. The authors are explicit about the ceiling that forecasting's inherent randomness — the dice-roll aleatoric uncertainty — puts on any model's ability, and the paper unusually documents the dead ends along the way.

Key takeaways: - Tool calls plus GRPO can pull open-ended, lookup-dependent tasks into trainable range, no large model required. - The small model's win over Sonnet 3.5 is limited to cross-entropy on one dataset. Don't extrapolate it to across-the-board dominance. - The paper honestly records its failed paths, which is more useful than the headline result for anyone reproducing the approach.

Source: Reinforcement Learning for LLM-based Event Forecasting

A 1.5B Model Beats Sonnet 3.5 at Event Forecasting

Also Worth Noting

Putting the Data Manifold Under a Microscope to Test Whether Generalization Theory's Geometry Holds Up InterpretabilityAssumptions like intrinsic dimension and curvature, often taken for granted, may not survive contact with real data. Worth a look for anyone reconciling theory with practice. link

Flow Matching Treats Signals as Points in Euclidean Space, Missing the Topology of Data Like fMRI Brain Maps AI for ScienceThis adds a topological dimension to the generative framework, relevant if you work on structured signal generation. link

Today's Observation

Read the control-plane forgetting paper (2606.15903) alongside the inherited truthful heads paper (2606.15821), and a quiet consensus surfaces: a model's key behaviors are increasingly attributed to its structural position and origin, not to what you trained it on. The first says forgetting failure modes are decided by where the LLM sits in the memory pipeline. The second says contextual faithfulness is decided by which base it inherited from. Both hand the explanation for behavior back from later, tunable training to architecture and lineage that are locked in early.

This isn't the familiar story of the bottleneck moving from model to environment. It's something else: the explanation doesn't shift sideways, it traces upstream. The engineering reminder is concrete. Some capabilities and defects are fixed at selection time, so don't count on fine-tuning to fix them. Before you pick a base or settle on a memory architecture, put behaviors like anti-hallucination and wrong-forgetting on the selection checklist and test them then — not at the tuning stage, where you'll find they can't be recovered.