alphaXiv

Papers Benchmarks Models

4,126

20 Nov 2025

computer-science artificial-intelligence machine-learning

Evolution Strategies at the Hyperscale

Bidipta Sarkar

Researchers from the University of Oxford, MILA, and NVIDIA introduce EGGROLL, an Evolution Strategies algorithm that scales black-box optimization to billion-parameter neural networks by employing low-rank parameter perturbations. The method achieves a hundredfold increase in training throughput and enables stable pre-training of pure-integer recurrent language models, demonstrating competitive or superior performance on reinforcement learning and large language model fine-tuning tasks.

359

24 Nov 2025

computer-science robotics

RynnVLA-002: A Unified Vision-Language-Action and World Model

Researchers from DAMO Academy, Alibaba Group, developed RynnVLA-002, a unified vision-language-action (VLA) and world model that integrates environmental dynamics learning with action planning. It achieved high success rates on simulation and real-world robot tasks, often outperforming extensively pretrained baselines without requiring large-scale pretraining itself.

570

231

23 Nov 2025

agentic-frameworks agents computer-science

General Agentic Memory Via Deep Research

Researchers from BAAI, Peking University, Renmin University of China, and Hong Kong Polytechnic University developed General Agentic Memory (GAM), a framework for AI agents that employs a Just-in-Time Compilation principle through a dual-agent (Memorizer and Researcher) architecture. This approach, designed for dynamic context creation, consistently surpassed existing memory systems and achieved over 90% accuracy on complex multi-step reasoning benchmarks like HotpotQA.

231

24 Nov 2025

computer-science computer-vision-and-pattern-recognition

Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

Fudan University Yinwang Intelligent Technology Co. Ltd.

Percept-WAM, developed by researchers from Yinwang Intelligent Technology Co. Ltd. and Fudan University, integrates explicit 2D and 3D scene understanding into a single Vision-Language Model backbone for robust end-to-end autonomous driving. The model achieves strong performance on perception tasks, with 51.7 mAP on COCO 2D detection and 0.589 mAP on nuScenes BEV 3D detection, and enhances trajectory planning, yielding 0.36m L2 error on nuScenes and 90.2 PMDS on NAVSIM with efficient streaming inference.

1,042

20 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

SAM 3D: 3Dfy Anything in Images

Meta Superintelligence Labs

Meta Superintelligence Labs introduces SAM 3D, a generative model that reconstructs 3D geometry, texture, and layout of objects from a single natural image, achieving a 5:1 human preference win rate over state-of-the-art methods on real images. The approach significantly outperforms prior models on the SA-3DAO benchmark, improving F1@0.01 to 0.2344 compared to previous 0.14-0.16.

2,217

953

20 Nov 2025

agentic-frameworks agents computer-science

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Stanford University Salesforce Research UNC-Chapel Hill

Agent0, developed by researchers at UNC-Chapel Hill, Salesforce Research, and Stanford University, introduces a fully autonomous framework for evolving high-performing LLM agents from scratch, without human-curated data. The system achieves substantial improvements, such as an 18% increase in mathematical reasoning and a 24% boost in general reasoning for Qwen3-8B-Base models, through a co-evolutionary loop and tool-integrated learning.

326

20 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

SAM 3: Segment Anything with Concepts

Meta Superintelligence Labs

Meta Superintelligence Labs developed SAM 3, a unified model that detects, segments, and tracks all instances of a concept in images and videos using natural language or image exemplar prompts. The system achieves a 54.1 cgF1 on the new SA-Co/Gold benchmark, doubling previous baselines, and sets new state-of-the-art results across various open-vocabulary segmentation and tracking tasks.

3,503

24 Nov 2025

chain-of-thought computer-science artificial-intelligence

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

UCLA

UC Berkeley Panasonic AI Research

The Chain-of-Visual-Thought (COVT) framework enables Vision-Language Models (VLMs) to reason using continuous visual tokens, rather than discrete text, for enhanced perceptual understanding. This approach yields an overall 5.5% gain on CV-Bench, including a 14.0% improvement on its depth sub-task, while maintaining performance on general VLM benchmarks.

358

20 Nov 2025

computer-science artificial-intelligence computation-and-language

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

OpenMMReasoner introduces a transparent, two-stage recipe for building Large Multimodal Reasoning Models (LMRMs) from open-source LMMs, combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages with meticulously curated data and algorithmic designs. The approach achieves state-of-the-art performance across nine diverse multimodal reasoning benchmarks, showing an 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline.

7,188

17 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

Back to Basics: Let Denoising Generative Models Denoise

MIT

Researchers at MIT demonstrated that directly predicting the clean image (x-prediction) in diffusion models is fundamentally more effective for high-dimensional pixel data with Vision Transformers than conventional noise or velocity prediction. Their "Just image Transformers" (JiT) architecture achieved competitive pixel-space image generation on ImageNet, reaching 1.82 FID at 256x256 and 1.78 FID at 512x512 without external components.

793

20 Nov 2025

agents chain-of-thought computer-science

Early science acceleration experiments with GPT-5

University of Cambridge

Harvard University Vanderbilt University

University of Oxford

OpenAI

Columbia University Collège de France Lawrence Livermore National Laboratory The Jackson Laboratory

OpenAI researchers and collaborators evaluate GPT-5's utility in accelerating scientific research across diverse fields, demonstrating its capacity for contributing to known result rediscovery, literature search, collaborative problem-solving, and the generation of novel scientific findings. The model proved to compress research timelines from months to hours and provided verifiable new insights in mathematics, physics, and biology.

124

24 Nov 2025

computer-science robotics

Agility Meets Stability: Versatile Humanoid Control with Heterogeneous Data

Tsinghua University

NVIDIA

The University of Hong Kong

Humanoid robots are envisioned to perform a wide range of tasks in human-centered environments, requiring controllers that combine agility with robust balance. Recent advances in locomotion and whole-body tracking have enabled impressive progress in either agile dynamic skills or stability-critical behaviors, but existing methods remain specialized, focusing on one capability while compromising the other. In this work, we introduce AMS (Agility Meets Stability), the first framework that unifies both dynamic motion tracking and extreme balance maintenance in a single policy. Our key insight is to leverage heterogeneous data sources: human motion capture datasets that provide rich, agile behaviors, and physically constrained synthetic balance motions that capture stability configurations. To reconcile the divergent optimization goals of agility and stability, we design a hybrid reward scheme that applies general tracking objectives across all data while injecting balance-specific priors only into synthetic motions. Further, an adaptive learning strategy with performance-driven sampling and motion-specific reward shaping enables efficient training across diverse motion distributions. We validate AMS extensively in simulation and on a real Unitree G1 humanoid. Experiments demonstrate that a single policy can execute agile skills such as dancing and running, while also performing zero-shot extreme balance motions like Ip Man's Squat, highlighting AMS as a versatile control paradigm for future humanoid applications.

22 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

CUHK

Sun Yat-Sen University

Zhejiang University

Peking University

While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at this https URL .

24 Nov 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

HunyuanVideo 1.5 Technical Report

Tencent

We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and this http URL experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at this https URL.

610

612

20 Nov 2025

agents autonomous-vehicles computer-science

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Xiaomi

Xiaomi's MiMo-Embodied introduces the first open-source cross-embodied foundation model, unifying autonomous driving and embodied AI tasks. It achieves state-of-the-art performance across 29 benchmarks in both domains, demonstrating positive knowledge transfer between diverse physical environments.

115

21 Nov 2025

agentic-frameworks agents computer-science

OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

Tsinghua University Zhongguancun Academy BNRist

The OmniScientist framework integrates AI agents within a simulated human scientific ecosystem by encoding collaborative and infrastructural norms, allowing AI to participate as "genuine scientists." The approach achieved superior literature review, a dramatic reduction in solution error for experiment automation, and enhanced human-AI collaboration in complex problem-solving.

24 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Mixture of Horizons in Action Chunking

Researchers from RUC, UNC, and CUHK introduced Mixture of Horizons (MoH), a strategy that combines predictions from multiple action chunk lengths in Vision-Language-Action (VLA) models. This approach addresses the trade-off between short-term precision and long-term foresight, achieving a 99% success rate on the LIBERO benchmark and improving real-world robotic manipulation.

24 Nov 2025

agent-based-systems computer-science artificial-intelligence

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

The AUTOENV framework, developed by researchers from HKUST (Guangzhou) and DeepWisdom, automates the generation of diverse and validated environments for measuring agent learning across heterogeneous rule distributions. This system, demonstrated with the AUTOENV-36 dataset, identifies that no single fixed learning strategy generalizes effectively across varied environments, highlighting the need for adaptive learning.

24 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: this https URL.

24 Nov 2025

mathematics optimization-and-control

An Axiomatic Analysis of Distributionally Robust Optimization with $q$ -Norm Ambiguity Sets for Probability Smoothing

Researchers from Kyushu University developed a q-norm distributionally robust optimization (DRO) framework for probability smoothing, addressing the zero-frequency problem by providing a flexible and axiomatically sound alternative to Laplace smoothing. This method is shown to be equivalent to regularized empirical cross-entropy loss minimization and is proven to satisfy key axiomatic properties like Positivity, Symmetry, and for q (1, ), Order Preservation.

There are no more papers matching your filters at the moment.

Events

Watch recordings

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Events

Personalize Your Feed

Evolution Strategies at the Hyperscale

RynnVLA-002: A Unified Vision-Language-Action and World Model

General Agentic Memory Via Deep Research

Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

SAM 3D: 3Dfy Anything in Images

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

SAM 3: Segment Anything with Concepts

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Back to Basics: Let Denoising Generative Models Denoise

Early science acceleration experiments with GPT-5

Agility Meets Stability: Versatile Humanoid Control with Heterogeneous Data

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

HunyuanVideo 1.5 Technical Report

MiMo-Embodied: X-Embodied Foundation Model Technical Report

OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

Mixture of Horizons in Action Chunking

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

An Axiomatic Analysis of Distributionally Robust Optimization with qqq-Norm Ambiguity Sets for Probability Smoothing

Events

Personalize Your Feed

An Axiomatic Analysis of Distributionally Robust Optimization with $q$ -Norm Ambiguity Sets for Probability Smoothing