rss/hf_papers.json at main · MichaelMarkert/rss · GitHub

1
{"version": "https://jsonfeed.org/version/1", "title": "Hugging Face Papers", "home_page_url": "https://huggingface.co/", "feed_url": "https://raw.githubusercontent.com/MichaelMarkert/rss/refs/heads/main/hf_papers.json", "items": [{"id": "https://huggingface.co/papers/2605.00658", "image": "", "title": "UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors", "content_text": "Abstract UniVidX is a unified multimodal framework that uses video diffusion model priors for versatile video generation through stochastic condition masking, decoupled gated LoRA, and cross-modal self-attention mechanisms.  AI-generated summary Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/", "url": "https://huggingface.co/papers/2605.00658", "date_published": "2026-05-04T02:12:15"}, {"id": "https://huggingface.co/papers/2604.27221", "image": "", "title": "Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction", "content_text": "Abstract Web2BigTable is a multi-agent framework that addresses both broad and deep web search challenges through a bi-level architecture with coordinated agents and iterative improvement mechanisms.  AI-generated summary Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce Web2BigTable, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run--verify--reflect process, the framework jointly improves decomposition and execution over time via persistent, human-readable external memory, with self-evolving updates to each single-agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of 38.50 (7.5times the second best at 5.10), Row F1 of 63.53 (+25.03 over the second best), and Item F1 of 80.12 (+14.42 over the second best). It also generalises to depth-oriented search on XBench-DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.", "url": "https://huggingface.co/papers/2604.27221", "date_published": "2026-05-04T09:57:29"}, {"id": "https://huggingface.co/papers/2605.00781", "image": "", "title": "Map2World: Segment Map Conditioned Text to 3D World Generation", "content_text": "Abstract Map2World enables 3D world generation from user-defined segment maps with improved scale consistency and detail enhancement through a pipeline leveraging asset generator priors.  AI-generated summary 3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments. To further enhance the quality, we propose a detail enhancer network that generates fine details of the world. The detail enhancer enables the addition of fine-grained details without compromising overall scene coherence by incorporating global structure information. We design the entire pipeline to leverage strong priors from asset generators, achieving robust generalization across diverse domains, even under limited training data for scene generation. Extensive experiments demonstrate that our method significantly outperforms existing approaches in user-controllability, scale consistency, and content coherence, enabling users to generate 3D worlds under more complex conditions.", "url": "https://huggingface.co/papers/2605.00781", "date_published": "2026-05-04T08:30:04"}, {"id": "https://huggingface.co/papers/2604.23774", "image": "", "title": "Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions", "content_text": "Abstract A training-free framework for fine-grained 3D editing that uses geometric primitives and vision-language models to preserve identity while enabling localized structural changes.  AI-generated summary Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision-language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.", "url": "https://huggingface.co/papers/2604.23774", "date_published": "2026-05-04T14:02:14"}, {"id": "https://huggingface.co/papers/2605.00416", "image": "", "title": "Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies", "content_text": "Abstract Learning While Deploying framework enables continuous improvement of Vision-Language-Action policies through fleet-scale offline-to-online reinforcement learning with distributed robot experience and human interventions.  AI-generated summary Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.", "url": "https://huggingface.co/papers/2605.00416", "date_published": "2026-05-04T03:24:53"}, {"id": "https://huggingface.co/papers/2605.00809", "image": "", "title": "Let ViT Speak: Generative Language-Image Pre-training", "content_text": "", "url": "https://huggingface.co/papers/2605.00809", "date_published": "2026-05-04T18:18:28.371101"}, {"id": "https://huggingface.co/papers/2605.00553", "image": "", "title": "Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance", "content_text": "Abstract Stable-GFN addresses training instability and mode collapse in generative flow networks for large language model red-teaming through partition function elimination and robust masking techniques.  AI-generated summary Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.", "url": "https://huggingface.co/papers/2605.00553", "date_published": "2026-05-04T15:19:56"}, {"id": "https://huggingface.co/papers/2604.24026", "image": "", "title": "From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills", "content_text": "Abstract Structured representation of agent skills disentangles scheduling, execution, and logic components, improving performance in skill discovery and risk assessment tasks.  AI-generated summary LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structured records whose machine-usable evidence remains embedded largely in natural-language descriptions. This poses a challenge for skill-centered agent systems: managing skill collections and using skills to support agent both require reasoning over invocation interfaces, execution structure, and concrete side effects that are often entangled in a single textual surface. An explicit representation of skill knowledge may therefore help make these artifacts easier for machines to acquire and leverage. Drawing on Memory Organization Packets, Script Theory, and Conceptual Dependency from Schank and Abelson's classical work on linguistic knowledge representation, we introduce what is, to our knowledge, the first structured representation for agent skill artifacts that disentangles skill-level scheduling signals, scene-level execution structure, and logic-level action and resource-use evidence: the Scheduling-Structural-Logical (SSL) representation. We instantiate SSL with an LLM-based normalizer and evaluate it on a corpus of skills in two tasks, Skill Discovery and Risk Assessment, and superiorly outperform the text-only baselines: in Skill Discovery, SSL improves MRR from 0.573 to 0.707; in Risk Assessment, it improves macro F1 from 0.744 to 0.787. These findings reveal that explicit, source-grounded structure makes agent skills easier to search and review. They also suggest that SSL is best understood as a practical step toward more inspectable, reusable, and operationally actionable skill representations for agent systems, rather than as a finished standard or an end-to-end mechanism for managing and using skills.", "url": "https://huggingface.co/papers/2604.24026", "date_published": "2026-05-04T03:48:23"}, {"id": "https://huggingface.co/papers/2605.00273", "image": "", "title": "When Do Diffusion Models learn to Generate Multiple Objects?", "content_text": "Abstract Diffusion models struggle with multi-object generation due to scene complexity rather than concept imbalance, with counting being particularly challenging in low-data regimes.  AI-generated summary Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.", "url": "https://huggingface.co/papers/2605.00273", "date_published": "2026-05-04T13:40:56"}, {"id": "https://huggingface.co/papers/2605.00414", "image": "", "title": "Trees to Flows and Back: Unifying Decision Trees and Diffusion Models", "content_text": "Abstract Decision trees and diffusion models are mathematically unified through a shared optimization principle called Global Trajectory Score Matching, enabling efficient generative models and neural network distillation methods.  AI-generated summary Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: Global Trajectory Score Matching (GTSM), for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \\treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2\\times computational speedup, and \\dsmtree, a novel distillation method that transfers hierarchical decision logic into neural networks, matching teacher performance within 2\\% on many benchmarks.", "url": "https://huggingface.co/papers/2605.00414", "date_published": "2026-05-04T10:53:37"}, {"id": "https://huggingface.co/papers/2605.00503", "image": "", "title": "End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer", "content_text": "Abstract End-to-end training of autoregressive image models with joint reconstruction and generation optimization achieves state-of-the-art results on ImageNet 256x256 generation.  AI-generated summary Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.", "url": "https://huggingface.co/papers/2605.00503", "date_published": "2026-05-04T13:15:29"}, {"id": "https://huggingface.co/papers/2605.00323", "image": "", "title": "Online Self-Calibration Against Hallucination in Vision-Language Models", "content_text": "Abstract Online self-calibration framework using Monte Carlo tree search and dual-granularity reward mechanism improves vision-language model accuracy by addressing hallucination through preference data construction and direct preference optimization.  AI-generated summary Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose Online Self-CAlibRation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.", "url": "https://huggingface.co/papers/2605.00323", "date_published": "2026-05-04T06:45:25"}, {"id": "https://huggingface.co/papers/2605.00691", "image": "", "title": "Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization", "content_text": "Abstract A trajectory-driven framework uses large language models to guide agent behavior and cooperation patterns in distributed black-box consensus optimization, improving solution quality and efficiency.  AI-generated summary Distributed blackbox consensus optimization is a fundamental problem in multi-agent systems, where agents must improve a global objective using only local objective queries and limited neighbor communication. Existing methods largely rely on handcrafted update rules and static cooperation patterns, which often struggle to balance local adaptation, global coordination, and communication efficiency in heterogeneous nonconvex environments. In this paper, we take an initial step toward trajectory-driven self-design for distributed black-box consensus optimization. We first redesign the agent-level swarm dynamics with an adaptive internal mechanism tailored to decentralized consensus settings, improving the balance between exploration, convergence, and local escape. Built on top of this adaptive execution layer, we propose Learning to Act and Cooperate (LACMAS), a trajectorydriven framework in which large language models provide sparse highlevel guidance for shaping both agentinternal action behaviors and agentexternal cooperation patterns from historical optimization trajectories. We further introduce a phased cognitive scheduling strategy to activate different forms of adaptation in a resource-aware manner. Experiments on standard distributed black-box benchmarks and real-world distributed tasks show that LAC-MAS consistently improves solution quality, convergence efficiency, and communication efficiency over strong baselines, suggesting a practical route from handcrafted distributed coordination toward self-designing multi-agent optimization systems.", "url": "https://huggingface.co/papers/2605.00691", "date_published": "2026-05-04T05:59:01"}, {"id": "https://huggingface.co/papers/2605.00754", "image": "", "title": "Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring", "content_text": "Abstract Researchers introduce Themis-RM, a suite of multilingual code reward models trained on a large preference dataset to enable flexible multi-criteria scoring for code generation tasks.  AI-generated summary Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.", "url": "https://huggingface.co/papers/2605.00754", "date_published": "2026-05-04T02:00:58"}, {"id": "https://huggingface.co/papers/2605.00777", "image": "", "title": "LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation", "content_text": "Abstract A language-adversarial speaker encoder (LASE) is proposed to address cross-script voice cloning issues by training with contrastive loss and gradient-reversal learning to produce language-uninformative yet speaker-informative embeddings.  AI-generated summary A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.", "url": "https://huggingface.co/papers/2605.00777", "date_published": "2026-05-04T05:45:38"}, {"id": "https://huggingface.co/papers/2604.23586", "image": "", "title": "Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling", "content_text": "Abstract Talker-T2AV presents an autoregressive diffusion framework for talking head synthesis that separates high-level cross-modal reasoning from low-level modality-specific refinement, improving lip-sync accuracy and cross-modal consistency.  AI-generated summary Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.", "url": "https://huggingface.co/papers/2604.23586", "date_published": "2026-05-04T07:59:49"}, {"id": "https://huggingface.co/papers/2604.23195", "image": "", "title": "AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval", "content_text": "Abstract AnalogRetriever is a unified tri-modal retrieval framework that enhances analog circuit design by encoding schematics, descriptions, and netlists into a shared embedding space using vision-language models and graph convolutional networks.  AI-generated summary Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality, failing to capture cross-modal semantic relationships. To bridge this gap, we present AnalogRetriever, a unified tri-modal retrieval framework for analog circuit search. We first build a high-quality dataset on top of Masala-CHAI through a two-stage repair pipeline that raises the netlist compile rate from 22\\% to 100\\%. Built on this foundation, AnalogRetriever encodes schematics and descriptions with a vision-language model and netlists with a port-aware relational graph convolutional network, mapping all three modalities into a shared embedding space via curriculum contrastive learning. Experiments show that AnalogRetriever achieves an average Recall@1 of 75.2\\% across all six cross-modal retrieval directions, significantly outperforming existing baselines. When integrated into the AnalogCoder agentic framework as a retrieval-augmented generation module, it consistently improves functional pass rates and enables previously unsolved tasks to be completed. Our code and dataset will be released.", "url": "https://huggingface.co/papers/2604.23195", "date_published": "2026-05-04T03:28:10"}, {"id": "https://huggingface.co/papers/2604.27124", "image": "", "title": "Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models", "content_text": "Abstract Sigmoid attention improves biological foundation model training by providing better representations, faster convergence, and enhanced stability compared to softmax attention through bounded derivatives and diagonal Jacobian structure.  AI-generated summary Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives (leq 0.25) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient clipping on 8K-token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open-source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention-2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at https://github.com/MSDLLCpapers/triton-sigmoid", "url": "https://huggingface.co/papers/2604.27124", "date_published": "2026-05-04T15:07:47"}]}