AI — Normal Science

This Week

Accelerating Gemma 4: faster inference with multi-token prediction drafters

amrrs·3d ago640pts

An overview of how Multi-Token Prediction (MTP) drafters are making Gemma 4 models up to 3x faster at inference.

ProgramBench: Can language models rebuild programs from scratch?

jonbaer·1d ago129pts

Abstract page for arXiv paper 2605.03546: ProgramBench: Can Language Models Rebuild Programs From Scratch?

ZAYA1-8B matches DeepSeek-R1 on math with less than 1B active parameters

steveharing1·1d ago87pts

Who should care If you work with math, science problems, or complex coding tasks and you're looking for something small enough to run locally or cheaply via API, this is worth serious evaluation. The benchmark numbers at 760M active parameters are not normal and the Markovian RSA boost means performance scales with compute budget rather than hitting a fixed ceiling. If you're building agent workflows that need reliable tool calling or multi-step instruction following, look elsewhere fo

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

donsupreme·6d ago470pts

Researchers say results mark a really ‘profound change in technology that will reshape medicine’

A couple million lines of Haskell: Production engineering at Mercury

unignorant·6d ago399pts

What it takes to run 2 million lines of Haskell in production at a fintech company serving 300,000 businesses.

Learning the Integral of a Diffusion Model

benanne·2d ago140pts

A deep dive on flow maps.

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

gmays·3d ago149pts

Abstract page for arXiv paper 2604.26752: GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Show HN: Apple's SHARP running in the browser via ONNX runtime web

bring-shrubbery·5d ago170pts

Hi HN, author here. SHARP is Apple's recent single-image 3D Gaussian splatting model (https://arxiv.org/abs/2512.10685). Their reference code is PyTorch + a pretty heavy pipeline; I wanted to see if it could run in a browser with no server hop, so I exported the predictor to ONNX and ran it via onnxruntime-web with the WebGPU EP.What works: drop in an image, get a .ply you can download or preview live, all on your machine — your image never leaves the tab. The model is large (~2.4 GB sidecar) so first load is slow on a cold cache, but inference itself is a few seconds on a recent Mac.Caveats: SHARP's released weights are research-use only (Apple's model license, not the code's). I host the exported ONNX on R2 so thedemo "just works", but you can also export your own from the upstream Apple repo and upload locally.Happy to talk about it in the comments :)

Transformers Are Inherently Succinct (2025)

bearseascape·4d ago45pts

Abstract page for arXiv paper 2510.19315: Transformers are Inherently Succinct

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee, Krishna P. Gummadi, Abhilasha Ravichander, Muhammad Bilal Zafar·ArXiv cs.AI·4d ago

arXiv:2605.00737v1 Announce Type: new Abstract: Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task. This decision is particularly challenging for web search tools, where the benefits of external information depend on the model's internal knowledge and...

Cloud Is Closer Than It Appears: Revisiting the Tradeoffs of Distributed Real-Time Inference

Pragya Sharma, Hang Qiu, Mani Srivastava·ArXiv cs.AI·4d ago

arXiv:2605.00005v1 Announce Type: cross Abstract: The increasing deployment of deep neural networks (DNNs) in cyber-physical systems (CPS) enhances perception fidelity, but imposes substantial computational demands on execution platforms, posing challenges to real-time control deadlines. Traditional distributed CPS architectures typically favor on-device inference to avoid network variability and contention-induced delays on remote platforms. However, this design choice places significant energy...

FedACT: Concurrent Federated Intelligence across Heterogeneous Data Sources

Md Sirajul Islam, Isabelle G Chapman, N I Md Ashafuddula, Xu Yuan, Li Chen, Nian-Feng Tzeng, Klara Nahrstedt·ArXiv cs.LG·4d ago

arXiv:2605.00011v1 Announce Type: new Abstract: Federated Learning (FL) enables collaborative intelligence across decentralized data source devices in a privacy-preserving way. While substantial research attention has been drawn to optimizing the learning process for an individual task, real-world applications increasingly require multiple machine learning tasks simultaneously training their models across a shared pool of devices. Naively applying single-FL optimization techniques in multi-FL sy...

What Physics do Data-Driven MoCap-to-Radar Models Learn?

Kevin Chen, Kenneth W. Parker, Anish Arora·ArXiv cs.LG·4d ago

arXiv:2605.00018v1 Announce Type: new Abstract: Data-driven MoCap-to-radar models generate plausible micro-Doppler spectrograms, but do they actually learn the underlying physics? We introduce a physics-based interpretability framework to answer this question via two proposed complementary metrics: one measures alignment between model predictions and the physics-derived Doppler frequency, while the other tests whether predictions preserve the velocity-frequency relationship under velocity interv...

AirFM-DDA: Air-Interface Foundation Model in the Delay-Doppler-Angle Domain for AI-Native 6G

Kejia Bian, Meixia Tao, Jianhua Mo, Zhiyong Chen, Leyan Chen·ArXiv cs.LG·4d ago

arXiv:2605.00020v1 Announce Type: new Abstract: The success of large foundation models is catalyzing a new paradigm for AI-native 6G network design: wireless foundation models for physical layer design. However, existing models often operate on channel state information (CSI) in the space-time-frequency (STF) domain, where distinct multipath components are inherently superimposed and structurally entangled. This hinders the learning of universal channel representation. Meanwhile, their reliance ...

Learning physically grounded traffic accident reconstruction from public accident reports

Yanchen Guan, Haicheng Liao, Chengyue Wang, Zhenning Li·ArXiv cs.LG·4d ago

arXiv:2605.00050v1 Announce Type: new Abstract: Traffic accidents are routinely documented in textual reports, yet physically grounded accident reconstruction remains difficult because detailed scene measurements and expert reconstructions are scarce, costly and hard to scale. Here we formulate accident reconstruction from publicly accessible reports and scene measurements as a parameterized multimodal learning problem. We construct CISS-REC, a dataset of 6,217 real-world accident cases curated ...

Information-Theoretic Generalization Bounds for Stochastic Gradient Descent with Predictable Virtual Noise

Mohammad Partohaghighi·ArXiv cs.LG·4d ago

arXiv:2605.00064v1 Announce Type: new Abstract: Information-theoretic generalization bounds analyze stochastic optimization by relating expected generalization error to the mutual information between learned parameters and training data. Virtual perturbation analyses of SGD add auxiliary Gaussian noise only in the proof, making mutual information tractable while leaving the actual SGD trajectory unchanged. Existing bounds, however, typically require perturbation covariances to be fixed independe...

Human-in-the-Loop Meta Bayesian Optimization for Fusion Energy and Scientific Applications

Ricardo Luna Gutierrez, Sahand Ghorbanpour, Ejaz Rahman, Varchas Gopalaswamy, Riccardo Betti, Vineet Gundecha, Aarne Lees, Soumyendu Sarkar·ArXiv cs.LG·4d ago

arXiv:2605.00068v1 Announce Type: new Abstract: Inertial Confinement Fusion (ICF) holds transformative promise for sustainable, near-limitless clean energy, yet remains constrained by prohibitively high costs and limited experimental opportunities. This paper presents Human-in-the-Loop Meta Bayesian Optimization (HL-MBO), a framework that integrates expert knowledge with few-shot, uncertainty-aware machine learning to accelerate discovery in data-scarce, high-stakes scientific domains. HL-MBO in...

Soft-MSM: Differentiable Context-Aware Elastic Alignment for Time Series

Christopher Holder, Anthony Bagnall·ArXiv cs.LG·4d ago

arXiv:2605.00069v1 Announce Type: new Abstract: Elastic distances like dynamic time warping (DTW) are central to time series machine learning because they compare sequences under local temporal misalignment. Soft-DTW is an adaptation of DTW that can be used as a gradient-based loss by replacing the hard minimum in its dynamic-programming recursion with a smooth relaxation. However, this approach does not directly extend to elastic distances whose transition costs depend on the local alignment co...

CRADIPOR: Crash Dispersion Predictor

Edgar Chaillou, Sebastian Rodriguez, Yves Tourbier, Francisco Chinesta·ArXiv cs.LG·4d ago

arXiv:2605.00070v1 Announce Type: new Abstract: We present CRADIPOR, a numerical dispersion prediction tool for automotive crash simulations. Finite Element (FE) crash models are widely used throughout vehicle development, but their predictions are not strictly repeatable because of parallel computation and model complexity. As a result, performance criteria evaluated during post-processing may exhibit significant numerical dispersion, which complicates engineering decision-making. Although disp...

Hyperspherical Forward-Forward with Prototypical Representations

Shalini Sarode, Brian Moser, Joachim Folz, Federico Raue, Tobias Nauen, Stanislav Frolov, Andreas Dengel·ArXiv cs.LG·4d ago

arXiv:2605.00082v1 Announce Type: new Abstract: The Forward-Forward (FF) algorithm presents a compelling, bio-inspired alternative to backpropagation. However, while efficient in training, it has a computationally prohibitive inference process that requires a separate forward pass for every class that is evaluated. In this work, we introduce the Hyperspherical Forward-Forward (HFF), a novel reformulation that resolves this critical bottleneck. Our core innovation is to reframe the local objectiv...

SPLICE: Latent Diffusion over JEPA Embeddings for Conformal Time-Series Inpainting

Arnaud Zinflou·ArXiv cs.LG·4d ago

arXiv:2605.00126v1 Announce Type: new Abstract: Generative models for time-series imputation achieve strong reconstruction accuracy, yet provide no finite-sample reliability guarantees, a critical limitation in power systems where imputed values inform dispatch and planning. We introduce SPLICE (Self-supervised Predictive Latent Inpainting with Conformal Envelopes), a modular framework coupling latent generative imputation with distribution-free, online-adaptive prediction intervals. A JEPA enco...

Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization

Huayu Li, ZhengXiao He, Xiwen Chen, Jingjing Wang, Siyuan Tian, Jinghao Wen, Ao Li·ArXiv cs.LG·4d ago

arXiv:2605.00130v1 Announce Type: new Abstract: Learning meaningful representations from medical time series (MedTS) such as ECG or EEG signals is a critical challenge. These signals are often high-dimensional, variable-length and rife with noise. Existing self-supervised approaches, such as Masked Autoencoders (MAEs) are highly effective for pre-training general-purpose encoders. However, they do not explicitly learn compact and semantically interpretable latent representations, typically relyi...

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

YiFeng Wang, Zhun Sun, Keisuke Sakaguchi·ArXiv cs.LG·4d ago

arXiv:2605.00140v1 Announce Type: new Abstract: We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian from activation quantization residuals (G_x), ARHQ analytically identifies and isolates error-sensitive weight directions into a high-precision low-rank branch. This is achieved via a closed-form truncated SVD on the scaled...

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Yikai Wang, Shang Liu, Jose Blanchet·ArXiv cs.LG·4d ago

arXiv:2605.00155v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting g...

Consistent Diffusion Language Models

Hasan Amin, Yuan Gao, Yaser Souri, Subhojit Som, Ming Yin, Rajiv Khanna, Xia Song·ArXiv cs.LG·4d ago

arXiv:2605.00161v1 Announce Type: new Abstract: Diffusion language models (DLMs) are an attractive alternative to autoregressive models because they promise sublinear-time, parallel generation, yet practical gains remain elusive as high-quality samples still demand hundreds of refinement steps. In continuous domains, consistency training along the probability-flow ODE is a popular recipe to accelerate diffusion. For discrete diffusion, no analogous sample-space ODE exists, making direct adaptati...

Towards A Generative Protein Evolution Machine with DPLM-Evo

Xinyou Wang, Liang Hong, Jiasheng Ye, Zaixiang Zheng, Yu Li, Shujian Huang, Quanquan Gu·ArXiv cs.LG·4d ago

arXiv:2605.00182v1 Announce Type: new Abstract: Proteins are shaped by gradual evolution under biophysical and functional constraints. Protein language models learn rich evolutionary constraints from large-scale sequences, and discrete diffusion-based protein language models~(\eg, DPLMs) are promising for both understanding and generation. However, existing DPLMs typically rely on masking-based absorbing diffusion that contradicts a simple biological intuition: proteins evolve through accumulate...

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

Woody Haosheng Gan, William Held, Diyi Yang·ArXiv cs.CL·4d ago

arXiv:2605.00022v1 Announce Type: new Abstract: The rapid proliferation of large audio models (LAMs) demands efficient approaches for model comparison, yet comprehensive benchmarks are costly. To fill this gap, we investigate whether minimal subsets can reliably evaluate LAMs while reducing costs and data redundancy. Analyzing 10 subset selection methods with 18 audio models across 40 tasks covering major LAM evaluation dimensions, we show that subsets of just 50 examples (0.3% of data) can achi...

How Frontier LLMs Adapt to Neurodivergence Context: A Measurement Framework for Surface vs. Structural Change in System-Prompted Responses

Ishan Gupta, Pavlo Buryi·ArXiv cs.CL·4d ago

arXiv:2605.00113v1 Announce Type: new Abstract: We examine if frontier chat-based large language models (LLMs) adjust their outputs based on neurodivergence (ND) context in system prompts and describe the nature of these adjustments. Specifically, we propose NDBench, a 576-output benchmark involving two frontier models, three system prompt types (baseline, ND-profile assertion, and ND-profile assertion with explicit instructions for adjustments), four canonical ND profiles, and 24 prompts across...

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

Nhung Thi-Hong Duong, Mai Ngoc Ho, Tin Van Huynh, Kiet Van Nguyen·ArXiv cs.CL·4d ago

arXiv:2605.00116v1 Announce Type: new Abstract: In this article, we introduce ViLegalNLI, the first large-scale Vietnamese Natural Language Inference (NLI) dataset specifically constructed for the legal domain. The dataset consists of 42,012 premise-hypothesis pairs derived from official statutory documents and annotated with binary inference labels (Entailment and Non-entailment). It covers multiple legal domains and reflects realistic legal reasoning scenarios characterized by structured logic...

Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

Muhammad Dehan Al Kautsar, Saeed Almheiri, Momina Ahsan, Bilal Elbouardi, Younes Samih, Sarfraz Ahmad, Amr Keleg, Omar El Herraoui, Kareem Elzeky, Abed Alhakim Freihat, Mohamed Anwar, Zhuohan Xie, Junhong Liang, Mohammad Rustom Al Nasar, Preslav Nakov, Fajri Koto·ArXiv cs.CL·4d ago

arXiv:2605.00119v1 Announce Type: new Abstract: There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both ...

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

Jugal Gajjar, Kamalasankari Subramaniakuppusamy·ArXiv cs.CL·4d ago

arXiv:2605.00199v1 Announce Type: new Abstract: When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alo...

Confidence Estimation in Automatic Short Answer Grading with LLMs

Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne·ArXiv cs.CL·4d ago

arXiv:2605.00200v1 Announce Type: new Abstract: Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we i...

Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions

Jan Sobotka, Mustafa O. Karabag, Ufuk Topcu·ArXiv cs.CL·4d ago

arXiv:2605.00226v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly tasked with strategic decision-making under incomplete information, such as in negotiation and policymaking. While LLMs can excel at many such tasks, they also fail in ways that are poorly understood. We shed light on these failures by uncovering two fundamental gaps in the internal mechanisms underlying the decision-making of LLMs in incomplete-information games, supported by experiments with open-weig...

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne·ArXiv cs.CL·4d ago

arXiv:2605.00238v1 Announce Type: new Abstract: Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader a...

Lost in State Space: Probing Frozen Mamba Representations

Bhagyashree Wagh, Akash Singh·ArXiv cs.CL·4d ago

arXiv:2605.00253v1 Announce Type: new Abstract: Mamba's recurrent state h_t is, by construction, a compressed summary of every token seen so far. This raises a tempting hypothesis: if we extract token-level outputs y_t at fixed patch boundaries, we obtain semantic sentence summaries for free, with no pooling head, no fine-tuning, and no [CLS] token. We test this hypothesis carefully. Across five benchmarks (SST-2, CoLA, MRPC, STS-B, IMDb), we compare four strategies for extracting frozen sentenc...

Retrieval-Augmented Reasoning for Chartered Accountancy

Jatin Gupta, Akhil Sharma, Saransh Singhania, Ali Imam Abidi·ArXiv cs.CL·4d ago

arXiv:2605.00257v1 Announce Type: new Abstract: The inception of Large Language Models (LLMs) has catalyzed AI adoption in the finance sector, yet their reliability in complex, jurisdiction-specific tasks like Indian Chartered Accountancy (CA) remains limited. The models display difficulty in executing numerical tasks which require multiple steps while also needing advanced knowledge about legal regulations and the method of scaling their operations is not feasible in settings which have limited...

How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework

Hamidreza Saghir·ArXiv cs.CL·4d ago

arXiv:2605.00269v1 Announce Type: new Abstract: Recent white-box OOD detection methods for LLMs -- including CED, RAUQ, and WildGuard confidence scores -- appear effective, but we show they are structurally confounded by sequence length (|r| >= 0.61) and collapse to near-chance under length-matched evaluation. Even raw attention entropy (mean H(alpha) across heads and layers), a natural baseline we include for completeness, shows the same confound. The confound stems from attention's Theta(log T...

Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation

Pooja Guttal, Varun Magotra, Vasudeva Mahavishnu, Natasha Chanto, Sidharth Sivaprasad, Manas Gaur·ArXiv cs.CL·4d ago

arXiv:2605.00318v1 Announce Type: new Abstract: Tabular documents such as CSV and Excel files are widely used in enterprise data pipelines, yet existing chunking strategies for retrieval-augmented generation (RAG) are primarily designed for unstructured text and do not account for tabular structure. We propose a structure-aware tabular chunking (STC) framework that operates on row-level units by constructing a hierarchical Row Tree representation, where each row is encoded as a key-value block. ...

Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

Charles Weng, Dingwen Li, Alexander Martin·ArXiv cs.CL·4d ago

arXiv:2605.00326v1 Announce Type: new Abstract: Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-pr...

Budget-Aware Routing for Long Clinical Text

Khizar Qureshi, Geoffrey Martin, Yifan Peng·ArXiv cs.CL·4d ago

arXiv:2605.00336v1 Announce Type: new Abstract: A key challenge for large language models is token cost per query and overall deployment cost. Clinical inputs are long, heterogeneous, and often redundant, while downstream tasks are short and high stakes. We study budgeted context selection, where a subset of document units is chosen under a strict token budget so an off-the-shelf generator can meet fixed cost and latency constraints. We cast this as a knapsack-constrained subset selection proble...

Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

Lehan Pan, Ziyang Tao, Ruoyu Pang, Xiao Wang, Jianjun Zhao, Yanyong Zhang·ArXiv cs.CL·4d ago

arXiv:2605.00342v1 Announce Type: new Abstract: Tree-based speculative decoding accelerates autoregressive generation by verifying multiple draft candidates in parallel, but this advantage weakens for sparse Mixture-of-Experts (MoE) models. As the draft tree grows, different branches activate different experts, expanding the union of activated experts and substantially increasing target-side verification cost. We propose EVICT, a training-free, hyperparameter-free, and lossless adaptive verifica...

How did ‘large’ language models get that way? The role of Transformers and Pretraining in GPT

Oliver Sourbut·5d ago

Large language models are really large. They’re among the largest machine learning projects ever, and set to be (perhaps already are by some measures) some of the largest computing and even largest infrastructure projects ever.But how did LMs actually get so large as to warrant the title ‘large language model (LLM)’? A large part of the answer is in the P ('pretrained') and the T ('transformer') of GPT.This is part 1 of a series about LLM architecture and some implications, past and future, for ...

MHC Interp #1: Previous-Token Heads Become Attention Sinks Under Manifold-Constrained Hyper-Connections

Realmbird·5d ago

Background:Manifold-Constrained Hyper-Connections (mHC) is a new architecture added by Deepseek and recently implemented in Deepseek v4.mHC is a fix that makes HC(Hyper-Connections) vanishing or exploding gradient caused by HC while still keeping the performance increases. As adding weights and biases on HC made signals from earlier layers harder to update making the residual stream less residual streamy.HC is a cursed method of adding weights and biases onto the residual stream to simulate a wi...

Paraphrasing Is (At Best) a Partial Defence Against Steganography in LLMs

Usman Anwar·5d ago

Within the AI Safety community, paraphrasing, which, in the context of this post, simply means using another LLM (with nonzero temperature) to rewrite a given piece of content, is generally considered a viable defence and detection method for steganography in LLMs. In this blogpost, we briefly provide a taxonomy of types of steganography in LLMs, and then highlight the limitations of paraphrasing against each type.Unfortunately, there are types of steganography that LLMs have been shown to use t...

Older

Using “underdrawings” for accurate text and numbers

samcollins·7d ago359pts

A technique for accurate text and numbers in AI-generated images: generate the layout deterministically, then ask the image model to paint on top.

Text-to-CAD

softservo·7d ago146pts

An open source harness for generating CAD models. Contribute to earthtojake/text-to-cad development by creating an account on GitHub.

The Road to a Billion-Token Context

pseudolus·7d ago38pts

Benchmarking LLMs on African Livestock Knowledge — Baseline Results and Draft

fatika·6d ago

Published on May 2, 2026 12:52 PM GMTI'm a veterinary student and ML researcher based in Nigeria. Over the past months I've been building what I believe is the first AI safety evaluation benchmark targeting Nigerian indigenous livestock systems.This post shares the baseline results, methodology, and open questions. I'm posting here partly to share the work and partly because I'm looking for feedback from people working on evals, AI deployment in low-resource contexts, and African AI safety.Why t...

Benchmarking LLMs on African Livestock Knowledge — Baseline Results and Draft by fatika

fatika·Nuno Sempere·6d ago

I’m a veterinary student and ML researcher based in Nigeria. Over the past months I’ve been building what I believe is the first AI safety evaluation benchmark targeting Nigerian indigenous livestock systems.This post shares the baseline results, methodology, and open questions. I’m posting here partly to share the work and partly because I’m looking for feedback from people working on evals, AI deployment in low-resource contexts, and African AI safety.Why this ...

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

Simon Willison·Simon Willison·8d ago

Our evaluation of OpenAI's GPT-5.5 cyber capabilities The UK's AI Security Institute previously evaluated Claude Mythos: now they've evaluated GPT-5.5 for finding security vulnerability and found it to be comparable to Mythos, but unlike Mythos it's generally available right now. Tags: ai, openai, generative-ai, llms, anthropic, claude, ai-security-research, gpt

This Week

Accelerating Gemma 4: faster inference with multi-token prediction drafters

amrrs·3d ago640pts

An overview of how Multi-Token Prediction (MTP) drafters are making Gemma 4 models up to 3x faster at inference.

ProgramBench: Can language models rebuild programs from scratch?

jonbaer·1d ago129pts

Abstract page for arXiv paper 2605.03546: ProgramBench: Can Language Models Rebuild Programs From Scratch?

ZAYA1-8B matches DeepSeek-R1 on math with less than 1B active parameters

steveharing1·1d ago87pts

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

donsupreme·6d ago470pts

Researchers say results mark a really ‘profound change in technology that will reshape medicine’

A couple million lines of Haskell: Production engineering at Mercury

unignorant·6d ago399pts

What it takes to run 2 million lines of Haskell in production at a fintech company serving 300,000 businesses.

Learning the Integral of a Diffusion Model

benanne·2d ago140pts

A deep dive on flow maps.

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

gmays·3d ago149pts

Abstract page for arXiv paper 2604.26752: GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Show HN: Apple's SHARP running in the browser via ONNX runtime web

bring-shrubbery·5d ago170pts

Transformers Are Inherently Succinct (2025)

bearseascape·4d ago45pts

Abstract page for arXiv paper 2510.19315: Transformers are Inherently Succinct

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee, Krishna P. Gummadi, Abhilasha Ravichander, Muhammad Bilal Zafar·ArXiv cs.AI·4d ago

Cloud Is Closer Than It Appears: Revisiting the Tradeoffs of Distributed Real-Time Inference

Pragya Sharma, Hang Qiu, Mani Srivastava·ArXiv cs.AI·4d ago

FedACT: Concurrent Federated Intelligence across Heterogeneous Data Sources

Md Sirajul Islam, Isabelle G Chapman, N I Md Ashafuddula, Xu Yuan, Li Chen, Nian-Feng Tzeng, Klara Nahrstedt·ArXiv cs.LG·4d ago

What Physics do Data-Driven MoCap-to-Radar Models Learn?

Kevin Chen, Kenneth W. Parker, Anish Arora·ArXiv cs.LG·4d ago

AirFM-DDA: Air-Interface Foundation Model in the Delay-Doppler-Angle Domain for AI-Native 6G

Kejia Bian, Meixia Tao, Jianhua Mo, Zhiyong Chen, Leyan Chen·ArXiv cs.LG·4d ago

Learning physically grounded traffic accident reconstruction from public accident reports

Yanchen Guan, Haicheng Liao, Chengyue Wang, Zhenning Li·ArXiv cs.LG·4d ago

Information-Theoretic Generalization Bounds for Stochastic Gradient Descent with Predictable Virtual Noise

Mohammad Partohaghighi·ArXiv cs.LG·4d ago

Human-in-the-Loop Meta Bayesian Optimization for Fusion Energy and Scientific Applications

Ricardo Luna Gutierrez, Sahand Ghorbanpour, Ejaz Rahman, Varchas Gopalaswamy, Riccardo Betti, Vineet Gundecha, Aarne Lees, Soumyendu Sarkar·ArXiv cs.LG·4d ago

Soft-MSM: Differentiable Context-Aware Elastic Alignment for Time Series

Christopher Holder, Anthony Bagnall·ArXiv cs.LG·4d ago

CRADIPOR: Crash Dispersion Predictor

Edgar Chaillou, Sebastian Rodriguez, Yves Tourbier, Francisco Chinesta·ArXiv cs.LG·4d ago

Hyperspherical Forward-Forward with Prototypical Representations

Shalini Sarode, Brian Moser, Joachim Folz, Federico Raue, Tobias Nauen, Stanislav Frolov, Andreas Dengel·ArXiv cs.LG·4d ago

SPLICE: Latent Diffusion over JEPA Embeddings for Conformal Time-Series Inpainting

Arnaud Zinflou·ArXiv cs.LG·4d ago

Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization

Huayu Li, ZhengXiao He, Xiwen Chen, Jingjing Wang, Siyuan Tian, Jinghao Wen, Ao Li·ArXiv cs.LG·4d ago

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

YiFeng Wang, Zhun Sun, Keisuke Sakaguchi·ArXiv cs.LG·4d ago

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Yikai Wang, Shang Liu, Jose Blanchet·ArXiv cs.LG·4d ago

Consistent Diffusion Language Models

Hasan Amin, Yuan Gao, Yaser Souri, Subhojit Som, Ming Yin, Rajiv Khanna, Xia Song·ArXiv cs.LG·4d ago

Towards A Generative Protein Evolution Machine with DPLM-Evo

Xinyou Wang, Liang Hong, Jiasheng Ye, Zaixiang Zheng, Yu Li, Shujian Huang, Quanquan Gu·ArXiv cs.LG·4d ago

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

Woody Haosheng Gan, William Held, Diyi Yang·ArXiv cs.CL·4d ago

How Frontier LLMs Adapt to Neurodivergence Context: A Measurement Framework for Surface vs. Structural Change in System-Prompted Responses

Ishan Gupta, Pavlo Buryi·ArXiv cs.CL·4d ago

A reading list for frontier science

This Week

Older

A reading list for frontier science

This Week

Older