Brief

last 24h

[50/690] 185 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · Towards AI · 4h

Building an LLM From Scratch: I Trained Word Embeddings on Dostoevsky. Here’s What I Found.

The author details their process of building word embeddings from scratch, using Dostoevsky's novels as a corpus of nearly one million words. This step follows their previous work on character-level tokenization and aims to represent words as dense vectors that capture semantic relationships, moving beyond simple frequency counts. The article explains the mathematical concepts behind embeddings and highlights the limitations of earlier NLP models like one-hot encodings, which struggled with semantic understanding and data sparsity. AI

IMPACT Demonstrates a foundational NLP technique for representing word meaning, crucial for building more sophisticated language models.
- Dostoevsky
- NLP
TOOL · LessWrong (AI tag) · 4h

A Research Agenda for Secret Loyalties

A new paper from Formation Research introduces the concept of "secret loyalties" in frontier AI models, where a model is intentionally manipulated to advance a specific actor's interests without disclosure. The research highlights that such secret loyalties could be activated broadly or narrowly, and could influence a wide range of actions. The paper argues that current AI safety infrastructure, including data monitoring and behavioral evaluations, is insufficient to detect these sophisticated, covert manipulations, which can be strengthened by splitting poisoning across training stages. AI

IMPACT Introduces a new threat model for AI safety, potentially requiring new defense mechanisms against covert manipulation.
TOOL · Towards AI · 6h

How LLMs Actually Work And Why Your Prompts Keep Failing

This article provides a beginner-friendly explanation of how Large Language Models (LLMs) function, focusing on their internal processes without complex mathematics. It details how LLMs handle context, predict subsequent tokens, and generate outputs. The piece aims to help users understand why their prompts might not yield the desired results. AI

IMPACT Provides a foundational understanding of LLM mechanics, aiding users in crafting more effective prompts and interpreting model behavior.
- LLMs
- Towards AI
TOOL · dev.to — LLM tag · 6h

Why I’m Pivoting Mnemara: The "Turn 0" State Injection Strategy

A developer is pivoting their tool, Mnemara, from injecting state mid-conversation to a "Turn 0" strategy, placing all critical information in the initial system prompt. This approach leverages the primacy bias of LLMs, ensuring smaller models like Llama 3 and Mistral can consistently access and utilize injected state. The revised architecture aims to make the tool model-agnostic, improving reliability across different model tiers by establishing a clear source of truth at the beginning of the context window. AI

IMPACT This strategy may improve the reliability of smaller LLMs by ensuring critical state information is prioritized in the prompt.
- Mnemara
- GPT-4o
- Claude 3.5
- Llama 3
- Mistral
- Gemini
- Mnemara-Gemma
TOOL · Towards AI · 7h

MCP vs Tool Use vs Function Calling: LLM Integration Guide

This article explores three distinct approaches for integrating large language models (LLMs) with external systems: MCP, tool use, and function calling. It aims to clarify the differences between these architectures and how they address the challenge of connecting LLMs to the broader digital ecosystem. The guide provides insights into the underlying mechanisms and potential applications of each integration method. AI

IMPACT Clarifies key methods for connecting LLMs to external systems, aiding developers in choosing the right integration architecture.
RESEARCH · arXiv stat.ML · 18h · [2 sources]

Bayesian Surrogate Training on Multiple Data Sources: A Hybrid Modeling Strategy

Researchers have developed new strategies for training surrogate models by integrating data from multiple sources, including simulations and real-world measurements. One approach involves training separate models for each data type and then combining their predictions, while another trains a single model incorporating both data types. These hybrid methods aim to improve predictive accuracy and coverage, and to identify potential issues within existing simulation models, ultimately aiding in system understanding and future development. AI

IMPACT Enhances AI model training by enabling more accurate predictions and better diagnostics through multi-source data integration.
- Philipp Reiser
- Ian Taylor
RESEARCH · Mastodon — sigmoid.social · 7h · [5 sources]

BIML is proud to release a new study today: No Security Meter for AI # AI # ML # MLsec # security # infosec # swsec # appsec # LLM # AgenticAI https:// berryvil

Berryville Infrastructure & Machine Learning (BIML) has published a new study highlighting a lack of security metrics for AI systems. The research indicates that current security practices are insufficient to address the unique risks posed by artificial intelligence. This gap in security measurement could hinder the safe and responsible development and deployment of AI technologies. AI

IMPACT Highlights a critical gap in AI security, potentially slowing responsible adoption.
- Berryville Infrastructure & Machine Learning
- AI
SIGNIFICANT · 雷峰网 (Leiphone) 中文(ZH) · 15h · [2 sources]

BMJ exclusively partners with Hydrogen Ion, Alibaba Health begins international top journal cooperation

Alibaba Health has launched its medical AI assistant, "Hydrogen Ion," designed to provide Chinese doctors with reliable, evidence-based medical information. The AI will offer exclusive access to over a decade of content from 70 medical journals published by the UK's BMJ Group, enabling features like full-text reading, translation, and evidence-based Q&A. This collaboration aims to bridge the gap in accessing cutting-edge global medical research for Chinese physicians, addressing challenges such as scattered literature, language barriers, and the high hallucination rates of general AI models. AI

IMPACT Enhances access to global medical research for Chinese doctors, potentially improving clinical decision-making and research.
TOOL · arXiv stat.ML · 18h

Semi-Supervised Bayesian GANs with Log-Signatures for Uncertainty-Aware Credit Card Fraud Detection

Researchers have developed a new semi-supervised deep learning framework for credit card fraud detection, addressing challenges with large datasets and irregular transaction data. The system integrates Generative Adversarial Networks (GANs) for data augmentation, Bayesian inference for uncertainty quantification, and log-signatures for robust feature encoding. Evaluated on the BankSim dataset, the approach demonstrated improved performance over benchmarks, particularly in scenarios with limited labeled data, highlighting the value of uncertainty-aware predictions in financial time series classification. AI

IMPACT Introduces a novel framework for improving fraud detection accuracy and uncertainty quantification in financial transactions.
- David Hirnschall
- BankSim
TOOL · arXiv stat.ML · 18h

Stationary MMD Points

Researchers have introduced a new theoretical framework for approximating probability distributions using a finite set of points. Instead of attempting to globally minimize the maximum mean discrepancy (MMD), which is computationally challenging due to non-convexity, the study focuses on identifying and computing "stationary points" of the MMD. The paper demonstrates that these stationary points offer a faster convergence rate for numerical integration errors than the MMD itself, a phenomenon termed "super-convergence." AI

IMPACT Introduces a novel theoretical approach for probability distribution approximation that could enhance numerical integration methods in machine learning.
- Stationary MMD Points
- Zonghao Chen
TOOL · Mastodon — fosstodon.org · 9h

Breaking through mathematical barriers is key to advancing scientific discovery. Penn Engineers have designed a new # AI framework to solve complex equations, h

Researchers at the University of Pennsylvania have developed a novel AI framework aimed at tackling complex mathematical equations. This advancement is expected to accelerate scientific discovery by enabling a deeper understanding of intricate systems, such as DNA interactions and weather patterns. AI

IMPACT This AI framework could accelerate scientific breakthroughs by improving the analysis of complex data in fields like biology and meteorology.
- University of Pennsylvania
- AI framework
RESEARCH · The Register — AI · 2d · [2 sources]

Microsoft researchers find AI models and agents can't handle long-running tasks

Microsoft researchers have identified a significant limitation in current AI models and agents: their inability to effectively manage long-running tasks. These systems struggle with tasks that require sustained operation or memory over extended periods. This deficiency impacts their potential for complex, multi-stage operations and highlights an area for future AI development. AI

IMPACT Highlights a current limitation in AI capabilities, suggesting that complex, long-term operations are not yet feasible for current models and agents.
TOOL · Forbes — Innovation · 4h

Teaching Your Body To Make Designer Antibodies

Researchers have developed a novel method to enable the body to produce its own antibodies for extended periods, addressing the limitations of current antibody drugs. This technique involves gene-editing blood-forming stem cells to carry a blueprint for a specific antibody, which then act as a continuous factory within the body. The edited cells can be triggered by a vaccine booster to produce high levels of the chosen antibody, showing promising results in mice against HIV, malaria, and influenza, and even enabling the production of multiple antibodies simultaneously. AI

IMPACT This research could lead to more effective and cost-efficient long-term treatments for chronic diseases and infections.
- Science
- HIV
- malaria
- influenza
TOOL · IEEE Spectrum — AI · 8h

Can AI Chatbots Reason Like Doctors?

A recent study published in Science indicates that OpenAI's large language models have demonstrated the ability to outperform physicians in certain clinical reasoning tasks, using real emergency room data. This development occurs amidst ongoing debate about the reliability of medical information provided by chatbots, with some research highlighting impressive diagnostic capabilities while others point to fabricated information and flawed advice. Despite these concerns, products like ChatGPT for Clinicians and Healthcare are already being introduced to the market, prompting calls for further testing and cautious interpretation of AI's role in medicine. AI

IMPACT LLMs show potential to aid medical professionals in diagnosis and treatment planning, though concerns about accuracy and reliability persist.
TOOL · AI Business · 8h

Bosch, Researchers Develop AI for Humanoid Dexterity

Researchers from Bosch and Carnegie Mellon University have created an AI system called Humanoid Transformer with Touch Dreaming (HTD) to enhance the dexterity of humanoid robots. This system uses reinforcement learning and VR data to enable robots to predict touch and force outcomes, improving their spatial awareness and planning for complex manipulation tasks. In tests, HTD significantly boosted success rates by over 90% across various real-world tasks, with potential applications in household chores, retail, and manufacturing. AI

IMPACT Enhances humanoid robot capabilities in manipulation and task execution, potentially broadening their use in domestic and industrial settings.
TOOL · Medium — fine-tuning tag · 10h

Is Fine-Tuning Always Necessary? When Pretrained Models Are Enough

This article explores the necessity of fine-tuning pretrained AI models. It argues that while fine-tuning can enhance performance for specific tasks, it is not always required. The author suggests that for many applications, the capabilities of existing large pretrained models are sufficient, potentially saving resources and time. AI

IMPACT Operators can save resources by leveraging existing pretrained models instead of always fine-tuning for specific tasks.
TOOL · Towards AI · 10h

I Actually Built It. Here’s Every Line That Matters — and Every Line That Broke First.

The author details the practical implementation of the A2A Protocol, an open standard for agent discovery and task delegation. This second part focuses on the code, outlining the architecture where the orchestrator acts as both a server and a client. It highlights the importance of the orchestrator being an A2A service to receive structured tasks and emit failure events, contrasting this with a simpler client-only script. The project structure and setup for the shared agent and customer-specific orchestrators are also provided. AI

IMPACT Provides a practical, code-level guide to implementing agent interoperability, potentially accelerating adoption of decentralized agent systems.
TOOL · IEEE Spectrum — AI · 10h

Archivists Turn to LLMs to Decipher Handwriting at Scale

Large language models are proving effective at deciphering historical handwriting, a task that has long challenged AI researchers. A study by Wilfrid Laurier University found that LLMs outperformed specialized software like Transkribus in accuracy, speed, and cost when transcribing 18th and 19th-century documents. This advancement is making previously inaccessible archival collections searchable, enabling new avenues for scholarly research and personal discovery. AI

IMPACT Makes vast archives searchable, accelerating historical research and personal discovery by enabling LLMs to decipher difficult handwriting.
TOOL · 36氪 (36Kr) 中文(ZH) · 12h

Alibaba Health and UK's BMJ Group Reach Exclusive Cooperation on Journal Content

Ali Health has launched its medical AI platform, "Hydrogen Ion," and announced an exclusive content partnership with the UK's BMJ Group. This collaboration grants Hydrogen Ion access to BMJ's extensive medical journal content, enabling Chinese doctors to directly access and utilize global medical literature for clinical and research purposes. The platform also offers features like evidence-based Q&A and online translation, with ongoing discussions for partnerships with other top journals. AI

IMPACT Enhances access to global medical literature for Chinese doctors, potentially improving clinical decision-making and research.
RESEARCH · MarkTechPost · 1d · [2 sources]

Meet AntAngelMed: A 103B-Parameter Open-Source Medical Language Model Built on a 1/32 Activation-Ratio MoE Architecture

Researchers have introduced AntAngelMed, a 103 billion parameter open-source medical language model. It utilizes a Mixture-of-Experts (MoE) architecture, activating only 6.1 billion parameters per query for enhanced efficiency. This design allows it to match the performance of a 40 billion parameter dense model while achieving speeds over 200 tokens per second on H20 hardware. The model supports a 128K context length and has undergone a three-stage training process including pre-training on medical corpora, supervised fine-tuning, and reinforcement learning. AI

IMPACT Provides a highly efficient, open-source LLM for medical applications, potentially accelerating research and development in the healthcare sector.
TOOL · arXiv stat.ML · 18h

Integral Imprecise Probability Metrics

Researchers have introduced a new framework for comparing and quantifying epistemic uncertainty in machine learning models. This framework, called the integral imprecise probability metric (IIPM), generalizes classical integral probability metrics to a broader class of imprecise probability models. IIPM not only allows for comparisons between different imprecise probability models but also enables the quantification of epistemic uncertainty within a single model. A key application is the development of a new measure called Maximum Mean Imprecision (MMI), which has shown strong empirical performance in selective classification tasks, particularly when dealing with a large number of classes. AI

IMPACT Introduces a novel framework for quantifying epistemic uncertainty, potentially improving model robustness and interpretability in complex classification tasks.
TOOL · arXiv stat.ML · 18h

Localising Dropout Variance in Twin Networks

Researchers have developed a novel method to decompose predictive variance in deep twin networks, separating it into encoder and head components. This technique, which adds minimal computational cost, helps pinpoint the source of model failures. The encoder component proves crucial for identifying out-of-distribution samples under covariate shift, while the head component becomes informative only after encoder uncertainty is managed. This decomposition offers a practical diagnostic tool for guiding data collection strategies. AI

IMPACT Provides a new diagnostic tool for understanding and improving the reliability of deep learning models in critical applications.
- Cooper Doyle
TOOL · arXiv stat.ML Deutsch(DE) · 18h

Doubly Outlier-Robust Online Infinite Hidden Markov Model

Researchers have developed a new method called Batched Robust iHMM (BR-iHMM) to improve the accuracy of online infinite hidden Markov models when dealing with noisy data. This approach enhances robustness against outliers and model misspecification by incorporating generalized Bayesian inference and bounding the posterior influence function. Tests on financial, energy, and synthetic datasets showed BR-iHMM reduced forecasting errors by up to 67% compared to existing methods, demonstrating its practical utility for forecasting and interpretable online learning. AI

IMPACT Introduces a more robust forecasting method for streaming data, potentially improving accuracy in financial and energy sectors.
- Batched Robust iHMM (BR-iHMM)
- Horace Yiu
RESEARCH · arXiv stat.ML · 1d · [2 sources]

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Researchers have introduced Pion, a novel spectrum-preserving optimizer designed for training large language models. Unlike traditional additive optimizers like Adam, Pion utilizes orthogonal transformations to update weight matrices, maintaining their singular values and spectral norm. This approach offers a stable and competitive alternative for both LLM pretraining and finetuning, as demonstrated by empirical results. AI

IMPACT Introduces a new optimization method that could improve LLM training stability and performance.
- Pion
- large language model
- Adam
- Muon
TOOL · arXiv stat.ML · 18h

Sparsity-Constraint Optimization via Splicing Iteration

Researchers have introduced SCOPE, a novel iterative algorithm for sparsity-constrained optimization problems. This method is designed to optimize nonlinear, differentiable, and strongly convex functions, replacing traditional gradient steps with a splicing operation that directly uses objective values. SCOPE eliminates the need for hyperparameter tuning and theoretically achieves linear convergence rates while accurately recovering the true support set. Numerical experiments demonstrate its superior performance in tasks like sparse quadratic optimization and learning sparse classifiers. AI

IMPACT Introduces a new optimization technique that could improve efficiency and accuracy in various machine learning tasks.
- SCOPE
- Jin Zhu
TOOL · arXiv stat.ML · 18h

Practical estimation of the optimal classification error with soft labels and calibration

This paper introduces a practical method for estimating optimal classification error in binary classification tasks, particularly when dealing with soft labels and calibration. The research extends prior work by theoretically analyzing the bias of hard-label estimators and addressing the challenge of corrupted soft labels. The proposed method, which is instance-free and thus suitable for privacy-sensitive scenarios, demonstrates consistency even with imperfectly calibrated soft labels. AI

IMPACT Introduces a novel theoretical and practical approach to evaluating classification model performance, particularly useful in privacy-constrained environments.
- Ryota Ushio
TOOL · arXiv stat.ML · 18h

Testing General Relativity Through Gravitational Wave Classification: A Convolutional Neural Network Framework

Researchers have developed a convolutional neural network (CNN) framework to test General Relativity using gravitational wave data. By training the CNN on simulated beyond-GR waveforms, they found that using a response function observable improved classification sensitivity significantly compared to raw waveforms. The framework successfully detected deviations in massive gravity theories, demonstrating its potential for probing fundamental physics with astrophysical observations. AI

IMPACT Introduces a novel machine learning approach for fundamental physics research, potentially enabling new avenues for scientific discovery.
TOOL · arXiv stat.ML · 18h

In-Context Multi-Objective Optimization

Researchers have developed TAMO, a novel transformer-based policy for multi-objective Bayesian optimization that operates entirely in-context. This approach eliminates the need for per-task surrogate fitting and acquisition engineering, significantly reducing proposal time by up to 1000x. TAMO is pretrained using reinforcement learning to maximize cumulative hypervolume improvement, allowing it to approximate Pareto frontiers and improve solution quality under tight evaluation budgets. The development opens a path towards plug-and-play optimizers for scientific discovery. AI

IMPACT Enables faster, more adaptable optimization for scientific discovery workflows by eliminating per-task model fitting.
- TAMO
- Xinyu Zhang
- arXiv
TOOL · arXiv stat.ML · 18h

Provably Data-driven Multiple Hyper-parameter Tuning with Structured Loss Function

Researchers have developed a new framework for statistically guaranteeing the performance of multi-dimensional hyperparameter tuning in data-driven machine learning settings. This approach leverages tools from real algebraic geometry to provide sharper and more broadly applicable guarantees than previous methods, which were limited to one-dimensional hyperparameters. The work also establishes the first general lower bound for this type of tuning and extends the analysis to use validation loss under minimal assumptions. AI

IMPACT Establishes theoretical guarantees for optimizing complex machine learning models, potentially improving performance and reliability.
- Anh Nguyen
TOOL · arXiv stat.ML · 18h

Approximating Simple ReLU Networks based on Spectral Decomposition of Fisher Information

Researchers have analyzed the Fisher information matrices of simple two-layer ReLU neural networks with random hidden weights. They found that the eigenvalue distribution concentrates significantly on specific eigenspaces, with the first three accounting for nearly all of the matrix's trace. The study identifies these dominant eigenspaces as corresponding to spherical harmonic functions of order two or less, linking this to Mercer decomposition of neural tangent kernels. AI

IMPACT Provides theoretical insights into the structure of simple neural networks, potentially informing future model design and analysis.
TOOL · arXiv stat.ML · 18h

Smoothed Analysis of Learning from Positive Samples

Researchers have developed a smoothed analysis approach for learning from positive-only samples, a challenging problem in binary classification. Unlike worst-case scenarios where learning is nearly impossible, this new method demonstrates that all VC classes become learnable under smoothed conditions. The work also introduces efficient algorithms for related problems in parameter estimation, truncation detection, and learning from reference distributions. AI

IMPACT Introduces a theoretical framework that could enable learning from incomplete datasets in fields like bioinformatics and ecology.
- Anay Mehrotra
TOOL · arXiv stat.ML · 18h

CRPS-Optimal Binning for Univariate Conformal Regression

Researchers have developed a new non-parametric method for estimating conditional distributions, which can be used for conformal regression. This approach involves partitioning data into bins and using the empirical cumulative distribution function within each bin to predict distributions. The method optimizes bin boundaries by minimizing a leave-one-out Continuous Ranked Probability Score (LOO-CRPS) and selects the optimal number of bins through cross-validation. The resulting prediction bands and sets offer finite-sample coverage guarantees and demonstrate narrower intervals than existing split-conformal methods on benchmark datasets. AI

IMPACT Introduces a novel statistical technique that could enhance the reliability and precision of predictive modeling in machine learning applications.
- Paolo Toccaceli
TOOL · arXiv stat.ML · 18h

Improving the Accuracy of Amortized Model Comparison with Self-Consistency

Researchers have developed a self-consistency (SC) loss to improve the accuracy of amortized Bayesian model comparison (BMC) when simulation models are misspecified. This technique enhances BMC estimators, particularly in open-world scenarios where all candidate models are imperfect. The study evaluated four amortized BMC methods, finding that SC training significantly boosts performance when analytic likelihoods are available or surrogate likelihoods are locally accurate, even with misspecified models. AI

IMPACT Enhances statistical methods used in training and evaluating machine learning models.
- Šimon Kucharský
TOOL · arXiv stat.ML · 18h

Adversarial Causal Tuning for Realistic Time-series Generation

Researchers have developed a new methodology called Adversarial Causal Tuning (ACT) to generate realistic time-series data from causal models. This approach aims to create simulated data that matches the observational and interventional distributions of real-world datasets, enabling tasks like intervention simulation and root-cause analysis. ACT utilizes ideas from Generative Adversarial Networks and AutoML to optimize causal models and discriminators, with experiments showing its effectiveness in selecting optimal causal models and generating indistinguishable data from the true distribution. AI

IMPACT Introduces a novel method for generating realistic time-series data from causal models, potentially improving simulations and causal reasoning tasks.
- Adversarial Causal Tuning
- Nikolaos Gkorgkolis
TOOL · arXiv stat.ML · 18h

Partition Tree: Conditional Density Estimation over General Outcome Spaces

Researchers have introduced Partition Tree, a new framework for conditional density estimation that can handle both continuous and categorical variables. This nonparametric approach models conditional distributions using data-adaptive partitions and learns by minimizing conditional negative log-likelihood. An extension called Partition Forest averages conditional densities for improved probabilistic prediction, showing competitive results against existing methods. AI

IMPACT Introduces a new nonparametric method for density estimation, potentially improving probabilistic predictions in machine learning models.
- Partition Tree
- Felipe Angelim
RESEARCH · arXiv stat.ML · 1d · [2 sources]

A proximal gradient algorithm for composite log-concave sampling

Researchers have developed a new proximal gradient algorithm designed to sample from composite log-concave distributions. This algorithm assumes access to gradient evaluations for one part of the distribution and a restricted Gaussian oracle for the other. The proposed method achieves state-of-the-art iteration counts for sampling, matching previous results for simpler cases and extending to non-log-concave distributions and non-smooth functions. AI

IMPACT Introduces a novel sampling technique that could improve efficiency in statistical modeling and machine learning applications.
- arXiv
- Mathematics
TOOL · arXiv stat.ML · 18h

Towards Uncertainty-Aware Federated Granger Causal Learning

Researchers have developed a new method for Federated Granger Causality (FedGC) that addresses the limitation of deterministic point estimates by incorporating uncertainty awareness. This approach provides calibrated measures of uncertainty, allowing operators to distinguish reliable cross-client interactions from spurious ones. The method derives closed-form expressions for steady-state variances and proposes a post-training hypothesis testing procedure to identify genuine interactions, outperforming existing federated causal structure learning baselines on synthetic and real-world datasets. AI

IMPACT Introduces uncertainty quantification to federated causal discovery, enabling more reliable identification of cross-system interactions.
- Ayush Mohanty
- Federated Granger Causality
TOOL · arXiv stat.ML · 18h

Finite and Corruption-Robust Regret Bounds in Online Inverse Linear Optimization under M-Convex Action Sets

Researchers have developed a new method for online inverse linear optimization, a technique used in contextual recommendation systems. This approach achieves a finite regret bound of O(d log d) for M-convex action sets, a significant improvement over previous exponential bounds and a partial answer to an open question in the field. The method combines structural characterization of optimal solutions with geometric volume arguments. Additionally, the technique has been extended to handle adversarially corrupted feedback, yielding a bound of O((C+1)d log d) without prior knowledge of the corruption level. AI

IMPACT Establishes a new theoretical bound for online inverse linear optimization, potentially improving recommendation systems.
RESEARCH · arXiv stat.ML · 1d · [2 sources]

Model-based Bootstrap of Controlled Markov Chains

Researchers have developed a new model-based bootstrap method for controlled Markov chains, particularly useful in offline reinforcement learning scenarios where the data-generating policy is unknown. This technique establishes distributional consistency for transition estimators and extends to policy evaluation and recovery, providing asymptotically valid confidence intervals for value and Q-functions. Experimental results on the RiverSwim problem demonstrate that the proposed confidence intervals offer improved calibration and coverage compared to existing methods, especially with limited data. AI

IMPACT Improves confidence interval calibration for offline reinforcement learning, aiding in more reliable policy evaluation and recovery.
RESEARCH · arXiv cs.CL · 1d · [2 sources]

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Researchers have introduced MedHopQA, a new benchmark designed to evaluate the multi-hop reasoning capabilities of large language models in the biomedical domain. This benchmark consists of 1,000 expert-curated question-answer pairs, each requiring information synthesis from two distinct Wikipedia articles, with answers provided in free text. The MedHopQA dataset was presented as a shared task at BioCreative IX, attracting 48 submissions from 13 teams, and highlighted the effectiveness of retrieval-augmented generation strategies for improved performance. AI

IMPACT Establishes a new standard for evaluating complex biomedical reasoning in LLMs, pushing for more robust and contamination-resistant benchmarks.
- MedHopQA
- LLM
- BioCreative IX
- Wikipedia
- MONDO
- NCBI Gene
- NCBI Taxonomy
RESEARCH · arXiv stat.ML · 1d · [2 sources]

Online Learning-to-Defer with Varying Experts

Researchers have developed a new online algorithm for Learning-to-Defer (L2D) methods, designed to handle streaming data and dynamic expert availability. This algorithm is the first of its kind for multiclass classification with bandit feedback and a varying pool of experts. It offers theoretical regret guarantees and has demonstrated effectiveness in experiments on both synthetic and real-world datasets, extending L2D capabilities to more complex, dynamic environments. AI

IMPACT Introduces a novel algorithmic approach for dynamic expert selection in machine learning, potentially improving efficiency in real-time decision-making systems.
- Yannis Montreuil
RESEARCH · arXiv cs.CL · 1d · [2 sources]

Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

Researchers are exploring the use of large language models (LLMs) for enhancing safety in air traffic control (ATC) and around non-towered airports. One study proposes a vision-language model approach to analyze radio communications, weather data, and flight trajectories for safety assessments, achieving high F1 scores with open-source models. Another paper introduces a safety-oriented evaluation framework that highlights the critical need for consequence-aware metrics, as standard accuracy measures can mask severe risks in ATC operations. AI

IMPACT LLM analysis could improve safety and efficiency in critical air traffic control operations.
RESEARCH · arXiv stat.ML · 1d · [3 sources]

Multi-Variable Conformal Prediction: Optimizing Prediction Sets without Data Splitting

Two new research papers introduce advanced conformal prediction techniques to improve the accuracy and efficiency of prediction sets. The first paper, "Multi-Variable Conformal Prediction (MCP)," extends conformal prediction to handle vector-valued score functions, allowing for more flexible prediction set shapes without sacrificing coverage guarantees and eliminating the need for data splitting. The second paper, "Shape-Adaptive Conditional Calibration for Conformal Prediction via Minimax Optimization," presents the Minimax Optimization Predictive Inference (MOPI) framework, which optimizes over a flexible class of set-valued mappings to achieve superior shape adaptivity and more efficient prediction sets, even for complex conditional distributions. AI

IMPACT These new methods could lead to more reliable and efficient predictive models in machine learning by improving the calibration of prediction sets.
TOOL · Towards AI · 17h

Machine Learning System -Design Model Versioning & the Registry: Why Your S3 Bucket Is Not a Source…

This article discusses the critical need for robust model versioning and registry systems in machine learning development. It argues that simple cloud storage solutions like S3 buckets are insufficient for managing the complexities of ML model lifecycles. The piece emphasizes the importance of dedicated registries for tracking, organizing, and deploying models effectively. AI

IMPACT Highlights the necessity of proper infrastructure for managing ML models, crucial for scalable and reliable AI deployments.
- Towards AI
- S3 bucket
TOOL · dev.to — LLM tag · 17h

Guaranteed JSON Every Time: Using Claude's Structured Outputs with JSON Schema

A developer guide demonstrates how to reliably extract structured data from Anthropic's Claude models by leveraging their tool-use feature. Instead of directly prompting for JSON, the technique involves defining a fake tool with a JSON schema for its arguments and forcing Claude to call this tool. The model's output, which conforms to the schema as a side effect of tool invocation, is then captured as the desired structured data. This method bypasses common issues like malformed JSON or prose responses, ensuring consistent and parsable output for applications. AI

IMPACT Enables developers to reliably integrate LLM-generated structured data into applications, reducing error handling and improving robustness.
TOOL · Towards AI · 17h

Cog-RAG: Cognitive-Inspired Dual-Hypergraph RAG

Researchers have developed Cog-RAG, a novel approach to Retrieval Augmented Generation that mimics human cognitive processes for improved LLM responses. Unlike traditional methods that retrieve flat text or simple graph structures, Cog-RAG constructs a dual-hypergraph. This structure includes a theme hypergraph for narrative themes across documents and an entity hypergraph for detailed relationships within chunks. The system first identifies query themes to guide the retrieval of relevant details, enhancing coherence and reducing factual errors. AI

IMPACT Cog-RAG's cognitive-inspired approach could lead to more coherent and accurate LLM responses by better capturing semantic relationships.
- Cog-RAG
- LLM
- GPT-4o
TOOL · 量子位 (QbitAI) 中文(ZH) · 18h

In the Auto Research Era, 47 Tasks Without Standard Answers Become the Must-Test List for Agent Capabilities

A new benchmark, Frontier-Eng Bench, has been released to evaluate AI agents on complex engineering tasks that lack standardized answers. This benchmark moves beyond simple problem-solving by requiring agents to propose solutions, integrate with simulators, interpret feedback, and iteratively refine parameters. The goal is to assess an agent's ability to perform continuous optimization and self-evolution in real-world scenarios, moving towards an era of 'Auto Research' where AI agents function as tireless engineering teams. AI

IMPACT This benchmark could accelerate the development of AI agents capable of real-world engineering optimization, potentially transforming research and development processes.
RESEARCH · arXiv stat.ML · 1d · [2 sources]

Optimal Policy Learning under Budget and Coverage Constraints

Researchers have developed a new framework for optimal policy learning that addresses combined budget and minimum coverage constraints. The study reveals a knapsack-type structure within the problem, allowing the optimal policy to be defined by an affine threshold rule. Two algorithms, Greedy-Lagrangian (GLC) and rank-and-cut (RC), are proposed to implement this approach, with GLC offering close approximation and RC showing near-optimality under specific conditions. AI

IMPACT Introduces a novel algorithmic approach for optimizing resource allocation in policy learning scenarios.
- Greedy-Lagrangian (GLC)
- rank-and-cut (RC)
RESEARCH · arXiv stat.ML · 1d · [2 sources]

Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

Researchers have developed a new method called Self-Supervised Laplace Approximation (SSLA) to directly approximate the posterior predictive distribution in Bayesian models. This approach draws inspiration from self-training techniques and quantifies predictive uncertainty by refitting the model on its own predictions. The SSLA method offers a deterministic, sampling-free approximation that outperforms classical Laplace approximations in predictive calibration for regression tasks, including Bayesian neural networks, while maintaining computational efficiency. AI

IMPACT Offers a more computationally efficient and accurate method for assessing uncertainty in Bayesian models, potentially improving reliability in AI applications.
RESEARCH · arXiv cs.CV · 1d · [2 sources]

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

Researchers have developed two new frameworks for improving 3D hand pose estimation from egocentric camera views. EgoForce utilizes a differentiable forearm representation and a unified transformer to achieve state-of-the-art accuracy across various camera types, reducing MPJPE by up to 28%. EgoEV-HandPose, on the other hand, employs stereo event cameras and a novel KeypointBEV fusion module to jointly estimate bimanual hand poses and recognize gestures, achieving an MPJPE of 30.54mm and 86.87% gesture recognition accuracy. Both methods aim to enhance applications in AR/VR and human-computer interaction by providing more robust and accurate hand tracking. AI

IMPACT These advancements in egocentric hand tracking could significantly improve the realism and interactivity of AR/VR experiences and human-computer interfaces.