Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.
Please contact your browser provider for download and installation instructions.
April 23, 2026
Information
Five papers from NTT and 1 paper from NTT Research, Inc. (NTT Research) were accepted at ICLR2026 (International Conference on Learning Representations)2026, to be held in Rio de Janeiro, Brazil from April 23, 2026 to April 27. ICLR2026 is known as a difficult international conference, and it is an international conference that has made a major contribution to the development of current AI through presentations on basic results and concepts of AI technology. The selected papers from NTT are as follows.
Model merging, a technique that combines multiple models into a single model, is gaining attention as a promising approach in machine learning. Prior work has suggested that model merging requires a parameter permutation, a complex alignment operation that matches corresponding components across neural networks while preserving their behavior. In this study, we show that model merging can succeed even without such operations, as long as the model is sufficiently large (i.e., wider hidden layers) and the softmax temperature is properly adjusted. We further show that the merged model behaves similarly to an ensemble of the original models, except for the final softmax layer. These findings improve the practicality of model merging, enabling the integration of multiple AI models without sharing training data and contributing to the development of privacy-conscious AI applications.
Large language models (LLMs) perform reasoning by using sub-words called "tokens" as the smallest input/output units and sequentially generate tokens according to their "next-token probabilities". It is known that coordinating these next-token probabilities across multiple LLMs enables advanced collaboration, such as "ensemble" methods that dynamically fuse knowledge between LLMs and "portable tuning" that reuses training results. However, such coordination requires LLMs to share the same token vocabulary, limiting its applicability. Therefore, this study establishes a new theory on equivalent transformations for next-token probabilities and develops the world's first technology that allows the token vocabulary to be freely reduced during each LLM's inference without accuracy degradation. By reducing each LLM's vocabulary to the same "maximum common vocabulary", this technology efficiently allows LLMs with different vocabularies to dynamically fuse their knowledge and to reuse learning results.
Rotary Position Embeddings (RoPE) are widely used in Transformers to encode positional information in token representations, yet the internal frequency structure of RoPE remains poorly understood. Previous studies have reported conflicting findings on the roles of high- and low-frequency dimensions, offering empirical observations but no unifying explanation. In this paper, we present a systematic framework that bridges these disparate results. We introduce Frequency Entropy (FE), a metric that quantifies the effective utilization of each RoPE frequency dimension, and we provide an analysis of how RoPE’s sinusoidal components contribute to model representations on a per-dimension basis. Based on an analysis of the Llama-4 model, which incorporates both RoPE and NoPE layers, we find that the periodicity captured by FE appears in RoPE layers but not in NoPE layers. Furthermore, FE identifies dimensions in which energy concentrates under RoPE. These characteristics are observed across the spectrum rather than being confined to specific dimensions. Moreover, attenuating extreme-entropy dimensions at inference yields downstream accuracy that is statistically indistinguishable from the baseline, with modest perplexity improvements on average, suggesting that such dimensions are often redundant. Overall, FE provides a simple, general diagnostic for RoPE with implications for analysis and design.
Rotary Position Embeddings (RoPE) are widely adopted in LLMs, and it is commonly believed that larger base frequencies θ yield better long-context performance. In this paper, we show that a high-norm RoPE dimension, referred to as the “frequency band,” consistently emerges across multiple models, and we focus on this band to reveal the trade-offs inherent in RoPE. We find that replacing the RoPE dimensions below the frequency band with NoPE during inference has little effect on performance, indicating that these lower-frequency dimensions are only weakly utilized. We further find that the location of the frequency band depends on the RoPE base θ and the training sequence length. Moreover, the band forms early during pre-training and persists even after context extension via position interpolation.
Notably, we show that setting θ to the training length shifts the band toward lower frequencies and improves extrapolation, whereas increasing θ enhances interpolation but reduces extrapolation, revealing a clear trade-off between interpolation and extrapolation.
We believe this work is a step toward a sharper understanding of positional embeddings in LLMs, with falsifiable diagnostics and practical guidance for choosing θ that support scaling to longer contexts.
Hawkes processes are probabilistic models used to analyze phenomena in which one event triggers or suppresses subsequent events, such as stock transactions, information diffusion on social networks, and earthquakes. In recent years, methods that combine Hawkes processes with kernel methods have attracted increasing attention, enabling flexible estimation of the triggering kernels, i.e., functions that characterize interactions between events, from event data. However, existing approaches often involve substantial computational costs, making them difficult to apply to large-scale event datasets. In this study, we propose a new kernel method-based approach via the least squares contrast for point processes. By exploiting the structure of this contrast, the proposed method eliminates the computationally intensive optimization required by conventional methods, enabling fast and accurate estimation even for large-scale data. These results are expected to contribute to improving societal safety and efficiency in applications such as disaster prediction and infrastructure management.
Abbreviated names of the laboratories:
HI: NTT Human Informatics Laboratories
SI: NTT Social Informatics Laboratories
CD: NTT Computer and Data Science Laboratories
CS: NTT Communication Science Laboratories
PHI: Physics & Informatics Laboratories
Information is current as of the date of issue of the individual topics.
Please be advised that information may be outdated after that point.
WEB media that thinks about the future with NTT