Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

  • Microsoft Edge(Latest version) 
  • Mozilla Firefox(Latest version) 
  • Google Chrome(Latest version) 
  • Apple Safari(Latest version) 

Please contact your browser provider for download and installation instructions.

Open search panel Close search panel Open menu Close menu

April 30, 2026

Information

NTT's 21 papers accepted for ICASSP2026, the world's largest international conference on signal processing technology

21 papers authored by NTT Laboratories have been accepted at the 2026 (2026 IEEE International Conference on Acoustics, Speech, and Signal Processing), Open other windowthe flagship conference on signal processing technology to be held in Barcelona, Spain from May 4 to May 8, 2026. In addition, we present demonstrations at the Show and Tell sessions at the conference.

Abbreviated names of the laboratories:
CS: NTT Communication Science Laboratories
HI: NTT Human Informatics Laboratories
CD: NTT Computer and Data Science Laboratories
(Affiliations are at the time of submission.)

◆Ensemble for Reducing Target Speech Extraction Errors

Tsubasa Ochiai (CS), Marc Delcroix (CS), Naoyuki Kamo (CS), Takanori Ashihara (HI), Naohiro Tawara (CS), Tomohiro Nakatani (CS)

Ensemble methods combine diverse system hypotheses to reduce errors. While they are widely used in speech recognition and speaker diarization, their applications in the speech enhancement domain have remained insufficiently explored. In this paper, for target speaker extraction, we propose a novel ensemble approach that incorporates a framework for excluding low‑quality hypotheses and those that incorrectly select the target speaker. The proposed approach is expected to enable a wide range of applications with speech interfaces to provide users with more stable speech enhancement results with fewer failures.

◆Generating Training Targets for Real-world Speech Enhancement via Close-to-distant Microphone Projection

Tomohiro Nakatani (CS), Rintaro Ikeshita (CS), Naoyuki Kamo (CS), Marc Delcroix (CS), Shoko Araki (CS)

Training neural networks for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. Although such data are often generated through simulation, the mismatch between simulated and real recordings can significantly limit SE performance. To address this issue, we propose Close-to-Distant microphone (C2D) projection, a method that generates paired training data directly from real recordings captured by both close and distant microphones. The proposed approach enables a new effective training scheme for real-world SE.

◆Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm

Anselm Lohmann (Oldenburg univ.), Tomohiro Nakatani (CS), Rintaro Ikeshita (CS), Marc Delcroix (CS), Shoko Araki (CS), Simon Doclo (Oldenburg univ.)

Modern speech recognition systems rely on microphone arrays to capture distant speech, but their performance strongly depends on how the reference microphone is chosen. This work introduces a smarter selection method that goes beyond signal-to-noise ratio estimation to also account for how sound reflections affect speech quality. By improving this choice, the proposed approach delivers more accurate speech recognition in real‑world environments, making voice‑driven technologies more reliable and accessible.

◆Frontend Token Enhancement for Token-Based Speech Recognition

Takanori Ashihara (HI), Shota Horiguchi (HI), Kohei Matsuura (HI), Tsubasa Ochiai (CS), Marc Delcroix (CS)

To enable robust automatic speech recognition (ASR) under noisy conditions, we proposed and compared a variety of speech enhancement frontends. Specifically, for ASR that takes discrete speech tokens as input, we systematically proposed multiple enhancement methods operating at the token level, in addition to conventional approaches operating at the waveform level. As a result, our approach achieved higher recognition accuracy than conventional ASR systems that use continuous vector inputs. This technology is expected to serve as a frontend for ASR in noisy environments such as streets and in-vehicle settings

◆Entropy-guided GRVQ for Ultra-Low Bitrate Neural Speech Codec

Yanzhou Ren (Waseda Univ.), Noboru Harada (CS), Daiki Takeuchi (CS), Siyu Chen, Wei Liu, Xiao Zhang, Liyuan Zhang (Waseda Univ.), Takehiro Moriya (CS), and Shoji Makino (Waseda Univ.)

We propose an entropy-guided group residual vector quantization (EG-GRVQ) for an ultra-low bitrate neural speech codec, which retains a semantic branch for linguistic information and incorporates an entropy-guided grouping strategy in the acoustic branch. Assuming that channel activations follow approximately Gaussian statistics, the variance of each channel can serve as a principled proxy for its information content. The proposed scheme shows improvements in perceptual quality and intelligibility metrics under ultra-low bitrate with a focus on codec-level fidelity for communication-oriented scenarios. This technology will enable higher‑quality calls in situations where transmission bandwidth is limited, such as with satellite communication services.

◆VBx for End-to-end Neural and Clustering-based Diarization

Petr Palka (BUT), Jiangyu Han (BUT), Marc Delcroix (CS), Naohiro Tawara (CS), Lukas Burget (BUT),

This work improves speaker diarization, the technology that determines “who spoke when,” by enhancing a two‑stage neural diarization framework that first detects speaker activity and then cluster speakers based on their embeddings across time. By making the second stage more robust by filtering unreliable speaker embeddings from short segments and using the state-of-the-art clustering approach, VBx, the system works well across diverse conditions without extensive tuning. It enables more accurate and scalable speaker diarization for real‑world applications such as meetings.

◆Loose Coupling of Spectral and Spatial Models for Multi-channel Diarization and Enhancement of Meetings in Dynamic Environments

Adrian Meise (Paderborn Univ.), Tobias Cord-Landwehr (Paderborn Univ.), Christoph Boeddeker (Paderborn Univ.), Marc Delcroix (CS), Tomohiro Nakatani (CS), Reinhold Haeb-Umbach (Paderborn Univ.)

This work proposes a new multi-microphone speaker diarization model that flexibly combines spectral features with spatial cues, allowing the system to track speakers even as their positions change. This is realized by introducing a novel joint spatial and spectral mixture model, whose two submodels are loosely coupled by modeling the relationship between speaker and position index probabilistically. The approach opens the possibility for more robust speaker diarization, which are a key components of e.g., meeting transcription systems.

◆Spatially Aware Self-Supervised Models for Multi-Channel Neural Speaker Diarization*

Jiangyu Han (BUT), Ruoyu Wang (USTC), Yoshiki Masuyama (MERL), Marc Delcroix (CS), Johan Rohdin (BUT), Jun Du (USTC), Lukáš Burget (BUT),

Speaker diarization identifies “who spoke when” in multi‑speaker recordings, but modern systems based on models like WavLM are typically trained for a single microphone. This work proposes a simple way to make these models spatially aware, enabling effective use of multiple microphones without heavy computation or specialized hardware. The approach yields more accurate diarization that scales across devices and supports applications such as meeting analysis and smart assistants.

*The above paper is the outcomes of the 2025 Jelinek Workshop on Speech and Language TechnologiesOpen other window.

◆Mixtures of Lightweight Articulatory Experts for Multilingual ASR

Masato Mimura (HI), Jaeyoung Lee (HI), Ryo Magoshi (Kyoto University), Tatsuya Kawahara (Kyoto University)

Multilingual speech recognition typically requires large-scale neural networks, as the model must encode language-specific information such as diverse writing systems and grammatical structures. In addition, when the target languages are neither geographically nor linguistically related, multilingual learning is prone to negative knowledge transfer, in which performance deteriorates across languages despite an apparent increase in the amount of training data. These issues can be alleviated by explicitly incorporating knowledge of articulatory features, which describe the movements of the speech organs during speech production, into the network architecture.

◆Decoder-Only Conformer with Modality-Aware Sparse Mixtures of Experts for ASR

Jaeyoung Lee (HI), Masato Mimura (HI)

We proposed a method for applying the decoder-only architecture, which is commonly used in large language models (LLMs), to speech recognition. While a key challenge of the decoder-only approach is handling two modalities (speech and text) simultaneously within a single model, we achieved higher accuracy than existing encoder–decoder approaches by introducing a modality-specific Mixture of Experts. This technique is expected to be used in technologies that integrate LLMs with speech recognition models.

◆Chunkwise Aligners for Streaming Speech Recognition

Wen Shen Teo (The University of Electro-Communications), Takafumi Moriya (HI), Masato Mimura (HI)

In practical streaming automatic speech recognition (ASR), Transducer-based modeling approaches are widely used; however, they suffer from extremely high computational costs during training. To address this issue, we propose a novel modeling approach based on an Aligner, together with a corresponding training and inference framework. While maintaining recognition accuracy comparable to that of Transducers, the proposed method achieves more than a twofold speedup in both training and inference. This technology is expected to be utilized as an effective next-generation ASR technology in real-world environments.

◆MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows

Takuhiro Kaneko (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS), Yuto Kondo (CS)

In voice conversion (VC), diffusion-based methods have attracted considerable attention for their high speech quality and speaker similarity; however, their reliance on iterative inference results in high computational cost. In this study, we propose a novel VC method, MeanVoiceFlow, which performs VC with a single forward pass and requires no pretraining. Experimental results demonstrate that MeanVoiceFlow achieves performance comparable to that of conventional multi-step inference and distillation-based methods, highlighting its potential as an important technique for fast and high-quality VC.

◆Class-Aware Permutation-Invariant Signal-to-Distortion Ratio for Semantic Segmentation of Sound Scene With Same-Class Sources

Binh Thien Nguyen (CS), Masahiro Yasuda (CS), Daiki Takeuchi (CS), Daisuke Niizumi (CS), and Noboru Harada (CS)

Spatial Semantic Segmentation of Sound Scenes (S5) systems aim to jointly detect and separate sound events from multichannel audio mixtures. We address the problem of same-class sources in the S5 task by enabling duplicated label prediction, introducing a class-aware permutation-invariant loss to handle duplicate-label queries, and redefining the evaluation metric. Advancements in S5 technology will accelerate real-world applications, including immersive communication and smart acoustic monitoring systems.

◆Task-Oriented Sound Privacy Preservation for Sound Event Detection via End-to-End Adversarial Multi-Task Learning

Nao Sato (CD), Masahiro Yasuda (CD/CS), Shoichiro Saito (CD)

Sound-based environmental recognition technology is expected to support daily life through applications like monitoring and security. However, its real-world implementation requires continuous recording in residential or public spaces, which can raise privacy concerns. In this study, we proposed a novel task-oriented approach using adversarial learning that strikes an optimal balance between two conflicting objectives: de-identifying privacy-related information while retaining environmental sounds essential for application execution. Our work contributes to the realization of sound-based environmental recognition systems that can be used securely without privacy concerns.

◆Microphone-less Measurement of Three-dimensional Radiating Impulse Response of Sound Source using Spherical Harmonic-domain Acousto-optic Tomography

Yuzuki Saito (Waseda Univ.), Kenji Ishikawa (CS), Risako Tanigawa (CS), Yasuhiro Oikawa (Waseda Univ.)

In acoustic engineering, the measurement of impulse responses (IRs) is a fundamental technique for characterizing sound sources and understanding sound propagation. However, because such measurements are typically performed using microphones, they are limited in terms of measurable positions and spatial resolution. In this study, we focus on acousto-optic tomography for three-dimensional sound field measurement and realize a technique that enables high-resolution, non-contact measurement of three-dimensional IRs over the full sphere without using microphones. This achievement allows detailed characterization of the three-dimensional acoustic properties of sound sources and is expected to contribute to the advancement of spatial audio technologies and noise source detection systems

◆Secondary Source Placement for Sound Field Control based on Ising Model

Shihori Kozuka (CD), Shoichi Koyama (NII), Hiroaki Itou (CD), and Noriyoshi Kamado (CD)

This study presents a method for rapidly searching for optimal placement of multiple loudspeakers in sound field control technology that delivers desired sound to specific regions. The method applies the Ising model, which can solve combinatorial optimization problems at high speed, and reduces computation time to a few hundredths of that of conventional methods while achieving high-precision placement. Our method enables sound control with enhanced flexibility using a minimal number of loudspeakers and reduces equipment costs. The technology is expected to efficiently reduce noise in large-scale venues such as stadiums and in open spaces such as outdoor areas through its application to active noise control.

◆Low-Frequency Harmonic Control for Speech Intelligibility in Open-Ear Headphones

Yuki Watanabe (CD), Hironobu Chiba (CD), Yutaka Kamamoto (CD), Tatsuya Kako (CD)

This paper presents a low-complexity processing method to improve the intelligibility of speech from open-ear headphones in noisy environments without increasing the volume. This method utilizes a low-complexity filter based on a comb filter to control the energy of fundamental frequency and low-order harmonics, reducing the effects of low-frequency environmental noise and the output limitations of small loudspeaker units. A subjective evaluation shows that applying our method significantly improves speech intelligibility. In the near future, this technology is expected to provide comfortable speech communication, even in noisy environments without excessively increasing the volume.

◆Stylized Text-to-Motion Synthesis via Multi-Condition Latent Diffusion

Fanglu Xie (HI), Tsukasa Shiota (HI), Motohiro Takagi (HI), Edgar Simo-Serra (Waseda University)

We propose a multi‑condition motion latent diffusion model that integrates style and trajectory information into text‑driven motion generation. By explicitly incorporating stylized human motion and trajectories, our method enables natural motion generation in accordance with the specified style while providing precise control over arbitrary trajectories. This technology is expected to be applied to humanoid robot motion generation and broader areas such as human interaction.

We also present a demonstration at a Show and Tell session:

◆Real-Time Demo of Single-Channel Target Speaker Extraction Using State-Space Modeling

Hiroshi Sato (HI), Takafumi Moriya (HI), Marc Delcroix (CS), Tsubasa Ochiai (CS), Taichi Asami (HI)

Target speaker extraction (TSE) is a technology that extracts only the voice of a specific speaker from audio containing background noise and competing talkers. This presentation demonstrates an on-device, real-time TSE system based on a lightweight Conv-TasNet architecture enhanced with a state-space model (SSM), running with low latency on a laptop CPU. This technology is expected to contribute to improving everyday voice communication for general users, for example by enhancing speech clarity in noisy environments and online meetings.

Information is current as of the date of issue of the individual topics.
Please be advised that information may be outdated after that point.