Listen Where You Describe: Multi-channel Target Speaker Extraction via Personality and Position Queries

Task


Our proposed SEPP framework, which facilitates TSE with personality (purple) and position queries (red).

Abstract

Target Speaker Extraction (TSE) identifies and isolates the voice of a specific speaker in a cocktail party scenario and plays a pivotal role in numerous real-world applications. Recent advances in personality query TSE have enabled speaker extraction based on high-level speaker characteristics such as gender and language. However, these approaches struggle when the target speaker must be distinguished by spatial location or when multiple candidates exhibit highly similar personality traits. To address these limitations, we introduce the SEPP framework, which incorporates both personalized and positional cues expressed via natural language descriptions. To mitigate the scarcity of suitable datasets, we further construct and annotate a new benchmark dataset, namely QuerySpeaker. Extensive experiments show that our framework outperforms existing approaches and validate the complementary benefits of combining personality and position cues for precise speaker extraction.

SEPP framework

Sizes of model trees

Our proposed SEPP network estimates the clean target speech \( \hat{\mathbf{z}}(\tau) \) from a multi-channel speech mixture \( \mathbf{y}^m(\tau) \) and a natural language query \( Q \). Specifically, the PPCP parses \( Q \) into position and personality components: the position component is quantized via predefined angle/distance maps to guide extraction of \( L_S \) (IPD, ILD, \( \xi_{\text{ang}} \) and \( \xi_{\text{dis}} \)); the personality component is encoded via a text encoder to produce \( L_T \).

Demos

mixture

Q:“Extract speech in Mandarin.”

gt_spk1

Q:“Extract the male speaker.”

Q:“Extract the male speaker at a medium distance on the left rear.”

Results

Results of TSE methods with different queries. \(^{\dagger}\) indicates a fine-tuned version. Bold values indicate the best performance. For all these metrics, higher values correspond to better performance outcomes \((\uparrow)\).

Performance comparison with different cue fusion types. \(L_T\) and \(L_S\) denote personality cue and position cue.

Performance comparison for SEPP using different personality cue types on the QuerySpeaker dataset.

Performance of the proposed method across different azimuth regions for 2-mix conditions./strong>

Performance of SEPP across different distance ranges on the simulated data.