Array geometry agnostic multi-channel personalized speech enhancement

US12230259B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12230259-B2
Application numberUS-202117555332-A
CountryUS
Kind codeB2
Filing dateDec 17, 2021
Priority dateOct 5, 2021
Publication dateFeb 18, 2025
Grant dateFeb 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Examples of array geometry agnostic multi-channel personalized speech enhancement (PSE) extract speaker embeddings, which represent acoustic characteristics of one or more target speakers, from target speaker enrollment data. Spatial features (e.g., inter-channel phase difference) are extracted from input audio captured by a microphone array. The input audio includes a mixture of speech data of the target speaker(s) and one or more interfering speaker(s). The input audio, the extracted speaker embeddings, and the extracted spatial features are provided to a trained geometry-agnostic PSE model. Output data is produced, which comprises estimated clean speech data of the target speaker(s) that has a reduction (or elimination) of speech data of the interfering speaker(s), without the trained PSE model requiring geometry information for the microphone array.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a processor; and a computer-readable medium storing instructions that are operative upon execution by the processor to: extract speaker embeddings from enrollment data for a first target speaker; extract spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; provide the input audio, the extracted speaker embeddings, and the extracted spatial features to a geometry-agnostic personalized speech enhancement (PSE) model trained with a virtual microphone signal comprising a combination of outputs of microphones in a multi-channel microphone array having different array geometries; and produce output data using the geometry-agnostic PSE model without geometry information for the microphone array, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker. 2. The system of claim 1 , wherein the multi-channel microphone array having different array geometries used during training does not include a microphone array geometry used to capture the input audio, and wherein training the PSE model does not include training the PSE model with speech data of the first target speaker, or a second target speaker, or the interfering speaker. 3. The system of claim 1 , wherein the output data comprises audio of the estimated clean speech data of the first target speaker, and wherein the instructions are further operative to: generate, from the output data, a transcript of the estimated clean speech data of the first target speaker. 4. The system of claim 1 , wherein the speaker embeddings are extracted from enrollment data for the first target speaker and a second target speaker, and wherein the output data further comprises estimated clean speech data of the second target speaker. 5. The system of claim 1 , wherein the input audio comprises real-time audio, and wherein producing the output data comprises producing the output data in real-time. 6. The system of claim 1 , wherein the spatial features comprise an inter-channel phase difference (IPD). 7. A computerized method comprising: extracting speaker embeddings from enrollment data for a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted speaker embeddings, and the extracted spatial features to a geometry-agnostic personalized speech enhancement (PSE) model trained with a virtual microphone signal comprising a combination of outputs of microphones in a multi-channel microphone array having different array geometries; and producing output data using the geometry-agnostic PSE model without geometry information for the microphone array, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker. 8. The method of claim 7 , wherein the multi-channel microphone array having different array geometries used during training does not include a microphone array geometry used to capture the input audio, and wherein training the PSE model does not include training the PSE model with speech data of the first target speaker, or a second target speaker, or the interfering speaker. 9. The method of claim 7 , wherein the output data comprises audio of the estimated clean speech data of the first target speaker, and wherein the method further comprises: generating, from the output data, a transcript of the estimated clean speech data of the first target speaker. 10. The method of claim 7 , wherein the speaker embeddings are extracted from enrollment data for the first target speaker and a second target speaker, and wherein the output data further comprises estimated clean speech data of the second target speaker. 11. The method of claim 7 , wherein the input audio comprises real-time audio, and wherein producing the output data comprises producing the output data in real-time. 12. The method of claim 7 , wherein the spatial features comprise an inter-channel phase difference (IPD). 13. One or more computer storage devices having computer-executable instructions stored thereon, which, on execution by a computer, cause the computer to perform operations comprising: extracting speaker embeddings from enrollment data for a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted speaker embeddings, and the extracted spatial features to a geometry-agnostic personalized speech enhancement (PSE) model trained with a virtual microphone signal comprising a combination of outputs of microphones in a multi-channel microphone array having different array geometries; and using the geometry-agnostic PSE model without geometry information for the microphone array, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker. 14. The one or more computer storage devices of claim 13 , wherein the output data comprises audio of the estimated clean speech data of the first target speaker, and wherein the operations further comprise: generating, from the output data, a transcript of the estimated clean speech data of the first target speaker. 15. The one or more computer storage devices of claim 13 , wherein the speaker embeddings are extracted from enrollment data for the first target speaker and a second target speaker, and wherein the output data further comprises estimated clean speech data of the second target speaker. 16. The one or more computer storage devices of claim 13 , wherein the input audio comprises real-time audio, and wherein producing the output data comprises producing the output data in real-time. 17. The one or more computer storage devices of claim 13 , wherein the spatial features comprise an inter-channel phase difference (IPD). 18. The one or more computer storage devices of claim 13 , wherein the multi-channel microphone array having different array geometries used during training does not include a microphone array geometry used to capture the input audio, and wherein training the PSE model does not include training the PSE model with speech data of the first target speaker, or a second target speaker, or the interfering speaker.

Assignees

Inventors

Classifications

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Noise filtering · CPC title

  • Artificial neural networks; Connectionist approaches · CPC title

  • Training, enrolment or model building · CPC title

  • the noise being separate speech, e.g. cocktail party · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12230259B2 cover?
Examples of array geometry agnostic multi-channel personalized speech enhancement (PSE) extract speaker embeddings, which represent acoustic characteristics of one or more target speakers, from target speaker enrollment data. Spatial features (e.g., inter-channel phase difference) are extracted from input audio captured by a microphone array. The input audio includes a mixture of speech data of…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L21/0208. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).