Microphone array based deep learning for time-domain speech signal extraction
US-11508388-B1 · Nov 22, 2022 · US
US12230259B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12230259-B2 |
| Application number | US-202117555332-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 17, 2021 |
| Priority date | Oct 5, 2021 |
| Publication date | Feb 18, 2025 |
| Grant date | Feb 18, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Examples of array geometry agnostic multi-channel personalized speech enhancement (PSE) extract speaker embeddings, which represent acoustic characteristics of one or more target speakers, from target speaker enrollment data. Spatial features (e.g., inter-channel phase difference) are extracted from input audio captured by a microphone array. The input audio includes a mixture of speech data of the target speaker(s) and one or more interfering speaker(s). The input audio, the extracted speaker embeddings, and the extracted spatial features are provided to a trained geometry-agnostic PSE model. Output data is produced, which comprises estimated clean speech data of the target speaker(s) that has a reduction (or elimination) of speech data of the interfering speaker(s), without the trained PSE model requiring geometry information for the microphone array.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a processor; and a computer-readable medium storing instructions that are operative upon execution by the processor to: extract speaker embeddings from enrollment data for a first target speaker; extract spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; provide the input audio, the extracted speaker embeddings, and the extracted spatial features to a geometry-agnostic personalized speech enhancement (PSE) model trained with a virtual microphone signal comprising a combination of outputs of microphones in a multi-channel microphone array having different array geometries; and produce output data using the geometry-agnostic PSE model without geometry information for the microphone array, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker. 2. The system of claim 1 , wherein the multi-channel microphone array having different array geometries used during training does not include a microphone array geometry used to capture the input audio, and wherein training the PSE model does not include training the PSE model with speech data of the first target speaker, or a second target speaker, or the interfering speaker. 3. The system of claim 1 , wherein the output data comprises audio of the estimated clean speech data of the first target speaker, and wherein the instructions are further operative to: generate, from the output data, a transcript of the estimated clean speech data of the first target speaker. 4. The system of claim 1 , wherein the speaker embeddings are extracted from enrollment data for the first target speaker and a second target speaker, and wherein the output data further comprises estimated clean speech data of the second target speaker. 5. The system of claim 1 , wherein the input audio comprises real-time audio, and wherein producing the output data comprises producing the output data in real-time. 6. The system of claim 1 , wherein the spatial features comprise an inter-channel phase difference (IPD). 7. A computerized method comprising: extracting speaker embeddings from enrollment data for a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted speaker embeddings, and the extracted spatial features to a geometry-agnostic personalized speech enhancement (PSE) model trained with a virtual microphone signal comprising a combination of outputs of microphones in a multi-channel microphone array having different array geometries; and producing output data using the geometry-agnostic PSE model without geometry information for the microphone array, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker. 8. The method of claim 7 , wherein the multi-channel microphone array having different array geometries used during training does not include a microphone array geometry used to capture the input audio, and wherein training the PSE model does not include training the PSE model with speech data of the first target speaker, or a second target speaker, or the interfering speaker. 9. The method of claim 7 , wherein the output data comprises audio of the estimated clean speech data of the first target speaker, and wherein the method further comprises: generating, from the output data, a transcript of the estimated clean speech data of the first target speaker. 10. The method of claim 7 , wherein the speaker embeddings are extracted from enrollment data for the first target speaker and a second target speaker, and wherein the output data further comprises estimated clean speech data of the second target speaker. 11. The method of claim 7 , wherein the input audio comprises real-time audio, and wherein producing the output data comprises producing the output data in real-time. 12. The method of claim 7 , wherein the spatial features comprise an inter-channel phase difference (IPD). 13. One or more computer storage devices having computer-executable instructions stored thereon, which, on execution by a computer, cause the computer to perform operations comprising: extracting speaker embeddings from enrollment data for a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted speaker embeddings, and the extracted spatial features to a geometry-agnostic personalized speech enhancement (PSE) model trained with a virtual microphone signal comprising a combination of outputs of microphones in a multi-channel microphone array having different array geometries; and using the geometry-agnostic PSE model without geometry information for the microphone array, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker. 14. The one or more computer storage devices of claim 13 , wherein the output data comprises audio of the estimated clean speech data of the first target speaker, and wherein the operations further comprise: generating, from the output data, a transcript of the estimated clean speech data of the first target speaker. 15. The one or more computer storage devices of claim 13 , wherein the speaker embeddings are extracted from enrollment data for the first target speaker and a second target speaker, and wherein the output data further comprises estimated clean speech data of the second target speaker. 16. The one or more computer storage devices of claim 13 , wherein the input audio comprises real-time audio, and wherein producing the output data comprises producing the output data in real-time. 17. The one or more computer storage devices of claim 13 , wherein the spatial features comprise an inter-channel phase difference (IPD). 18. The one or more computer storage devices of claim 13 , wherein the multi-channel microphone array having different array geometries used during training does not include a microphone array geometry used to capture the input audio, and wherein training the PSE model does not include training the PSE model with speech data of the first target speaker, or a second target speaker, or the interfering speaker.
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Noise filtering · CPC title
Artificial neural networks; Connectionist approaches · CPC title
Training, enrolment or model building · CPC title
the noise being separate speech, e.g. cocktail party · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.