Open earphone
US-2024422466-A1 · Dec 19, 2024 · US
US2021390975A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021390975-A1 |
| Application number | US-202117199347-A |
| Country | US |
| Kind code | A1 |
| Filing date | Mar 11, 2021 |
| Priority date | Jun 10, 2020 |
| Publication date | Dec 16, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes receiving an overlapped audio signal that includes audio spoken by a speaker that overlaps a segment of synthesized playback audio. The method also includes encoding a sequence of characters that correspond to the synthesized playback audio into a text embedding representation. For each character in the sequence of characters, the method also includes generating a respective cancelation probability using the text embedding representation. The cancelation probability indicates a likelihood that the corresponding character is associated with the segment of the synthesized playback audio overlapped by the audio spoken by the speaker in the overlapped audio signal.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving an overlapped audio signal comprising audio spoken by a speaker that overlaps a segment of synthesized playback audio; encoding a sequence of characters that correspond to the synthesized playback audio into a text embedding representation; for each character in the sequence of characters, generating, using the text embedding representation, a respective cancelation probability indicating a likelihood that the corresponding character is associated with the segment of the synthesized playback audio overlapped by the audio spoken by the speaker in the overlapped audio signal; and generating, using a cancelation neural network configured to receive the overlapped audio signal and the respective cancelation probability generated for each character in the sequence of characters as inputs, an enhanced audio signal by removing the segment of the synthesized playback audio from the overlapped audio signal. 2 . The computer-implemented method of claim 1 , wherein a text-to-speech (TTS) system converts the sequence of characters into synthesized speech comprising the synthesized playback audio. 3 . The computer-implemented method of claim 1 , wherein the text embedding representation comprises a single, fixed-dimensional text embedding vector. 4 . The computer-implemented method of claim 1 , wherein encoding the sequence of characters comprises encoding each character in the sequence of characters into a corresponding character embedding to generate a sequence of character embeddings. 5 . The computer-implemented method of claim 4 , wherein: the overlapped audio signal comprises a sequence of frames, each frame in the sequence of frames corresponding to a portion of the audio spoken by the speaker that overlaps the segment of synthesized playback audio; and generating the respective cancelation probability for each character in the sequence of characters comprises using an attention mechanism to apply a weight to the corresponding character embedding when the corresponding character embedding corresponds to one of the frames in the sequence of frames of the overlapped audio signal. 6 . The computer-implemented method of claim 1 , wherein the operations further comprise training the cancelation neural network on a plurality of training examples, each training example comprising: a ground truth audio signal corresponding to non-synthesized speech; a training overlapped audio signal comprising the ground truth audio signal overlapping a synthesized audio signal; and a respective textual representation of the synthesized audio signal, the textual representation comprising a sequence of characters. 7 . The computer-implemented method of claim 1 , wherein a text encoder of a text encoding neural network encodes the sequence of characters that correspond to the synthesized playback audio into the text embedding representation. 8 . The computer-implemented method of claim 7 , wherein the text encoder is shared by a text-to-speech (TTS) system, the TTS system configured to generate the synthesized playback audio from the sequence of characters. 9 . The computer-implemented method of claim 1 , wherein the cancelation neural network comprises a Long Short Term Memory (LSTM) network with a plurality of LSTM layers. 10 . The computer-implemented method of claim 1 , wherein the operations further comprise receiving an indication that a textual representation of the synthesized playback audio is available. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving an overlapped audio signal comprising audio spoken by a speaker that overlaps a segment of synthesized playback audio; encoding a sequence of characters that correspond to the synthesized playback audio into a text embedding representation; for each character in the sequence of characters, generating, using the text embedding representation, a respective cancelation probability indicating a likelihood that the corresponding character is associated with the segment of the synthesized playback audio overlapped by the audio spoken by the speaker in the overlapped audio signal; and generating, using a cancelation neural network configured to receive the overlapped audio signal and the respective cancelation probability generated for each character in the sequence of characters as inputs, an enhanced audio signal by removing the segment of the synthesized playback audio from the overlapped audio signal. 12 . The system of claim 11 , wherein a text-to-speech (TTS) system converts the sequence of characters into synthesized speech comprising the synthesized playback audio. 13 . The system of claim 11 , wherein the text embedding representation comprises a single, fixed-dimensional text embedding vector. 14 . The system of claim 11 , wherein encoding the sequence of characters comprises encoding each character in the sequence of characters into a corresponding character embedding to generate a sequence of character embeddings. 15 . The system of claim 14 , wherein: the overlapped audio signal comprises a sequence of frames, each frame in the sequence of frames corresponding to a portion of the audio spoken by the speaker that overlaps the segment of synthesized playback audio; and generating the respective cancelation probability for each character in the sequence of characters comprises using an attention mechanism to apply a weight to the corresponding character embedding when the corresponding character embedding corresponds to one of the frames in the sequence of frames of the overlapped audio signal. 16 . The system of claim 11 , wherein the operations further comprise training the cancelation neural network on a plurality of training examples, each training example comprising: a ground truth audio signal corresponding to non-synthesized speech; a training overlapped audio signal comprising the ground truth audio signal overlapping a synthesized audio signal; and a respective textual representation of the synthesized audio signal, the textual representation comprising a sequence of characters. 17 . The system of claim 11 , wherein a text encoder of a text encoding neural network encodes the sequence of characters that correspond to the synthesized playback audio into the text embedding representation. 18 . The system of claim 17 , wherein the text encoder is shared by a text-to-speech (TTS) system, the TTS system configured to generate the synthesized playback audio from the sequence of characters. 19 . The system of claim 11 , wherein the cancelation neural network comprises a Long Short Term Memory (LSTM) network with a plurality of LSTM layers. 20 . The system of claim 11 , wherein the operations further comprise receiving an indication that a textual representation of the synthesized playback audio is available.
Speech enhancement, e.g. noise reduction or echo cancellation (reducing echo effects in line transmission systems H04B3/20; echo suppression in hands-free telephones H04M9/08) · CPC title
Noise filtering · CPC title
the noise being echo, reverberation of the speech · CPC title
using neural networks · CPC title
Methods for producing synthetic speech; Speech synthesisers · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.