What technology area does this patent fall under?

Primary CPC classification G10L21/0208. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Dec 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Textual Echo Cancellation

US2021390975A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2021390975-A1
Application number	US-202117199347-A
Country	US
Kind code	A1
Filing date	Mar 11, 2021
Priority date	Jun 10, 2020
Publication date	Dec 16, 2021
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving an overlapped audio signal that includes audio spoken by a speaker that overlaps a segment of synthesized playback audio. The method also includes encoding a sequence of characters that correspond to the synthesized playback audio into a text embedding representation. For each character in the sequence of characters, the method also includes generating a respective cancelation probability using the text embedding representation. The cancelation probability indicates a likelihood that the corresponding character is associated with the segment of the synthesized playback audio overlapped by the audio spoken by the speaker in the overlapped audio signal.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving an overlapped audio signal comprising audio spoken by a speaker that overlaps a segment of synthesized playback audio; encoding a sequence of characters that correspond to the synthesized playback audio into a text embedding representation; for each character in the sequence of characters, generating, using the text embedding representation, a respective cancelation probability indicating a likelihood that the corresponding character is associated with the segment of the synthesized playback audio overlapped by the audio spoken by the speaker in the overlapped audio signal; and generating, using a cancelation neural network configured to receive the overlapped audio signal and the respective cancelation probability generated for each character in the sequence of characters as inputs, an enhanced audio signal by removing the segment of the synthesized playback audio from the overlapped audio signal. 2 . The computer-implemented method of claim 1 , wherein a text-to-speech (TTS) system converts the sequence of characters into synthesized speech comprising the synthesized playback audio. 3 . The computer-implemented method of claim 1 , wherein the text embedding representation comprises a single, fixed-dimensional text embedding vector. 4 . The computer-implemented method of claim 1 , wherein encoding the sequence of characters comprises encoding each character in the sequence of characters into a corresponding character embedding to generate a sequence of character embeddings. 5 . The computer-implemented method of claim 4 , wherein: the overlapped audio signal comprises a sequence of frames, each frame in the sequence of frames corresponding to a portion of the audio spoken by the speaker that overlaps the segment of synthesized playback audio; and generating the respective cancelation probability for each character in the sequence of characters comprises using an attention mechanism to apply a weight to the corresponding character embedding when the corresponding character embedding corresponds to one of the frames in the sequence of frames of the overlapped audio signal. 6 . The computer-implemented method of claim 1 , wherein the operations further comprise training the cancelation neural network on a plurality of training examples, each training example comprising: a ground truth audio signal corresponding to non-synthesized speech; a training overlapped audio signal comprising the ground truth audio signal overlapping a synthesized audio signal; and a respective textual representation of the synthesized audio signal, the textual representation comprising a sequence of characters. 7 . The computer-implemented method of claim 1 , wherein a text encoder of a text encoding neural network encodes the sequence of characters that correspond to the synthesized playback audio into the text embedding representation. 8 . The computer-implemented method of claim 7 , wherein the text encoder is shared by a text-to-speech (TTS) system, the TTS system configured to generate the synthesized playback audio from the sequence of characters. 9 . The computer-implemented method of claim 1 , wherein the cancelation neural network comprises a Long Short Term Memory (LSTM) network with a plurality of LSTM layers. 10 . The computer-implemented method of claim 1 , wherein the operations further comprise receiving an indication that a textual representation of the synthesized playback audio is available. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving an overlapped audio signal comprising audio spoken by a speaker that overlaps a segment of synthesized playback audio; encoding a sequence of characters that correspond to the synthesized playback audio into a text embedding representation; for each character in the sequence of characters, generating, using the text embedding representation, a respective cancelation probability indicating a likelihood that the corresponding character is associated with the segment of the synthesized playback audio overlapped by the audio spoken by the speaker in the overlapped audio signal; and generating, using a cancelation neural network configured to receive the overlapped audio signal and the respective cancelation probability generated for each character in the sequence of characters as inputs, an enhanced audio signal by removing the segment of the synthesized playback audio from the overlapped audio signal. 12 . The system of claim 11 , wherein a text-to-speech (TTS) system converts the sequence of characters into synthesized speech comprising the synthesized playback audio. 13 . The system of claim 11 , wherein the text embedding representation comprises a single, fixed-dimensional text embedding vector. 14 . The system of claim 11 , wherein encoding the sequence of characters comprises encoding each character in the sequence of characters into a corresponding character embedding to generate a sequence of character embeddings. 15 . The system of claim 14 , wherein: the overlapped audio signal comprises a sequence of frames, each frame in the sequence of frames corresponding to a portion of the audio spoken by the speaker that overlaps the segment of synthesized playback audio; and generating the respective cancelation probability for each character in the sequence of characters comprises using an attention mechanism to apply a weight to the corresponding character embedding when the corresponding character embedding corresponds to one of the frames in the sequence of frames of the overlapped audio signal. 16 . The system of claim 11 , wherein the operations further comprise training the cancelation neural network on a plurality of training examples, each training example comprising: a ground truth audio signal corresponding to non-synthesized speech; a training overlapped audio signal comprising the ground truth audio signal overlapping a synthesized audio signal; and a respective textual representation of the synthesized audio signal, the textual representation comprising a sequence of characters. 17 . The system of claim 11 , wherein a text encoder of a text encoding neural network encodes the sequence of characters that correspond to the synthesized playback audio into the text embedding representation. 18 . The system of claim 17 , wherein the text encoder is shared by a text-to-speech (TTS) system, the TTS system configured to generate the synthesized playback audio from the sequence of characters. 19 . The system of claim 11 , wherein the cancelation neural network comprises a Long Short Term Memory (LSTM) network with a plurality of LSTM layers. 20 . The system of claim 11 , wherein the operations further comprise receiving an indication that a textual representation of the synthesized playback audio is available.

Assignees

Google Llc

Inventors

Wang Quan

Classifications

G10L21/02
Speech enhancement, e.g. noise reduction or echo cancellation (reducing echo effects in line transmission systems H04B3/20; echo suppression in hands-free telephones H04M9/08) · CPC title
G10L21/0208Primary
Noise filtering · CPC title
G10L2021/02082
the noise being echo, reverberation of the speech · CPC title
G10L25/30
using neural networks · CPC title
G10L13/02
Methods for producing synthetic speech; Speech synthesisers · CPC title

Patent family

Related publications grouped by family.

View patent family 75302675

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021390975A1 cover?: A method includes receiving an overlapped audio signal that includes audio spoken by a speaker that overlaps a segment of synthesized playback audio. The method also includes encoding a sequence of characters that correspond to the synthesized playback audio into a text embedding representation. For each character in the sequence of characters, the method also includes generating a respective c…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L21/0208. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Dec 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).