What technology area does this patent fall under?

Primary CPC classification G10L15/25. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 19 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Query endpointing based on lip detection

US11308963B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11308963-B2
Application number	US-202016936948-A
Country	US
Kind code	B2
Filing date	Jul 23, 2020
Priority date	Mar 14, 2017
Publication date	Apr 19, 2022
Grant date	Apr 19, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are described for improving endpoint detection of a voice query submitted by a user. In some implementations, a synchronized video data and audio data is received. A sequence of frames of the video data that includes images corresponding to lip movement on a face is determined. The audio data is endpointed based on first audio data that corresponds to a first frame of the sequence of frames and second audio data that corresponds to a last frame of the sequence of frames. A transcription of the endpointed audio data is generated by an automated speech recognizer. The generated transcription is then provided for output.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by one or more processors comprising: identifying a plurality of training data instances for training a machine learning model, each of the plurality of training data instances comprising: training data instance input that includes video data and audio data that is synchronized with the video data, wherein the video data includes at least one video frame, and training data instance output that includes an indication of whether the audio data captures a corresponding user speaking in the at least one video frame of the training data instance input; and training the machine learning model based on the plurality of training data instances to predict whether further video data and further audio data that is synchronized with the further video data captures a user speaking. 2. The method of claim 1 , wherein the at least one video frame of the training data instance input is a speaking video frame, and wherein the indication of the training data instance output indicates that the speaking video frame captures the corresponding user speaking in the at least one video frame of the training data instance input. 3. The method of claim 2 , wherein the training data instance output further includes a transcription of a word or phrase captured in the audio data that is synchronized with the video data, and wherein training the machine learning model based on the plurality of training data instances further comprises training the machine learning model to perform speech recognition on the further audio data. 4. The method of claim 3 , wherein predicting whether the further video data and the further audio data that is synchronized with the further video data captures the user speaking further comprises generating a predicted transcription of the further audio data. 5. The method of claim 1 , wherein the at least one video frame of the training data instance input is a non-speaking video frame, and wherein the indication of the training data instance output indicates that the non-speaking video frame does not capture the corresponding user speaking in the at least one video frame of the training data instance input. 6. The method of claim 5 , wherein the non-speaking video frame captures lip movement of the user. 7. The method of claim 5 , wherein the non-speaking video frame captures no lip movement of the user. 8. The method of claim 1 , further comprising: subsequent to training the machine learning model: receiving additional video data and additional audio data that is synchronized with the additional video data; processing, using the trained machine learning model, the additional video data and the additional audio data; and identifying, based on processing the additional video data and the additional audio data, one or more portions of the additional audio data that include the user speaking. 9. The method of claim 1 , further comprising: subsequent to training the machine learning model, causing the machine learning model to be used at a client device. 10. A system comprising: at least one processor; and at least one memory storing instructions that, when executed, cause the at least one processor to: identify a plurality of training data instances for training a machine learning model, each of the plurality of training data instances comprising: training data instance input that includes video data and audio data that is synchronized with the video data, wherein the video data includes at least one video frame, and training data instance output that includes an indication of whether the audio data captures a corresponding user speaking in the at least one video frame of the training data instance input; and train the machine learning model based on the plurality of training data instances to predict whether further video data and further audio data that is synchronized with the further video data captures a user speaking. 11. The system of claim 10 , wherein the at least one video frame of the training data instance input is a speaking video frame, and wherein the indication of the training data instance output indicates that the speaking video frame captures the corresponding user speaking in the at least one video frame of the training data instance input. 12. The system of claim 11 , wherein the training data instance output further includes a transcription of a word or phrase captured in the audio data that is synchronized with the video data, and wherein training the machine learning model based on the plurality of training data instances further comprises training the machine learning model to perform speech recognition on the further audio data. 13. The system of claim 12 , wherein predicting whether the further video data and the further audio data that is synchronized with the further video data captures the user speaking further comprises generating a predicted transcription of the further audio data. 14. The system of claim 10 , wherein the at least one video frame of the training data instance input is a non-speaking video frame, and wherein the indication of the training data instance output indicates that the non-speaking video frame does not capture the corresponding user speaking in the at least one video frame of the training data instance input. 15. The system of claim 14 , wherein the non-speaking video frame captures lip movement of the user. 16. The system of claim 14 , wherein the non-speaking video frame captures no lip movement of the user. 17. The system of claim 10 , further comprising instructions to: subsequent to training the machine learning model: receive additional video data and additional audio data that is synchronized with the additional video data; process, using the trained machine learning model, the additional video data and the additional audio data; and identify, based on processing the additional video data and the additional audio data, one or more portions of the additional audio data that include the user speaking. 18. The system of claim 10 , further comprising instructions to: subsequent to training the machine learning model, cause the machine learning model to be used at a client device. 19. A non-transitory computer-readable storage medium storing instructions that, when executed, cause at least at least one processor to: identify a plurality of training data instances for training a machine learning model, each of the plurality of training data instances comprising: training data instance input that includes video data and audio data that is synchronized with the video data, wherein the video data includes at least one video frame, and training data instance output that includes an indication of whether the audio data captures a corresponding user speaking in the at least one video frame of the training data instance input; and train the machine learning model based on the plurality of training data instances to predict whether further video data and further audio data that is synchronized with the further video data captures a user speaking.

Assignees

Google Llc

Inventors

Classifications

G10L2015/227
of the speaker; Human-factor methodology · CPC title
G10L2015/225
Feedback of the input speech · CPC title
G10L21/0356
for synchronising with other signals, e.g. video signals · CPC title
G10L15/25Primary
using position of the lips, movement of the lips or face analysis · CPC title
G10L15/22Primary
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

View patent family 60452748

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11308963B2 cover?: Systems and methods are described for improving endpoint detection of a voice query submitted by a user. In some implementations, a synchronized video data and audio data is received. A sequence of frames of the video data that includes images corresponding to lip movement on a face is determined. The audio data is endpointed based on first audio data that corresponds to a first frame of the se…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/25. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 19 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).