Query endpointing based on lip detection
US-2018268812-A1 · Sep 20, 2018 · US
US11308963B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11308963-B2 |
| Application number | US-202016936948-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 23, 2020 |
| Priority date | Mar 14, 2017 |
| Publication date | Apr 19, 2022 |
| Grant date | Apr 19, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods are described for improving endpoint detection of a voice query submitted by a user. In some implementations, a synchronized video data and audio data is received. A sequence of frames of the video data that includes images corresponding to lip movement on a face is determined. The audio data is endpointed based on first audio data that corresponds to a first frame of the sequence of frames and second audio data that corresponds to a last frame of the sequence of frames. A transcription of the endpointed audio data is generated by an automated speech recognizer. The generated transcription is then provided for output.
Opening claim text (preview).
What is claimed is: 1. A method implemented by one or more processors comprising: identifying a plurality of training data instances for training a machine learning model, each of the plurality of training data instances comprising: training data instance input that includes video data and audio data that is synchronized with the video data, wherein the video data includes at least one video frame, and training data instance output that includes an indication of whether the audio data captures a corresponding user speaking in the at least one video frame of the training data instance input; and training the machine learning model based on the plurality of training data instances to predict whether further video data and further audio data that is synchronized with the further video data captures a user speaking. 2. The method of claim 1 , wherein the at least one video frame of the training data instance input is a speaking video frame, and wherein the indication of the training data instance output indicates that the speaking video frame captures the corresponding user speaking in the at least one video frame of the training data instance input. 3. The method of claim 2 , wherein the training data instance output further includes a transcription of a word or phrase captured in the audio data that is synchronized with the video data, and wherein training the machine learning model based on the plurality of training data instances further comprises training the machine learning model to perform speech recognition on the further audio data. 4. The method of claim 3 , wherein predicting whether the further video data and the further audio data that is synchronized with the further video data captures the user speaking further comprises generating a predicted transcription of the further audio data. 5. The method of claim 1 , wherein the at least one video frame of the training data instance input is a non-speaking video frame, and wherein the indication of the training data instance output indicates that the non-speaking video frame does not capture the corresponding user speaking in the at least one video frame of the training data instance input. 6. The method of claim 5 , wherein the non-speaking video frame captures lip movement of the user. 7. The method of claim 5 , wherein the non-speaking video frame captures no lip movement of the user. 8. The method of claim 1 , further comprising: subsequent to training the machine learning model: receiving additional video data and additional audio data that is synchronized with the additional video data; processing, using the trained machine learning model, the additional video data and the additional audio data; and identifying, based on processing the additional video data and the additional audio data, one or more portions of the additional audio data that include the user speaking. 9. The method of claim 1 , further comprising: subsequent to training the machine learning model, causing the machine learning model to be used at a client device. 10. A system comprising: at least one processor; and at least one memory storing instructions that, when executed, cause the at least one processor to: identify a plurality of training data instances for training a machine learning model, each of the plurality of training data instances comprising: training data instance input that includes video data and audio data that is synchronized with the video data, wherein the video data includes at least one video frame, and training data instance output that includes an indication of whether the audio data captures a corresponding user speaking in the at least one video frame of the training data instance input; and train the machine learning model based on the plurality of training data instances to predict whether further video data and further audio data that is synchronized with the further video data captures a user speaking. 11. The system of claim 10 , wherein the at least one video frame of the training data instance input is a speaking video frame, and wherein the indication of the training data instance output indicates that the speaking video frame captures the corresponding user speaking in the at least one video frame of the training data instance input. 12. The system of claim 11 , wherein the training data instance output further includes a transcription of a word or phrase captured in the audio data that is synchronized with the video data, and wherein training the machine learning model based on the plurality of training data instances further comprises training the machine learning model to perform speech recognition on the further audio data. 13. The system of claim 12 , wherein predicting whether the further video data and the further audio data that is synchronized with the further video data captures the user speaking further comprises generating a predicted transcription of the further audio data. 14. The system of claim 10 , wherein the at least one video frame of the training data instance input is a non-speaking video frame, and wherein the indication of the training data instance output indicates that the non-speaking video frame does not capture the corresponding user speaking in the at least one video frame of the training data instance input. 15. The system of claim 14 , wherein the non-speaking video frame captures lip movement of the user. 16. The system of claim 14 , wherein the non-speaking video frame captures no lip movement of the user. 17. The system of claim 10 , further comprising instructions to: subsequent to training the machine learning model: receive additional video data and additional audio data that is synchronized with the additional video data; process, using the trained machine learning model, the additional video data and the additional audio data; and identify, based on processing the additional video data and the additional audio data, one or more portions of the additional audio data that include the user speaking. 18. The system of claim 10 , further comprising instructions to: subsequent to training the machine learning model, cause the machine learning model to be used at a client device. 19. A non-transitory computer-readable storage medium storing instructions that, when executed, cause at least at least one processor to: identify a plurality of training data instances for training a machine learning model, each of the plurality of training data instances comprising: training data instance input that includes video data and audio data that is synchronized with the video data, wherein the video data includes at least one video frame, and training data instance output that includes an indication of whether the audio data captures a corresponding user speaking in the at least one video frame of the training data instance input; and train the machine learning model based on the plurality of training data instances to predict whether further video data and further audio data that is synchronized with the further video data captures a user speaking.
of the speaker; Human-factor methodology · CPC title
Feedback of the input speech · CPC title
for synchronising with other signals, e.g. video signals · CPC title
using position of the lips, movement of the lips or face analysis · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.