Method, apparatus, and computer program product for searchable real-time transcribed audio and visual content within a group-based communication system
US-2023377575-A1 · Nov 23, 2023 · US
US2022366910A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022366910-A1 |
| Application number | US-202117322765-A |
| Country | US |
| Kind code | A1 |
| Filing date | May 17, 2021 |
| Priority date | May 17, 2021 |
| Publication date | Nov 17, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods described herein relate to determining whether to incorporate recognized text, that corresponds to a spoken utterance of a user of a client device, into a transcription displayed at the client device, or to cause an assistant command, that is associated with the transcription and that is based on the recognized text, to be performed by an automated assistant implemented by the client device. The spoken utterance is received during a dictation session between the user and the automated assistant. Implementations can process, using automatic speech recognition model(s), audio data that captures the spoken utterance to generate the recognized text. Further, implementations can determine whether to incorporate the recognized text into the transcription or cause the assistant command to be performed based on touch input being directed to the transcription, a state of the transcription, and/or audio-based characteristic(s) of the spoken utterance.
Opening claim text (preview).
What is claimed is: 1 . A method implemented by one or more processors, the method comprising: receiving audio data that captures a spoken utterance of a user of a client device, the audio data being generated one or more microphones of the client device, and the audio data being received while touch input of the user is being directed to a transcription that is displayed at the client device via a software application accessible at the client device; determining, based on the touch input of the user being directed to the transcription and the spoken utterance, whether to: incorporate recognized text, that corresponds to the spoken utterance, into the transcription, or perform an assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance; in response to determining to incorporate the recognized text that corresponds to the spoken utterance into the transcription: automatically incorporating the recognized text that corresponds to the spoken utterance into the transcription; and in response to determining to perform the assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance: causing an automated assistant to perform the assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance. 2 . The method of claim 1 , further comprising: processing, using an automatic speech recognition (ASR) model, the audio data that captures the spoken utterance to generate the recognized text that corresponds to the spoken utterance. 3 . The method of claim 2 , further comprising: processing, using a natural language understanding (NLU) model, the recognized text that corresponds to the spoken utterance to generate annotated recognized text. 4 . The method of claim 3 , further comprising: determining the assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance, wherein determining the assistant command, that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance, is based on the annotated recognized text. 5 . The method of claim 1 , wherein the touch input of the user is being directed to one or more textual segments of the transcription that is displayed at the client device. 6 . The method of claim 5 , wherein the touch input of the user graphically demarcates one or more of the textual segments of the transcription that is displayed at the client device. 7 . The method of claim 6 , wherein determining whether to incorporate the recognized text, that corresponds to the spoken utterance, into the transcription, or to perform the assistant command, that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance, comprises determining to perform the assistant command based on the touch input of the user graphically demarcating one or more of the textual segments of the transcription. 8 . The method of claim 1 , wherein the touch input of the user is being directed to one or more fields of the transcription that is displayed at the client device. 9 . The method of claim 8 , wherein determining whether to incorporate the recognized text, that corresponds to the spoken utterance, into the transcription, or to perform the assistant command, that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance, comprises determining to perform the assistant command based on the touch input of the user being directed to one or more fields of the transcription. 10 . The method of claim 1 , wherein automatically incorporating the recognized text that corresponds to the spoken utterance into the transcription comprises: causing the recognized text to be visually displayed to the user via the software application accessible at the client device as part of the transcription. 11 . The method of claim 10 , wherein causing the recognized text to be visually displayed to the user via the software application accessible at the client device as part of the transcription comprises causing the recognized text to be maintained in the transcription after additional text is incorporated into the transcription. 12 . A method implemented by one or more processors, the method comprising: receiving audio data that captures a spoken utterance of a user of a client device, the audio data being generated one or more microphones of the client device, and the audio data being received while a transcription is being displayed at the client device via a software application accessible at the client device; processing, using an automatic speech recognition (ASR) model, the audio data that captures the spoken utterance to generate recognized text that corresponds to the spoken utterance; processing, using a natural language understanding (NLU) model, the recognized text that corresponds to the spoken utterance to generate annotated recognized text; processing, using an audio-based machine learning (ML) model, the audio data that captures the spoken utterance to determine one or more audio-based characteristics of the spoken utterance; determining, based on one or more of the annotated recognized text or one or more of the audio-based characteristics of the spoken utterance, whether to: incorporate recognized text, that corresponds to the spoken utterance, into the transcription, or perform an assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance; in response to determining to incorporate the recognized text that corresponds to the spoken utterance into the transcription: automatically incorporating the recognized text that corresponds to the spoken utterance into the transcription; and in response to determining to perform the assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance: causing an automated assistant to perform the assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance. 13 . The method of claim 12 , further comprising: determining the assistant command that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance, wherein determining the assistant command, that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance, is based on the annotated recognized text. 14 . The method of claim 12 , wherein the audio-based ML model is an endpointing model trained to detect pauses in the spoken utterance, and wherein one or more of the audio-based characteristics of the spoken utterance correspond to one or more of pauses in the spoken utterance. 15 . The method of claim 14 , wherein determining whether to incorporate the recognized text, that corresponds to the spoken utterance, into the transcription, or to perform the assistant command, that is associated with the transcription and that is based on the recognized text that corresponds to the spoken utterance, comprises determining to perform the assistant command associated with the transcription based on one or more of the pauses in the spoken utterance. 16
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
using a touch-screen or digitiser, e.g. input of commands through traced gestures · CPC title
using natural language modelling · CPC title
Machine learning · CPC title
Execution procedure of a spoken command · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.