Processing multimodal user input for assistant systems
US-12406316-B2 · Sep 2, 2025 · US
US12499868B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12499868-B2 |
| Application number | US-202318296914-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 6, 2023 |
| Priority date | Apr 8, 2022 |
| Publication date | Dec 16, 2025 |
| Grant date | Dec 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A data processing method is provided. The method includes: obtaining a speech pattern data of a target user based on a speech information of the target user, where the speech pattern data indicates a speech feature of the target user; and converting a broadcast text into an audio content based on the speech pattern data, where a text of the audio content corresponds to the broadcast text, and the audio content has the speech feature.
Opening claim text (preview).
What is claimed is: 1 . A data processing method, comprising: obtaining a speech pattern data of a target user based on speech information of the target user, wherein the speech pattern data indicates a speech feature of the target user, wherein the speech feature comprises a lexical feature and a grammatical feature, wherein the speech pattern data comprises a lexical data indicating the lexical feature of the target user, and a grammatical data indicating the grammatical feature of the target user, and wherein the obtaining the speech pattern data of the target user based on the speech information of the target user comprises: performing a speech recognition on the speech information of the target user to obtain a speech text; obtaining a modal particle commonly used by the target user from the speech text to determine the modal particle as the lexical data; and obtaining, by analyzing a grammar in the speech text, a tag representing a grammar commonly used by the target user to determine the tag as the grammatical data; and converting a broadcast text into an audio content for delivery to the target user based on the speech pattern data, wherein a text of the audio content corresponds to the broadcast text, and the audio content has the speech feature. 2 . The method according to claim 1 , wherein the speech feature further comprises at least one of a pronunciation feature and a speech rate feature, and wherein the speech pattern data further comprises at least one of following: a pronunciation data indicating the pronunciation feature of the target user; and a speech rate data indicating the speech rate feature of the target user. 3 . The method according to claim 1 , further comprising: obtaining, based on an image information of the target user, a facial expression data indicating a facial expression of the target user, wherein the image information corresponds to the speech information; and driving, based on the facial expression data, a digital human to deliver the audio content, wherein the digital human has a broadcast expression when delivering the audio content, and wherein the broadcast expression corresponds to the facial expression of the target user. 4 . The method according to claim 3 , wherein the facial expression data comprises an expression type data and an expression intensity data. 5 . The method according to claim 3 , further comprising: obtaining a behavior data of the target user based on the image information, wherein the behavior data indicates a behavior of the target user; and wherein the driving, based on the facial expression data, the digital human to deliver the audio content comprises: driving, based on the facial expression data and the behavior data, the digital human to deliver the audio content, wherein the digital human has the broadcast expression and a broadcast behavior when delivering the audio content, and wherein the broadcast behavior corresponds to the behavior of the target user. 6 . The method according to claim 5 , wherein the behavior of the target user comprises at least one of a motion, a posture, a gesture, and a breathing rate, and the behavior data further comprises at least one of following: a motion data indicating the motion of the target user; a posture data indicating the posture of the target user; a gesture data indicating the gesture of the target user; and a breathing rate data indicating the breathing rate of the target user. 7 . An electronic device, comprising: one or more processors; and a memory storing one or more programs configured to be executed by the one or more processors, the one or more programs comprising instructions for performing operations comprising: obtaining a speech pattern data of a target user based on speech information of the target user, wherein the speech pattern data indicates a speech feature of the target user, wherein the speech feature comprises a lexical feature and a grammatical feature, wherein the speech pattern data comprises a lexical data indicating the lexical feature of the target user, and a grammatical data indicating the grammatical feature of the target user, and wherein the obtaining the speech pattern data of the target user based on the speech information of the target user comprises: performing a speech recognition on the speech information of the target user to obtain a speech text; obtaining a modal particle commonly used by the target user from the speech text to determine the modal particle as the lexical data; and obtaining, by analyzing a grammar in the speech text, a tag representing a grammar commonly used by the target user to determine the tag as the grammatical data; and converting a broadcast text into an audio content for delivery to the target user based on the speech pattern data, wherein a text of the audio content corresponds to the broadcast text, and the audio content has the speech feature. 8 . The electronic device according to claim 7 , wherein the speech feature further comprises at least one of a pronunciation feature and a speech rate feature, and wherein the speech pattern data further comprises at least one of following: a pronunciation data indicating the pronunciation feature of the target user; and a speech rate data indicating the speech rate feature of the target user. 9 . The electronic device according to claim 7 , wherein the operations further comprising: obtaining, based on an image information of the target user, a facial expression data indicating a facial expression of the target user, wherein the image information corresponds to the speech information; and driving, based on the facial expression data, a digital human to deliver the audio content, wherein the digital human has a broadcast expression when delivering the audio content, and wherein the broadcast expression corresponds to the facial expression of the target user. 10 . The electronic device according to claim 9 , wherein the facial expression data comprises an expression type data and an expression intensity data. 11 . The electronic device according to claim 9 , wherein the operations further comprising: obtaining a behavior data of the target user based on the image information, wherein the behavior data indicates a behavior of the target user; and wherein the driving, based on the facial expression data, the digital human to deliver the audio content comprises: driving, based on the facial expression data and the behavior data, the digital human to deliver the audio content, wherein the digital human has the broadcast expression and a broadcast behavior when delivering the audio content, and wherein the broadcast behavior corresponds to the behavior of the target user. 12 . The electronic device according to claim 11 , wherein the behavior of the target user comprises at least one of a motion, a posture, a gesture, and a breathing rate, and the behavior data further comprises at least one of following: a motion data indicating the motion of the target user; a posture data indicating the posture of the target user; a gesture data indicating the gesture of the target user; and a breathing rate data indicating the breathing rate of the target user. 13 . A non-transitory computer-readable storage medium storing one or more programs comprising instructions that, when executed by one or more processors of a computing device, cause the computing device to perform operations comprising: obtaining a speech pattern data of a target user based on speech information of the target user, wherein the speech pattern data indicates a speech feature of the target user, wherein the sp
using statistical methods · CPC title
Semantic analysis · CPC title
using position of the lips, movement of the lips or face analysis · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
of characters, e.g. humans, animals or virtual beings · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.