Dynamic speech output configuration
US-11398218-B1 · Jul 26, 2022 · US
US11626104B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11626104-B2 |
| Application number | US-202017115158-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 8, 2020 |
| Priority date | Dec 8, 2020 |
| Publication date | Apr 11, 2023 |
| Grant date | Apr 11, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A device includes processors configured to determine, in a first power mode, whether an audio stream corresponds to speech of at least two talkers. The processors are configured to, based on determining that the audio stream corresponds to speech of at least two talkers, analyze, in a second power mode, audio feature data of the audio stream to generate a segmentation result. The processors are configured to perform a comparison of a plurality of user speech profiles to an audio feature data set of a plurality of audio feature data sets of a talker-homogenous audio segment to determine whether the audio feature data set matches any of the user speech profiles. The processors are configured to, based on determining that the audio feature data set does not match any of the plurality of user speech profiles, generate a user speech profile based on the plurality of audio feature data sets.
Opening claim text (preview).
What is claimed is: 1. A device for audio analysis comprising: a memory configured to store a plurality of user speech profiles of a plurality of users; and one or more processors configured to: determine, in a first power mode, whether an audio stream corresponds to speech of at least two distinct talkers; based on determining that the audio stream corresponds to speech of at least two distinct talkers, analyze, in a second power mode, audio feature data of the audio stream to generate a segmentation result using one or more machine-learning segmentation models that are trained to perform speaker segmentation, the segmentation result indicating talker-homogenous audio segments of the audio stream; perform a comparison of the plurality of user speech profiles to a first audio feature data set of a first plurality of audio feature data sets of a first talker-homogenous audio segment to determine whether the first audio feature data set matches any of the plurality of user speech profiles; and based on determining that the first audio feature data set does not match any of the plurality of user speech profiles: store the first audio feature data set in a first enrollment buffer associated with a first talker; store subsequent audio feature data sets corresponding to speech of the first talker in the first enrollment buffer until a stop condition is satisfied, wherein the first plurality of audio feature data sets of the first talker-homogenous audio segment includes the first audio feature data set and the subsequent audio feature data sets; generate a first user speech profile based on the first plurality of audio feature data sets; and add the first user speech profile to the plurality of user speech profiles. 2. The device of claim 1 , wherein the first audio feature data set includes a first audio feature vector. 3. The device of claim 1 , wherein the one or more processors are configured to analyze the audio feature data by applying a speaker segmentation neural network to the audio feature data. 4. The device of claim 1 , wherein the one or more processors are configured to determine that the stop condition is satisfied in response to determining that longer than threshold silence is detected in the audio stream. 5. The device of claim 1 , wherein the one or more processors are configured to add a particular audio feature data set to the first enrollment buffer based at least in part on determining that the particular audio feature data set corresponds to speech of a single talker, wherein the single talker includes the first talker. 6. The device of claim 1 , wherein the one or more processors are configured to, based on determining that a count of the first plurality of audio feature data sets of the first talker-homogenous audio segment stored in the first enrollment buffer is greater than an enrollment threshold, generate the first user speech profile based on the first plurality of audio feature data sets. 7. The device of claim 1 , wherein the one or more processors are configured to, based on determining that the first audio feature data set matches a particular user speech profile, update the particular user speech profile based on the first audio feature data set. 8. The device of claim 7 , wherein the one or more processors are configured to, based at least in part on determining that the first audio feature data set corresponds to speech of a single talker, update the particular user speech profile based on the first audio feature data set. 9. The device of claim 1 , wherein the one or more processors are configured to determine whether a second audio feature data set of a second plurality of audio feature data sets of a second talker-homogenous audio segment matches any of the plurality of user speech profiles. 10. The device of claim 9 , wherein the one or more processors are configured to, based on determining that the second audio feature data set does not match any of the plurality of user speech profiles: generate a second user speech profile based on the second plurality of audio feature data sets; and add the second user speech profile to the plurality of user speech profiles. 11. The device of claim 9 , wherein the one or more processors are configured to, based on determining that the second audio feature data set matches a particular user speech profile of the plurality of user speech profiles, update the particular user speech profile based on the second audio feature data set. 12. The device of claim 1 , wherein the memory is configured to store profile update data, and wherein the one or more processors are configured to: in response to generating the first user speech profile, update the profile update data to indicate that the first user speech profile is updated; and based on determining that the profile update data indicates that a first count of the plurality of user speech profiles have been updated, output the first count as a count of talkers detected in the audio stream. 13. The device of claim 1 , wherein the memory is configured to store user interaction data, and wherein the one or more processors are configured to: in response to generating the first user speech profile, update the user interaction data based on a speech duration of the first talker-homogenous audio segment to indicate that a first user associated with the first user speech profile interacted for the speech duration; and output at least the user interaction data. 14. The device of claim 1 , wherein the first power mode is a lower power mode as compared to the second power mode. 15. The device of claim 1 , wherein the one or more processors are configured to: determine, in the first power mode, audio information of the audio stream, the audio information including a count of talkers detected in the audio stream, voice activity detection (VAD) information, or both; activate one or more audio analysis applications in the second power mode; and provide the audio information to one or more audio analysis applications. 16. The device of claim 1 , wherein the one or more processors are configured to, in response to determining that the segmentation result indicates that one or more second audio segments of the audio stream correspond to multiple talkers, refrain from updating the plurality of user speech profiles based on the one or more second audio segments. 17. A method of audio analysis comprising: determining, in a first power mode at a device, whether an audio stream corresponds to speech of at least two distinct talkers; based on determining that the audio stream corresponds to speech of at least two distinct talkers, analyzing, in a second power mode, audio feature data of the audio stream to generate a segmentation result using one or more machine-learning segmentation models that are trained to perform speaker segmentation, the segmentation result indicating talker-homogenous audio segments of the audio stream; performing, at the device, a comparison of a plurality of user speech profiles to a first audio feature data set of a first plurality of audio feature data sets of a first talker-homogenous audio segment to determine whether the first audio feature data set matches any of the plurality of user speech profiles; and based on determining that the first audio feature data set does not match any of the plurality of user speech profiles: storing the first audio feature data set in a first enrollment buffer associated with a first talker; storing subsequent audio feature data sets corresponding to speech of the first talker in the first enro
Related publications grouped by family.
Answers are generated from the same data shown on this page.