Theme detection for object-recognition-based notifications
US-12183330-B2 · Dec 31, 2024 · US
US9761247B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9761247-B2 |
| Application number | US-201313755738-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 31, 2013 |
| Priority date | Jan 31, 2013 |
| Publication date | Sep 12, 2017 |
| Grant date | Sep 12, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Prosodic features are used for discriminating computer-directed speech from human-directed speech. Statistics and models describing energy/intensity patterns over time, speech/pause distributions, pitch patterns, vocal effort features, and speech segment duration patterns may be used for prosodic modeling. The prosodic features for at least a portion of an utterance are monitored over a period of time to determine a shape associated with the utterance. A score may be determined to assist in classifying the current utterance as human directed or computer directed without relying on knowledge of preceding utterances or utterances following the current utterance. Outside data may be used for training lexical addressee detection systems for the H-H-C scenario. H-C training data can be obtained from a single-user H-C collection and that H-H speech can be modeled using general conversational speech. H-C and H-H language models may also be adapted using interpolation with small amounts of matched H-H-C data.
Opening claim text (preview).
What is claimed is: 1. A method for identifying an intended addressee by a conversational understanding system executing on a computer processing device, the method comprising: receiving an utterance; extracting prosodic features from speech segments of the received utterance; performing an evaluation of the extracted prosodic features to determine whether the intended addressee is one of a human and a computer, wherein the evaluation is word-independent, context-independent, and speaker-independent; in response to determining that the intended addressee is the computer, processing the utterance to generate a response; and outputting the response through the computer processing device. 2. The method of claim 1 , further comprising: characterizing the received utterance as a human-computer speaking style or a human-human speaking style based on at least one of a temporal pattern that evaluates segmental duration and a spectral pattern that evaluates at least one of: energy/intensity, pitch, and voice quality/vocal effort features. 3. The method of claim 1 , further comprising: extracting energy-related features from the received utterance using fixed-length temporal windows within the received utterance; and determining the intended addressee based on the energy-related features. 4. The method of claim 1 , further comprising: determining the intended addressee based on features of the received utterance, wherein the features comprise a peak count, a rate, a mean and a max distance apart, an intensity value, and a location and a value for the highest peak of the received utterance. 5. The method of claim 1 , further comprising: determining the intended addressee based on speech activity information, wherein the speech activity information comprises a speaking rate and a duration. 6. The method of claim 1 , further comprising: determining the intended addressee based on a waveform duration of the received utterance, a length of initial final non-speech regions, and a duration of non-speech regions. 7. The method of claim 1 , further comprising: evaluating the intended addressee based on processing lexical features of the speech segments using a lexical model and processing the extracted prosodic features using a prosodic model. 8. The method of claim 7 , further comprising training the lexical model used in a Human-Human-Computer dialog with out-of-domain data. 9. The method of claim 8 , further comprising: determining the out-of-domain data based on anchor text. 10. The method of claim 7 , wherein the lexical model comprises a similarity measure between words of the received utterance and display text. 11. A conversational understanding system comprising: a processor; and memory storing computer-executable instructions that, when executed, causes the processor to perform a method comprising: receive an utterance from a user; parse the utterance into one or more speech segments; extract prosodic features from the one or more speech segments; generate a score for the received utterance based on the prosodic features, wherein generating the score is word-independent, context-independent, and speaker-independent; determine an intended addressee of the received utterance based on the generated score, wherein the intended addressee is one of a human and a computer; in response to determining that the intended addressee is the processor, generate a response for the received utterance; and output the response to the user. 12. The system of claim 11 , wherein the method further comprises: characterizing the received utterance as a human-computer speaking style or a human-human speaking style based on at least one of a temporal pattern that evaluates segmental duration and a spectral pattern that evaluates at least one of: energy/intensity, pitch, and voice quality/vocal effort features. 13. The system of claim 11 , wherein the method further comprises: extracting energy-related features from the received utterance using fixed-length temporal windows within the received utterance; and determining the intended addressee based on the energy-related features. 14. The system of claim 11 , wherein the method further comprises: determining the intended addressee based on features of the received utterance, wherein the features comprise a peak count, a rate, a mean and a max distance apart, an intensity value, and a location and a value for the highest peak of the received utterance. 15. The system of claim 11 , wherein the computer is a computing device that is executing the conversational understanding system. 16. The system of claim 11 , wherein the method further comprises: determining the intended addressee based on speech activity information, wherein the speech activity information comprises a length of initial final non-speech regions and a duration of non-speech regions for the extracted prosodic features. 17. A conversational understanding system for addressee detection, comprising: a computer processor and memory; an operating environment executed by the computer processor; and an addressee manager that is configured to perform actions comprising: receiving an utterance; extracting prosodic features from speech segments of the received utterance; performing an evaluation of the extracted prosodic features to determine whether the intended addressee is a computer or a human, wherein the evaluation is word-independent, context-independent, and speaker independent; in response to determining that the intended addressee is the computer, processing the utterance to generate a response for the received utterance; and outputting the response through the conversational understanding system. 18. The system of claim 17 , wherein the performed actions further comprise: extracting energy-related features from the received utterance using fixed-length temporal windows within the received utterance and; determining the intended addressee based on the extracted energy-related features. 19. The system of claim 17 , wherein the performed actions further comprise: determining the intended addressee based on features of the received utterance, wherein the features comprise of a peak count, a rate, a mean and a max distance apart, an intensity value, and a location and a value for the highest peak of the received utterance. 20. The system of claim 17 , wherein the performed actions further comprise: determining the intended addressee based on speech activity information, wherein the speech activity information comprises a length of initial final non-speech regions and a duration of non-speech regions.
using context dependencies, e.g. language models · CPC title
Pitch determination of speech signals · CPC title
the extracted parameters being spectral information of each sub-band · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.