Who is the assignee on this patent?

Microsoft Corp, Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L15/22. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 12 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Prosodic and lexical addressee detection

US9761247B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9761247-B2
Application number	US-201313755738-A
Country	US
Kind code	B2
Filing date	Jan 31, 2013
Priority date	Jan 31, 2013
Publication date	Sep 12, 2017
Grant date	Sep 12, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Prosodic features are used for discriminating computer-directed speech from human-directed speech. Statistics and models describing energy/intensity patterns over time, speech/pause distributions, pitch patterns, vocal effort features, and speech segment duration patterns may be used for prosodic modeling. The prosodic features for at least a portion of an utterance are monitored over a period of time to determine a shape associated with the utterance. A score may be determined to assist in classifying the current utterance as human directed or computer directed without relying on knowledge of preceding utterances or utterances following the current utterance. Outside data may be used for training lexical addressee detection systems for the H-H-C scenario. H-C training data can be obtained from a single-user H-C collection and that H-H speech can be modeled using general conversational speech. H-C and H-H language models may also be adapted using interpolation with small amounts of matched H-H-C data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for identifying an intended addressee by a conversational understanding system executing on a computer processing device, the method comprising: receiving an utterance; extracting prosodic features from speech segments of the received utterance; performing an evaluation of the extracted prosodic features to determine whether the intended addressee is one of a human and a computer, wherein the evaluation is word-independent, context-independent, and speaker-independent; in response to determining that the intended addressee is the computer, processing the utterance to generate a response; and outputting the response through the computer processing device. 2. The method of claim 1 , further comprising: characterizing the received utterance as a human-computer speaking style or a human-human speaking style based on at least one of a temporal pattern that evaluates segmental duration and a spectral pattern that evaluates at least one of: energy/intensity, pitch, and voice quality/vocal effort features. 3. The method of claim 1 , further comprising: extracting energy-related features from the received utterance using fixed-length temporal windows within the received utterance; and determining the intended addressee based on the energy-related features. 4. The method of claim 1 , further comprising: determining the intended addressee based on features of the received utterance, wherein the features comprise a peak count, a rate, a mean and a max distance apart, an intensity value, and a location and a value for the highest peak of the received utterance. 5. The method of claim 1 , further comprising: determining the intended addressee based on speech activity information, wherein the speech activity information comprises a speaking rate and a duration. 6. The method of claim 1 , further comprising: determining the intended addressee based on a waveform duration of the received utterance, a length of initial final non-speech regions, and a duration of non-speech regions. 7. The method of claim 1 , further comprising: evaluating the intended addressee based on processing lexical features of the speech segments using a lexical model and processing the extracted prosodic features using a prosodic model. 8. The method of claim 7 , further comprising training the lexical model used in a Human-Human-Computer dialog with out-of-domain data. 9. The method of claim 8 , further comprising: determining the out-of-domain data based on anchor text. 10. The method of claim 7 , wherein the lexical model comprises a similarity measure between words of the received utterance and display text. 11. A conversational understanding system comprising: a processor; and memory storing computer-executable instructions that, when executed, causes the processor to perform a method comprising: receive an utterance from a user; parse the utterance into one or more speech segments; extract prosodic features from the one or more speech segments; generate a score for the received utterance based on the prosodic features, wherein generating the score is word-independent, context-independent, and speaker-independent; determine an intended addressee of the received utterance based on the generated score, wherein the intended addressee is one of a human and a computer; in response to determining that the intended addressee is the processor, generate a response for the received utterance; and output the response to the user. 12. The system of claim 11 , wherein the method further comprises: characterizing the received utterance as a human-computer speaking style or a human-human speaking style based on at least one of a temporal pattern that evaluates segmental duration and a spectral pattern that evaluates at least one of: energy/intensity, pitch, and voice quality/vocal effort features. 13. The system of claim 11 , wherein the method further comprises: extracting energy-related features from the received utterance using fixed-length temporal windows within the received utterance; and determining the intended addressee based on the energy-related features. 14. The system of claim 11 , wherein the method further comprises: determining the intended addressee based on features of the received utterance, wherein the features comprise a peak count, a rate, a mean and a max distance apart, an intensity value, and a location and a value for the highest peak of the received utterance. 15. The system of claim 11 , wherein the computer is a computing device that is executing the conversational understanding system. 16. The system of claim 11 , wherein the method further comprises: determining the intended addressee based on speech activity information, wherein the speech activity information comprises a length of initial final non-speech regions and a duration of non-speech regions for the extracted prosodic features. 17. A conversational understanding system for addressee detection, comprising: a computer processor and memory; an operating environment executed by the computer processor; and an addressee manager that is configured to perform actions comprising: receiving an utterance; extracting prosodic features from speech segments of the received utterance; performing an evaluation of the extracted prosodic features to determine whether the intended addressee is a computer or a human, wherein the evaluation is word-independent, context-independent, and speaker independent; in response to determining that the intended addressee is the computer, processing the utterance to generate a response for the received utterance; and outputting the response through the conversational understanding system. 18. The system of claim 17 , wherein the performed actions further comprise: extracting energy-related features from the received utterance using fixed-length temporal windows within the received utterance and; determining the intended addressee based on the extracted energy-related features. 19. The system of claim 17 , wherein the performed actions further comprise: determining the intended addressee based on features of the received utterance, wherein the features comprise of a peak count, a rate, a mean and a max distance apart, an intensity value, and a location and a value for the highest peak of the received utterance. 20. The system of claim 17 , wherein the performed actions further comprise: determining the intended addressee based on speech activity information, wherein the speech activity information comprises a length of initial final non-speech regions and a duration of non-speech regions.

Assignees

Inventors

Classifications

G10L15/183
using context dependencies, e.g. language models · CPC title
G10L25/90
Pitch determination of speech signals · CPC title
G10L25/18
the extracted parameters being spectral information of each sub-band · CPC title
G10L15/22Primary
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

View patent family 51223888

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9761247B2 cover?: Prosodic features are used for discriminating computer-directed speech from human-directed speech. Statistics and models describing energy/intensity patterns over time, speech/pause distributions, pitch patterns, vocal effort features, and speech segment duration patterns may be used for prosodic modeling. The prosodic features for at least a portion of an utterance are monitored over a period …
Who is the assignee on this patent?: Microsoft Corp, Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 12 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).