Vehicle-mounted human-machine interaction system
US-2024395262-A1 · Nov 28, 2024 · US
US9711148B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9711148-B1 |
| Application number | US-201313944975-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jul 18, 2013 |
| Priority date | Jul 18, 2013 |
| Publication date | Jul 18, 2017 |
| Grant date | Jul 18, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A processing system receives an audio signal encoding an utterance and determines that a first portion of the audio signal corresponds to a predefined phrase. The processing system accesses one or more text-dependent models associated with the predefined phrase and determines a first confidence based on the one or more text-dependent models associated with the predefined phrase, the first confidence corresponding to a first likelihood that a particular speaker spoke the utterance. The processing system determines a second confidence for a second portion of the audio signal using one or more text-independent models, the second confidence corresponding to a second likelihood that the particular speaker spoke the utterance. The processing system then determines that the particular speaker spoke the utterance based at least in part on the first confidence and the second confidence.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: receiving, by a speech-enabled home device of an automated speaker identification system that includes the speech-enabled home device that includes one or more microphones for detecting utterances spoken in a home environment, and a server based speaker recognition engine that is associated with an automated query processor and that includes (i) one or more text-dependent speaker identification models that are trained using multiple previous utterances of a keyword by a particular speaker and by other users whose accounts are registered with the server, and (ii) one or more text-independent speaker identification models that are trained using utterances of words other than the keyword by the particular speaker and by other users whose accounts are registered with the server, an audio signal encoding an utterance that was spoken in a home environment and was detected by one or more microphones of the speech-enabled home device, and that includes the keyword and a query; determining, by the server-based speaker recognition engine and based on an analysis of a portion of the audio signal that corresponds to the keyword by one or more of the text-dependent speaker identification models that are trained using utterances of the keyword by the particular speaker and by the other users whose accounts are registered with the server, a first speaker identification confidence value that reflects a likelihood that the particular speaker spoke the keyword; determining, by the server-based speaker recognition engine and based on an analysis of at least a portion of the audio signal that corresponds to the query by one or more of the text-independent speaker identification models that are trained using utterances of words other than the keyword by the particular speaker and by other users whose accounts are registered with the server, a second speaker identification confidence value that reflects a likelihood that the particular speaker spoke the query; determining, by the server-based speaker recognition engine, a first quantity of the utterances of the keyword by the particular speaker that were used to train the one or more text-dependent speaker identification models; determining, by the server-based speaker recognition engine, a second quantity of the utterances of the words other than the keyword by the particular speaker that were used to train the one or more text-independent speaker identification models; assigning, by the server-based speaker recognition engine, a first weight to the first speaker identification confidence value based at least on the first quantity of utterances of the keyword by the particular speaker that were used to train the one or more text-dependent speaker identification models, and a second weight to the second speaker identification confidence value based at least on the second quantity of utterances of the words other than the keyword by the particular speaker that were used to train the one or more text-independent speaker identification models; determining, by the server-based speaker recognition engine, that the particular speaker spoke the utterance encoded in the audio signal based at least in part on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value; in response to determining by the server-based speaker recognition engine that the particular speaker spoke the utterance encoded in the audio signal based at least in part on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value, initiating access to one or more account resources associated with the particular speaker for preparation of a personalized response to the query by the automated query processor that is associated with the server-based speaker recognition engine; and providing, by the automated query processor that is associated with the server-based speaker recognition engine, the personalized response to the speech-enabled home device for output to the particular speaker. 2. The method of claim 1 , comprising: obtaining one or more sets of mel-frequency cepstral coefficients (MFCCs) associated with the keyword, each set of MFCCs being associated with an individual speaker; and wherein determining, based on an analysis of a portion of the audio signal that corresponds to the keyword by one or more of the more text-dependent speaker identification models that are trained using utterances of the keyword by the particular speaker, the first speaker identification confidence value that reflects the likelihood that the particular speaker spoke the keyword, comprises determining, based on a comparison of the one or more sets of MFCCs to a set of MFCCs derived from the portion of the audio signal that corresponds to the keyword, a first speaker identification confidence value that reflects a likelihood that the particular speaker spoke the keyword. 3. The method of claim 1 , wherein determining the second speaker identification confidence value comprises: deriving a set of mel-frequency cepstral coefficients (MFCCs) from the portion of the audio signal that corresponds to the query; accessing one or more Gaussian mixture models (GMMs), each GMM being associated with an individual speaker; and processing the set of MFCCs from the portion of the audio signal that corresponds to the query using each of the GMMs to determine the second speaker identification confidence value. 4. The method of claim 1 further comprising: analyzing the portion of the audio signal that corresponds to the keyword using the one or more text-independent models to determine a third speaker identification confidence value that reflects a likelihood that the particular speaker generated the utterance; and wherein determining that the particular speaker spoke the utterance based at least in part on the weighted first confidence and the weighted second confidence comprises determining that the particular speaker spoke the utterance based at least in part on the weighted first confidence, the weighted second confidence, and the third speaker identification confidence value. 5. The method of claim 1 , wherein determining that the particular speaker spoke the utterance based at least on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value comprises: combining the weighted first speaker identification confidence value and the weighted second speaker identification confidence value to generate a combined confidence; and determining that the combined confidence for the particular speaker is greater than a combined confidence for any other speaker. 6. The method of claim 5 , wherein determining that the combined confidence for the particular speaker is greater than a combined confidence for any other speaker comprises determining that the combined confidence for the particular speaker is greater than a combined confidence for any other speaker and that the combined confidence satisfies a predetermined threshold. 7. The method of claim 1 , wherein determining that the particular speaker spoke the utterance based at least in part on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value comprises determining that the particular speaker from among a plurality of speakers spoke the utterance based at least in part on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value. 8. The method of claim 1 , further comprising: combining the weighted first speaker identification confidence value and the weighted sec
Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
Interactive procedures; Man-machine interfaces · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Training, enrolment or model building · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.