What technology area does this patent fall under?

Primary CPC classification G10L17/10. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jul 18 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Dual model speaker identification

US9711148B1 · US · B1

Patent metadata
Field	Value
Publication number	US-9711148-B1
Application number	US-201313944975-A
Country	US
Kind code	B1
Filing date	Jul 18, 2013
Priority date	Jul 18, 2013
Publication date	Jul 18, 2017
Grant date	Jul 18, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A processing system receives an audio signal encoding an utterance and determines that a first portion of the audio signal corresponds to a predefined phrase. The processing system accesses one or more text-dependent models associated with the predefined phrase and determines a first confidence based on the one or more text-dependent models associated with the predefined phrase, the first confidence corresponding to a first likelihood that a particular speaker spoke the utterance. The processing system determines a second confidence for a second portion of the audio signal using one or more text-independent models, the second confidence corresponding to a second likelihood that the particular speaker spoke the utterance. The processing system then determines that the particular speaker spoke the utterance based at least in part on the first confidence and the second confidence.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving, by a speech-enabled home device of an automated speaker identification system that includes the speech-enabled home device that includes one or more microphones for detecting utterances spoken in a home environment, and a server based speaker recognition engine that is associated with an automated query processor and that includes (i) one or more text-dependent speaker identification models that are trained using multiple previous utterances of a keyword by a particular speaker and by other users whose accounts are registered with the server, and (ii) one or more text-independent speaker identification models that are trained using utterances of words other than the keyword by the particular speaker and by other users whose accounts are registered with the server, an audio signal encoding an utterance that was spoken in a home environment and was detected by one or more microphones of the speech-enabled home device, and that includes the keyword and a query; determining, by the server-based speaker recognition engine and based on an analysis of a portion of the audio signal that corresponds to the keyword by one or more of the text-dependent speaker identification models that are trained using utterances of the keyword by the particular speaker and by the other users whose accounts are registered with the server, a first speaker identification confidence value that reflects a likelihood that the particular speaker spoke the keyword; determining, by the server-based speaker recognition engine and based on an analysis of at least a portion of the audio signal that corresponds to the query by one or more of the text-independent speaker identification models that are trained using utterances of words other than the keyword by the particular speaker and by other users whose accounts are registered with the server, a second speaker identification confidence value that reflects a likelihood that the particular speaker spoke the query; determining, by the server-based speaker recognition engine, a first quantity of the utterances of the keyword by the particular speaker that were used to train the one or more text-dependent speaker identification models; determining, by the server-based speaker recognition engine, a second quantity of the utterances of the words other than the keyword by the particular speaker that were used to train the one or more text-independent speaker identification models; assigning, by the server-based speaker recognition engine, a first weight to the first speaker identification confidence value based at least on the first quantity of utterances of the keyword by the particular speaker that were used to train the one or more text-dependent speaker identification models, and a second weight to the second speaker identification confidence value based at least on the second quantity of utterances of the words other than the keyword by the particular speaker that were used to train the one or more text-independent speaker identification models; determining, by the server-based speaker recognition engine, that the particular speaker spoke the utterance encoded in the audio signal based at least in part on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value; in response to determining by the server-based speaker recognition engine that the particular speaker spoke the utterance encoded in the audio signal based at least in part on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value, initiating access to one or more account resources associated with the particular speaker for preparation of a personalized response to the query by the automated query processor that is associated with the server-based speaker recognition engine; and providing, by the automated query processor that is associated with the server-based speaker recognition engine, the personalized response to the speech-enabled home device for output to the particular speaker. 2. The method of claim 1 , comprising: obtaining one or more sets of mel-frequency cepstral coefficients (MFCCs) associated with the keyword, each set of MFCCs being associated with an individual speaker; and wherein determining, based on an analysis of a portion of the audio signal that corresponds to the keyword by one or more of the more text-dependent speaker identification models that are trained using utterances of the keyword by the particular speaker, the first speaker identification confidence value that reflects the likelihood that the particular speaker spoke the keyword, comprises determining, based on a comparison of the one or more sets of MFCCs to a set of MFCCs derived from the portion of the audio signal that corresponds to the keyword, a first speaker identification confidence value that reflects a likelihood that the particular speaker spoke the keyword. 3. The method of claim 1 , wherein determining the second speaker identification confidence value comprises: deriving a set of mel-frequency cepstral coefficients (MFCCs) from the portion of the audio signal that corresponds to the query; accessing one or more Gaussian mixture models (GMMs), each GMM being associated with an individual speaker; and processing the set of MFCCs from the portion of the audio signal that corresponds to the query using each of the GMMs to determine the second speaker identification confidence value. 4. The method of claim 1 further comprising: analyzing the portion of the audio signal that corresponds to the keyword using the one or more text-independent models to determine a third speaker identification confidence value that reflects a likelihood that the particular speaker generated the utterance; and wherein determining that the particular speaker spoke the utterance based at least in part on the weighted first confidence and the weighted second confidence comprises determining that the particular speaker spoke the utterance based at least in part on the weighted first confidence, the weighted second confidence, and the third speaker identification confidence value. 5. The method of claim 1 , wherein determining that the particular speaker spoke the utterance based at least on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value comprises: combining the weighted first speaker identification confidence value and the weighted second speaker identification confidence value to generate a combined confidence; and determining that the combined confidence for the particular speaker is greater than a combined confidence for any other speaker. 6. The method of claim 5 , wherein determining that the combined confidence for the particular speaker is greater than a combined confidence for any other speaker comprises determining that the combined confidence for the particular speaker is greater than a combined confidence for any other speaker and that the combined confidence satisfies a predetermined threshold. 7. The method of claim 1 , wherein determining that the particular speaker spoke the utterance based at least in part on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value comprises determining that the particular speaker from among a plurality of speakers spoke the utterance based at least in part on the weighted first speaker identification confidence value and the weighted second speaker identification confidence value. 8. The method of claim 1 , further comprising: combining the weighted first speaker identification confidence value and the weighted sec

Assignees

Google Inc

Inventors

Classifications

G10L17/10Primary
Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems · CPC title
G10L17/02Primary
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
G10L17/22
Interactive procedures; Man-machine interfaces · CPC title
G10L15/02
Feature extraction for speech recognition; Selection of recognition unit · CPC title
G10L17/04
Training, enrolment or model building · CPC title

Patent family

Related publications grouped by family.

View patent family 59297825

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9711148B1 cover?: A processing system receives an audio signal encoding an utterance and determines that a first portion of the audio signal corresponds to a predefined phrase. The processing system accesses one or more text-dependent models associated with the predefined phrase and determines a first confidence based on the one or more text-dependent models associated with the predefined phrase, the first confi…
Who is the assignee on this patent?: Google Inc
What technology area does this patent fall under?: Primary CPC classification G10L17/10. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jul 18 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).