Keyword detection with international phonetic alphabet by foreground model and background model

US9466289B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9466289-B2
Application numberUS-201314103775-A
CountryUS
Kind codeB2
Filing dateDec 11, 2013
Priority dateJan 29, 2013
Publication dateOct 11, 2016
Grant dateOct 11, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An electronic device with one or more processors and memory trains an acoustic model with an international phonetic alphabet (IPA) phoneme mapping collection and audio samples in different languages, where the acoustic model includes: a foreground model; and a background model. The device generates a phone decoder based on the trained acoustic model. The device collects keyword audio samples, decodes the keyword audio samples with the phone decoder to generate phoneme sequence candidates, and selects a keyword phoneme sequence from the phoneme sequence candidates. After obtaining the keyword phoneme sequence, the device detects one or more keywords in an input audio signal with the trained acoustic model, including: matching phonemic keyword portions of the input audio signal with phonemes in the keyword phoneme sequence with the foreground model; and filtering out phonemic non-keyword portions of the input audio signal with the background model.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of detecting keywords, comprising: at an electronic device with one or more processors and memory: training an acoustic model with an International Phonetic Alphabet (IPA) phoneme mapping collection and a plurality of audio samples in a plurality of different languages, wherein the acoustic model includes: a foreground model configured to match a phoneme in an input audio signal to a corresponding keyword, wherein the foreground model is trained by (i) obtaining a phoneme collection for each of the plurality of different languages, (ii) generating a plurality of triphones by linking phonemes in the phoneme collection corresponding to the language, and (iii) performing Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language; and a background model configured to match a phoneme in the input audio signal to a corresponding non-keyword; after training the acoustic model, generating a phone decoder based on the trained acoustic model; obtaining a keyword phoneme sequence for a respective keyword in a respective language of the plurality of different languages, including: collecting a set of keyword audio samples for the respective keyword in the respective language; decoding the set of keyword audio samples with the phone decoder to generate a set of phoneme sequence candidates for the respective keyword, each phoneme sequence candidate corresponding to a respective keyword audio sample; and selecting the keyword phoneme sequence for the respective keyword from the set of phoneme sequence candidates by choosing a phoneme of a highest confidence measure from one of the set of phoneme sequence candidates at each location in the corresponding sequence and assembling the chosen phonemes into the keyword phoneme sequence according to their locations in the corresponding sequence; after obtaining the keyword phoneme sequence, detecting one or more keywords in the input audio signal with the trained acoustic model, including: matching one or more phonemic keyword portions of the input audio signal with one or more phonemes in the keyword phoneme sequence with the foreground model; and filtering out one or more phonemic non-keyword portions of the input audio signal with the background model. 2. The method of claim 1 , wherein selecting the keyword phoneme sequence from the set of phoneme sequence candidates includes: in accordance with a determination that the set of keyword audio samples includes one collected keyword audio sample, selecting the phoneme sequence candidate generated from decoding the one collected keyword audio sample as the keyword phoneme sequence; and in accordance with a determination that the set of audio samples includes two or more collected keyword audio samples, selecting one of the two or more phoneme sequence candidates generated from decoding the two or more collected keyword audio samples as the keyword phoneme sequence. 3. The method of claim 1 , further including: collecting the plurality of audio samples in the plurality of different languages and labeled data for the plurality of audio samples; mapping phonemes from each phoneme collection to phonemes in the IPA so as to generate the IPA phoneme mapping collection; and wherein the acoustic model is trained based on the collected plurality of audio samples in the plurality of different languages, the collected labeled data for the plurality of audio samples, and the generated IPA phoneme mapping collection. 4. The method of claim 3 , further including: processing the collected plurality of audio samples with a predetermined characteristic extraction protocol so as to obtain a plurality of corresponding audio characteristic sequences; obtaining a characteristic phoneme collection corresponding to the plurality of audio characteristic sequences based on the IPA phoneme mapping collection; training the foreground and background models based on the characteristic phoneme collection and the collected labeled data; and integrating the trained foreground and background models into the acoustic model. 5. The method of claim 4 , wherein: generating the plurality of triphones by linking phonemes in the phoneme collection corresponding to the language includes, for each phoneme in the phoneme collection: obtaining a context phoneme; and generating a triphone by linking the context phoneme to a corresponding monophone for the phoneme; performing Gaussian splitting training on the triphones that are clustered with the decision tree corresponding to the language updates a parameter of the clustered triphone; and training the foreground model further includes: for each phoneme in the characteristic phoneme collection: training an initial hidden Markov model (HMM) for three statuses of a respective phoneme in the characteristic phoneme collection; obtaining data related to the respective phoneme from the collected labeled data; updating the initial HMM with the obtained data so as to obtain a monophone model for the respective phoneme; and after performing the Gaussian splitting training on the triphones that are clustered with a decision tree corresponding to the language, performing minimum phoneme error discriminative training so as to obtain triphone models for respective phonemes in the phoneme collection corresponding to the language; and training the foreground model based on the obtained monophone and triphone models. 6. The method of claim 5 , further including: calculating a Gaussian Mixed Model (GMM) distance between two monophone models; comparing the calculated GMM distance with a predefined similarity threshold value; and in accordance with a determination that the calculated GMM distance is larger than the predefined similarity threshold value: clustering the two monophones corresponding to the two monophone models; and recording the two monophones in a confusion matrix, wherein the confusion matrix is configured to describe similar monophones. 7. The method of claim 6 , wherein training the background model includes: generating a confusion phoneme collection by processing the phonemes in the characteristic phoneme collection with the confusion matrix; and training the background model with the generated confusion phoneme collection. 8. The method of claim 1 , wherein: the set of audio samples includes two or more collected keyword audio samples; and the set of phoneme sequence candidates generated from decoding the two or more collected keyword audio samples include two or more phoneme sequence candidates; and the method including: integrating the phoneme sequence candidates into a linear structure, including: mapping the corresponding phonemes of the phoneme sequence candidates to an edge of the linear structure; and categorizing the edges corresponding to similar phonemes into a same slot of the linear structure, wherein the slots form a linear connection relation with each other; and selecting a path of the linear structure, wherein the phonemes corresponding to the edges of the selected path comprise the keyword phoneme sequence. 9. The method of claim 8 , wherein selecting the path from the linear structure includes: calculating an occurrence frequency for the phoneme corresponding to each edge of the linear structure; calculating a score for each path of the linear structure by summing the occurrence frequencies of the phonemes corresponding to each edge in a respective path; sorting the scores for the paths of the linear structure from high to low, and selecting the first N paths as alternative paths, wherein N is an integer larger than 1; and calculating a confidence measure for each of t

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9466289B2 cover?
An electronic device with one or more processors and memory trains an acoustic model with an international phonetic alphabet (IPA) phoneme mapping collection and audio samples in different languages, where the acoustic model includes: a foreground model; and a background model. The device generates a phone decoder based on the trained acoustic model. The device collects keyword audio samples, d…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 11 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).