Model training for automatic speech recognition from imperfect transcription data

US9280969B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9280969-B2
Application numberUS-48214209-A
CountryUS
Kind codeB2
Filing dateJun 10, 2009
Priority dateJun 10, 2009
Publication dateMar 8, 2016
Grant dateMar 8, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques and systems for training an acoustic model are described. In an embodiment, a technique for training an acoustic model includes dividing a corpus of training data that includes transcription errors into N parts, and on each part, decoding an utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription. The technique may further include inserting silence between a pair of words into the decoded transcription and aligning an original transcription corresponding to the utterance with the decoded transcription according to time for each part. The technique may further include selecting a segment from the utterance having at least Q contiguous matching aligned words, and training the incremental acoustic model with the selected segment. The trained incremental acoustic model may then be used on a subsequent part of the training data. Other embodiments are described and claimed.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: a. aligning an utterance from a set of training data with a corresponding original transcription from the set of training data to produce a time-aligned transcription with time alignment information for each word in the utterance, wherein the set of training data includes transcription errors; b. decoding the same utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription with time alignment information for each word; c. aligning the time-aligned and decoded transcriptions according to time alignment information; d. selecting all segments from the utterance having at least Q contiguous matching aligned words, where Q is a positive integer, by: including a silence in a selected segment comprising the Q matching aligned words when the selected segment is preceded or followed by a silence; and when there is no silence preceding or succeeding the selected segment: selecting the selected segment according to the original transcription with time alignment information; and inserting part of a silence segment from the beginning of the utterance into the beginning of the selected segment, and appending a part of a silence segment from the beginning of the utterance to the end of the selected segment; e. training the incremental acoustic model with the selected segments; and f. evaluating the accuracy of the incremental acoustic model built from the training data including transcription errors compared to the accuracy of an acoustic model built from a similar amount of training data having no transcription errors. 2. The computer-implemented method of claim 1 , comprising: dividing training data comprising audio data and transcription data corresponding to the audio data into N parts of M duration, wherein each part includes one or more utterances each comprising a plurality of words, and wherein N and M are positive integers; and g. iterating 1.a. through 1.f. for each utterance in one of the N parts; and h. iterating 2.g. for each of the N parts. 3. The computer-implemented method of claim 2 , comprising: during a first iteration on a first part, building the incremental language model from the original transcription corresponding to the first part; and during a subsequent iteration on a subsequent part, building L incremental language models, where M/L is less than or equal to one, and where each of the L incremental language models uses a portion of M/L duration of the original transcription corresponding to the subsequent part. 4. A computer-readable hardware medium storing computer-executable program instructions that when executed cause a computing system to: compute a frame posterior for each word in an utterance from a corpus comprising audio data and a corresponding transcription that contains transcription errors, wherein the instructions to compute the frame posterior include instructions that when executed cause the computing system to: decode the audio data using an existing acoustic model to generate a lattice, merging the decoded lattice with the transcription, labeling each word in the merged lattice as one of correct or incorrect by examining a percentage to which the word is overlapped in duration with the transcription, computing a posterior probability for each word in the merged lattice, and computing the frame posterior q(t) of time t by summing the posterior probabilities of all the correct words passing time t for a time interval; train an acoustic model with confidence-based maximum likelihood estimation (MLE) training using the frame posterior by estimating acoustic model parameters using the transcription, the audio data and the frame posterior; estimate the acoustic model parameters with confidence-based discriminative training using the frame posterior; evaluate the accuracy of the acoustic model built from the corpus including the corresponding transcription that contains transcription errors compared to the accuracy of an acoustic model built from a similar amount of training data having no transcription errors; and generate a finalized acoustic model. 5. The computer-readable hardware medium of claim 4 , wherein the instructions to estimate model parameters include instructions that when executed cause the computing system to: calculate the update formulas for mean (μ jk ) and variance (σ jk 2 ) for a jth state and a kth mixture model as: μ jk = ∑ t = 1 T ⁢ ζ _ jk ⁡ ( t ) ⁢ O ⁡ ( t ) ∑ t = 1 T ⁢ ζ _ jk ⁡ ( t ) σ jk 2 = ∑ t = 1 T ⁢ ζ _ jk ⁡ ( t ) ⁢ ( O ⁡ ( t ) - μ jk ′ ) ⁢ (

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9280969B2 cover?
Techniques and systems for training an acoustic model are described. In an embodiment, a technique for training an acoustic model includes dividing a corpus of training data that includes transcription errors into N parts, and on each part, decoding an utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription. The technique may further inc…
Who is the assignee on this patent?
Li Jinyu, Gong Yifan, Liu Chaojun, and 2 more
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 08 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).