Speech recognizer with multi-directional decoding
US-2015095026-A1 · Apr 2, 2015 · US
US9865254B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9865254-B1 |
| Application number | US-201615187102-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jun 20, 2016 |
| Priority date | Feb 29, 2016 |
| Publication date | Jan 9, 2018 |
| Grant date | Jan 9, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Compact finite state transducers (FSTs) for automatic speech recognition (ASR). An HCLG FST and/or G FST may be compacted at training time to reduce the size of the FST to be used at runtime. The compact FSTs may be significantly smaller (e.g., 50% smaller) in terms of memory size, thus reducing the use of computing resources at runtime to operate the FSTs. The individual arcs and states of each FST may be compacted by binning individual weights, thus reducing the number of bits needed for each weight. Further, certain fields such as a next state ID may be left out of a compact FST if an estimation technique can be used to reproduce the next state at runtime. During runtime portions of the FSTs may be decompressed for processing by an ASR engine.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for creating and using a compact finite state transducer for automatic speech recognition, the method comprising: during a training period: identifying a first representation of a finite state transducer (FST) for automatic speech recognition (ASR), wherein the FST corresponds to an ASR grammar and is configured to be traversed using a first sequence of words with corresponding scores and output a second sequence of words with corresponding scores, wherein the first representation occupies a first amount of memory and comprises: a first table representing states of the FST, the first table including a first record representing a first state of the FST, the first record comprising: a 32 bit representation of a weight of the first state, and a first reference to a second record in a second table, the second record representing at least one first arc, the first arc representing a connection from the first state to another state in the FST; and the second table representing arcs of the FST, the second table including the second record representing the first arc, the second record comprising: a 32 bit representation of a weight of the first arc, a second reference to a third record in the first table, the third record representing a second state of the FST, the second state being a destination state for the first arc, a first representation of an input label of the first arc, and a second representation of an output label of the first arc; and configuring a second representation of the FST, wherein the second representation occupies a second amount of memory less than the first amount of memory and comprises: a third table representing at least a portion of the states of the FST, the third table including a fourth record representing the first state, the fourth record comprising: a 12 bit representation of the weight of the first state, and a third reference to a fifth record in a fourth table, the fifth record representing the at least one first arc; the fourth table representing at least a portion of the arcs of the FST, the fourth table including the fifth record representing the first arc, the fifth record comprising: a 12 bit representation of the weight of the first arc, a third representation of an input/output label pair corresponding to the first arc, an identifier of a destination state of the first arc, the first representation of the input label of the first arc, and the second representation of the output label of the first arc; and during a runtime period: receiving audio data comprising audio feature vectors for ASR; and determining ASR output using the second representation of the FST and the audio data. 2. The computer-implemented method of claim 1 , further comprising, during the runtime period: identifying a third representation of a second FST, wherein the second FST is an HCL FST configured to be traversed using acoustic unit data and to output words; processing the audio data using an acoustic model to obtain acoustic unit data; loading the third representation of the second FST into a memory; loading the second representation of the FST into the memory; determining a score for the fifth record using the acoustic unit data; determining an output label using the third representation of the input/output label pair; and determining the ASR output using the output label. 3. The computer-implemented method of claim 1 , further comprising, during the training period: identifying, using the first representation, a third state of the FST; determining the third state is associated with a certain number of outgoing arcs; determining that the certain number is above a threshold; including in the third table, a sixth record corresponding to the third state, the sixth record including a bit indicating that information about the third state is available outside the third table; and including, in the second representation, the information about the third state, the information comprising: a 32 bit representation of a weight of the third state, a fourth reference to a seventh record in the fourth table, the seventh record representing a second arc, the second arc outgoing from the second state in the FST, first data representing a number of arcs associated with the third state that have a blank input label, and second data representing a number of arcs associated with the third state that have a blank output label. 4. The computer-implemented method of claim 1 , further comprising, during the runtime period: receiving an indication to capture audio for speech processing; loading the 12 bit representation of the weight of the first state into memory; converting the 12 bit representation of the weight of the first state into a second 32 bit representation of the weight of the first state; storing the second 32 bit representation of the weight of the first state; receiving the audio data; and performing speech processing using the audio data and the second 32 bit representation of the weight of the first state. 5. A computer-implemented method, comprising: receiving audio data; identifying a first representation of a first finite state transducer (FST) for automatic speech recognition (ASR), the FST configured to be traversed using input words and to output words, the first representation including: a first table representing states of the first FST, the first table including a first record representing a first state, the first record comprising: a first representation of a weight of the first state, and a reference to a second record in a second table, the second record representing a first arc; the second table representing arcs of the first FST, the second table including the second record, the second record comprising: a second representation of a weight of the first arc, a 12 bit representation of the weight of the first arc, a third representation of an input/output label pair, and an identifier of a destination state of the first arc; and identifying a second representation of a second FST, wherein the second FST is an FST configured to be traversed using acoustic unit data and to output words; loading the first representation of the first FST into a memory; loading the second representation of the second FST into the memory; performing ASR on a first portion of the audio data using the first record and the second record; determining an output label using the third representation of the input/output label pair; and determining ASR output using the output label. 6. The computer-implemented method of claim 5 , further comprising: receiving an indication to capture audio for speech processing; loading the 12 bit representation of the weight of the first state into memory; converting the 12 bit representation of the weight of the first state into a second 32 bit representation of the weight of the first state; storing the second 32 bit representation of the weight of the first state; and performing speech processing using the audio data and the second 32 bit representation of the weight of the first state. 7. The computer-implemented method of claim 5 , further comprising loading the 12 bit representation of the weight of the first state into memory after receiving the indication but prior to receiving the audio data. 8. The computer-implemented method of claim 5 , further comprising configuring the second table to include a third record corresponding to a second arc, the third record occupying a same amount of space in memory as the second record. 9. The computer-implemented method of claim 5 , further comprising configuring the third representation of the input/output la
updating or merging of old and new templates; Mean values; Weighting · CPC title
Parsing for meaning understanding · CPC title
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title
Training · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.