Training a neural network using augmented training datasets
US-10346721-B2 · Jul 9, 2019 · US
US11222253B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11222253-B2 |
| Application number | US-201715421424-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 31, 2017 |
| Priority date | Nov 3, 2016 |
| Publication date | Jan 11, 2022 |
| Grant date | Jan 11, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The technology disclosed provides a so-called “joint many-task neural network model” to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model. The model is successively trained by considering linguistic hierarchies, directly connecting word representations to all model layers, explicitly using predictions in lower tasks, and applying a so-called “successive regularization” technique to prevent catastrophic forgetting. Three examples of lower level model layers are part-of-speech (POS) tagging layer, chunking layer, and dependency parsing layer. Two examples of higher level model layers are semantic relatedness layer and textual entailment layer. The model achieves the state-of-the-art results on chunking, dependency parsing, semantic relatedness and textual entailment.
Opening claim text (preview).
What is claimed is: 1. A neural network that processes words in an input sentence, the neural network comprising a processor that processes: a part-of-speech (FOS) label embedding layer that produces, using the processor, POS label embeddings from words embeddings generated from the words in the input sentence; a chunk label embedding layer overlaying the POS label embedding layer, the chunk label embedding layer receives, using a first bypass connection the POS label embeddings from the POS label embedding layer, and using a second bypass connection the word embeddings, and produces, using the processor, chunk label embeddings and chunk state vectors from the POS label embeddings and the words embeddings; a dependency parsing layer overlaying the chunk label embedding layer, the dependency parsing layer comprising: a bi-directional long-short term memory (LSTM) that receives, using the first bypass connection the POS label embeddings; using the second bypass connection the word embeddings, and using a third bypass connection the chunk label embeddings and the chunk state vectors from the chunk label embedding layer, and processes, using the processor, the word embeddings, the POS label embeddings, the chunk label embeddings and the chunk state vectors, to produce parent label state vectors; an attention encoder that, using the processor: produces parent label probability mass vectors from the parent label state vectors; and produces parent label embedding vectors from the parent label probability mass vectors; and a dependency relationship label classifier that: exponentially normalizes the parent label state vectors and the parent label embedding vectors to produce dependency relationship label probability mass vectors; and produces dependency relationship label embedding vectors from the dependency relationship label probability mass vectors; and an output, using the processor, that outputs the dependency relationship label embedding vectors. 2. The neural network of claim 1 : wherein the parent label state vectors produced by the bi-directional LSTM are forward and backward parent label state vectors for each respective word in the input sentence, which represent forward and backward progressions of interactions among the words in the input sentence from which the parent label probability mass vectors are produced; and wherein the attention encoder processes the forward and backward parent label state vectors for each respective word in the input sentence, encodes attention as vectors of inner products between each respective word and other words in the input sentence, with a linear transform applied to the forward and backward parent label state vectors for the word or the other words, and produces the parent label embedding vectors from the encoded attention vectors. 3. The neural network of claim 2 , wherein the linear transform is trainable during training of the dependency relationship label classifier. 4. The neural network of claim 2 , wherein a number of available analytical framework labels, over which the parent label probability mass vectors are calculated, is one-fifth or less a dimensionality of the forward and backward parent label state vectors, thereby forming a dimensionality bottleneck that reduces overfitting when training a neural network stack of bi-directional LSTMs. 5. A neural network system that processes words in an input sentence, the neural network system comprising: at least one memory configured to store a dependency parsing layer, a chunk label embedding layer, and a POS label embedding layer; the dependency parsing layer that overlies the chunk label embedding layer that produces chunk label embeddings and chunk state vectors from part-of-speech (POS) label embeddings and word embeddings of the words in the input sentence, the POS label embeddings received from the POS label embedding layer using a first bypass connection and the word embeddings received using a second bypass connection; the chunk label embedding layer, in turn, overlies the POS label embedding layer, the POS label embedding layer that produces the POS label embeddings from the word embeddings; the dependency parsing layer including a dependency parent layer and a dependency relationship label classifier, wherein the dependency parent layer includes: a dependency parent analyzer, implemented as a bi-directional long-short term memory (LSTM), that: receives, using the second bypass connection the word embeddings, using the first bypass connection the POS label embeddings from the POS label embedding layer, and using a third bypass connection the chunk label embeddings and the chunk state vector from the chunk label embedding layer; and processes the words in the input sentences, including processing, for each word, the word embeddings, the POS label embeddings, the chunk label embeddings, and the chunk state vector to accumulate forward and backward state vectors that represent forward and backward progressions of interactions among the words in the input sentence; and an attention encoder that: processes the forward and backward state vectors for each respective word in the input sentence, and encodes attention as inner products between each respective word and other words in the input sentence, with a linear transform applied to the forward and backward state vectors for the word or the other words prior to the inner products; applies exponential normalization to vectors of the inner products to produce parent label probability mass vectors and projects the parent label probability mass vectors to produce parent label embedding vectors; and wherein the dependency relationship label classifier, for each respective word in the input sentence: processes the forward and backward state vectors and the parent label embedding vectors, to produce dependency relationship label probability mass vectors; and projects the dependency relationship label probability mass vectors to produce dependency relationship label embedding vectors; and an output processor that outputs at least the dependency relationship label probability mass vectors, or the dependency relationship label embedding vectors. 6. The neural network system of claim 5 , wherein the linear transform applied prior to the inner products is trainable during training of the dependency parent layer and the dependency relationship label classifier. 7. The neural network system of claim 5 , wherein a number of available analytical framework labels, over which the dependency relationship label probability mass vectors are calculated, is one-fifth or less a dimensionality of the forward and backward state vectors, thereby forming a dimensionality bottleneck that reduces overfitting when training a neural network stack of the bi-directional LSTMs. 8. A method for parsing words in an input sentence using a neural network device, the method comprising: producing, at a part-of-speed (POS) label embedding layer, POS label embeddings from word embeddings of the words in the input sentence; producing, at a chunk label embedding layer that overlies the POS label embedding layer, chunk label embeddings and chunk state vectors from the POS label embeddings received from the POS embedding layer using a first bypass connection and the word embeddings received using a second bypass connection; receiving, at a dependency parsing layer that overlies a chunk label embedding layer, the POS label embeddings using the first bypass connection, the word embeddings using the second bypass connection, and the chunk label embeddings and the chunk state vectors from the chunk label embedding layer using a third bypass connection, the dependency parsing layer including a dependency parent layer and a depend
Backpropagation, e.g. using gradient descent · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Combinations of networks · CPC title
Learning methods · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.