Structure aware transformers for natural language processing
US-2024370714-A1 · Nov 7, 2024 · US
US2018129931A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2018129931-A1 |
| Application number | US-201715420801-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jan 31, 2017 |
| Priority date | Nov 4, 2016 |
| Publication date | May 10, 2018 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The technology disclosed provides a quasi-recurrent neural network (QRNN) encoder-decoder model that alternates convolutional layers, which apply in parallel across timesteps, and minimalist recurrent pooling layers that apply in parallel across feature dimensions.
Opening claim text (preview).
What is claimed is: 1 . A quasi-recurrent neural network (QRNN) system that increases computational efficiency in neural network sequence-to-sequence modeling, the system comprising: a QRNN encoder that comprises one or more encoder convolutional layers and one or more one encoder pooling layers, at least one encoder convolutional layer receives a time series of encoder input vectors and concurrently outputs encoded convolutional vectors for time series windows, and at least one encoder pooling layer receives the encoded convolutional vectors for the time series windows, concurrently accumulates an ordered set of feature sums in an encoded state vector for a current time series window, and sequentially outputs an encoded state vector for each successive time series window among the time series windows; a QRNN decoder that comprises one or more decoder convolutional layers and one or more one decoder pooling layers, at least one decoder convolutional layer receives a time series of decoder input vectors and concurrently outputs decoded convolutional vectors for time series windows, and at least one decoder pooling layer receives the decoded convolutional vectors for the time series windows respectively concatenated with an encoded state vector outputted by an encoder pooling layer for a final time series window, concurrently accumulates an ordered set of feature sums in a decoded state vector for a current time series window, and sequentially outputs a decoded state vector for each successive time series window among the time series windows; a state comparator that calculates linguistic similarity between the encoded state vectors and the decoded state vectors to produce an affinity matrix with encoding-wise and decoding-wise axes; an exponential normalizer that normalizes the affinity matrix encoding-wise to produce respective encoding-to-decoding attention weights; an encoding mixer that respectively combines the encoded state vectors with the encoding-to-decoding attention weights to generate respective contextual summaries of the encoded state vectors; and an attention encoder that respectively combines the decoded state vectors with the respective contextual summaries of the encoded state vectors to produce an attention encoding for each of the time series windows. 2 . The system of claim 1 , wherein the attention encoder is a multilayer perceptron that projects a concatenation of the decoded state vectors and respective contextual summaries of the encoded state vectors into non-linear projections to produce an attention encoding for each of the time series windows. 3 . The system of claim 1 , wherein the encoded state vectors are respectively multiplied by output gate vectors of the encoded convolutional vectors to produce respective encoded hidden state vectors, wherein the state comparator calculates linguistic similarity between the encoded hidden state vectors and the decoded state vectors to produce an affinity matrix with encoding-wise and decoding-wise axes, wherein the encoding mixer respectively combines the encoded hidden state vectors with the encoding-to-decoding attention weights to generate respective contextual summaries of the encoded hidden state vectors, and wherein the attention encoder respectively combines the decoded state vectors with the respective contextual summaries of the encoded hidden state vectors, and further multiplies the combinations with respective output gate vectors of the decoded convolutional vectors to produce an attention encoding for each of the time series windows. 4 . The system of claim 3 , wherein the attention encoder is a multilayer perceptron that projects a concatenation of the decoded state vectors and respective contextual summaries of the encoded hidden state vectors into non-linear projections, and further multiplies the non-linear projections with respective output gate vectors of the decoded convolutional vectors to produce an attention encoding for each of the time series windows. 5 . The system of claim 1 , wherein each of the convolution vectors comprising feature values in an activation vector and in one or more gate vectors, and the feature values in the gate vectors are parameters that, respectively, apply element-wise by ordinal position to the feature values in the activation vector. 6 . The system of claim 5 , wherein each pooling layer operates in parallel over feature values of a convolutional vector to concurrently accumulate ordinal position-wise, in a state vector for a current time series window, an ordered set of feature sums in dependence upon a feature value at a given ordinal position in an activation vector outputted for the current time series window, one or more feature values at the given ordinal position in one or more gate vectors outputted for the current time series window, and a feature sum at the given ordinal position in a state vector accumulated for a prior time series window. 7 . The system of claim 5 , wherein the gate vector is a forget gate vector, and wherein each pooling layer uses a forget gate vector for a current time series window to control accumulation of information from a state vector accumulated for a prior time series window and information from an activation vector for the current time series window. 8 . The system of claim 5 , wherein the gate vector is an input gate vector, and wherein each pooling layer uses an input gate vector for a current time series window to control accumulation of information from an activation vector for the current time series window. 9 . The system of claim 5 , wherein the gate vector is an output gate vector, and wherein each pooling layer uses an output gate vector for a current time series window to control accumulation of information from a state vector for the current time series window. 10 . A method of increasing computational efficiency in neural network sequence-to-sequence modeling, the method including: receiving a time series of encoder input vectors at an encoder convolutional layer of a QRNN encoder and concurrently outputting encoded convolutional vectors for time series windows; receiving the encoded convolutional vectors for the time series windows at an encoder pooling layer of the QRNN encoder, concurrently accumulating an ordered set of feature sums in an encoded state vector for a current time series window, and sequentially outputting an encoded state vector for each successive time series window among the time series windows; receiving a time series of decoder input vectors at a decoder convolutional layer of a QRNN decoder and concurrently outputting decoded convolutional vectors for time series windows; receiving the decoded convolutional vectors for the time series windows at a decoder pooling layer of the QRNN decoder respectively concatenated with an encoded state vector outputted by an encoder pooling layer for a final time series window, concurrently accumulating an ordered set of feature sums in an decoded state vector for a current time series window, and sequentially outputting an decoded state vector for each successive time series window among the time series windows; calculating linguistic similarity between the encoded state vectors and the decoded state vectors to produce an affinity matrix with encoding-wise and decoding-wise axes; exponentially normalizing the affinity matrix encoding-wise to produce respective encoding-to-decoding attention weights; combining the encoded state vectors with the encoding-to-decoding attention weights to generate respective contextual summaries of the encoded state vectors; and combining the decoded state vectors with the respective contextual summaries of th
Recurrent networks, e.g. Hopfield networks · CPC title
Combinations of networks · CPC title
using statistical methods · CPC title
Architecture, e.g. interconnection topology · CPC title
Semantic analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.