System and method for unsupervised text normalization using distributed representation of words
US-2016098386-A1 · Apr 7, 2016 · US
US11328126B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11328126-B2 |
| Application number | US-201916690350-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 21, 2019 |
| Priority date | Oct 19, 2015 |
| Publication date | May 10, 2022 |
| Grant date | May 10, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method, system, and non-transitory compute readable medium determining and discerning items with multiple meanings in a sequence of items including producing a distributed representation for each item of the sequence of items including a word vector and a context vector, partitioning the sequence of items into classes, for an item using a representative word vector of each class, calculating a cosine distance between the word vector of said item and the class representative vector, and producing a new sequence of items by modifying the distributed representation in the producing by replacing each occurrence of an item depending on the cosine distance calculated by the calculating.
Opening claim text (preview).
What is claimed is: 1. A method of determining and discerning items with multiple meanings in a sequence of items, the method comprising: producing a distributed representation for each item of the sequence of items including a word vector and a context vector; partitioning the sequence of items into classes and associating a class representative vector with the classes based on a user selection; for an item using a representative word vector of each class, calculating a cosine distance between the word vector of said item and the class representative vector; producing a fixed number of classes D with smallest of the cosine distance; and producing a modified sequence of items by replacing each item I occurrence in the sequence by an I_j occurrence where j is one of the D classes for item I, wherein each word of the modified sequence conveys a ranked list of senses of the word in the original sequence based on a closeness of the item to the class representative vector as determined using the calculated cosine distance, wherein the modified sequence and the sequence include words replaced by the senses to explain the words using the senses, and wherein the calculating scans the sequence of items and sets a current center item as a vocabulary word and uses an average of the word vectors of items in a window of a predetermined size surrounding the vocabulary word, and then calculates the cosine distance of the vector average with a class representative context vector, further comprising displaying a dominant member of each class based on the member having a largest cosine distance between the word vector of a potential dominant member item and the class representative vector, wherein a distribution representation device learns the distributed representation using a tool and the distribution representation device produces a vector from the learned distributed representation, the vector including the word vector and the context vector. 2. The method of claim 1 , further comprising producing a new sequence of items based on a result of the calculating. 3. The method of claim 1 , wherein the sequence of items into are partitioned into the classes by applying a clustering algorithm to the word vector. 4. The method of claim 1 , wherein the producing the modified sequence of items includes replacing each item with a variation of said item based on the sense of the item in a current context usage. 5. The method of claim 1 , wherein the partitioning items into classes uses a K-means algorithm. 6. The method of claim 1 , wherein the average comprises a weighted average with higher weights assigned to word vectors whose word is closer to a window center word. 7. The method of claim 1 , wherein the words of the modified sequence are partitioned into new classes. 8. A non-transitory computer-readable recording medium recording a program for determining and discerning items with multiple meanings in a sequence of items, the program causing a computer to perform: producing a distributed representation for each item of the sequence of items including a word vector and a context vector; partitioning the sequence of items into classes and associating a class representative vector with the classes based on a user selection; for an item using a representative word vector of each class, calculating a cosine distance between the word vector of said item and the class representative vector; producing a fixed number of classes D with smallest of the cosine distance; and producing a modified sequence of items by replacing each item I occurrence in the sequence by an I_j occurrence where j is one of the D classes for item I, wherein each word of the modified sequence conveys a ranked list of senses of the word in the original sequence based on a closeness of the item to the class representative vector as determined using the calculated cosine distance, wherein the modified sequence and the sequence include words replaced by the senses to explain the words using the senses, and wherein the calculating scans the sequence of items and sets a current center item as a vocabulary word and uses an average of the word vectors of items in a window of a predetermined size surrounding the vocabulary word, and then calculates the cosine distance of the vector average with a class representative context vector, further comprising displaying a dominant member of each class based on the member having a largest cosine distance between the word vector of a potential dominant member item and the class representative vector, wherein a distribution representation device learns the distributed representation using a tool and the distribution representation device produces a vector from the learned distributed representation, the vector including the word vector and the context vector. 9. The non-transitory computer-readable recording medium of claim 8 , further comprising producing a new sequence of items based on a result of the calculating. 10. The non-transitory computer-readable recording medium of claim 8 , wherein the sequence of items into are partitioned into the classes by applying a clustering algorithm to the word vector. 11. A two-phase system for determining and discerning items with multiple meanings in a sequence of items, the system comprising: a processor; and a memory, the memory storing instructions to cause the processor to perform: producing a distributed representation for each item of the sequence of items including a word vector and a context vector; partitioning the sequence of items into classes and associating a class representative vector with the classes based on a user selection; for an item using a representative word vector of each class, calculating a cosine distance between the word vector of said item and the class representative vector; producing a fixed number of classes D with smallest of the cosine distance; and producing a modified sequence of items by replacing each item I occurrence in the sequence by an I_j occurrence where j is one of the D classes for item I, wherein each word of the modified sequence conveys a ranked list of senses of the word in the original sequence based on a closeness of the item to the class representative vector as determined using the calculated cosine distance, wherein the modified sequence and the sequence include words replaced by the senses to explain the words using the senses, and wherein the calculating scans the sequence of items and sets a current center item as a vocabulary word and uses an average of the word vectors of items in a window of a predetermined size surrounding the vocabulary word, and then calculates the cosine distance of the vector average with a class representative context vector, further comprising displaying a dominant member of each class based on the member having a largest cosine distance between the word vector of a potential dominant member item and the class representative vector, wherein a distribution representation device learns the distributed representation using a tool and the distribution representation device produces a vector from the learned distributed representation, the vector including the word vector and the context vector. 12. The system of claim 11 , further comprising producing a new sequence of items based on a result of the calculating. 13. The system of claim 11 , wherein the sequence of items into are partitioned into the classes by applying a clustering algorithm to the word vector.
Editing, e.g. inserting or deleting · CPC title
Parsing for meaning understanding · CPC title
Semantic analysis · CPC title
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title
Creation or modification of classes or clusters · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.