System and method for unsupervised text normalization using distributed representation of words
US-2016098386-A1 · Apr 7, 2016 · US
US10585987B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10585987-B2 |
| Application number | US-201715442834-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 27, 2017 |
| Priority date | Oct 19, 2015 |
| Publication date | Mar 10, 2020 |
| Grant date | Mar 10, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method, system, and non-transitory compute readable medium determining and discerning items with multiple meanings in a sequence of items including producing a distributed representation for each item of the sequence of items including a word vector and a context vector, partitioning the sequence of items into classes, for an item using a representative word vector of each class, calculating a cosine distance between the word vector of said item and the class representative vector, and producing a new sequence of items by modifying the distributed representation in the producing by replacing each occurrence of an item depending on the cosine distance calculated by the calculating.
Opening claim text (preview).
What is claimed is: 1. A two-phase method of determining and discerning items with multiple meanings in a sequence of items, the method comprising: in a first phase: producing a distributed representation for each item of the sequence of items including a word vector and a context vector; partitioning the sequence of items into classes; for an item using a representative word vector of each class, calculating a cosine distance between the word vector of said item and the class representative vector not including an own class of the item; and producing a fixed number of classes D with smallest said cosine distance; and in a second phase: producing a modified sequence of items by replacing each item I occurrence in the sequence by an I_j occurrence where j is one of the D classes for item I; producing a new distributed representation for each item I_m of the modified sequence of items including a word vector and a context vector; and using said new distributed representation for determining and discerning items with multiple meanings, wherein each word of the modified sequence conveys a ranked list of senses of the word in the original sequence based on a closeness of the item to the class representative vector without being a member of the class. 2. The method of claim 1 , wherein the producing a modified sequence of items includes replacing each item with a variation of said item based on the sense of the item in a current context usage. 3. The method of claim 2 , wherein the replacing scans the sequence of items and sets a current center item as a vocabulary word and uses an average of the word vectors of items in a window of a predetermined size surrounding the vocabulary word, and then calculates the cosine distance of the vector average with a class representative context vector. 4. The method of claim 1 , wherein the partitioning items into classes uses a K-means algorithm. 5. The method of claim 3 , wherein an average comprises a weighted average with higher weights assigned to word vectors whose word is closer to a window center word. 6. The method of claim 1 , wherein the using further partitions the words of the modified sequence into new classes. 7. The method of claim 6 , wherein said partitioning uses a K-means algorithm. 8. The method of claim 6 , wherein each word of the modified sequence is presented alongside dominant members of a new class to which said word belongs. 9. The method of claim 1 , further comprising scanning the sequence of items and setting a current center item as a vocabulary word and uses an average of the word vectors of items in a window of a predetermined size surrounding the vocabulary word, and then calculates the cosine distance of the vector average with a class representative context vector. 10. A non-transitory computer-readable recording medium recording a program for determining and discerning items with multiple meanings in a sequence of items, in two-phases, the program causing a computer to perform: in a first phase: producing a distributed representation for each item of the sequence of items including a word vector and a context vector; partitioning the sequence of items into classes; for an item using a representative word vector of each class, calculating a cosine distance between the word vector of said item and the class representative vector at including an own class of the item; and producing a fixed number of classes D with smallest said cosine distance; and in a second phase: producing a modified sequence of items by replacing each item I occurrence in the sequence by an I_j occurrence where j is one of the D classes for item I; producing a new distributed representation for each item I_m of the modified sequence of items including a word vector and a context vector; and using said new distributed representation for determining and discerning items with multiple meanings, wherein each word of the modified sequence conveys a ranked list of senses of the word in the original sequence based on a closeness of the item to the class representative vector without being a member of the class. 11. The non-transitory computer readable recording medium of claim 10 , wherein the producing a modified sequence of items replaces each item with a variation of said item based on the sense of the item in a current context usage. 12. The non-transitory computer readable recording medium of claim 11 , wherein the replacing scans the sequence of items and sets a current center item as a vocabulary word and uses an average of the word vectors of items in a window of a predetermined size surrounding the vocabulary word, and then calculates the cosine distance of the vector average with a class representative context vector. 13. The non-transitory computer readable recording medium of claim 10 , wherein the partitioning items into classes uses a K-means algorithm. 14. The non-transitory computer readable recording medium of claim 12 , wherein an average comprises a weighted average with higher weights assigned to word vectors whose word is closer to a window center word. 15. The non-transitory computer readable recording medium of claim 10 , wherein the using further partitions the words of the modified sequence into new classes. 16. The non-transitory computer readable recording medium of claim 15 , wherein said partitioning uses a K-Means algorithm. 17. The non-transitory computer readable recording medium of claim 15 , wherein each word of the modified sequence is presented alongside dominant members of a new class to which said word belongs. 18. A two-phase system for determining and discerning items with multiple meanings in a sequence of items, the system comprising: a processor, and a memory, the memory storing instructions to cause the processor to perform: in a first phase: producing a distributed representation for each item of the sequence of items including a word vector and a context vector; partitioning the sequence of items into classes; for an item using a representative word vector of each class, calculating a cosine distance between the word vector of said item and the class representative vector not including an own class of the item; and producing a fixed number of classes D with smallest said cosine distance; and in a second phase: producing a modified sequence of items by replacing each item I occurrence in the sequence by an I_j occurrence where j is one of the D classes for item I; producing a new distributed representation for each item I_m of the modified sequence of items including a word vector and a context vector, and using said new distributed representation for determining and discerning items with multiple meanings, wherein each word of the modified sequence conveys a ranked list of senses of the word in the original sequence based on a closeness of the item to the class representative vector without being a member of the class. 19. The system of claim 18 , wherein the producing a modified sequence of items including replacing each item with a variation of said item based on the sense of the item in a current context usage.
Semantic analysis · CPC title
Editing, e.g. inserting or deleting · CPC title
Parsing for meaning understanding · CPC title
Creation or modification of classes or clusters · CPC title
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.