Domain concept discovery and clustering using word embedding in dialogue design

US11048870B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11048870-B2
Application numberUS-201715841703-A
CountryUS
Kind codeB2
Filing dateDec 14, 2017
Priority dateJun 7, 2017
Publication dateJun 29, 2021
Grant dateJun 29, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method performs automated domain concept discovery and clustering using word embeddings by receiving a set of documents for natural language processing for a domain, representing a plurality of entries in the set of documents as continuous vectors in a high dimensional continuous space, applying a clustering algorithm based on a mutual information optimization criterion to form a set of clusters, associating each entry of the plurality of entries with each cluster in the set of clusters through formalizing an evidence based model of each cluster given each entry, calculating a mutual information metric between each entry and each cluster using the evidence based model, and identifying a nominal center of each cluster by maximizing the mutual information.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method to perform automated domain concept discovery and clustering, the method comprising: receiving a set of documents for natural language processing for a domain; representing a plurality of entries in the set of documents as a plurality of continuous vector representations in a high dimensional continuous space; applying a log-linear model algorithm to the plurality of continuous vector representations; applying a clustering algorithm to discover domain concepts based on a mutual information optimization criterion to form a set of clusters; associating each entry of the plurality of entries with each cluster in the set of clusters through formalizing an evidence based model of each cluster given each entry; calculating a mutual information between each entry and each cluster using the evidence based model; identifying a nominal center of each cluster by maximizing a corresponding calculated mutual information of that cluster; calculating an iterative derivative gradient with respect to the plurality of continuous vector representations via chain-rule backpropagation, wherein a number of iterations in calculating the iterative derivate gradient is controlled via a learning-rate hyperparameter; and moving the nominal center of each cluster along the iterative derivative gradient to thereby reduce a value of an objective function of the clustering algorithm via a pure gradient descent approach. 2. The method of claim 1 , wherein a continuous vector representation of the plurality of continuous vector representations uses a pre-trained word embedding. 3. The method of claim 2 , wherein the each entry is selected from a group consisting of an entity, a concept, and a relationship. 4. The method of claim 3 , wherein the group is communicated in a knowledge graph. 5. The method of claim 1 , wherein the documents include corpus or a body of works. 6. The method of claim 1 , wherein the entries include words or phrases. 7. The method of claim 1 , wherein the evidence based model includes a posterior probability model. 8. The method of claim 1 , wherein representing the entries includes extracting a unique word. 9. The method of claim 1 , wherein representing the entries includes mapping a word. 10. The method of claim 1 , further comprising adding an entity synonym. 11. The method of claim 1 , further comprising adding a slot synonym. 12. The method of claim 1 , wherein associating each entry further comprises mapping each cluster to each entry. 13. The method of claim 1 , further comprising identifying a theme based on the set of clusters. 14. The method of claim 1 , wherein identifying the nominal center of each cluster by maximizing the mutual information includes determining a cosine distance between each entity included in each cluster. 15. An apparatus comprising: a memory storing program code; and a processor configured to access the memory and execute the program code to receive a set of documents for natural language processing for a domain, represent a plurality of entries in the set of documents as continuous vectors in a high dimensional continuous space, apply a log-linear model algorithm to the continuous vectors, apply a clustering algorithm to discover domain concepts based on a mutual information optimization criterion to form a set of clusters, associate each entry of the plurality of entries with each cluster in the set of clusters through formalizing an evidence based model of each cluster given each entry, calculate a mutual information between each entry and each cluster using the evidence based model, identify a nominal center of each cluster in the set of clusters by maximizing a corresponding calculated mutual information of that cluster, calculate an iterative derivative gradient with respect to the plurality of continuous vector representations via chain-rule backpropagation, wherein a number of iterations in calculating the iterative derivate gradient is controlled via a learning-rate hyperparameter, and move the nominal center of each cluster along the iterative derivative gradient to thereby reduce a value of an objective function of the clustering algorithm via a pure gradient descent approach. 16. The apparatus of claim 15 , wherein the mutual information optimization criterion includes pointwise mutual information. 17. The apparatus of claim 15 , wherein the processor is further configured to extract a word included in the set of documents. 18. The apparatus of claim 17 , wherein the processor is further configured to compile a lexicon using the word. 19. A program product to perform automated domain concept discovery and clustering, the program product comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code being executable by a processor to receive a set of documents for natural language processing for a domain, represent a plurality of entries in the set of documents as continuous vectors in a high dimensional continuous space, apply a log-linear model algorithm to the continuous vectors, apply a clustering algorithm to discover domain concepts based on a mutual information optimization criterion to form a set of clusters, associate each entry of the plurality of entries with each cluster in the set of clusters through formalizing an evidence based model of each cluster given each entry, calculate a mutual information between each entry and each cluster using the evidence based model, identify a nominal center of each cluster in the set of clusters by maximizing a corresponding calculated mutual information of that cluster, calculate an iterative derivative gradient with respect to the plurality of continuous vector representations via chain-rule backpropagation, wherein a number of iterations in calculating the iterative derivate gradient is controlled via a learning-rate hyperparameter, and move the nominal center of each cluster along the iterative derivative gradient to thereby reduce a value of an objective function of the clustering algorithm via a pure gradient descent approach.

Assignees

Inventors

Classifications

  • Semantic analysis · CPC title

  • G06F16/35Primary

    Clustering; Classification · CPC title

  • Document management systems · CPC title

  • Querying (for retrieval from the web G06F16/953) · CPC title

  • G06F40/205Primary

    Parsing · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11048870B2 cover?
A system and method performs automated domain concept discovery and clustering using word embeddings by receiving a set of documents for natural language processing for a domain, representing a plurality of entries in the set of documents as continuous vectors in a high dimensional continuous space, applying a clustering algorithm based on a mutual information optimization criterion to form a s…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/35. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 29 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).