Phenomenological semantic distance from latent dirichlet allocations (LDA) classification

US10242002B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10242002-B2
Application numberUS-201615225458-A
CountryUS
Kind codeB2
Filing dateAug 1, 2016
Priority dateAug 1, 2016
Publication dateMar 26, 2019
Grant dateMar 26, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments provide a system and method for semantic distance calculation. The method can involve receiving a plurality of documents having a set of subjects extracted through the use of latent dirichlet allocation; for each document in the plurality of documents, generating a classification list comprising a ranking of the one or more subjects based on the relevance of each subject to the document; for each classification list, calculating the semantic distance between each subject present on the classification list; aggregating the plurality of classification lists; and creating a distance matrix containing the relative semantic distances between each member of the set of subjects.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method in a data processing system comprising a processor and a memory comprising instructions, which are executed by the processor to cause the processor to implement a system for calculating semantic distances between subjects using a natural language processing technique, the method comprising: receiving a plurality of documents having a set of subjects extracted through latent dirichlet allocation; for each subject, extrapolating the subject into one or more topic vectors; calculating relevance of the subject through analyzing the one or more topic vectors against the plurality of documents; for each document in the plurality of documents, generating a classification list comprising a ranking of the one or more subjects based on the relevance of each subject to the document; for each classification list, normalizing a relevance value to be no more than 1.0 for each subject on the classification list based on a primary subject; for each classification list, calculating the semantic distance between each subject present on the classification list; aggregating the generated classification list of each of the plurality of documents; and creating a distance matrix containing relative semantic distances between each member of the set of subjects. 2. The computer implemented method as recited in claim 1 , further comprising: excluding one or more subjects from the classification list if the subjects fail to reach a predetermined relevance threshold. 3. The computer implemented method as recited in claim 1 , further comprising: disregarding one or more distance matrix elements if the elements do not appear in a predetermined threshold amount of documents contained in the plurality of documents. 4. The computer implemented method as recited in claim 1 , further comprising: randomizing an order of the plurality of documents prior to ingestion. 5. A computer program product for calculating semantic distance between subjects using a natural language processing technique, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to: receive a plurality of documents having a set of subjects extracted through latent dirichlet allocation; for each subject, extrapolate the subject into one or more topic vectors; calculate relevance of the subject through analyzing the one or more topic vectors against the plurality of documents; for each document in the plurality of documents, generate a classification list comprising a ranking of the one or more subjects based on the relevance of each subject to the document; for each classification list, normalize a relevance value to be no more than 1.0 for each subject on the classification list based on a primary subject; for each classification list, calculate the semantic distance between each subject present on the classification list; aggregate the generated classification list of each of the plurality of documents; and create a distance matrix containing relative semantic distances between each member of the set of subjects. 6. The computer program product as recited in claim 5 , the processor further configured to: exclude one or more subjects from the classification list if the subjects fail to reach a predetermined relevance threshold. 7. The computer program product as recited in claim 5 , the processor further configured to: disregard one or more distance matrix elements if the elements do not appear in a predetermined threshold amount of documents contained in the plurality of documents. 8. The computer program product as recited in claim 5 , the processor further configured to: randomize an order of the plurality of documents prior to ingestion. 9. A system for calculating semantic distance between subjects using a natural language processing technique, comprising: a semantic distance calculation processor configured to: receive a plurality of documents having a set of subjects extracted through-latent dirichlet allocation; for each subject, extrapolate the subject into one or more topic vectors; calculate relevance of the subject through analyzing the one or more topic vectors against the plurality of documents; for each document in the plurality of documents, generate a classification list comprising a ranking of the one or more subjects based on the relevance of each subject to the document; for each classification list, normalize a relevance value to be no more than 1.0 for each subject on the classification list based on a primary subject; for each classification list, calculate the semantic distance between each subject present on the classification list; aggregate the generated classification list of each of the plurality of documents; and create a distance matrix containing relative semantic distances between each member of the set of subjects. 10. The system as recited in claim 9 , the semantic distance calculation processor further configured to: exclude one or more subjects from the classification list if the subjects fail to reach a predetermined relevance threshold. 11. The system as recited in claim 9 , the semantic distance calculation processor further configured to: disregard one or more distance matrix elements if the elements do not appear in a predetermined threshold amount of documents contained in the plurality of documents. 12. The system as recited in claim 9 , the semantic distance calculation processor further configured to: randomize an order of the plurality of documents prior to ingestion.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10242002B2 cover?
Embodiments provide a system and method for semantic distance calculation. The method can involve receiving a plurality of documents having a set of subjects extracted through the use of latent dirichlet allocation; for each document in the plurality of documents, generating a classification list comprising a ranking of the one or more subjects based on the relevance of each subject to the docu…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/93. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).