Descriptor uniqueness for entity clustering

US11544312B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11544312-B2
Application numberUS-202016792456-A
CountryUS
Kind codeB2
Filing dateFeb 17, 2020
Priority dateFeb 17, 2020
Publication dateJan 3, 2023
Grant dateJan 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A mechanism is provided in a data processing system to implement a cognitive natural language processing (NLP) system with descriptor uniqueness identification to support named entity mention clustering. The mechanism annotates a set of documents from a corpus of documents for entity types and mentions, collects descriptor usages from all documents in the corpus of documents, analyzes the descriptor usages to classify the descriptors as base terms or modifier terms, generates compatibility scores for the descriptors, and performs entity merging of entity clusters based on the compatibility scores.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, in a data processing system comprising a processor and a memory, the memory comprising instructions that are executed by the processor to specifically configure the processor to implement a cognitive question answering (QA) system with descriptor uniqueness identification to support named entity mention clustering, the method comprising: receiving, by an entity uniqueness identification and entity clustering engine executing within the cognitive QA system, an open-domain. corpus of text. documents and a domain-specific corpus of text documents; annotating, by an entity tagger within the entity uniqueness and entity clustering engine, a set of documents from the open-domain corpus of text documents and the domain-specific corpus of text documents for entity types and mentions; collecting, by an entity uniqueness identification component within the entity uniqueness and entity clustering engine, descriptor usages of descriptors from all documents in the open-domain corpus of text documents; analyzing, by the entity uniqueness identification component, the descriptor usages to classify the descriptors as base terms or modifier terms; building, by the entity uniqueness identification component, a frequency count of descriptor co-occurrences: generating, by the entity uniqueness identification component, specificity markers for the descriptors, wherein each specificity marker specifies whether the uniqueness of the corresponding descriptor is definite or indefinite; generating, by the entity uniqueness identification component, compatibility scores for combinations of the descriptors based on the frequency counts of descriptor co-occurrences and the specificity markers for the descriptors, wherein the compatibility scores comprise real-valued scores such that a negative score indicates incompatibility and a positive score indicates compatibility, with larger magnitude scores indicating strength of determination or confidence; performing, by an entity clustering component within the entity uniqueness and entity clustering engine, entity merging of entity clusters based on the compatibility scores; and generating, by the cognitive QA system, a set of candidate answers from passages within the domain-specific corpus of text documents for an input question based on results of the entity merging of entity clusters. 2. The method of claim 1 , further comprising removing context dependent descriptor terms. 3. The method of claim 1 , wherein annotating the set of documents comprises replacing each text-level mention with its corresponding entity type. 4. The method of claim 1 , wherein each description is in a grammatical construction selected from the group consisting of copula, pre-nominal, sentence-initial adverbial, and appositive. 5. The method of claim 1 , wherein generating compatibility scores for the descriptors comprises using a rule-based scorer with a taxonomy and synonym resource. 6. The method of claim 1 , wherein generating compatibility scores for the descriptors comprises using a trained statistical scorer. 7. The method of claim 1 , wherein performing entity merging comprises using a classification model, a distance-based model, or a Markov Chain Monte Carlo based inference model. 8. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program comprises instructions, which when executed on a processor of a computing device causes the computing device to implement a cognitive question answering (QA) system with descriptor uniqueness identification to support named entity mention clustering, wherein the computer readable program causes the computing device to: receive, by an entity uniqueness identification and entity clustering engine executing within the cognitive QA system, an open-domain corpus of text documents and a domain-specific corpus of text documents; annotate, by an entity tagger within the entity uniqueness and entity clustering engine, a set of documents from the open-domain corpus of text documents and the domain-specific corpus of text documents for entity types and mentions; collect, by an entity uniqueness identification component within the entity uniqueness and entity clustering engine, descriptor usages of descriptors from all documents in the open-domain corpus of text documents; analyze, by the entity uniqueness identification component, the descriptor usages to classify the descriptors as base terms or modifier terms; build, by the entity uniqueness identification component, a frequency count of descriptor co-occurrences; generate, by the entity uniqueness identification component, specificity markers for the descriptors, wherein each specificity marker specifies whether the uniqueness of the corresponding descriptor is definite or indefinite; generate, by the entity uniqueness identification component, compatibility scores for combinations of the descriptors based on the frequency counts of descriptor co-occurrences and the specificity markers for the descriptors, Wherein the compatibility scores comprise real-valued scores such that a negative score indicates incompatibility and a positive score indicates compatibility, with larger magnitude scores indicating strength of determination or confidence; perform, by an entity clustering component within the entity uniqueness and entity clustering engine, entity merging of entity clusters based on the compatibility scores; and generate, by the cognitive QA system, a set of candidate answers from passages within the domain-specific corpus of text documents for an input question based on results of the entity merging of entity clusters. 9. The computer program product of claim 8 , wherein the computer readable program causes the computing device to remove context dependent descriptor terms. 10. The computer program product of claim 8 , wherein annotating the set of documents comprises replacing each text-level mention with its corresponding entity type. 11. The computer program product of claim 8 , wherein each description is in a grammatical construction selected from the group consisting of copula, pre-nominal, sentence-initial adverbial, and appositive. 12. The computer program product of claim 8 , wherein generating compatibility scores for the descriptors comprises using a rule-based scorer with a taxonomy and synonym resource. 13. The computer program product of claim 8 , wherein generating compatibility scores for the descriptors comprises using a trained statistical scorer. 14. A computing device comprising: a processor; and a memory coupled to the processor, wherein the memory comprises instructions, which when executed on a processor of a computing device causes the computing device to implement a cognitive question answering (QA) system with descriptor uniqueness identification to support named entity mention clustering, wherein the instructions cause the processor to: receive, by an entity uniqueness identification and entity clustering engine executing within the cognitive QA system, an open-domain corpus of text documents and a domain-specific corpus of text documents; annotate, by an entity tagger within the entity uniqueness and entity clustering engine, a set of documents from the open-domain corpus of text documents and the domain-specific corpus of text documents for entity types and mentions; collect, by an entity uniqueness identification component within the entity uniqueness and entity clustering engine, descriptor usages of descriptors from all documents in the open-domain corpus of text document

Assignees

Inventors

Classifications

  • Natural language query formulation · CPC title

  • Annotation, e.g. comment data or footnotes · CPC title

  • G06F16/355Primary

    Creation or modification of classes or clusters · CPC title

  • using statistics or function optimisation, e.g. modelling of probability density functions · CPC title

  • G06F40/295Primary

    Named entity recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11544312B2 cover?
A mechanism is provided in a data processing system to implement a cognitive natural language processing (NLP) system with descriptor uniqueness identification to support named entity mention clustering. The mechanism annotates a set of documents from a corpus of documents for entity types and mentions, collects descriptor usages from all documents in the corpus of documents, analyzes the descr…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/355. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).