Automatically extending a domain taxonomy to the level of granularity present in glossaries in documents

US11475222B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11475222-B2
Application numberUS-202016797430-A
CountryUS
Kind codeB2
Filing dateFeb 21, 2020
Priority dateFeb 21, 2020
Publication dateOct 18, 2022
Grant dateOct 18, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A controller accesses an initial taxonomy for a domain comprising one or more existing terms for the domain identified in a hierarchical structure. The controller analyzes a corpus documents for a domain to identify a selection of one or more documents with glossaries. The controller extracts, from the glossaries, one or more pairs each comprising a term and a definition. The controller attempts to map a respective term of each of the one or more pairs into the initial taxonomy for the domain based on text of a respective definition of each of the one or more pairs to generate an updated taxonomy for the domain.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: applying, by a computer, rule-based annotators and statistical annotators to automate document annotation; annotating, by the computer, a plurality of documents in a corpus such that the plurality of documents are recognizable by a machine; using, by the computer, annotated documents as a dataset in machine learning for building natural language processing models used in a question answering system in the computer; accessing, by the computer, an initial taxonomy for a domain comprising one or more existing terms for the domain identified in a hierarchical structure; analyzing, by the computer, the corpus of the plurality of documents for a domain to identify a selection of one or more documents with glossaries; extracting, by the computer, from the glossaries, one or more pairs each comprising a term and a definition; attempting to map, by the computer, a respective term of each of the one or more pairs into the initial taxonomy for the domain based on text of a respective definition of each of the one or more pairs to generate an updated taxonomy for the domain; extracting, by the computer, a head noun phrase of the respective definition of a current entry from among the one or more pairs; evaluating, by the computer, whether the head noun phrase is present in the initial taxonomy; responsive to the head noun phrase being present in the initial taxonomy, mapping, by the computer, the respective term of the current entry to the initial taxonomy to generate the updated taxonomy; responsive to the head noun phrase not being present in the initial taxonomy, evaluating, by the computer, whether the head noun phrase is present in a particular definition from among the one or more pairs; responsive to evaluating the head noun phrase is present in the particular definition from among the one or more pairs, building, by the computer, a tiny taxonomy with the respective term of the current entry as a child node and another term paired with the particular definition as the parent node; and responsive to mapping the another term to the initial taxonomy to generate the updated taxonomy, mapping, by the computer system, the tiny taxonomy to the updated taxonomy. 2. The method according to claim 1 , wherein accessing, by a computer, an initial taxonomy for a domain comprising one or more existing terms for the domain identified in a hierarchical structure further comprises: accessing, by the computer, the initial taxonomy comprising the one or more existing terms for the domain identified in the hierarchical structure comprising a parent node and one or more levels of child nodes. 3. The method according to claim 1 , wherein attempting to map a respective term of each of the one or more pairs into the initial taxonomy for the domain based on text of a respective definition of each of the one or more pairs to generate an updated taxonomy for the domain further comprises: marking, by the computer, one or more selections of the one or more pairs that are related; and attempting to map, by the computer, the respective term of each of the one or more pairs into the initial taxonomy for the domain based on text of the respective definition of each of the one or more pairs and the marked one or more selections of the one or more pairs that are related to generate the updated taxonomy for the domain. 4. The method according to claim 1 , further comprising: responsive to the head noun phrase not being present in the initial taxonomy, evaluating, by the computer, whether a last word of the head noun phrase is present the initial taxonomy; and responsive to evaluating the last word of the head noun phrase is present in the initial taxonomy, mapping, by the computer, the respective term of the current entry to the updated taxonomy. 5. The method according to claim 1 , further comprising: responsive to the head noun phrase not being present in the initial taxonomy, evaluating, by the computer, whether a see also term is present in a particular definition of another entry from among the one or more pairs; responsive to detecting the see also term in the particular definition of the another entry, attempting to map, by the computer system, the another entry to the initial taxonomy to generate the updated taxonomy; and responsive to mapping the another entry to the initial taxonomy to generate the updated taxonomy, mapping the current entry to a same node as the another entry in the updated taxonomy. 6. The method according to claim 1 , further comprising: identifying, by the computer, a remainder collection of one or more unmapped pairs from among the plurality of pairs that are not mapped to generate the updated taxonomy; and clustering, by the computer, one or more clusters from among the one or more unmapped pairs based on the text of the respective definition of each of the one or more unmapped pairs into one or more groups of semantically similar terms. 7. The method according to claim 1 , wherein attempting to map, by the computer, a respective term of each of the one or more pairs into the initial taxonomy for the domain based on text of a respective definition of each of the one or more pairs to generate an updated taxonomy for the domain further comprises: identifying, by the computer, a remainder collection of one or more unmapped pairs from among the plurality of pairs that are not mapped to generate the updated taxonomy; and iteratively attempting to map, by the computer, the remainder collection of the one or more unmapped pairs to the updated taxonomy based on the text of the respective definition of the one or more unmapped pairs. 8. The method according to claim 6 , further comprising: evaluating, by the computer, a top N terms from each of the one or more clusters; selecting, by the computer, a best match term from each selection of top N terms as a candidate concept label for the respective cluster from the one or more clusters; and automatically adding, by the computer, each candidate concept label to the initial taxonomy to generate the updated taxonomy. 9. A computer system comprising one or more processors, one or more computer-readable memories, one or more computer-readable storage devices, and program instructions, stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, the stored program instructions comprising: program instruction to apply rule-based annotators and statistical annotators to automate document annotation; program instructions to annotate a plurality of documents in a corpus such that the plurality of documents are recognizable by a machine; program instructions to use annotated documents as a dataset in machine learning for building natural language processing models used in a question answering system in the computer system; program instructions to access an initial taxonomy for a domain comprising one or more existing terms for the domain identified in a hierarchical structure; program instructions to analyze the corpus of the plurality of documents for a domain to identify a selection of one or more documents with glossaries; program instructions to extract, from the glossaries, one or more pairs each comprising a term and a definition; program instructions to attempt to map a respective term of each of the one or more pairs into the initial taxonomy for the domain based on text of a respective definition of each of the one or more pairs to generate an updated taxonomy for the domain; program instructions to extract a head noun phrase of the respective definition of a current entry from among the one or more pairs; program instructions to

Assignees

Inventors

Classifications

  • G06F40/30Primary

    Semantic analysis · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • G06F40/289Primary

    Phrasal analysis, e.g. finite state techniques or chunking · CPC title

  • Thesauruses; Synonyms · CPC title

  • Grammatical analysis; Style critique · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11475222B2 cover?
A controller accesses an initial taxonomy for a domain comprising one or more existing terms for the domain identified in a hierarchical structure. The controller analyzes a corpus documents for a domain to identify a selection of one or more documents with glossaries. The controller extracts, from the glossaries, one or more pairs each comprising a term and a definition. The controller attempt…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 18 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).