Using natural language processing (NLP) to create subject matter synonyms from definitions

US9665568B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9665568-B2
Application numberUS-201615043447-A
CountryUS
Kind codeB2
Filing dateFeb 12, 2016
Priority dateSep 13, 2013
Publication dateMay 30, 2017
Grant dateMay 30, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, apparatus and systems, including computer program products, for creating subject matter synonyms from definitions extracted from a subject matter glossary. Confidence scores, each representing a likelihood that two terms defined in the subject matter glossary are synonyms, are determined by applying natural language processing (e.g., passage term matching, lexical matching, and syntactic matching) to the extracted definitions. A subject matter thesaurus is built based on the confidence scores. In one embodiment, a statement containing a first term is created based on an extracted definition of the first term, a modified statement is created by substituting a second term in the statement in lieu of the first term, a corpus is searched, and a confidence score is determined based on evidence in the corpus that the modified statement is accurate. The first and second terms are marked as synonyms if the confidence score is greater than a threshold.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for creating subject matter synonyms from definitions of terms defined in a subject matter glossary, comprising: extracting from a subject matter glossary definitions of terms defined in the subject matter glossary, wherein the subject matter glossary comprises a list of terms associated with a particular subject matter, and wherein each of the terms is accompanied in the subject matter glossary by one or more definitions; determining a plurality of confidence scores by applying natural language processing to the definitions extracted from the subject matter glossary, wherein each confidence score represents a likelihood that two terms defined in the subject matter glossary are synonyms, and wherein determining a plurality of confidence scores by applying natural language processing to the definitions extracted from the subject matter glossary includes calculating a total confidence score for two terms defined in the subject matter glossary based on a plurality of the confidence scores determined for those two terms; building a subject matter thesaurus based on the confidence scores, wherein the subject matter thesaurus comprises a list of synonyms associated with that particular subject matter organized in a list of synonym pairs, wherein each of the synonym pairs includes two terms defined in the subject matter glossary that are synonyms, and wherein building a subject matter thesaurus based on the confidence scores includes marking those two terms as synonyms if the total confidence score calculated for those two terms is greater than a first threshold; determining the first threshold using machine learning. 2. The computer-implemented method as recited in claim 1 , wherein the natural language processing includes at least one of passage term matching, lexical matching, and syntactic matching. 3. The computer-implemented method as recited in claim 1 , wherein determining the first threshold using machine learning comprises: accessing a list of known synonym pairs of terms defined in a subject matter glossary, wherein each of the synonym pairs includes a first term defined in the subject matter glossary and a second term defined in the subject matter glossary; extracting from the subject matter glossary, for each of the known synonym pairs, definitions of terms defined in the subject matter glossary, wherein extracting from the subject matter glossary, for each of the known synonym pairs, definitions of terms defined in the subject matter glossary includes: extracting from the subject matter glossary, for each of the known synonym pairs, a definition of the first term; and extracting from the subject matter glossary, for each of the known synonym pairs, a definition of the second term; determining a plurality of confidence scores, for each of the known synonym pairs, by applying natural language processing to the definitions extracted from the subject matter glossary, wherein each confidence score represents a likelihood that two terms defined in the subject matter glossary are synonyms, wherein determining a plurality of confidence scores, for each of the known synonym pairs, by applying natural language processing to the definitions extracted from the subject matter glossary includes: creating a first statement, for each of the known synonym pairs, wherein the first statement contains the first term and is based on the definition of the first term extracted from the subject matter glossary; creating a modified first statement, for each of the known synonym pairs, by substituting in the first statement the second term in lieu of the first term; searching in a corpus for evidence, for each of the known synonym pairs, that the modified first statement is accurate; determining a first confidence score, for each of the known synonym pairs, based on evidence in the corpus that the modified first statement is accurate; creating a second statement, for each of the known synonym pairs, wherein the second statement contains the second term and is based on the definition of the second term extracted from the subject matter glossary; creating a modified second statement, for each of the known synonym pairs, by substituting in the second statement the first term in lieu of the second term; searching in the corpus for evidence, for each of the known synonym pairs, that the modified second statement is accurate; determining a second confidence score, for each of the known synonym pairs, based on evidence in the corpus that the modified second statement is accurate; calculating a total confidence score (TCS), for each of the known synonym pairs, based on the first confidence score and the second confidence score; and calculating the first threshold based on the TCS of one or more of the known synonym pairs. 4. The computer-implemented method as recited in claim 3 , wherein calculating the first threshold based on the TCS of one or more of the known synonym pairs includes calculating the first threshold as the lowest TCS of the known synonym pairs. 5. A computer program product for creating subject matter synonyms from definitions of terms defined in a subject matter glossary, the computer program product comprising a non-transitory computer readable storage medium having program code embodied therewith, the program code executable by a processor to perform a method comprising: extracting from a subject matter glossary definitions of terms defined in the subject matter glossary, wherein the subject matter glossary comprises a list of terms associated with a particular subject matter, and wherein each of the terms is accompanied in the subject matter glossary by one or more definitions; determining a plurality of confidence scores by applying natural language processing to the definitions extracted from the subject matter glossary, wherein each confidence score represents a likelihood that two terms defined in the subject matter glossary are synonyms, and wherein determining a plurality of confidence scores by applying natural language processing to the definitions extracted from the subject matter glossary includes calculating a total confidence score for two terms defined in the subject matter glossary based on a plurality of the confidence scores determined for those two terms; building a subject matter thesaurus based on the confidence scores, wherein the subject matter thesaurus comprises a list of synonyms associated with that particular subject matter organized in a list of synonym pairs, wherein each of the synonym pairs includes two terms defined in the subject matter glossary that are synonyms, and wherein building a subject matter thesaurus based on the confidence scores includes marking those two terms as synonyms if the total confidence score calculated for those two terms is greater than a first threshold; determining the first threshold using machine learning. 6. The computer program product as recited in claim 5 , wherein the natural language processing includes at least one of passage term matching, lexical matching, and syntactic matching. 7. The computer program product as recited in claim 5 , wherein determining the first threshold using machine learning comprises: accessing a list of known synonym pairs of terms defined in a subject matter glossary, wherein each of the synonym pairs includes a first term defined in the subject matter glossary and a second term defined in the subject matter glossary; extracting from the subject matter glossary, for each of the known synonym pairs, definitions of terms defined in the subject matter glossary, wherein extracting from the subject matter glossary, for each of the known synonym pairs, definitions of terms defined in the subject matter glossary includes: ex

Assignees

Inventors

Classifications

  • Indexing; Web crawling techniques · CPC title

  • Dictionaries · CPC title

  • Search customisation based on user profiles and personalisation · CPC title

  • G06F40/247Primary

    Thesauruses; Synonyms · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9665568B2 cover?
Methods, apparatus and systems, including computer program products, for creating subject matter synonyms from definitions extracted from a subject matter glossary. Confidence scores, each representing a likelihood that two terms defined in the subject matter glossary are synonyms, are determined by applying natural language processing (e.g., passage term matching, lexical matching, and syntact…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/247. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 30 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).