Generation of natural language processing model for an information domain

US9740685B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9740685-B2
Application numberUS-201213712460-A
CountryUS
Kind codeB2
Filing dateDec 12, 2012
Priority dateDec 12, 2011
Publication dateAug 22, 2017
Grant dateAug 22, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments relate to a method, apparatus and program product and for generating a natural language processing model for an information domain. The method derives a skeleton of a natural language lexicon from a source model and uses it to form a dictionary. It also applies a set of syntactical rules defining concepts and relationships to the dictionary and expands the skeleton of the natural language lexicon based on a plurality of reference documents from the information domain. Using the expanded skeleton of the natural language lexicon, it also provides a natural language processing model for the information domain.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for generating a natural language model for a specific information domain, comprising: building a skeleton of a natural language lexicon for the specific information domain from a source model of the specific information domain, the skeleton comprising terms found in the source model, the source model comprising classification hierarchies for the terms, the terms including objects and attributes; using the skeleton of the natural language lexicon to form a dictionary; applying a set of syntactical rules defining concepts and relationships to the dictionary; expanding the skeleton of the natural language lexicon based on a plurality of reference documents from the specific information domain, wherein expanding the skeleton comprises: clustering and scoring terms for concepts and relationships, and an intersection component for intersecting the syntactic rules and the clustered concepts and relationships; and using the expanded skeleton of the natural language lexicon, provide a natural language processing model for the specific information domain, the natural language processing model utilized by a user in the specific information domain to analyze documents in the specific information domain. 2. The method as claimed in claim 1 , wherein building a skeleton of a natural language lexicon uses preferred terms in the specific information domain. 3. The method as claimed in claim 1 , wherein applying a set of syntactical rules includes taking subject, predicate, object and varying order for coverage. 4. The method as claimed in claim 1 , wherein expanding the skeleton further comprises: selecting a preferred term as a concept or relationship; carrying out a keyword search for the preferred term in reference documents; and providing an ordered set of potential for the preferred term. 5. The method as claimed in claim 1 , further comprising: determining local n-grams; measuring one or more metrics of the n-grams; and scoring the n-grams. 6. The method as claimed in claim 1 , further comprising: deriving further syntactic rules based on the reference documents. 7. The method as claimed in claim 6 , further comprising: using verb structures from linguistic classes of verbs to drive the intersection applied to the clustered terms. 8. The method as claimed in claim 1 , wherein expanding the skeleton starts at a starting concept or relationship and moves out through neighboring concepts or relationship links in the source model. 9. The method as claimed in claim 1 , wherein expanding the skeleton dynamically changes an iterating strategy based on results comprising: determining a divergence of best terms for a concept or relationship using a score threshold. 10. The method as claimed in claim 1 , wherein building a skeleton of a natural language lexicon is based on more than one source model. 11. The method as claimed in claim 1 , wherein building, a skeleton of a natural language lexicon leverages open data to populate the skeleton initially wherein the ontology classes of the source model are matched to classes of open data. 12. A computer program product for a natural language processing model for a specific information domain, the computer program product comprising a non-transitory computer readable storage medium having computer readable program code embodied therewith, said computer readable program code executable by a computer, comprising: building a skeleton of a natural language lexicon for the specific information domain from a source model of the specific information domain, the skeleton comprising terms found in the source model, the source model comprising classification hierarchies for the terms, the terms including objects and attributes; using the skeleton of the natural language lexicon to from a dictionary; applying a set of syntactical rules defining concepts and relationships to the dictionary; expanding the skeleton of the natural lexicon based on a plurality of reference documents from the specific information domain, wherein expanding the skeleton comprises: clustering and scoring terms for concepts and relationships, and an intersection component for intersecting the syntactic rules and the clustered concepts and relationships; and using the expanded skeleton of the natural language lexicon, provide a natural language processing model for the specific information domain, the natural language processing model utilized by a user in the specific information domain to analyze documents in the specific information domain. 13. A system for generating a natural language processing model for specific information domain, comprising: a processor configured for building a skeleton of a natural language lexicon for the specific information domain from a source model of the specific information domain and for using the skeleton of the natural language lexicon to form a dictionary, the skeleton comprising terms found in the source model, the source model comprising classification hierarchies for the terms, the terms including objects and attributes; a syntactic rule component for applying a set of syntactical rules defining concepts and relationships to the dictionary; and an expanding component, including an intersection component, for expanding the skeleton of the natural language lexicon based on reference documents and using the syntactic rule component to provide a natural language processing model, the natural language processing model utilized by a user in the specific information domain to analyze documents in the specific information domain, wherein expanding the skeleton comprises: clustering and scoring terms for concepts and relationships, and intersecting the syntactic rules and the clustered concepts and relationships. 14. The system as claimed in claim 13 , wherein the clustering and scoring terms for concepts and relationships comprises the syntactic rule applying a set of syntactical rules includes taking subject, predicate, object and varying order for coverage. 15. The system as claimed in claim 13 , wherein the expanding component for expanding the skeleton includes components includes a concept/relationship clustering component for: selecting a preferred term as a concept or relationship; carrying out a keyword search for the preferred term in reference documents from the specific information domain; and providing an ordered set of potential terms for the preferred term. 16. The system as claimed in claim 13 , wherein the concept/relationship clustering component is for: determining local n-grams; measuring one or more metrics of the n-grams; and scoring the n-grams. 17. The system as claimed in claim 13 , wherein the expanding component for expanding the skeleton of the natural language lexicon includes: a syntactic rule generating component for deriving further syntactic rules based on the reference documents from the specific information domain. 18. The system as claimed in claim 13 , wherein the expanding component for expanding the skeleton starts at a starting concept or relationship and moves out through neighboring concepts or relationship links in the source model, iterating outwards; and refines the expanded terms of concepts and relationships by augmenting scores. 19. The system as claimed in claim 13 , wherein the expanding component for expanding the skeleton dynamically changes an iterating strategy based on results.

Assignees

Inventors

Classifications

  • G06F40/169Primary

    Annotation, e.g. comment data or footnotes · CPC title

  • G06F40/40Primary

    Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • Dictionaries · CPC title

  • Tagging; Marking up (details of markup languages G06F40/143); Designating a block; Setting of attributes (style sheets, e.g. eXtensible Stylesheet Language Transformation [XSLT], G06F40/154) · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9740685B2 cover?
Embodiments relate to a method, apparatus and program product and for generating a natural language processing model for an information domain. The method derives a skeleton of a natural language lexicon from a source model and uses it to form a dictionary. It also applies a set of syntactical rules defining concepts and relationships to the dictionary and expands the skeleton of the natural la…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/169. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 22 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).