Preserving conceptual distance within unstructured documents

US9424298B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9424298-B2
Application numberUS-201414508200-A
CountryUS
Kind codeB2
Filing dateOct 7, 2014
Priority dateOct 7, 2014
Publication dateAug 23, 2016
Grant dateAug 23, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, system and computer-usable medium are disclosed for preserving conceptual distance within unstructured documents by characterizing conceptual relationships. Natural language processing is applied to content in a plurality of documents to identify topics and subjects. Analytic analysis is then applied to the identified topics and subjects to identify concepts. The content in each of the plurality of documents is partitioned into a first structured hierarchy, preserving at least one structure in each document inherent in the each document. Access is then provided to the content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy. The conceptual relationship criteria are based upon a directed graph with weights based upon a similarity and a distance based upon concepts.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a processor; a data bus coupled to the processor; and a computer-usable medium embodying computer program code, the computer-usable medium being coupled to the data bus, the computer program code used for characterizing content of documents by conceptual relationships and comprising instructions executable by the processor and configured for: applying natural language processing (NLP) to content in a plurality of documents to identify topics and subjects; applying analytic analysis to the topics and subjects to identify a conceptual relationship of the content in the plurality of documents; partitioning the content in each of the plurality of documents into a first structured hierarchy, preserving at least one structure in each document inherent in the each document; and providing access to content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy; and wherein the content is characterized by optimizing a vector space model representation of the documents, the optimization performed by a system capable of answering questions, where: the content from the plurality of documents is ingested by the system; natural language processing is applied to the content in the plurality of documents to identify terms, topics, subjects and concepts; the content is partitioned according to a semantic parse distance to identify a context for partitioned content; the content and context is represented, by the system, utilizing a vector space model; entries in the vector space model are eliminated based on a difference criteria; and an iterative genetic algorithm is applied to optimize features of the vector space model. 2. The system of claim 1 , wherein: the conceptual relationship is based upon a directed graph with weights based upon a similarity and a distance based upon concepts. 3. The system of claim 1 , wherein: the distance is based upon a topic hierarchy. 4. The system of claim 1 , wherein: a ground truth is an optimized feature. 5. The system of claim 4 , wherein: the genetic algorithm determines which features are used during the ingesting and has weighting based on semantic distance. 6. A non-transitory, computer-readable storage medium embodying computer program code, the computer program code comprising computer executable instructions configured for: applying natural language processing (NLP) to content in a plurality of documents to identify topics and subjects; applying analytic analysis to the topics and subjects to identify a conceptual relationship of the content in the plurality of documents; partitioning the content in each of the plurality of documents into a first structured hierarchy, preserving at least one structure in each document inherent in the each document; and providing access to content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy; and wherein the content is characterized by optimizing a vector space model representation of the documents, the optimization performed by a system capable of answering questions, where: the content from the plurality of documents is ingested by the system; natural language processing is applied to the content in the plurality of documents to identify terms, topics, subjects and concepts; the content is partitioned according to a semantic parse distance to identify a context for partitioned content; the content and context is represented, by the system, utilizing a vector space model; entries in the vector space model are eliminated based on a difference criteria; and an iterative genetic algorithm is applied to optimize features of the vector space model. 7. The non-transitory, computer-readable storage medium of claim 6 , wherein: the conceptual relationship is based upon a directed graph with weights based upon a similarity and a distance based upon concepts. 8. The non-transitory, computer-readable storage medium of claim 6 , wherein: the distance is based upon a topic hierarchy. 9. The non-transitory, computer-readable storage medium of claim 6 , wherein: a ground truth is an optimized feature. 10. The non-transitory, computer-readable storage medium of claim 9 , wherein: the genetic algorithm determines which features are used during the ingesting and has weighting based on semantic distance. 11. The non-transitory, computer-readable storage medium of claim 6 , wherein the computer executable instructions are deployable to a client system from a server system at a remote location. 12. The non-transitory, computer-readable storage medium of claim 6 , wherein the computer executable instructions are provided by a service provider to a user on an on-demand basis.

Assignees

Inventors

Classifications

  • Named entity recognition · CPC title

  • Document management systems · CPC title

  • Selection or weighting of terms for indexing · CPC title

  • Semantic analysis · CPC title

  • Management thereof · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9424298B2 cover?
A method, system and computer-usable medium are disclosed for preserving conceptual distance within unstructured documents by characterizing conceptual relationships. Natural language processing is applied to content in a plurality of documents to identify topics and subjects. Analytic analysis is then applied to the identified topics and subjects to identify concepts. The content in each of th…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/2272. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 23 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).