Automated collective term and phrase index

US9864741B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9864741-B2
Application numberUS-201514862621-A
CountryUS
Kind codeB2
Filing dateSep 23, 2015
Priority dateSep 23, 2014
Publication dateJan 9, 2018
Grant dateJan 9, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Knowledge automation techniques may include selecting a knowledge element from a knowledge corpus of an enterprise for extraction of n-grams, and deriving a term vector comprising terms in the knowledge element. Based at least on a frequency of occurrence of each term in the knowledge element, key terms are identified in the term vector. Thereafter, the identified key terms are used to extract one or more n-grams from the knowledge element. Each of the extracted n-grams is scored as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise, and based on the scoring, one or more of the n-grams is added to a collective term and phrase index.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining, by a computer system, data files from a knowledge corpus of an enterprise; identifying, by the computer system, key terms within the data files; determining, by the computer system, for each identified key term, a frequency of occurrence and location within the data files; generating, by the computer system, knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files; selecting, by computing system, a knowledge unit from the generated knowledge units for extraction of n-grams; deriving, by the computing system, a term vector for the knowledge unit based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit; identifying, by the computing system, the key terms in the term vector based at least on the frequency of occurrence of each key term in the knowledge unit; extracting, by the computing system, n-grams using the key terms in the term vector; scoring, by the computing system, each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and adding, by the computing system, one or more of the extracted n-grams to an index based on the scoring. 2. The method of claim 1 , wherein the deriving the term vector includes modeling the key terms identified in the knowledge unit as the term vector, and wherein a value for each key term in the term vector is calculated as a function of at least the frequency of occurrence of the key term and the position of each occurrence of the key term in the knowledge unit. 3. The method of claim 2 , wherein the deriving the term vector further includes performing natural language processing on the key terms in the knowledge unit, and filtering the key terms in the knowledge unit based on the natural language processing. 4. The method of claim 1 , wherein the extracting the one or more n-grams using the identified key terms includes: identifying one or more terms adjacent to each key term in the knowledge unit; performing natural language processing on the one or more terms adjacent to each key term and the key terms; calculating a probability of the one or more terms adjacent to each key term in the knowledge unit as preceding or following the key term based on a function of the natural language processing; when the probability of the one or more terms being adjacent to the key term is greater than minimum threshold probability, extracting an n-gram comprising the one or more terms and the key term; and when the probability of the one or more terms being adjacent to the key term is less than the minimum threshold probability, extracting an n-gram comprising only the key term. 5. The method of claim 4 , wherein the calculating the probability of the one or more terms adjacent to each key term as preceding or following the key term in the knowledge unit as preceding or following the key term is based on the function of the natural language processing and a frequency of occurrence of the one or more terms adjacent to each key term. 6. The method of claim 1 , wherein the scoring each of the extracted n-grams is a function of the frequency of occurrence of the n-gram, a recency of the n-gram, and a commonality of the n-gram across the knowledge corpus of the enterprise. 7. The method of claim 1 , wherein the adding the one or more of the extracted n-grams to the index includes: determining a total number of n-grams extracted for the knowledge unit; determining a top percentage of the n-grams; and adding the top percentage of the n-grams to the index. 8. The method of claim 1 , wherein the adding the one or more of the extracted n-grams to the index includes: setting a minimum threshold score; determining whether the score for each of the extracted n-grams is above the minimum threshold score; and when the score for an n-gram is above the minimum threshold score, adding the n-gram to the index. 9. The method of claim 1 , wherein the index is a corporate dictionary comprising a set of n-grams that identify each knowledge unit within the knowledge corpus of the enterprise, and the set of n-grams comprises the one or more of the extracted n-grams added to the index. 10. A non-transitory computer-readable storage memory storing a plurality of instructions executable by one or more processors, the plurality of instructions comprising: instructions that cause the one or more processors to obtain data files from a knowledge corpus of an enterprise; instructions that cause the one or more processors to identify key terms within the data files; instructions that cause the one or more processors to determine, for each identified key term, a frequency of occurrence and location within the data files; instructions that cause the one or more processors to generate knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files; instructions that cause the one or more processors to select a knowledge unit from the generated knowledge units for extraction of n-grams; instructions that cause the one or more processors to derive a term vector comprising terms in the knowledge unit based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit; instructions that cause the one or more processors to identify the key terms in the term vector based at least on the frequency of occurrence of each key term in the knowledge unit; instructions that cause the one or more processors to calculate a probability of one or more terms adjacent to each key term in the knowledge unit as preceding or following the key term based on a function of natural language processing; instructions that cause the one or more processors to extract an n-gram comprising the one or more terms and the key term when the probability of the one or more terms being adjacent to the key term is greater than a minimum threshold probability; instructions that cause the one or more processors to extract an n-gram comprising only the key term when the probability of the one or more terms being adjacent to the key term is less than the minimum threshold probability; instructions that cause the one or more processors to score each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and instructions that cause the one or more processors to add one or more of the extracted n-grams to an index based on the scoring. 11. The non-transitory computer-readable storage memory of claim 10 , wherein the plurality of instructions further comprise: instructions that cause the one or more processors to model the key terms identified in the knowledge unit as the term vector; and wherein a value for each key term in the term vector is calculated as a function of at least the frequency of occurrence of the term and the position of each occurrence of the key term in the knowledge unit. 12. The non-transitory computer-readable storage memory of claim 11 , wherein the plurality of instructions further comprise: instructions that cause the one or more processors to perform natural language processing on the key terms in the knowledge unit; and instructions that cause the one or more processors to filter the key terms in the knowledge unit based on the natural language processing. 13. The non-transitory computer-readable storage m

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9864741B2 cover?
Knowledge automation techniques may include selecting a knowledge element from a knowledge corpus of an enterprise for extraction of n-grams, and deriving a term vector comprising terms in the knowledge element. Based at least on a frequency of occurrence of each term in the knowledge element, key terms are identified in the term vector. Thereafter, the identified key terms are used to extract …
Who is the assignee on this patent?
Prysm Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/242. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 09 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).