Keyword extraction for relationship maps

US9965460B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9965460-B1
Application numberUS-201615394436-A
CountryUS
Kind codeB1
Filing dateDec 29, 2016
Priority dateDec 29, 2016
Publication dateMay 8, 2018
Grant dateMay 8, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein is a method of extracting keywords from a document based on certain statistical, positional and natural language data, as well as relationship maps between the keywords. Under this method, document data are processed to obtain an NLP result for each sentence of the document, and based on the NLP result, words in the document are filtered and grouped into terms; a frequency analysis as well as a co-occurrence analysis are performed over the terms to output one or more keywords representing the document.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for keyword extraction, comprising: receiving and processing document data to obtain paragraphs and sentences, each sentence comprising one or more words; for each sentence of the document data, obtaining an NLP (Natural Language Processing) result relating to a sentence, wherein the NLP result comprises NER (Named Entity Recognition), POS (Part of Speech) and dependency of each word of the sentence, removing at least one word from the sentence based on the NLP result, grouping the remaining words of the sentence into one or more terms, wherein each term is either a word or a multi-word phrase, and storing the terms on a list; applying a frequency analysis to the terms on the list; for each paragraph of the document data, identifying terms whose occurrences in the paragraph meet a first threshold, forming a cluster for the paragraph based on a co-occurrence analysis of the identified terms, and evaluating a connectivity of the formed cluster; excluding at least one paragraph based on the connectivity of the formed clusters; selecting terms that occur in at least two remaining paragraphs; calculating a score for each of the selected terms; and based on the scores of the selected terms, outputting one or more top terms as keywords representing the document data. 2. The method of claim 1 , further comprising permuting the multi-word phrases before storing the terms on the list. 3. The method of claim 1 , wherein the frequency analysis comprises: for each term on the list, counting its occurrences in the document data to obtain a frequency of the term, and calculating a frequency score for the term based on its frequency. 4. The method of claim 3 , wherein the frequency score is calculated according to the following equation: Score=Frequency×[1+(NER Weight+POS Weight)/2]. 5. The method of claim 3 , wherein the frequency analysis further comprises: if two or more terms have the same frequency and at least one term is a subset of another term, removing the at least one term that is a subset of another term. 6. The method of claim 3 , wherein the frequency analysis further comprises: deleting at least one term whose frequency score does not meet a second threshold. 7. The method of claim 1 , wherein the co-occurrence analysis comprises: using the identified terms as nodes of a graph; and forming a connection between any two nodes if their corresponding terms co-occur in a sentence of the paragraph, where the cluster is formed to comprise the nodes and connections between the nodes. 8. The method of claim 7 , wherein the connectivity of the formed cluster is a result of dividing the actual connections in the formed cluster against a maximum of connections for the nodes in the formed cluster. 9. The method of claim 1 , wherein the at least one paragraph is excluded if the connectivity of its corresponding cluster is below a third threshold. 10. The method of claim 3 , further comprising: calculating a co-occurrence score for each selected term; and multiplying the frequency score and the co-occurrence score of the selected term to obtain the score of the selected term. 11. A computer program product comprising a computer usable non-transitory medium having a computer readable program code embedded therein for controlling a data processing apparatus, the computer readable program code configured to cause the data processing apparatus to execute a process for handwriting recognition, the process comprising: receiving and processing document data to obtain paragraphs and sentences, each sentence comprising one or more words; for each sentence of the document data, obtaining an NLP (Natural Language Processing) result relating to a sentence, wherein the NLP result comprises NER (Named Entity Recognition), POS (Part of Speech) and dependency of each word of the sentence, removing at least one word from the sentence based on the NLP result, grouping the remaining words of the sentence into one or more terms, wherein each term is either a word or a multi-word phrase, and storing the terms on a list; applying a frequency analysis to the terms on the list; for each paragraph of the document data, identifying terms whose occurrences in the paragraph meet a first threshold, forming a cluster for the paragraph based on a co-occurrence analysis of the identified terms, and evaluating a connectivity of the formed cluster; excluding at least one paragraph based on the connectivity of the formed clusters; selecting terms that occur in at least two remaining paragraphs; calculating a score for each of the selected terms; and based on the scores of the selected terms, outputting one or more top terms as keywords representing the document data. 12. The computer program product of claim 11 , where the process further comprises permuting the multi-word phrases before storing the terms on the list. 13. The computer program product of claim 11 , wherein the frequency analysis comprises: for each term on the list, counting its occurrences in the document data to obtain a frequency of the term, and calculating a frequency score for the term based on its frequency. 14. The computer program product of claim 13 , wherein the frequency score is calculated according to the following equation: Score=Frequency×[1+(NER Weight+POS Weight)/2]. 15. The computer program product of claim 13 , wherein the frequency analysis further comprises: if two or more terms have the same frequency and at least one term is a subset of another term, removing the at least one term that is a subset of another term. 16. The computer program product of claim 13 , wherein the frequency analysis further comprises: deleting at least one term whose frequency score does not meet a second threshold. 17. The computer program product of claim 11 , wherein the co-occurrence analysis comprises: using the identified terms as nodes of a graph; and forming a connection between any two nodes if their corresponding terms co-occur in a sentence of the paragraph, where the cluster is formed to comprise the nodes and connections between the nodes. 18. The computer program product of claim 17 , wherein the connectivity of the formed cluster is a result of dividing the actual connections in the formed cluster against a maximum of connections for the nodes in the formed cluster. 19. The computer program product of claim 11 , wherein the at least one paragraph is excluded if the connectivity of its corresponding cluster is below a third threshold. 20. The computer program product of claim 13 , wherein the process further comprises: calculating a co-occurrence score for each selected term; and multiplying the frequency score and the co-occurrence score of the selected term to obtain the score of the selected term.

Assignees

Inventors

Classifications

  • Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • G06F40/295Primary

    Named entity recognition · CPC title

  • Parsing · CPC title

  • Recognition of textual entities · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9965460B1 cover?
Disclosed herein is a method of extracting keywords from a document based on certain statistical, positional and natural language data, as well as relationship maps between the keywords. Under this method, document data are processed to obtain an NLP result for each sentence of the document, and based on the NLP result, words in the document are filtered and grouped into terms; a frequency anal…
Who is the assignee on this patent?
Konica Minolta Laboratory Usa Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/295. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 08 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).