Who is the assignee on this patent?

Konica Minolta Laboratory Usa Inc

What technology area does this patent fall under?

Primary CPC classification G06F40/295. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 08 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Keyword extraction for relationship maps

US9965460B1 · US · B1

Patent metadata
Field	Value
Publication number	US-9965460-B1
Application number	US-201615394436-A
Country	US
Kind code	B1
Filing date	Dec 29, 2016
Priority date	Dec 29, 2016
Publication date	May 8, 2018
Grant date	May 8, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein is a method of extracting keywords from a document based on certain statistical, positional and natural language data, as well as relationship maps between the keywords. Under this method, document data are processed to obtain an NLP result for each sentence of the document, and based on the NLP result, words in the document are filtered and grouped into terms; a frequency analysis as well as a co-occurrence analysis are performed over the terms to output one or more keywords representing the document.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for keyword extraction, comprising: receiving and processing document data to obtain paragraphs and sentences, each sentence comprising one or more words; for each sentence of the document data, obtaining an NLP (Natural Language Processing) result relating to a sentence, wherein the NLP result comprises NER (Named Entity Recognition), POS (Part of Speech) and dependency of each word of the sentence, removing at least one word from the sentence based on the NLP result, grouping the remaining words of the sentence into one or more terms, wherein each term is either a word or a multi-word phrase, and storing the terms on a list; applying a frequency analysis to the terms on the list; for each paragraph of the document data, identifying terms whose occurrences in the paragraph meet a first threshold, forming a cluster for the paragraph based on a co-occurrence analysis of the identified terms, and evaluating a connectivity of the formed cluster; excluding at least one paragraph based on the connectivity of the formed clusters; selecting terms that occur in at least two remaining paragraphs; calculating a score for each of the selected terms; and based on the scores of the selected terms, outputting one or more top terms as keywords representing the document data. 2. The method of claim 1 , further comprising permuting the multi-word phrases before storing the terms on the list. 3. The method of claim 1 , wherein the frequency analysis comprises: for each term on the list, counting its occurrences in the document data to obtain a frequency of the term, and calculating a frequency score for the term based on its frequency. 4. The method of claim 3 , wherein the frequency score is calculated according to the following equation: Score=Frequency×[1+(NER Weight+POS Weight)/2]. 5. The method of claim 3 , wherein the frequency analysis further comprises: if two or more terms have the same frequency and at least one term is a subset of another term, removing the at least one term that is a subset of another term. 6. The method of claim 3 , wherein the frequency analysis further comprises: deleting at least one term whose frequency score does not meet a second threshold. 7. The method of claim 1 , wherein the co-occurrence analysis comprises: using the identified terms as nodes of a graph; and forming a connection between any two nodes if their corresponding terms co-occur in a sentence of the paragraph, where the cluster is formed to comprise the nodes and connections between the nodes. 8. The method of claim 7 , wherein the connectivity of the formed cluster is a result of dividing the actual connections in the formed cluster against a maximum of connections for the nodes in the formed cluster. 9. The method of claim 1 , wherein the at least one paragraph is excluded if the connectivity of its corresponding cluster is below a third threshold. 10. The method of claim 3 , further comprising: calculating a co-occurrence score for each selected term; and multiplying the frequency score and the co-occurrence score of the selected term to obtain the score of the selected term. 11. A computer program product comprising a computer usable non-transitory medium having a computer readable program code embedded therein for controlling a data processing apparatus, the computer readable program code configured to cause the data processing apparatus to execute a process for handwriting recognition, the process comprising: receiving and processing document data to obtain paragraphs and sentences, each sentence comprising one or more words; for each sentence of the document data, obtaining an NLP (Natural Language Processing) result relating to a sentence, wherein the NLP result comprises NER (Named Entity Recognition), POS (Part of Speech) and dependency of each word of the sentence, removing at least one word from the sentence based on the NLP result, grouping the remaining words of the sentence into one or more terms, wherein each term is either a word or a multi-word phrase, and storing the terms on a list; applying a frequency analysis to the terms on the list; for each paragraph of the document data, identifying terms whose occurrences in the paragraph meet a first threshold, forming a cluster for the paragraph based on a co-occurrence analysis of the identified terms, and evaluating a connectivity of the formed cluster; excluding at least one paragraph based on the connectivity of the formed clusters; selecting terms that occur in at least two remaining paragraphs; calculating a score for each of the selected terms; and based on the scores of the selected terms, outputting one or more top terms as keywords representing the document data. 12. The computer program product of claim 11 , where the process further comprises permuting the multi-word phrases before storing the terms on the list. 13. The computer program product of claim 11 , wherein the frequency analysis comprises: for each term on the list, counting its occurrences in the document data to obtain a frequency of the term, and calculating a frequency score for the term based on its frequency. 14. The computer program product of claim 13 , wherein the frequency score is calculated according to the following equation: Score=Frequency×[1+(NER Weight+POS Weight)/2]. 15. The computer program product of claim 13 , wherein the frequency analysis further comprises: if two or more terms have the same frequency and at least one term is a subset of another term, removing the at least one term that is a subset of another term. 16. The computer program product of claim 13 , wherein the frequency analysis further comprises: deleting at least one term whose frequency score does not meet a second threshold. 17. The computer program product of claim 11 , wherein the co-occurrence analysis comprises: using the identified terms as nodes of a graph; and forming a connection between any two nodes if their corresponding terms co-occur in a sentence of the paragraph, where the cluster is formed to comprise the nodes and connections between the nodes. 18. The computer program product of claim 17 , wherein the connectivity of the formed cluster is a result of dividing the actual connections in the formed cluster against a maximum of connections for the nodes in the formed cluster. 19. The computer program product of claim 11 , wherein the at least one paragraph is excluded if the connectivity of its corresponding cluster is below a third threshold. 20. The computer program product of claim 13 , wherein the process further comprises: calculating a co-occurrence score for each selected term; and multiplying the frequency score and the co-occurrence score of the selected term to obtain the score of the selected term.

Assignees

Konica Minolta Laboratory Usa Inc

Inventors

Classifications

G06F40/211
Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06F40/295Primary
Named entity recognition · CPC title
G06F40/205
Parsing · CPC title
G06F40/279
Recognition of textual entities · CPC title

Patent family

Related publications grouped by family.

View patent family 62045111

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9965460B1 cover?: Disclosed herein is a method of extracting keywords from a document based on certain statistical, positional and natural language data, as well as relationship maps between the keywords. Under this method, document data are processed to obtain an NLP result for each sentence of the document, and based on the NLP result, words in the document are filtered and grouped into terms; a frequency anal…
Who is the assignee on this patent?: Konica Minolta Laboratory Usa Inc
What technology area does this patent fall under?: Primary CPC classification G06F40/295. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 08 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).