What technology area does this patent fall under?

Primary CPC classification G06F16/36. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu May 03 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Unsupervised information extraction dictionary creation

US2018121444A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2018121444-A1
Application number	US-201615342361-A
Country	US
Kind code	A1
Filing date	Nov 3, 2016
Priority date	Nov 3, 2016
Publication date	May 3, 2018
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data handling system enables the unsupervised creation of an information extraction dictionary by expanding upon a word or phrase included within an expansion query. Prior to receiving the expansion query, the data handling system performs an unsupervised learning of an information corpus which includes text to assign a corpus vector to each word and phrase of the text. After the expansion query, the data handling system compares the expansion query to the corpus vectors. The data handling system ranks the corpus vectors by similarity to the expansion query and provides a ranked list of words or phrases associated with the ranked corpus vectors. The ranked list may be subsequently utilized as the information extraction dictionary.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of forming a list of expanded words or phrases generated by an unsupervised learning of text within an information corpus, the method comprising: prior to a host device receiving an expansion query from a client device: assigning, with the host device, a corpus vector to each word and phrase within an information corpus stored within a data source local to the host system; forming, with the host device, a plurality of clusters, each cluster comprising a plurality of similar corpus vectors; and indicating, with the host device, a particular corpus vector within each cluster as being a representative corpus vector of the cluster in which the particular corpus vector resides; and subsequent to the host device receiving the expansion query from the client device: assigning, with the host device, a query vector to the expansion query; determining, with the host device, a most similar cluster to the query vector; ranking, with the host device, the corpus vectors within the most similar cluster based upon the similarity to the query vector; and forming, with the host device, a list of words or phrases associated with each of the ranked corpus vectors within the most similar cluster. 2 . The method of claim 1 , wherein determining the most similar cluster to the query vector comprises: comparing the query vector to each of the representative corpus vectors. 3 . The method of claim 1 , wherein forming the plurality of clusters comprises: iteratively conducting a pairwise comparison between a particular corpus vector and each other corpus vector assigned to each word and phrase within the information corpus; and iteratively grouping similar corpus vectors together. 4 . The method of claim 1 , wherein each of the representative corpus vectors are centroid vectors of the cluster in which the particular corpus vector resides. 5 . The method of claim 1 , wherein each of the representative corpus vectors are median vectors of the cluster in which the particular corpus vector resides. 6 . The method of claim 1 , wherein each of the representative corpus vectors are mode vectors of the cluster in which the particular corpus vector resides. 7 . The method of claim 1 , further comprising subsequent to the host device receiving the expansion query from the client device, sending, with the host device, the formed list of words or phrases to the client device. 8 . The method of claim 7 , further comprising subsequent to the host device sending the formed list of words or phrases to the client device, receiving, with the host device, an indication of a user-selection a word or phrase that accurately expands the expansion query. 9 . A computer program product for forming a list of expanded words or phrases generated by an unsupervised learning of text within an information corpus, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions readable by a host device to cause the host device to: prior to the host device receiving an expansion query from a client device: assign a corpus vector to each word and phrase within an information corpus stored within a data source local to the host system; form a plurality of clusters, each cluster comprising a plurality of similar corpus vectors; and indicate a particular corpus vector within each cluster as being a representative corpus vector of the cluster in which the particular corpus vector resides; and subsequent to the host device receiving the expansion query from the client device: assign a query vector to the expansion query; determine a most similar cluster to the query vector; rank the corpus vectors within the most similar cluster based upon the similarity to the query vector; and form a list of words or phrases associated with each of the ranked corpus vectors within the most similar cluster. 10 . The computer program product of claim 9 , wherein the program instructions that when executed by the host device cause the host device to determine the most similar cluster to the query vector, further causes the host device to compare the query vector to each of the representative corpus vectors. 11 . The computer program product of claim 9 , wherein the program instructions that when executed by the host device cause the host device to form the plurality of clusters, further causes the host device to: iteratively conduct a pairwise comparison between a particular corpus vector and each other corpus vector assigned to each word and phrase within the information corpus; and iteratively group similar corpus vectors together. 12 . The computer program product of claim 9 , wherein each of the representative corpus vectors are centroid vectors of the cluster in which the particular corpus vector resides. 13 . The computer program product of claim 9 , wherein each of the representative corpus vectors are median vectors of the cluster in which the particular corpus vector resides. 14 . The computer program product of claim 9 , wherein each of the representative corpus vectors are mode vectors of the cluster in which the particular corpus vector resides. 15 . The computer program product of claim 9 , wherein the program instructions that when executed by the host device further cause the host device to, subsequent to the host device receiving the expansion query from the client device, send the formed list of words or phrases to the client device. 16 . The computer program product of claim 15 , wherein the program instructions that when executed by the host device further cause the host device to: subsequent to the host device sending the formed list of words or phrases to the client device, receive an indication of a user-selection a word or phrase that accurately expands the expansion query. 17 . A computer for forming a list of expanded words or phrases generated by an unsupervised learning of text within an information corpus, the computer comprising: a processor; an information corpus stored within a data source communicatively coupled to the processor; and a memory communicatively coupled to the processor, wherein the memory is encoded with instructions, wherein the instructions when executed by the processor cause the processor to: prior to the processor receiving an expansion query from a client device: assign a corpus vector to each word and phrase within the information corpus; form a plurality of clusters, each cluster comprising a plurality of similar corpus vectors; and indicate a particular corpus vector within each cluster as being a representative corpus vector of the cluster in which the particular corpus vector resides; and subsequent to the processor receiving the expansion query from the client device: assign a query vector to the expansion query; determine a most similar cluster to the query vector; rank the corpus vectors within the most similar cluster based upon the similarity to the query vector; and form a list of words or phrases associated with each of the ranked corpus vectors within the most similar cluster. 18 . The computer of claim 17 , wherein the instructions that when executed by the processor to cause the processor to form the plurality of clusters further cause the processor to: iteratively conduct a pairwise comparison between a particular corpus vector and each other corpus vector assigned to each word and phrase within the information corpus; and iteratively group similar corp

Assignees

Inventors

Classifications

G06F16/3322
using system suggestions (G06F16/3325 takes precedence) · CPC title
G06F16/36Primary
Creation of semantic tools, e.g. ontology or thesauri · CPC title
G06F40/242Primary
Dictionaries · CPC title
G06F16/3338
Query expansion · CPC title
G06F16/3344
using natural language analysis · CPC title

Patent family

Related publications grouped by family.

View patent family 62019955

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018121444A1 cover?: A data handling system enables the unsupervised creation of an information extraction dictionary by expanding upon a word or phrase included within an expansion query. Prior to receiving the expansion query, the data handling system performs an unsupervised learning of an information corpus which includes text to assign a corpus vector to each word and phrase of the text. After the expansion qu…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F16/36. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu May 03 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Quantization-based fast inner product search

Computer-readable recording medium, retrieval device, and retrieval method

Similarity calculation system, method of calculating similarity, and program

Frequently asked questions