What technology area does this patent fall under?

Primary CPC classification G06F40/30. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Nov 24 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Semantic text comparison using artificial intelligence identified source document topics

US2022374598A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2022374598-A1
Application number	US-202117303098-A
Country	US
Kind code	A1
Filing date	May 20, 2021
Priority date	May 20, 2021
Publication date	Nov 24, 2022
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer assigns a similarity value to a comparison document. The computer receives, reference document contextual word embeddings in first set of topic clusters, each with a representative embedding. The computer receives comparison document contextual word embeddings. The computer determines, using a trained neural network model classifier, for each comparison document contextual word embedding, topic correspondence values relative to the representative embeddings of said first set of clusters. The computer generates a second set of clusters by assigning comparison document embeddings to best matching one of the first clusters, according to the topic correspondence values. The computer determines a second set of representative embeddings and uses a comparison method, to determine a cluster similarity value for second set clusters compared to first set representative embeddings. The computer determines document similarity values based, at least in part, on at least one of cluster similarity values.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer implemented method to assign a similarity value to a comparison document, comprising: receiving by said computer, for a reference document, contextual word embeddings arranged into a first set of clusters, each representing a topic and characterized by a representative embedding; receiving by said computer, for at least one comparison document, a set of contextual word embeddings; determining, by said computer, using a neural network model classifier trained to predict whether embeddings are in a same cluster, for each comparison document contextual word embedding, topic correspondence values relative to the representative embeddings of said first set of clusters; 1 generating, by said computer, a second set of clusters by assigning each comparison document contextual word embedding to a best matching one of the first set of clusters, according to the topic correspondence values; determining by said computer, a representative embedding for each of the second set of clusters; using, by said computer, a comparison method, to determine for each centroid of the second set of clusters compared to each centroid of the first set of clusters, a cluster similarity value; and determining, by said computer for each comparison document, a document similarity value based, at least in part, on at least one of the cluster similarity values. 2 . The method of claim 1 , wherein said document similarity value rating is further based upon multiplying said cluster similarity values by at least one modification value selected from a group consisting of an associated cluster weight value, and a cluster relative weight proportion. 3 . The method of claim 1 , wherein: a set of training data for said neural network includes pairs of intra-cluster and inter-cluster embeddings from a set of clusters respectively, wherein each pair is labeled as intra-cluster or inter-cluster; and wherein said neural network generates a classifier based on said training data. 4 . The method of claim 1 , wherein said received contextual word embeddings are generated by passing text from the reference document through a deep neural network. 5 . The method of claim 1 , wherein said clusters are established by applying a clustering algorithm to said embeddings. 6 . The method of claim 1 , wherein said representative embeddings are based, at least in part, on a computed embedding value selected from a group consisting of mean embedding, medoid embedding, and concatenated embedding of a plurality of embeddings in said cluster. 7 . The method of claim 1 , wherein, responsive to a sufficiency rating exceeding a sufficiency threshold, determining the comparison document to be an acceptable representation of the reference document. 8 . The method of claim 1 , wherein said at least one comparison document is a plurality of comparison documents; wherein said computer generates a cluster similarity value for each of said plurality of comparison documents; and wherein said computer generates a ranked list of said plurality of comparison documents ordered, at least in part by document similarity value. 9 . A system to assign a similarity value to a comparison document, which comprises: a computer system comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to: receive for a reference document, contextual word embeddings arranged into a first set of clusters, each representing a topic and characterized by a representative embedding; receive for at least one comparison document, a set of contextual word embeddings; determine using a neural network model classifier trained to predict whether embeddings are in a same cluster, for each comparison document contextual word embedding, topic correspondence values relative to the representative embeddings of said first set of clusters; generate a second set of clusters by assigning each comparison document contextual word embedding to a best matching one of the first set of clusters, according to the topic correspondence values; determine a representative embedding for each of the second set of clusters; use a comparison method, to determine for each centroid of the second set of clusters compared to each centroid of the first set of clusters, a cluster similarity value; and determine for each comparison document, a document similarity value based, at least in part, on at least one of the cluster similarity values. 10 . The system of claim 9 , wherein said document similarity value rating is further based upon multiplying said cluster similarity values by at least one modification value selected from a group consisting of an associated cluster weight value, and a cluster relative weight proportion. 11 . The system of claim 9 , wherein: a set of training data for said neural network includes pairs of intra-cluster and inter-cluster embeddings from a set of the clusters respectively, wherein each pair is labeled as intra-cluster or inter-cluster; and wherein said neural network generates a classifier based on said training data. 12 . The system of claim 9 , wherein said received contextual word embeddings are generated by passing text from the reference document through a deep neural network. 13 . The system of claim 9 , wherein said representative embeddings are based, at least in part, on a computed embedding value selected from a group consisting of mean embedding, medoid embedding, and concatenated embedding of a plurality of embeddings in said cluster. 14 . The system of claim 9 , wherein, the instructions responsive to a sufficiency rating exceeding a sufficiency threshold, further cause the computer to determine the comparison document to be an acceptable representation of the reference document. 15 . A computer program product to assign a similarity value to a comparison document, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to: receive, using the computer, for a reference document, contextual word embeddings arranged into a first set of clusters, each representing a topic and characterized by a representative embedding; receive, using the computer, for at least one comparison document, a set of contextual word embeddings; determine, using the computer, using a neural network model classifier trained to predict whether embeddings are in a same cluster, for each comparison document contextual word embedding, topic correspondence values relative to the representative embeddings of said first set of clusters; generate, using the computer, a second set of clusters by assigning each comparison document contextual word embedding to a best matching one of the first set of clusters, according to the topic correspondence values; determine, using the computer, a representative embedding for each of the second set of clusters; use a comparison method to determine for each centroid of the second set of clusters compared to each centroid of the first set of clusters, a cluster similarity value; and determine, using the computer, for each comparison document, a document similarity value based, at least in part, on at least one of the cluster similarity values. 16 . The computer program product of claim 15 , wherein said document similarity value rating is further based upon multiplying said cluster similarity values by at least one modification value selected from a group c

Assignees

Inventors

Classifications

G06F40/30Primary
Semantic analysis · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06F40/216
using statistical methods · CPC title
G06N3/08
Learning methods · CPC title
G06F40/279Primary
Recognition of textual entities · CPC title

Patent family

Related publications grouped by family.

View patent family 84103905

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2022374598A1 cover?: A computer assigns a similarity value to a comparison document. The computer receives, reference document contextual word embeddings in first set of topic clusters, each with a representative embedding. The computer receives comparison document contextual word embeddings. The computer determines, using a trained neural network model classifier, for each comparison document contextual word embed…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Nov 24 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).