What technology area does this patent fall under?

Primary CPC classification G06F16/35. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jan 02 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Identifying homogenous clusters

US2020004870A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2020004870-A1
Application number	US-201816025936-A
Country	US
Kind code	A1
Filing date	Jul 2, 2018
Priority date	Jul 2, 2018
Publication date	Jan 2, 2020
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Homogeneous clusters are generated from a first plurality of documents for generation of regular expressions. Documents that share similar characteristics are clustered, and for each cluster, features are generated for use by a homogeneity model to determine a homogeneity score for the cluster. Clusters determined to be homogenous are sent to a regular expression generator.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer implemented method for generating homogeneous clusters from a first plurality of documents for generation of regular expressions, the method comprising: clustering the first plurality of documents into a first plurality of clusters, wherein each of the first plurality of documents is included in only one of the clusters, wherein each of the clusters includes one or more of the documents and wherein the documents in each cluster share certain characteristics more closely with each other than the documents of the other clusters in the first plurality of clusters; for each cluster in the first plurality of clusters: generating a word distribution for each document in the cluster; assigning, using the word distribution, each word to a probability group; determining the percentage of words in each probability group; determining features for the cluster using the probability groups; determining a homogeneity score by applying a homogeneity model to the features for the cluster; and send those of the first plurality of clusters for which the homogeneity score exceeds a homogeneity threshold to an automatic regular expression generator. 2 . The method of claim 1 , wherein the word distribution indicates in which documents in the cluster that each word in each document occurs. 3 . The method of claim 1 , wherein the percentage indicates what percentage of documents in the cluster each word occurs at least once. 4 . The method of claim 1 , wherein the probability groups bin each word within a predetermined number of probability groups. 5 . The method of claim 4 , wherein the predetermined number of probability groups is 10. 6 . The method of claim 1 , wherein the homogeneity model is a logistic regression model. 7 . The method of claim 1 , further comprising: assigning documents in a second plurality of documents to the first plurality of clusters using the regular expressions, wherein the first plurality of documents comprises a plurality of error messages and wherein each cluster corresponds to one or more related software bugs and wherein the second plurality of documents corresponds to a more recent plurality of error messages. 8 . A non-transitory machine-readable storage medium that provides instructions for generating homogeneous clusters from a first plurality of documents for generation of regular expressions that, if executed by a processor, will cause said processor to perform operations comprising: clustering the first plurality of documents into a first plurality of clusters, wherein each of the first plurality of documents is included in only one of the clusters, wherein each of the clusters includes one or more of the documents and wherein the documents in each cluster share certain characteristics more closely with each other than the documents of the other clusters in the first plurality of clusters; for each cluster in the first plurality of clusters: generating a word distribution for each document in the cluster; assigning, using the word distribution, each word to a probability group; determining the percentage of words in each probability group; determining features for the cluster using the probability groups; determining a homogeneity score by applying a homogeneity model to the features for the cluster; and send those of the first plurality of clusters for which the homogeneity score exceeds a homogeneity threshold to an automatic regular expression generator. 9 . The non-transitory machine-readable storage medium of claim 8 , wherein the word distribution indicates in which documents in the cluster that each word in each document occurs. 10 . The non-transitory machine-readable storage medium of claim 8 , wherein the percentage indicates what percentage of documents in the cluster each word occurs at least once. 11 . The non-transitory machine-readable storage medium of claim 8 , wherein the probability groups bin each word within a predetermined number of probability groups. 12 . The non-transitory machine-readable storage medium of claim 11 , wherein the predetermined number of probability groups is 10. 13 . The non-transitory machine-readable storage medium of claim 8 , wherein the homogeneity model is a logistic regression model. 14 . The non-transitory machine-readable storage medium of claim 8 , the operations further comprising: assigning documents in a second plurality of documents to the first plurality of clusters using the regular expressions, wherein the first plurality of documents comprises a plurality of error messages and wherein each cluster corresponds to one or more related software bugs and wherein the second plurality of documents corresponds to a more recent plurality of error messages. 15 . An article of manufacture for generating homogeneous clusters from a first plurality of documents for generation of regular expressions, the article comprising: a processor; and a memory coupled to the processor, the memory storing instructions which, when executed by the processor, cause the article to: cluster the first plurality of documents into a first plurality of clusters, wherein each of the first plurality of documents is included in only one of the clusters, wherein each of the clusters includes one or more of the documents and wherein the documents in each cluster share certain characteristics more closely with each other than the documents of the other clusters in the first plurality of clusters; for each cluster in the first plurality of clusters: generate a word distribution for each document in the cluster; assign, using the word distribution, each word to a probability group; determine the percentage of words in each probability group; determine features for the cluster using the probability groups; determine a homogeneity score by applying a homogeneity model to the features for the cluster; and send those of the first plurality of clusters for which the homogeneity score exceeds a homogeneity threshold to an automatic regular expression generator. 16 . The article of claim 15 , wherein the word distribution indicates in which documents in the cluster that each word in each document occurs. 17 . The article of claim 15 , wherein the percentage indicates what percentage of documents in the cluster each word occurs at least once. 18 . The article of claim 15 , wherein the probability groups bin each word within a predetermined number of probability groups. 19 . The article of claim 18 , wherein the predetermined number of probability groups is 10. 20 . The article of claim 15 , the instructions further causing the article to: assigning documents in a second plurality of documents to the first plurality of clusters using the regular expressions, wherein the first plurality of documents comprises a plurality of error messages and wherein each cluster corresponds to one or more related software bugs and wherein the second plurality of documents corresponds to a more recent plurality of error messages.

Assignees

Salesforce Com Inc

Inventors

Dulam Ganeswara Rao

Classifications

G06F40/30
Semantic analysis · CPC title
G06F16/35Primary
Clustering; Classification · CPC title
G06F16/285
Clustering or classification · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

View patent family 69055133

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2020004870A1 cover?: Homogeneous clusters are generated from a first plurality of documents for generation of regular expressions. Documents that share similar characteristics are clustered, and for each cluster, features are generated for use by a homogeneity model to determine a homogeneity score for the cluster. Clusters determined to be homogenous are sent to a regular expression generator.
Who is the assignee on this patent?: Salesforce Com Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/35. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jan 02 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Organizing survey text responses

Systems and methods for classification of software defect reports

Classifying structured documents

Frequently asked questions