Large-scale anomaly detection with relative density-ratio estimation

US10909468B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10909468-B2
Application numberUS-201514634515-A
CountryUS
Kind codeB2
Filing dateFeb 27, 2015
Priority dateFeb 27, 2015
Publication dateFeb 2, 2021
Grant dateFeb 2, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a set of training data consisting of inliers may be obtained. A supervised classification model may be trained using the set of training data to identify outliers. The supervised classification model may be applied to generate an anomaly score for a data point. It may be determined whether the data point is an outlier based, at least in part, upon the anomaly score.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: extracting, for each one of a plurality of electronic mail (email) messages, a set of feature values corresponding to a set of features from the corresponding one of the plurality of email messages, each of the plurality of email messages being received from of a corresponding non-spammer email source; updating, by one or more servers, a set of training data comprising data corresponding to email messages received from non-spammer email sources such that the set of training data comprises the set of feature values for each of the plurality of email messages; training, by the one or more servers, a supervised two-class classification model using the set of training data to identify outliers; selecting, by the supervised two-class classification model, one or more features associated with more effective anomaly detection than one or more other features; determining a training data density associated with non-spammer email sources based, at least in part, on the set of training data comprising the data corresponding to the email messages received from non-spammer email sources; determining a test data density associated with spammer email sources based, at least in part, on a set of test data comprising data corresponding to email messages received from spammer email sources; estimating a relative importance measure, using the supervised two-class classification model, based, at least in part, on the training data density associated with non-spammer email sources and the test data density associated with spammer email sources, wherein the relative importance measure weighs the training data density associated with non-spammer email sources more heavily than the test data density associated with spammer email sources; receiving, by the one or more servers, an email message; applying, by the one or more servers, the supervised two-class classification model to generate an anomaly score for the email message according to the one or more features selected by the supervised two-class classification model and the relative importance measure; and determining, by the one or more servers, whether the email message is received from a spammer email source based, at least in part, on the anomaly score. 2. The method as recited in claim 1 , wherein feature values extracted from the plurality of email messages correspond to features comprising a first identity of a first user sending a first email message, a second identity of a second user receiving the first email message, a time that the first email message was sent, a first user feature obtained from a first user profile of the first user, and a second user feature obtained from a second user profile of the second user. 3. The method as recited in claim 1 , wherein the supervised two-class classification model comprises a gradient boosted decision tree (GBDT) algorithm. 4. The method as recited in claim 1 , wherein the selecting comprises: eliminating, by the supervised two-class classification model, one or more second features associated with anomaly detection that is not effective. 5. The method as recited in claim 1 , wherein the set of training data comprises digital voice data. 6. The method as recited in claim 1 , comprising: processing the email message according to a result of determining whether the email message is received from a spammer email source. 7. An apparatus, comprising: at least one processor; and a memory storing thereon computer-readable instructions, the computer-readable instructions being configured such that, when executed by the at least one processor, the computer-readable instructions cause the at least one processor to: extract, for each one of a plurality of digital images of a corresponding one of a plurality of non-faulty semiconductors, a set of feature values corresponding to a set of features from the corresponding one of the plurality of digital images; update a set of training data comprising data corresponding to digital images of non-faulty semiconductors such that the set of training data comprises the set of feature values for each of the plurality of digital images; train a supervised two-class classification model using the set of training data to identify faulty semiconductors; determine a training data density associated with non-faulty semiconductors based, at least in part, on the set of training data comprising the data corresponding to digital images of non-faulty semiconductors; determine a test data density associated with faulty semiconductors based, at least in part, on a set of test data comprising data corresponding to digital images of faulty semiconductors; estimate a relative importance measure, using the supervised two-class classification model, based, at least in part, on the training data density associated with non-faulty semiconductors and the test data density associated with faulty semiconductors, wherein the relative importance measure weighs the training data density associated with non-faulty semiconductors more heavily than the test data density associated with faulty semiconductors; obtain an image of a semiconductor; apply the supervised two-class classification model to generate an anomaly score for the semiconductor using the image according to one or more features selected by the supervised two-class classification model and the relative importance measure; and determine whether the semiconductor is a faulty semiconductor based, at least in part, on the anomaly score. 8. The apparatus as recited in claim 7 , the relative importance measure being a ratio of the training data density and the test data density. 9. The apparatus as recited in claim 7 , wherein the supervised two-class classification model comprises a gradient boosted decision tree (GBDT) algorithm. 10. The apparatus as recited in claim 7 , wherein the supervised two-class classification model performs feature selection to select the one or more features upon which to generate anomaly scores for semiconductors. 11. A computer program product, comprising one or more non-transitory computer readable media having computer program instructions stored therein, the computer program instructions being configured such that, when executed by one or more processors, the computer program instructions cause the one or more processors to: obtain a set of training data comprising data corresponding to inliers, each of the inliers comprising an electronic mail (email) message received from a non-spammer, an image of a non-faulty semiconductor, or digital voice data of an individual; train a supervised two-class classification model using the set of training data to identify outliers, each of the outliers comprising an email message received from a spammer, an image of a faulty semiconductor, or digital voice data that is not of the individual; select, by the supervised two-class classification model, one or more features determined to be effective for anomaly detection; determining a training data density associated with non-spammer email sources, non-faulty semiconductors, or digital voice data of the individual based, at least in part, on the set of training data comprising the data corresponding to inliers; determining a test data density associated with spammer email sources, faulty semiconductors, or digital voice data that that is not of the individual based, at least in part, on a set of test data comprising data corresponding to email messages received from spammer email sources, data corresponding to digital images of faulty semiconductors or data corresponding to digital voice data that is not of the individual; estimating a relative importance measure, u

Assignees

Inventors

Classifications

  • using kernel methods, e.g. support vector machines [SVM] · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • involving long-term monitoring or reporting · CPC title

  • G06N20/20Primary

    Ensemble learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10909468B2 cover?
In one embodiment, a set of training data consisting of inliers may be obtained. A supervised classification model may be trained using the set of training data to identify outliers. The supervised classification model may be applied to generate an anomaly score for a data point. It may be determined whether the data point is an outlier based, at least in part, upon the anomaly score.
Who is the assignee on this patent?
Oath Inc, Verizon Media Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 02 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).