Systems and methods for partitioning sets of features for a Bayesian classifier

US10163056B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10163056-B2
Application numberUS-201615162505-A
CountryUS
Kind codeB2
Filing dateMay 23, 2016
Priority dateAug 29, 2014
Publication dateDec 25, 2018
Grant dateDec 25, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The technology disclosed relates to methods for partitioning sets of features for a Bayesian classifier, finding a data partition that makes the classification process faster and more accurate, while discovering and taking into account feature dependence among sets of features in the data set. It relates to computing class entropy scores for a class label across all tuples that share the feature-subset and arranging the tuples in order of non-decreasing entropy scores for the class label, and constructing a data partition that offers the highest improvement in predictive accuracy for the data set. Also disclosed is a method for partitioning a complete set of records of features in a batch computation, computing increasing predictive power; and also relates to starting with singleton partitions, and using an iterative process to construct a data partition that offers the highest improvement in predictive accuracy for the data set.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of building a partition list of feature subsets having probabilistic interdependence among features in the feature subsets for use with a classifier to detect fraudulent user registrations, the method including: accessing an input set including an input tuple comprising feature-values assigned to features, wherein the features are of user registration data records and wherein the feature-values of the input tuple are values from a user registration data record; identifying, from the input tuple, input subtuples comprising unique feature subsets; accessing a tuple instance count data structure stored in memory that provides counts of tuples in a data set; computing class entropy scores for the identified input subtuples that have at least a threshold support count of instances in the tuple instance count data structure, wherein the class entropy scores are based on class labels of the input subtuples, and wherein a class label for an input subtuple has a class value that indicates either a fraudulent user registration or a non-fraudulent user registration; building the partition list including: ordering at least some of the scored input subtuples by non-decreasing class entropy score; and traversing the ordered input subtuples, including: adding a feature subset of a current ordered input subtuple to the partition list, and pruning, from subsequent ordered input subtuples, input subtuples including features that overlap with features of the feature subset corresponding to the current ordered input subtuple; storing the partition list in a memory, whereby it becomes available to use with the classifier; and using the partition list with the classifier to classify additional user registration data records as either fraudulent or non-fraudulent. 2. The method of claim 1 , further including the threshold support count being in a range of 2 to 10, inclusive, thereby requiring at least 2 to 10 instances of feature-values in the input subtuples to be present in the tuple instance count data structure. 3. The method of claim 1 , wherein: the tuple instance count data structure includes tuple instances, class values and counts; and the input tuple includes the class label that accepts an assigned class value. 4. The method of claim 1 , wherein the tuple instance count data structure is subject to a minimum threshold support count and returns non-zero counts for tuple instances that have a count in the data set that meets the minimum support threshold. 5. A method of building a partition list of feature subsets having probabilistic interdependence among features in the feature subsets for use with a classifier to detect fraudulent user registrations, the method including: accessing a tuple instance count data structure stored in memory, the tuple instance count data structure based on counts of feature-values in an input set of tuples wherein the feature-values of the tuples are values from user registration data records; accessing a class value, wherein the class value indicates either a fraudulent user registration or a non-fraudulent user registration; computing predictive power for the class value of tuple instances using counts in the tuple instance count data structure for supported tuples among tuple instances that have at least a threshold support of count instances; building the partition list including: ordering at least some of the tuple instances for which the predictive power for the class value has been computed by non-increasing predictive power; and traversing the ordered tuple instances, including: adding a feature subset of a current ordered input subtuple to the partition list, and pruning, from subsequent consideration, other input subtuples that include any features in the current ordered input subtuple; storing the partition list in a memory, whereby it becomes available to use with the classifier; and using the partition list with the classifier to classify additional user registration data records as either fraudulent or non-fraudulent. 6. The method of claim 5 , further including the supported tuples satisfying the threshold support count being in a range of 2 to 10, inclusive, thereby requiring at least 2 to 10 instances of feature-values in the supported tuples to be present in the tuple instance count data structure. 7. The method of claim 5 , wherein: the tuple instance count data structure includes tuple instances, class values and counts. 8. A method of building a partition list of feature subsets having probabilistic interdependence among features in the feature subsets for use with a classifier to detect fraudulent user registrations, the method including: accessing a tuple instance count data structure stored in memory, the tuple instance count data structure based on counts of feature-values in an input set of tuples wherein the feature-values of the tuples are values from user registration data records; accessing an input set including an input tuple comprising feature-values assigned to features, wherein the features are of the user registration data records and wherein the feature-values of the input tuple are values from a user registration data record; building the partition list including: adding singleton input features present in the input set to the partition list as pending feature subsets; computing reduction in class entropy scores that would result from merging pairs of pending feature subsets using the tuple instance count data structure, limiting consideration of mergers resulting in merged feature subsets to the mergers that have at least a threshold support count of instances in the tuple instance count data structure, wherein the class entropy scores are based on class values that indicate either a fraudulent user registration or a non-fraudulent user registration; selecting a selected pair of pending feature subsets that yields a reduction in class entropy score resulting from the merger, wherein the reduction in class entropy score meets a predetermined class entropy reduction threshold; and merging the selected pair of feature subsets into a merged pending feature subset in the partition list; storing the partition list in the memory, whereby it becomes available to use with the classifier; and using the partition list with the classifier to classify additional user registration data records as either fraudulent or non-fraudulent. 9. The method of claim 8 , further including selecting the pair of pending feature subsets for merging that yields a greatest reduction in class entropy among pairs of pending feature subsets available for merger. 10. The method of claim 8 , further including the threshold support count for the merged feature subsets being in a range of 2 to 10, inclusive, thereby requiring at least 2 to 10 instances of feature-values in the merged feature subsets to be present in the tuple instance count data structure. 11. The method of claim 8 , wherein: the tuple instance count data structure includes tuple instances, class values and counts; and the input tuple includes a class label that accepts an assigned class value. 12. A method of building a partition list of feature subsets having probabilistic interdependence among features in the feature subsets for use with a classifier to detect fraudulent user registrations, the method including: accessing a tuple instance count data structure stored in memory, the tuple instance count data structure based on counts of feature-values in an input set of tuples wherein the feature-values of the tuples are values from user registration data records; accessing a class value, wherein the feat

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

  • Physics · mapped topic

  • G06N5/02Primary

    Knowledge representation; Symbolic representation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10163056B2 cover?
The technology disclosed relates to methods for partitioning sets of features for a Bayesian classifier, finding a data partition that makes the classification process faster and more accurate, while discovering and taking into account feature dependence among sets of features in the data set. It relates to computing class entropy scores for a class label across all tuples that share the featur…
Who is the assignee on this patent?
Salesforce Com Inc
What technology area does this patent fall under?
Primary CPC classification G06N5/02. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 25 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).