Learning with transformed data

US11521106B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11521106-B2
Application numberUS-201515521441-A
CountryUS
Kind codeB2
Filing dateOct 23, 2015
Priority dateOct 24, 2014
Publication dateDec 6, 2022
Grant dateDec 6, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This disclosure relates to learning with transformed data such as determining multiple training samples from multiple data samples. Each of the multiple data samples comprises one or more feature values and a label that classifies that data sample. A processor determines each of the multiple training samples by randomly selecting a subset of the multiple data samples, and combining the feature values of the data samples of the subset based on the label of each of the data samples of the subset. Since the training samples are combinations of randomly chosen data samples, the training samples can be provided to third parties without disclosing the actual training data. This is an advantage over existing methods in cases where the data is confidential and should therefore not be shared with a learner of a classifier, for example.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer implemented method for training a classifier that corresponds to confidential data, the method comprising collecting the confidential data as multiple data samples on a user device by way of a user interface, each of the multiple data samples comprising one or more feature values in a vector x i , where i is an index of the data sample, and a label y i that classifies that data sample i; creating non-confidential training data by determining a training sample as a vector of feature values π from the multiple data samples, the training sample π preserving the privacy of the confidential data by: randomly selecting a subset of the multiple data samples by defining a masking variable σ i for each data sample i, the subset comprising more than one data sample, and combining the feature values of the data samples of the subset based on the label of each of the data samples of the subset, by determining a weighted sum of the feature values of the data samples of the subset wherein the feature value of a feature of the training sample is the weighted sum of the feature values of that feature of the data samples of the subset, wherein the weighted sum comprises a sum of feature values of that feature multiplied by the respective labels of each of the data samples of the subset by calculating π=Σ i (σ i +y i )x i , wherein feature j of the training sample is a weighted sum of values of feature j of the data samples: repeating the step of determining the training sample to determine multiple training samples; sending the non-confidential training data including the multiple training samples including the combined feature values to a computer system for determining the classifier weight while maintaining privacy of the confidential data by preventing access to the confidential data from the computer system; training, by the computer system, a classifier that corresponds to the confidential data, without access to the confidential data and with access to the non-confidential training data by calculating, by the computer system, a classifier weight associated with a feature index from the multiple training samples; and classifying, by the computer system using the trained classifier, a test value by determining, by the computer system, a classification of the test values based on the classifier weight. 2. The method of claim 1 , wherein randomly selecting the subset of the multiple data samples comprises multiplying each of the multiple data samples by a random selection value that is unequal to zero to select that data sample or equal to zero to deselect that data sample. 3. The method of claim 1 , wherein determining the sum comprises determining a weighted sum that is weighted based on the number of data samples in the subset of the multiple data samples. 4. The method of claim 1 , wherein the weighted sum is weighted based on a random number such that randomly selecting the subset of the multiple data samples is performed simultaneously with combining the feature values. 5. The method of claim 1 , wherein randomly selecting a subset of multiple data samples comprises randomly selecting a subset of multiple data samples based on a non-uniform distribution. 6. The method of claim 1 , wherein the data samples have signed real values as features values, and the label is one of ‘− 1 ’ and ‘+1’. 7. The method of claim 1 , wherein determining each of the multiple training samples comprises determining each of the multiple training samples such that each of the multiple training samples is based on at least a predetermined number of data samples. 8. The method of claim 7 , wherein randomly selecting a subset of the multiple data samples comprises randomly selecting a subset of the multiple data samples that comprises at least a predetermined number of data samples. 9. The method of claim 1 , wherein determining multiple training samples further comprises: determining for each feature value of the training sample a random value and adding the random value to that feature value to determine a modified training sample. 10. A non-transitory computer readable medium comprising computer-executable instructions stored thereon, that when executed by a processor, causes the processor to perform the method of claim 1 . 11. A system for training a classifier that corresponds to confidential data, the system comprising a data collection device for determining multiple training samples from multiple data samples, and a computer system for receiving and processing the training samples, the data collection device comprising: an input port to receive the multiple data samples; and a processor configured to collect the confidential data as the multiple data samples by way of a user interface, each of the multiple data samples comprising one or more feature values in a vector x i , where i is an index of the data sample, and a label y i that classifies that data sample i; create non-confidential training data by determining a training sample π from the multiple data samples, the training sample π preserving the privacy of the confidential data by randomly selecting a subset of the multiple data samples by defining a masking variable σ i for each data sample i, the subset comprising more than one data sample, and combining the feature values of the data samples of the subset based on the label of each of the data samples of the subset by determining a weighted sum of the feature values of the data samples of the subset wherein the feature value of a feature of the training sample is the weighted sum of the feature values of that feature of the data samples of the subset, wherein the weighted sum comprises a sum of feature values of that feature multiplied by the respective labels of each of the data samples of the subset by calculating π=Σ i (σ i +y i )x i , wherein feature j of the training sample is a weighted sum of values of feature j of the data samples; repeating the step of determining the training sample to determine multiple training samples; and to send the non-confidential training data including the multiple training samples including the combined feature values to the computer system for determining the classifier weight while maintaining privacy of the confidential data by preventing access to the confidential data from the computer system; the computer system comprising a processor configured to: train, by the computer system, a classifier that corresponds to the confidential data without access to the confidential data and with access to the non-confidential training data by calculating a classifier weight associated with a feature index from the multiple training samples, and classify, by the computer system using the trained classifier, a test value by determining a classification of the test values based on the classifier weight. 12. The computer implemented method of claim 1 comprising: receiving, by the computer system, multiple training values associated with a feature index, each training value being based on a combination of a subset of multiple data values that are kept securely on a data collection device by preventing access to the multiple data samples from the computer system, based on multiple data labels, each of the multiple data labels being associated with one of the multiple data values; determining, by the computer system, a correlation value based on the multiple training values, such that the correlation value is indicative of a correlation between each of the multiple data values and the data label associated with that data value; and determining, by the computer system, the classifier coefficient bas

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11521106B2 cover?
This disclosure relates to learning with transformed data such as determining multiple training samples from multiple data samples. Each of the multiple data samples comprises one or more feature values and a label that classifies that data sample. A processor determines each of the multiple training samples by randomly selecting a subset of the multiple data samples, and combining the feature …
Who is the assignee on this patent?
Nat Ict Australia Ltd
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 06 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).