Device for and method of determining clusters

US10411728B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10411728-B2
Application numberUS-201716076347-A
CountryUS
Kind codeB2
Filing dateFeb 8, 2017
Priority dateFeb 8, 2016
Publication dateSep 10, 2019
Grant dateSep 10, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A device ( 100 ) for and method of determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data is provided. Also a method of compacting a data set, a method of transmitting compacted data and a computer program product are provided. In a sequence clustering unit ( 110 ) of the device, sequences of a first set of data are clustered on basis of conditional probabilities. Each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence. In the clustering a significant part of the mutual information between the first set of data and the second set of data is maintained.

First claim

Opening claim text (preview).

The invention claimed is: 1. A data reduction arrangement for compacting sequences of instances of a first type of data, the data reduction arrangement comprising: a device for determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data based on a characteristic of a physical entity, the device comprising: a first data set unit for obtaining a first set of data comprising sequences of instances of the first type of data; a second data set unit for obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; a sequence clustering unit for assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; and a compaction unit configured for compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set. 2. A computer-implemented method for determining clusters of sequences of instances of data of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data being based on a characteristic of a physical entity, the method comprising: obtaining a first set of data comprising sequences of instances of the first type of data; obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; and compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set. 3. The method according to claim 2 , wherein the assigning of the sequences of the first set of data to clusters comprises forming the cluster on basis of estimated conditional probabilities of the leaf nodes of the context tree, wherein, if a specific leaf node is related to a specific cluster, then all sequences of the first set of data equal to the unique sequence ending in the specific leaf node are assigned to the specific cluster and wherein the estimated conditional probability of a respective leaf node is a Krichevsky and Trofimov estimator that is calculated on basis of the counts of the respective leaf nodes. 4. The method according to claim 2 , wherein the assigning of the sequences of the first set of data to clusters uses a k-means algorithm to form the clusters and assigns sequences of the first set to clusters, and wherein the k-means algorithm uses the estimated conditional probabilities of the leaf nodes of the context tree to form the clusters. 5. The method according to claim 4 , wherein, in the assigning of the sequence of the first set of data to clusters, sequences of the first set of data ending in leaf nodes having a total count that is smaller than a minimum number of observations are assigned to two additional clusters, sequences ending in leaf nodes having an estimated conditional probability smaller than 0.5 and having a total count that is smaller than the minimum number of observations are assigned to a first one of the two additional clusters, sequences ending in leaf nodes having an estimated conditional probability larger than 0.5 and having a total count that is smaller than the minimum number of observations are assigned to a second one of the two additional clusters, sequences ending in leaf nodes having an estimated conditional probability that is equal to 0.5 and having a total count smaller than the minimum number of observations are assigned to either the first one of the additional clusters or the second one of the additional clusters. 6. The method according to claim 2 , wherein the clusters of the sequences of the first set of data are further optimized by an iterative optimization method to minimize an optimization function that is based on a conditional entropy of the second set of data given the data of the clusters. 7. The method according to claim 6 , wherein the iterative optimization method comprises simulated annealing. 8. The method according to claim 2 , wherein the sequences of instances of the first type of data of the first set of data comprise time series of sensor data, each time series comprises results of measurements of one specific sensor at consecutive moments in time and the specific sensors are of an equal type. 9. The method according to claim 2 , wherein the instances of the second set of data are binary data instances. 10. The method according to claim 9 , wherein the assigning based on conditional probabilities is based on the conditional probabilities that the data of the second set of data is one given an unique sequence of the first set of data. 11. A method of transmitting compacted data comprising at least one sequence of instances of the first type of data, the at least one sequence is a sequence to be transmitted, the method comprising: obtaining

Assignees

Inventors

Classifications

  • H03M7/3071Primary

    Prediction · CPC title

  • Clustering or classification · CPC title

  • for evaluating statistical data {, e.g. average values, frequency distributions, probability functions, regression analysis (forecasting specially adapted for a specific administrative, business or logistic context G06Q10/04)} · CPC title

  • Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression · CPC title

  • Binary arithmetic codes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10411728B2 cover?
A device ( 100 ) for and method of determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data is provided. Also a method of compacting a data set, a method of transmitting compacted data and a computer program product are provided. In a sequence clustering unit ( 110 ) of the device, sequences of a…
Who is the assignee on this patent?
Koninklijke Philips Nv
What technology area does this patent fall under?
Primary CPC classification H03M7/3071. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Sep 10 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).