Method, device and computer program product for determining duplicated data

US11226935B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11226935-B2
Application numberUS-201916359445-A
CountryUS
Kind codeB2
Filing dateMar 20, 2019
Priority dateApr 28, 2018
Publication dateJan 18, 2022
Grant dateJan 18, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Technique determine (or detect) duplicated data. The techniques involve: in response to determining that data at a first position in input data is the same as predetermined data, determining a feature value of a selected portion of input data; determining whether the feature value matches with a pre-stored duplicated data pattern in a duplicated data pattern list; and in response to determining that the feature value matches with the duplicated data pattern, determining an association of the input data with reference data which is associated with the matched pattern.

First claim

Opening claim text (preview).

We claim: 1. A method of determining duplicated data, comprising: in a first layer of comparison, determining that a first data portion of input data is the same as data from a plurality of predetermined locations of the input data; in response to determining that the first data portion of the input data is the same as the data from the plurality of predetermined locations of the input data, determining a feature value of the first data portion of the input data; in a second layer of comparison, determining that the feature value of the first data portion of the input data is matched with a pre-stored duplicated data pattern in a duplicated data pattern list; in response to determining that the feature value of the first data portion of the input data is matched with the pre-stored duplicated data pattern in the duplicated data pattern list, determining an association of the first data portion with corresponding reference data associated with the pre-stored duplicated data pattern; in a third layer of comparison, determining that the association indicates that the first data portion is not associated with the corresponding reference data; and in response to determining that the association indicates that the first data portion is not associated with the corresponding reference data, storing the input data. 2. The method according to claim 1 , wherein the plurality of predetermined locations are adjacent to a plurality of equidistant locations between a starting portion and an ending portion of the input data, respectively. 3. The method according to claim 1 , wherein the data from the plurality of predetermined locations are based on combining data in a second number of bytes at each of a first number of the plurality of predetermined locations of the input data. 4. The method according to claim 1 , wherein determining the association of the first data portion with the corresponding reference data comprises: determining, from the first data portion, a plurality of data portions having a predetermined length; determining, based on the reference data, a plurality of second data portions having a predetermined length; and determining the association based on a comparison of the plurality of first data portions determined from the first data portion and the plurality of second data portions determined based on the reference data. 5. An apparatus for determining duplicated data, comprising: a memory configured to store one or more programs; a processing unit coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform acts comprising: in a first layer of comparison, determining that a first data portion of input data is the same as data from a plurality of predetermined locations of the input data; in response to determining that the first data portion of the input data is the same as the data from the plurality of predetermined locations of the input data, determining a feature value of the first data portion of the input data; in a second layer of comparison, determining that the feature value of the first data portion of the input data is matched with a pre-stored duplicated data pattern in a duplicated data pattern list; in response to determining that the feature value of the first data portion of the input data is matched with the pre-stored duplicated data pattern in the duplicated data pattern list, determining an association of the first data portion with corresponding reference data associated with the pre-stored duplicated data pattern; in a third layer of comparison, determining that the association indicates that the first data portion is not associated with the corresponding reference data; and in response to determining that the association indicates that the first data portion is not associated with the corresponding reference data, storing the input data. 6. The apparatus according to claim 5 , wherein the plurality of predetermined locations are adjacent to a plurality of equidistant locations between a starting portion and an ending portion of the input data, respectively. 7. The apparatus according to claim 5 , wherein the data from the plurality of predetermined locations are based on combining data in a second number of bytes at each of a first number of the plurality of predetermined locations of the input data. 8. The apparatus according to claim 5 , wherein determining the association of the first data portion with the corresponding reference data comprises: determining, from the first data portion, a plurality of data portions having a predetermined length; determining, based on the reference data, a plurality of second data portions having a predetermined length; and determining the association based on a comparison of the plurality of data portions determined from the first data portion and the plurality of second data portions determined based on the reference data. 9. A computer program product having a non-transitory computer readable medium that stores a set of instructions to detect duplicated data received by a data storage array; the set of instructions, when carried out by the data storage array, causing the data storage array to perform a method of: in a first layer of comparison, determining that a first data portion of input data is the same as from a plurality of predetermined locations of the input data; in response to determining that the first data portion of the input data is the same as the data from the plurality of predetermined locations of the input data, determining a feature value of the first data portion of the input data; in a second layer of comparison, determining that the feature value of the first data portion of the input data is matched with a pre-stored duplicated data pattern in a duplicated data pattern list; in response to determining that the feature value of the first data portion of the input data is matched with the pre-stored duplicated data pattern in the duplicated data pattern list, determining an association of the first data portion with corresponding reference data associated with the pre-stored duplicated data pattern; in a third layer of comparison, determining that the association indicates that the first data portion is not associated with the corresponding reference data; and in response to determining that the association indicates that the first data portion is not associated with the corresponding reference data, storing the input data. 10. The computer program product of claim 9 , further comprising: in response to determining that the feature value of the first data portion of the input data is matched with the pre-stored duplicated data pattern in the duplicated data pattern list, applying reclaimed processing cycles to provide other computerized services.

Assignees

Inventors

Classifications

  • G06F16/21Primary

    Design, administration or maintenance of databases · CPC title

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

  • by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination · CPC title

  • Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11226935B2 cover?
Technique determine (or detect) duplicated data. The techniques involve: in response to determining that data at a first position in input data is the same as predetermined data, determining a feature value of a selected portion of input data; determining whether the feature value matches with a pre-stored duplicated data pattern in a duplicated data pattern list; and in response to determining…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/21. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 18 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).