System and method for data de-duplication

US9465823B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9465823-B2
Application numberUS-58478206-A
CountryUS
Kind codeB2
Filing dateOct 19, 2006
Priority dateOct 19, 2006
Publication dateOct 11, 2016
Grant dateOct 11, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are methods, systems, and computer program products for processing a file which include using a computer system that is programmed for performing a process of receiving the file in response to a request for storing the file, determining whether a database already contains the file, and storing the file in the database if the database does not already contain the file. The process may alternatively include receiving the file in response to a request for storing the file, determining whether a database already contains the file, and storing the file without storing the received file if the database already contains the file. The process may also alternatively include receiving the file in response to a request for storing the file in a database, separating the file into a plurality of portions, and storing the plurality of portions so each of the plurality of portions can be individually accessed.

First claim

Opening claim text (preview).

What is claimed: 1. A computer implemented method of processing a data object, comprising: using a computer system which comprises at least one processor and is programmed for performing a process, the process comprising: receiving a request to store the data object; receiving the data object; determining, at a deduplication module functioning in conjunction with the at least one processor and stored in a non-transitory computer accessible storage medium, a unique identifier for the data object based in part or in whole upon contents of the data object, instead of based upon a name of the data object, at least by transforming the contents of the data object with a transformation; determining whether a database contains a first data object identified by a different data object name but comprising first data object contents identical to the contents of the data object at least by computing an identifier for at least a portion of the first data object contents and further by comparing the unique identifier for the data object to the identifier for the at least the portion of the first data object contents; and determining whether to store a duplicative copy of the contents in the database based in part or in whole upon a duplication indicator that is stored with at least the unique identifier in an identifier data structure and corresponds to the contents of the data object in the database; storing the duplicative copy of the contents in the database when it is determined that the database has already contained the first data object contents identical to the contents of the data object, and when the duplication indicator corresponding to the contents of the data object indicates that the duplicative copy of the contents is to be stored in the database; and storing the duplicative copy of the contents in the database when it is determined that the database has not contained the first data object contents identical to the contents of the data object. 2. The method of claim 1 , further comprising: determining a data processing threshold; and separating the data object into a plurality of portions based on the data processing threshold. 3. The method of claim 1 , further comprising compressing at least a portion of the data object. 4. The method of claim 3 , further comprising storing a data compression flag to indicate that data compression has been performed. 5. The method of claim 1 , further comprising determining a data compression criteria, and compressing at least a portion of the data object based on the data compression criteria or encrypting at least a portion of the data object. 6. The method of claim 1 , further comprising determining a plurality of rolling identifiers for the data object during a period of time when the data object is being received; and determining a final unique identifier as the unique identifier. 7. The method of claim 6 , further comprising: storing a data encryption flag to indicate that data encryption has been performed; determining one or more data encryption criteria; and encrypting at least a portion of the data object based on the one or more data encryption criteria, wherein the data object comprises a LOB data object, and the LOB data object comprises an image data object. 8. The method of claim 1 , further comprising: receiving, at a data receiving module stored at least partially in memory, the request to store the data object that comprises a large object (LOB) file in the database from a client; and reflecting a state where more than one client has requested storage of the large object file at least by updating the database without storing the duplicative copy of the contents in the database when it is determined that the database has already contained the first data object contents identical to the contents of the large object file, and when the duplication indicator corresponding to the contents of the data object indicates that the duplicative copy of the contents is not to be stored in the database. 9. The method of claim 8 , further comprising: receiving, at the data receiving module stored at least partially in memory, at least a portion of the large object file for the request from the client, the at least the portion being smaller than the larger object file; determining a size of the at least the portion of the large object file; generating a size comparison result at least by comparing the size of the at least the portion of the large object file to a prescribed data processing threshold; generating a cumulative amount collection result at least by determining whether a cumulatively collected amount for the large object file is equal to or larger than the prescribed data processing threshold; transmitting the cumulatively collected amount for the large object file to downstream processing when it is determined that the cumulatively collected amount for the large object file is equal to or larger than the prescribed data processing threshold; and performing the downstream processing at least by computing a first rolling unique identifier for the cumulatively collected amount of the large object file. 10. The method of claim 9 , examining the identifier data structure to determine whether the first rolling unique identifier for the cumulatively collected amount is found in the identifier data structure; storing the duplicative copy of the cumulatively collected amount in the database when the first rolling unique identifier for the cumulatively collected amount is not found in the identifier data structure, wherein the database comprises the identifier data structure storing thereupon at least the duplication indicator, the unique identifier, and an index corresponding to the unique identifier as well as an index data structure storing thereupon metadata for the index, an address corresponding to a physical location of the first data object contents identical to the contents of the large object file in the database, and a duplication request counter but not the duplication indicator; storing the duplicative copy of the cumulatively collected amount in the database when the first rolling unique identifier for the cumulatively collected amount is found in the identifier data structure, and when the duplication indicator corresponding to the contents of the data object indicates that the duplicative copy of the large object file is to be stored in the database; continuing, at the data receiving module, to receive one or more additional portions of the large object file when the first unique identifier is found in the identifier data structure until another cumulatively collected amount is equal to or larger than the prescribed data processing threshold; transmitting the another cumulatively collected amount for the large object file to the downstream processing when it is determined that the another cumulatively collected amount for the large object file is equal to or larger than the prescribed data processing threshold; performing the downstream processing at least by computing a second rolling unique identifier for the another cumulatively collected amount of the large object file; examining the identifier data structure to determine whether the second rolling unique identifier for the another cumulatively collected amount is found in the identifier data structure; storing the duplicative copy of the another cumulatively collected amount in the database when the second rolling unique identifier for the cumulatively collected amount is not found in the identifier data structure; storing the duplicative copy of the another cumulatively collected amount in the database when the second rolling unique identifier for the another cumulatively

Assignees

Inventors

Classifications

  • using de-duplication of the data · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9465823B2 cover?
Disclosed are methods, systems, and computer program products for processing a file which include using a computer system that is programmed for performing a process of receiving the file in response to a request for storing the file, determining whether a database already contains the file, and storing the file in the database if the database does not already contain the file. The process may …
Who is the assignee on this patent?
Shergill Kam, Aleti Bharath, Pandey Dheerai, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 11 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).