What technology area does this patent fall under?

Primary CPC classification G06F16/1752. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jan 21 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Distributed deduplication using locality sensitive hashing

US2016019232A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2016019232-A1
Application number	US-201414337070-A
Country	US
Kind code	A1
Filing date	Jul 21, 2014
Priority date	Jul 21, 2014
Publication date	Jan 21, 2016
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Deduplication in a distributed storage system is described. A deduplication manager identifies a data item that includes multiple data chunks. The deduplication manager defines a first extent on a first node in a distributed storage system. The deduplication manager compares the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent. The deduplication manager selects a second extent from the found group of extents. The second closely matches the first extent and removes from the first extent one or more data chunks that are included in the first extent and the second extent. The deduplication manager associates, with the first extent, a pointer to the second extent for the removed one or more data chunks.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: identifying a data item including a plurality of data chunks; generating, by a processing device, a fingerprint for each of the plurality of data chunks; defining a first extent on a first node in a distributed storage system; comparing the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; selecting a second extent from the found group of extents, the second extent closely matching the first extent; removing, by the processing device, from the first extent one or more data chunks that are included in the first extent and the second extent; and associating, with the first extent, a pointer to the second extent for the removed one or more data chunks. 2 . The method of claim 1 further comprising storing data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein comparing the first extent to existing groups of similar extents comprises: identifying the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identifying the second extent as being associated with the first group. 3 . The method of claim 1 , wherein comparing the first extent to existing groups of similar extents comprises: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 4 . The method of claim 1 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein removing from the first extent one or more data chunks that are included in the first extent and the second extent comprises comparing the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 5 . The method of claim 1 , wherein the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 6 . The method of claim 1 , wherein the processing device executes a daemon to identify the plurality of data chunks. 7 . The method of claim 1 , wherein at least some of the plurality of data chunks of the first extent are sequentially addressed. 8 . A computer readable storage medium having instructions that, when executed by a processing device, cause the processing device to perform operations comprising: identifying a data item including a plurality of data chunks; generating, by the processing device, a fingerprint for each of the plurality of data chunks; defining a first extent on a first node in a distributed storage system, the first extent comprising a sketch that comprises the fingerprints for each of the plurality of data chunks; comparing the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; selecting a second extent from the found group of extents, the second extent that closely matching the first extent; removing, by the processing device, from the first extent one or more data chunks that are included in the first extent and the second extent; and associating, with the first extent, a pointer to the second extent for the removed one or more data chunks. 9 . The computer readable storage medium of claim 8 , the operations further comprising storing data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein comparing the first extent to existing groups of similar extents comprises: identifying the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identifying the second extent as being associated with the first group. 10 . The computer readable storage medium of claim 8 , wherein comparing the first extent to existing groups of similar extents comprises: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 11 . The computer readable storage medium of claim 8 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein removing from the first extent one or more data chunks that are included in the first extent and the second extent comprises comparing the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 12 . The computer readable storage medium of claim 8 , wherein at least some of the plurality of data chunks of the first extent are sequentially addresses. 13 . The computer readable storage medium of claim 8 , wherein, the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 14 . A computing device comprising: a memory; and a processing device coupled to the memory, the processing device to: identify a data item including a plurality of data chunks; generate a fingerprint for each of the plurality of data chunks; define a first extent on a first node in a distributed storage system; compare the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; select a second extent from the found group of extents, the second extent closely matching the first extent; remove from the first extent one or more data chunks that are included in the first extent and the second extent; and associate, with the first extent, a pointer to the second extent for the removed one or more data chunks. 15 . The computing device of claim 14 , the processing device further to store data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein when comparing the first extent to existing groups of similar extents, the processing device is to: identify the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identify the second extent as being associated with the first group. 16 . The computing device of claim 14 , wherein when comparing the first extent to existing groups of similar extents, the processing device is to: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 17 . The computing device of claim 14 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein when removing from the first extent one or more data chunks that are included in the first extent and the second extent, the processing device is to compare the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 18 . The computing device of claim 14 , wherein the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 19 . The computing

Assignees

Red Hat Inc

Inventors

Lambright Henry Dan

Classifications

G06F16/1752Primary
based on file chunks · CPC title
G06F17/30159Primary
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 55074728

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016019232A1 cover?: Deduplication in a distributed storage system is described. A deduplication manager identifies a data item that includes multiple data chunks. The deduplication manager defines a first extent on a first node in a distributed storage system. The deduplication manager compares the first extent to existing groups of similar extents to find one of the existing groups that has extents that are simil…
Who is the assignee on this patent?: Red Hat Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/1752. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jan 21 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).