Method and system for data assignment in a distributed system

US11100073B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11100073-B2
Application numberUS-201514938990-A
CountryUS
Kind codeB2
Filing dateNov 12, 2015
Priority dateNov 12, 2015
Publication dateAug 24, 2021
Grant dateAug 24, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present teaching relates to methods, systems, and programming for data assignment in a distributed system. In one example, a plurality of tables is obtained. Each of the plurality of tables includes a plurality of buckets. Each of the plurality of tables is generated based on a same set of keys and a different function. The buckets in the plurality of tables are projected into different partitions. Data in the plurality of tables are assigned to a plurality of nodes in the distributed system such that data in buckets that are projected into a same partition are assigned to a same node in the distributed system.

First claim

Opening claim text (preview).

I claim: 1. A method, implemented on a machine having at least one processor, storage, and a communication platform connected to a network for data assignment in a distributed system, the method comprising: obtaining a plurality of first-layer hash tables, each of which includes a plurality of first buckets, wherein each of the plurality of first-layer hash tables is generated based on a same set of keys and a different first-layer hash function; determining a center point in a feature space for each first bucket included in each of the plurality of first-layer hash tables to obtain a plurality of center points; generating an aggregated table that includes each center point of each first bucket included in the plurality of first-layer hash tables; determining one or more directions in the feature space with maximum variance of each center point; determining a median point in the feature space of a center point distribution along each of the one or more directions in the feature space; mapping each center point to one of a plurality of second buckets of the aggregated table based on a second-layer hash function, the one or more directions, and the respective median points in the feature space, wherein center points corresponding to different first buckets that overlap in the feature space are mapped to a same second bucket in the aggregated table, wherein mapping each center point comprises segmenting the plurality of center points to the one of the plurality of second buckets of the aggregated table based on the second-layer hash function; and assigning at least one of the plurality of second buckets to a node in the distributed system such that data items stored within one or more of the plurality of first buckets whose center points are mapped to the same second bucket are assigned to a same node in the distributed system so that the data items assigned to the same node represent similar data items. 2. The method of claim 1 , wherein segmenting the plurality of center points to the one of the plurality of second buckets of the aggregated table based on the second-layer hash function comprises: segmenting the plurality of center points to the one of the plurality of second buckets based on a respective position of each center point in the feature space, wherein the feature space is Euclidean space. 3. The method of claim 1 , wherein segmenting further comprising: determining one or more hash functions based on the one or more directions and their respective median points in the feature space, wherein each center point in the aggregated table is segmented into different partitions based on the one or more hash functions. 4. The method of claim 1 , wherein the at least one of the plurality of second buckets is randomly assigned to the same node in the distributed system. 5. The method of claim 1 , wherein each of the plurality of first-layer hash tables is a hash table generated based on a corresponding hash function. 6. The method of claim 5 , further comprising: receiving input items each corresponding to one of the same set of keys; and computing, based on the corresponding hash function corresponding to each first-layer hash table of the plurality of first-layer hash tables, the same set of keys into the plurality of first buckets included in a corresponding first-layer hash table. 7. The method of claim 1 , wherein assigning further comprises: assigning one second bucket of the plurality of second buckets in the aggregated table to a first node in the distributed system, wherein assigning the one second bucket comprises: determining at least one first bucket from at least one first-layer hash table having a corresponding center point mapped to the one second bucket, and sending data items from each of the at least one first bucket to the first node in the distributed system; and assigning a different second bucket of the plurality of second buckets in the aggregated table to a second node in the distributed system, wherein assigning the different second bucket comprises: determining at least one different first bucket from at least one first-layer hash table having a corresponding center point mapped to the different second bucket, and sending data items from each of the different second bucket to the second node in the distributed system. 8. A system having at least one processor, storage, and a communication platform connected to a network for data assignment in a distributed system, comprising: a hash table generator configured for obtaining a plurality of first-layer hash tables, each of which includes a plurality of first buckets, wherein each of the plurality of first-layer hash tables is generated based on a same set of keys and a different first-layer hash function; a bucket center determiner configured for: determining a center point in a feature space for each first bucket included in each of the plurality of first-layer hash tables to obtain a plurality of center points, generating an aggregated table that includes each center point of each first bucket included in the plurality of first-layer hash tables, and mapping each center point to one of a plurality of second buckets of the aggregated table; a maximum variant direction determiner configured for determining one or more directions in the feature space with maximum variance of each center point; a median point determiner configured for determining a median point in the feature space of a center point distribution along each of the one or more directions in the feature space, wherein: the center point is mapped to the one of the plurality of second buckets of the aggregated table based on a second-layer hash function, the one or more directions, and the respective median points in the feature space, wherein center points corresponding to different first buckets that overlap in the feature space are mapped to a same second bucket in the aggregated table, wherein mapping each center point comprises segmenting the plurality of center points to the one of the plurality of second buckets of the aggregated table based on the second-layer hash function; and a hash bucket assigner configured for assigning at least one of the plurality of second buckets to a node in the distributed system such that data items stored within one or more of the plurality of first buckets whose center points are mapped to the same second bucket are assigned to a same node in the distributed system so that the data items assigned to the same node represent similar data items. 9. The system of claim 8 , wherein segmenting the plurality of center points to the one of the plurality of second buckets of the aggregated table based on the second-layer hash function comprises: segmenting the plurality of center points to the one of the plurality of second buckets based on a respective position of each center point in the feature space, wherein the feature space is Euclidean space. 10. The system of claim 8 , further comprising: a direction based function determiner configured for determining one or more hash functions based on the one or more directions and their respective median points in the feature space, wherein each center point in the aggregated table is segmented into different partitions based on the one or more hash functions. 11. The system of claim 8 , wherein the at least one of the plurality of second buckets is randomly assigned to the same node in the distributed system. 12. The system of claim 8 , wherein each of the plurality of first-layer hash tables is a hash table generated based on a corresponding hash function, wherein the system further comprises: a hash bucket computer configured for:

Assignees

Inventors

Classifications

  • for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS] · CPC title

  • Distributed file systems · CPC title

  • Hash tables · CPC title

  • Data partitioning, e.g. horizontal or vertical partitioning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11100073B2 cover?
The present teaching relates to methods, systems, and programming for data assignment in a distributed system. In one example, a plurality of tables is obtained. Each of the plurality of tables includes a plurality of buckets. Each of the plurality of tables is generated based on a same set of keys and a different function. The buckets in the plurality of tables are projected into different par…
Who is the assignee on this patent?
Yahoo Holdings Inc, Verizon Media Inc
What technology area does this patent fall under?
Primary CPC classification H04L67/1097. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Aug 24 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).