Data shards for distributed processing

US12236264B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12236264-B2
Application numberUS-202117163386-A
CountryUS
Kind codeB2
Filing dateJan 30, 2021
Priority dateJan 30, 2021
Publication dateFeb 25, 2025
Grant dateFeb 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, devices, and techniques are disclosed for data shards for distributed processing. Data sets of data for users may be received. The data sets may belong to separate groups. User identifiers in the data sets may be hashed to generate hashed identifiers for the data sets. The user identifiers in the data sets may be replaced with the hashed identifiers. The data sets may be split to generate shards. The data sets may be split into the same number of shards. Merged shards may be generated by merging the shards using a separate running process for each of the merged shards. The merged shards may be generated using shards from more than one of the two or more data sets. An operation may be performed on all of the merged shards.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: receiving two or more data sets of data for users wherein each of the two or more data sets belongs to a separate one of two or more groups; hashing user identifiers in the two or more data sets to generate hashed identifiers for the two or more data sets; replacing the user identifiers in the two or more data sets with the hashed identifiers; splitting each of the two or more data sets to generate shards, wherein each of the two or more data sets is split into the same number of shards by splitting each of the two more data sets such that the hashed identifiers that are common to two or more of the two or more data sets are in equivalent shards from the two or more of the two or more data sets; generating merged shards by merging the shards using a separate running process for each of the merged shards, wherein each of the merged shards is generated using shards from more than one of the two or more data sets; performing an operation on all of the merged shards wherein performing an operation on all of the merged shards comprises joining the merged shards into a merged data set; and training a machine learning system using the merged data set. 2. The computer-implemented method of claim 1 , wherein performing an operation on each of the merged shards comprises performing non-negative matrix factorization on the merged shards. 3. The computer-implemented method of claim 1 , wherein equivalent shards of the two or more data sets comprise shards assigned data from separate data sets based on the same criteria. 4. The computer-implemented method of claim 3 , wherein the criteria comprises an alphanumeric range that a hashed identifier for the data falls into. 5. The computer-implemented method of claim 1 , wherein generating merged shards by merging the shards using a separate running process for each of the merged shards, wherein each of the merged shards is generated using shards from more than one of the two or more data sets comprises merging a first set of equivalent shards from the shards on a first processor and merging a second set of equivalent shards from the shards on a second processor in parallel. 6. The computer-implemented method of claim 5 , wherein merging a first set of equivalent shards from the shards on the first processor further comprises: joining the data in the equivalent shards; sorting the data in the equivalent shards by hashed identifier; and merging data for any duplicate hashed identifiers. 7. A computer-implemented system for localization of matrix factorization models trained with global data comprising: one or more storage devices; and two or more processors that receive two or more data sets of data for users wherein each of the two or more data sets belongs to a separate one of two or more groups with a first of the two or processors, hash user identifiers in the two or more data sets to generate hashed identifiers for the two or more data sets with the first of the two or processors, replace the user identifiers in the two or more data sets with the hashed identifiers with the first of the two or processors, split each of the two or more data sets to generate shards, wherein each of the two or more data sets is split into the same number of shards with the first of the two or processors by splitting each of the two more data sets such that the hashed identifiers that are common to two or more of the two or more data sets are in equivalent shards from the two or more of the two or more data sets, generate merged shards by merging the shards using a separate running process on each of the two or more processors for each of the merged shards, wherein each of the merged shards is generated using shards from more than one of the two or more data sets, perform an operation on all of the merged shards with the first of the two or processors wherein performing an operation on all of the merged shards comprises joining the merged shards into a merged data set and, training a machine learning system using the merged data set. 8. The computer-implemented system of claim 7 , wherein the first of the two or more processors performs an operation on each of the merged shards by performing non-negative matrix factorization on the merged shards. 9. The computer-implemented system of claim 7 , wherein equivalent shards of the two or more data sets comprise shards assigned data from separate data sets based on the same criteria. 10. The computer-implemented system of claim 9 , wherein the criteria comprises an alphanumeric range that a hashed identifier for the data falls into. 11. The computer-implemented system of claim 7 , wherein the two or more processors generate merged shards by merging the shards using a separate running process on each of the two or more processors for each of the merged shards, wherein each of the merged shards is generated using shards from more than one of the two or more data sets by merging a first set of equivalent shards from the shards on a first processor and merging a second set of equivalent shards from the shards on a second processor in parallel. 12. The computer-implemented system of claim 11 , wherein the first of the two or more processors merges a first set of equivalent shards from the shards on the first processor further by joining the data in the equivalent shards, sorting the data in the equivalent shards by hashed identifier, and merging data for any duplicate hashed identifiers. 13. A system comprising: one or more computers and one or more non-transitory storage devices storing instructions which are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving two or more data sets of data for users wherein each of the two or more data sets belongs to a separate one of two or more groups; hashing user identifiers in the two or more data sets to generate hashed identifiers for the two or more data sets; replacing the user identifiers in the two or more data sets with the hashed identifiers; splitting each of the two or more data sets to generate shards, wherein each of the two or more data sets is split into the same number of shards by splitting each of the two more data sets such that the hashed identifiers that are common to two or more of the two or more data sets are in equivalent shards from the two or more of the two or more data sets; generating merged shards by merging the shards using a separate running process for each of the merged shards, wherein each of the merged shards is generated using shards from more than one of the two or more data sets; performing an operation on all of the merged shards wherein performing an operation on all of the merged shards comprises joining the merged shards into a merged data set; and training a machine learning system using the merged data set. 14. The system of claim 13 , wherein the instructions that cause the one or more computers to perform operations comprising generating merged shards by merging the shards using a separate running process for each of the merged shards, wherein each of the merged shards is generated using shards from more than one of the two or more data sets further cause the one or more computers to perform operations comprising merging a first set of equivalent shards from the shards on a first processor and merging a second set of equivalent shards from the shards on a second processor in parallel. 15. The system of claim 14 , wherein the instructions that cause the one or more computers to perform operations co

Assignees

Inventors

Classifications

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • from multiple instruction streams, e.g. multistreaming · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Machine learning · CPC title

  • using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12236264B2 cover?
Systems, devices, and techniques are disclosed for data shards for distributed processing. Data sets of data for users may be received. The data sets may belong to separate groups. User identifiers in the data sets may be hashed to generate hashed identifiers for the data sets. The user identifiers in the data sets may be replaced with the hashed identifiers. The data sets may be split to gener…
Who is the assignee on this patent?
Salesforce Com Inc, Salesforce Inc
What technology area does this patent fall under?
Primary CPC classification G06F9/4881. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).