Control of document similarity determinations by respective nodes of a plurality of computing devices

US10642912B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10642912-B2
Application numberUS-201615239521-A
CountryUS
Kind codeB2
Filing dateAug 17, 2016
Priority dateAug 17, 2016
Publication dateMay 5, 2020
Grant dateMay 5, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques and systems are described to control a determination of document similarity. In one example, dimensionality of the documents is reduced through computation of a signature, e.g., via a hashing technique such as “minhashing” which is also known as min-wise independent permutations locality sensitive hashing. From these signatures, another hashing technique (e.g., locality sensitive hashing) is used to determine similarity of the signatures to each other. Identification of disjoint sets is then used as a basis to partition the documents for determination of document similarity by respective nodes of a plurality of computing devices. In this way, an amount of data shuffling between the nodes as part of the determination of document similarity may be reduced. In another example, a weighting is applied to attributes of documents as part of the determination of document similarity.

First claim

Opening claim text (preview).

What is claimed is: 1. In a digital medium environment to determine document similarity of a plurality of documents, a method implemented by at least one computing device, the method comprising: receiving, by the at least one computing device, an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two said documents are both to be included in to be considered similar; generating, by the at least one computing device, a plurality of signature data from the plurality of documents using hashing, the plurality of signature data resulting in a reduction of dimensionality of the plurality of documents; hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data; generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold; generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions. 2. The method as described in claim 1 , wherein the plurality of documents is configured as webpages, product descriptions, or social network communications. 3. The method as described in claim 1 , further comprising extracting word data from each document of the plurality of documents and filtering the extracted word data to locate meaningful word data. 4. The method as described in claim 3 , wherein the filtering of the extracted word data to locate meaningful word data includes removing word data included in a listing of rare or common word data. 5. The method as described in claim 1 , wherein the input specifying the similarity threshold is a user input. 6. The method as described in claim 1 , further comprising generating a recommendation by the at least one computing device based on the determination of similarity. 7. The method as described in claim 1 , wherein the generating of the plurality of signatures is based at least in part on locality-sensitive hashing (LSH). 8. The method as described in claim 1 , wherein the assigning includes applying a weight to an attribute described by words in respective ones of the plurality of documents. 9. The method as described in claim 8 , wherein the applying of the weight includes adding additional instances of the words that describe the attribute to the respective ones of the plurality of documents based on the weight. 10. The method as described in claim 8 , wherein the attribute and the weight are user specified via one or more inputs. 11. In a digital medium environment to determine document similarity, a method implemented by at least one computing device, the method comprising: receiving, by the at least one computing device, an input specifying a similarity threshold and a weight of an attribute, the similarity threshold defining a minimum number of a plurality of buckets that two documents of a plurality of documents are both to be included in to be considered similar; applying, by the at least one computing device, the weight to a word that corresponds to the attribute, the applying including adding additional instances of the word to the respective ones of the plurality of documents based on the weight; generating, by the at least one computing device using hashing, a plurality of signature data from the plurality of documents having the applied weighting; hashing, by the at least one computing device, the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data; generating, by the at least one computing device, a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold; generating, by the at least one computing device, a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data; and assigning, by the at least one computing device, the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of the filtered set of documents, to each other, within respective said partitions. 12. The method as described in claim 11 , wherein the indication of the similarity threshold data and the weight are user specified. 13. In a digital medium environment to determine document similarity, a system comprising: a processing system; and a computer-readable storage medium having instructions stored thereon that, responsive to execution by the processing system, causes the processing system to perform operations comprising: receiving an input specifying a similarity threshold, the similarity threshold defining a minimum number of a plurality of buckets that two documents, of a plurality of documents, are to be included in to be considered similar; generating a plurality of signature data from the plurality of documents using hashing; hashing the plurality of documents into respective ones of the plurality of buckets based on the plurality of signature data; generating a filtered set of documents by removing first and second said documents from the plurality of documents that are not considered similar based on the similarity threshold; generating a plurality of partitions from the filtered set of documents based on inclusion in a respective disjoint set of data of a plurality of disjoint sets of data; and assigning the plurality of partitions to respective nodes of a plurality of computing devices to determine document similarity of respective ones of the filtered set of documents, to each other, within respective said partitions. 14. The system as described in claim 13 , wherein the operations further comprising: extracting word data from each document of the plurality of documents; and filtering the extracted word data to locate meaningful word data. 15. The system as described in claim 14 , wherein the filtering of the extracted word data to locate meaningful words includes removing word data included in a listing of rare or common word data. 16. The system as described in claim 14 , wherein the generating includes applying a weight to an attribute described by word data in respective ones of the plurality of documents to control the determination of document similarity. 17. The system as described in claim 16 , wherein the application of the weight includes adding additional instances of the word data that describe the attribute to the respective ones of the plurality of documents based on the weight. 18. The system as described in claim 16 , wherein the attribute and the weight are user specified via one or more inputs. 19. The method as described in claim 11 , further comprising generating a recommendation based on the determination of document similarity. 20. The system as described in claim 16 , the operations further comprising generating a recommendation based on the determination of document similarity.

Assignees

Inventors

Classifications

  • Search customisation based on user profiles and personalisation · CPC title

  • Physics · mapped topic

  • Business processes related to social networking or social networking services · CPC title

  • G06F16/325Primary

    Hash tables · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10642912B2 cover?
Techniques and systems are described to control a determination of document similarity. In one example, dimensionality of the documents is reduced through computation of a signature, e.g., via a hashing technique such as “minhashing” which is also known as min-wise independent permutations locality sensitive hashing. From these signatures, another hashing technique (e.g., locality sensitive has…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/9535. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 05 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).