Systems for generating indications of relationships between electronic documents

US12198459B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12198459-B2
Application numberUS-202117534744-A
CountryUS
Kind codeB2
Filing dateNov 24, 2021
Priority dateNov 24, 2021
Publication dateJan 14, 2025
Grant dateJan 14, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In implementations of systems for generating indications of relationships between electronic documents, a processing device implements a relationship system to segment text of electronic documents included in a document corpus into segments. The relationship system determines a subset of the electronic documents that includes electronic document pairs having a number of similar segments that is greater than a threshold number. The similar segments are identified using locality sensitive hashing. The electronic document pairs are classified as related documents or unrelated documents using a machine learning model that receives a pair of electronic documents as an input and generates an indication of a classification for the pair of electronic documents as an output. Indications of relationships between particular electronic documents included in the subset are generated based at least partially on the electronic document pairs that are classified as related documents.

First claim

Opening claim text (preview).

What is claimed is: 1. In a digital medium environment, a method implemented by a processing device, the method comprising: segmenting, by the processing device, text of electronic documents included in a document corpus into segments; determining, by the processing device, a subset of the electronic documents that includes electronic document pairs having a number of similar segments that is greater than a threshold number, the similar segments identified using locality sensitive hashing; classifying, by the processing device and using a machine learning model, the electronic document pairs as semantically similar documents or not semantically similar documents, the machine learning model being used to receive a pair of electronic documents as an input and generate an indication of a classification for the pair of electronic documents as an output; computing, by the processing device, containment scores for the electronic document pairs based on the number of the similar segments and a length of a shorter electronic document included in each of the electronic document pairs; and generating, by the processing device, indications of relationships between particular electronic documents included in the subset based at least partially on the electronic document pairs that are classified as semantically similar documents and the containment scores. 2. The method as described in claim 1 , wherein the relationships between the particular electronic documents include a version relationship, an aggregation relationship, a repurposed relationship, or a similarity relationship. 3. The method as described in claim 1 , further comprising determining, by the processing device, a maximum spanning tree from a graph that includes a node for each electronic document included in the electronic document pairs that are classified as semantically similar documents, and the indications of the relationships between the particular electronic documents are generated at least partially based on the maximum spanning tree. 4. The method as described in claim 3 , wherein the nodes included in the graph are connected by edges, the edges having weights, and the weights of the edges being based on insertions and deletions in the electronic document pairs that are classified as semantically similar documents. 5. The method as described in claim 1 , wherein the machine learning model is trained to classify the electronic document pairs as semantically similar documents or not semantically similar documents using training data that describes two-dimensional heatmaps generated from pairs of electronic document training samples. 6. The method as described in claim 5 , wherein the two-dimensional heatmaps include first two-dimensional heatmaps for lexical similarity between the segments that are included in the pairs of the electronic document training samples and second two-dimensional heatmaps for Jaccard similarity between entities included in the segments. 7. The method as described in claim 1 , further comprising generating, by the processing device, indications of semantic similarity for electronic documents included in the subset using a hierarchical attention network trained on training data to receive first and second electronic documents as an input and generate an indication of sematic similarity for the first and second electronic documents as an output. 8. The method as described in claim 7 , further comprising clustering, by the processing device, the electronic documents included in the subset into similarity groups based on the indications of semantic similarity, and the indications of the relationships between the particular electronic documents are generated at least partially based on the similarity groups. 9. The method as described in claim 1 , wherein the indications of the relationships between the particular electronic documents include at least one of a change summary, an explanation of similarity, or a relative ordering between the particular electronic documents. 10. One or more computer-readable storage media comprising instructions stored thereon that, responsive to execution by a processing device, causes the processing device to perform operations including: segmenting text of electronic documents included in a document corpus into segments; determining a subset of the electronic documents that includes electronic document pairs having a number of similar segments that is greater than a threshold number, the similar segments identified using locality sensitive hashing; classifying, using a machine learning model, the electronic document pairs as semantically similar documents or not semantically similar documents, the machine learning model being used to receive a pair of electronic documents as an input and generate an indication of a classification for the pair of electronic documents as an output; computing containment scores for the electronic document pairs based on the number of the similar segments and a length of a shortest electronic document included in each of the electronic document pairs; forming a graph having a node for each electronic document included in the electronic document pairs that are classified as semantically similar documents; determining a maximum spanning tree from the graph; and generating indications of relationships between particular electronic documents included in the subset based at least partially on the maximum spanning tree and the containment scores. 11. The one or more computer-readable storage media as described in claim 10 , wherein the relationships between the particular electronic documents include at least one of a version relationship, an aggregation relationship, a repurposed relationship, or a similarity relationship. 12. The one or more computer-readable storage media as described in claim 10 , wherein the operations further include generating indications of semantic similarity for electronic documents included in the subset using a hierarchical attention network trained on training data to receive first and second electronic documents as an input and generate an indication of sematic similarity for the first and second electronic documents as an output. 13. The one or more computer-readable storage media as described in claim 12 , wherein the operations further include clustering the electronic documents included in the subset into similarity groups based on the indications of semantic similarity, and the indications of the relationships between the particular electronic documents are generated at least partially based on the similarity groups. 14. A system comprising: a processing device; and computer-readable storage media storing instructions that are executable by the processing system to perform operations including: segmenting text of electronic documents included in a document corpus into segments; determining a subset of the electronic documents that includes electronic document pairs having a number of similar segments that is greater than a threshold number, the similar segments identified using locality sensitive hashing; classifying, using a machine learning model, the electronic document pairs as semantically similar documents or not semantically similar documents, the machine learning model being used to receive a pair of electronic documents as an input and generate an indication of a classification for the pair of electronic documents as an output; computing containment scores for the electronic document pairs based on the number of the similar segments and a length of a shorter electronic document included in each of the electronic document pairs; and generating indicat

Assignees

Inventors

Classifications

  • Classification techniques · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

  • G06V30/418Primary

    Document matching, e.g. of document images · CPC title

  • Syntactic or semantic context, e.g. balancing · CPC title

  • G06V30/413Primary

    Classification of content, e.g. text, photographs or tables · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12198459B2 cover?
In implementations of systems for generating indications of relationships between electronic documents, a processing device implements a relationship system to segment text of electronic documents included in a document corpus into segments. The relationship system determines a subset of the electronic documents that includes electronic document pairs having a number of similar segments that is…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06V30/418. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 14 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).