Systems and methods for document analysis to produce, consume and analyze content-by-example logs for documents
US-2023419026-A1 · Dec 28, 2023 · US
US12585707B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12585707-B2 |
| Application number | US-202217851506-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 28, 2022 |
| Priority date | Jun 28, 2022 |
| Publication date | Mar 24, 2026 |
| Grant date | Mar 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Document analysis systems and methods for the generation of a content-by-example log that expresses withheld documents in terms of a set of disclosed documents are disclosed. Additionally, document analysis systems and methods for the analysis of such a content-by-example log to determine withheld documents of interest without access to those withheld documents are disclosed.
Opening claim text (preview).
What is claimed is: 1 . A system for document analysis, comprising: a processor; a non-transitory computer readable medium, comprising instructions for: receiving, by a receiving party, a content-by-example log, the content-by-example log including an entry for each withheld document of a set of withheld documents, wherein the set of withheld documents is inaccessible to the receiving party, wherein the entry for each withheld document associates an identifier for the corresponding withheld document with example identifiers for a set of example documents, wherein the set of example documents exemplify the corresponding withheld document, wherein the set of example documents are disclosed documents accessible to the receiving party; storing the content-by-example log at a data store; analyzing the content-by-example log to determine identifiers of withheld documents of interest by: transforming the content-by-example log into a feature vector index, wherein the feature vector index comprises: a feature vector associated with each of the identifiers of the withheld documents, wherein the feature vector comprises: a set of features determined from the set of example documents for the corresponding withheld document, wherein creating a respective feature vector for each withheld document comprises: generating a document feature vector for each example document identified in the content-by-example log as being associated with the withheld document, wherein the document feature vector comprises a weighted set of text based features determined from the respective example document, wherein the feature vector associates with the identifier for the withheld document in the feature vector index by using the document feature vectors generated for each example document of the set of example documents; and determining the identifiers of withheld documents of interest based on the feature vector index by: obtaining labels associated with identifiers of withheld documents; obtaining a supervised machine learning model trained at a first time based on obtained labels for documents; further training, at a second time after the first time, the supervised machine learning model using newly obtained labels for withheld documents identified in the content-by-example log, wherein the further training at the second time of the supervised machine learning model utilizes features determined from the example documents and provided by the feature vector index, wherein the features are associated with the withheld documents; ranking identifiers for withheld documents of the content-by-example log based on the feature vector index using the further trained supervised machine learning model; and selecting a number of top ranked identifiers of withheld documents as identifiers of the set of withheld documents of interest; and generating requests for a producing party having the set of withheld documents, wherein the generated requests correspond to at least some of the set of withheld documents of interest by specifying a subset of identifiers of the at least some of the set of withheld documents of interest. 2 . The system of claim 1 , wherein determining the identifiers of withheld documents of interest comprises: searching the identifiers for the withheld documents using the feature vector index based on a query to rank the identifiers for the withheld documents; and selecting a number of top ranked identifiers of withheld documents as identifiers of the set of withheld documents of interest. 3 . The system of claim 2 , wherein the query is determined from content associated with the disclosed documents accessible by the receiving party. 4 . The system of claim 1 , wherein the features of the feature vector are the identifiers of the set of example documents. 5 . The system of claim 1 , wherein determining the identifiers of withheld documents of interest comprises: generating a set of clusters of identifiers of withheld documents by clustering the identifiers for the withheld documents included in the content-by-example log based on the feature vector index; selecting an identifier from each of the set of clusters of identifiers of withheld documents as identifiers of the set of withheld documents of interest. 6 . The system of claim 5 , wherein the identifier is selected from a cluster of the set of clusters based on a distance of that identifier from a centroid of that cluster. 7 . A method for document analysis, comprising: receiving, by a receiving party, a content-by-example log, the content-by-example log including an entry for each withheld document of a set of withheld documents, wherein the set of withheld documents is inaccessible to the receiving party, wherein the entry for each withheld document associates an identifier for the corresponding withheld document with example identifiers for a set of example documents, wherein the set of example documents exemplify the corresponding withheld document, wherein the set of example documents are disclosed documents accessible to the receiving party; storing the content-by-example log at a data store; analyzing the content-by-example log to determine identifiers of withheld documents of interest by: transforming the content-by-example log into a feature vector index, wherein the feature vector index comprises: a feature vector associated with each of the identifiers of the withheld documents, wherein the feature vector comprises: a set of features determined from the set of example documents for the corresponding withheld document, wherein creating a respective feature vector for each withheld document comprises: generating a document feature vector for each example document identified in the content-by-example log as being associated with the withheld document, wherein the document feature vector comprises a weighted set of text based features determined from the respective example document, wherein the feature vector associates with the identifier for the withheld document in the feature vector index by using the document feature vectors generated for each example document of the set of example documents; and determining the identifiers of withheld documents of interest based on the feature vector index by: obtaining labels associated with identifiers of withheld documents; obtaining a supervised machine learning model trained at a first time based on obtained labels for documents; further training, at a second time after the first time, the supervised machine learning model using newly obtained labels for withheld documents identified in the content-by-example log, wherein the further training at the second time of the supervised machine learning model utilizes features determined from the example documents and provided by the feature vector index, wherein the features are associated with the withheld documents; ranking identifiers for withheld documents of the content-by-example log based on the feature vector index using the further trained supervised machine learning model; and selecting a number of top ranked identifiers of withheld documents as identifiers of the set of withheld documents of interest; and generating requests for a producing party having the set of withheld documents, wherein the generated requests correspond to at least some of the set of withheld documents of interest by specifying a subset of identifiers of the at least some of the set of withheld documents of interest. 8 . The method of claim 7 , wherein determining the identifiers of withheld documents of interest comprises: searching the identifiers for the withheld documents using the feature vector index based on a query to rank the identifiers for the withheld documents; and select
Query processing · CPC title
Office automation; Time management · CPC title
Legal services · CPC title
using vector based model · CPC title
Document management systems · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.