Document similarity analysis
US-2018300296-A1 · Oct 18, 2018 · US
US12499703B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12499703-B2 |
| Application number | US-202318193669-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 31, 2023 |
| Priority date | Dec 30, 2022 |
| Publication date | Dec 16, 2025 |
| Grant date | Dec 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The disclosure includes a system and method for obtaining a document specification in an electronic format, wherein the document specification is associated with a first document, and describes features present in valid instances of the first document; determining a set of labels describing the first document from the document specification; obtaining one or more digital images of at least one valid instance of the first document from the document specification; obtaining information describing a set of bounding boxes resulting from application, to the one or more images of the least one valid instance of the first document, of one or more of optical character recognition and object detection; generating a set of derived checks based on the set of bounding boxes; and generating a document assembly object describing valid instances of the document and the set of derived checks usable to determine validity of a document under test.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: obtaining, using one or more processors, a document specification in an electronic format, wherein the document specification is associated with a first document, and describes features present in valid instances of the first document; determining, using the one or more processors, a set of labels describing the first document from the document specification; obtaining, using the one or more processors, one or more digital images of at least one valid instance of the first document from the document specification; obtaining, using the one or more processors, information describing a set of bounding boxes, the set of bounding boxes resulting from one or more of optical character recognition and object detection, the one or more of the optical character recognition and the object detection applied to the one or more images of the at least one valid instance of the first document; generating, using the one or more processors, a set of derived validity checks based on the set of bounding boxes; and generating, using the one or more processors, a document assembly object describing valid instances of the first document and the set of derived validity checks usable to determine whether an image of a document under test represents a valid instance of the first document. 2 . The method of claim 1 , the method further comprising: obtaining a set of test images representing multiple instances of the first document, the set of test images including a first test image; determining, based on a first derived validity check in the document assembly object, whether each test image in the set of test images is valid with respect to the first derived validity check or invalid with respect to the first derived validity check; and adjusting how subsequent determinations are made based on a presence of a false positive or false negative in a determination of the first test image with respect to the first derived validity check. 3 . The method of claim 1 , wherein adjusting how subsequent determinations are made includes one or more of: retraining a machine learning model associated with a first derived validity check to reduce an instance of a false positive or a false negative; and adjusting a tolerance. 4 . The method of claim 1 , the method further comprising: obtaining a set of valid document images, wherein each image in the set of valid document images represents a valid instance of the first document; applying pattern recognition to the set of valid document images; generating, based on a first detected pattern, a newly derived validity check; and adding the newly derived validity check to the document assembly object. 5 . The method of claim 4 , wherein the newly derived validity check is associated with an unpublished security feature present in the first document. 6 . The method of claim 4 , wherein the pattern recognition identifies a repetition in at least a portion of personally identifiable information (PII) text between two or more bounding boxes associated with a common, valid document instance in the set of valid document images, and wherein the newly derived validity check, when applied to the image of the document under test, checks for one or more of: whether a bounding box, which is associated with at least a partial repetition of PII in valid instances of the first document, is present in the document under test; whether the bounding box, which is associated with at least a partial repetition of PII in valid instances of the first document, in the document under test is in a location consistent with valid instances of the first document; and whether text content of the bounding box repeats a portion of PII text found elsewhere in the document under test that is consistent with valid instances of the first document. 7 . The method of claim 1 , wherein the set of bounding boxes includes a first bounding box that is associated with a ghost image. 8 . The method of claim 1 , wherein the set of bounding boxes includes a first bounding box that is associated with at least a partial repetition of PII in valid instances of the first document, is undiscernible to an average human eye absent magnification. 9 . The method of claim 1 , wherein the electronic format is one of hypertext markup language and printable document format and published by a trusted source. 10 . The method of claim 1 , wherein the document assembly object is human and machine readable. 11 . A system comprising: a processor; and a memory, the memory storing instructions that, when executed by the processor, cause the system to: obtain a document specification in an electronic format, wherein the document specification is associated with a first document, and describes features present in valid instances of the first document; determine a set of labels describing the first document from the document specification; obtain one or more digital images of at least one valid instance of the first document from the document specification; obtain information describing a set of bounding boxes, the set of bounding boxes resulting from one or more of optical character recognition and object detection, the one or more of the optical character recognition and the object detection applied to the one or more images of the at least one valid instance of the first document; generate a set of derived validity checks based on the set of bounding boxes; and generate a document assembly object describing valid instances of the first document and the set of derived validity checks usable to determine whether an image of a document under test represents a valid instance of the first document. 12 . The system of claim 11 , wherein the instructions, when executed, cause the system to: obtain a set of test images representing multiple instances of the first document, the set of test images including a first test image; determine, based on a first derived validity check in the document assembly object, whether each image in the set of test images is valid with respect to the first derived validity check or invalid with respect to the first derived validity check; and adjust how subsequent determinations are made based on a presence of a false positive or false negative in a determination of the first test image with respect to the first derived validity check. 13 . The system of claim 11 , wherein adjusting how subsequent determinations are made includes one or more of: retraining a machine learning model associated with a first derived validity check to reduce an instance of a false positive or a false negative; and adjusting a tolerance. 14 . The system of claim 11 , wherein the instructions, when executed, cause the system to: obtain a set of valid document images, wherein each image in the set of valid document images represents a valid instance of the first document; apply pattern recognition to the set of valid document images; generate, based on a first detected pattern, a newly derived validity check; and add the newly derived validity check to the document assembly object. 15 . The system of claim 14 , wherein the newly derived validity check is associated with an unpublished security feature present in the first document. 16 . The system of claim 14 , wherein the pattern recognition identifies a repetition in at least a portion of personally identifiable information (PII) text between two or more bounding boxes associated with a common, valid document instance in the set of valid document images, and wherein the newly derived v
Validation; Performance evaluation · CPC title
Classification, e.g. identification · CPC title
Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title
Classification techniques · CPC title
Determination of region of interest · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.