Semantic difference characterization for documents

US12086551B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12086551-B2
Application numberUS-202117356037-A
CountryUS
Kind codeB2
Filing dateJun 23, 2021
Priority dateJun 23, 2021
Publication dateSep 10, 2024
Grant dateSep 10, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer implemented method determines differences between documents. The method includes parsing a first document and a second document into respective distinct instances of content. The distinct instances of content are classified into different categories. Category specific matching algorithms are applied to each of the respective instances of content to determine a similarity score for each of the respective instances of content. Semantic differences between the first document and the second document are analyzed as a function of the similarity scores. A characterization of the semantic differences is generated.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer implemented method of determining differences between documents, the method comprising: parsing a first document and a second document into respective distinct instances of content; classifying the distinct instances of content into different semantic categories including text, images, and tables; applying category specific matching algorithms to content within each of the respective instances of content to determine a similarity score for each of the respective instances of content to match the respective instances, wherein the category specific category matching algorithms comprise machine learning models trained on labeled respective category training data; analyzing semantic differences between the content within matching respective instances of the first document and the second document as a function of the similarity scores; and generating a characterization of the semantic differences. 2. The method of claim 1 wherein generating a characterization of the semantic differences comprises generating a difference label for pairs of respective instances of matched content. 3. The method of claim 1 wherein generating a characterization of the semantic differences of the degree of differences comprises generating added and removed labels for respective instances of content for unmatched content. 4. The method of claim 1 wherein the semantic differences comprise added, re-ordered, deleted, and modified, and where generating a characterization of the semantic differences comprises generating a count of the semantic differences for each type of semantic difference. 5. The method of claim 1 wherein the similarity score for respective instances of content is determined as a function of similarity of the respective instances of content and similarity of context of the respective instances of content. 6. The method of claim 1 wherein classifying the distinct instances of content into different categories comprises classifying the instances of content into one of a text, an image, or a table category. 7. The method of claim 6 wherein text is further classified into section headings, sections, headers, footers, titles, authors, references, and captions. 8. The method of claim 1 wherein the similarity score of the respective instances of content is a function of each respective instance of content's position with respect to other local identified instances of content. 9. The method of claim 8 wherein image embeddings are compared to determine contexts for respective instances of content comprising images. 10. The method of claim 1 wherein applying category specific matching algorithms to each of the respective instances of content to determine a similarity score for respective instances of content comprises for each category specific matching algorithm: comparing each instance of content of the specific category in the first document to each instance of content of the specific category in the second document; generating a similarity score for each pair of respective instances of content; and selecting the pair with the highest similarity score as a match. 11. The method of claim 10 wherein the category specific matching algorithm comprises a text matching algorithm, and wherein applying the text matching algorithm to text instances of content comprises recursively: matching sequences of text from the respective instances of text; unmatching sequences of text and evaluating longer sequences of text for matches; and matching the longer sequences of text. 12. The method of claim 1 wherein characterizing the semantic differences is performed for each matched instance of content and for each unmatched instance of content. 13. The method of claim 1 wherein respective instances are matched based on having similar content and on being in similar locations within the respective first and second documents. 14. A machine-readable storage device having instructions for execution by a processor of a machine to cause the processor to perform operations to perform a method, the operations comprising: parsing a first document and a second document into respective distinct instances of content; classifying the distinct instances of content into different semantic categories including text, images, and tables; applying category specific matching algorithms to content within each of the respective instances of content to determine a similarity score for each of the respective instances of content to match the respective instances, wherein the category specific category matching algorithms comprise machine learning models trained on labeled respective category training data; analyzing semantic differences between the content within matching respective instances of the first document and the second document as a function of the similarity scores; and generating a characterization of the semantic differences. 15. The device of claim 14 wherein generating a characterization of the semantic differences comprises generating a difference label for pairs of respective instances of matched content and wherein generating a characterization of the semantic differences of the degree of differences comprises generating added and removed labels for respective instances of content for unmatched content. 16. The device of claim 14 wherein the semantic differences comprise added, re-ordered, deleted, and modified, and where generating a characterization of the semantic differences comprises generating a count of the semantic differences for each type of semantic difference. 17. The device of claim 14 wherein applying category specific matching algorithms to each of the respective instances of content to determine a similarity score for respective instances of content comprises for each category specific matching algorithm: comparing each instance of content of the specific category in the first document to each instance of content of the specific category in the second document; generating a similarity score for each pair of respective instances of content; and selecting the pair with the highest similarity score as a match. 18. The device of claim 17 wherein the category specific matching algorithm comprises a text matching algorithm, and wherein applying the text matching algorithm to text instances of content comprises recursively: matching sequences of text from the respective instances of text; unmatching sequences of text and evaluating longer sequences of text for matches; and matching the longer sequences of text. 19. A device comprising: a processor, and a memory device coupled to the processor and having a program stored thereon for execution by the processor to perform operations comprising: parsing a first document and a second document into respective distinct instances of content; classifying the distinct instances of content into different semantic categories including text, images, and tables, applying category specific matching algorithms to content within each of the respective instances of content to determine a similarity score for each of the respective instances of content to match the respective instances, wherein the category specific category matching algorithms comprise machine learning models trained on labeled respective category training data; analyzing semantic differences between the content within matching respective instances of the first document and the second document as a function of the similarity scores; and generating a characterization of the semantic dif

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12086551B2 cover?
A computer implemented method determines differences between documents. The method includes parsing a first document and a second document into respective distinct instances of content. The distinct instances of content are classified into different categories. Category specific matching algorithms are applied to each of the respective instances of content to determine a similarity score for ea…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).