Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F40/30. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Semantic difference characterization for documents

US12086551B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12086551-B2
Application number	US-202117356037-A
Country	US
Kind code	B2
Filing date	Jun 23, 2021
Priority date	Jun 23, 2021
Publication date	Sep 10, 2024
Grant date	Sep 10, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer implemented method determines differences between documents. The method includes parsing a first document and a second document into respective distinct instances of content. The distinct instances of content are classified into different categories. Category specific matching algorithms are applied to each of the respective instances of content to determine a similarity score for each of the respective instances of content. Semantic differences between the first document and the second document are analyzed as a function of the similarity scores. A characterization of the semantic differences is generated.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer implemented method of determining differences between documents, the method comprising: parsing a first document and a second document into respective distinct instances of content; classifying the distinct instances of content into different semantic categories including text, images, and tables; applying category specific matching algorithms to content within each of the respective instances of content to determine a similarity score for each of the respective instances of content to match the respective instances, wherein the category specific category matching algorithms comprise machine learning models trained on labeled respective category training data; analyzing semantic differences between the content within matching respective instances of the first document and the second document as a function of the similarity scores; and generating a characterization of the semantic differences. 2. The method of claim 1 wherein generating a characterization of the semantic differences comprises generating a difference label for pairs of respective instances of matched content. 3. The method of claim 1 wherein generating a characterization of the semantic differences of the degree of differences comprises generating added and removed labels for respective instances of content for unmatched content. 4. The method of claim 1 wherein the semantic differences comprise added, re-ordered, deleted, and modified, and where generating a characterization of the semantic differences comprises generating a count of the semantic differences for each type of semantic difference. 5. The method of claim 1 wherein the similarity score for respective instances of content is determined as a function of similarity of the respective instances of content and similarity of context of the respective instances of content. 6. The method of claim 1 wherein classifying the distinct instances of content into different categories comprises classifying the instances of content into one of a text, an image, or a table category. 7. The method of claim 6 wherein text is further classified into section headings, sections, headers, footers, titles, authors, references, and captions. 8. The method of claim 1 wherein the similarity score of the respective instances of content is a function of each respective instance of content's position with respect to other local identified instances of content. 9. The method of claim 8 wherein image embeddings are compared to determine contexts for respective instances of content comprising images. 10. The method of claim 1 wherein applying category specific matching algorithms to each of the respective instances of content to determine a similarity score for respective instances of content comprises for each category specific matching algorithm: comparing each instance of content of the specific category in the first document to each instance of content of the specific category in the second document; generating a similarity score for each pair of respective instances of content; and selecting the pair with the highest similarity score as a match. 11. The method of claim 10 wherein the category specific matching algorithm comprises a text matching algorithm, and wherein applying the text matching algorithm to text instances of content comprises recursively: matching sequences of text from the respective instances of text; unmatching sequences of text and evaluating longer sequences of text for matches; and matching the longer sequences of text. 12. The method of claim 1 wherein characterizing the semantic differences is performed for each matched instance of content and for each unmatched instance of content. 13. The method of claim 1 wherein respective instances are matched based on having similar content and on being in similar locations within the respective first and second documents. 14. A machine-readable storage device having instructions for execution by a processor of a machine to cause the processor to perform operations to perform a method, the operations comprising: parsing a first document and a second document into respective distinct instances of content; classifying the distinct instances of content into different semantic categories including text, images, and tables; applying category specific matching algorithms to content within each of the respective instances of content to determine a similarity score for each of the respective instances of content to match the respective instances, wherein the category specific category matching algorithms comprise machine learning models trained on labeled respective category training data; analyzing semantic differences between the content within matching respective instances of the first document and the second document as a function of the similarity scores; and generating a characterization of the semantic differences. 15. The device of claim 14 wherein generating a characterization of the semantic differences comprises generating a difference label for pairs of respective instances of matched content and wherein generating a characterization of the semantic differences of the degree of differences comprises generating added and removed labels for respective instances of content for unmatched content. 16. The device of claim 14 wherein the semantic differences comprise added, re-ordered, deleted, and modified, and where generating a characterization of the semantic differences comprises generating a count of the semantic differences for each type of semantic difference. 17. The device of claim 14 wherein applying category specific matching algorithms to each of the respective instances of content to determine a similarity score for respective instances of content comprises for each category specific matching algorithm: comparing each instance of content of the specific category in the first document to each instance of content of the specific category in the second document; generating a similarity score for each pair of respective instances of content; and selecting the pair with the highest similarity score as a match. 18. The device of claim 17 wherein the category specific matching algorithm comprises a text matching algorithm, and wherein applying the text matching algorithm to text instances of content comprises recursively: matching sequences of text from the respective instances of text; unmatching sequences of text and evaluating longer sequences of text for matches; and matching the longer sequences of text. 19. A device comprising: a processor, and a memory device coupled to the processor and having a program stored thereon for execution by the processor to perform operations comprising: parsing a first document and a second document into respective distinct instances of content; classifying the distinct instances of content into different semantic categories including text, images, and tables, applying category specific matching algorithms to content within each of the respective instances of content to determine a similarity score for each of the respective instances of content to match the respective instances, wherein the category specific category matching algorithms comprise machine learning models trained on labeled respective category training data; analyzing semantic differences between the content within matching respective instances of the first document and the second document as a function of the similarity scores; and generating a characterization of the semantic dif

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06F18/2431
Multiple classes · CPC title
G06V30/418
Document matching, e.g. of document images · CPC title
G06N20/00
Machine learning · CPC title
G06F40/205
Parsing · CPC title
G06N3/0475
Generative networks · CPC title

Patent family

Related publications grouped by family.

View patent family 84542255

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12086551B2 cover?: A computer implemented method determines differences between documents. The method includes parsing a first document and a second document into respective distinct instances of content. The distinct instances of content are classified into different categories. Category specific matching algorithms are applied to each of the respective instances of content to determine a similarity score for ea…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Method and apparatus for determining item name, computer device, and storage medium

Method of comparing documents, electronic device and readable storage medium

Method and apparatus for summarization of dialogs

Identification of fields in documents with neural networks using global document context

Document revision change summarization

Fine-grained image similarity

Frequently asked questions