Automatic document summarization using search engine intelligence

US10169453B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10169453-B2
Application numberUS-201615082052-A
CountryUS
Kind codeB2
Filing dateMar 28, 2016
Priority dateMar 28, 2016
Publication dateJan 1, 2019
Grant dateJan 1, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A summary of a document is generated in near real time. In aspects, an indication to summarize the document is received and the document is processed to generate a summary. For instance, processing includes extracting sentences from the document and generating a plurality of candidate passages from the extracted sentences. Features are extracted from each of the plurality of candidate passages and each candidate passage is ranked based at least in part on the extracted features. High-ranking candidate passages are considered likely to be important and/or representative of the document. A summary of the document is generated including one or more of the high-ranking candidate passages. The summary includes portions of the document that are considered important and/or representative of the document, so a user may review the summary in lieu of reading the entire document.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: at least one processing unit; and at least one memory storing computer executable instructions that, when executed by the at least one processing unit, cause the system to perform a method, the method comprising: causing a document to open in an interface; receiving an indication to summarize the document; generating a plurality of overlapping, contiguous candidate passages for the document, comprising: extracting a series of contiguous sentences; and iteratively combining in a consecutive order two or more of the series of contiguous sentences to generate at least a first candidate passage comprising a first extracted sentence and a second extracted sentence, at least a second candidate passage comprising the second extracted sentence and a third extracted sentence, and at least a third candidate passage comprising the third extracted sentence and a fourth extracted sentence; extracting one or more features for each candidate passage of the plurality of candidate passages; ranking each candidate passage of the plurality of candidate passages based at least in part on the extracted one or more features; generating a summary of the document, wherein the summary includes at least a highest ranked candidate passage of the plurality of candidate passages; and providing the summary adjacent to the document in the interface. 2. The system of claim 1 , wherein the summary is generated in response to receiving the indication to summarize the document. 3. The system of claim 1 , further comprising generating another plurality of candidate passages by one or more of: removing a document header and combining text falling before and after the document header; combining text surrounding a graphical element; summarizing a long list of text into a more concise list of text; and summarizing complex formatted text into condensed formatted text. 4. The system of claim 3 , further comprising: generating a summary of the document, wherein the summary includes the highest similarity candidate passage of the plurality of candidate passages and at least one candidate passage of the one or more candidate passages. 5. The system of claim 1 , wherein the one or more features comprise one or more of: document-level features, readability features, presentation/layout features, representativeness features and search metadata. 6. The system of claim 1 , further comprising: calculating a feature vector for each of the plurality of candidate passages based on the extracted one or more features; and ranking each candidate passage of the plurality of candidate passages based at least in part on the calculated feature vector. 7. The system of claim 6 , wherein the calculated feature vector for a candidate passage is representative of the extracted one or more features for the candidate passage. 8. The system of claim 1 , wherein extracting the one or more features further comprises: retrieving search query data, wherein the search query data correlates at least one search query with the document; calculating a distance between the at least one search query and each candidate passage of the plurality of candidate passages; and identifying one or more candidate passages having a short distance to the at least one search query as representative of the document. 9. The system of claim 8 , further comprising: calculating a feature vector for each of the plurality of candidate passages based at least in part on the distance between each candidate passage and the at least one search query; and ranking each candidate passage of the plurality of candidate passages based at least in part on the calculated feature vector. 10. The system of claim 1 , wherein the one or more features comprise readability features that depict a relative complexity of each candidate passage, the readability features comprising one or more of: passage meta features, lexical density features, type-token ratio features, and direct readability features. 11. A system comprising at least one processing unit; and at least one memory storing computer executable instructions that, when executed by the at least one processing unit, cause the system to: receive an indication to summarize a document; generate a plurality of overlapping, contiguous candidate passages for the document, comprising: extract a series of contiguous sentences; and iteratively combine in a consecutive order two or more of the series of contiguous sentences to generate at least a first candidate passage comprising a first extracted sentence and a second extracted sentence, at least a second candidate passage comprising the second extracted sentence and a third extracted sentence, and at least a third candidate passage comprising the third extracted sentence and a fourth extracted sentence; extract one or more features for each candidate passage of the plurality of candidate passages; score each candidate passage of the plurality of candidate passages based at least in part on the extracted one or more features; identify one or more high-scoring candidate passages of the plurality of candidate passages, wherein the high-scoring candidate passages are considered representative of the document; and provide a summary of the document including at least a highest scored candidate passage adjacent to the document in the interface. 12. The system of claim 11 , wherein identifying the one or more high-scoring candidate passages comprises highlighting the one or more high-scoring candidate passages within the document. 13. The system of claim 11 , wherein identifying the one or more high-scoring candidate passages comprises generating a summary of the document, wherein the summary comprises the one or more high-scoring candidate passages in addition to the highest scored candidate passage. 14. The system of claim 13 , wherein the summary is provided as an overlay covering at least a portion of the document. 15. The system of claim 11 , wherein the indication to summarize the document is received when the document is caused to be opened. 16. The system of claim 11 , wherein the indication to summarize the document is received in response to activation of a control. 17. The system of claim 11 , the computer executable instructions further causing the system to: retrieve search query data, wherein the search query data correlates at least one search query with the document; calculate a distance between the at least one search query and each candidate passage of the plurality of candidate passages; and identify one or more candidate passages having a short distance to the at least one search query as representative of the document. 18. The system of claim 17 , the computer executable instructions further causing the system to: calculate a feature vector for each of the plurality of candidate passages based at least in part on the distance between each candidate passage and the at least one search query; and rank each candidate passage of the plurality of candidate passages based at least in part on the calculated feature vector. 19. The system of claim 11 , wherein extracting the one or more features further comprises: retrieving search query data, wherein the search query data correlates at least one search query with the document; calculating a distance between the at least one search query and each candidate passage of the plurality of candidate passages; and identifying one or more candidate passages having a short distance to the at least one sea

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10169453B2 cover?
A summary of a document is generated in near real time. In aspects, an indication to summarize the document is received and the document is processed to generate a summary. For instance, processing includes extracting sentences from the document and generating a plurality of candidate passages from the extracted sentences. Features are extracted from each of the plurality of candidate passages …
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/345. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 01 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).