Method and apparatus for automatically summarizing the contents of electronic documents
US-2015095770-A1 · Apr 2, 2015 · US
US9727641B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9727641-B2 |
| Application number | US-201313870267-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 25, 2013 |
| Priority date | Apr 25, 2013 |
| Publication date | Aug 8, 2017 |
| Grant date | Aug 8, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A technique to generate a summary of a set of sentences. Each sentence in the set can be evaluated based on a criterion, such as informativeness of the sentence. The sentences may also be evaluated for readability based on a readability measure. Sentences can be selected for inclusion in the summary based on the evaluations.
Opening claim text (preview).
What is claimed is: 1. A method executed by a computer system, comprising: extracting a set of sentences from a digital document; scoring each sentence of the set of sentences using a respective informativeness measure; scoring each sentence of the set of sentences using a readability measure, wherein the readability measure is based at least in part on one of: a number of words in the sentence, a number of syllables per word, a frequency of a word based on a vocabulary frequency, a frequency of a word based on context, or if words of the sentence appear on a reading list; selecting selected sentences in the set of sentences based on the readability measures and informativeness measures, wherein the selecting comprises: determining a subset of sentences from the set of sentences, wherein the sentences in the subset of sentences have informativeness measures greater than a threshold, and selecting, from the subset of sentences, the selected sentences based on a ranking of the sentences in the subset of sentences according to readability measures of the sentences in the subset of sentences, wherein the selected sentences are the to ranked sentences in the subset of sentences; identifying a low readability, high informativeness sentence from the set of sentences, wherein: a low readability sentence includes at least one of fewer syllables per word, fewer words on a reading list, or a lower frequency of words associated with a vocabulary frequency list; and a high informativeness sentence includes greater similarity to other sentences in the set of sentences and more words having term frequency-inverse document frequency (tf-idf) values indicating that the words are key words; generating a concatenated sentence by concatenating at least one contextual sentence with the low readability, high informativeness sentence, wherein the concatenated sentence has a higher readability than the low readability, high informativeness sentence; and generating a readable summary of the digital document, the readable summary including the concatenated sentence and the selected sentences. 2. The method of claim 1 , wherein the contextual sentence comprises a sentence preceding or following the identified low readability, high informativeness sentence in the digital document. 3. The method of claim 1 , wherein the selected sentences are selected using a linear program optimization that maximizes informativeness and readability of the readable summary as measured by the informativeness measures and the readability measures of the sentences in the set of sentences. 4. The method of claim 1 , further comprising: computing a readability measure of the concatenated sentence; and including the concatenated sentence in the readable summary in response to the readability measure of the concatenated sentence satisfying a specified criterion. 5. The method of claim 4 , wherein the specified criterion comprises a specified threshold, and including the concatenated sentence in the readable summary is in response to the readability measure of the concatenated sentence exceeding the specified threshold. 6. The method of claim 4 , wherein the specified criterion comprises a threshold amount greater than a readability measure of the low readability, high informativeness sentence, and including the concatenated sentence in the readable summary is in response to the readability measure of the concatenated sentence exceeding the readability measure of the low readability, high informativeness sentence by greater than the threshold amount. 7. A system comprising: a processor; and a non-transitory storage medium storing instructions executable on the processor to: extract a plurality of sentences from a digital document; identify sentences from the plurality of sentences for inclusion in a summary of the digital document based on a criterion; evaluate a readability of the identified sentences using respective readability measures, wherein each readability measure assigned to each sentence is based at least in part on one of: a number of words in the sentence, a number of syllables per word, a frequency of a word based on a vocabulary frequency, a frequency of a word based on context, or if words of the sentence appear on a reading list; select sentences based in part on the evaluated readability of the identified sentences, wherein the selecting comprises: determining a subset of sentences from the plurality of sentences, wherein the sentences in the subset of sentences have informativeness measures greater than a threshold, and selecting, from the subset of sentences, the selected sentences based on a ranking of the sentences in the subset of sentences according to readability measures of the sentences in the subset of sentences, wherein the selected sentences are the to ranked sentences in the subset of sentences; add a low readability, high informativeness sentence to at least one of the selected sentences to create a concatenated sentence, wherein the concatenated sentence has a higher readability than the low readability, high informativeness sentence, and wherein: a low readability sentence includes at least one of fewer syllables per word, fewer words on a reading list, or a lower frequency of words associated with a vocabulary frequency list; and a high informativeness sentence includes greater similarity to other sentences in the plurality of sentences and more words having term frequency-inverse document frequency (tf-idf) values indicating that the words are key words. 8. The system of claim 7 , wherein the instructions are executable on the processor to assign an informativeness measure to each sentence of the plurality of sentences, wherein the identifying is based on the informativeness measures. 9. The system of claim 8 , wherein the criterion is informativeness. 10. The system of claim 7 , wherein the instructions are executable on the processor to: compute a readability measure of the concatenated sentence; and include the concatenated sentence in the summary in response to the readability measure of the concatenated sentence satisfying a specified criterion. 11. A non-transitory computer readable storage medium storing instructions that when executed cause a computer system to: assign a respective informativeness measure to each sentence of a set of sentences in a digital document; assign a respective readability measure to each sentence of the set of sentences; select selected sentences in the set of sentences based on the readability measures and informativeness measures, wherein the selecting comprises: determining a subset of sentences from the set of sentences, wherein the sentences in the subset of sentences have informativeness measures greater than a threshold, and selecting, from the subset of sentences, the selected sentences based on a ranking of the sentences in the subset of sentences according to readability measures of the sentences in the subset of sentences, wherein the selected sentences are the top ranked sentences in the subset of sentences; identify a low readability, high informativeness sentence from the set of sentences, wherein: a low readability sentence includes at least one of fewer syllables per word, fewer words on a reading list, or a lower frequency of words associated with a vocabulary frequency list; and a high informativeness sentence includes greater similarity to other sentences in the set of sentences and more words having term frequency-inverse document frequency (tf-idf) values indicating that the words are key words; generate a concatenated sentence by concatenating at least one contextual sentence onto the low readability, high informativeness
Physics · mapped topic
Summarisation for human users · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.