Unifying text segmentation and long document summarization

US2024220709A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2024220709-A1
Application numberUS-202218090132-A
CountryUS
Kind codeA1
Filing dateDec 28, 2022
Priority dateDec 28, 2022
Publication dateJul 4, 2024
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method including receiving an input comprising natural language texts; segmenting the natural language texts into sections; summarizing the natural language texts; developing a first model based on the plurality of sections and the summary of the natural language texts; identifying one or more salient sentences within the natural language texts using the first model; determining a sentence quality score based on how informative a salient sentence is; determining a sentence similarity score based on a salient sentence's similarity to another salient sentence; developing a second model based on the sentence quality score and the sentence similarity score; combining the first model and the second model into a final model; selecting sentences based on the final model; and generating an extractive summarization using the selected sentences.

First claim

Opening claim text (preview).

1 . A method executed by at least one processor, the method comprising: receiving an input comprising natural language texts; segmenting the natural language texts into a plurality of sections; summarizing the natural language texts; developing a first model based on the plurality of sections and the summary of the natural language texts; identifying two or more salient sentences within the natural language texts using the first model; determining a sentence quality score for each of the two or more salient sentences based on how informative the salient sentence is; determining, for each of the two or more salient sentences, a sentence similarity score based on a similarity of the salient sentence to another salient sentence of the two or more salient sentences; generating a second model, as a negative log-probability of a ground-truth extractive summary, based on performing batch matrix multiplication (BMM) between the sentence quality scores and the sentence similarity scores to calculate a matrix product; combining the first model and the second model into a final model; selecting sentences from the natural language texts based on the final model; and generating an extractive summarization of the natural language texts using the selected sentences. 2 . The method according to claim 1 , wherein segmenting the natural language texts and summarizing the natural language texts occur simultaneously. 3 . The method according to claim 1 , further comprising calculating a pairwise repulsiveness between the salient sentence and the another salient sentence to increase diversity of the identified salient sentences and eliminate redundancy. 4 . The method according to claim 1 , wherein developing the first model further comprises minimizing a per-sentence empirical cross-entropy of the first model with respect to standard summary labels. 5 . The method according to claim 1 , further comprising using a determinantal point process. 6 . The method according to claim 1 , further comprising training the final model. 7 . The method according to claim 1 , further comprising pre-training the final model on either arXiv or PubMed. 8 . An apparatus comprising: at least one memory configured to store program code; and at least one processor configured to read the program code and operate as instructed by the program code, the program code comprising: receiving code configured to cause the at least one processor to receive an input comprising natural language texts; segmenting code configured to cause the at least one processor to segment the natural language texts into a plurality of sections; summarizing code configured to cause the at least one processor to summarize the natural language texts; first developing code configured to cause the at least one processor to develop a first model based on the plurality of sections and the summary of the natural language texts; identifying code configured to cause the at least one processor to identify two or more salient sentences within the natural language texts using the first model; first determining code configured to cause the at least one processor to determine a sentence quality score for each of the two or more salient sentences based on how informative the salient sentence is; second determining code configured to cause the at least one processor to determine for each of the two or more salient sentences a sentence similarity score based on a similarity of the salient sentence to another salient sentence of the two or more salient sentences; generating code configured to cause the at least one processor to generate a second model, as a negative log-probability of a ground-truth extractive summary, based on performing batch matrix multiplication (BMM) between the sentence quality scores and the sentence similarity scores to calculate a matrix product; combining code configured to cause the at least one processor to combine the first model and the second model into a final model; selecting code configured to cause the at least one processor to select sentences from the natural language texts based on the final model; and generating code configured to cause the at least one processor to generate an extractive summarization of the natural language texts using the selected sentences. 9 . The apparatus according to claim 8 , wherein segmenting the natural language texts and summarizing the natural language texts occur simultaneously. 10 . The apparatus according to claim 8 , wherein the program code is further configured to cause the at least one processor to calculate a pairwise repulsiveness between the salient sentence and the another salient sentence to increase diversity of the identified salient sentences and eliminate redundancy. 11 . The apparatus according to claim 8 , wherein developing the first model further comprises minimizing a per-sentence empirical cross-entropy of the first model with respect to standard summary labels. 12 . The apparatus according to claim 8 , wherein the program code is further configured to cause the at least one processor to use a determinantal point process. 13 . The apparatus according to claim 8 , further comprising training the final model. 14 . The apparatus according to claim 8 , further comprising pre-training the final model on either arXiv or PubMed. 15 . A non-transitory computer-readable storage medium, storing instructions, which, when executed by at least one processor, cause the at least one processor to: receive an input comprising natural language texts; segment the natural language texts into a plurality of sections; summarize the natural language texts; develop a first model based on the plurality of sections and the summary of the natural language texts; identify two or more salient sentences within the natural language texts using the first model; determine a sentence quality score for each of the two or more salient sentences based on how informative the salient sentence is; determine, for each of the two or more salient sentences, a sentence similarity score based on a similarity to another salient sentence of the two or more salient sentences; generate a second model, as a negative log-probability of a ground-truth extractive summary, based on performing batch matrix multiplication (BMM) between the sentence quality scores and the sentence similarity scores to calculate a matrix product; combine the first model and the second model into a final model; select sentences from the natural language texts based on the final model; and generate an extractive summarization of the natural language texts using the selected sentences. 16 . The non-transitory computer-readable storage medium according to claim 15 , wherein segmenting the natural language texts and summarizing the natural language texts occur simultaneously. 17 . The non-transitory computer-readable storage medium according to claim 15 , wherein where the instructions are further configured to, when executed by the at least one processor, cause the at least one processor to calculate a pairwise repulsiveness between the salient sentence and the another salient sentence to increase diversity of the identified salient sentences and eliminate redundancy. 18 . The non-transitory computer-readable storage medium according to claim 15 , wherein developing the first model further comprises minimizing a per-sentence empirical cross-entropy of the first model with respect to standard summary labels. 19 . The non-transitory computer-readabl

Assignees

Inventors

Classifications

  • Semantic analysis · CPC title

  • Phrasal analysis, e.g. finite state techniques or chunking · CPC title

  • Machine learning · CPC title

  • G06F40/166Primary

    Editing, e.g. inserting or deleting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024220709A1 cover?
A method including receiving an input comprising natural language texts; segmenting the natural language texts into sections; summarizing the natural language texts; developing a first model based on the plurality of sections and the summary of the natural language texts; identifying one or more salient sentences within the natural language texts using the first model; determining a sentence qu…
Who is the assignee on this patent?
Tencent America LLC
What technology area does this patent fall under?
Primary CPC classification G06F40/166. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jul 04 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).