Method and system of intelligently generating a title for a group of documents

US2024104055A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2024104055-A1
Application numberUS-202217950475-A
CountryUS
Kind codeA1
Filing dateSep 22, 2022
Priority dateSep 22, 2022
Publication dateMar 28, 2024
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method automatically generating a title for a cluster of documents includes accessing a plurality of documents that have been categorized as belonging to a document cluster and providing the plurality of documents as an input to a trained title generating machine-learning (ML) model. The trained title generating ML model is trained for generating a title for a document and provides a titles for each of the plurality of documents. An embedding is created for the generated titles and then an embedding is generated for the document cluster. A similarity between the embeddings for the titles and embedding for the document cluster is measured to identify titles that are more similar to the embedding for the document cluster and based on the similarity one or more titles are selected as title candidates for the document cluster and provided as an output.

First claim

Opening claim text (preview).

What is claimed is: 1 . A data processing system comprising: a processor; and a memory in communication with the processor, the memory comprising executable instructions that, when executed by the processor, cause the data processing system to perform functions of: accessing a plurality of documents, the plurality of documents being documents that have been categorized as belonging to a document cluster; providing the plurality of documents as an input to a trained title generating machine-learning (ML) model, the trained title generating ML model being trained for generating a title for a document; receiving a plurality of titles from the trained title generating ML model, each of the plurality of titles being a title for one of the plurality of documents; creating an embedding for one or more of the plurality of titles; creating an embedding for the document cluster; measuring a similarity between the embeddings for the one or more of the plurality of titles and embedding for the document cluster to identify titles that are more similar to the embedding for the document cluster; selecting, based on the similarity, one or more titles from among the plurality of titles as title candidates for the document cluster; and providing the one or more title candidates as an output. 2 . The data processing system of claim 1 , wherein the trained title generating ML model is a trained encoder-decoder language model that generates abstractive titles for a document in the document cluster. 3 . The data processing system of claim 1 , wherein creating an embedding for one or more of the plurality of titles includes generating numerical vector representations of text for each of the plurality of titles. 4 . The data processing system of claim 1 , wherein creating an embedding for the document cluster includes creating an averaged embedding for the document cluster. 5 . The data processing system of claim 4 , wherein creating an averaged embedding includes: utilizing a model trained for generating topic embeddings from text inputs to generate one or more embeddings for the plurality of documents in the document cluster; and calculating an average of the generated topic embeddings to generate the averaged embedding for the document cluster. 6 . The data processing system of claim 1 , wherein measuring the similarity between the embeddings for the one or more of the plurality of titles and embedding for the document cluster includes calculating a similarity score between each of the embeddings for the one or more of the plurality of titles and the embedding for the document cluster. 7 . The data processing system of claim 6 , wherein the title candidates are selected based on the similarity score. 8 . A method for automatically generating a title for a cluster of documents comprising: accessing a plurality of documents in the document cluster, the plurality of documents being documents that have been categorized as belonging to the document cluster; providing the plurality of documents as an input to a trained title generating machine-learning (ML) model, the trained title generating ML model being trained for generating a title for a document; receiving a title from the trained title generating ML model for each of the plurality of documents; creating an embedding for each of the received titles; creating a topic embedding for the document cluster; measuring a similarity between each of the embeddings for the received titles and the topic embedding for the document cluster; and selecting, based on the similarity, one or more titles from among the received titles as title candidates for the document cluster. 9 . The method of claim 8 , wherein the trained title generating ML model is a trained text to text language model that receives the document as the input and generates the titles for the document as an output. 10 . The method of claim 8 , wherein the trained title generating ML model is trained by using a publicly available labeled dataset to fine-tune a pretrained ML model. 11 . The method of claim 10 , wherein the pretrained ML model is an encoder-decoder deep learning model. 12 . The method of claim 8 , wherein creating an embedding for each of the received titles includes generating numerical vector representations of text for each of the received titles. 13 . The method of claim 8 , wherein creating the embedding for the document cluster includes creating an averaged embedding for the document cluster. 14 . The method of claim 13 , further comprising: utilizing a model trained for generating topic embeddings from text inputs to generate one or more embeddings for the plurality of documents in the document cluster; and calculating an average of the generated topic embeddings to generate the topic embedding for the document cluster. 15 . The method of claim 8 , wherein measuring the similarity between the embeddings for the one or more of the plurality of titles and embedding for the document cluster includes calculating a similarity score between each of the embeddings for the one or more of the plurality of titles and the embedding for the document cluster. 16 . The method of claim 15 , wherein the title candidates are selected based on the similarity score. 17 . A non-transitory computer readable medium on which are stored instructions that, when executed, cause a programmable device to perform functions of: accessing a document, the document including content from a plurality of shorter documents, the shorter documents being documents that have been identified as belonging to a document cluster; providing the document as an input to a trained title generating machine-learning (ML) model, the trained title generating ML model being trained for generating a title for a document that includes a plurality of shorter documents that belong to the document cluster; receiving a title from the trained title generating ML model as an output; and providing the title as a cluster title for the document cluster. 18 . The non-transitory computer readable medium of claim 17 , wherein the trained title generating ML model is a trained text to text language model that receives the document as the input and generates the title for the document cluster as the output. 19 . The non-transitory computer readable medium of claim 17 , wherein the trained title generating ML model is trained by using a publicly available labeled dataset to fine-tune a pretrained ML model. 20 . The non-transitory computer readable medium of claim 17 , wherein the document is concatenated to include the documents in the document cluster.

Assignees

Inventors

Classifications

  • G06F16/164Primary

    File meta data generation · CPC title

  • G06F16/35Primary

    Clustering; Classification · CPC title

  • Document management systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024104055A1 cover?
A system and method automatically generating a title for a cluster of documents includes accessing a plurality of documents that have been categorized as belonging to a document cluster and providing the plurality of documents as an input to a trained title generating machine-learning (ML) model. The trained title generating ML model is trained for generating a title for a document and provides…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/164. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Mar 28 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).