Systems and methods for generating labeled short text sequences

US11797594B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11797594-B2
Application numberUS-202017093722-A
CountryUS
Kind codeB2
Filing dateNov 10, 2020
Priority dateDec 9, 2019
Publication dateOct 24, 2023
Grant dateOct 24, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A set of documents related to a particular topic, industry, or entity are received. Sentences are extract from each document. The sentences are grouped into tuples of one, two, or three consecutive sentences (i.e., short text sequences). The sentence tuples are clustered based on vector representations of the sentences. For each cluster, a set of tuples that best represents or best fits the cluster is selected. These sentence tuples are fed to an ontology to determine ontological entities associated with each tuple. These determined ontological entities are associated with the clusters corresponding to each tuple. The sentence tuples associated with each cluster are labeled based on the ontological entities associated with the cluster. The labeled sentence tuples may then be used for a variety of purposes such as training a model to determine the topic of short text sequences.

First claim

Opening claim text (preview).

What is claimed: 1. A method for automatically generating labeled short text sequences from a document corpus without a human reviewer comprising: receiving a plurality of documents by a computing device; for each document of the plurality of documents, extracting a plurality of sentences from the document by the computing device; for each of the plurality of sentences, by the computing device: calculating a complexity for each sentence in the plurality of sentences, and removing sentences from the plurality of sentences with a calculated complexity that does not exceed a threshold; for each document of the plurality of documents, generating a plurality of short text sequences from the plurality of sentences extracted from the document by the computing device; assigning each of the plurality of short text sequences into one or more clusters of a plurality of clusters by the computing device; determining one or more topics for each cluster based on one or more of the short text sequences associated with the cluster by the computing device; for each short text sequence, labeling the short text sequence using the one or more topics determined for the one or more clusters of the plurality of clusters that the short text sequence is assigned to by the computing device; and training a model to classify short text sequence inputs using the plurality of labeled short text sequences by the computing device. 2. The method of claim 1 , wherein assigning each of the plurality of short text sequences into one or more clusters of the plurality of clusters comprises: for each short text sequence, generating a vector representation of the short text sequence; and assigning each of the plurality of short text sequences into one or more clusters of the plurality of clusters based on the vector representations. 3. The method of claim 2 , wherein determining one or more topics for each cluster based on the short text sequences associated with the cluster comprises: for each short text sequence, calculating the probability that the vector representation of the short text sequence belongs to each cluster; for each cluster, selecting a subset of the vector representations based on the calculated probabilities; for each cluster, using an ontology to determine ontological entities associated with the short text sequences corresponding to the vector representations in the selected subset of vector representations for the cluster by the computing device; and for each cluster, determining the one or more topics based on the determined ontological entities. 4. The method of claim 1 , further comprising training a model using the labeled short text sequences. 5. The method of claim 1 , wherein the threshold is zero. 6. The method of claim 1 , wherein calculating the complexity for a sentence comprises calculating a number of complex nominals for the sentence. 7. The method of claim 1 , wherein generating the plurality of short text sequences from the plurality of sentences extracted from the document comprises generating a short text sequence from each sentence of the plurality of sentences. 8. The method of claim 1 , wherein generating the plurality of short text sequences from the plurality of sentences extracted from the document comprises generating a short text sequence from each pair of consecutive sentences of the plurality of sentences. 9. The method of claim 1 , wherein generating the plurality of short text sequences from the plurality of sentences extracted from the document comprises generating a short text sequence from each triplet of consecutive sentences of the plurality of sentences. 10. A system for automatically generating labeled short text sequences from a document corpus without a human reviewer comprising: at least one computing device; and a computer-readable medium storing computer-executable instructions that when executed by the at least one computing device cause the at least one computing device to: for each document of the plurality of documents, extract a plurality of sentences from the document; for each of the plurality of sentences: calculate a complexity for each sentence in the plurality of sentences, and remove sentences from the plurality of sentences with a calculated complexity that does not exceed a threshold; for each document of the plurality of documents, generate a plurality of short text sequences from the plurality of sentences extracted from the document; assign each of the plurality of short text sequences into one or more clusters of a plurality of clusters; determine one or more topics for each cluster based on the short text sequences associated with the cluster; for each short text sequence, label the short text sequence using the one or more topics determined for the one or more clusters of the plurality of clusters that the short text sequence is assigned to; and train a model to classify short text sequence inputs using the plurality of labeled short text sequences. 11. The system of claim 10 , wherein assigning each of the plurality of short text sequences into one or more clusters of the plurality of clusters comprises: for each short text sequence, generating a vector representation of the short text sequence; and assigning each of the plurality of short text sequences into one or more clusters of the plurality of clusters based on the vector representations. 12. The system of claim 11 , wherein determining one or more topics for each cluster based on the short text sequences associated with the cluster comprises: for each short text sequence, calculating the probability that the vector representation of the short text sequence belongs to each cluster; for each cluster, selecting a subset of the vector representations based on the calculated probabilities; for each cluster, using an ontology to determine ontological entities associated with the short text sequences corresponding to the vector representations in the selected subset of vector representations for the cluster; and for each cluster, determining the one or more topics based on the determined ontological entities. 13. The system of claim 10 , wherein the threshold is zero. 14. The system of claim 10 , wherein calculating the complexity for a sentence comprises calculating a number of complex nominals for the sentence. 15. The system of claim 10 , wherein generating the plurality of short text sequences from the plurality of sentences extracted from the document comprises generating a short text sequence from each sentence of the plurality of sentences. 16. The system of claim 10 , wherein generating the plurality of short text sequences from the plurality of sentences extracted from the document comprises generating a short text sequence from each pair of consecutive sentences of the plurality of sentences. 17. A non-transitory computer-readable medium with instructions stored thereon that when executed by a processor cause the processor to: for each document of the plurality of documents, extract a plurality of sentences from the document; for each document of the plurality of documents, generate a plurality of short text sequences from the plurality of sentences extracted from the document; for each of the plurality of sentences: calculate a complexity for each sentence in the plurality of sentences, and remove sentences from the plurality of sentences with a calculated complexity that does not exceed a threshold; assign each of the plurality of short text sequences into one or more clusters of a plurality of clusters; determine o

Assignees

Inventors

Classifications

  • G06F16/355Primary

    Creation or modification of classes or clusters · CPC title

  • Ontology · CPC title

  • Phrasal analysis, e.g. finite state techniques or chunking · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11797594B2 cover?
A set of documents related to a particular topic, industry, or entity are received. Sentences are extract from each document. The sentences are grouped into tuples of one, two, or three consecutive sentences (i.e., short text sequences). The sentence tuples are clustered based on vector representations of the sentences. For each cluster, a set of tuples that best represents or best fits the clu…
Who is the assignee on this patent?
Verint Americas Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/355. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 24 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).