System and method for classifying textual data blocks

US12346364B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12346364-B2
Application numberUS-202218087069-A
CountryUS
Kind codeB2
Filing dateDec 22, 2022
Priority dateDec 22, 2022
Publication dateJul 1, 2025
Grant dateJul 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and a system of classifying textual data blocks are claimed. The method includes receiving at least one textual data block in an original version, including a plurality of textual data elements; performing a preprocessing procedure on the at least one textual data block in the original version, wherein the preprocessing procedure includes replacing the textual data elements characterized by pertinence to at least one specific part-of-speech (POS) category with a respective POS token, thereby obtaining the at least one textual data block in a preprocessed version; inferring a pretrained ML-based model on the at least one textual data block in the preprocessed version, to classify the at least one textual data block by pertinence to the at least one class.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of classifying textual data blocks by at least one processor, the method comprising: receiving textual data blocks in an original version, each of the textual data blocks comprising a plurality of textual data elements; performing a preprocessing procedure on each of the textual data blocks in the original version, wherein the preprocessing procedure comprises: replacing, for each of the textual data blocks, each of the textual data elements characterized by presence of a specific character or sequence of characters with a respective character-based token to generate a first partial tokenized version of the respective textual data block; replacing, for each first partial tokenized version of the textual data blocks, each of the first partial tokenized versions further characterized by a specific contextual definition with a respective context-based token to generate a second partial tokenized version of the respective first partial tokenized first; and replacing, for each second partial tokenized version of the textual data blocks, each of the second partial tokenized versions further characterized by pertinence to at least one specific part-of-speech (POS) category with a respective POS token, thereby obtaining, for each of the textual data blocks, the textual data block in a preprocessed tokenized version of the respective textual data block; forming a training dataset comprising the textual data blocks in the preprocessed tokenized version labeled with an indication of pertinence to at least once class indicative of an email signature block; training a machine learning-based (ML-based) model to classify textual data blocks by pertinence to the at least one class, based on the training dataset, wherein the ML-based model comprises an artificial neural network; receiving a new textual data block in an original version, the new textual data block comprising a plurality of textual data elements and performing the preprocessing procedure to obtain, for the new textual data block, the new textual data block in the preprocessed tokenized version; performing machine learning, using the trained ML-based model, on the new textual data block in the preprocessed tokenized version to classify the new textual data block by pertinence to the at least one class. 2. The method of claim 1 , wherein the at least one specific POS category is “proper noun” category; and the preprocessing procedure further comprises defining the textual data elements as pertaining to a “proper noun” category. 3. The method of claim 1 , wherein the specific character or sequence of characters are a character or sequence of characters specific for at least one of: an email address, an alphanumeric or numeric code, and a Uniform Resource Locator (URL). 4. The method of claim 1 , wherein the specific contextual definition represents a definition of the textual data elements as pertaining to a “named entity” category. 5. The method of claim 1 , wherein each one of the textual data blocks comprises textual data elements arranged in at least one line; and the method further comprises preliminarily classifying each one of the textual data blocks by pertinence to the at least one class based on a length of the at least one line of the textual data elements. 6. The method of claim 1 , wherein the preprocessing procedure further comprises embedding each preprocessed tokenized version of the textual data blocks into a vector space; wherein each one of the textual data blocks represents a paragraph of an email and the textual data elements represent words. 7. The method of claim 6 , wherein embedding each preprocessed tokenized version of the textual data blocks into a vector space comprises creating a vector representation of each textual data element based on a term frequency-inverse document frequency (TF-IDF) measure. 8. The method of claim 1 , wherein each one of the textual data blocks represents a paragraph of an email and the textual data elements represent words. 9. The method of claim 1 , wherein each one of the textual data blocks represents an email section of an email. 10. A system for classifying textual data blocks, the system comprising: a non-transitory memory device, wherein modules of instruction code are stored, and at least one processor associated with the memory device, and configured to execute the modules of instruction code, whereupon execution of said modules of instruction code, the at least one processor is configured to: receive textual data blocks in an original version, each of the textual data blocks comprising a plurality of textual data elements; perform a preprocessing procedure on each of the textual data blocks in the original version, wherein the preprocessing procedure comprises: replacing, for each of the textual data blocks, each of the textual data elements characterized by presence of a specific character or sequence of characters with a respective character-based token to generate a first partial tokenized version of the respective textual data block; replacing, for each first partial tokenized version of the textual data blocks, each of the first partial tokenized versions further characterized by a specific contextual definition with a respective context-based token to generate a second partial tokenized version of the respective first partial tokenized first; and replacing, for each second partial tokenized version of the textual data blocks, each of the second partial tokenized versions further characterized by pertinence to at least one specific part-of-speech (POS) category with a respective POS token, thereby obtaining, for each of the textual data blocks, the textual data block in a preprocessed tokenized version of the respective textual data block; form a training dataset comprising the textual data blocks in the preprocessed tokenized version labeled with an indication of pertinence to at least once class indicative of an email signature block; train a machine learning-based (ML-based) model to classify textual data blocks by pertinence to the at least one class, based on the training dataset, wherein the ML-based model comprises an artificial neural network; receive a new textual data block in an original version, the new textual data block comprising a plurality of textual data elements and performing the preprocessing procedure to obtain, for the new textual data block, the new textual data block in the preprocessed version; perform machine learning, using the trained ML-based model, on the new textual data block in the preprocessed tokenized version to classify the new textual data block by pertinence to the at least one class. 11. The system of claim 10 , wherein the at least one specific POS category is a “proper noun” category; and the preprocessing procedure further comprises defining the textual data elements as pertaining to a “proper noun” category. 12. The system of claim 10 , wherein the specific character or sequence of characters are a character or sequence of characters specific for at least one of: an email address, an alphanumeric or numeric code, and a Uniform Resource Locator (URL). 13. The system of claim 10 , wherein the specific contextual definition represents a definition of the textual data elements as pertaining to a “named entity” category. 14. The system of claim 10 , wherein each one of the textual data blocks comprises textual data elements arranged in at least one line; and the at least one processor is further configured to preliminarily classify each one of the textual data blocks by pertinence to the at least one class based on a length of the at

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12346364B2 cover?
A method and a system of classifying textual data blocks are claimed. The method includes receiving at least one textual data block in an original version, including a plurality of textual data elements; performing a preprocessing procedure on the at least one textual data block in the original version, wherein the preprocessing procedure includes replacing the textual data elements characteriz…
Who is the assignee on this patent?
Genesys Cloud Services Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/353. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).