Natural language processing for categorizing sequences of text data

US12340182B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12340182-B2
Application numberUS-202217707110-A
CountryUS
Kind codeB2
Filing dateMar 29, 2022
Priority dateApr 1, 2021
Publication dateJun 24, 2025
Grant dateJun 24, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are system, method, and computer program product embodiments for categorizing sequences of text extracted from documents using natural language processing. In some embodiments, a categorization system may receive a first document file in a machine readable format. The categorization system may analyze a sequence of text from the first document file and identify a numeric text string in the sequence. The categorization system may also identify text data in the sequence matching text data from a second document file. The categorization system may remove the numeric text string and the matching data from the sequence to generate a trimmed version of the sequence. The categorization system may then apply a vectorization model to the trimmed version of the sequence as well as a trained deep learning model to the vector version to identify a corresponding category for the sequence of text.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for categorizing text data, comprising: receiving a first document file in a machine-readable format, wherein the first document file includes one or more sequences of text; analyzing a sequence of text from the one or more sequences to identify a numeric text string in the sequence of text that forms an alphanumeric reference; generating a trimmed version of the sequence of text by: removing first text data from the sequence of text based on an amount of matches between the first text data and respective text data for a plurality of document files satisfying a match threshold, and removing the numeric text string from the sequence of text to transform the alphanumeric reference to second text data; generating, based on the trimmed version of the sequence of text put into a vectorization model, a vector version of the trimmed version of the sequence of text; and generating, based on the vector version put into a deep learning model, a categorization of the sequence of text, wherein the deep learning model is pre-trained to categorize vector representations of the text data into predefined categories based on language pattern dependencies indicated by states of cells of the vector version that correspond to each portion of another sequence of text that is managed by a respective neural network of a plurality of neural networks of the deep learning model, wherein the language pattern dependencies correspond to a first language that is different from a second language indicated by the trimmed version of the sequence of text. 2. The computer-implemented method of claim 1 , wherein the first document file is a commercial bank statement. 3. The computer-implemented method of claim 2 , wherein the sequence of text is a row of transaction description text from the commercial bank statement. 4. The computer-implemented method of claim 1 , wherein analyzing the sequence of text further comprises: applying a crowd learning algorithm to compare the first text data to text data from the plurality of document files including a second document file. 5. The computer-implemented method of claim 1 , wherein the generating the vector version of the trimmed version of the sequence is further based on: a word2vec algorithm. 6. The computer-implemented method of claim 1 , wherein the deep learning model is a long short-term memory (LSTM) model. 7. The computer-implemented method of claim 4 , wherein the first document file includes a plurality of sequences of text, the method further comprising: analyzing the plurality of sequences to identify third text data matching fourth text data from the second document file; in response to analyzing the plurality of sequences to identify the third text data matching the fourth text data, removing the third text data from the plurality of sequences to generate a trimmed version of the plurality of sequences; applying the vectorization model to the trimmed version of the plurality of sequences to generate a vector version of the plurality of sequences; and applying the deep learning model to the vector version of the plurality of sequences to categorize each sequence from the plurality of sequences. 8. A system for categorizing text data, comprising: a memory; and at least one processor coupled to the memory and configured to: receive a first document file in a machine-readable format, wherein the first document file includes one or more sequences of text; analyze a sequence of text from the one or more sequences to identify a numeric text string in the sequence of text that forms an alphanumeric reference; generate a trimmed version of the sequence of text by: removing first text data from the sequence of text based on an amount of matches between the first text data and respective text data for a plurality of document files satisfying a match threshold, and removing one or more numeric text strings from the sequence of text; generate, based on the trimmed version of the sequence of text put into a vectorization model, a vector version of the trimmed version of the sequence of text; and generating, based on the vector version put into a deep learning model, a categorization of the sequence of text, wherein the deep learning model is pre-trained to categorize vector representations of the text data into predefined categories based on language pattern dependencies indicated by states of cells of the vector version that correspond to each portion of another sequence of text that is managed by a respective neural network of a plurality of neural networks of the deep learning model, wherein the language pattern dependencies correspond to a first language that is different from a second language indicated by the trimmed version of the sequence of text. 9. The system of claim 8 , wherein the first document file is a commercial bank statement. 10. The system of claim 9 , wherein the sequence of text is a row of transaction description text from the commercial bank statement. 11. The system of claim 8 , wherein to analyze the sequence of text, the at least one processor is further configured to: apply a crowd learning algorithm to compare the first text data to text data from the plurality of document files including a second document file. 12. The system of claim 8 , wherein to generate the vector version of the trimmed version of the sequence the at least one processor is further configured to: execute a word2vec algorithm. 13. The system of claim 8 , wherein the deep learning model is a long short-term memory (LSTM) model. 14. The system of claim 11 , wherein the first document file includes a plurality of sequences of text and wherein the at least one processor is further configured to: analyze the plurality of sequences to identify third text data matching fourth text data from the second document file; in response to analyzing the plurality of sequences to identify the third text data matching the fourth text data, remove the third text data from the plurality of sequences to generate a trimmed version of the plurality of sequences; apply the vectorization model to the trimmed version of the plurality of sequences to generate a vector version of the plurality of sequences; and apply the deep learning model to the vector version of the plurality of sequences to categorize each sequence from the plurality of sequences. 15. A non-transitory computer-readable device having instructions stored thereon that, when executed by at least one computing device, cause the at least one computing device to perform operations comprising: receiving a first document file in a machine-readable format, wherein the first document file includes one or more sequences of text; analyzing a sequence of text from the one or more sequences to identify a numeric text string in the sequence of text that forms an alphanumeric reference; generating a trimmed version of the sequence of text by: removing first text data from the sequence of text based on an amount of matches between the first text data and respective text data for a plurality of document files satisfying a match threshold, and removing one or more numeric text strings from the sequence of text; generating, based on the trimmed version of the sequence put into a vectorization model, a vector version of the trimmed version of the sequence of text; and generating, based on the vector version put into a deep learning model, a categorization of the sequence of text, wherein the deep learning model is pre-trained to categorize vector representations of text data into predefin

Assignees

Inventors

Classifications

  • Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title

  • Document matching, e.g. of document images · CPC title

  • Orthographic correction, e.g. spell checking or vowelisation · CPC title

  • Credit; Loans; Processing thereof · CPC title

  • Banking, e.g. interest calculation or account maintenance (credit or loans G06Q40/03) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12340182B2 cover?
Disclosed herein are system, method, and computer program product embodiments for categorizing sequences of text extracted from documents using natural language processing. In some embodiments, a categorization system may receive a first document file in a machine readable format. The categorization system may analyze a sequence of text from the first document file and identify a numeric text s…
Who is the assignee on this patent?
American Express Travel Related Services Co Inc, American Express India Private Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 24 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).