Pre-trained contextual embedding models for named entity recognition and confidence prediction
US-2021149993-A1 · May 20, 2021 · US
US12340182B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12340182-B2 |
| Application number | US-202217707110-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 29, 2022 |
| Priority date | Apr 1, 2021 |
| Publication date | Jun 24, 2025 |
| Grant date | Jun 24, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed herein are system, method, and computer program product embodiments for categorizing sequences of text extracted from documents using natural language processing. In some embodiments, a categorization system may receive a first document file in a machine readable format. The categorization system may analyze a sequence of text from the first document file and identify a numeric text string in the sequence. The categorization system may also identify text data in the sequence matching text data from a second document file. The categorization system may remove the numeric text string and the matching data from the sequence to generate a trimmed version of the sequence. The categorization system may then apply a vectorization model to the trimmed version of the sequence as well as a trained deep learning model to the vector version to identify a corresponding category for the sequence of text.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for categorizing text data, comprising: receiving a first document file in a machine-readable format, wherein the first document file includes one or more sequences of text; analyzing a sequence of text from the one or more sequences to identify a numeric text string in the sequence of text that forms an alphanumeric reference; generating a trimmed version of the sequence of text by: removing first text data from the sequence of text based on an amount of matches between the first text data and respective text data for a plurality of document files satisfying a match threshold, and removing the numeric text string from the sequence of text to transform the alphanumeric reference to second text data; generating, based on the trimmed version of the sequence of text put into a vectorization model, a vector version of the trimmed version of the sequence of text; and generating, based on the vector version put into a deep learning model, a categorization of the sequence of text, wherein the deep learning model is pre-trained to categorize vector representations of the text data into predefined categories based on language pattern dependencies indicated by states of cells of the vector version that correspond to each portion of another sequence of text that is managed by a respective neural network of a plurality of neural networks of the deep learning model, wherein the language pattern dependencies correspond to a first language that is different from a second language indicated by the trimmed version of the sequence of text. 2. The computer-implemented method of claim 1 , wherein the first document file is a commercial bank statement. 3. The computer-implemented method of claim 2 , wherein the sequence of text is a row of transaction description text from the commercial bank statement. 4. The computer-implemented method of claim 1 , wherein analyzing the sequence of text further comprises: applying a crowd learning algorithm to compare the first text data to text data from the plurality of document files including a second document file. 5. The computer-implemented method of claim 1 , wherein the generating the vector version of the trimmed version of the sequence is further based on: a word2vec algorithm. 6. The computer-implemented method of claim 1 , wherein the deep learning model is a long short-term memory (LSTM) model. 7. The computer-implemented method of claim 4 , wherein the first document file includes a plurality of sequences of text, the method further comprising: analyzing the plurality of sequences to identify third text data matching fourth text data from the second document file; in response to analyzing the plurality of sequences to identify the third text data matching the fourth text data, removing the third text data from the plurality of sequences to generate a trimmed version of the plurality of sequences; applying the vectorization model to the trimmed version of the plurality of sequences to generate a vector version of the plurality of sequences; and applying the deep learning model to the vector version of the plurality of sequences to categorize each sequence from the plurality of sequences. 8. A system for categorizing text data, comprising: a memory; and at least one processor coupled to the memory and configured to: receive a first document file in a machine-readable format, wherein the first document file includes one or more sequences of text; analyze a sequence of text from the one or more sequences to identify a numeric text string in the sequence of text that forms an alphanumeric reference; generate a trimmed version of the sequence of text by: removing first text data from the sequence of text based on an amount of matches between the first text data and respective text data for a plurality of document files satisfying a match threshold, and removing one or more numeric text strings from the sequence of text; generate, based on the trimmed version of the sequence of text put into a vectorization model, a vector version of the trimmed version of the sequence of text; and generating, based on the vector version put into a deep learning model, a categorization of the sequence of text, wherein the deep learning model is pre-trained to categorize vector representations of the text data into predefined categories based on language pattern dependencies indicated by states of cells of the vector version that correspond to each portion of another sequence of text that is managed by a respective neural network of a plurality of neural networks of the deep learning model, wherein the language pattern dependencies correspond to a first language that is different from a second language indicated by the trimmed version of the sequence of text. 9. The system of claim 8 , wherein the first document file is a commercial bank statement. 10. The system of claim 9 , wherein the sequence of text is a row of transaction description text from the commercial bank statement. 11. The system of claim 8 , wherein to analyze the sequence of text, the at least one processor is further configured to: apply a crowd learning algorithm to compare the first text data to text data from the plurality of document files including a second document file. 12. The system of claim 8 , wherein to generate the vector version of the trimmed version of the sequence the at least one processor is further configured to: execute a word2vec algorithm. 13. The system of claim 8 , wherein the deep learning model is a long short-term memory (LSTM) model. 14. The system of claim 11 , wherein the first document file includes a plurality of sequences of text and wherein the at least one processor is further configured to: analyze the plurality of sequences to identify third text data matching fourth text data from the second document file; in response to analyzing the plurality of sequences to identify the third text data matching the fourth text data, remove the third text data from the plurality of sequences to generate a trimmed version of the plurality of sequences; apply the vectorization model to the trimmed version of the plurality of sequences to generate a vector version of the plurality of sequences; and apply the deep learning model to the vector version of the plurality of sequences to categorize each sequence from the plurality of sequences. 15. A non-transitory computer-readable device having instructions stored thereon that, when executed by at least one computing device, cause the at least one computing device to perform operations comprising: receiving a first document file in a machine-readable format, wherein the first document file includes one or more sequences of text; analyzing a sequence of text from the one or more sequences to identify a numeric text string in the sequence of text that forms an alphanumeric reference; generating a trimmed version of the sequence of text by: removing first text data from the sequence of text based on an amount of matches between the first text data and respective text data for a plurality of document files satisfying a match threshold, and removing one or more numeric text strings from the sequence of text; generating, based on the trimmed version of the sequence put into a vectorization model, a vector version of the trimmed version of the sequence of text; and generating, based on the vector version put into a deep learning model, a categorization of the sequence of text, wherein the deep learning model is pre-trained to categorize vector representations of text data into predefin
Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title
Document matching, e.g. of document images · CPC title
Orthographic correction, e.g. spell checking or vowelisation · CPC title
Credit; Loans; Processing thereof · CPC title
Banking, e.g. interest calculation or account maintenance (credit or loans G06Q40/03) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.