Generating unique word embeddings for jargon-specific tabular data for neural network training and usage

US12254265B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12254265-B2
Application numberUS-202117483989-A
CountryUS
Kind codeB2
Filing dateSep 24, 2021
Priority dateSep 24, 2021
Publication dateMar 18, 2025
Grant dateMar 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Tabular data is accessed that contains multiple entries of alphanumeric data. Multiple tokens are generated of the multiple entries of alphanumeric data using a tokenization process. The tokenization process maintains jargon-specific features of the alphanumeric data. Multiple embeddings of the multiple entries of alphanumeric data are generated using the tokens. The embeddings capture similarity of the multiple entries considering all of global features, column features, and row features in the tokens of the tabular data. A neural network is used to predict probabilities for pre-defined classes for the tabular data using the generated embeddings.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of using a computing device to generate unique word embeddings for jargon-specific tabular data comprising: accessing by the computing device tabular data containing a plurality of entries of alphanumeric data, individual entries comprising one or more strings; generating, by the computing device using a tokenization process, a plurality of tokens of the plurality of entries of alphanumeric data, the tokenization process maintaining jargon-specific features of the alphanumeric data by masking a set comprising every individual numerical character in all strings of the plurality of entries of alphanumeric data in the tabular data by replacing the individual numerical characters of the set with an equal size set of individual replacement characters to form masked strings, wherein one or more of the plurality of tokens comprise masked strings that maintain an original sequence of the alphanumeric data while masking one or more characters in the original sequence and keeping other characters in the one or more characters as unchanged; generating, by the computing device using the tokens, a plurality of embeddings of the plurality of entries of alphanumeric data, the plurality of embeddings capturing similarity of the plurality of entries considering all of global features, column features, and row features in the tokens of the tabular data, and wherein the generating the plurality of embeddings creates an embedded table; forming a total context by: for a cell in the embedded table, extracting a sliced table containing the cell and adjacent cells; selecting a row context and a column context for the cell from rows and columns of the sliced table; and concatenating the row context and the column context to form the total context; training a supervised attention-based neural network at least by applying cells, of the embedded table, and corresponding total context to the supervised attention-based neural network using pre-defined classes; and predicting, by the computing device using the supervised attention-based neural network, probabilities for the pre-defined classes for the tabular data using the generated plurality of embeddings. 2. The method of claim 1 , where the tabular data is located in a spreadsheet. 3. The method of claim 1 , wherein: the tabular data is considered to have a format comprising: the tabular data is considered to be a document; rows into which the tabular data is organized are considered to be context; and columns into which the tabular data is organized are considered to be one or more words and the tokens have replaced corresponding tabular data in the columns, and the generating the plurality of embeddings of the plurality of entries of alphanumeric data uses this format for the tabular data. 4. The method of claim 3 , wherein: the tokenization process stores numerical characters that were replaced during the masking and in association with a corresponding token that had alphanumeric data where masking was performed; and the generating the plurality of embeddings of the plurality of entries of alphanumeric data comprises: predicting target cells for embedded output using context of the rows and columns associated with the target cells to create embeddings for the target cells; and generating final embedding vectors for the target cells at least by concatenating the numerical characters to associated previously created embeddings for the target cells as corresponding numerical frequency encoded vectors. 5. The method of claim 4 , wherein the numerical frequency encoded vectors have information indicating a number of times an associated number for the token in an associated target cell has been seen, for each of digits zero through nine. 6. The method of claim 4 , wherein the predicting target cells for embedded output using context of the rows and columns associated with the target cells to create embeddings for the target cell further comprises considering all cell entries in the row of the target cell and a next N cell entries from the column of the target cell, N being one or more but less than all of the entries in the column. 7. The method of claim 4 , wherein the predicting target cells for embedded output using context of the rows and columns associated with the target cells to create embeddings for the target cell further comprises considering one or more of: a top K frequent cell entries within a column and a row for the target cell; M cell entries within the row of the target cell; or N cell entries in the column of the target cell. 8. The method of claim 1 , wherein the pre-defined classes are headers of columns in the tabular data. 9. The method of claim 1 , wherein selecting context for the cell from rows and columns of the sliced table comprises: selecting a first row in the sliced table as the row context; and selecting a jth column in the sliced table as the column context. 10. The method of claim 1 , wherein: the tokenization process forms a table of the plurality of tokens; and the generating, by the computing device using the tokens, the plurality of embeddings identifies the table as a document with each row as a sentence and masked strings as words. 11. A computing device to generate unique word embeddings for jargon-specific tabular data, comprising: one or more memories having computer-readable code thereon; and one or more processors, the one or more processors, in response to retrieval and execution of the computer-readable code, causing the computing device to perform operations comprising: accessing by the computing device tabular data containing a plurality of entries of alphanumeric data, individual entries comprising one or more strings; generating, by the computing device using a tokenization process, a plurality of tokens of the plurality of entries of alphanumeric data, the tokenization process maintaining jargon-specific features of the alphanumeric data by masking a set comprising every individual numerical character in all strings of the plurality of entries of alphanumeric data in the tabular data by replacing the individual numerical characters of the set with an equal size set of individual replacement characters to form masked strings, wherein one or more of the plurality of tokens comprise masked strings that maintain an original sequence of the alphanumeric data while masking one or more characters in the original sequence and keeping other characters in the one or more characters as unchanged; generating, by the computing device using the tokens, a plurality of embeddings of the plurality of entries of alphanumeric data, the plurality of embeddings capturing similarity of the plurality of entries considering all of global features, column features, and row features in the tokens of the tabular data, and wherein the generating the plurality of embeddings creates an embedded table; forming a total context by: for a cell in the embedded table, extracting a sliced table containing the cell and adjacent cells; selecting a row context and a column context for the cell from rows and columns of the sliced table; and concatenating the row context and the column context to form the total context; training a supervised attention-based neural network at least by applying cells, of the embedded table, and corresponding total context to the supervised attention-based neural network using pre-defined classes; and predicting, by the computing device using the supervised attention-based neural network, probabilities for the pre-defined classes for the tabular data using the generated plurality of embeddings. 12. The computing device of claim 11 , w

Assignees

Inventors

Classifications

  • Probabilistic or stochastic networks · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Learning methods · CPC title

  • Supervised learning · CPC title

  • Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12254265B2 cover?
Tabular data is accessed that contains multiple entries of alphanumeric data. Multiple tokens are generated of the multiple entries of alphanumeric data using a tokenization process. The tokenization process maintains jargon-specific features of the alphanumeric data. Multiple embeddings of the multiple entries of alphanumeric data are generated using the tokens. The embeddings capture similari…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).