Systems and methods for training language models to reason over tables
US-2022309087-A1 · Sep 29, 2022 · US
US12254265B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12254265-B2 |
| Application number | US-202117483989-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 24, 2021 |
| Priority date | Sep 24, 2021 |
| Publication date | Mar 18, 2025 |
| Grant date | Mar 18, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Tabular data is accessed that contains multiple entries of alphanumeric data. Multiple tokens are generated of the multiple entries of alphanumeric data using a tokenization process. The tokenization process maintains jargon-specific features of the alphanumeric data. Multiple embeddings of the multiple entries of alphanumeric data are generated using the tokens. The embeddings capture similarity of the multiple entries considering all of global features, column features, and row features in the tokens of the tabular data. A neural network is used to predict probabilities for pre-defined classes for the tabular data using the generated embeddings.
Opening claim text (preview).
What is claimed is: 1. A method of using a computing device to generate unique word embeddings for jargon-specific tabular data comprising: accessing by the computing device tabular data containing a plurality of entries of alphanumeric data, individual entries comprising one or more strings; generating, by the computing device using a tokenization process, a plurality of tokens of the plurality of entries of alphanumeric data, the tokenization process maintaining jargon-specific features of the alphanumeric data by masking a set comprising every individual numerical character in all strings of the plurality of entries of alphanumeric data in the tabular data by replacing the individual numerical characters of the set with an equal size set of individual replacement characters to form masked strings, wherein one or more of the plurality of tokens comprise masked strings that maintain an original sequence of the alphanumeric data while masking one or more characters in the original sequence and keeping other characters in the one or more characters as unchanged; generating, by the computing device using the tokens, a plurality of embeddings of the plurality of entries of alphanumeric data, the plurality of embeddings capturing similarity of the plurality of entries considering all of global features, column features, and row features in the tokens of the tabular data, and wherein the generating the plurality of embeddings creates an embedded table; forming a total context by: for a cell in the embedded table, extracting a sliced table containing the cell and adjacent cells; selecting a row context and a column context for the cell from rows and columns of the sliced table; and concatenating the row context and the column context to form the total context; training a supervised attention-based neural network at least by applying cells, of the embedded table, and corresponding total context to the supervised attention-based neural network using pre-defined classes; and predicting, by the computing device using the supervised attention-based neural network, probabilities for the pre-defined classes for the tabular data using the generated plurality of embeddings. 2. The method of claim 1 , where the tabular data is located in a spreadsheet. 3. The method of claim 1 , wherein: the tabular data is considered to have a format comprising: the tabular data is considered to be a document; rows into which the tabular data is organized are considered to be context; and columns into which the tabular data is organized are considered to be one or more words and the tokens have replaced corresponding tabular data in the columns, and the generating the plurality of embeddings of the plurality of entries of alphanumeric data uses this format for the tabular data. 4. The method of claim 3 , wherein: the tokenization process stores numerical characters that were replaced during the masking and in association with a corresponding token that had alphanumeric data where masking was performed; and the generating the plurality of embeddings of the plurality of entries of alphanumeric data comprises: predicting target cells for embedded output using context of the rows and columns associated with the target cells to create embeddings for the target cells; and generating final embedding vectors for the target cells at least by concatenating the numerical characters to associated previously created embeddings for the target cells as corresponding numerical frequency encoded vectors. 5. The method of claim 4 , wherein the numerical frequency encoded vectors have information indicating a number of times an associated number for the token in an associated target cell has been seen, for each of digits zero through nine. 6. The method of claim 4 , wherein the predicting target cells for embedded output using context of the rows and columns associated with the target cells to create embeddings for the target cell further comprises considering all cell entries in the row of the target cell and a next N cell entries from the column of the target cell, N being one or more but less than all of the entries in the column. 7. The method of claim 4 , wherein the predicting target cells for embedded output using context of the rows and columns associated with the target cells to create embeddings for the target cell further comprises considering one or more of: a top K frequent cell entries within a column and a row for the target cell; M cell entries within the row of the target cell; or N cell entries in the column of the target cell. 8. The method of claim 1 , wherein the pre-defined classes are headers of columns in the tabular data. 9. The method of claim 1 , wherein selecting context for the cell from rows and columns of the sliced table comprises: selecting a first row in the sliced table as the row context; and selecting a jth column in the sliced table as the column context. 10. The method of claim 1 , wherein: the tokenization process forms a table of the plurality of tokens; and the generating, by the computing device using the tokens, the plurality of embeddings identifies the table as a document with each row as a sentence and masked strings as words. 11. A computing device to generate unique word embeddings for jargon-specific tabular data, comprising: one or more memories having computer-readable code thereon; and one or more processors, the one or more processors, in response to retrieval and execution of the computer-readable code, causing the computing device to perform operations comprising: accessing by the computing device tabular data containing a plurality of entries of alphanumeric data, individual entries comprising one or more strings; generating, by the computing device using a tokenization process, a plurality of tokens of the plurality of entries of alphanumeric data, the tokenization process maintaining jargon-specific features of the alphanumeric data by masking a set comprising every individual numerical character in all strings of the plurality of entries of alphanumeric data in the tabular data by replacing the individual numerical characters of the set with an equal size set of individual replacement characters to form masked strings, wherein one or more of the plurality of tokens comprise masked strings that maintain an original sequence of the alphanumeric data while masking one or more characters in the original sequence and keeping other characters in the one or more characters as unchanged; generating, by the computing device using the tokens, a plurality of embeddings of the plurality of entries of alphanumeric data, the plurality of embeddings capturing similarity of the plurality of entries considering all of global features, column features, and row features in the tokens of the tabular data, and wherein the generating the plurality of embeddings creates an embedded table; forming a total context by: for a cell in the embedded table, extracting a sliced table containing the cell and adjacent cells; selecting a row context and a column context for the cell from rows and columns of the sliced table; and concatenating the row context and the column context to form the total context; training a supervised attention-based neural network at least by applying cells, of the embedded table, and corresponding total context to the supervised attention-based neural network using pre-defined classes; and predicting, by the computing device using the supervised attention-based neural network, probabilities for the pre-defined classes for the tabular data using the generated plurality of embeddings. 12. The computing device of claim 11 , w
Probabilistic or stochastic networks · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Learning methods · CPC title
Supervised learning · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.