What technology area does this patent fall under?

Primary CPC classification G06F16/211. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 28 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Inferring a dataset schema from input files

US12210491B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12210491-B2
Application number	US-202418438301-A
Country	US
Kind code	B2
Filing date	Feb 9, 2024
Priority date	Jul 20, 2017
Publication date	Jan 28, 2025
Grant date	Jan 28, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method comprises selecting a sample excerpt from a data input file; in response to the determining that a first row in the sample excerpt does not contain a delimited value and a second row does contain a delimited value, determining that the first row consists of header data; identifying one or more jagged rows based on row delimiters that were erroneously placed; causing displaying text that led to creation of a jagged row; receiving an addition or removal of a specific row delimiter to the text; updating the sample excerpt based on the addition or the removal; analyzing the sample excerpt to determine a row delimiter for the data input file; identifying a plurality of rows that is not included in the header data; identifying a plurality of candidate column delimiters and generating a candidate schema for the data input file.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving a data input file to be stored in a database, the data input file having unknown schema; selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file; determining that a first row in the sample excerpt does not contain a delimited numeric value; determining that a second row in the sample excerpt following the first row does contain a delimited value; in response to the determining that the first row in the sample excerpt does not contain a delimited value and the second row in the sample excerpt does contain a delimited value, determining that the first row consists of header data for the data input file; identifying one or more jagged rows based on row delimiters that were erroneously placed in the sample excerpt instead of column delimiters; causing displaying, via a graphical user interface (GUI), text in the sample excerpt that led to creation of a jagged row of the one or more jagged rows; receiving, via the GUI, an addition or removal of a specific row delimiter of the row delimiters to the text; updating the sample excerpt based on the addition or the removal; analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row in the data input file; using the row delimiter, identifying a plurality of rows from the sample excerpt that is not included in the header data for the data input file; identifying a plurality of candidate column delimiters in the sample excerpt; using the plurality of candidate column delimiters and the row delimiter to generate a candidate schema for the data input file, wherein the method is performed using one or more processors. 2. The method of claim 1 , further comprising: receiving a second data input file; selecting a second sample excerpt from the second data input file, the second sample excerpt comprising a subset of the second data input file; receiving, via the GUI, a specification of a certain row in the second sample excerpt as containing certain header data; ignoring all rows up to and including the certain row in the second sample excerpt in determining a row delimiter or a column delimiter for the second data input file. 3. The method of claim 1 , further comprising receiving, via the GUI, a specification of an encoding for the data input file, the encoding being a pattern that represents a plurality of characters in a character set. 4. The method of claim 1 , further comprising: receiving, via the GUI, a request to rerun scheme inference using only string types; re-identifying column delimiters for the sample excerpt by setting data format types for columns as strings. 5. The method of claim 1 , further comprising: determining that an entry in a column based on the candidate schema does not match a determined data type for the column; causing a display, via the GUI, of a row containing the entry with an error message. 6. The method of claim 1 , further comprising: receiving, via the GUI, a specification of a certain column delimiter; causing a display, via the GUI, of a warning that using the certain column delimiter causes a jagged row. 7. The method of claim 1 , further comprising: detecting that a column having an integer type contains a text version of an integer; converting the text version to the integer for the column. 8. The method of claim 1 , further comprising: computing a score indicating a percentage of a column that can be processed as a certain data type; causing a display, via the GUI, of rows for which the column cannot be processed as the certain data type. 9. The method of claim 1 , further comprising: using the candidate schema for the data input file, translating the data input file into a second plurality of rows and a plurality of columns; storing, in the database, the second plurality of rows and the plurality of columns. 10. The method of claim 1 , further comprising: receiving, via the GUI, a second row delimiter for the data input file; using the second row delimiter, identifying a second plurality of rows from the sample excerpt that is not included in the header data for the data input file; identifying one or more second candidate column delimiters in the second plurality of rows; using the one or more second candidate column delimiters and the second row delimiter to generate a second candidate schema for the data input file. 11. A system comprising: a memory; one or more processors coupled with the memory and configured to perform: receiving a data input file to be stored in a database, the data input file having unknown schema; selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file; determining that a first row in the sample excerpt does not contain a delimited numeric value; determining that a second row in the sample excerpt following the first row does contain a delimited value; in response to the determining that the first row in the sample excerpt does not contain a delimited value and the second row in the sample excerpt does contain a delimited value, determining that the first row consists of header data for the data input file; identifying one or more jagged rows based on row delimiters that were erroneously placed in the sample excerpt instead of column delimiters; causing displaying, via a graphical user interface (GUI), text in the sample excerpt that led to creation of a jagged row of the one or more jagged rows; receiving, via the GUI, an addition or removal of a specific row delimiter of the row delimiters to the text; updating the sample excerpt based on the addition or the removal; analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row in the data input file; using the row delimiter, identifying a plurality of rows from the sample excerpt that is not included in the header data for the data input file; identifying a plurality of candidate column delimiters in the sample excerpt; using the plurality of candidate column delimiters and the row delimiter to generate a candidate schema for the data input file. 12. The system of claim 11 , the one or more processors further configured to perform: receiving a second data input file; selecting a second sample excerpt from the second data input file, the second sample excerpt comprising a subset of the second data input file; receiving, via the GUI, a specification of a certain row in the second sample excerpt as containing certain header data; ignoring all rows up to and including the certain row in the second sample excerpt in determining a row delimiter or a column delimiter for the second data input file. 13. The system of claim 11 , the one or more processors further configured to perform receiving, via the GUI, a specification of an encoding for the data input file, the encoding being a pattern that represents a plurality of characters in a character set. 14. The system of claim 11 , the one or more processors further configured to perform: receiving, via the GUI, a request to rerun scheme inference using only string types; re-identifying column delimiters for the sample excerpt by setting data format types for columns as strings. 15. The system of claim 11 , the one or more processors further configured to perform: determining that an entry in a column based on the candidate schema does not match a determined d

Assignees

Palantir Technologies Inc

Inventors

Classifications

G06F40/205
Parsing · CPC title
G06F3/0638
Organizing or formatting or addressing of data · CPC title
G06F16/211Primary
Schema design and management · CPC title

Patent family

Related publications grouped by family.

View patent family 65241728

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12210491B2 cover?: A method comprises selecting a sample excerpt from a data input file; in response to the determining that a first row in the sample excerpt does not contain a delimited value and a second row does contain a delimited value, determining that the first row consists of header data; identifying one or more jagged rows based on row delimiters that were erroneously placed; causing displaying text tha…
Who is the assignee on this patent?: Palantir Technologies Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/211. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 28 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).