Automatic Detection on String and Column Delimiters in Tabular Data Files
US-2018314883-A1 · Nov 1, 2018 · US
US11907181B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11907181-B2 |
| Application number | US-202016748351-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 21, 2020 |
| Priority date | Jul 20, 2017 |
| Publication date | Feb 20, 2024 |
| Grant date | Feb 20, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for generating a schema for a data input file are described herein. In an embodiment, a server computer receives a data input file. The server computer system selects a sample excerpt from the data input which comprises a subset of the data input file. The server computer system analyzes the sample excerpt to determine a row delimiter for the data input file, a column delimiter for the data input file, and a plurality of data format types. Using the column delimiter, row delimiter, and plurality of data format types, the server computer system generates a candidate schema for the data input file.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving a data input file to be stored in a database, the data input file having unknown schema; selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file; analyzing the sample excerpt to determine header data for the data input file, determining the header data for the data input file comprising: determining that a first row in the sample excerpt does not contain a delimited numeric value; determining that a second row in the sample excerpt following the first row does contain a delimited value; in response to the determining that the first row in the sample excerpt does not contain a delimited value and the second row in the sample excerpt does contain a delimited value, determining that the first row consists of the header data for the data input file; identifying one or more jagged rows based on row delimiters that were erroneously placed in the sample excerpt instead of column delimiters; causing displaying, via a graphical user interface (GUI), text in the sample excerpt that led to creation of a jagged row of the one or more jagged rows; receiving, via the GUI, an addition or removal of a specific row delimiter of the row delimiters to the text; updating the sample excerpt based on the addition or the removal; analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row in the data input file; using the row delimiter, identifying a plurality of rows from the sample excerpt that is not included in the header data for the data input file; storing column delimiter whitelist data comprising a plurality of particular candidate column delimiters; storing column delimiter blacklist data comprising data identifying one or more symbols that are not candidate column delimiters; identifying one or more particular candidate column delimiters of the plurality of particular candidate column delimiters in the plurality of rows from the sample excerpt; identifying a plurality of candidate column delimiters as symbols in the sample excerpt that are not contained in either the column delimiter whitelist data or the column delimiter blacklist data; receiving, from a user device, a second row delimiter for the data input file; using the second row delimiter, identifying a second plurality of rows from the sample excerpt that is not included in the header data for the data input file; identifying one or more second candidate column delimiters in the second plurality of rows; using the one or more second candidate column delimiters and the second row delimiter to generate a candidate schema for the data input file; using the candidate schema for the data input file, translating the data input file into the second plurality of rows and a plurality of columns; storing, in the database, the second plurality of rows and the plurality of columns, wherein the method is performed using one or more processors. 2. The method of claim 1 , further comprising using the header data for the data input file, extracting one or more column names for the plurality of columns. 3. The method of claim 1 , wherein analyzing the sample excerpt to determine a row delimiter for the data input file comprises: storing row delimiter whitelist data comprising a plurality of candidate row delimiters; searching the sample excerpt to locate a particular candidate row delimiter, wherein the particular candidate row delimiter is a first occurrence of any of the plurality of candidate row delimiters; selecting the particular candidate row delimiter as the row delimiter for the data input file. 4. The method of claim 1 , wherein identifying the plurality of candidate column delimiters comprises: determining that the sample excerpt does not contain any of the plurality of particular candidate column delimiters identifying, as the one or more candidate column delimiters, one or more symbols in the sample excerpt that are not contained in the column delimiter blacklist data. 5. The method of claim 1 wherein analyzing the sample excerpt to determine a column delimiter for the data input file comprises: identifying, in the sample excerpt, a set of symbols following an open quotation and preceding a close quotation; identifying a particular symbol immediately following the close quotation; selecting the particular symbol immediately following the close quotation as the column delimiter. 6. A system comprising: one or more processors; one or more storage media storing instructions which, when executed by the one or more processors, cause performance of: receiving a data input file to be stored in a database, the data input file having unknown schema; selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file; analyzing the sample excerpt to determine header data for the data input file, determining the header data for the data input file comprising: determining that a first row in the sample excerpt does not contain a delimited numeric value; determining that a second row in the sample excerpt following the first row does contain a delimited value; in response to the determining that the first row in the sample excerpt does not contain a delimited value and the second row in the sample excerpt does contain a delimited value, determining that the first row consists of the header data for the data input file; identifying one or more jagged rows based on row delimiters that were erroneously placed in the sample excerpt instead of column delimiters; causing displaying, via a graphical user interface (GUI), text in the sample excerpt that led to creation of a jagged row of the one or more jagged rows; receiving, via the GUI, an addition or removal of a specific row delimiter of the row delimiters to the text; updating the sample excerpt based on the addition or the removal; analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row in the data input file; using the row delimiter, identifying a plurality of rows from the sample excerpt that is not included in the header data for the data input file; storing column delimiter whitelist data comprising a plurality of particular candidate column delimiters; storing column delimiter blacklist data comprising data identifying one or more symbols that are not candidate column delimiters; identifying one or more particular candidate column delimiters of the plurality of particular candidate column delimiters in the plurality of rows from the sample excerpt; identifying a plurality of candidate column delimiters as symbols in the sample excerpt that are not contained in either the column delimiter whitelist data or the column delimiter blacklist data; receiving, from a user device, a second row delimiter for the data input file; using the second row delimiter, identifying a second plurality of rows from the sample excerpt that is not included in the header data for the data input file; identifying one or more second candidate column delimiters in the second plurality of rows; using the one or more second candidate column delimiters and the second row delimiter to generate a candidate schema for the data input file; using the candidate schema for the data input file, translating the data input file into the second plurality of rows and a plurality of columns; storing, in the database, the second plurality of rows and the plurality of columns. 7. The system of claim 6 , wherein the instructions, when executed by the one or more processors, further cause performa
Schema design and management · CPC title
Organizing or formatting or addressing of data · CPC title
Parsing · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.