Distributed data set storage, retrieval and analysis
US-9684543-B1 · Jun 20, 2017 · US
US11468083B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11468083-B2 |
| Application number | US-202016915693-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 29, 2020 |
| Priority date | Dec 28, 2016 |
| Publication date | Oct 11, 2022 |
| Grant date | Oct 11, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer-implemented system or process is programmed or configured to use a configuration file to specify one or more tasks to apply to raw ingested data. A task may be a sequence of instructions programmed or configured to format raw ingested data into a dataset in a CSV format. Examples of tasks may include: a parser to parse Cobol data into a CSV, a parser to parse XML into a CSV, a parser to parse text using fixed-width fields to a CSV, a parser to parse files in a zip archive into a CSV, a regular expression search/replace function, or formatting logic to remove lines or blank lines from raw ingested data. In one embodiment, the configuration file may specify a schema definition for a task to use for generating a dataset. In one embodiment, the configuration file may also include one or more access control list (ACL) definitions for the generated dataset. In one embodiment, the building of datasets using the configuration file is automated, for example, on a nightly basis.
Opening claim text (preview).
What is claimed is: 1. A method comprising: retrieving at least one configuration file, the at least one configuration file comprising: a plurality of different data transformation tasks, each of the tasks denoted with a task identifier that identifies a particular task to apply to a set of input data and associated with task-specific criteria for execution of the particular task; a schema definition for a dataset, wherein the schema definition defines a plurality of columns; receiving an input file that includes an input dataset comprising a single text column with fixed-width fields; in response to receiving the input file, based on reading the at least one configuration file, applying the plurality of different data transformation tasks to the input dataset to generate an output dataset that is formatted differently from the input dataset, wherein the applying comprises using an array of fixed-width values specified in the at least one configuration file to map the fixed-width fields of the input dataset to the output dataset, and wherein the output dataset is formatted according to the task-specific criteria and aligns with the plurality of columns as defined by the schema definition; wherein the method is performed using one or more processors. 2. The method of claim 1 , wherein the output dataset is formatted as a comma separated value (CSV) file. 3. The method of claim 1 , wherein the input file is a text file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: applying a search-and-replace regular expression, specified in the at least one configuration file, to each line of the input dataset. 4. The method of claim 1 , wherein the input file is a COBOL binary file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: using an expected byte size specified in the at least one configuration file to identify a location of a field in the input dataset; and retrieving the field from the input dataset. 5. The method of claim 1 , wherein the input file is an extensible markup language (XML) file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: mapping tagged fields of the XML file to the output dataset. 6. The method of claim 1 , wherein the input file is a zip archive comprising text files, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: mapping content of the text files to the output dataset. 7. The method of claim 6 , wherein the zip archive is encrypted, and wherein the using further comprises: decrypting the zip archive. 8. The method of claim 6 , wherein the input file is a text file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: removing all blank lines from the input dataset. 9. The method of claim 6 , wherein the input dataset is formatted as rows of text, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: using a first setting specified in the at least one configuration file to identify a number of header rows to remove from the input dataset; and using a second setting specified in the at least one configuration file to identify a number of footer rows to remove from the input dataset. 10. The method of claim 6 , the input dataset being a single-column dataset, and further comprising using the at least one configuration file to transform the single-column dataset into a multi-column dataset that is delimited according to the schema definition. 11. The method of claim 6 , wherein the at least one configuration file further comprises an access control list that defines one or more access control permissions for the dataset, and further comprising, in response to receiving the input file, based on reading the at least one configuration file, determining output access control permissions for the output dataset based on the access control list. 12. One or more non-transitory computer-readable media storing instructions, which when executed by one or more processors cause: retrieving at least one configuration file, the at least one configuration file comprising: a plurality of different data transformation tasks, each of the tasks denoted with a task identifier that identifies a particular task to apply to a set of input data and associated with task-specific criteria for execution of the particular task; a schema definition for a dataset, wherein the schema definition defines a plurality of columns; receiving an input file that includes an input dataset comprising a single text column with fixed-width fields; in response to receiving the input file, based on reading the at least one configuration file, applying the plurality of different data transformation tasks to the input dataset to generate an output dataset that is formatted differently from the input dataset, wherein the applying comprises using an array of fixed-width values specified in the at least one configuration file to map the fixed-width fields of the input dataset to the output dataset, and wherein the output dataset is formatted according to the task-specific criteria and aligns with the plurality of columns as defined by the schema definition. 13. The one or more non-transitory computer-readable media of claim 12 , wherein the input file is a text file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: applying a search-and-replace regular expression, specified in the at least one configuration file, to each line of the input dataset. 14. The one or more non-transitory computer-readable media of claim 12 , wherein the input file is a COBOL binary file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: using an expected byte size specified in the at least one configuration file to identify a location of a field in the input dataset; and retrieving the field from the input dataset. 15. The one or more non-transitory computer-readable media of claim 12 , wherein the input file is an extensible markup language (XML) file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: mapping tagged fields of the XML file to the output dataset. 16. The one or more non-transitory computer-readable media of claim 12 , wherein the input file is a zip archive comprising text files, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: mapping content of the text files to the output dataset. 17. The one or more non-transitory computer-readable media of claim 12 , wherein the input dataset is formatted as rows of text, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: using a first setting specified in the at least one configuration file to identify a number of header rows to remove from the input dataset; and using a second setting specified in the at least one configuration file to identify a number of footer rows to remove from the input dataset.
Querying · CPC title
Data format conversion from or to a database · CPC title
Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Mapping to a database · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.