Automatically executing tasks and configuring access control lists in a data transformation system

US11468083B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11468083-B2
Application numberUS-202016915693-A
CountryUS
Kind codeB2
Filing dateJun 29, 2020
Priority dateDec 28, 2016
Publication dateOct 11, 2022
Grant dateOct 11, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented system or process is programmed or configured to use a configuration file to specify one or more tasks to apply to raw ingested data. A task may be a sequence of instructions programmed or configured to format raw ingested data into a dataset in a CSV format. Examples of tasks may include: a parser to parse Cobol data into a CSV, a parser to parse XML into a CSV, a parser to parse text using fixed-width fields to a CSV, a parser to parse files in a zip archive into a CSV, a regular expression search/replace function, or formatting logic to remove lines or blank lines from raw ingested data. In one embodiment, the configuration file may specify a schema definition for a task to use for generating a dataset. In one embodiment, the configuration file may also include one or more access control list (ACL) definitions for the generated dataset. In one embodiment, the building of datasets using the configuration file is automated, for example, on a nightly basis.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: retrieving at least one configuration file, the at least one configuration file comprising: a plurality of different data transformation tasks, each of the tasks denoted with a task identifier that identifies a particular task to apply to a set of input data and associated with task-specific criteria for execution of the particular task; a schema definition for a dataset, wherein the schema definition defines a plurality of columns; receiving an input file that includes an input dataset comprising a single text column with fixed-width fields; in response to receiving the input file, based on reading the at least one configuration file, applying the plurality of different data transformation tasks to the input dataset to generate an output dataset that is formatted differently from the input dataset, wherein the applying comprises using an array of fixed-width values specified in the at least one configuration file to map the fixed-width fields of the input dataset to the output dataset, and wherein the output dataset is formatted according to the task-specific criteria and aligns with the plurality of columns as defined by the schema definition; wherein the method is performed using one or more processors. 2. The method of claim 1 , wherein the output dataset is formatted as a comma separated value (CSV) file. 3. The method of claim 1 , wherein the input file is a text file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: applying a search-and-replace regular expression, specified in the at least one configuration file, to each line of the input dataset. 4. The method of claim 1 , wherein the input file is a COBOL binary file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: using an expected byte size specified in the at least one configuration file to identify a location of a field in the input dataset; and retrieving the field from the input dataset. 5. The method of claim 1 , wherein the input file is an extensible markup language (XML) file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: mapping tagged fields of the XML file to the output dataset. 6. The method of claim 1 , wherein the input file is a zip archive comprising text files, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: mapping content of the text files to the output dataset. 7. The method of claim 6 , wherein the zip archive is encrypted, and wherein the using further comprises: decrypting the zip archive. 8. The method of claim 6 , wherein the input file is a text file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: removing all blank lines from the input dataset. 9. The method of claim 6 , wherein the input dataset is formatted as rows of text, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: using a first setting specified in the at least one configuration file to identify a number of header rows to remove from the input dataset; and using a second setting specified in the at least one configuration file to identify a number of footer rows to remove from the input dataset. 10. The method of claim 6 , the input dataset being a single-column dataset, and further comprising using the at least one configuration file to transform the single-column dataset into a multi-column dataset that is delimited according to the schema definition. 11. The method of claim 6 , wherein the at least one configuration file further comprises an access control list that defines one or more access control permissions for the dataset, and further comprising, in response to receiving the input file, based on reading the at least one configuration file, determining output access control permissions for the output dataset based on the access control list. 12. One or more non-transitory computer-readable media storing instructions, which when executed by one or more processors cause: retrieving at least one configuration file, the at least one configuration file comprising: a plurality of different data transformation tasks, each of the tasks denoted with a task identifier that identifies a particular task to apply to a set of input data and associated with task-specific criteria for execution of the particular task; a schema definition for a dataset, wherein the schema definition defines a plurality of columns; receiving an input file that includes an input dataset comprising a single text column with fixed-width fields; in response to receiving the input file, based on reading the at least one configuration file, applying the plurality of different data transformation tasks to the input dataset to generate an output dataset that is formatted differently from the input dataset, wherein the applying comprises using an array of fixed-width values specified in the at least one configuration file to map the fixed-width fields of the input dataset to the output dataset, and wherein the output dataset is formatted according to the task-specific criteria and aligns with the plurality of columns as defined by the schema definition. 13. The one or more non-transitory computer-readable media of claim 12 , wherein the input file is a text file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: applying a search-and-replace regular expression, specified in the at least one configuration file, to each line of the input dataset. 14. The one or more non-transitory computer-readable media of claim 12 , wherein the input file is a COBOL binary file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: using an expected byte size specified in the at least one configuration file to identify a location of a field in the input dataset; and retrieving the field from the input dataset. 15. The one or more non-transitory computer-readable media of claim 12 , wherein the input file is an extensible markup language (XML) file, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: mapping tagged fields of the XML file to the output dataset. 16. The one or more non-transitory computer-readable media of claim 12 , wherein the input file is a zip archive comprising text files, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: mapping content of the text files to the output dataset. 17. The one or more non-transitory computer-readable media of claim 12 , wherein the input dataset is formatted as rows of text, and further comprising using the at least one configuration file to apply the particular task to the input dataset, the using comprises: using a first setting specified in the at least one configuration file to identify a number of header rows to remove from the input dataset; and using a second setting specified in the at least one configuration file to identify a number of footer rows to remove from the input dataset.

Assignees

Inventors

Classifications

  • Querying · CPC title

  • G06F16/258Primary

    Data format conversion from or to a database · CPC title

  • Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Mapping to a database · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11468083B2 cover?
A computer-implemented system or process is programmed or configured to use a configuration file to specify one or more tasks to apply to raw ingested data. A task may be a sequence of instructions programmed or configured to format raw ingested data into a dataset in a CSV format. Examples of tasks may include: a parser to parse Cobol data into a CSV, a parser to parse XML into a CSV, a parser…
Who is the assignee on this patent?
Palantir Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/258. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 11 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).