Dynamic data-ingestion pipeline

US10122783B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10122783-B2
Application numberUS-201514944934-A
CountryUS
Kind codeB2
Filing dateNov 18, 2015
Priority dateNov 18, 2015
Publication dateNov 6, 2018
Grant dateNov 6, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In order to ingest data from an arbitrary source in a set of sources, a computer system accesses predefined configuration instructions. Then, the computer system generates a dynamic data-ingestion pipeline that is compatible with a Hadoop file system based on the predefined configuration instructions. This dynamic data-ingestion pipeline includes a modular arrangement of operators from a set of operators that includes: an extraction operator for extracting the data of interest from the source, a converter operator for transforming the data, and a quality-checker operator for checking the transformed data. Moreover, the computer system receives the data from the source. Next, the computer system processes the data using the dynamic data-ingestion pipeline as the data is received without storing the data in memory for the purpose of subsequent ingestion processing.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-system-implemented method for ingesting data from a source in a set of sources, the method comprising: accessing predefined configuration instructions; generating a dynamic data-ingestion pipeline compatible with a Hadoop file system based on the predefined configuration instructions, wherein the dynamic data-ingestion pipeline includes a modular arrangement of operators from a set of operators that includes: an extraction operator for extracting the data of interest from the source, a converter operator for transforming the data, and a quality-checker operator for checking the transformed data; receiving the data from the source; and processing the data, by a computer system, using the dynamic data-ingestion pipeline implemented on the computer system by: using the converter operator to transform the received data into a common format compatible with the Hadoop file system; wherein the data is processed in real-time as the data is received, without storing the data in memory for the purpose of subsequent ingestion processing, thereby reducing storage requirements of the computer system. 2. The method of claim 1 , wherein the set of sources is compatible with one of: a database, a message broker, a distributed key-value storage system, a Simple Storage Service (S3) file system, a first file system on a first network accessible via HyperText Transfer Protocol, a second file system on a second network accessible via File Transfer Protocol, and a local file system. 3. The method of claim 1 , wherein the dynamic data-ingestion pipeline egresses the data to a sink in a set of sinks; wherein the sink is compatible with the Hadoop file system; and wherein the set of operators includes a writer operator for outputting the data to the sink. 4. The method of claim 1 , wherein the set of operators includes a publisher operator for outputting the data to an output directory. 5. The method of claim 4 , wherein the publisher operator outputs the data when all of the operators in the dynamic data-ingestion pipeline are successfully completed. 6. The method of claim 4 , wherein the publisher operator outputs the data when a subset of the operators in the dynamic data-ingestion pipeline is successfully completed. 7. The method of claim 1 , wherein the processing of the data using the dynamic data-ingestion pipeline is performed as a batch process. 8. The method of claim 1 , wherein the processing of the data using the dynamic data-ingestion pipeline involves parallel processing of workunits. 9. The method of claim 1 , wherein the quality-checker operator checks one of: a record-level policy, and a task-level policy. 10. An apparatus, comprising: one or more processors; memory; and a program module, wherein the program module is stored in the memory and, during operation of the apparatus, is executed by the one or more processors to ingest data from a source in a set of sources, the program module including: instructions for accessing predefined configuration instructions; instructions for generating a dynamic data-ingestion pipeline compatible with a Hadoop file system based on the predefined configuration instructions, wherein the dynamic data-ingestion pipeline includes a modular arrangement of operators from a set of operators that includes: an extraction operator for extracting the data of interest from the source, a converter operator for transforming the data, and a quality-checker operator for checking the transformed data; instructions for receiving the data from the source; and instructions for processing the data using the dynamic data-ingestion pipeline by: using the converter operator to transform the received data into a common format compatible with the Hadoop file system; wherein the data is processed in real-time as the data is received, without storing the data in memory for the purpose of subsequent ingestion processing, thereby reducing storage requirements on the memory. 11. The apparatus of claim 10 , wherein the set of sources is compatible with one of: a database, a message broker, a distributed key-value storage system, a Simple Storage Service (S3) file system, a first file system on a first network accessible via HyperText Transfer Protocol, a second file system on a second network accessible via File Transfer Protocol, and a local file system. 12. The apparatus of claim 10 , wherein the dynamic data-ingestion pipeline egresses the data to a sink in a set of sinks; wherein the sink is compatible with the Hadoop file system; and wherein the set of operators includes a writer operator for outputting the data to the sink. 13. The apparatus of claim 10 , wherein the set of operators includes a publisher operator for outputting the data to an output directory. 14. The apparatus of claim 13 , wherein the publisher operator outputs the data when all of the operators in the dynamic data-ingestion pipeline are successfully completed. 15. The apparatus of claim 13 , wherein the publisher operator outputs the data when a subset of the operators in the dynamic data-ingestion pipeline is successfully completed. 16. The apparatus of claim 10 , wherein the processing of the data using the dynamic data-ingestion pipeline is performed as a batch process. 17. The apparatus of claim 10 , wherein the processing of the data using the dynamic data-ingestion pipeline involves parallel processing of workunits. 18. The apparatus of claim 10 , wherein the quality-checker operator checks one of: a record-level policy, and a task-level policy. 19. A system, comprising: a processing module comprising one or more processors and a non-transitory computer readable medium storing instructions that, when executed by the one or more processors, cause the system to: access predefined configuration instructions; generate a dynamic data-ingestion pipeline compatible with a Hadoop file system based on the predefined configuration instructions, wherein the dynamic data-ingestion pipeline includes a modular arrangement of operators from a set of operators that includes: an extraction operator for extracting the data of interest from the source, a converter operator for transforming the data, and a quality-checker operator for checking the transformed data; receive the data from a source in a set of sources; and process the data using the dynamic data-ingestion pipeline by: using the converter operator to transform the received data into a common format compatible with the Hadoop file system; wherein the data is processed in real-time as the data is received, without storing the data in memory for the purpose of subsequent ingestion processing, thereby reducing storage requirements of the system. 20. The system of claim 19 , wherein the dynamic data-ingestion pipeline egresses the data to a sink in a set of sinks; wherein the sink is compatible with the Hadoop file system; and wherein the set of operators includes a writer operator for outputting the data to the sink.

Assignees

Inventors

Classifications

  • H04L67/02Primary

    based on web technology, e.g. hypertext transfer protocol [HTTP] · CPC title

  • Physics · mapped topic

  • G06F16/11Primary

    File system administration, e.g. details of archiving or snapshots (error detection or correction of the data by redundancy in operations G06F11/14) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10122783B2 cover?
In order to ingest data from an arbitrary source in a set of sources, a computer system accesses predefined configuration instructions. Then, the computer system generates a dynamic data-ingestion pipeline that is compatible with a Hadoop file system based on the predefined configuration instructions. This dynamic data-ingestion pipeline includes a modular arrangement of operators from a set of…
Who is the assignee on this patent?
Linkedin Corp, Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification H04L67/02. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Nov 06 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).