Atomic incremental load for map-reduce systems on append-only file systems

US2016306799A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016306799-A1
Application numberUS-201615198345-A
CountryUS
Kind codeA1
Filing dateJun 30, 2016
Priority dateAug 30, 2012
Publication dateOct 20, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Augmenting data files in a repository of an append-only file system includes maintaining a companion metadata file for each corresponding data file in a map-reduce system using the append-only file system. Each companion metadata file tracks a logical end-of-file (EOF) for each data file. Global versioning of each companion metadata is maintained. A map-reduce append job is performed for a set of data files using a current global version number for the companion metadata file. The map-reduce job including multiple append tasks. For each successful append job, a logical EOF for each appended file is incremented to a new physical EOF. For each failed append task of the append job, a logical EOF is maintained for each failed append task by not incrementing the logical EOF for each failed append task.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of augmenting data files in a repository of an append-only file system, comprising: maintaining a companion metadata file for each corresponding data file in a map-reduce system using the append-only file system, wherein each companion metadata file tracks a logical end-of-file (EOF) for each data file; maintaining global versioning of each companion metadata; performing a map-reduce append job for a set of data files using a current global version number for the companion metadata file, wherein the map-reduce job including multiple append tasks; for each successful append job, incrementing a logical EOF for each appended file to a new physical EOF; and for each failed append task of the append job, maintaining a logical EOF for each failed append task by not incrementing the logical EOF for each failed append task. 2 . The method of claim 1 , wherein global versioning is used to increment a valid companion metadata file version for each data file appended, and said valid companion metadata file version indicates the logical EOF corresponding to the new physical EOF for each of the data files appended. 3 . The method of claim 2 , wherein subsequent append tasks that read a data file for retrying failed append tasks use metadata to stop reading upon reaching the logical EOF for the failed append task even when a current physical EOF is not reached. 4 . The method of claim 1 , further comprising: for a failed data file append task, maintaining a current companion metadata file version for the data file, wherein partially appended bytes are ignored. 5 . The method of claim 1 , further comprising: for a failed append task: in a next successful append task updating the companion metadata file to skip a region corresponding to a failed append task; and in subsequent tasks, referring to the skipped region as an invalid region; and after a failed append task, in a subsequent append task, incrementing the logical EOF to a new physical EOF. 6 . The method of claim 4 , further comprising: using a single writer for write instructions to avoid concurrent writers; upon a determination that an existing metadata file exists with a version value set to a new version value, deleting the metadata file and creating a new metadata file on completion of a write instruction; and creating a new metadata file with the version set to the new version value. 7 . The method of claim 4 , further comprising: for each data file being read: setting a local version value of a file to a maximum metadata version value; reading a metadata file having the local version value; configuring a record reader for each record, split with invalid regions and the logical EOF; and reading data up to the logical EOF while skipping over invalid regions. 8 . The method of claim 4 , further comprising: performing periodic garbage collection comprising rewriting a data file, omitting invalid regions, updating the metadata file to purge all of the invalid regions, and pointing to the new logical EOF. 9 . The method of claim 8 , wherein garbage collection is performed while all other read instructions are stopped. 10 . The method of claim 1 , wherein the file system comprises a Hadoop Distributed File System (HDFS). 11 . A computer program product for augmenting data files in a repository of an append-only file system, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, wherein the program instructions executable by a computer to cause the computer to: maintain, by the computer, a companion metadata file for each corresponding data file in a map-reduce system using the append-only file system, wherein each companion metadata file tracks a logical end-of-file (EOF) for each data file; maintain, by the computer, global versioning of each companion metadata file; perform, by the computer, a map-reduce append job for a set of data files a current global version number for the companion metadata file, wherein map-reduce job including multiple append tasks; for each successful append job, increment, by the computer, a logical EOF for each appended file to a new physical EOF; and for each failed append task of the append job, maintain, by the computer, a logical EOF for each failed append task by not incrementing the logical EOF for each failed append task. 12 . The computer program product of claim 11 , wherein the program instructions further cause the computer to: perform, by the computer, periodic garbage collection comprising rewriting a data file, omitting invalid regions, updating the metadata file to purge all of the invalid regions, and pointing to the new logical EOF. 13 . The computer program product of claim 12 , wherein garbage collection is performed while all other read instructions are stopped. 14 . A storage device comprising: a memory storing instructions; and a processor configured to execute the instructions including: maintaining a companion metadata file in the memory for each corresponding data file in an append-only file system, wherein each companion metadata file tracks a logical end-of-file (EOF) for each data file; maintaining global versioning of each companion metadata in the memory; performing a map-reduce append job for a set of data files using a current global version number for the companion metadata file, wherein the map-reduce job including multiple append tasks; for each successful append job, incrementing a logical EOF for each appended file to a new physical EOF; and for each failed append task of the append job, maintaining a logical EOF for each failed append task by not incrementing the logical EOF for each failed append task. 15 . The storage device of claim 14 , wherein: the processor uses global versioning to increment a valid companion metadata file version for each data file appended; said valid companion metadata file version indicates the logical EOF corresponding to the new physical EOF for each of the data files appended; and subsequent append tasks that read a data file for retrying failed append tasks use metadata to stop reading upon reaching the logical EOF for the failed append task even when a current physical EOF is not reached. 16 . The storage device of claim 14 , wherein the processor is further configured to perform further instructions including: for a failed append task: in a next successful append task updating the companion metadata file to skip a region corresponding to a failed append task; and in subsequent tasks, referring to said region as an invalid region; and after a failed append task, in a subsequent append task, incrementing the logical EOF to a new physical EOF. 17 . The storage device of claim 14 , wherein the processor is further configured to perform further instructions including: causing only a single writer to perform write instructions to avoid concurrent writers performing write instructions; upon a determination that an existing metadata file exists with a version value set to a new version value, deleting the metadata file and creating a new metadata file in the memory on completion of a write instruction; and creating a new metadata file in the memory with the version set to the new version value. 18 . The storage device of claim 14 , wherein the processor is further configured to perform further instructions including: for each data file being read: setting a local version value of a data file to a maximum

Assignees

Inventors

Classifications

  • Management specifically adapted to NAS (management of storage area networks [SAN] G06F3/067) · CPC title

  • Versioning file systems, temporal file systems, e.g. file system supporting different historic versions of files · CPC title

  • G06F16/164Primary

    File meta data generation · CPC title

  • Append-only file systems, e.g. using logs or journals to store data · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016306799A1 cover?
Augmenting data files in a repository of an append-only file system includes maintaining a companion metadata file for each corresponding data file in a map-reduce system using the append-only file system. Each companion metadata file tracks a logical end-of-file (EOF) for each data file. Global versioning of each companion metadata is maintained. A map-reduce append job is performed for a set …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/164. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Oct 20 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).