Index utilization in ETL tools

US10114878B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10114878-B2
Application numberUS-201314108067-A
CountryUS
Kind codeB2
Filing dateDec 16, 2013
Priority dateDec 16, 2013
Publication dateOct 30, 2018
Grant dateOct 30, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer manages methods for utilizing an index to manage access to data in a dataset stored in one or more file locations in an ETL tool by receiving a request to access a dataset associated with one or more file locations, wherein the dataset is stored in the one or more file locations. The computer queries an index for the one or more file locations associated with the dataset, wherein the dataset has another index for data in the dataset. The computer receives the one or more file locations associated with the dataset. The computer determines to cache the request to access the one or more file locations for the dataset until one or more thresholds are met, wherein the cached request is part of a total number of cached requests.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for utilizing an index to manage access to data in a dataset stored in one or more file locations in an Extract Transform Load tool, the method comprising: receiving, by one or more processors, a request to access a first dataset of a plurality of datasets stored at a source system during an Extract Transform Load process between the source system and an end target system during a period of high I/O requests, wherein the period of high I/O requests is based on a number of I/O requests received for a particular duration; querying, by one or more processors, a first index for one or more file locations where the first dataset is stored at the source system; responsive to determining that caching the request does not disrupt an order assigned to the request to access the first dataset of the plurality of datasets stored at the source system, caching, by one or more processors, the request to access the first dataset stored at the source system, wherein the cached access request for the first dataset is one of a plurality of cached requests to access the plurality of datasets stored in a plurality of file locations at the source system; responsive to determining a total size of the plurality of cached requests in temporary storage does not met a first threshold level, determining, by one or more processors, whether a duration for which the cached request for the first dataset of the plurality of cached requests has been cached in the temporary storage has met a second threshold level; responsive to determining that the duration for which the cached request for the first dataset of the plurality of cached requests has been cached has met the second threshold, identifying, by one or more processors, a first file location of the plurality of file locations to access in order to satisfy a portion of the plurality of cached requests for datasets that includes the cached request to access the first dataset; and accessing, by one or more processors, the first file location at the source system to satisfy the portion of the plurality of cached requests for datasets that includes the cached request to access the first dataset stored at the first file location, wherein accessing the first file location to satisfy the portion of the plurality of cached requests for the datasets stored at the first file location reduces the total size of the plurality of cached requests to access the source system during the Extract Transform Load process. 2. The method of claim 1 , further comprising: prior to receiving a request to access a first dataset, receiving, by one or more processors, data which is to be stored in the first dataset, wherein the data includes employee date of birth, employee salary amount, and employee expertise level; creating, by one or processors, a second index using one or more keys representing the data present in the received first dataset, wherein the one or more keys includes employee names; and identifying, by one or more processors, each field of the second index to store one or more entries, wherein each entry is associated with one or more file locations along with an offset of the data to be stored in the first dataset in the one or more file locations. 3. A computer program product for utilizing an index to manage access to data in a dataset stored in one or more file locations in an Extract Transform Load tool, the computer program product comprising: one or more computer readable storage media; program instructions stored on the one or more computer readable storage media, which when executed by one or more processors, to: receive a request to access a first dataset of a plurality of datasets stored at a source system during an Extract Transform Load process between the source system and an end target system during a period of high I/O requests, wherein the period of high I/O requests is based on a number of I/O requests received for a particular duration; query a first index for one or more file locations where the first dataset is stored at the source system; responsive to determining that caching the request does not disrupt an order assigned to the request to access the first dataset of the plurality of datasets stored at the source system, cache the request to access the first dataset stored at the source system, wherein the cached access request for the first dataset is one of a plurality of cached requests to access the plurality of datasets stored in a plurality of file locations at the source system; responsive to determining a total size of the plurality of cached requests in temporary storage does not met a first threshold level, determine whether a duration for which the cached request for the first dataset of the plurality of cached requests has been cached in the temporary storage has met a second threshold level; responsive to determining the duration for which the cached request for the first dataset of the plurality of cached requests has been cached has met the second threshold, identify a first file location of the plurality of file locations to access in order to satisfy a portion of the plurality of cached requests for datasets that includes the cached request to access the first dataset; and access the first file location at the source system to satisfy the portion of the plurality of cached requests for datasets that includes the cached request to access the first data set stored at the first file location, wherein accessing the first file location to satisfy the portion of the plurality of cached requests for the datasets stored at the first file location reduces the total size of the plurality of cached requests to access the source system during the Extract Transform Load process. 4. The computer program product of claim 3 , further comprising program instructions, stored on the one or more computer readable storage media, which when executed by a processor, to: prior to receiving a request to access a first dataset, receive data which is to be stored in the first dataset, wherein the data includes employee date of birth, employee salary amount, and employee expertise level; create a second index using one or more keys representing the data present in the received first dataset, wherein the one or more keys includes employee names; and identify, by one or more processors, each field of the second index to store one or more entries, wherein each entry is associated with one or more file locations along with an offset of the data to be stored in the first dataset in the one or more file locations. 5. A computer system for utilizing an index to manage access to data in a dataset stored in one or more file locations in an ETL tool, the computer system comprising: one or more computer processors; one or more computer readable storage media; program instructions stored on the one or more computer readable storage media, for execution by at least one of the one or more computer processors, which when executed, to: receive a request to access a first dataset of a plurality of datasets stored at a source system during an Extract Transform Load process between the source system and an end target system during a period of high I/O requests, wherein the period of high I/O requests is based on a number of I/O requests received for a particular duration; query a first index for one or more file locations where the first dataset is stored at the source system; responsive to determining that caching the request does not disrupt an order assigned to the request to access the first dataset of the plurality of datasets stored at the source system, cache the request to access the first dataset stored at the source system, wherein the cached access request for the first dataset is one of a plurality of cached requests to access the plurality of datasets

Assignees

Inventors

Classifications

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10114878B2 cover?
A computer manages methods for utilizing an index to manage access to data in a dataset stored in one or more file locations in an ETL tool by receiving a request to access a dataset associated with one or more file locations, wherein the dataset is stored in the one or more file locations. The computer queries an index for the one or more file locations associated with the dataset, wherein the…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/254. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 30 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).