Mapping of extensible datasets to relational database schemas

US9916313B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9916313-B2
Application numberUS-201414339391-A
CountryUS
Kind codeB2
Filing dateJul 23, 2014
Priority dateFeb 14, 2014
Publication dateMar 13, 2018
Grant dateMar 13, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Data including a text file is received. The text file is arranged in an extensible format and includes a plurality of metadata lines, a header line, and a plurality of content lines. Metadata from the metadata lines is mapped to a plurality of metadata tables in a database that are formed according to a relational database schema using prefix parameters from each metadata line. Content from the content lines is mapped to a plurality of content tables in the database that are formed according to the relational database schema using the header line. A first subset of the content tables have a static structure and a second subset of the content tables have a dynamic structure. Related apparatus, systems, techniques and articles are also described.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method comprising: receiving data comprising a text file, the text file being arranged in an extensible format and comprising a plurality of metadata lines, a header line, and a plurality of content lines; generating, based at least on metadata from the metadata lines, a plurality of metadata tables in a database, the plurality of metadata tables being formed according to a relational database schema, and the metadata from the plurality of metadata lines being mapped, based at least on prefix parameters from each metadata line, to the plurality of metadata tables; generating, based at least on content from the content lines, a plurality of content tables in the database, the plurality of content tables being formed according to the relational database schema, the content from the plurality of content lines being mapped, based at least on the header line, to the plurality of content tables, a first table of the plurality of content tables having a static structure that includes a fixed number of columns for storing content lines mapped to the first table, a second table of the plurality of content tables having a dynamic structure that enables an addition of at least one new column during runtime, the at least one new column accommodating an additional field in data received subsequent to the generation of the plurality of content tables; and performing, based at least on the plurality of metadata tables and/or the plurality of content tables, a database operation with respect to the data comprising the text file, the performance of the database operation comprising adding, to the second table, the at least one new column, the performance of the database operation further comprising generating a corresponding entry in a database log, the database log being used during a recovery to replay one or more operations performed on the data comprising the text file since a last savepoint. 2. The method of claim 1 , wherein the mapping of the content from the content lines to the plurality of content tables comprises mapping gene sequence variations to corresponding reference genome position where the gene sequence variations occur. 3. The method of claim 1 , wherein the text file is arranged according to the genomic variant call format. 4. The method of claim 1 , wherein the generating of the plurality of metadata tables comprises the prefix parameters determining at least one of a metadata table name, number of columns and data types for the columns. 5. The method of claim 1 , wherein the mapping of the metadata from the metadata lines to the plurality of metadata tables comprises: identifying key-value pairs from each metadata line; storing contents from metadata lines that contain only one key-value pair in single key-value pair metadata tables comprising a single column for the keys from all the metadata lines and a corresponding column for values from all the metadata lines; and storing contents from metadata lines that contain multiple key-value pairs in multiple key-value pair metadata tables comprising a column for each unique key, wherein the column is named after the key, and wherein the corresponding values are mapped into the rows of the columns named after the keys. 6. The method of claim 1 , wherein the mapping of the content from the content lines to the plurality of content tables comprises: identifying header parameters from the header line; and generating content tables with the header parameters from the header line defining the content tables names within the relational database; wherein each content table comprises at least a column storing contents associated with the header parameter defining the name of the content table. 7. The method of claim 6 , further comprising: determining that the header parameter has more than one corresponding value; and generating, in the content table, an additional row under the header parameter for each corresponding value, and incrementing a record count by one with each additional row. 8. The method of claim 6 , wherein the header parameters from the header line comprise at least one of a position, identification, alternate, quality and filter parameter. 9. The method of claim 6 , wherein an alternates content table stores reference alleles and corresponding alternate alleles, wherein the alternates content table comprises at least one column storing both the reference allele and corresponding alternate allele at a particular chromosome position, wherein each reference allele and corresponding alternate alleles at a particular chromosome position are mapped into separate rows in the column, wherein a record count value increments by one with each additional row, and wherein an index number, starting at zero to represent the reference allele increments by one for each alternate allele mapped into each additional row, and wherein the alternate content table includes an allele length column. 10. The method of claim 1 , wherein the database comprises a columnar data store storing database tables as sections of columns. 11. The method of claim 1 , wherein the database is an in-memory database storing the metadata tables and the content tables in main memory. 12. The method of claim 1 , further comprising: generating a lookup table, the lookup table providing a mapping between the at least one new column and the keys. 13. A non-transitory computer program product storing instructions which, when executed by at least one data processor forming part of at least one computing system, result in operations comprising: receiving data comprising a text file, the text file being arranged in an extensible format and comprising a plurality of metadata lines, a header line, and a plurality of content lines; generating, based at least on metadata from the metadata lines, a plurality of metadata tables in a database, the plurality of metadata tables being formed according to a relational database schema, and the metadata from the plurality of metadata lines being mapped, based at least on prefix parameters from each metadata line, to the plurality of metadata tables; generating, based at least on content from the content lines, a plurality of content tables in the database, the plurality of content tables being formed according to the relational database schema, the content from the plurality of content lines being mapped, based at least on the header line, to the plurality of content tables, a first table of the plurality of content tables having a static structure that includes a fixed number of columns for storing content lines mapped to the first table, a second table of the plurality of content tables having a dynamic structure that enables an addition of at least one new column during runtime, the at least one new column accommodating an additional field in data received subsequent to the generation of the plurality of content tables; and performing, based at least on the plurality of metadata tables and/or the plurality of content tables, a database operation with respect to the data comprising the text file, the performance of the database operation comprising adding, to the second table, the at least one new column, the performance of the database operation further comprising generating a corresponding entry in a database log, the database log being used during a recovery to replay one or more operations performed on the data comprising the text file since a last savepoint. 14. The non-transitory computer program product as in claim 13 , wherein the mapping of the content from the content lines to the plurality of content tables comprises mapping g

Assignees

Inventors

Classifications

  • Column-oriented storage; Management thereof · CPC title

  • Subject matter not provided for in other groups of this subclass · CPC title

  • File meta data generation · CPC title

  • ICT programming tools or database systems specially adapted for bioinformatics · CPC title

  • Mapping to a database · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9916313B2 cover?
Data including a text file is received. The text file is arranged in an extensible format and includes a plurality of metadata lines, a header line, and a plurality of content lines. Metadata from the metadata lines is mapped to a plurality of metadata tables in a database that are formed according to a relational database schema using prefix parameters from each metadata line. Content from the…
Who is the assignee on this patent?
Kumar Srinivasan, Bog Anja, Avudai Kannan, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06F16/13. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 13 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).