Data pruning based on metadata

US10437780B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10437780-B2
Application numberUS-201615210536-A
CountryUS
Kind codeB2
Filing dateJul 14, 2016
Priority dateJul 14, 2016
Publication dateOct 8, 2019
Grant dateOct 8, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system, apparatus, and method for processing queries wherein the query includes a request to access or delete data and accessing metadata associated with the set of data, the metadata defining data characteristics of the set of data and identifying at least sets of data that need or not need to be accessed or deleted based on the metadata without accessing the actual data in the set of data; also methods to optimize processing of some operations based on the collected metadata on data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for managing query operations, the method comprising: receiving a query directed to a set of files comprising data, wherein the query comprises a plurality of predicates; determining whether data in each file matches at least one predicate of the plurality of predicates based on file metadata without accessing the set of files; removing files that do not match at least one predicate from the set of files to create a reduced set of files; identifying, based on the file metadata, one or more predicates of the query that do not fully match any file in the set of files without accessing the set of files; removing the one or more predicates that do not fully match any file in the set of files from the query to create a modified query; executing the modified query against the reduced set of files to create a final set of files; and returning the final set of files in response to the query. 2. The method of claim 1 , wherein the method further comprises, based on the file metadata, one or more of the following: identifying, without accessing a set of files, zero or more files in the set of files that do not need to be analyzed for a given query; identifying, without accessing a set of files, zero or more files in the set of files that need to be analyzed for a given query; and identifying, without accessing a set of files, zero or more files in the set of files to fully delete. 3. The method of claim 2 , further comprising: analyzing queries with a plurality of conjunctive predicates; checking each of the plurality of conjunctive predicates against each file in the set of files; and reducing the set of files for consideration for each of the plurality of conjunctive predicates by removing the files that did not match the previous predicate. 4. The method of claim 2 , wherein determining whether any of the files do not fully match at least one of the plurality of predicates comprises determining whether at least one of the plurality of predicates defines a range of values not represented in data within the set of files. 5. The method of claim 1 , wherein the method further comprises identifying one or more columns within one or more files of the reduced set of files that are only used in predicates that have been deleted. 6. The method of claim 1 , wherein receiving a query directed to a set of files further comprises receiving a request to delete tuples in the query. 7. The method of claim 6 , further comprising identifying files wherein all tuples in the identified files match a delete predicate based on the file metadata and deleting the identified files without accessing data from the actual file. 8. The method of claim 7 , wherein deleting is performed during a query preparation phase. 9. The method of claim 1 , further comprising identifying files wherein all tuples in the identified files match a delete predicate based on the file metadata and deleting the identified files without accessing data from the actual file. 10. The method of claim 2 , further comprising receiving a join query wherein the join query comprises one or more relations between a first set of files and a second set of files, and at least one predicate on the first set of files. 11. The method of claim 10 , further comprising deriving a new predicate on the second set of files from the predicate on the first set of files. 12. The method of claim 11 , further comprising deriving a revised predicate from a range found in metadata corresponding to a first relation and wherein the derived predicate is used for filtering the second set of files. 13. The method of claim 1 , further comprising determining a range of characters based on the file metadata and selecting one or more string operations based on the range of characters. 14. The method of claim 13 , wherein selecting the one or more string operations comprises selecting one or more of an ASCII specific string operation and a UNICODE specific string operation. 15. The method of claim 1 , further comprising deriving metadata for complex functions based on metadata for a set of base attributes in a data set and filtering the set of files using complex predicates with arbitrary functions. 16. The method of claim 1 , wherein the file metadata comprises a number of distinct values; a number of null values; and a minimum value and a maximum value for each file. 17. The method of claim 16 , wherein the file metadata further comprises string length information and ranges of characters in strings. 18. The method of claim 1 , further comprising: creating one or more files comprising the file metadata; collecting the file metadata when there are changes made to the data on a per column and a per file basis during data ingestion or as a separate process after data is loaded; receiving the file metadata on a per column and a per file basis; and storing the file metadata in a metadata store. 19. A processor that is programmable to execute instructions stored in non-transitory computer readable storage media, the instructions comprising: receiving a query directed to a set of files comprising data, wherein the query comprises a plurality of predicates; accessing metadata associated with the set of files, determining whether data in each file of the set of files matches at least one predicate of the plurality of predicates based on the metadata without accessing the set of files, and removing files that do not match at least one predicate from the set of files to create a reduced set of files; identifying, based on the metadata, one or more predicates of the query that do not fully match any file in the set of files without accessing the set of files; removing the one or more predicates that do not fully match any file in the set of files from the query to create a modified query; executing the modified query against the reduced set of files to create a final set of files; and returning the final set of files in response to the query. 20. The processor of claim 19 , wherein the instructions further comprise determining whether at least one of the plurality of predicates defines a range of values that is not contained in data within the set of files. 21. The processor of claim 19 , wherein the instructions further comprise deriving a revised predicate from a range found in metadata corresponding to a first relation and wherein the revised predicate is used for filtering a second relation. 22. The processor of claim 19 , wherein the instructions further comprise one or more of the following based on the metadata: identifying, without accessing a set of files, zero or more files in the set of files that do not need to be analyzed for a given query; identifying, without accessing a set of files, zero or more files in the set of files that need to be analyzed for a given query; and identifying, without accessing a set of files, zero or more files in the set of files to fully delete. 23. The processor of claim 22 , wherein the instructions further comprise: analyzing queries with a plurality of conjunctive predicates; checking each of the plurality of conjunctive predicates against each file in the set of files; and reducing the set of files for consideration for each of the plurality of conjunctive predicates by removing the files that did not match the previous predicate. 24. The processor of claim 22 , wherein the instructions further comprise determining whether at

Assignees

Inventors

Classifications

  • G06F16/162Primary

    Delete operations (erasing in storage systems G06F3/0652) · CPC title

  • Indexing; Data structures therefor; Storage structures · CPC title

  • Join order optimisation · CPC title

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Search customisation based on user profiles and personalisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10437780B2 cover?
A system, apparatus, and method for processing queries wherein the query includes a request to access or delete data and accessing metadata associated with the set of data, the metadata defining data characteristics of the set of data and identifying at least sets of data that need or not need to be accessed or deleted based on the metadata without accessing the actual data in the set of data; …
Who is the assignee on this patent?
Snowflake Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/162. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 08 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).