Post-hoc management of datasets

US10417439B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10417439-B2
Application numberUS-201715480971-A
CountryUS
Kind codeB2
Filing dateApr 6, 2017
Priority dateApr 8, 2016
Publication dateSep 17, 2019
Grant dateSep 17, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating a catalog for multiple datasets, the method comprising accessing multiple extant data sets, the extant data sets including data sets that are independently generated and structurally dissimilar; organizing the data sets into collections, each data set in each collection belonging to the collection based on collection data associated with the data set; for each collection of data sets: determining, from a subset of the data sets that belong to the collection, metadata that describe the data sets that belong to the collection, wherein the metadata does not include the collection data, and attributing, to other data sets in the collection, the metadata determined from the subset of data sets; and generating, from the collections of data sets and the determined metadata, a catalog for the multiple datasets.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented in data processing apparatus of a plurality of computers, the method comprising: accessing a plurality of extant data sets, the plurality of extant data sets including data sets that are generated independent of each other and structurally dissimilar; organizing the data sets into a plurality of collections, each collection including two or more data sets from the plurality of data sets, each data set in each collection belonging to the collection based on collection data associated with the data set, and each collection corresponding to collection data that is different from the collection data that corresponds to the other collections; for each collection of data sets: determining, from a subset of the data sets that belong to the collection, metadata that describe the data sets that belong to the collection, wherein the metadata does not include the collection data; attributing, to other data sets in the collection that are not included in the subset of data sets in the collection, the metadata determined from the subset of data sets such that the metadata determined to describe the subset of the data sets that belong to the collection are also determined to describe the other data sets in the collection that are not included in the subset of data sets in the collection; generating, from the collections of data sets and the metadata determined from the respective subsets of datasets, a catalog for the plurality of datasets, wherein the catalog includes an entry for each dataset in the plurality of datasets, and each entry describes the collection to which the dataset belongs and the metadata for the dataset. 2. The method of claim 1 , wherein the collection data is a path of a location of a dataset. 3. The method of claim 2 , wherein the datasets that belong to a collection include different instances of a particular dataset stored at a location identified by the path. 4. The method of claim 2 , wherein the datasets that belong to a collection include datasets that each have at least a sub-path name in its respective path that is common to each path of each dataset included in the collection. 5. The method of claim 2 , wherein the metadata include timestamp, file format, owners, and access permissions of the datasets. 6. The method of claim 2 , wherein the metadata includes provenance data that describes one or more of dataset production in a workflow, dataset consumption in a workflow, dataset parent dependencies, and dataset child dependencies. 7. The method of claim 2 , wherein the metadata include a schema of the dataset. 8. The method of claim 2 , wherein the metadata include a content summary that describes the content stored in a dataset. 9. The method of claim 2 , wherein the metadata include user-defined annotations. 10. The method of claim 1 , further comprising: receiving one or more keyword search queries; matching the one or more keyword search queries to one or more datasets included in the plurality of datasets in the generated data catalog; ranking the matched one or more datasets using a scoring function, wherein the scoring function ranks each dataset based on one or more of (i) the type of dataset, (ii) the index section of the dataset, (iii) lineage fan-out of the dataset or (iv) a description of the dataset; and providing a predetermined number of highest ranking datasets for output in response to receiving the one or more keyword search queries. 11. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform a method comprising: accessing a plurality of extant data sets, the plurality of extant data sets including data sets that are generated independent of each other and structurally dissimilar; organizing the data sets into a plurality of collections, each collection including two or more data sets from the plurality of data sets, each data set in each collection belonging to the collection based on collection data associated with the data set, and each collection corresponding to collection data that is different from the collection data that corresponds to the other collections; for each collection of data sets: determining, from a subset of the data sets that belong to the collection, metadata that describe the data sets that belong to the collection, wherein the metadata does not include the collection data; attributing, to other data sets in the collection that are not included in the subset of data sets in the collection, the metadata determined from the subset of data sets such that the metadata determined to describe the subset of the data sets that belong to the collection are also determined to describe the other data sets in the collection that are not included in the subset of data sets in the collection; generating, from the collections of data sets and the metadata determined from the respective subsets of datasets, a catalog for the plurality of datasets, wherein the catalog includes an entry for each dataset in the plurality of datasets, and each entry describes the collection to which the dataset belongs and the metadata for the dataset. 12. The system of claim 11 , wherein the collection data is a path of a location of a dataset. 13. The system of claim 12 , wherein the datasets that belong to a collection include different instances of a particular dataset stored at a location identified by the path. 14. The system of claim 12 , wherein the datasets that belong to a collection include datasets that each have at least a sub-path name in its respective path that is common to each path of each dataset included in the collection. 15. The system of claim 12 , wherein the metadata include timestamp, file format, owners, and access permissions of the datasets. 16. The system of claim 12 , wherein the metadata includes provenance data that describes one or more of dataset production in a workflow, dataset consumption in a workflow, dataset parent dependencies, and dataset child dependencies. 17. The system of claim 12 , wherein the metadata include a schema of the dataset. 18. The system of claim 12 , wherein the metadata include a content summary that describes the content stored in a dataset. 19. The system of claim 12 , wherein the metadata include user-defined annotations. 20. A computer-readable storage medium comprising instructions stored thereon that are executable by a processing device and upon such execution cause the processing device to perform a method comprising: accessing a plurality of extant data sets, the plurality of extant data sets including data sets that are generated independent of each other and structurally dissimilar; organizing the data sets into a plurality of collections, each collection including two or more data sets from the plurality of data sets, each data set in each collection belonging to the collection based on collection data associated with the data set, and each collection corresponding to collection data that is different from the collection data that corresponds to the other collections; for each collection of data sets: determining, from a subset of the data sets that belong to the collection, metadata that describe the data sets that belong to the collection, wherein the metadata does not include the collection data; attributing, to other data sets in the collection that are not included in the subset of data sets in the collection,

Assignees

Inventors

Classifications

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • G06F16/211Primary

    Schema design and management · CPC title

  • to a system of files or objects, e.g. local or distributed file system or database · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10417439B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating a catalog for multiple datasets, the method comprising accessing multiple extant data sets, the extant data sets including data sets that are independently generated and structurally dissimilar; organizing the data sets into collections, each data set in each collection belonging to th…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/211. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 17 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).