Object metadata query with distributed processing systems

US10318491B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10318491-B1
Application numberUS-201514674324-A
CountryUS
Kind codeB1
Filing dateMar 31, 2015
Priority dateMar 31, 2015
Publication dateJun 11, 2019
Grant dateJun 11, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A distributed object store can expose object metadata, in addition to object data, to distributed processing systems, such as Hadoop and Apache Spark. The distributed object store may acts as a Hadoop Compatible File System (HCFS), exposing object metadata as a collection of records that can be efficiently processed by MapReduce (MR) and other distributed processing frameworks. A distributed processing job can specify a metadata query to narrow the set of objects returned. Related methods are also described.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: providing one or more computer processors configured to perform: providing access, from a distributed processing system, to a distributed object store configured for storing and retrieving object data and associated metadata, where the distributed object store uniquely identifies objects contained therein using an object key comprising one or more namespace identifiers and a unique object identifier (id) within the identified namespace, wherein the namespace identifiers comprise a tenant id uniquely identifying a tenant within the distributed object store, and a bucket id uniquely identifying a bucket that comprises a plurality of objects, the bucket defined by and belonging to the tenant, wherein the distributed object store is configured as part of a distributed key-value store, the distributed key-value store comprising: a set of data and object metadata for the plurality of objects; a primary index configured for storing a mapping of object ids to storage locations, for the plurality of objects; and one or more secondary indexes each configured for storing information relating to properties of the plurality of objects different than the object ids and different than the set of data and metadata for the plurality of objects, wherein the one or more secondary indexes are each defined as part of a respective specified bucket, wherein each of the one or more secondary indexes comprises: information based on the properties of the object metadata itself and information based on object access patterns for one or more applications that query the distributed object store; a secondary index table maintaining a mapping between properties of the object metadata and cached object metadata properties for each stored object; and a plurality of secondary index definitions, the secondary index definitions comprising information about one or more secondary indexes that have been created in the distributed key-value store, the information about the one or more secondary indexes comprising, for each respective secondary index, index name, indexed metadata keys, and cached metadata keys, the cached metadata keys corresponding to duplicates of one or more of the cached object metadata properties; wherein providing duplicates of one or more of the cached object metadata properties is configured to improve data request performance by reducing time to access information about object metadata properties; and wherein at least one of the one or more secondary indexes is configured to improve the efficiency of responding to data requests; receiving a data request for object metadata from the distributed processing system, the data request associated with a first bucket within the distributed object store, wherein the first bucket comprises at least a first respective secondary index, the data request comprising an object metadata query, identifying one or more objects within the first bucket that satisfy the object metadata query; wherein the object metadata query includes at least one query predicate involving an object metadata key, wherein identifying the one or more objects within the first bucket that satisfy the object metadata query comprises: parsing the object metadata query into a query parse tree; generating a plurality of candidate query plans, each of the candidate query plans being semantically equivalent to the object metadata query; selecting one of the candidate query plans; and identifying one or more objects that satisfy the query predicate by retrieving object ids from the first respective secondary index using a first bucket id associated with the first bucket and the object metadata key involved in the query predicate; for each object identified as satisfying the object metadata query: determining a location of corresponding object metadata stored within the distributed object store; retrieving the corresponding object metadata using the determined location; and generating a metadata record from the corresponding object metadata; combining the metadata records from the identified objects into a metadata collection having a format compatible with the distributed processing system; and returning the metadata collection to the distributed processing system in connection with the response to the data request. 2. The method of claim 1 wherein the data request further identifies a partition, wherein identifying one or more objects as objects associated with the first bucket comprises identifying one or more objects as objects associated with the first bucket and the partition. 3. The method of claim 1 wherein receiving the data request for object metadata from a distributed processing system comprises receiving a data request from a Hadoop cluster. 4. The method of claim 3 wherein receiving the data request for object metadata comprises receiving an Hadoop Distributed File System (HDFS) DataNode request. 5. The method of claim 3 further comprising receiving a Hadoop Distributed File System (HDFS) Namenda request from the distributed processing system, the HDFS NameNode request identifying a bucket within the distributed object store. 6. The method of claim 3 wherein generating the metadata record from the corresponding object metadata comprises generating a record in Apache Avro format, Apache Thrift format, Apache Parquet format, Simple Key/Value format, JSON format, Hadoop SequenceFile format, or Google Protocol Buffer format. 7. The method of claim 1 wherein receiving the data request for object metadata from a distributed processing system comprises receiving a data request from an Apache Spark cluster. 8. The method of claim 7 wherein combining the metadata records into the metadata collection comprises forming a Resilient Distributed Dataset (RDD). 9. The method of claim 1 where determining the location of corresponding object metadata stored within the distributed object store comprises using the distributed key/value store. 10. The method of claim 1 wherein identifying the one or more objects in the first bucket comprises issuing a PREFIX-GET command to the distributed key/value store, the PREFIX-GET command identifying a tenant and the bucket for the one or more objects. 11. The method of claim 1 wherein selecting one of candidate query plans comprises: evaluating the candidate query plans based upon a cost model, the cost model based on usage of at least one of time and processing resources for the given candidate query plan; and selecting one of candidate query plans based upon the cost model evaluation, wherein the selected candidate query plan has the lowest cost. 12. The method of claim 11 wherein generating candidate query plans includes: generating at least one logical query plan according to the received query; and generating a plurality of physical query plans according to the logical query plan, wherein the selected query plan corresponds to one of the plurality of physical query plans. 13. The method of claim 12 wherein generating a plurality of physical query plans comprises generating a tree representation, wherein nodes of the tree representation correspond to operations, the method further comprising traversing the nodes of the tree representation and executing the corresponding operations. 14. The method of claim 11 wherein evaluating the candidate query plans based upon a cost model comprises utilizing statistical information about the first respective secondary index computed from the distributed key-value store. 15. The method of claim 1 wherein retrieving object ids from the first respective

Assignees

Inventors

Classifications

  • G06F16/182Primary

    Distributed file systems · CPC title

  • Indexing; Web crawling techniques · CPC title

  • Query processing support for facilitating data mining operations in structured databases · CPC title

  • Distributed queries · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10318491B1 cover?
A distributed object store can expose object metadata, in addition to object data, to distributed processing systems, such as Hadoop and Apache Spark. The distributed object store may acts as a Hadoop Compatible File System (HCFS), exposing object metadata as a collection of records that can be efficiently processed by MapReduce (MR) and other distributed processing frameworks. A distributed pr…
Who is the assignee on this patent?
Emc Corp, Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/182. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).