Data stream processing

US11556563B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11556563-B2
Application numberUS-202016900357-A
CountryUS
Kind codeB2
Filing dateJun 12, 2020
Priority dateJun 12, 2020
Publication dateJan 17, 2023
Grant dateJan 17, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for partitioning data from a data stream into batches and inferring schema for individual batches based on the field values of each batch are disclosed. The system may infer different schemas corresponding to different batches of data records even though the batches are received from a common data stream or a common data source. The system may infer a schema by determining whether a field contains single values or multiple values. Then the system determines the field type(s) associated with the values. These determinations are then stored in a dictionary generated for each batch.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving a data stream comprising a plurality of data records; partitioning the plurality of data records into a plurality of batches comprising (a) a first batch of data records and (b) a second batch of data records; inferring field types for the plurality of data records, received in a same data stream, on a per-batch basis at least by: processing the first batch of data records received via the stream based on a first inferred set of field types at least by: identifying a first plurality of fields corresponding to the data records of the first batch, the first plurality of fields comprising a particular field; analyzing a first set of values, associated with the particular field, in the data records of the first batch to determine the first inferred set of one or more field types corresponding to the particular field for data records of the first batch; wherein the first inferred set of field types, corresponding to the particular field, comprises a first field type corresponding to a first value for the particular field in a first record and a second field type corresponding to a second value for the particular field in a second record; indexing the first set of values in association with (a) the particular field and (b) a respective inferred field type of the first inferred set of field types; processing the second batch of data records received via the stream based on a second inferred set of field types at least by: identifying a second plurality of fields corresponding to the data records of the second batch, the second plurality of fields comprising the same particular field comprised in the first plurality of fields; analyzing a second set of values, associated with the particular field, in the data records of the second batch to determine the second inferred set of one or more field types corresponding to the particular field for data records of the second batch, wherein the first inferred set of one or more field types corresponding to the particular field is different than the second inferred set of one or more field types corresponding to the particular field; and indexing the second set of values in association with (a) the particular field and (b) a respective inferred field type of the second inferred set of field types. 2. The method of claim 1 , wherein: the second inferred set of field types, corresponding to the particular field, consists of a single field type. 3. The method of claim 1 , wherein analyzing the first set of values, associated with the particular field, in the first batch to determine a first inferred set of one or more field types corresponding to the particular field comprises: detecting at least two values, corresponding to the particular field, in a same data record of the first batch. 4. The method of claim 1 , wherein analyzing the first set of values, associated with the particular field, in the first batch to determine a first inferred set of one or more field types corresponding to the particular field comprises: identifying a number of field types corresponding to the particular field; and inferring a type of each field type of the number of field types by applying a machine learning model. 5. The method of claim 1 , wherein analyzing the first set of values, associated with the particular field, in the first batch to determine a first inferred set of one or more field types corresponding to the particular field comprises: selecting a value of the first batch using statistical sampling; and applying regular expression analysis to the value to determine a type of the value. 6. The method of claim 1 , wherein further comprising receiving the plurality of data records in a data stream, wherein the plurality of data records are partitioned into the first batch and the second batch concurrent with receipt of additional data records via the data stream. 7. The method of claim 1 , further comprising inferring a first schema for the first batch based on the first plurality of fields and a second schema for the second batch based on the second plurality of fields. 8. The method of claim 1 , further comprising: prior to the partitioning operation, receiving the plurality of data records in a single data stream, wherein partitioning particular records of the plurality of data records into the second batch is based on a data record capacity threshold of the first batch being met. 9. The method of claim 1 , wherein the plurality of data records are received in a single data stream, wherein the first plurality of fields is identical to the second plurality of fields, and wherein a first set of types corresponding to the first plurality of fields is not identical to a second set of types corresponding to the second plurality of fields. 10. The method of claim 1 , wherein: the method is performed by a data storage system; and the plurality of data records comprise records that do not comply with any predetermined schema. 11. One or more non-transitory computer-readable media storing instructions, which when executed by one or more hardware processors, cause performance of operations comprising: receiving a data stream comprising a plurality of data records; partitioning the plurality of data records into a plurality of batches comprising (a) a first batch of data records and (b) a second batch of data records; inferring field types for the plurality of data records, received in a same data stream, on a per-batch basis at least by: processing the first batch of data records received via the stream based on a first inferred set of field types at least by: identifying a first plurality of fields corresponding to the data records of the first batch, the first plurality of fields comprising a particular field; analyzing a first set of values, associated with the particular field, in the data records of the first batch to determine the first inferred set of one or more field types corresponding to the particular field data records of the first batch; wherein the first inferred set of field types, corresponding to the particular field, comprises a first field type corresponding to a first value for the particular field in a first record and a second field type corresponding to a second value for the particular field in a second record; indexing the first set of values in association with (a) the particular field and (b) a respective inferred field type of the first inferred set of field types; processing the second batch of data records received via the stream based on a second inferred set of field types at least by: identifying a second plurality of fields corresponding to the data records of the second batch, the second plurality of fields comprising the same particular field comprised in the first plurality of fields; analyzing a second set of values, associated with the particular field, in the data records of the second batch to determine a second inferred set of one or more field types corresponding to the particular field data records of the second batch, wherein the first inferred set of one or more field types corresponding to the particular field is different than the second inferred set of one or more field types corresponding to the particular field; and indexing the second set of values in association with (a) the particular field and (b) a respective inferred field type of the second inferred set of field types. 12. The one or more media of claim 11 , wherein: the second inferred set of field types, corresponding to the particular field consists of a single field type. 13. The one or more media of claim 11 , wherein anal

Assignees

Inventors

Classifications

  • G06F16/278Primary

    Data partitioning, e.g. horizontal or vertical partitioning · CPC title

  • Schema design and management · CPC title

  • Data stream processing; Continuous queries · CPC title

  • Query processing support for facilitating data mining operations in structured databases · CPC title

  • Indexing structures · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11556563B2 cover?
Techniques for partitioning data from a data stream into batches and inferring schema for individual batches based on the field values of each batch are disclosed. The system may infer different schemas corresponding to different batches of data records even though the batches are received from a common data stream or a common data source. The system may infer a schema by determining whether a …
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/278. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 17 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).