Highly available storage using independent data stores
US-11366801-B1 · Jun 21, 2022 · US
US11556563B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11556563-B2 |
| Application number | US-202016900357-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 12, 2020 |
| Priority date | Jun 12, 2020 |
| Publication date | Jan 17, 2023 |
| Grant date | Jan 17, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for partitioning data from a data stream into batches and inferring schema for individual batches based on the field values of each batch are disclosed. The system may infer different schemas corresponding to different batches of data records even though the batches are received from a common data stream or a common data source. The system may infer a schema by determining whether a field contains single values or multiple values. Then the system determines the field type(s) associated with the values. These determinations are then stored in a dictionary generated for each batch.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving a data stream comprising a plurality of data records; partitioning the plurality of data records into a plurality of batches comprising (a) a first batch of data records and (b) a second batch of data records; inferring field types for the plurality of data records, received in a same data stream, on a per-batch basis at least by: processing the first batch of data records received via the stream based on a first inferred set of field types at least by: identifying a first plurality of fields corresponding to the data records of the first batch, the first plurality of fields comprising a particular field; analyzing a first set of values, associated with the particular field, in the data records of the first batch to determine the first inferred set of one or more field types corresponding to the particular field for data records of the first batch; wherein the first inferred set of field types, corresponding to the particular field, comprises a first field type corresponding to a first value for the particular field in a first record and a second field type corresponding to a second value for the particular field in a second record; indexing the first set of values in association with (a) the particular field and (b) a respective inferred field type of the first inferred set of field types; processing the second batch of data records received via the stream based on a second inferred set of field types at least by: identifying a second plurality of fields corresponding to the data records of the second batch, the second plurality of fields comprising the same particular field comprised in the first plurality of fields; analyzing a second set of values, associated with the particular field, in the data records of the second batch to determine the second inferred set of one or more field types corresponding to the particular field for data records of the second batch, wherein the first inferred set of one or more field types corresponding to the particular field is different than the second inferred set of one or more field types corresponding to the particular field; and indexing the second set of values in association with (a) the particular field and (b) a respective inferred field type of the second inferred set of field types. 2. The method of claim 1 , wherein: the second inferred set of field types, corresponding to the particular field, consists of a single field type. 3. The method of claim 1 , wherein analyzing the first set of values, associated with the particular field, in the first batch to determine a first inferred set of one or more field types corresponding to the particular field comprises: detecting at least two values, corresponding to the particular field, in a same data record of the first batch. 4. The method of claim 1 , wherein analyzing the first set of values, associated with the particular field, in the first batch to determine a first inferred set of one or more field types corresponding to the particular field comprises: identifying a number of field types corresponding to the particular field; and inferring a type of each field type of the number of field types by applying a machine learning model. 5. The method of claim 1 , wherein analyzing the first set of values, associated with the particular field, in the first batch to determine a first inferred set of one or more field types corresponding to the particular field comprises: selecting a value of the first batch using statistical sampling; and applying regular expression analysis to the value to determine a type of the value. 6. The method of claim 1 , wherein further comprising receiving the plurality of data records in a data stream, wherein the plurality of data records are partitioned into the first batch and the second batch concurrent with receipt of additional data records via the data stream. 7. The method of claim 1 , further comprising inferring a first schema for the first batch based on the first plurality of fields and a second schema for the second batch based on the second plurality of fields. 8. The method of claim 1 , further comprising: prior to the partitioning operation, receiving the plurality of data records in a single data stream, wherein partitioning particular records of the plurality of data records into the second batch is based on a data record capacity threshold of the first batch being met. 9. The method of claim 1 , wherein the plurality of data records are received in a single data stream, wherein the first plurality of fields is identical to the second plurality of fields, and wherein a first set of types corresponding to the first plurality of fields is not identical to a second set of types corresponding to the second plurality of fields. 10. The method of claim 1 , wherein: the method is performed by a data storage system; and the plurality of data records comprise records that do not comply with any predetermined schema. 11. One or more non-transitory computer-readable media storing instructions, which when executed by one or more hardware processors, cause performance of operations comprising: receiving a data stream comprising a plurality of data records; partitioning the plurality of data records into a plurality of batches comprising (a) a first batch of data records and (b) a second batch of data records; inferring field types for the plurality of data records, received in a same data stream, on a per-batch basis at least by: processing the first batch of data records received via the stream based on a first inferred set of field types at least by: identifying a first plurality of fields corresponding to the data records of the first batch, the first plurality of fields comprising a particular field; analyzing a first set of values, associated with the particular field, in the data records of the first batch to determine the first inferred set of one or more field types corresponding to the particular field data records of the first batch; wherein the first inferred set of field types, corresponding to the particular field, comprises a first field type corresponding to a first value for the particular field in a first record and a second field type corresponding to a second value for the particular field in a second record; indexing the first set of values in association with (a) the particular field and (b) a respective inferred field type of the first inferred set of field types; processing the second batch of data records received via the stream based on a second inferred set of field types at least by: identifying a second plurality of fields corresponding to the data records of the second batch, the second plurality of fields comprising the same particular field comprised in the first plurality of fields; analyzing a second set of values, associated with the particular field, in the data records of the second batch to determine a second inferred set of one or more field types corresponding to the particular field data records of the second batch, wherein the first inferred set of one or more field types corresponding to the particular field is different than the second inferred set of one or more field types corresponding to the particular field; and indexing the second set of values in association with (a) the particular field and (b) a respective inferred field type of the second inferred set of field types. 12. The one or more media of claim 11 , wherein: the second inferred set of field types, corresponding to the particular field consists of a single field type. 13. The one or more media of claim 11 , wherein anal
Data partitioning, e.g. horizontal or vertical partitioning · CPC title
Schema design and management · CPC title
Data stream processing; Continuous queries · CPC title
Query processing support for facilitating data mining operations in structured databases · CPC title
Indexing structures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.