Creating NoSQL database index for semi-structured data

US9953102B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9953102-B2
Application numberUS-201514599296-A
CountryUS
Kind codeB2
Filing dateJan 16, 2015
Priority dateJan 20, 2014
Publication dateApr 24, 2018
Grant dateApr 24, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Semi-structured source data is preprocessed to obtain text partitions to be stored into a data table with a first combined primary key including a structure thread primary key and a sequence value primary key. The structure thread primary key identifies a structure thread that is segmented into several consecutive intervals according to a determined or predetermined sequence. An inverted index table, created for the preprocessed text partitions, includes a second combined primary key including the structure thread primary key and a keyword primary key. Corresponding to values of the primary keys in the second combined primary key, related text partition sequence IDs are recorded as index values of the inverted index table. Index values having a same keyword primary key value but different structure thread primary key values are located in different rows in the inverted index table. The present techniques improve query efficiency of database index and facilitate updating.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: preprocessing semi-structured source data to obtain text partitions to be stored into a database; storing the preprocessed text partitions into a data table including a first combined primary key, the first combined primary key including a structure thread primary key and a sequence value primary key, the structure thread primary key identifying a structure thread, the structure thread being a time primary key corresponding to generation time of source data of the text partitions, the sequence value primary key assigning, to a respective text partition, a sequence value uniquely corresponding to the respective text partition; creating an inverted index table for the preprocessed text partitions, the inverted index table including a second combined primary key, the second combined primary including the structure thread primary key and a keyword primary key, index values having a same keyword primary key value but different structure thread primary key values being located in different rows in the inverted index table; and updating the inverted index table by reading new data from the data table, the new data having a same time primary key. 2. The method of claim 1 , wherein the storing the preprocessed text partitions into the data table comprises storing a respective text partition corresponding to a respective first combined primary key into a corresponding record. 3. The method of claim 1 , further comprising segmenting the structure thread into several consecutive intervals according to a predetermined sequence. 4. The method of claim 3 , further comprising assigning a respective key value to a respective interval to serve as a value of a respective structure thread primary key. 5. The method of claim 3 , wherein the segmenting comprising: segmenting the generation time of the source data of the text partitions into several time periods; and assigning a respective key value to a respective time period to serve as a value of the structure thread primary key. 6. The method of claim 5 , wherein the respective key value is one of: a starting point of the respective time period; an ending point of the respective time period; a middle point of the respective time period; a point in the respective time period; or a unique identifier determined for the respective time period. 7. The method of claim 1 , further comprising recording a respective text partition sequence ID corresponding to values of primary keys in the second combined primary key as a respective index value in the inverted index table. 8. The method of claim 7 , further comprising assigning a special symbol to text partitions, corresponding to a same data source primary key value and having a same structure thread primary key value, which include a respective keyword as the respective index value in the inverted index table. 9. The method of claim 7 , further comprising representing the respective text partition sequence value in a form of a base value and an offset value, the base value corresponding to a value of the structure thread primary key, the offset value being assigned sequentially to the respective text partition among text partitions corresponding to a same structure thread primary key value. 10. The method of claim 9 , wherein: the first combined primary key further includes a data source primary key, the data source primary key identifying data sources of the text partitions; the second combined primary key further includes the data source primary key, wherein index values having different data source primary key values are located in different rows in the inverted index table; and the method further comprises: assigning the base value for text partitions that correspond to a same data source primary key value and have a same structure thread primary key value; and assigning a binary bit array for the text partitions that correspond to the same data source primary key value and have the same structure thread primary key value, a i th binary digit in the bit array indicating if a i th text partition including a keyword listed in the keyword primary key of a record where the i th text partition is located. 11. The method of claim 9 , wherein: the first combined primary key further includes a data source primary key, the data source primary key identifying data sources of the text partitions; the second combined primary key further includes the data source primary key, wherein index values having different data source primary key values are located in different rows in the inverted index table; and the method further comprises recording an integer to represent the offset value if one or more text partitions, corresponding to a same data source primary key value and having a same structure thread primary key value, which include a respective keyword. 12. The method of claim 1 , wherein: the first combined primary key further includes a data source primary key, the data source primary key identifying data sources of the text partitions; and the second combined primary key further includes the data source primary key, wherein index values having different data source primary key values are located in different rows in the inverted index table. 13. The method of claim 12 , further comprising, when the data table and the inverted index table are created for the text partitions, reading text partitions from a same data source and having a same structure thread primary key value in one time. 14. The method of claim 12 , wherein: in the first combined primary key, the data source primary key uses a hash value calculated based on a respective data sources of the respective text partition and an original value of the structure thread of the respective text partition; or in the second combined primary key, the data source primary key uses a hash value calculated based on a respective data source and a respective keyword of the respective text partition. 15. An apparatus comprising: one or more processors; memory communicatively coupled to the one or more processors, the memory storing a plurality of units executable by the one or more processors, that when executed by the one or more processors, cause the plurality of units to perform associated operations, the plurality of units comprising: a preprocessing unit configured to preprocess semi-structured source data to obtain text partitions to be stored into a database; a data table creating unit configured to create a data table for storing the preprocessed text partitions, the data table including a first combined primary key, the first combined primary key including a structure thread primary key and a sequence value primary key, the structure thread primary key identifying a structure thread, the structure thread being segmented into several consecutive intervals according to a predetermined sequence, a specific key value being assigned to a respective interval to serve as a value of the structure thread primary key; a sequence value primary key assigning, to a respective text partition, a sequence value uniquely corresponding to the respective text partition; and an inverted index table creating unit configured to create an inverted index table for the preprocessed text partitions, the inverted index table including a second combined primary key, the second combined primary key including the structure thread primary key and a keyword primary key, index values having a same keyword primary key value but different structure thread primary key values are located in different rows in the inverted index table.

Assignees

Inventors

Classifications

  • G06F16/81Primary

    Indexing, e.g. XML tags; Data structures therefor; Storage structures · CPC title

  • Tablespace storage structures; Management thereof · CPC title

  • Inverted lists · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9953102B2 cover?
Semi-structured source data is preprocessed to obtain text partitions to be stored into a data table with a first combined primary key including a structure thread primary key and a sequence value primary key. The structure thread primary key identifies a structure thread that is segmented into several consecutive intervals according to a determined or predetermined sequence. An inverted index …
Who is the assignee on this patent?
Alibaba Group Holding Ltd
What technology area does this patent fall under?
Primary CPC classification G06F16/81. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 24 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).