Storage and retrieval of data from a bit vector search index

US2016378807A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016378807-A1
Application numberUS-201615186210-A
CountryUS
Kind codeA1
Filing dateJun 17, 2016
Priority dateJun 23, 2015
Publication dateDec 29, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The technology described herein provides for storing and retrieving data in a bit vector search index. The bit vector search index stores data about terms from documents using bit vectors. Each bit vector comprises an array of bits and corresponds to a different set of terms. Each bit in the bit vector is used to represent whether a document includes at least one term from the set of terms. A band table is used to store bit vector configurations for bands of terms having similar term characteristics. Each term is indexed in the bit vector search index according to a bit vector configuration for a band to which it belongs. When identifying bit vector storage locations for terms, explicit mappings are used for some terms and ad hoc approaches used for other terms. Explicit mappings provide specific locations for terms, while ad hoc approaches use mapping algorithms assigned to bands.

First claim

Opening claim text (preview).

The invention claimed is: 1 . One or more computer storage media storing computer-useable instructions that, when used by one or more computing devices, cause the one or more computing devices to perform operations comprising: receiving a search query; identifying a term based on the search query; accessing one or more data structures for determining storage locations of bit vectors for terms indexed in a search index, each bit vector comprising an array of bits, each of at least a portion of the bits representing whether at least one document contains at least one term from a corresponding set of terms, the one or more data structures including explicit mappings for a subset of the terms indexed in the search index, the explicit mappings identifying actual storage locations for each term in the subset of terms, the one or more data structures also including ad hoc information mapping term characteristics to mapping algorithms for calculating storage locations of bit vectors for terms with the term characteristics; if an explicit mapping for the term is provided in the one or more data structures, identifying storage locations of bit vectors for the term from the explicit mapping; and if an explicit mapping for the term is not provided in the one or more data structures, using ad hoc information to identify storage locations of bit vectors for the term. 2 . The one or more computer storage media of claim 1 , wherein using ad hoc information to identify storage locations of bit vectors for the term comprises: determining one or more term characteristics of the term; identifying mapping algorithms for the one or more term characteristics from one or more data structures; and employing the mapping algorithms to determine storage locations for bit vectors for the term. 3 . The one or more computer storage media of claim 1 , wherein the one or more term characteristics of the term include at least one selected from the following: term location in document, number of words included in the term, term frequency in search index, and term frequency in search queries. 4 . The one or more computer storage media of claim 1 , wherein the mapping algorithms determine storage locations of bit vectors for the term as a function of the hash of the term. 5 . The one or more computer storage media of claim 1 , wherein the operations further comprise: accessing the bit vectors for the term; and intersecting the bit vectors for the term to identify matching documents for the search query. 6 . The one or more computer storage media of claim 1 , wherein the operations further comprise: determining one or more other terms based on the search query; using the one or more data structures to identify storage locations of bit vectors for the one or more other terms; accessing the bit vectors for the term and the one or more other terms; and intersecting the bit vectors for the term and the one or more other terms to identify matching documents for the search query. 7 . The one or more computer storage media of claim 1 , wherein the one or more data structures include: a band table mapping term characteristics to a bit vector configuration and ad hoc information for each of a plurality of bands; a term table the provides the explicit mappings. 8 . A method comprising: assigning term characteristics to each of a plurality of bands; assigning a bit vector configuration to each of the plurality of bands, each bit vector configuration specifying a number and length of bit vectors; and storing, on one or more computer storage media, a data structure mapping the term characteristics to bit vector configuration for each of the bands. 9 . The method of claim 8 , wherein the method further comprises generating a search index using the data structure, the search index include a plurality of bit vectors, each bit vector comprising an array of bits, each of at least a portion of the bit vectors having bits in which each bit represents whether one or more documents includes at least one term from a corresponding set of terms. 10 . The method of claim 9 , wherein generating the search index using the data structure comprises: assigning terms to each of the plurality of bands based on term characteristics of the terms and term characteristics assigned to each of the plurality of bands; generating the search index with the plurality of bit vectors in which each of at least a portion of the terms is assigned to one or more bit vectors in accordance with the bit vector configuration of the band to which each of the at least a portion of the terms is assigned. 11 . The method of claim 9 , wherein the method further comprises: assigning ad hoc mapping information to each of the plurality of bands, the ad hoc mapping information providing one or more mapping algorithms for determining a location of bit vectors for terms in the search index; and storing the ad hoc mapping information in the data structure. 12 . The method of claim 11 , wherein the method further comprises indexing information about a document in the search index by: identifying a term in the document; determining term characteristics for the term; identifying a band based on the term characteristics for the term; determining a band configuration for the term and storage locations of bit vectors for the term based on the band identified for the term; and setting bits in the bit vectors for the term. 13 . The method of claim 11 , wherein the method further comprises: receiving a search query; determining a term based on the search query; determining one or more term characteristics for the term; identifying ad hoc mapping information from the data structure based on the one or more term characteristics for the term; employing the ad hoc mapping information to derive locations of a plurality of bit vectors for the term in the search index; and intersecting the plurality of bit vectors to identify matching documents for the search query. 14 . The method of claim 8 , wherein the term characteristics assigned to each of the plurality of bands includes one or more selected from the following: an identification of a shard of the search index, a term's location in a document, a number of words in a term, a term frequency in a shard of the search index, a term frequency in search queries. 15 . The method of claim 8 , wherein each bit vector configuration is assigned to each of the plurality of bands based on design goals of balancing storage with a relevance metric. 16 . The method of claim 15 , wherein the relevance metric is selected from the following: a false positive rate that reflects a number of percentage of matching documents expected to be returned that don't match terms from search queries, an error rate based on a fraction of valid matching documents that could have been returned but were displaced by documents that don't match terms from search queries, a fix-up costs that represents a cost in additional storage and/or processing time required to identify and remove documents expected to be returned that don't match terms from search queries. 17 . The method of claim 15 , wherein each bit vector configuration is assigned to each of the plurality of bands using a cost function that is a weighted sum of the relevance metric and storage requirements. 18 . The method of claim 17 , wherein the weighted sum uses a weighting that is configurable based on a relative importance of relevance and storage to a design of a search system. 19 . A s

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016378807A1 cover?
The technology described herein provides for storing and retrieving data in a bit vector search index. The bit vector search index stores data about terms from documents using bit vectors. Each bit vector comprises an array of bits and corresponds to a different set of terms. Each bit in the bit vector is used to represent whether a document includes at least one term from the set of terms. A b…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F17/30324. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 29 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).