Dynamic text tokenization for index-based searching of annotated data assets using keyword-based text searching

US12499137B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12499137-B2
Application numberUS-202318541280-A
CountryUS
Kind codeB2
Filing dateDec 15, 2023
Priority dateDec 15, 2023
Publication dateDec 16, 2025
Grant dateDec 16, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Devices, systems, and methods for tokenizing search attributes and terms of a search query for an index-based search. A method may include receiving, by a search service of a provider network, a first search query to search a first searchable document set, the first search query including a first search term in a first language; applying a first tokenization rule to identify the first search term in the first search query; determining that the first search term is in the first language; applying a second tokenization rule to tokenize the first search term based on the first search term being in the first language; causing a launch of a search instance by a managed compute service of the provider network, the search instance to execute a search function for a keyword-based text search using the tokenized first search term.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method comprising: receiving, by a search service of a provider network, a first search query to search a first searchable document set, the first search query including a first search term in a first language, a second search term in a second language, and an identification of an attribute in which to search for the first search term and the second search term, the attribute associated with a document type in the first searchable document set; applying a universal tokenization rule to divide the first search query into the first search term and the second search term; determining that the first search term is in the first language; determining that the second search term is in the second language; applying a first language-specific tokenization rule to tokenize the first search term based on the first search term being in the first language; applying a second language-specific tokenization rule to tokenize the second search term based on the second search term being in the second language; causing a launch of a search instance by an on-demand code execution service of the provider network, the search instance to execute a search function for a keyword-based text search using the first search query; and by the search instance executing the search function: identifying, using the document type, the attribute, the tokenized first search term, and the tokenized second search term, a first bitmap in a data store, the first bitmap including a bit for each document in the first searchable document set; determining, without accessing a first document, that the first document in the first searchable document set contains at least one of the tokenized first search term or the tokenized second search term based at least in part on a set bit corresponding to the first document in the first bitmap; and generating a search response to the keyword-based text search, wherein the search response includes the first document. 2 . The method of claim 1 , further comprising: generating a first confidence score indicating a likelihood that the first search term is in the first language; generating a second confidence score indicating a likelihood that the second search term is in the second language; determining that the first confidence score exceeds a confidence threshold; and determining that the second confidence score exceeds the confidence threshold, wherein determining that the first search term is in the first language is based on the confidence score exceeding the confidence threshold, and wherein determining that the second search term is in the second language is based on the confidence score exceeding the confidence threshold. 3 . The method of claim 1 , wherein the universal tokenization rule further divides the search query into a third search term, the method further comprising: generating a first confidence score indicating a likelihood that the third search term is in a third language; and determining that the first confidence score is below a confidence threshold, wherein to execute the search function using the first search query excludes the third search term from the search function based on the first confidence score being below the confidence threshold. 4 . The method of claim 1 , wherein to tokenize the second search term comprises to split the second term into a third search term and a fourth search term based on the second language-specific tokenization rule, and wherein to execute the search function using the first search query includes the third search term and the fourth search term in the search function. 5 . The method of claim 1 , wherein the tokenized first search term forms at least a portion of a key to access the first bitmap. 6 . A computer-implemented method comprising: receiving, by a search service of a provider network, a first search query to search a first searchable document set, the first search query including a first search term in a first language; applying a first tokenization rule to identify the first search term in the first search query; determining that the first search term is in the first language; applying a second tokenization rule to tokenize the first search term based on the first search term being in the first language; causing a launch of a search instance by a managed compute service of the provider network, the search instance to execute a search function using the first search query for a keyword-based text search; and by the search instance executing the search function: identifying, using the tokenized first search term, a first bitmap in a data store, the first bitmap including a bit for each document in the first searchable document set; determining that a first document in the first searchable document set contains the tokenized first search term based at least in part on a set bit corresponding to the first document in the first bitmap; and generating a search response to the keyword-based text search, wherein the search response includes an indication of the first document. 7 . The method of claim 6 , wherein applying the first tokenization rule divides the first search query into the first search term and a second search term, the method further comprising: determining that the second search term is in a second language. 8 . The method of claim 7 , further comprising: applying a third tokenization rule to tokenize the second search term based on the second search term being in the second language. 9 . The method of claim 8 , wherein to execute the search function includes the tokenized first search term and the tokenized second search term. 10 . The method of claim 8 , further comprising: identifying, using the tokenized second search term, a second bitmap in the data store, the second bitmap including a bit for each document in the first searchable document set. 11 . The method of claim 10 , further comprising: determining that a second document in the first searchable document set contains the tokenized second search term based at least in part on a set bit corresponding to the first document in the second bitmap. 12 . The method of claim 11 , wherein the search response further includes an indication of the second document. 13 . The method of claim 6 , further comprising: generating a first confidence score indicating a likelihood that the first search term is in the first language; determining that the first confidence score exceeds a confidence threshold; and wherein determining that the first search term is in the first language is based on the confidence score exceeding the confidence threshold. 14 . The method of claim 6 , wherein applying the first tokenization rule divides the first search query into the first search term and a second search term, the method further comprising: generating a first confidence score indicating a likelihood that the second search term is in a second language; and determining that the first confidence score is below a confidence threshold, wherein to execute the search function using the first search query excludes the second search term from the search function based on the first confidence score being below the confidence threshold. 15 . The method of claim 6 , wherein to tokenize the first search term comprises to split the first term into a second search term and a third search term based on the second tokenization rule, and wherein to execute the search function using the first search query includes the second search term and the third search term in the search function.

Assignees

Inventors

Classifications

  • Language identification · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Translation of the query language, e.g. Chinese to English · CPC title

  • Vectors, bitmaps or matrices · CPC title

  • G06F16/31Primary

    Indexing; Data structures therefor; Storage structures · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12499137B2 cover?
Devices, systems, and methods for tokenizing search attributes and terms of a search query for an index-based search. A method may include receiving, by a search service of a provider network, a first search query to search a first searchable document set, the first search query including a first search term in a first language; applying a first tokenization rule to identify the first search te…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/31. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).