System and method for hybrid multilingual search indexing
US-2024119076-A1 · Apr 11, 2024 · US
US12499137B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12499137-B2 |
| Application number | US-202318541280-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 15, 2023 |
| Priority date | Dec 15, 2023 |
| Publication date | Dec 16, 2025 |
| Grant date | Dec 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Devices, systems, and methods for tokenizing search attributes and terms of a search query for an index-based search. A method may include receiving, by a search service of a provider network, a first search query to search a first searchable document set, the first search query including a first search term in a first language; applying a first tokenization rule to identify the first search term in the first search query; determining that the first search term is in the first language; applying a second tokenization rule to tokenize the first search term based on the first search term being in the first language; causing a launch of a search instance by a managed compute service of the provider network, the search instance to execute a search function for a keyword-based text search using the tokenized first search term.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method comprising: receiving, by a search service of a provider network, a first search query to search a first searchable document set, the first search query including a first search term in a first language, a second search term in a second language, and an identification of an attribute in which to search for the first search term and the second search term, the attribute associated with a document type in the first searchable document set; applying a universal tokenization rule to divide the first search query into the first search term and the second search term; determining that the first search term is in the first language; determining that the second search term is in the second language; applying a first language-specific tokenization rule to tokenize the first search term based on the first search term being in the first language; applying a second language-specific tokenization rule to tokenize the second search term based on the second search term being in the second language; causing a launch of a search instance by an on-demand code execution service of the provider network, the search instance to execute a search function for a keyword-based text search using the first search query; and by the search instance executing the search function: identifying, using the document type, the attribute, the tokenized first search term, and the tokenized second search term, a first bitmap in a data store, the first bitmap including a bit for each document in the first searchable document set; determining, without accessing a first document, that the first document in the first searchable document set contains at least one of the tokenized first search term or the tokenized second search term based at least in part on a set bit corresponding to the first document in the first bitmap; and generating a search response to the keyword-based text search, wherein the search response includes the first document. 2 . The method of claim 1 , further comprising: generating a first confidence score indicating a likelihood that the first search term is in the first language; generating a second confidence score indicating a likelihood that the second search term is in the second language; determining that the first confidence score exceeds a confidence threshold; and determining that the second confidence score exceeds the confidence threshold, wherein determining that the first search term is in the first language is based on the confidence score exceeding the confidence threshold, and wherein determining that the second search term is in the second language is based on the confidence score exceeding the confidence threshold. 3 . The method of claim 1 , wherein the universal tokenization rule further divides the search query into a third search term, the method further comprising: generating a first confidence score indicating a likelihood that the third search term is in a third language; and determining that the first confidence score is below a confidence threshold, wherein to execute the search function using the first search query excludes the third search term from the search function based on the first confidence score being below the confidence threshold. 4 . The method of claim 1 , wherein to tokenize the second search term comprises to split the second term into a third search term and a fourth search term based on the second language-specific tokenization rule, and wherein to execute the search function using the first search query includes the third search term and the fourth search term in the search function. 5 . The method of claim 1 , wherein the tokenized first search term forms at least a portion of a key to access the first bitmap. 6 . A computer-implemented method comprising: receiving, by a search service of a provider network, a first search query to search a first searchable document set, the first search query including a first search term in a first language; applying a first tokenization rule to identify the first search term in the first search query; determining that the first search term is in the first language; applying a second tokenization rule to tokenize the first search term based on the first search term being in the first language; causing a launch of a search instance by a managed compute service of the provider network, the search instance to execute a search function using the first search query for a keyword-based text search; and by the search instance executing the search function: identifying, using the tokenized first search term, a first bitmap in a data store, the first bitmap including a bit for each document in the first searchable document set; determining that a first document in the first searchable document set contains the tokenized first search term based at least in part on a set bit corresponding to the first document in the first bitmap; and generating a search response to the keyword-based text search, wherein the search response includes an indication of the first document. 7 . The method of claim 6 , wherein applying the first tokenization rule divides the first search query into the first search term and a second search term, the method further comprising: determining that the second search term is in a second language. 8 . The method of claim 7 , further comprising: applying a third tokenization rule to tokenize the second search term based on the second search term being in the second language. 9 . The method of claim 8 , wherein to execute the search function includes the tokenized first search term and the tokenized second search term. 10 . The method of claim 8 , further comprising: identifying, using the tokenized second search term, a second bitmap in the data store, the second bitmap including a bit for each document in the first searchable document set. 11 . The method of claim 10 , further comprising: determining that a second document in the first searchable document set contains the tokenized second search term based at least in part on a set bit corresponding to the first document in the second bitmap. 12 . The method of claim 11 , wherein the search response further includes an indication of the second document. 13 . The method of claim 6 , further comprising: generating a first confidence score indicating a likelihood that the first search term is in the first language; determining that the first confidence score exceeds a confidence threshold; and wherein determining that the first search term is in the first language is based on the confidence score exceeding the confidence threshold. 14 . The method of claim 6 , wherein applying the first tokenization rule divides the first search query into the first search term and a second search term, the method further comprising: generating a first confidence score indicating a likelihood that the second search term is in a second language; and determining that the first confidence score is below a confidence threshold, wherein to execute the search function using the first search query excludes the second search term from the search function based on the first confidence score being below the confidence threshold. 15 . The method of claim 6 , wherein to tokenize the first search term comprises to split the first term into a second search term and a third search term based on the second tokenization rule, and wherein to execute the search function using the first search query includes the second search term and the third search term in the search function.
Language identification · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Translation of the query language, e.g. Chinese to English · CPC title
Vectors, bitmaps or matrices · CPC title
Indexing; Data structures therefor; Storage structures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.