Method and System for Fuzzy Keyword Search Over Encrypted Data
US-2020125563-A1 · Apr 23, 2020 · US
US2021319854A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021319854-A1 |
| Application number | US-201917272598-A |
| Country | US |
| Kind code | A1 |
| Filing date | Aug 13, 2019 |
| Priority date | Aug 28, 2018 |
| Publication date | Oct 14, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method (100) for standardizing gene nomenclature, comprising: (i) receiving (110) a source; (ii) tokenizing (120) the source; (iii) comparing (130) a first token to a prefix tree structure with a root node, edges, and leaf nodes; (iv) determining (140) which edge extending from the root node to associated first leaf nodes the first token matches; (v) updating (150) an identification pointer with the location of the first leaf node; (vi) determining (160) which of one or more edges that a second token matches; (vii) updating (170) the identification pointer with the location of the second leaf node; (viii) repeating (172) the determining (160) and updating (170) steps with subsequent tokens until a subsequent token fails to match an edge extending from a leaf node or there is no edge extending from the leaf node; and (ix) providing (180) an identification of a canonical gene name.
Opening claim text (preview).
1 . A computer implemented method for standardizing gene nomenclature, comprising: receiving a source comprising one or more gene identifiers; tokenizing text from the source into a token stream; comparing a first token from the token stream to a data structure generated from a database of gene identifiers and corresponding canonical gene name for each of a plurality of genes, the data structure comprising a prefix tree structure with a root node, a plurality of edges, and a plurality of leaf nodes; determining which of one or more edges extending from the root node to associated first leaf nodes the first token matches; updating an identification pointer with the location of the first leaf node associated with the matching edge; determining which, if any, of one or more edges extending from the first leaf node to second leaf nodes that a second, subsequent token from the token steam matches; updating, if the second token matches an edge extending from the first leaf node, the identification pointer with the location of the second leaf node associated with the matching edge; repeating the determining and updating steps with subsequent tokens from the token stream using any additional extending edges and leaf nodes until a subsequent token fails to match an edge extending from a leaf node or there is no edge extending from the leaf node; and providing, when a subsequent token fails to match an edge extending from a leaf node, or if there is no edge extending from the leaf node, an identification of a canonical gene name based on a most recent location from the identification pointer. 2 . The method of claim 1 , further comprising: generating a curated table of gene identifiers and associated canonical gene names; tokenizing the curated table into a token stream; and generating, using the token stream, the prefix tree structure. 3 . The method of claim 2 , wherein tokenization of the curated table into a token stream and the tokenization of text from the source into a token stream utilizes the same tokenization logic. 4 . The method of claim 2 , further comprising: generating a list of most common words in a language; comparing the list of most common words to the gene identifiers in the curated table; and identifying any gene identifiers found in the list of most common words. 5 . The method of claim 1 , wherein providing an identification of a canonical gene name based on a most recent location from the identification pointer comprises: updating the identification pointer with a location of a leaf node upstream of the most recent matching leaf node. 6 . The method of claim 1 , further comprising the step of determining whether the identified canonical gene name is also common natural language word. 7 . The method of claim 1 , wherein tokenization comprises generating an acronym for every capitalized phrase in a document, and preventing or removing each occurrence of a generated acronym from the token stream. 8 . The method of claim 1 , wherein tokenization comprises generating a single token of each string of Roman letters, generating a single token of each run of numbers, and generating a single token with each Greek letter. 9 . The method of claim 1 , wherein providing an identification of a canonical gene name comprises information about a location of a gene identifier in the source. 10 . A system for standardizing gene nomenclature, comprising: a source comprising one or more gene identifiers; a data structure generated from a database of gene identifiers and corresponding canonical gene name for each of a plurality of genes, the data structure comprising a prefix tree structure with a root node, a plurality of edges, and a plurality of leaf nodes; and a processor configured to: (i) tokenize text from the source into a token stream; (ii) compare a first token from the token stream to the data structure; (iii) determine which of one or more edges extending from the root node to associated first leaf nodes the first token matches; (iv) update an identification pointer with the location of the first leaf node associated with the matching edge; (v) determine which, if any, of one or more edges extending from the first leaf node to second leaf nodes that a second, subsequent token from the token steam matches; (vi) update, if the second token matches an edge extending from the first leaf node, the identification pointer with the location of the second leaf node associated with the matching edge; (vii) repeat the determining and updating with subsequent tokens from the token stream using any additional extending edges and leaf nodes until a subsequent token fails to match an edge extending from a leaf node or there is no edge extending from the leaf node; and (viii) provide, when a subsequent token fails to match an edge extending from a leaf node, or if there is no edge extending from the leaf node, an identification of a canonical gene name based on a most recent location from the identification pointer. 11 . The system of claim 10 , wherein the processor is further configured to: generate a curated table of gene identifiers and associated canonical gene names; tokenize the curated table into a token stream; and generate, using the token stream, the prefix tree structure. 12 . The system of claim 10 , wherein the processor is further configured to: generate a list of most common words in a language; compare the list of most common words to the gene identifiers in the curated table; and identify any gene identifiers found in the list of most common words. 13 . The system of claim 10 , wherein the processor is further configured to update the identification pointer with a location of a leaf node upstream of the most recent matching leaf node. 14 . The system of claim 10 , wherein tokenization comprises generating an acronym for every capitalized phrase in a document, and preventing or removing each occurrence of a generated acronym from the token stream. 15 . The system of claim 10 , wherein providing an identification of a canonical gene name comprises information about a location of a gene identifier in the source.
Ontologies; Annotations · CPC title
Editing, e.g. inserting or deleting · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.