Method and system for normalization of gene names in medical text
US-2021319854-A1 · Oct 14, 2021 · US
US11763083B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11763083-B2 |
| Application number | US-202017798638-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 18, 2020 |
| Priority date | May 18, 2020 |
| Publication date | Sep 19, 2023 |
| Grant date | Sep 19, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. In some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token ID and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token IDs that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. In some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_ID(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_ID(s).
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method comprising: performing, by one or more processors of a processing system, tokenization of a string of text, comprising: analyzing a first node of a vocabulary trie structure, and identifying a link between the first node and a second node of the vocabulary trie structure corresponding to a first character of the string; determining not to store a token associated with the first node based on the link between the first node and the second node; analyzing the second node, and identifying a link between the second node and a third node of the vocabulary trie structure corresponding to a second character of the string; determining not to store a token associated with the second node based on the link between the second node and the third node; analyzing the third node to determine that the third node has no link corresponding to a third character of the string, and identifying a fail link between the third node and a fourth node of the vocabulary trie structure; storing a first token associated with the third node, the first token representing a word or wordpiece comprised of the first character and the second character of the string; analyzing the fourth node to determine that the fourth node has no link corresponding to the third character of the string; storing a second token associated with the fourth node, the second token representing a word or wordpiece comprised of the third character of the string; and concatenating the first token and the second token to form an array of tokens; and providing, by the one or more processors, the array of tokens to a neural network for natural language processing. 2. The method of claim 1 , wherein the first token comprises a word or wordpiece including the first character and second character of the string, and the second token includes the third character of the string. 3. The method of claim 1 , wherein the first token identifies an entry in a vocabulary for a word or wordpiece including the first character and second character of the string, and the second token identifies an entry in the vocabulary for the third character of the string. 4. The method of claim 1 , wherein the string further comprises a fourth character. 5. The method of claim 4 , wherein the fourth character is a symbol representing the end of the string. 6. A computer-implemented method comprising: performing, by one or more processors of a processing system, tokenization of a string of text, comprising: analyzing a first node of a vocabulary trie structure, and identifying a link between the first node and a second node of the vocabulary trie structure corresponding to a first character of the string; determining not to store a token based on the link between the first node and the second node; analyzing the second node, and identifying a link between the second node and a third node of the vocabulary trie structure corresponding to a second character of the string; determining not to store a token based on the link between the second node and the third node; analyzing the third node, and identifying a link between the third node and a fourth node of the vocabulary trie structure corresponding to a third character of the string; determining not to store a token based on the link between the third node and the fourth node; analyzing the fourth node, and identifying a link between the fourth node and a fifth node of the vocabulary trie structure corresponding to a fourth character of the string; determining not to store a token based on the link between the fourth node and the fifth node; analyzing the fifth node to determine that the fifth node has no link corresponding to a fifth character of the string, and identifying a fail link between the fifth node and a sixth node of the vocabulary trie structure, and a previous match link between the fifth node and the third node; storing a first token associated with the third node, the first token representing a word or wordpiece comprised of the first character and the second character of the string; storing a second token associated with the fifth node, the second token representing a word or wordpiece comprised of the third character of the string; analyzing the sixth node to determine that the sixth node has no link corresponding to the fifth character of the string, and no previous match link; storing a third token associated with the sixth node, the third token representing a word or wordpiece comprised of the fourth character of the string; and concatenating the first token, the second token, and the third token to form an array of tokens; and providing, by the one or more processors, the array of tokens to a neural network for natural language processing. 7. The method of claim 6 , wherein the first token comprises a word or wordpiece including the first character and second character of the string, the second token includes the third character of the string, and the third token includes the fourth character of the string. 8. The method of claim 6 , wherein the first token identifies an entry in a vocabulary for a word or wordpiece including the first character and second character of the string, the second token identifies an entry in the vocabulary for the third character of the string, and the third token identifies an entry in the vocabulary for the fourth character of the string. 9. The method of claim 6 , wherein the string further comprises a fifth character. 10. The method of claim 9 , wherein the fifth character is a symbol representing the end of the string. 11. A processing system comprising: a memory; and one or more processors coupled to the memory and configured to: perform tokenization of a string of text, comprising: analyzing a first node of a vocabulary trie structure, and identifying a link between the first node and a second node of the vocabulary trie structure corresponding to a first character of the string; determining not to store a token associated with the first node based on the link between the first node and the second node; analyzing the second node, and identifying a link between the second node and a third node of the vocabulary trie structure corresponding to a second character of the string; determining not to store a token associated with the second node based on the link between the second node and the third node; analyzing the third node to determine that the third node has no link corresponding to a third character of the string, and identifying a fail link between the third node and a fourth node of the vocabulary trie structure; storing a first token associated with the third node, the first token representing a word or wordpiece comprised of the first character and the second character of the string; analyzing the fourth node to determine that the fourth node has no link corresponding to the third character of the string; storing a second token associated with the fourth node, the second token representing a word or wordpiece comprised of the third character of the string; and concatenating the first token and the second token to form an array of tokens; and provide the array of tokens to a neural network for natural language processing. 12. The system of claim 11 , wherein the first token comprises a word or wordpiece including the first character and second character of the string, and the second token includes the third character of the string. 13. The system of claim 11 , wherein the first token identifies an entry in a vocabulary for a word or wordpiece including the first character and second character of the string, and the
Lexical tools · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Selection or weighting of terms from queries, including natural language queries · CPC title
Trees · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.