Method and system for normalization of gene names in medical text

US2021319854A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021319854-A1
Application numberUS-201917272598-A
CountryUS
Kind codeA1
Filing dateAug 13, 2019
Priority dateAug 28, 2018
Publication dateOct 14, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method (100) for standardizing gene nomenclature, comprising: (i) receiving (110) a source; (ii) tokenizing (120) the source; (iii) comparing (130) a first token to a prefix tree structure with a root node, edges, and leaf nodes; (iv) determining (140) which edge extending from the root node to associated first leaf nodes the first token matches; (v) updating (150) an identification pointer with the location of the first leaf node; (vi) determining (160) which of one or more edges that a second token matches; (vii) updating (170) the identification pointer with the location of the second leaf node; (viii) repeating (172) the determining (160) and updating (170) steps with subsequent tokens until a subsequent token fails to match an edge extending from a leaf node or there is no edge extending from the leaf node; and (ix) providing (180) an identification of a canonical gene name.

First claim

Opening claim text (preview).

1 . A computer implemented method for standardizing gene nomenclature, comprising: receiving a source comprising one or more gene identifiers; tokenizing text from the source into a token stream; comparing a first token from the token stream to a data structure generated from a database of gene identifiers and corresponding canonical gene name for each of a plurality of genes, the data structure comprising a prefix tree structure with a root node, a plurality of edges, and a plurality of leaf nodes; determining which of one or more edges extending from the root node to associated first leaf nodes the first token matches; updating an identification pointer with the location of the first leaf node associated with the matching edge; determining which, if any, of one or more edges extending from the first leaf node to second leaf nodes that a second, subsequent token from the token steam matches; updating, if the second token matches an edge extending from the first leaf node, the identification pointer with the location of the second leaf node associated with the matching edge; repeating the determining and updating steps with subsequent tokens from the token stream using any additional extending edges and leaf nodes until a subsequent token fails to match an edge extending from a leaf node or there is no edge extending from the leaf node; and providing, when a subsequent token fails to match an edge extending from a leaf node, or if there is no edge extending from the leaf node, an identification of a canonical gene name based on a most recent location from the identification pointer. 2 . The method of claim 1 , further comprising: generating a curated table of gene identifiers and associated canonical gene names; tokenizing the curated table into a token stream; and generating, using the token stream, the prefix tree structure. 3 . The method of claim 2 , wherein tokenization of the curated table into a token stream and the tokenization of text from the source into a token stream utilizes the same tokenization logic. 4 . The method of claim 2 , further comprising: generating a list of most common words in a language; comparing the list of most common words to the gene identifiers in the curated table; and identifying any gene identifiers found in the list of most common words. 5 . The method of claim 1 , wherein providing an identification of a canonical gene name based on a most recent location from the identification pointer comprises: updating the identification pointer with a location of a leaf node upstream of the most recent matching leaf node. 6 . The method of claim 1 , further comprising the step of determining whether the identified canonical gene name is also common natural language word. 7 . The method of claim 1 , wherein tokenization comprises generating an acronym for every capitalized phrase in a document, and preventing or removing each occurrence of a generated acronym from the token stream. 8 . The method of claim 1 , wherein tokenization comprises generating a single token of each string of Roman letters, generating a single token of each run of numbers, and generating a single token with each Greek letter. 9 . The method of claim 1 , wherein providing an identification of a canonical gene name comprises information about a location of a gene identifier in the source. 10 . A system for standardizing gene nomenclature, comprising: a source comprising one or more gene identifiers; a data structure generated from a database of gene identifiers and corresponding canonical gene name for each of a plurality of genes, the data structure comprising a prefix tree structure with a root node, a plurality of edges, and a plurality of leaf nodes; and a processor configured to: (i) tokenize text from the source into a token stream; (ii) compare a first token from the token stream to the data structure; (iii) determine which of one or more edges extending from the root node to associated first leaf nodes the first token matches; (iv) update an identification pointer with the location of the first leaf node associated with the matching edge; (v) determine which, if any, of one or more edges extending from the first leaf node to second leaf nodes that a second, subsequent token from the token steam matches; (vi) update, if the second token matches an edge extending from the first leaf node, the identification pointer with the location of the second leaf node associated with the matching edge; (vii) repeat the determining and updating with subsequent tokens from the token stream using any additional extending edges and leaf nodes until a subsequent token fails to match an edge extending from a leaf node or there is no edge extending from the leaf node; and (viii) provide, when a subsequent token fails to match an edge extending from a leaf node, or if there is no edge extending from the leaf node, an identification of a canonical gene name based on a most recent location from the identification pointer. 11 . The system of claim 10 , wherein the processor is further configured to: generate a curated table of gene identifiers and associated canonical gene names; tokenize the curated table into a token stream; and generate, using the token stream, the prefix tree structure. 12 . The system of claim 10 , wherein the processor is further configured to: generate a list of most common words in a language; compare the list of most common words to the gene identifiers in the curated table; and identify any gene identifiers found in the list of most common words. 13 . The system of claim 10 , wherein the processor is further configured to update the identification pointer with a location of a leaf node upstream of the most recent matching leaf node. 14 . The system of claim 10 , wherein tokenization comprises generating an acronym for every capitalized phrase in a document, and preventing or removing each occurrence of a generated acronym from the token stream. 15 . The system of claim 10 , wherein providing an identification of a canonical gene name comprises information about a location of a gene identifier in the source.

Assignees

Inventors

Classifications

  • G16B50/10Primary

    Ontologies; Annotations · CPC title

  • Editing, e.g. inserting or deleting · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021319854A1 cover?
A method (100) for standardizing gene nomenclature, comprising: (i) receiving (110) a source; (ii) tokenizing (120) the source; (iii) comparing (130) a first token to a prefix tree structure with a root node, edges, and leaf nodes; (iv) determining (140) which edge extending from the root node to associated first leaf nodes the first token matches; (v) updating (150) an identification pointer w…
Who is the assignee on this patent?
Koninklijke Philips Nv
What technology area does this patent fall under?
Primary CPC classification G16B50/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Oct 14 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).