Rapid genomic sequence classification using probabilistic data structures

US11037654B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11037654-B2
Application numberUS-201815977667-A
CountryUS
Kind codeB2
Filing dateMay 11, 2018
Priority dateMay 12, 2017
Publication dateJun 15, 2021
Grant dateJun 15, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for identifying and/or classifying genomic information are provided. In some embodiments, genomic information may be identified by computing systems without access to a database of reference genomic information, instead relying on locally stored probabilistic data structures representing reference genomic information. Query genomic data, such as data taken from a read-set, may be divided into sub-strings, and each of the locally-stored probabilistic data structures may be queried by each of the extracted sub-strings, generating probabilistic outputs indicating either that (a) the sub-string is probably included in the set of data represented by the probabilistic data structure or (b) the sub-string is definitely not included in the set of data. Based on the number and/or proportion of sub-strings from a read-set that are indicated as being likely represented by a probabilistic data structure, a likely identity or classification for the genomic information in the read-set may be determined.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for identifying genomic information in a computing environment remote from a database of genomic reference data, the system comprising: one or more hardware processors; a memory storing one or more programs, the one or more programs configured to be executed by the one or more hardware processors and including instructions to: receive encoded data representing genomic reference data of a plurality of organisms, wherein the encoded data comprises: a plurality of probabilistic data structures each corresponding respectively to an organism of the plurality of organisms, wherein each of the plurality of probabilistic data structures represents a respective plurality of elements as members of a set, wherein each of the plurality of elements corresponds to a nucleic acid sub-string of the genomic reference data of the respective organism; and metadata indicating an association of each of the plurality of probabilistic data structures with a respective one of the plurality of organisms; receive data representing a nucleic acid sequence; divide the data representing the nucleic acid sequence into a plurality of portions, wherein each of the plurality of portions represents a sub-string of the nucleic acid sequence; and for each of the plurality of probabilistic data structures in the encoded genomic reference data: query the probabilistic data structure by each of the plurality of portions of the data representing the nucleic acid sequence; generate, in response to querying the probabilistic data structure, result data comprising one or more indications of whether each of the plurality of portions of the data representing the nucleic acid sequence is a member of the set of sub-strings of the genomic reference data of the respective organism; store the result data in a data structure comprising an indication of the respective organism associated with the metadata associated with the probabilistic data structure; and calculate one or more coverage metrics, wherein calculating the one or more coverage metrics comprises calculating a percentage of the plurality of portions of the data representing the nucleic acid sequence that are determined to be members of the set of sub-strings of the genomic reference data of the respective organism. 2. The system of claim 1 , wherein the one or more programs include instructions to, generate an output indicating the one or more organisms associated with the probabilistic data structures for which the calculated percentages are the highest among the probabilistic data structures in the encoded data. 3. The system of claim 1 , wherein generating result data comprises one of generating data indicating that an element is definitely not a member of the set and generating data indicating that an element is probably a member of the set. 4. The system of claim 1 , wherein each of the probabilistic data structures has a predefined false-positive probability. 5. The system of claim 4 , wherein the predefined false-positive probability is set at least in part in accordance with available processing resources of the one or more hardware processors or of associated storage. 6. The system of claim 4 , wherein the predefined false-positive probability is set at least in part in accordance with available storage resources associated with the one or more hardware processors. 7. The system of claim 4 , wherein the predefined false-positive probability is set at least in part in accordance with requirements for accuracy of comparisons to be made against the probabilistic data structure. 8. The system of claim 1 , wherein each of the plurality of probabilistic data structures is configured such that redundant reference sub-strings are represented as members of the respective set only once. 9. The system of claim 8 , calculating the one or more coverage metrics comprises accounting for a number of times that one or more of the redundant sub-strings appeared in the genomic reference data. 10. The system of claim 1 , wherein the one or more programs further include instructions to, for each of the plurality of probabilistic data structures in the encoded genomic reference data, if the percentage exceeds a predefined threshold percentage, determine that the nucleic acid sequence is genetically associated with the reference genome. 11. The system of claim 10 , wherein determining that the nucleic acid sequence is genetically associated with the reference genome comprises determining that the nucleic acid sequence and the reference genome represent one or both of: a same species, and a same strain. 12. The system of claim 1 , wherein the plurality of probabilistic data structures comprises, for each organism of the plurality of organisms, a suite of probabilistic data structures representing sub-strings of varying lengths. 13. The system of claim 12 , wherein the one or more programs include instructions to, determine, based on querying multiple probabilistic data structures in one or more of the suites of probabilistic data structures, a consensus across the probabilistic data structures for multiple different sub-string lengths. 14. A method for identifying genomic information in a computing environment remote from a database of genomic reference data, the method comprising: at a system comprising one or more processors and a memory: receiving encoded data representing genomic reference data of a plurality of organisms, wherein the encoded data comprises: a plurality of probabilistic data structures each corresponding respectively to an organism of the plurality of organisms, wherein each of the plurality of probabilistic data structures represents a respective plurality of elements as members of a set, wherein each of the plurality of elements corresponds to a nucleic acid sub-string of the genomic reference data of the respective organism; and metadata indicating an association of each of the plurality of probabilistic data structures with a respective one of the plurality of organisms; receiving data representing a nucleic acid sequence; dividing the data representing the nucleic acid sequence into a plurality of portions, wherein each of the plurality of portions represents a sub-string of the nucleic acid sequence; and for each of the plurality of probabilistic data structures in the encoded genomic reference data: querying the probabilistic data structure by each of the plurality of portions of the data representing the nucleic acid sequence; generating, in response to querying the probabilistic data structure, result data comprising one or more indications of whether each of the plurality of portions of the data representing the nucleic acid sequence is a member of the set of sub-strings of the genomic reference data of the respective organism; storing the result data in a data structure comprising an indication of the organism associated with the metadata associated with the probabilistic data structure; and calculating one or more coverage metrics, wherein calculating the one or more coverage metrics comprises calculating a percentage of the plurality of portions of the data representing the nucleic acid sequence that are determined to be members of the set of sub-strings of the genomic reference data of the respective organism. 15. A non-transitory computer-readable storage medium storing one or more programs for identifying genomic information in a computing environment remote from a database of genomic reference data, the one or more programs configured to be executed by one or more processors and including instructions to: receive encoded data re

Assignees

Inventors

Classifications

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title

  • by using string matching techniques · CPC title

  • Sequence alignment; Homology search · CPC title

  • Presentation of query results · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11037654B2 cover?
Techniques for identifying and/or classifying genomic information are provided. In some embodiments, genomic information may be identified by computing systems without access to a database of reference genomic information, instead relying on locally stored probabilistic data structures representing reference genomic information. Query genomic data, such as data taken from a read-set, may be div…
Who is the assignee on this patent?
Noblis Inc
What technology area does this patent fall under?
Primary CPC classification G16B30/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 15 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).