Probabilistic data structures for concordance management
US-10642994-B1 · May 5, 2020 · US
US11037654B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11037654-B2 |
| Application number | US-201815977667-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 11, 2018 |
| Priority date | May 12, 2017 |
| Publication date | Jun 15, 2021 |
| Grant date | Jun 15, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for identifying and/or classifying genomic information are provided. In some embodiments, genomic information may be identified by computing systems without access to a database of reference genomic information, instead relying on locally stored probabilistic data structures representing reference genomic information. Query genomic data, such as data taken from a read-set, may be divided into sub-strings, and each of the locally-stored probabilistic data structures may be queried by each of the extracted sub-strings, generating probabilistic outputs indicating either that (a) the sub-string is probably included in the set of data represented by the probabilistic data structure or (b) the sub-string is definitely not included in the set of data. Based on the number and/or proportion of sub-strings from a read-set that are indicated as being likely represented by a probabilistic data structure, a likely identity or classification for the genomic information in the read-set may be determined.
Opening claim text (preview).
What is claimed is: 1. A system for identifying genomic information in a computing environment remote from a database of genomic reference data, the system comprising: one or more hardware processors; a memory storing one or more programs, the one or more programs configured to be executed by the one or more hardware processors and including instructions to: receive encoded data representing genomic reference data of a plurality of organisms, wherein the encoded data comprises: a plurality of probabilistic data structures each corresponding respectively to an organism of the plurality of organisms, wherein each of the plurality of probabilistic data structures represents a respective plurality of elements as members of a set, wherein each of the plurality of elements corresponds to a nucleic acid sub-string of the genomic reference data of the respective organism; and metadata indicating an association of each of the plurality of probabilistic data structures with a respective one of the plurality of organisms; receive data representing a nucleic acid sequence; divide the data representing the nucleic acid sequence into a plurality of portions, wherein each of the plurality of portions represents a sub-string of the nucleic acid sequence; and for each of the plurality of probabilistic data structures in the encoded genomic reference data: query the probabilistic data structure by each of the plurality of portions of the data representing the nucleic acid sequence; generate, in response to querying the probabilistic data structure, result data comprising one or more indications of whether each of the plurality of portions of the data representing the nucleic acid sequence is a member of the set of sub-strings of the genomic reference data of the respective organism; store the result data in a data structure comprising an indication of the respective organism associated with the metadata associated with the probabilistic data structure; and calculate one or more coverage metrics, wherein calculating the one or more coverage metrics comprises calculating a percentage of the plurality of portions of the data representing the nucleic acid sequence that are determined to be members of the set of sub-strings of the genomic reference data of the respective organism. 2. The system of claim 1 , wherein the one or more programs include instructions to, generate an output indicating the one or more organisms associated with the probabilistic data structures for which the calculated percentages are the highest among the probabilistic data structures in the encoded data. 3. The system of claim 1 , wherein generating result data comprises one of generating data indicating that an element is definitely not a member of the set and generating data indicating that an element is probably a member of the set. 4. The system of claim 1 , wherein each of the probabilistic data structures has a predefined false-positive probability. 5. The system of claim 4 , wherein the predefined false-positive probability is set at least in part in accordance with available processing resources of the one or more hardware processors or of associated storage. 6. The system of claim 4 , wherein the predefined false-positive probability is set at least in part in accordance with available storage resources associated with the one or more hardware processors. 7. The system of claim 4 , wherein the predefined false-positive probability is set at least in part in accordance with requirements for accuracy of comparisons to be made against the probabilistic data structure. 8. The system of claim 1 , wherein each of the plurality of probabilistic data structures is configured such that redundant reference sub-strings are represented as members of the respective set only once. 9. The system of claim 8 , calculating the one or more coverage metrics comprises accounting for a number of times that one or more of the redundant sub-strings appeared in the genomic reference data. 10. The system of claim 1 , wherein the one or more programs further include instructions to, for each of the plurality of probabilistic data structures in the encoded genomic reference data, if the percentage exceeds a predefined threshold percentage, determine that the nucleic acid sequence is genetically associated with the reference genome. 11. The system of claim 10 , wherein determining that the nucleic acid sequence is genetically associated with the reference genome comprises determining that the nucleic acid sequence and the reference genome represent one or both of: a same species, and a same strain. 12. The system of claim 1 , wherein the plurality of probabilistic data structures comprises, for each organism of the plurality of organisms, a suite of probabilistic data structures representing sub-strings of varying lengths. 13. The system of claim 12 , wherein the one or more programs include instructions to, determine, based on querying multiple probabilistic data structures in one or more of the suites of probabilistic data structures, a consensus across the probabilistic data structures for multiple different sub-string lengths. 14. A method for identifying genomic information in a computing environment remote from a database of genomic reference data, the method comprising: at a system comprising one or more processors and a memory: receiving encoded data representing genomic reference data of a plurality of organisms, wherein the encoded data comprises: a plurality of probabilistic data structures each corresponding respectively to an organism of the plurality of organisms, wherein each of the plurality of probabilistic data structures represents a respective plurality of elements as members of a set, wherein each of the plurality of elements corresponds to a nucleic acid sub-string of the genomic reference data of the respective organism; and metadata indicating an association of each of the plurality of probabilistic data structures with a respective one of the plurality of organisms; receiving data representing a nucleic acid sequence; dividing the data representing the nucleic acid sequence into a plurality of portions, wherein each of the plurality of portions represents a sub-string of the nucleic acid sequence; and for each of the plurality of probabilistic data structures in the encoded genomic reference data: querying the probabilistic data structure by each of the plurality of portions of the data representing the nucleic acid sequence; generating, in response to querying the probabilistic data structure, result data comprising one or more indications of whether each of the plurality of portions of the data representing the nucleic acid sequence is a member of the set of sub-strings of the genomic reference data of the respective organism; storing the result data in a data structure comprising an indication of the organism associated with the metadata associated with the probabilistic data structure; and calculating one or more coverage metrics, wherein calculating the one or more coverage metrics comprises calculating a percentage of the plurality of portions of the data representing the nucleic acid sequence that are determined to be members of the set of sub-strings of the genomic reference data of the respective organism. 15. A non-transitory computer-readable storage medium storing one or more programs for identifying genomic information in a computing environment remote from a database of genomic reference data, the one or more programs configured to be executed by one or more processors and including instructions to: receive encoded data re
ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title
ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title
by using string matching techniques · CPC title
Sequence alignment; Homology search · CPC title
Presentation of query results · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.