DNA alignment using a hierarchical inverted index table

US12087403B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12087403-B2
Application numberUS-202318114065-A
CountryUS
Kind codeB2
Filing dateFeb 24, 2023
Priority dateOct 21, 2015
Publication dateSep 10, 2024
Grant dateSep 10, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

System and method for constructing a hierarchical index table usable for matching a search sequence to reference data. The index table may be constructed to contain entries associated with an exhaustive list of all subsequences of a given length, wherein each entry contains the number and locations of matches of each subsequence in the reference data. The hierarchical index table may be constructed in an iterative manner, wherein entries for each lengthened subsequence are selectively and iteratively constructed based on the number of matches being greater than each of a set of respective thresholds. The hierarchical index table may be used to search for matches between a search sequence and reference data, and to perform misfit identification and characterization upon each respective candidate match.

First claim

Opening claim text (preview).

What is claimed is: 1. A multiprocessor array (MPA) comprising interspersed processors and communication elements coupled to at least one non-transitory computer-readable memory medium, wherein the at least one non-transitory computer-readable memory medium comprises program instructions executable by the MPA to: determine a length and a type of an indel in a search sequence using an anchor position in the search sequence, wherein the search sequence is matched to reference data at a match location; determine a starting position of the indel, wherein, in determining the starting position of the indel, the program instructions are executable to cause the MPA to: compute a first running error sum between the search sequence and the reference data matched at the match location; and compute a second running error sum between the search sequence and the reference data matched at an offset from the match location, wherein the offset is based on the type and length of the indel; determine the starting position of the indel to correspond to a minimum of the second running error sum subtracted from the first running error sum; and store the length, size, and starting position of the indel in the non-transitory computer-readable memory medium. 2. The MPA of claim 1 , wherein the program instructions are further executable to cause the MPA to: select the anchor position in the search sequence, wherein, in selecting the anchor position in the search sequence, the program instructions are executable to cause the MPA to: determine a first number of mismatches at a low end of the search sequence and a second number of mismatches at a high end of the search sequence; and select the anchor position to be the low end or the high end based on a comparison of the first and second numbers. 3. The MPA of claim 1 , wherein, in determining the length and the type of the indel using the anchor position, the program instructions are executable to cause the MPA to: for each of a plurality of offsets from the match location: determine a respective number of mismatches between the search sequence and the reference data within a predetermined number of locations from an opposite end of the search sequence from the anchor position; determine the length of the indel to correspond to a magnitude of a first offset of the plurality of offsets corresponding to a smallest number of mismatches; and determine the type of the indel to be an insertion when the first offset is negative and a deletion when the first offset is positive. 4. The MPA of claim 1 , wherein the first and second running error sums are computed starting from the anchor position. 5. The MPA of claim 1 , wherein the program instructions are further executable to cause the MPA to: match the search sequence to the reference data using a hierarchical index table. 6. The MPA of claim 1 , wherein the reference data comprises a reference genome, and wherein the search sequence comprises a short read of a sample genome. 7. The MPA of claim 1 , wherein the program instructions are further executable to cause the MPA to: determine a running average of a number of mismatches of the search sequence to the reference data from a first end of the search sequence for a predetermined number of locations; and determine whether the running average is within a predetermined threshold of a first ratio, wherein determining the length, type and starting position of the indel is performed responsive to a determination that the running average is within the predetermined threshold of the first ratio. 8. The MPA of claim 7 , wherein the reference data comprises a reference genome, wherein the search sequence comprises a short read of a sample genome, and wherein the first ratio is ¾. 9. A non-transitory computer-readable memory medium comprising program instructions for characterizing an indel in a search sequence, wherein the program instructions are executable to: determine a length and a type of the indel using an anchor position in the search sequence, wherein the search sequence is matched to reference data at a match location; determine a starting position of the indel, wherein, in determining the starting position of the indel, the program instructions are executable to: compute a first running error sum between the search sequence and the reference data matched at the match location; and compute a second running error sum between the search sequence and the reference data matched at an offset from the match location, wherein the offset is based on the type and length of the indel; determine the starting position of the indel to correspond to a minimum of the second running error sum subtracted from the first running error sum; and store the length, size, and starting position of the indel in the non-transitory computer-readable memory medium. 10. The non-transitory computer-readable memory medium of claim 9 , wherein the program instructions are further executable to: select the anchor position in the search sequence, wherein, in selecting the anchor position in the search sequence, the program instructions are executable to: determine a first number of mismatches at a low end of the search sequence and a second number of mismatches at a high end of the search sequence; and select the anchor position to be the low end or the high end based on a comparison of the first and second numbers. 11. The non-transitory computer-readable memory medium of claim 9 , wherein, in determining the length and the type of the indel using the anchor position, the program instructions are executable to: for each of a plurality of offsets from the match location: determine a respective number of mismatches between the search sequence and the reference data within a predetermined number of locations from an opposite end of the search sequence from the anchor position; determine the length of the indel to correspond to a magnitude of a first offset of the plurality of offsets corresponding to a smallest number of mismatches; and determine the type of the indel to be an insertion when the first offset is negative and a deletion when the first offset is positive. 12. The non-transitory computer-readable memory medium of claim 9 , wherein the first and second running error sums are computed starting from the anchor position. 13. The non-transitory computer-readable memory medium of claim 9 , wherein the reference data comprises a reference genome, and wherein the search sequence comprises a short read of a sample genome. 14. A method for characterizing an indel in a search sequence, wherein the search sequence is matched to reference data at a match location, the method comprising: determining a length and a type of the indel using an anchor position in the search sequence; determining a starting position of the indel, comprising: computing a first running error sum between the search sequence and the reference data matched at the match location; and computing a second running error sum between the search sequence and the reference data matched at an offset from the match location, wherein the offset is based on the type and length of the indel; determining the starting position of the indel to correspond to a minimum of the second running error sum subtracted from the first running error sum; and storing the length, size, and starting position of the indel in a non-transitory computer-readable memory medium. 15. The method of claim 14 , further comprising: selecting the anchor position in the search sequence, comprising: determining a first num

Assignees

Inventors

Classifications

  • Data warehousing; Computing architectures · CPC title

  • G06F16/319Primary

    Inverted lists · CPC title

  • ICT programming tools or database systems specially adapted for bioinformatics · CPC title

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • Indexing structures · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12087403B2 cover?
System and method for constructing a hierarchical index table usable for matching a search sequence to reference data. The index table may be constructed to contain entries associated with an exhaustive list of all subsequences of a given length, wherein each entry contains the number and locations of matches of each subsequence in the reference data. The hierarchical index table may be constru…
Who is the assignee on this patent?
Coherent Logix Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/319. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).