Methods of optimizing genome assembly parameters

US11830581B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11830581-B2
Application numberUS-201916295836-A
CountryUS
Kind codeB2
Filing dateMar 7, 2019
Priority dateMar 7, 2019
Publication dateNov 28, 2023
Grant dateNov 28, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An iterative process for optimizing one or more parameters used by a k-mer based de novo genome assembler program to assemble a set of sequenced nucleic acids is described. The method utilizes quality metrics whose desired values are initially specified. Computed values of the quality metrics are calculated during the assembly process and compared to the desired values. The assembly process stops when the computed values are not desired values. After modification of one or more of the parameters (e.g., k-mer value), the assembly process re-initiates using the modified parameter set. This process repeats until the computed values of the quality metrics meet the desired values. The final parameter set is then used to generate or complete one or more final assembled genomes.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: selecting a k-mer-based de novo genome assembler program, designated assembler; providing a set of parameters for assembly of a set of sequenced nucleic acids, the parameters having respective values and respective priority scores, the parameters including a k-mer parameter having an initial length k in number of nucleotides, k being a positive integer equal to at least 35; providing a set of quality metrics for assembly of a set of sequenced nucleic acids, the quality metrics having respective weights indicating importance, respective target values, and respective computed values calculated during assembly, a given quality metric depending on one or more of the parameters; initiating assembly of the set of sequenced nucleic acids by the assembler, the assembler utilizing the set of parameters and the set of quality metrics, thereby forming intermediate assembled sequences; performing a procedure iteratively until the respective computed values of the quality metrics equal the respective target values, the procedure comprising the steps of: i) stopping the assembler when the respective computed values of the quality metrics do not equal the respective target values, ii) modifying at least one of the parameters and/or at least one of the quality metrics, iii) deleting any intermediate assembled sequences, iv) initiating assembly of the set of sequenced nucleic acids by the assembler, and v) calculating values of the quality metrics including any modified quality metrics, the procedure terminating while utilizing a set of final parameters and a set of final quality metrics, the final parameters including a final k-mer parameter of length k′; and completing assembly of the set of sequenced nucleic acids using the assembler, the set of final parameters, and the set of final quality metrics, thereby forming one or more final assembled genomes; wherein the intermediate assembled sequences have sizes, a mean size, a median size, and a standard deviation of the sizes of the intermediate assembled sequences calculated during the assembly; wherein a given taxonomic rank contains reference genomes having sizes, a mean size, a median size, and a standard deviation of the sizes of the reference genomes; and wherein the assembly includes a quality metric for the difference between the mean size of the intermediate assembled sequences and the mean size of the reference genomes. 2. The method of claim 1 , wherein the method comprises storing the final parameters and the final quality metrics with the one or more final assembled genomes. 3. The method of claim 1 , wherein the method is performed by a computer system without human intervention. 4. The method of claim 1 , wherein the intermediate assembled sequences are partially assembled genomes. 5. The method of claim 1 , wherein the intermediate assembled sequences are wholly assembled genomes. 6. The method of claim 1 , wherein the quality metrics include a member selected from the group consisting of N50, NA50, NG50, NGA50, L50, LA50, LG50, LGA50, and combinations thereof. 7. The method of claim 1 , wherein the quality metrics include a member selected from the group consisting of number of contigs, number of contigs above a given size, number of contig edges, number of connections within contigs, and combinations thereof. 8. The method of claim 1 , wherein the assembly includes a quality metric for the difference between the median size of the intermediate assembled sequences and the median size of the reference genomes. 9. The method of claim 5 , wherein the assembly includes a quality metric for the difference between the standard deviation of the sizes of the intermediate assembled sequences and the standard deviation of the sizes of the reference genomes. 10. The method of claim 1 , wherein the quality metrics include a member selected from the group consisting of coverages per nucleotide base, coverages per contig, coverages per assembly, number of misassembles, relative abundances of nucleotides, repetitive content of nucleotides, and combinations thereof. 11. The method of claim 1 , wherein k′ is optimal for one of the final assembled genomes. 12. The method of claim 1 , wherein k′ is an average of values of the k-mer parameter used in the procedure. 13. The method of claim 1 , wherein the k-mer parameter has a lower priority score and/or lower ranking compared to another parameter. 14. The method of claim 13 , wherein said another parameter is a member selected from the group consisting of coverage cutoff value, number of mismatches allowed, expected genome size, insert size, and sequencing error rate of the intermediate assembled sequences. 15. The method of claim 1 , wherein k′ is optimal for a subset of the final assembled genomes, the subset comprising more than one final assembled genome. 16. The method of claim 15 , wherein the subset is defined by taxonomic rank. 17. The method of claim 15 , wherein the subset is defined by sequencing method. 18. The method of claim 17 , wherein the sequencing method is a member selected from the group consisting of Sanger, Illumina, PacBio, 454, Ion Torrent, and SOLid. 19. A system comprising one or more computer processor circuits configured and arranged to: select a k-mer-based de novo genome assembler program, designated assembler; provide a set of parameters for assembly of a set of sequenced nucleic acids, the parameters having respective values and respective priority scores, the parameters including a k-mer parameter having an initial length k in number of nucleotides, k being a positive integer equal to at least 35; provide a set of quality metrics for assembly of a set of sequenced nucleic acids, the quality metrics having respective weights indicating importance, respective target values, and respective computed values calculated during assembly, a given quality metric depending on one or more of the parameters; initiate assembly of the set of sequenced nucleic acids by the assembler, the assembler utilizing the set of parameters and the set of quality metrics, thereby forming intermediate assembled sequences, wherein the intermediate assembled sequences are wholly assembled genomes; perform a procedure iteratively until the respective computed values of the quality metrics equal the respective target values, the procedure comprising the steps of: i) stopping the assembler when the respective computed values of the quality metrics do not equal the respective target values, ii) modifying at least one of the parameters and/or at least one of the quality metrics, iii) deleting any intermediate assembled sequences, iv) initiating assembly of the set of sequenced nucleic acids by the assembler, and v) calculating values of the quality metrics including any modified quality metrics, the procedure terminating while utilizing a set of final parameters and a set of final quality metrics, the final parameters including a final k-mer parameter of length k′; and complete assembly of the set of sequenced nucleic acids using the assembler, the set of final parameters, and the set of final quality metrics, thereby forming one or more final assembled genomes; wherein the assembly includes a quality metric for the difference between the standard deviation of the sizes of the intermediate assembled sequences and the standard deviation of the sizes of the reference genomes. 20. A computer program product, comprising a computer readable hardware storage device having a computer-readable pro

Assignees

Inventors

Classifications

  • G16B30/20Primary

    Sequence assembly · CPC title

  • Sequence alignment; Homology search · CPC title

  • Supervised data analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11830581B2 cover?
An iterative process for optimizing one or more parameters used by a k-mer based de novo genome assembler program to assemble a set of sequenced nucleic acids is described. The method utilizes quality metrics whose desired values are initially specified. Computed values of the quality metrics are calculated during the assembly process and compared to the desired values. The assembly process sto…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G16B30/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).