Methods and systems for large scale scaffolding of genome assemblies

US2016239602A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016239602-A1
Application numberUS-201415024990-A
CountryUS
Kind codeA1
Filing dateSep 27, 2014
Priority dateSep 27, 2013
Publication dateAug 18, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Computational methods used for large scale scaffolding of a genome assembly are provided. Such methods may include a step of applying a location clustering model to a test set of contigs to form two or more location cluster groups, each location cluster group comprising one or more location-clustered contigs; a step of applying an ordering model to each of the two or more location cluster groups to form an ordered set of one or more location-clustered contigs within each cluster group; and a step of applying an orienting model to each ordered set of one or more location-clustered contigs to assign a relative orientation to each of the location-clustered contigs within each location cluster group. In some aspects, the test set of contigs are generated from aligning a set of reads generated by a chromosome conformation analysis technique (e.g., Hi-C) with a draft assembly, a reference assembly, or both.

First claim

Opening claim text (preview).

1 . A method performed by a computing system for large scale scaffolding of a genome assembly comprising: applying a location clustering model to a test set of contigs to form two or more location cluster groups, each location cluster group comprising one or more location-clustered contigs; applying an ordering model to each of the two or more location cluster groups to form an ordered set of one or more location-clustered contigs within each cluster group; and applying an orienting model to each ordered set of one or more location-clustered contigs to assign a relative orientation to each of the location-clustered contigs within each location cluster group; wherein the test set of contigs are generated from aligning a set of reads generated by a chromosome conformation analysis technique with a draft assembly, a reference assembly, or both. 2 . The method of claim 1 , wherein the location clustering model comprises building a graph and applying a hierarchical agglomerative clustering algorithm with an average-linkage metric to calculate a link density between each of the contigs of the test set. 3 . The method of claim 1 , wherein the two or more location cluster groups are two or more chromosome groups, each chromosome group comprising one or more contigs derived from the same chromosome. 4 . The method of claim 1 , wherein the ordering model comprises building a graph and calculating a minimum spanning tree. 5 . The method of claim 1 , wherein the orienting model comprises building a graph and calculating an orientation quality score for each location-clustered contig, and wherein the graph is optionally a weighted directed acyclic graph (WDAG). 6 . (canceled) 7 . The method of claim 1 , wherein the chromosome conformation analysis technique is Chromatin Conformation Capture (3C), Circularized Chromatin Conformation Capture (4C), Carbon Copy Chromosome Conformation Capture (5C), Chromatin Immunoprecipitation (ChIP), ChIP-Loop, Hi-C, combined 3C-ChIP-cloning (6C), or Capture-C. 8 . The method of claim 1 , further comprising, prior to applying a location clustering model, applying a species clustering model to a heterogeneous set of contigs to form two or more species cluster groups, each species cluster group comprising one or more species-clustered contigs from a single species; wherein the heterogeneous set of contigs are generated from aligning a set of reads generated by a chromosome conformation analysis technique with a metagenome assembly, and wherein the one or more species-clustered contigs are used as the test set of contigs. 9 . A system for performing large scale scaffolding of a genome assembly comprising: a computer readable storage medium which stores computer-executable instructions comprising instructions for applying a location clustering model to a test set of contigs to form two or more location cluster groups, each location cluster group comprising one or more location-clustered contigs; instructions for applying an ordering model to each of the two or more location cluster groups to form an ordered set of one or more location-clustered contigs within each cluster group; and instructions for applying an orienting model to each ordered set of one or more location-clustered contigs to assign a relative orientation to each of the location-clustered contigs within each location cluster group; wherein the test set of contigs are generated from aligning a set of reads generated by a chromosome conformation analysis technique with a draft assembly, a reference assembly, or both. a processor which is configured to perform steps comprising receiving a set of input files which comprise a file comprising the set of reads generated by a chromosome conformation analysis technique; and the draft assembly, reference assembly, or both; executing the computer-executable instructions stored in the computer-readable storage medium. 10 . The system of claim 9 , wherein the location clustering model comprises building a graph and applying a hierarchical agglomerative clustering algorithm with an average-linkage metric to calculate a link density between each of the contigs of the test set. 11 . The system of claim 9 , wherein the two or more location cluster groups are two or more chromosome groups, each chromosome group comprising one or more contigs derived from the same chromosome. 12 . The system of claim 9 , wherein the ordering model comprises building a graph and calculating a minimum spanning tree. 13 . The system of claim 9 , wherein the orienting model comprises building a graph and calculating an orientation quality score for each location-clustered contig, and wherein the graph is optionally a weighted directed acyclic graph (WDAG). 14 . (canceled) 15 . The system of claim 9 , wherein the chromosome conformation analysis technique is Chromatin Conformation Capture (3C), Circularized Chromatin Conformation Capture (4C), Carbon Copy Chromosome Conformation Capture (5C), Chromatin Immunoprecipitation (ChIP), ChIP-Loop, Hi-C, combined 3C-ChIP-cloning (6C), or Capture-C. 16 . The system of claim 9 , wherein the computer-executable instructions further comprises instructions for applying a species clustering model to a heterogeneous set of contigs to form two or more species cluster groups, each species cluster group comprising one or more species-clustered contigs from a single species; wherein the heterogeneous set of contigs are generated from aligning a set of reads generated by a chromosome conformation analysis technique with a metagenome assembly, and wherein the one or more species-clustered contigs are used as the test set of contigs in the instructions for applying a location clustering model. 17 . A computer readable storage medium which stores computer-executable instructions comprising: instructions for applying a location clustering model to a test set of contigs to form two or more location cluster groups, each location cluster group comprising one or more location-clustered contigs; instructions for applying an ordering model to each of the two or more location cluster groups to form an ordered set of one or more location-clustered contigs within each cluster group; instructions for applying an orienting model to each ordered set of one or more location-clustered contigs to assign a relative orientation to each of the location-clustered contigs within each location cluster group; and instructions for applying a species clustering model to a heterogeneous set of contigs to form two or more species cluster groups, each species cluster group comprising one or more species-clustered contigs from a single species; wherein the test set of contigs are generated from aligning a set of reads generated by a chromosome conformation analysis technique with a draft assembly, a reference assembly, or both; wherein the heterogeneous set of contigs are generated from aligning a set of reads generated by a chromosome conformation analysis technique with a metagenome assembly, and wherein the one or more species-clustered contigs are used as the test set of contigs in the instructions for applying a location clustering model. 18 . The computer readable storage medium of claim 17 , wherein the location clustering model comprises building a graph and applying a hierarchical agglomerative clustering algorithm with an average-linkage metric to calculate a link density between each of the contigs of the test set. 19 . The computer readable storage medium of claim 17 , wherein the two or mo

Assignees

Inventors

Classifications

  • ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • Physics · mapped topic

  • G06F19/12Primary

    Physics · mapped topic

  • Sequence assembly · CPC title

  • ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016239602A1 cover?
Computational methods used for large scale scaffolding of a genome assembly are provided. Such methods may include a step of applying a location clustering model to a test set of contigs to form two or more location cluster groups, each location cluster group comprising one or more location-clustered contigs; a step of applying an ordering model to each of the two or more location cluster group…
Who is the assignee on this patent?
Univ Washington
What technology area does this patent fall under?
Primary CPC classification G06F19/12. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 18 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).