Classification of nucleotide sequences by latent semantic analysis

US9659145B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9659145-B2
Application numberUS-201313954925-A
CountryUS
Kind codeB2
Filing dateJul 30, 2013
Priority dateJul 30, 2012
Publication dateMay 23, 2017
Grant dateMay 23, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

DNA sequences are analyzed using latent semantic analysis. A set of nucleotide sequences is received in which the set has a first number of sequences. A set of basis vectors is determined, in which the set has a second number of basis vectors, the second number being smaller than the first number. Each basis vector represents a specific combination of predetermined nucleotide segments. For each of the nucleotide sequences, an approximate representation of the nucleotide sequence is determined based on a combination of the basis vectors. For each pair of nucleotide sequences, a distance between the pair of nucleotide sequences is determined according the distance between the approximate representation of the pair of nucleotide sequences. The set of nucleotide sequences are classified based on the distances between the pairs of nucleotide sequences.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving a first set of nucleotide sequences, the first set having a first number of nucleotide sequences, the first set of nucleotide sequences including a first portion and a second portion, the first portion including nucleotide sequences that belong to known species; determining, by a data processor, a set of basis vectors, the set having a second number of basis vectors, in which the second number is smaller than the first number, the second number is equal to or larger than two, and each basis vector represents a specific combination of predetermined nucleotide segments; for each of the first set of nucleotide sequences, determining an approximate representation of the nucleotide sequence based on a combination of the basis vectors; for each pair of a plurality of pairs of nucleotide sequences, determining distances between the pair of nucleotide sequences according to distances between the approximate representations of the pair of nucleotide sequences; classifying the first set of nucleotide sequences based on the distances between the pairs of nucleotide sequences; for each nucleotide sequence in the second portion, determining whether the nucleotide sequence is associated with one of the known species based on the classification of the first set of nucleotide sequences; and generating, by the data processor, an output having information about, for each of those nucleotide sequences in the second portion that are associated with known species, which one of the known species is associated with the nucleotide sequence. 2. The method of claim 1 in which the first portion of the first set of nucleotide sequences belong to known species of at least one of prokaryotes, eukaryotes, or viruses, the second portion of the first set of nucleotide sequences are obtained from a patient, and the method comprises, for each nucleotide sequence in the second portion, determining whether the nucleotide sequence is associated with one of the known species of the at least one of prokaryotes, eukaryotes, or viruses based on the classification of the first set of nucleotide sequences. 3. The method of claim 2 , comprising generating an output having information that indicates, for each nucleotide sequence in the second portion, which one of the known species of the at least one of prokaryotes, eukaryotes, or viruses, if any, is associated with the nucleotide sequence. 4. The method of claim 1 in which the predetermined nucleotide segments are k-mers each having k nucleobases, k being a positive integer, and each basis vector represents a specific combination of the k-mers. 5. The method of claim 4 in which determining a set of basis vectors comprises forming a k-mer-sequence matrix in which rows of the matrix represent the k-mers and columns of the matrix represent the nucleotide sequences, k being a positive integer, and each element in the matrix represents a repetition frequency of the segment represented by the corresponding row within the sequence represented by the corresponding column, and applying a dimension reduction process to the k-mer-sequence matrix to determine the basis vectors. 6. The method of claim 5 in which applying a dimension reduction process comprises applying at least one of non-negative matrix factorization or singular value decomposition to the segment-sequence matrix to determine the basis vectors. 7. The method of claim 1 in which determining a set of basis vectors comprises forming a segment-sequence matrix in which rows of the matrix represent the nucleotide segments and columns of the matrix represent the sequences, each element in the matrix representing a repetition frequency of the segment represented by the corresponding row within the sequence represented by the corresponding column, and applying a dimension reduction process to the segment-sequence matrix to determine the basis vectors. 8. The method of claim 7 in which applying a dimension reduction process comprises applying at least one of non-negative matrix factorization or singular value decomposition to the segment-sequence matrix to determine the basis vectors. 9. The method of claim 1 in which determining an approximate representation of the nucleotide sequence based on a combination of the basis vectors comprises determining an approximate representation of the nucleotide sequence based on a linear combination of the basis vectors. 10. The method of claim 1 in which determining an approximate representation of the nucleotide sequence comprises determining coefficients for a linear combination of the basis vectors that represents an approximation of the nucleotide sequence. 11. The method of claim 1 in which the distance between the approximate representations of the pair of nucleotide sequences is determined according to at least one of (i) Euclidean distance between the approximate representations of the pair of nucleotide sequences or (ii) correlation between the approximate representations of the pair of nucleotide sequences. 12. The method of claim 1 , comprising determining the distance between every pair of nucleotide sequences, and classifying the first set of nucleotide sequences based on the distances between all of the pairs of nucleotide sequences. 13. The method of claim 1 in which species of the second portion of the first set of nucleotide sequences are initially unknown. 14. The method of claim 1 , comprising generating a phylogenetic tree for the first set of nucleotide sequences based on the classification of the first set of nucleotide sequences. 15. The method of claim 1 , comprising determining whether one or more of the first set of nucleotide sequences are associated with pathogenic species based on the classification of the first set of nucleotide sequences, and generating an output having information about which one or more of the first set of nucleotide sequences are associated with pathogenic species. 16. The method of claim 1 , comprising determining which nucleotide sequences are associated with low risk species, and which nucleotide sequences are associated with high risk species, based on the classification of the first set of nucleotide sequences, and generating an output indicating which nucleotide sequences are associated with low risk species, and which nucleotide sequences are associated with high risk species. 17. The method of claim 1 , comprising receiving a second set of nucleotide sequences that includes nucleotide sequences from a host and nucleotide sequences from a plurality of known species different from the host, classifying the second set of nucleotide sequences based on the distances between the pairs of nucleotide sequences, and identifying nucleotide sequences that are primarily associated with the host based on the classification of the first and second sets of nucleotide sequences. 18. The method of claim 17 , comprising removing, from the second set of nucleotide sequences, nucleotide sequences that are primarily associated with the host. 19. The method of claim 18 , comprising generating an output having information about the nucleotide sequences remaining in the second set of nucleotide sequences after the nucleotide sequences primarily associated with the host have been removed. 20. The method of claim 17 in which the second set of nucleotide sequences comprises a second set of 16S ribosomal RNA sequences. 21. The method of claim 1 , comprising obtaining a sample from an animal or a human, and genera

Assignees

Inventors

Classifications

  • G06F19/22Primary

    Physics · mapped topic

  • Physics · mapped topic

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9659145B2 cover?
DNA sequences are analyzed using latent semantic analysis. A set of nucleotide sequences is received in which the set has a first number of sequences. A set of basis vectors is determined, in which the set has a second number of basis vectors, the second number being smaller than the first number. Each basis vector represents a specific combination of predetermined nucleotide segments. For each…
Who is the assignee on this patent?
Sayood Khalid, Way Sam, Nalbantoglu Ozkan Ufuk, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06F19/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 23 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).