Who is the assignee on this patent?

Sayood Khalid, Way Sam, Nalbantoglu Ozkan Ufuk, and 3 more

What technology area does this patent fall under?

Primary CPC classification G06F19/22. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 23 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Classification of nucleotide sequences by latent semantic analysis

US9659145B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9659145-B2
Application number	US-201313954925-A
Country	US
Kind code	B2
Filing date	Jul 30, 2013
Priority date	Jul 30, 2012
Publication date	May 23, 2017
Grant date	May 23, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

DNA sequences are analyzed using latent semantic analysis. A set of nucleotide sequences is received in which the set has a first number of sequences. A set of basis vectors is determined, in which the set has a second number of basis vectors, the second number being smaller than the first number. Each basis vector represents a specific combination of predetermined nucleotide segments. For each of the nucleotide sequences, an approximate representation of the nucleotide sequence is determined based on a combination of the basis vectors. For each pair of nucleotide sequences, a distance between the pair of nucleotide sequences is determined according the distance between the approximate representation of the pair of nucleotide sequences. The set of nucleotide sequences are classified based on the distances between the pairs of nucleotide sequences.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving a first set of nucleotide sequences, the first set having a first number of nucleotide sequences, the first set of nucleotide sequences including a first portion and a second portion, the first portion including nucleotide sequences that belong to known species; determining, by a data processor, a set of basis vectors, the set having a second number of basis vectors, in which the second number is smaller than the first number, the second number is equal to or larger than two, and each basis vector represents a specific combination of predetermined nucleotide segments; for each of the first set of nucleotide sequences, determining an approximate representation of the nucleotide sequence based on a combination of the basis vectors; for each pair of a plurality of pairs of nucleotide sequences, determining distances between the pair of nucleotide sequences according to distances between the approximate representations of the pair of nucleotide sequences; classifying the first set of nucleotide sequences based on the distances between the pairs of nucleotide sequences; for each nucleotide sequence in the second portion, determining whether the nucleotide sequence is associated with one of the known species based on the classification of the first set of nucleotide sequences; and generating, by the data processor, an output having information about, for each of those nucleotide sequences in the second portion that are associated with known species, which one of the known species is associated with the nucleotide sequence. 2. The method of claim 1 in which the first portion of the first set of nucleotide sequences belong to known species of at least one of prokaryotes, eukaryotes, or viruses, the second portion of the first set of nucleotide sequences are obtained from a patient, and the method comprises, for each nucleotide sequence in the second portion, determining whether the nucleotide sequence is associated with one of the known species of the at least one of prokaryotes, eukaryotes, or viruses based on the classification of the first set of nucleotide sequences. 3. The method of claim 2 , comprising generating an output having information that indicates, for each nucleotide sequence in the second portion, which one of the known species of the at least one of prokaryotes, eukaryotes, or viruses, if any, is associated with the nucleotide sequence. 4. The method of claim 1 in which the predetermined nucleotide segments are k-mers each having k nucleobases, k being a positive integer, and each basis vector represents a specific combination of the k-mers. 5. The method of claim 4 in which determining a set of basis vectors comprises forming a k-mer-sequence matrix in which rows of the matrix represent the k-mers and columns of the matrix represent the nucleotide sequences, k being a positive integer, and each element in the matrix represents a repetition frequency of the segment represented by the corresponding row within the sequence represented by the corresponding column, and applying a dimension reduction process to the k-mer-sequence matrix to determine the basis vectors. 6. The method of claim 5 in which applying a dimension reduction process comprises applying at least one of non-negative matrix factorization or singular value decomposition to the segment-sequence matrix to determine the basis vectors. 7. The method of claim 1 in which determining a set of basis vectors comprises forming a segment-sequence matrix in which rows of the matrix represent the nucleotide segments and columns of the matrix represent the sequences, each element in the matrix representing a repetition frequency of the segment represented by the corresponding row within the sequence represented by the corresponding column, and applying a dimension reduction process to the segment-sequence matrix to determine the basis vectors. 8. The method of claim 7 in which applying a dimension reduction process comprises applying at least one of non-negative matrix factorization or singular value decomposition to the segment-sequence matrix to determine the basis vectors. 9. The method of claim 1 in which determining an approximate representation of the nucleotide sequence based on a combination of the basis vectors comprises determining an approximate representation of the nucleotide sequence based on a linear combination of the basis vectors. 10. The method of claim 1 in which determining an approximate representation of the nucleotide sequence comprises determining coefficients for a linear combination of the basis vectors that represents an approximation of the nucleotide sequence. 11. The method of claim 1 in which the distance between the approximate representations of the pair of nucleotide sequences is determined according to at least one of (i) Euclidean distance between the approximate representations of the pair of nucleotide sequences or (ii) correlation between the approximate representations of the pair of nucleotide sequences. 12. The method of claim 1 , comprising determining the distance between every pair of nucleotide sequences, and classifying the first set of nucleotide sequences based on the distances between all of the pairs of nucleotide sequences. 13. The method of claim 1 in which species of the second portion of the first set of nucleotide sequences are initially unknown. 14. The method of claim 1 , comprising generating a phylogenetic tree for the first set of nucleotide sequences based on the classification of the first set of nucleotide sequences. 15. The method of claim 1 , comprising determining whether one or more of the first set of nucleotide sequences are associated with pathogenic species based on the classification of the first set of nucleotide sequences, and generating an output having information about which one or more of the first set of nucleotide sequences are associated with pathogenic species. 16. The method of claim 1 , comprising determining which nucleotide sequences are associated with low risk species, and which nucleotide sequences are associated with high risk species, based on the classification of the first set of nucleotide sequences, and generating an output indicating which nucleotide sequences are associated with low risk species, and which nucleotide sequences are associated with high risk species. 17. The method of claim 1 , comprising receiving a second set of nucleotide sequences that includes nucleotide sequences from a host and nucleotide sequences from a plurality of known species different from the host, classifying the second set of nucleotide sequences based on the distances between the pairs of nucleotide sequences, and identifying nucleotide sequences that are primarily associated with the host based on the classification of the first and second sets of nucleotide sequences. 18. The method of claim 17 , comprising removing, from the second set of nucleotide sequences, nucleotide sequences that are primarily associated with the host. 19. The method of claim 18 , comprising generating an output having information about the nucleotide sequences remaining in the second set of nucleotide sequences after the nucleotide sequences primarily associated with the host have been removed. 20. The method of claim 17 in which the second set of nucleotide sequences comprises a second set of 16S ribosomal RNA sequences. 21. The method of claim 1 , comprising obtaining a sample from an animal or a human, and genera

Assignees

Inventors

Classifications

G06F19/22Primary
Physics · mapped topic
G06F19/24
Physics · mapped topic
G16B30/00Primary
ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title
G16B40/00
ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title

Patent family

Related publications grouped by family.

View patent family 50028483

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9659145B2 cover?: DNA sequences are analyzed using latent semantic analysis. A set of nucleotide sequences is received in which the set has a first number of sequences. A set of basis vectors is determined, in which the set has a second number of basis vectors, the second number being smaller than the first number. Each basis vector represents a specific combination of predetermined nucleotide segments. For each…
Who is the assignee on this patent?: Sayood Khalid, Way Sam, Nalbantoglu Ozkan Ufuk, and 3 more
What technology area does this patent fall under?: Primary CPC classification G06F19/22. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 23 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).