Method, system and software arrangement for detecting or determining similarity regions between datasets

US9390163B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9390163-B2
Application numberUS-41069206-A
CountryUS
Kind codeB2
Filing dateApr 24, 2006
Priority dateApr 22, 2005
Publication dateJul 12, 2016
Grant dateJul 12, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and computer-readable media are provided which can identify and provide local variations in regions of similarity among two or more data sets. These data sets may be represented as sequences such as, e.g., genomic sequences or words in a text. The local variations in similarity levels can be provided by selecting an initial prior distribution relating the data sets, organizing the first data set into windows and the remaining data sets into blocks, using the priors to sample one or more sets of words from the first data set, computing a similarity curve from exact and inexact matches for these words and, if convergence of results is not achieved, computing a new set of priors and repeating the sampling and computation of similarity curves. The computations can be performed using an amount of computational time that is linearly proportional to the size of the data sets.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for obtaining information associated with a sequence similarity across a plurality of regions among a plurality of strings, comprising: (a) receiving a particular string and at least one further string, wherein at least one of the particular string or the at least one further string is ordered; (b) organizing the particular string into the regions; (c) generating a plurality of subregions corresponding to a sample of the regions; (d) obtaining particular information related to at least one of an initially unknown or guessed-at predetermined statistical distribution of local variations in the sequence similarity; (e) using a processing arrangement, automatically determining local variations in the sequence similarity across the subregions with respect to the at least one further string based on the particular information; (f) using the processing arrangement, modifying the particular information based on the determined local variations in sequence similarity; (g) repeating procedures (e) and (f) until a preselected criterion is met; and (h) at least one of storing the particular information in a storage arrangement in at least one of a user-accessible format or a user-readable format or displaying the particular information on a hardware display, wherein the particular information is provided in a form of a sequence similarity curve that includes a highest similarity of a particular subregion of the subregions at a particular position in the particular string compared to a further subregion of all positions in the at least one further string, and wherein a computational time utilized to perform procedures (e) to (g) is linearly proportional to a size of at least one of the particular string or the at least one further string. 2. The method according to claim 1 , wherein the particular string and the at least one further string have a form of a linear sequence. 3. The method according to claim 1 , wherein the particular information is domain-dependent. 4. The method according to claim 3 , wherein the particular information is based on at least one of (i) an independent string that is different from the particular string and the at least one further string, or (ii) a statistical model related to at least one of the particular string or the at least one further string. 5. The method according to claim 4 , wherein the determining procedure further comprises the subprocedure of verifying the local variations in sequence similarity based on the computational time. 6. The method according to claim 5 , wherein the verifying subprocedure comprises comparing the local variations in sequence similarity and the particular information. 7. The method according to claim 1 , further comprising providing regions of predetermined sequence similarity in the particular string with respect to the at least one further string. 8. The method according to claim 1 , wherein the sequence similarity is determined based on a number of exact matches and inexact matches of the subregions with respect to the at least one further string. 9. The method according to claim 8 , wherein the subregions comprise at least one of words or mers. 10. The method according to claim 1 , wherein the particular information is provided in a further form of at least one of a statistical prior distribution, a hierarchal prior model, a uniform distribution, or a homology curve. 11. The method according to claim 1 , wherein the particular information is based on a statistical model of at least one of edit operations or mutations. 12. The method according to claim 1 , wherein the particular subregions are selected based on the particular information. 13. The method according to claim 12 , wherein the determining procedure further comprises detecting exact matches and inexact matches of the particular subregions with respect to subregions of the at least one further string. 14. The method according to claim 12 , wherein the particular subregions of the particular string have the form of a set of windows. 15. The method according to claim 14 , wherein the particular subregions are arranged into subgroups, and wherein the determining procedure comprises comparing each particular subregion associated with the subgroups to further subregions of the at least one further string to detect at least one of (i) an exact match, (ii) an inexact match comprising a single error, or (iii) an inexact match comprising two errors between the particular subregion and the further subregions. 16. The method according to claim 14 , wherein subregions of the at least one further string have a form of a set of blocks. 17. The method according to claim 16 , wherein the set of blocks has a form of at least one of a hash table or a suffix array. 18. The method according to claim 1 , wherein the modifying procedure comprises using at least one of a Bayesian technique, a Bayesian technique together with a boosting technique, or a Bayesian estimator. 19. The method according to claim 1 , wherein the determining procedure is performed using a Bayesian estimator. 20. The method according to claim 1 , wherein the determining procedure comprises associating at least one of a mean, a standard deviation or a confidence data with each of the regions. 21. The method according to claim 1 , wherein the determining procedure further comprises the subprocedure of generating a local alignment data based on the local variations in the sequence similarity. 22. The method according to claim 1 , wherein the preselected criterion comprises at least one of a total computational time, a total number of executions of the repeating procedure, or a preselected change in the determined local variations in sequence similarity between iterations of procedures (e)-(g). 23. The method according to claim 1 , wherein the particular string and the at least one further string have a form of entire genome sequences. 24. The method according to claim 1 , wherein the particular string and the at least one further string have a form of at least one of a webpage code listing, a weblog, or a computer code listing. 25. The method according to claim 1 , wherein the particular string and the at least one further string have a form of at least one of a natural-language text, a times-series string, speech, an image, or video data. 26. The method according to claim 1 , wherein the particular string comprises at least one of a genome sequence of nucleotides, a short sequence of amino acids, natural-language text, natural-generated text, computer-generated text, hypertext, a weblog, blog, network log, or computer scripts, source code or object code. 27. The method according to claim 1 , wherein the sequence similarity curve further includes the highest similarity of the particular subregion at at least one further particular position in the particular string compared to the particular subregion of all of the positions in the at least one further string. 28. The method according to claim 1 , wherein the particular subregion has a particular length. 29. A system for determining information associated with a sequence similarity across a plurality of regions among a plurality of strings, comprising: a non-transitory computer-readable medium which includes thereon a set of instructions, wherein the set of instructions are configured to program

Assignees

Inventors

Classifications

  • ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks · CPC title

  • G06F16/334Primary

    Query execution (filtering based on additional data G06F16/335) · CPC title

  • ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks · CPC title

  • ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9390163B2 cover?
Methods, systems, and computer-readable media are provided which can identify and provide local variations in regions of similarity among two or more data sets. These data sets may be represented as sequences such as, e.g., genomic sequences or words in a text. The local variations in similarity levels can be provided by selecting an initial prior distribution relating the data sets, organizing…
Who is the assignee on this patent?
Paxia Salvatore, Mishra Bhubaneswar, Zhou Yi, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F16/334. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 12 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).