Construction of reference database accurately representing complete set of data items for faster and tractable classification usage

US11314781B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11314781-B2
Application numberUS-201816146576-A
CountryUS
Kind codeB2
Filing dateSep 28, 2018
Priority dateSep 28, 2018
Publication dateApr 26, 2022
Grant dateApr 26, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

For each unique pair of a complete set of data items, a computing device determines a distance between the data items of the unique pair. The computing device repeats the following until no data items remain in the complete set. For each data item remaining in the complete set, the computing device determines a similarity subset including each other data item that the distance between the data item and the other data item is less than a target difference threshold. The computing device moves a selected data item from a largest similarity subset to a reference database that is a subset of the complete set. The computing device removes each data item from the complete set that the distance between the selected data item and the data item is less than the threshold. A new data item can be classified using the reference database.

First claim

Opening claim text (preview).

We claim: 1. A method comprising: storing, by a computing device on a storage device, a reference database that is a subset of a complete set of data items, the reference database requires less storage space than the complete set of data items, each data item of the complete set having a distance to a data item of the subset that is less than a target difference threshold, the reference database having a classification accuracy equal to the k-th power of one minus the target difference threshold, and wherein k is an associated length of each data item; and classifying, by the computing device using logic implemented at least in hardware, a new data item using the reference database. 2. The method of claim 1 , wherein each data item is a genome having a plurality of substrings that are each of length k. 3. The method of claim 2 , wherein each the plurality of substrings is a kmer. 4. The method of claim 1 , wherein the classification accuracy is a probability that a new data item is classified correctly using just the data items moved to the reference database and not all the data items of the complete set. 5. The method of claim 1 , wherein classifying the new data item using just the reference database and not the complete set occurs more quickly than using the complete set. 6. The method of claim 1 , wherein the target difference threshold is represented as a percentage. 7. The method of claim 1 , wherein each data item is a protein having one or more sub strings that are each of length k. 8. A computer system comprising: one or more computer processors, one or more computer-readable storage media, and program instructions stored on the one or more computer-readable storage media for execution by at least one of the one or more computer processors, the program instructions comprising: program instructions to store, by a computing device on a storage device, a reference database that is a subset of a complete set of data items, the reference database requires less storage space than the complete set of data items, each data item of the complete set having a distance to a data item of the subset that is less than a target difference threshold, wherein the reference database has a classification accuracy equal to the k-th power of one minus the target difference threshold, and wherein k is an associated length of each data item; and program instructions to classify, by the computing device using logic implemented at least in hardware, a new data item using the reference database. 9. The computer system of claim 8 , wherein each data item is a genome having a plurality of sub strings that are each of length k. 10. The computer system of claim 9 , wherein each of the plurality of substrings is a kmer. 11. The computer system of claim 8 , wherein the classification accuracy is a probability that a new data item is classified correctly using just the data items moved to the reference database and not all the data items of the complete set. 12. The computer system of claim 8 , wherein classifying the new data item using just the reference database and not the complete set occurs more quickly than using the complete set. 13. The computer system of claim 8 , wherein the target difference threshold is represented as a percentage. 14. The computer system of claim 8 , wherein each data item is a protein having one or more substrings that are each of length k. 15. A computer program product comprising: one or more computer-readable storage media and program instructions stored on the one or more computer-readable storage media, the program instructions comprising: program instructions to store, by a computing device on a storage device, a reference database that is a subset of a complete set of data items, the reference database requires less storage space than the complete set of data items, each data item of the complete set having a distance to a data item of the subset that is less than a target difference threshold, wherein the reference database has a classification accuracy equal to the k-th power of one minus the target difference threshold, and wherein k is an associated length of each data item; and program instructions to classify, by the computing device using logic implemented at least in hardware, a new data item using the reference database. 16. The computer system of claim 15 , wherein each data item is a genome having a plurality of substrings that are each of length k. 17. The computer system of claim 16 , wherein each the plurality of substrings is a kmer. 18. The computer system of claim 15 , wherein the classification accuracy is a probability that a new data item is classified correctly using just the data items moved to the reference database and not all the data items of the complete set. 19. The computer system of claim 15 , wherein classifying the new data item using just the reference database and not the complete set occurs more quickly than using the complete set. 20. The computer system of claim 15 , wherein the target difference threshold is represented as a percentage. 21. The computer system of claim 15 , wherein each data item is a protein having one or more sub strings that are each of length k.

Assignees

Inventors

Classifications

  • G06F16/285Primary

    Clustering or classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11314781B2 cover?
For each unique pair of a complete set of data items, a computing device determines a distance between the data items of the unique pair. The computing device repeats the following until no data items remain in the complete set. For each data item remaining in the complete set, the computing device determines a similarity subset including each other data item that the distance between the data …
Who is the assignee on this patent?
IBM, Mars Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 26 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).