Method and system for implementing a fast dataset search using a compressed representation of a plurality of datasets

US11971868B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11971868-B2
Application numberUS-202318328376-A
CountryUS
Kind codeB2
Filing dateJun 2, 2023
Priority dateNov 17, 2021
Publication dateApr 30, 2024
Grant dateApr 30, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes storing, by one or more processors of one or more computing devices, a plurality of datasets in a non-transitory computer memory associated with the one or more computing devices. A plurality of index representations is generated where each one of the plurality of index representations includes a compressed representation of a respective one of the plurality of datasets. The plurality of index representations is stored in the non-transitory computer memory. A sample dataset is received by the one or more processors. A sample dataset representation is generated that includes a compressed representation of the sample dataset. A determination that at least one of the plurality of datasets is most similar to the sample dataset based on the sample dataset representation and the plurality of index representations is performed.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method comprising: transforming, by at least one processor of at least one computing device, using an encoder module, comprising at least one first machine learning model, a plurality of datasets in a database respectively into a plurality of compressed latent space datasets in a latent space representation that groups similar features in data objects in each of the plurality of datasets into feature clusters in the latent space representation; transforming, by the at least one processor, using the encoder module, a user-provided sample dataset from the plurality of datasets into a sample compressed latent space dataset in the latent space representation; determining, by the at least one processor, using a comparator module, comprising at least one second machine learning model, a plurality of distances between the sample compressed latent space dataset and each of the plurality of compressed latent space datasets in the latent space representation; identifying, by the at least one processor, at least one compressed latent space dataset from the plurality of compressed latent space datasets having a distance smaller than a predefined threshold distance; determining, by the at least one processor, at least one possible location of the user-provided sample dataset in the database based at least in part on: the at least one compressed latent space dataset in the latent space representation, and an indexing between the plurality of datasets and the plurality of compressed latent space datasets; and instructing, by the at least one processor, over a communication network, a display to display on a graphic user interface, the at least one possible location of the user-provided sample dataset in the database. 2. The method of claim 1 , further comprising storing, by the at least one processor, each dataset from the plurality of datasets as an n×m matrix of N dimensions in a non-transitory computer memory; wherein the n×m matrix comprises m columns of data objects from each dataset and n rows of features of the data objects. 3. The method of claim 2 , wherein the transforming of the plurality of datasets comprises performing by the encoder module, a matrix decomposition process on each dataset for: (i) reducing the n×m matrix of N dimensions to a decomposed matrix having a dimension smaller than N, and (ii) identifying the similar features in the data objects respectively for the feature clusters in the latent space representation respectively for each compressed latent space dataset; wherein each compressed latent space dataset in the latent space representation comprises the decomposed matrix. 4. The method according to claim 3 , wherein each compressed latent space dataset comprises the decomposed matrix with a lower order dimension of either 2 or 3, and further comprising providing, by the at least one processor, a visual representation of the features of the data objects in each compressed latent space dataset. 5. The method of claim 1 , wherein the at least one first machine learning model is trained to execute encoder algorithms on each of the plurality of datasets comprising a non-negative matrix factorization (NMF) process, a principal component analysis (PCA), independent component analysis (ICA), an auto-encoder, or a latent space representation generator; and further comprising transforming, by the at least one processor, the plurality of datasets by applying the at least one first machine learning model to the plurality of datasets. 6. The method according to claim 1 , wherein the at least one second machine learning model is a trained machine learning model to determine each of the plurality of distances using a Euclidian distance algorithm, a Manhattan distance algorithm, a Levenshtein distance algorithm, a cosine similarity algorithm, or any combination thereof; and wherein the determining of the plurality of distances comprises determining the plurality of distances using the trained machine learning model. 7. The method according to claim 1 , further comprising reconstructing, by the at least one processor, from the plurality of compressed latent space datasets, a lossy representation of the plurality of datasets using a decoder module comprising at least one third machine learning model. 8. The method according to claim 1 , wherein the plurality of datasets comprises a plurality of text-based datasets; and wherein the transforming of the plurality of text-based datasets to the plurality of compressed latent space datasets comprises applying a word embedding algorithm to the plurality of text-based datasets. 9. The method according to claim 1 , wherein the plurality of datasets comprises a plurality of image-based datasets; wherein each of the plurality of image-based datasets comprises a high dimensional pixel space representation of image data objects; further comprising storing, by the at least one processor, each image-based dataset from the plurality of image-based datasets as an n×m matrix of N dimensions in a non-transitory computer memory; and wherein the n×m matrix comprises m columns of data objects from each image-based datasets and n rows of pixel values. 10. The method according to claim 9 , wherein the transforming of the plurality of image-based datasets to the plurality of compressed latent space datasets comprises applying a T-distributed stochastic neighbor embedding (t-SNE) machine learning algorithm to the plurality of image-based datasets. 11. A system comprising: a non-transitory computer memory storing computer code; and at least one processor, that when executing the computer code, configures the at least one processor to: transform using an encoder module, comprising at least one first machine learning model, a plurality of datasets in a database respectively into a plurality of compressed latent space datasets in a latent space representation that groups similar features in data objects in each of the plurality of datasets into feature clusters in the latent space representation; transform using the encoder module, a user-provided sample dataset from the plurality of datasets into a sample compressed latent space dataset in the latent space representation; determine using a comparator module, comprising at least one second machine learning model, a plurality of distances between the sample compressed latent space dataset and each of the plurality of compressed latent space datasets in the latent space representation; identify at least one compressed latent space dataset from the plurality of compressed latent space datasets having a distance smaller than a predefined threshold distance; determine at least one possible location of the user-provided sample dataset in the database based at least in part on: the at least one compressed latent space dataset in the latent space representation, and an indexing between the plurality of datasets and the plurality of compressed latent space datasets; and instruct over a communication network, a display to display on a graphic user interface, the at least one possible location of the user-provided sample dataset in the database. 12. The system of claim 11 , wherein the at least one processor is configured to store each dataset from the plurality of datasets as an n×m matrix of N dimensions in the non-transitory computer memory; wherein the n×m matrix comprises m columns of data objects from each dataset and n rows of features of the data objects. 13. The system of claim 12 , wherein the at least one processor is configured to transform the plurality of datasets by performing by the encoder module, a matrix decomposition pr

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11971868B2 cover?
A method includes storing, by one or more processors of one or more computing devices, a plurality of datasets in a non-transitory computer memory associated with the one or more computing devices. A plurality of index representations is generated where each one of the plurality of index representations includes a compressed representation of a respective one of the plurality of datasets. The p…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/2228. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 30 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).