Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform
US-2019317961-A1 · Oct 17, 2019 · US
US11971868B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11971868-B2 |
| Application number | US-202318328376-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 2, 2023 |
| Priority date | Nov 17, 2021 |
| Publication date | Apr 30, 2024 |
| Grant date | Apr 30, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes storing, by one or more processors of one or more computing devices, a plurality of datasets in a non-transitory computer memory associated with the one or more computing devices. A plurality of index representations is generated where each one of the plurality of index representations includes a compressed representation of a respective one of the plurality of datasets. The plurality of index representations is stored in the non-transitory computer memory. A sample dataset is received by the one or more processors. A sample dataset representation is generated that includes a compressed representation of the sample dataset. A determination that at least one of the plurality of datasets is most similar to the sample dataset based on the sample dataset representation and the plurality of index representations is performed.
Opening claim text (preview).
The invention claimed is: 1. A method comprising: transforming, by at least one processor of at least one computing device, using an encoder module, comprising at least one first machine learning model, a plurality of datasets in a database respectively into a plurality of compressed latent space datasets in a latent space representation that groups similar features in data objects in each of the plurality of datasets into feature clusters in the latent space representation; transforming, by the at least one processor, using the encoder module, a user-provided sample dataset from the plurality of datasets into a sample compressed latent space dataset in the latent space representation; determining, by the at least one processor, using a comparator module, comprising at least one second machine learning model, a plurality of distances between the sample compressed latent space dataset and each of the plurality of compressed latent space datasets in the latent space representation; identifying, by the at least one processor, at least one compressed latent space dataset from the plurality of compressed latent space datasets having a distance smaller than a predefined threshold distance; determining, by the at least one processor, at least one possible location of the user-provided sample dataset in the database based at least in part on: the at least one compressed latent space dataset in the latent space representation, and an indexing between the plurality of datasets and the plurality of compressed latent space datasets; and instructing, by the at least one processor, over a communication network, a display to display on a graphic user interface, the at least one possible location of the user-provided sample dataset in the database. 2. The method of claim 1 , further comprising storing, by the at least one processor, each dataset from the plurality of datasets as an n×m matrix of N dimensions in a non-transitory computer memory; wherein the n×m matrix comprises m columns of data objects from each dataset and n rows of features of the data objects. 3. The method of claim 2 , wherein the transforming of the plurality of datasets comprises performing by the encoder module, a matrix decomposition process on each dataset for: (i) reducing the n×m matrix of N dimensions to a decomposed matrix having a dimension smaller than N, and (ii) identifying the similar features in the data objects respectively for the feature clusters in the latent space representation respectively for each compressed latent space dataset; wherein each compressed latent space dataset in the latent space representation comprises the decomposed matrix. 4. The method according to claim 3 , wherein each compressed latent space dataset comprises the decomposed matrix with a lower order dimension of either 2 or 3, and further comprising providing, by the at least one processor, a visual representation of the features of the data objects in each compressed latent space dataset. 5. The method of claim 1 , wherein the at least one first machine learning model is trained to execute encoder algorithms on each of the plurality of datasets comprising a non-negative matrix factorization (NMF) process, a principal component analysis (PCA), independent component analysis (ICA), an auto-encoder, or a latent space representation generator; and further comprising transforming, by the at least one processor, the plurality of datasets by applying the at least one first machine learning model to the plurality of datasets. 6. The method according to claim 1 , wherein the at least one second machine learning model is a trained machine learning model to determine each of the plurality of distances using a Euclidian distance algorithm, a Manhattan distance algorithm, a Levenshtein distance algorithm, a cosine similarity algorithm, or any combination thereof; and wherein the determining of the plurality of distances comprises determining the plurality of distances using the trained machine learning model. 7. The method according to claim 1 , further comprising reconstructing, by the at least one processor, from the plurality of compressed latent space datasets, a lossy representation of the plurality of datasets using a decoder module comprising at least one third machine learning model. 8. The method according to claim 1 , wherein the plurality of datasets comprises a plurality of text-based datasets; and wherein the transforming of the plurality of text-based datasets to the plurality of compressed latent space datasets comprises applying a word embedding algorithm to the plurality of text-based datasets. 9. The method according to claim 1 , wherein the plurality of datasets comprises a plurality of image-based datasets; wherein each of the plurality of image-based datasets comprises a high dimensional pixel space representation of image data objects; further comprising storing, by the at least one processor, each image-based dataset from the plurality of image-based datasets as an n×m matrix of N dimensions in a non-transitory computer memory; and wherein the n×m matrix comprises m columns of data objects from each image-based datasets and n rows of pixel values. 10. The method according to claim 9 , wherein the transforming of the plurality of image-based datasets to the plurality of compressed latent space datasets comprises applying a T-distributed stochastic neighbor embedding (t-SNE) machine learning algorithm to the plurality of image-based datasets. 11. A system comprising: a non-transitory computer memory storing computer code; and at least one processor, that when executing the computer code, configures the at least one processor to: transform using an encoder module, comprising at least one first machine learning model, a plurality of datasets in a database respectively into a plurality of compressed latent space datasets in a latent space representation that groups similar features in data objects in each of the plurality of datasets into feature clusters in the latent space representation; transform using the encoder module, a user-provided sample dataset from the plurality of datasets into a sample compressed latent space dataset in the latent space representation; determine using a comparator module, comprising at least one second machine learning model, a plurality of distances between the sample compressed latent space dataset and each of the plurality of compressed latent space datasets in the latent space representation; identify at least one compressed latent space dataset from the plurality of compressed latent space datasets having a distance smaller than a predefined threshold distance; determine at least one possible location of the user-provided sample dataset in the database based at least in part on: the at least one compressed latent space dataset in the latent space representation, and an indexing between the plurality of datasets and the plurality of compressed latent space datasets; and instruct over a communication network, a display to display on a graphic user interface, the at least one possible location of the user-provided sample dataset in the database. 12. The system of claim 11 , wherein the at least one processor is configured to store each dataset from the plurality of datasets as an n×m matrix of N dimensions in the non-transitory computer memory; wherein the n×m matrix comprises m columns of data objects from each dataset and n rows of features of the data objects. 13. The system of claim 12 , wherein the at least one processor is configured to transform the plurality of datasets by performing by the encoder module, a matrix decomposition pr
Indexing structures · CPC title
Binary matching operations · CPC title
using ranking · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.