Distributed, multi-model, self-learning platform for machine learning
US-2016132787-A1 · May 12, 2016 · US
US11474978B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11474978-B2 |
| Application number | US-201916405956-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 7, 2019 |
| Priority date | Jul 6, 2018 |
| Publication date | Oct 18, 2022 |
| Grant date | Oct 18, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for searching data are disclosed. For example, the system may include one or more memory units storing instructions and one or more processors configured to execute the instructions to perform operations. The operations may include receiving a sample dataset and identifying a data schema of the sample dataset. The operations may include generating a sample data vector that includes statistical metrics of the sample dataset and information based on the data schema of the sample dataset. The operations may include searching a data index comprising a plurality of stored data vectors corresponding to a plurality of reference datasets. The stored data vectors may include statistical metrics of the reference datasets and information based on corresponding data schema. The operations may include generating, based on the search and the sample data vector, one or more similarity metrics of the sample dataset to individual ones of the reference datasets.
Opening claim text (preview).
What is claimed is: 1. A system for searching data, comprising: at least one memory storing instructions; and one or more processors that execute the instructions to perform operations comprising: receiving a search request comprising a sample dataset and a vector similarity threshold of similarity between vectors; in response to the received search request, performing: identifying, using a data-profiling model comprising a machine learning model configured to compute statistical metrics descriptive of the sample dataset, a data schema of the sample dataset; computing, using the data-profiling model, statistical metrics describing at least one statistical attribute of data within the sample dataset; generating a sample data vector comprising the computed statistical metrics of the sample dataset and information describing the data schema of the sample dataset; searching a data index comprising a plurality of stored data vectors corresponding to a plurality of reference datasets, the stored data vectors comprising statistical metrics of data within the reference datasets and information describing corresponding data schema of the reference datasets, wherein searching the data index comprises: performing data schema comparisons between the data schema of the sample data vector and data schemas of the stored vectors; and performing statistical metric comparisons between the computed statistical metrics of the sample data vector and statistical metrics of the stored vectors; generating, based on both the data schema comparisons and the statistical metric comparisons, one or more similarity metrics of the sample dataset to individual ones of the reference datasets; determining, based on the one or more similarity metrics, at least a portion of the reference datasets having at least one data vector satisfying the vector similarity threshold; and returning, as a result of the received search request, the at least a portion of the reference datasets. 2. The system of claim 1 , the operations further comprising: receiving a new reference dataset; identifying a data schema of the new reference dataset; generating a new reference data vector comprising statistical measures of the new reference dataset; and updating the data index based on the new reference data vector. 3. The system of claim 1 , the operations further comprising: receiving, by an aggregator, the reference datasets; identifying, by the aggregator, the corresponding data schema of the reference datasets; generating, by the aggregator, the stored data vectors; generating, by the aggregator, the data index; and storing the data index in an aggregation database. 4. The system of claim 3 , wherein: the operations further comprise transmitting, to one or more computing environments, a request for the reference datasets; and the reference datasets are received from individual ones of the computing environments. 5. The system of claim 1 , wherein computing statistical metrics describing at least one statistical attribute of the sample dataset comprises applying a model to the sample dataset, the model comprising at least one of: a regression model, a Bayesian model, a statistical model, a linear discriminant analysis model, or a machine learning model. 6. The system of claim 1 , wherein searching the data index further comprises searching data profiles of variables of the reference datasets. 7. The system of claim 1 , wherein at least one of the stored data vectors corresponds to a data column of one of the reference datasets. 8. The system of claim 1 , wherein at least one of the stored data vectors comprises statistical metrics of a plurality of stored data vectors corresponding to data columns of one of the reference datasets. 9. The system of claim 1 , wherein the sample data vector corresponds to a data column of the sample dataset. 10. The system of claim 1 , wherein the sample data vector comprises statistical metrics of a plurality of data column data vectors comprising statistical metrics of respective data columns of the sample dataset. 11. The system of claim 1 , wherein: the sample data vector is a first sample data vector; the operations further comprise generating a second sample data vector comprising statistical metrics of the sample dataset and information based on the data schema of the sample dataset; and computing the similarity metrics is based on the second sample data vector. 12. The system of claim 1 , wherein the similarity metrics are based on a weight associated with the data schema of the sample dataset. 13. The system of claim 1 , wherein identifying the data schema comprises classifying a complex data type. 14. The system of claim 1 , wherein: performing the statistical metric comparisons comprises comparing individual statistical metrics of the sample dataset to individual statistical metrics of the reference datasets; and the individual statistical metrics of the sample dataset and the individual statistical metrics of the reference datasets comprise at least one of: an average, a standard deviation, a range, a moment, a variance, a covariance, or a covariance matrix. 15. The system of claim 1 , wherein the operations further comprise transmitting the generated one or more similarity metrics to a client device. 16. The system of claim 1 , wherein searching the data index comprises fuzzy searching. 17. The system of claim 1 , wherein at least one of the similarity metrics represents a likelihood that the sample data derives from one of the reference datasets. 18. A method for searching data, the method comprising the following operations performed by one or more processors: receiving a search request comprising a sample dataset and a vector similarity threshold of similarity between vectors; in response to the received search request, performing: identifying, using a data-profiling model comprising a machine learning model configured to compute statistical metrics descriptive of the sample dataset, a data schema of data within the sample dataset; computing, using the data-profiling model, statistical metrics describing at least one statistical attribute of the sample dataset; generating a sample data vector comprising the computed statistical metrics of the sample dataset and information describing the data schema of the sample dataset; searching a data index comprising a plurality of stored data vectors corresponding to a plurality of reference datasets, the stored data vectors comprising statistical metrics of data within the corresponding reference datasets and information describing corresponding data schema of the reference datasets, wherein searching the data index comprises: performing data schema comparisons between the data schema of the sample data vector and data schemas of the stored vectors; and performing statistical metric comparisons between the computed statistical metrics of the sample data vector and statistical metrics of the stored vectors; generating, based on both the data schema comparisons and the statistical metric comparisons, one or more similarity metrics of the sample dataset to individual ones of the reference datasets; determining, based on the one or more similarity metrics, at least a portion of the reference datasets having at least one data vector satisfying the vector similarity threshold; and returning, as a result of the received search request, the at least a portion of the reference datasets. 19. A system for searching data, comprising: at least one memo
Query formulation · CPC title
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
Probabilistic or stochastic networks · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.