Telemetry data contextualized across datasets
US-2017364561-A1 · Dec 21, 2017 · US
US10339147B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10339147-B1 |
| Application number | US-201615189735-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jun 22, 2016 |
| Priority date | Jun 22, 2016 |
| Publication date | Jul 2, 2019 |
| Grant date | Jul 2, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Technology is provided for data set scoring. In one example, a method includes analyzing first and second characteristics of a data set. The first and second characteristics represent a quality of data values in the data set. At least the first characteristic is independent of the data values in the data set. The method further includes assigning a score to the data set based on the first and second characteristics. The data set may be ranked against a plurality of other data sets based on the score. The score of the data set may be provided together with a scoring scale to enable a determination of the quality of the data values based on the score.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: analyzing a first characteristic of a data set using a set of rules to evaluate suitability for machine learning models, the first characteristic being independent of data values in the data set and representing a quality of the data values; assigning a first characteristic score to the first characteristic using the set of rules; analyzing a second characteristic of the data set using the set of rules; assigning a second characteristic score to the second characteristic using the set of rules; assigning a data set score to the data set based on the first and second characteristic scores; ranking the data set as suitable to use with a machine learning model against a plurality of other data sets based on at least one of the first characteristic score, the second characteristic score or the data set score; receiving a search request for data sets suitable for the machine learning model, which match a query received from a client device; and providing, using an electronic page, the data set and the plurality of other data sets for access by the client device through the electronic page based in part on the ranking and in response to the search request to compare the quality of the data set against the plurality of other data sets; providing a uniform resource locator (URL) for the data set to the client device after one or more data sets are selected from the electronic page, wherein the URL includes authentication credentials; and verifying the authentication credentials in the URL to determine access to the data set, in response to receipt of the URL. 2. The method of claim 1 , further comprising: selecting the first characteristic from at least one of completeness, dimensionality or freshness characteristics; wherein completeness indicates a proportion of entries in at least a portion the data set that are missing data values; dimensionality indicates a number of columns or attributes relative to a number of rows or values in the data set; and freshness indicates at least one of a frequency of updates to the data set or an amount of time since a latest update to the data set. 3. The method of claim 1 , further comprising: selecting the second characteristic from at least one of consistency, predictiveness, conformity, schema or uniqueness characteristics; wherein consistency indicates a degree of consistency of a data schema across a plurality of dataset snapshots; predictiveness indicates one or more of: a number of categorical attributes relative to text attributes in the data set, variance of numerical attributes in the data set, or correlation across attributes in the data set; conformity indicates how well the data values conform to the data schema; schema indicates a number of free-form attributes relative to a number of numerical attributes relative to a number of categorical attributes in the data schema; and uniqueness indicates a number of unique data values in the data set. 4. The method of claim 1 , wherein the search request comprises a request to filter the data set and the plurality of other data sets based on a threshold value for one of the first characteristic score, the second characteristic score or the data set score. 5. The method of claim 1 , wherein: each of the plurality of other data sets includes a plurality of scores based on the first characteristic or the second characteristic; providing the data set and the plurality of other data sets for access through the electronic page based on the ranking and in response to the search request which comprises: providing an identifier of the data set and the plurality of other data sets together with the first characteristic score, the second characteristic score or the data set score and with at least one of the plurality of scores for each of the plurality of other data sets; and providing a scoring scale for the first characteristic score, the second characteristic score or the data set score for the data set, or the at least one of the plurality of scores for each of the plurality of other data sets, to be compared against. 6. A computer-implemented method, comprising: analyzing a first characteristic of a data set using a set of rules to evaluate suitability for machine learning models, the first characteristic being determinable independent of data values in the data set and representing a quality of the data values; analyzing a second characteristic of the data set using the set of rules, the second characteristic representing a quality of the data values; assigning a score to the data set based on the set of rules analyzing the first and second characteristics; ranking the data set score within a display provided to a client device against a plurality of other data sets based on the score; providing the score of the data set within the display to enable a comparison of the quality of the data set against the plurality of other data sets; providing a uniform resource locator (URL) for the data set to the client device after one or more data sets are selected from the display, wherein the URL includes authentication credentials; and verifying the authentication credentials in the URL to determine access to the data set, in response to receipt of the URL. 7. The method of claim 6 , further comprising analyzing the first or second characteristics in view of an entirety of the data set. 8. The method of claim 6 , further comprising analyzing the first or second characteristics in view of an individual attribute or value of the data set. 9. The method of claim 6 , wherein the score is a numeric value or a qualitative grade. 10. The method of claim 6 , further comprising: assigning the score to the data set based on predictiveness that identifies how the data set is utilized by machine learning models; wherein the score indicates relative precision of numerical entries, categorical entries, and freeform entries to make predictions by the machine learning models. 11. The method of claim 6 , further comprising: selecting the first characteristic from at least one of completeness, dimensionality or freshness characteristics; wherein the second characteristic is selected from a different one or more of completeness, dimensionality or freshness than the first characteristic; completeness indicates a proportion of entries in at least a portion the data set that are missing data values; dimensionality indicates a number of columns or attributes relative to a number of rows or values in the data set; and freshness indicates at least one of a frequency of updates to the data set or an amount of time since a latest update to the data set. 12. The method of claim 6 , further comprising scoring and ranking the plurality of other data sets using a same scoring method and fully automatic. 13. The method of claim 6 , further comprising: modifying the score based on at least one of: data provider quality, data set consumability or machine learning adaptability; wherein data provider quality is calculated based on previous scores of other datasets provided by a provider of the data set; dataset consumability is calculated based on a number of customers who use the dataset; and machine learning adaptability is calculated based on a number of machine learning solutions that use the dataset. 14. The method of claim 6 , further comprising: identifying types of machine learning that use the data set; and providing identification of the types of the machine learning within the display, which are valuable for use with the data set, along with the score of the data s
Machine learning · CPC title
using ranking · CPC title
Search customisation based on user profiles and personalisation · CPC title
Forward inferencing; Production systems · CPC title
Electronic shopping [e-shopping] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.