Data set scoring

US10339147B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10339147-B1
Application numberUS-201615189735-A
CountryUS
Kind codeB1
Filing dateJun 22, 2016
Priority dateJun 22, 2016
Publication dateJul 2, 2019
Grant dateJul 2, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Technology is provided for data set scoring. In one example, a method includes analyzing first and second characteristics of a data set. The first and second characteristics represent a quality of data values in the data set. At least the first characteristic is independent of the data values in the data set. The method further includes assigning a score to the data set based on the first and second characteristics. The data set may be ranked against a plurality of other data sets based on the score. The score of the data set may be provided together with a scoring scale to enable a determination of the quality of the data values based on the score.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: analyzing a first characteristic of a data set using a set of rules to evaluate suitability for machine learning models, the first characteristic being independent of data values in the data set and representing a quality of the data values; assigning a first characteristic score to the first characteristic using the set of rules; analyzing a second characteristic of the data set using the set of rules; assigning a second characteristic score to the second characteristic using the set of rules; assigning a data set score to the data set based on the first and second characteristic scores; ranking the data set as suitable to use with a machine learning model against a plurality of other data sets based on at least one of the first characteristic score, the second characteristic score or the data set score; receiving a search request for data sets suitable for the machine learning model, which match a query received from a client device; and providing, using an electronic page, the data set and the plurality of other data sets for access by the client device through the electronic page based in part on the ranking and in response to the search request to compare the quality of the data set against the plurality of other data sets; providing a uniform resource locator (URL) for the data set to the client device after one or more data sets are selected from the electronic page, wherein the URL includes authentication credentials; and verifying the authentication credentials in the URL to determine access to the data set, in response to receipt of the URL. 2. The method of claim 1 , further comprising: selecting the first characteristic from at least one of completeness, dimensionality or freshness characteristics; wherein completeness indicates a proportion of entries in at least a portion the data set that are missing data values; dimensionality indicates a number of columns or attributes relative to a number of rows or values in the data set; and freshness indicates at least one of a frequency of updates to the data set or an amount of time since a latest update to the data set. 3. The method of claim 1 , further comprising: selecting the second characteristic from at least one of consistency, predictiveness, conformity, schema or uniqueness characteristics; wherein consistency indicates a degree of consistency of a data schema across a plurality of dataset snapshots; predictiveness indicates one or more of: a number of categorical attributes relative to text attributes in the data set, variance of numerical attributes in the data set, or correlation across attributes in the data set; conformity indicates how well the data values conform to the data schema; schema indicates a number of free-form attributes relative to a number of numerical attributes relative to a number of categorical attributes in the data schema; and uniqueness indicates a number of unique data values in the data set. 4. The method of claim 1 , wherein the search request comprises a request to filter the data set and the plurality of other data sets based on a threshold value for one of the first characteristic score, the second characteristic score or the data set score. 5. The method of claim 1 , wherein: each of the plurality of other data sets includes a plurality of scores based on the first characteristic or the second characteristic; providing the data set and the plurality of other data sets for access through the electronic page based on the ranking and in response to the search request which comprises: providing an identifier of the data set and the plurality of other data sets together with the first characteristic score, the second characteristic score or the data set score and with at least one of the plurality of scores for each of the plurality of other data sets; and providing a scoring scale for the first characteristic score, the second characteristic score or the data set score for the data set, or the at least one of the plurality of scores for each of the plurality of other data sets, to be compared against. 6. A computer-implemented method, comprising: analyzing a first characteristic of a data set using a set of rules to evaluate suitability for machine learning models, the first characteristic being determinable independent of data values in the data set and representing a quality of the data values; analyzing a second characteristic of the data set using the set of rules, the second characteristic representing a quality of the data values; assigning a score to the data set based on the set of rules analyzing the first and second characteristics; ranking the data set score within a display provided to a client device against a plurality of other data sets based on the score; providing the score of the data set within the display to enable a comparison of the quality of the data set against the plurality of other data sets; providing a uniform resource locator (URL) for the data set to the client device after one or more data sets are selected from the display, wherein the URL includes authentication credentials; and verifying the authentication credentials in the URL to determine access to the data set, in response to receipt of the URL. 7. The method of claim 6 , further comprising analyzing the first or second characteristics in view of an entirety of the data set. 8. The method of claim 6 , further comprising analyzing the first or second characteristics in view of an individual attribute or value of the data set. 9. The method of claim 6 , wherein the score is a numeric value or a qualitative grade. 10. The method of claim 6 , further comprising: assigning the score to the data set based on predictiveness that identifies how the data set is utilized by machine learning models; wherein the score indicates relative precision of numerical entries, categorical entries, and freeform entries to make predictions by the machine learning models. 11. The method of claim 6 , further comprising: selecting the first characteristic from at least one of completeness, dimensionality or freshness characteristics; wherein the second characteristic is selected from a different one or more of completeness, dimensionality or freshness than the first characteristic; completeness indicates a proportion of entries in at least a portion the data set that are missing data values; dimensionality indicates a number of columns or attributes relative to a number of rows or values in the data set; and freshness indicates at least one of a frequency of updates to the data set or an amount of time since a latest update to the data set. 12. The method of claim 6 , further comprising scoring and ranking the plurality of other data sets using a same scoring method and fully automatic. 13. The method of claim 6 , further comprising: modifying the score based on at least one of: data provider quality, data set consumability or machine learning adaptability; wherein data provider quality is calculated based on previous scores of other datasets provided by a provider of the data set; dataset consumability is calculated based on a number of customers who use the dataset; and machine learning adaptability is calculated based on a number of machine learning solutions that use the dataset. 14. The method of claim 6 , further comprising: identifying types of machine learning that use the data set; and providing identification of the types of the machine learning within the display, which are valuable for use with the data set, along with the score of the data s

Assignees

Inventors

Classifications

  • Machine learning · CPC title

  • using ranking · CPC title

  • Search customisation based on user profiles and personalisation · CPC title

  • Forward inferencing; Production systems · CPC title

  • Electronic shopping [e-shopping] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10339147B1 cover?
Technology is provided for data set scoring. In one example, a method includes analyzing first and second characteristics of a data set. The first and second characteristics represent a quality of data values in the data set. At least the first characteristic is independent of the data values in the data set. The method further includes assigning a score to the data set based on the first and s…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/24578. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 02 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).