What technology area does this patent fall under?

Primary CPC classification G06N20/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jul 23 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Automated dynamic data quality assessment

US10360516B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10360516-B2
Application number	US-201715619786-A
Country	US
Kind code	B2
Filing date	Jun 12, 2017
Priority date	Nov 22, 2013
Publication date	Jul 23, 2019
Grant date	Jul 23, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In general, embodiments of the present invention provide systems, methods and computer readable media for automated dynamic data quality assessment. One aspect of the subject matter described in this specification includes the actions of receiving a data quality job including a new data sample; and, if the new data sample is determined to be added to a reservoir of data samples, sending a quality verification request to an oracle; receiving a new data sample quality estimate from the oracle; and adding the new data sample and estimate to the reservoir. A second aspect of the subject matter includes the actions of receiving, from a predictive model, a judgment associated with a new data sample; analyzing the new data sample based in part on the judgment to determine whether to send a new data sample quality verification request to an oracle; and, if a new data sample quality estimate is received from the oracle, determining whether to add the new data sample and the judgment to the reservoir.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: receiving, by a processor, a data quality job, the data quality job including configuration data and a new data sample having a particular data type and one or more attributes associated therewith; determining, by the processor, whether to add the new data sample to a reservoir of data samples, the reservoir of data samples identified based at least in part on the particular data type; and in an instance in which the new data sample is to be added to the reservoir of data samples, sending, by the processor, a quality verification request including the new data sample to an oracle of a plurality of oracles, the oracle selected based on one or more of the particular data type and the one or more attributes associated with the new data sample; receiving, by the processor, a data quality estimate associated with the new data sample from the oracle in response to the quality verification request; and adding, by the processor, the new data sample and the associated data quality estimate to the reservoir of data samples in response to receiving the data quality estimate. 2. The method of claim 1 , further comprising: updating, by the processor, the reservoir summary statistics. 3. The method of claim 2 , wherein updating, by the processor, the reservoir summary statistics comprises: calculating an overall data quality estimate for the reservoir using data quality estimates respectively associated with each of the data samples; and calculating a statistical variance for the data samples. 4. The method of claim 2 , wherein updating, by the processor, the reservoir summary statistics further comprises: logging the updated reservoir summary statistics in persistent storage. 5. The method of claim 2 , further comprising: receiving, by the processor, corpus summary statistics calculated for a corpus of previously collected data samples, wherein each of the previously collected data samples are respectively associated with the particular data type; and generating, by the processor, an analysis comparing the updated reservoir summary statistics with the corpus summary statistics. 6. The method of claim 1 , wherein determining whether to add the new data sample to the reservoir is based on at the value of at least one of the attributes of the new data sample. 7. The method of claim 1 , wherein determining whether to add the new data sample to the reservoir is based on a probabilistic sampling approach. 8. The method of claim 1 , wherein the oracle is a crowd, a flat file of previously received crowd data verification results, or a software system. 9. The method of claim 1 , wherein the new data sample is collected from a data stream. 10. The method of claim 9 , wherein the new data sample is a single data instance or a set of data instances collected from the data stream within a pre-defined time window. 11. The method of claim 1 , wherein the new data sample has been pre-processed by a data cleaning process. 12. A computer program product, stored on a non-transitory computer readable medium, comprising instructions that when executed on one or more computers cause the one or more computers to: receive, by a processor, a data quality job, the data quality job including configuration data and a new data sample having a particular data type and one or more attributes associated therewith; determine, by the processor, whether to add the new data sample to a reservoir of data samples, the reservoir of data samples identified based at least in part on the particular data type; and in an instance in which the new data sample is to be added to the reservoir of data samples, send, by the processor, a quality verification request including the new data sample to an oracle of a plurality of oracles, the oracle selected based on one or more of the particular data type and the one or more attributes associated with the new data sample; receive, by the processor, a data quality estimate associated with the new data sample from the oracle in response to the quality verification request; and add, by the processor, the new data sample and the associated data quality estimate to the reservoir of data samples in response to receiving the data quality estimate. 13. The computer program product of claim 12 , wherein the instructions that when executed on one or more computers further cause the one or more computers to: update the reservoir summary statistics. 14. The computer program product of claim 13 , wherein updating the reservoir summary statistics comprises: calculating an overall data quality estimate for the reservoir using data quality estimates respectively associated with each of the data samples; and calculating a statistical variance for the data samples. 15. The computer program product of claim 13 , wherein updating the reservoir summary statistics further comprises: logging the updated reservoir summary statistics in persistent storage. 16. The computer program product of claim 13 , wherein the instructions that when executed on one or more computers further cause the one or more computers to: receive corpus summary statistics calculated for a corpus of previously collected data samples, wherein each of the previously collected data samples are respectively associated with the particular data type; and generate an analysis comparing the updated reservoir summary statistics with the corpus summary statistics. 17. The computer program product of claim 12 , wherein determining whether to add the new data sample to the reservoir is based on at the value of at least one of the attributes of the new data sample. 18. The computer program product of claim 12 , wherein determining whether to add the new data sample to the reservoir is based on a probabilistic sampling approach. 19. The computer program product of claim 12 , wherein the oracle is a crowd, a flat file of previously received crowd data verification results, or a software system. 20. The computer program product of claim 12 , wherein the new data sample is collected from a data stream. 21. The computer program product of claim 20 , wherein the new data sample is a single data instance or a set of data instances collected from the data stream within a pre-defined time window. 22. The computer program product of claim 12 , wherein the new data sample has been pre-processed by a data cleaning process. 23. A system, comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to: receive, by a processor, a data quality job, the data quality job including configuration data and a new data sample having a particular data type and one or more attributes associated therewith; determine, by the processor, whether to add the new data sample to a reservoir of data samples, the reservoir of data samples identified based at least in part on the particular data type; and in an instance in which the new data sample is to be added to the reservoir of data samples, send, by the processor, a quality verification request including the new data sample to an oracle of a plurality of oracles, the oracle selected based on one or more of the particular data type and the one or more attributes associated with the new data sample; receive, by the processor, a data quality estimate associated with the new data sample from t

Assignees

Groupon Inc

Inventors

Classifications

G06F16/215
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
G06F16/2358
Change logging, detection, and notification (replication G06F16/27) · CPC title
G06F16/2365
Ensuring data consistency and integrity · CPC title
G06N20/00Primary
Machine learning · CPC title
G06N5/02Primary
Knowledge representation; Symbolic representation · CPC title

Patent family

Related publications grouped by family.

View patent family 57860115

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10360516B2 cover?: In general, embodiments of the present invention provide systems, methods and computer readable media for automated dynamic data quality assessment. One aspect of the subject matter described in this specification includes the actions of receiving a data quality job including a new data sample; and, if the new data sample is determined to be added to a reservoir of data samples, sending a quali…
Who is the assignee on this patent?: Groupon Inc
What technology area does this patent fall under?: Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jul 23 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).