Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F16/2282. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 26 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Leveraging a collection of training tables to accurately predict errors within a variety of tables

US11157479B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11157479-B2
Application number	US-201916378155-A
Country	US
Kind code	B2
Filing date	Apr 8, 2019
Priority date	Apr 8, 2019
Publication date	Oct 26, 2021
Grant date	Oct 26, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to systems, methods, and computer-readable media for using a variety of hypothesis tests to identify errors within tables and other structured datasets. For example, systems disclosed herein can generate a modified table from an input table by removing one or more entries from the input table. The systems disclosed herein can further leverage a collection of training tables to determine probabilities associated with whether the input table and modified table are drawn from the collection of training tables. The systems disclosed herein can additionally compare the probabilities to accurately determine whether the one or more entries include errors therein. The systems disclosed herein may apply to a variety of different sizes and types of tables to identify different types of common errors within input tables.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: receiving an input table comprising a plurality of entries, wherein each entry of the plurality of entries comprises an associated value; removing one or more entries from the plurality of entries to generate a modified input table; determining a first metric of similarity based on a comparison of a first distribution of values from the input table and distributions of values from a plurality of training tables, the plurality of training tables including a set of reference tables presumed to have clean data; determining a second metric of similarity based on a comparison of a second distribution of values from the modified input table and distributions of values from the plurality of training tables: determining a first probability that the input table is drawn from a plurality of training tables based on the first metric of similarity; determining a second probability that the modified input table is drawn from the plurality of training tables based on the second metric of similarity; determining that the one or more entries removed from the input table contain an error based on a comparison of the first probability and the second probability; and providing, via a graphical user interface of a client device, an indication of the error in conjunction with a display of the one or more entries. 2. The method of claim 1 , further comprising identifying the plurality of training tables by identifying a subset of training tables from a collection of training tables based on one or more shared features of the input table and the subset of training tables. 3. The method of claim 2 , wherein the one or more shared features comprise one or more of: a datatype of the plurality of entries; a number of entries from the plurality of entries; a number of rows of entries from the plurality of entries; or a value prevalence associated with values from the plurality of entries. 4. The method of claim 1 , further comprising selectively identifying the one or more entries from the plurality of entries based on outlying values for the one or more entries relative to values of additional entries from the plurality of entries. 5. The method of claim 1 , further comprising: identifying a threshold perturbation value for generating the modified input table, the maximum perturbation value indicating a threshold number or a threshold percentage of entries to remove from the plurality of entries when generating the modified input table; and selectively identifying a number of the one or more entries to remove from the plurality of entries based on the threshold perturbation value. 6. The method of claim 1 , further comprising identifying the one or more entries by applying a minimization model to the input table, wherein the minimization model identifies the one or more entries based on a threshold expected ratio between the first probability and the second probability. 7. The method of claim 1 , wherein determining that the one or more entries removed from the input table contain the error comprises: calculating a ratio between the first probability and the second probability; and determining that the one or more entries contain the error based on the calculated ratio. 8. The method of claim 1 , further comprising: tagging the one or more entries of the input table; and providing an indication of the tagging via the graphical user interface of the client device in conjunction with a presentation of the input table. 9. A method, comprising: accessing a collection of training tables, wherein the collection of training tables comprises a plurality of training tables organized in rows and columns of entry values, the collection of training tables includes a set of reference tables presumed to have clean data; and training a table perturbation model based on the collection of training tables that, when applied to a given table, selectively identifies one or more errors within entries of the given table by: generating a modified table by removing one or more entries from the given table; determining a first metric of similarity based on a comparison of a first distribution of values from the given table and distributions of values from the collection of training tables; determining a second metric of similarity based on a comparison of a second distribution of values from the modified table and distributions of values from the collection of training tables; determining a first probability that the given table is drawn from the collection of training tables based on the first metric of similarity; determining a second probability that the modified table is drawn from the collection of training tables based on the second metric of similarity; and determining that the one or more entries from the given table contains an error based on a comparison of the first probability and the second probability applying the table perturbation model to an input table comprising a plurality of table entries to identify one or more errors within the plurality of table entries. 10. The method of claim 9 , wherein applying the table perturbation model to the input table causes a client device to provide, via a graphical user interface, an indication of the one or more errors in conjunction with a display of the plurality of table entries. 11. The method of claim 10 , further comprising: identifying the one or more errors within the plurality of table entries based on applying the table perturbation model to respective columns of the input table; tagging one or more entries of the plurality of table entries associated with the identified one or more errors; and providing an indication of tagging via the graphical user interface of the client device in conjunction with a presentation of the input table. 12. The method of claim 9 , wherein the table perturbation model is further trained to selectively identify the one or more errors within entries of the given table by identifying a subset of training tables from the collection of training tables based on one or more shared features of the given table and the subset of training tables. 13. The method of claim 9 , further comprising providing the table perturbation model to a computing device to enable the computing device to locally apply the table perturbation model to an input table accessible to the computing device. 14. The method of claim 9 , wherein the table perturbation model is further trained to: identify a threshold perturbation value for generating the modified table, the maximum perturbation value indicating a threshold number or a threshold percentage of entries to remove from the given table in generating the modified table, wherein the threshold perturbation value is based on one or more of a number of entries of the given table or a datatype of entries from the given table. 15. A system, comprising: one or more processors; memory in electronic communication with the one or more processors; and instructions stored in the memory, the instructions being executable by the one or more processors to cause a computing device to: receive an input table comprising a plurality of entries, wherein each entry of the plurality of entries comprises an associated value; remove one or more entries from the plurality of entries to generate a modified input table; determine a first metric of similarity based on a comparison of a first distribution of values from the input table and distributions of values from a plurality of training tables, the plurality of training tables including a set of reference tables presumed to have clea

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06F16/215
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
G06F16/2282Primary
Tablespace storage structures; Management thereof · CPC title
G06N20/00
Machine learning · CPC title
G06F17/18
for evaluating statistical data {, e.g. average values, frequency distributions, probability functions, regression analysis (forecasting specially adapted for a specific administrative, business or logistic context G06Q10/04)} · CPC title
G06F40/177Primary
of tables; using ruled lines · CPC title

Patent family

Related publications grouped by family.

View patent family 70166159

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11157479B2 cover?: The present disclosure relates to systems, methods, and computer-readable media for using a variety of hypothesis tests to identify errors within tables and other structured datasets. For example, systems disclosed herein can generate a modified table from an input table by removing one or more entries from the input table. The systems disclosed herein can further leverage a collection of train…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F16/2282. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 26 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).