Novel arabic spell checking error model
US-2019188255-A1 · Jun 20, 2019 · US
US11157479B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11157479-B2 |
| Application number | US-201916378155-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 8, 2019 |
| Priority date | Apr 8, 2019 |
| Publication date | Oct 26, 2021 |
| Grant date | Oct 26, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure relates to systems, methods, and computer-readable media for using a variety of hypothesis tests to identify errors within tables and other structured datasets. For example, systems disclosed herein can generate a modified table from an input table by removing one or more entries from the input table. The systems disclosed herein can further leverage a collection of training tables to determine probabilities associated with whether the input table and modified table are drawn from the collection of training tables. The systems disclosed herein can additionally compare the probabilities to accurately determine whether the one or more entries include errors therein. The systems disclosed herein may apply to a variety of different sizes and types of tables to identify different types of common errors within input tables.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving an input table comprising a plurality of entries, wherein each entry of the plurality of entries comprises an associated value; removing one or more entries from the plurality of entries to generate a modified input table; determining a first metric of similarity based on a comparison of a first distribution of values from the input table and distributions of values from a plurality of training tables, the plurality of training tables including a set of reference tables presumed to have clean data; determining a second metric of similarity based on a comparison of a second distribution of values from the modified input table and distributions of values from the plurality of training tables: determining a first probability that the input table is drawn from a plurality of training tables based on the first metric of similarity; determining a second probability that the modified input table is drawn from the plurality of training tables based on the second metric of similarity; determining that the one or more entries removed from the input table contain an error based on a comparison of the first probability and the second probability; and providing, via a graphical user interface of a client device, an indication of the error in conjunction with a display of the one or more entries. 2. The method of claim 1 , further comprising identifying the plurality of training tables by identifying a subset of training tables from a collection of training tables based on one or more shared features of the input table and the subset of training tables. 3. The method of claim 2 , wherein the one or more shared features comprise one or more of: a datatype of the plurality of entries; a number of entries from the plurality of entries; a number of rows of entries from the plurality of entries; or a value prevalence associated with values from the plurality of entries. 4. The method of claim 1 , further comprising selectively identifying the one or more entries from the plurality of entries based on outlying values for the one or more entries relative to values of additional entries from the plurality of entries. 5. The method of claim 1 , further comprising: identifying a threshold perturbation value for generating the modified input table, the maximum perturbation value indicating a threshold number or a threshold percentage of entries to remove from the plurality of entries when generating the modified input table; and selectively identifying a number of the one or more entries to remove from the plurality of entries based on the threshold perturbation value. 6. The method of claim 1 , further comprising identifying the one or more entries by applying a minimization model to the input table, wherein the minimization model identifies the one or more entries based on a threshold expected ratio between the first probability and the second probability. 7. The method of claim 1 , wherein determining that the one or more entries removed from the input table contain the error comprises: calculating a ratio between the first probability and the second probability; and determining that the one or more entries contain the error based on the calculated ratio. 8. The method of claim 1 , further comprising: tagging the one or more entries of the input table; and providing an indication of the tagging via the graphical user interface of the client device in conjunction with a presentation of the input table. 9. A method, comprising: accessing a collection of training tables, wherein the collection of training tables comprises a plurality of training tables organized in rows and columns of entry values, the collection of training tables includes a set of reference tables presumed to have clean data; and training a table perturbation model based on the collection of training tables that, when applied to a given table, selectively identifies one or more errors within entries of the given table by: generating a modified table by removing one or more entries from the given table; determining a first metric of similarity based on a comparison of a first distribution of values from the given table and distributions of values from the collection of training tables; determining a second metric of similarity based on a comparison of a second distribution of values from the modified table and distributions of values from the collection of training tables; determining a first probability that the given table is drawn from the collection of training tables based on the first metric of similarity; determining a second probability that the modified table is drawn from the collection of training tables based on the second metric of similarity; and determining that the one or more entries from the given table contains an error based on a comparison of the first probability and the second probability applying the table perturbation model to an input table comprising a plurality of table entries to identify one or more errors within the plurality of table entries. 10. The method of claim 9 , wherein applying the table perturbation model to the input table causes a client device to provide, via a graphical user interface, an indication of the one or more errors in conjunction with a display of the plurality of table entries. 11. The method of claim 10 , further comprising: identifying the one or more errors within the plurality of table entries based on applying the table perturbation model to respective columns of the input table; tagging one or more entries of the plurality of table entries associated with the identified one or more errors; and providing an indication of tagging via the graphical user interface of the client device in conjunction with a presentation of the input table. 12. The method of claim 9 , wherein the table perturbation model is further trained to selectively identify the one or more errors within entries of the given table by identifying a subset of training tables from the collection of training tables based on one or more shared features of the given table and the subset of training tables. 13. The method of claim 9 , further comprising providing the table perturbation model to a computing device to enable the computing device to locally apply the table perturbation model to an input table accessible to the computing device. 14. The method of claim 9 , wherein the table perturbation model is further trained to: identify a threshold perturbation value for generating the modified table, the maximum perturbation value indicating a threshold number or a threshold percentage of entries to remove from the given table in generating the modified table, wherein the threshold perturbation value is based on one or more of a number of entries of the given table or a datatype of entries from the given table. 15. A system, comprising: one or more processors; memory in electronic communication with the one or more processors; and instructions stored in the memory, the instructions being executable by the one or more processors to cause a computing device to: receive an input table comprising a plurality of entries, wherein each entry of the plurality of entries comprises an associated value; remove one or more entries from the plurality of entries to generate a modified input table; determine a first metric of similarity based on a comparison of a first distribution of values from the input table and distributions of values from a plurality of training tables, the plurality of training tables including a set of reference tables presumed to have clea
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Tablespace storage structures; Management thereof · CPC title
Machine learning · CPC title
for evaluating statistical data {, e.g. average values, frequency distributions, probability functions, regression analysis (forecasting specially adapted for a specific administrative, business or logistic context G06Q10/04)} · CPC title
of tables; using ruled lines · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.