Abnormal data detection
US-2020133999-A1 · Apr 30, 2020 · US
US11061994B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11061994-B2 |
| Application number | US-201916257741-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 25, 2019 |
| Priority date | Jan 26, 2018 |
| Publication date | Jul 13, 2021 |
| Grant date | Jul 13, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
This specification describes techniques for detecting abnormal data in a data set. One example method includes obtaining, by a data processing platform, a to-be-validated data group including to-be-validated data corresponding to a predetermined feature; obtaining, by the data processing platform, a comparison data group including historical data associated with the to-be-validated data group, wherein the historical and the to-be-validated data are from a same data source; performing, by the data processing platform, a two-group significance test on the to-be-validated data group and the comparison data group to generate a test result; and determining, by the data processing platform, whether there is abnormal data in the to-be-validated data group based on the test result.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: obtaining, by a data processing platform and from a second data platform, a to-be-validated data group including to-be-validated data corresponding to a predetermined feature, wherein the second data platform collects initial data and modifies the initial data by an encryption process that encrypts the initial data to generate the to-be-validated data; performing, by the data processing platform, a data preprocessing operation on the to-be-validated data group, comprising dividing the to-be-validated data group into a plurality of to-be-validated sub data groups each having a smaller size than that of the to-be-validated data group, and using one of the to-be-validated sub data group as the to-be-validated data group, or transforming data in the to-be-validated data group to have a predefined distribution by performing a corresponding data transformation on the data based on a distribution feature of the data; obtaining, by the data processing platform, a comparison data group including historical data associated with the to-be-validated data group, wherein the historical data and the to-be-validated data are from a same data source; performing, by the data processing platform, a two-group significance test on the to-be-validated data group and the comparison data group to generate a test result that is indicative of a degree of difference between the to-be-validated data group and the comparison data group; determining, by the data processing platform, that there is abnormal data in the to-be-validated data group based on the test result; in response, dividing, by the data processing platform and according to a predetermined data division rule, the to-be-validated data group into a plurality of to-be-validated sub data groups; performing, by the data processing platform, the two-group significance test on each to-be-validated sub data group of the plurality of to-be-validated sub data groups and the comparison data group to generate new test results; and determining, by the data processing platform, whether each to-be-validated sub data group includes abnormal data based on the new test results. 2. The computer-implemented method of claim 1 , wherein obtaining the comparison data group including the historical data includes: obtaining a plurality of groups of historical data associated with the to-be-validated data group; performing a two-group significance test on each of two groups of the historical data; and determining a group of the historical data that contains no abnormal data as a comparison data group based on a test result of the two-group significance test. 3. The computer-implemented method of claim 1 , wherein the predetermined feature defines a type of numerical values within a predetermined range. 4. The computer-implemented method of claim 1 , wherein performing a two-group significance test includes determining a probability that a population mean associated with the to-be-validated data group is a same with a population mean associated with the comparison data group. 5. The computer-implemented method of claim 4 , wherein it is determined that there is no abnormal data in in the to-be-validated data group if the probability is greater than 0.01%. 6. The computer-implemented method of claim 1 , wherein the predefined distribution is a normal distribution, and wherein the corresponding data transformation comprises one or more of a logarithmic transformation, a square root transformation, a reciprocal transformation, or an arcsine square root transformation. 7. A non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform operations comprising: obtaining, by a data processing platform and from a second data platform, a to-be-validated data group including to-be-validated data corresponding to a predetermined feature, wherein the second data platform collects initial data and modifies the initial data by an encryption process that encrypts the initial data to generate the to-be-validated data; performing, by the data processing platform, a data preprocessing operation on the to-be-validated data group, comprising dividing the to-be-validated data group into a plurality of to-be-validated sub data groups each having a smaller size than that of the to-be-validated data group, and using one of the to-be-validated sub data group as the to-be-validated data group, or transforming data in the to-be-validated data group to have a predefined distribution by performing a corresponding data transformation on the data based on a distribution feature of the data; obtaining, by the data processing platform, a comparison data group including historical data associated with the to-be-validated data group, wherein the historical data and the to-be-validated data are from a same data source; performing, by the data processing platform, a two-group significance test on the to-be-validated data group and the comparison data group to generate a test result that is indicative of a degree of difference between the to-be-validated data group and the comparison data group; determining, by the data processing platform, that there is abnormal data in the to-be-validated data group based on the test result; in response, dividing, by the data processing platform and according to a predetermined data division rule, the to-be-validated data group into a plurality of to-be-validated sub data groups; performing, by the data processing platform, the two-group significance test on each to-be-validated sub data group of the plurality of to-be-validated sub data groups and the comparison data group to generate new test results; and determining, by the data processing platform, whether each to-be-validated sub data group includes abnormal data based on the new test results. 8. The non-transitory, computer-readable medium of claim 7 , wherein obtaining the comparison data group including the historical data includes: obtaining a plurality of groups of historical data associated with the to-be-validated data group; performing a two-group significance test on each of two groups of the historical data; and determining a group of the historical data that contains no abnormal data as a comparison data group based on a test result of the two-group significance test. 9. The non-transitory, computer-readable medium of claim 7 , wherein the predetermined feature defines a type of numerical values within a predetermined range. 10. The non-transitory, computer-readable medium of claim 7 , wherein performing a two-group significance test includes determining a probability that a population mean associated with the to-be-validated data group is a same with a population mean associated with the comparison data group. 11. The non-transitory, computer-readable medium of claim 10 , wherein it is determined that there is no abnormal data in in the to-be-validated data group if the probability is greater than 0.01%. 12. The non-transitory, computer-readable medium of claim 7 , wherein the predefined distribution is a normal distribution, and wherein the corresponding data transformation comprises one or more of a logarithmic transformation, a square root transformation, a reciprocal transformation, or an arcsine square root transformation. 13. A computer-implemented system, comprising: one or more computers; and one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing one or more instructions that, when executed by the one or more computers, perform o
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
for evaluating statistical data {, e.g. average values, frequency distributions, probability functions, regression analysis (forecasting specially adapted for a specific administrative, business or logistic context G06Q10/04)} · CPC title
Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection · CPC title
Validation; Performance evaluation; Active pattern learning techniques · CPC title
Error or fault detection not based on redundancy (power supply failures G06F1/30; network fault management H04L41/06) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.