Neural architecture for self supervised event learning and anomaly detection
US-2020410322-A1 · Dec 31, 2020 · US
US11797565B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11797565-B2 |
| Application number | US-202016816778-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 12, 2020 |
| Priority date | Dec 30, 2019 |
| Publication date | Oct 24, 2023 |
| Grant date | Oct 24, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are disclosed relating to data validation using encode values. In various embodiments, a data monitoring system may retrieve a plurality of datasets from a live database at a non-production datacenter. The data monitoring system may perform encoding operations on one or more of the plurality of datasets to generate encode values that correspond to the plurality of datasets. The data monitoring system may then retrieve an updated dataset, for example from an experimental database at the non-production datacenter, and perform validation operations to validate one or more characteristics of the updated dataset. For example, in some embodiments, the data monitoring system may retrieve the encode values corresponding to the plurality of datasets and use the encode values to validate the updated dataset. The data monitoring system may then generate a validation output indicative of a result of the validation operations.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: accessing, by a data monitoring system, one or more current datasets used by a first live database at a production datacenter and a second live database at a non-production datacenter, wherein the first live database uses the one or more current datasets to support a production version of a web service for client use, and wherein the second live database uses the one or more current datasets to perform analytics on the production version of the web service; performing, by the data monitoring system, encoding operations on the one or more current datasets to generate encode values corresponding to the one or more current datasets; retrieving, by the data monitoring system, an experimental dataset from an experimental database at the non-production datacenter, wherein the one or more current datasets and the experimental dataset include data organized into multiple data records having values corresponding to multiple data fields, the one or more current datasets and the experimental dataset have respective dataset schemas, attributes of the dataset schemas include a number of data fields and formats of data fields, and the experimental dataset is a new or updated dataset as compared to the one or more current datasets; performing, by the data monitoring system, validation operations on the experimental dataset, wherein the validation operations include: retrieving the encode values corresponding to the one or more current datasets; and using the encode values to validate one or more characteristics of the experimental dataset; and in response to a determination of success of the validation operations, generating, by the data monitoring system, a validation output permitting publication of the experimental dataset to the first and second live databases for updating or modification of the first and second live databases. 2. The method of claim 1 , wherein the encoding operations include: training an autoencoder machine learning model based on the one or more current datasets to generate a trained autoencoder. 3. The method of claim 2 , wherein the validation operations further include: applying the trained autoencoder to the experimental dataset to detect one or more anomalous data records in the experimental dataset. 4. The method of claim 1 , wherein the performing validation operations includes validating the dataset schema associated with the experimental dataset. 5. The method of claim 4 , wherein the performing encoding operations includes training an autoencoder machine learning model using the one or more current datasets, wherein the encode values include a schema encode value that indicates one or more baseline attributes that correspond to the dataset schemas of the one or more current datasets. 6. The method of claim 5 , wherein the validating the dataset schema associated with the experimental dataset includes: identifying one or more attributes associated with the dataset schema of the experimental dataset; and comparing the one or more attributes associated with the dataset schema of the experimental dataset to the one or more baseline attributes associated with the dataset schemas of the one or more current datasets. 7. The method of claim 1 , wherein the performing validation operations includes validating an update pattern associated with one or more data records in the experimental dataset. 8. The method of claim 7 , wherein the experimental dataset is an updated version of a first dataset, and wherein the one or more current datasets includes a historical version of the first dataset; and wherein the performing encoding operations includes encoding the historical version of the first dataset to generate update pattern encode values associated with the first dataset. 9. The method of claim 8 , wherein the validating the update pattern includes comparing the one or more data records in the experimental dataset to the update pattern encode values associated with the first dataset. 10. A non-transitory, computer-readable medium having instructions stored thereon that are executable by a data monitoring system to perform operations comprising: accessing one or more current datasets used by a first live database at a production datacenter and a second live database at a non-production datacenter, wherein the first live database uses the one or more current datasets to support a production version of a web service for client use, and wherein the second live database uses the one or more current datasets to perform analytics on the production version of the web service; performing encoding operations on the one or more current datasets to generate encode values corresponding to the one or more current datasets; retrieving an experimental dataset from an experimental database at the non-production datacenter, wherein the one or more current datasets and the experimental dataset include data organized into multiple data records having values corresponding to multiple data fields, the one or more current datasets and the experimental dataset have respective dataset schemas, attributes of the dataset schemas include a number of data fields and formats of data fields, and the experimental dataset is a new or updated dataset of at least one of the one or more current datasets; performing validation operations on the experimental dataset, wherein the validation operations include: retrieving the encode values corresponding to the one or more current datasets; and using the encode values to validate one or more characteristics of the experimental dataset; and in response to a determination of success of the validation operations, generating a validation output permitting publication of the experimental dataset to the first and second live databases for updating or modification of the first and second live databases. 11. The non-transitory, computer-readable medium of claim 10 , wherein the performing validation operations includes validating a value distribution associated with the experimental dataset. 12. The non-transitory, computer-readable medium of claim 11 , wherein the performing encoding operations includes: training an autoencoder machine learning model based on the one or more current datasets to generate a trained autoencoder model; and calculating a first latent probability distribution across multiple data record keys corresponding to the one or more current datasets using the trained autoencoder model. 13. The non-transitory, computer-readable medium of claim 12 , wherein the autoencoder machine learning model is a Deep Autoencoding Gaussian Mixture Model (DAGMM). 14. The non-transitory, computer-readable medium of claim 12 , wherein the validating the value distribution associated with the experimental dataset includes validating numerical data in the experimental dataset, including by: applying the trained autoencoder model to the experimental dataset to calculate a second latent probability distribution across multiple data record keys corresponding to the experimental dataset; and comparing the first and second latent probability distributions. 15. The non-transitory, computer-readable medium of claim 10 , wherein the performing validation operations includes validating a value format of string-type data included in the experimental dataset. 16. The non-transitory, computer-readable medium of claim 15 , wherein the performing encoding operations includes: generating one or more regular expressions based on string-type data included in at least one of the one or more current datasets; and
Auto-encoder networks; Encoder-decoder networks · CPC title
Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title
for performance assessment · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Ensuring data consistency and integrity · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.