Data validation using encode values

US11797565B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11797565-B2
Application numberUS-202016816778-A
CountryUS
Kind codeB2
Filing dateMar 12, 2020
Priority dateDec 30, 2019
Publication dateOct 24, 2023
Grant dateOct 24, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are disclosed relating to data validation using encode values. In various embodiments, a data monitoring system may retrieve a plurality of datasets from a live database at a non-production datacenter. The data monitoring system may perform encoding operations on one or more of the plurality of datasets to generate encode values that correspond to the plurality of datasets. The data monitoring system may then retrieve an updated dataset, for example from an experimental database at the non-production datacenter, and perform validation operations to validate one or more characteristics of the updated dataset. For example, in some embodiments, the data monitoring system may retrieve the encode values corresponding to the plurality of datasets and use the encode values to validate the updated dataset. The data monitoring system may then generate a validation output indicative of a result of the validation operations.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: accessing, by a data monitoring system, one or more current datasets used by a first live database at a production datacenter and a second live database at a non-production datacenter, wherein the first live database uses the one or more current datasets to support a production version of a web service for client use, and wherein the second live database uses the one or more current datasets to perform analytics on the production version of the web service; performing, by the data monitoring system, encoding operations on the one or more current datasets to generate encode values corresponding to the one or more current datasets; retrieving, by the data monitoring system, an experimental dataset from an experimental database at the non-production datacenter, wherein the one or more current datasets and the experimental dataset include data organized into multiple data records having values corresponding to multiple data fields, the one or more current datasets and the experimental dataset have respective dataset schemas, attributes of the dataset schemas include a number of data fields and formats of data fields, and the experimental dataset is a new or updated dataset as compared to the one or more current datasets; performing, by the data monitoring system, validation operations on the experimental dataset, wherein the validation operations include: retrieving the encode values corresponding to the one or more current datasets; and using the encode values to validate one or more characteristics of the experimental dataset; and in response to a determination of success of the validation operations, generating, by the data monitoring system, a validation output permitting publication of the experimental dataset to the first and second live databases for updating or modification of the first and second live databases. 2. The method of claim 1 , wherein the encoding operations include: training an autoencoder machine learning model based on the one or more current datasets to generate a trained autoencoder. 3. The method of claim 2 , wherein the validation operations further include: applying the trained autoencoder to the experimental dataset to detect one or more anomalous data records in the experimental dataset. 4. The method of claim 1 , wherein the performing validation operations includes validating the dataset schema associated with the experimental dataset. 5. The method of claim 4 , wherein the performing encoding operations includes training an autoencoder machine learning model using the one or more current datasets, wherein the encode values include a schema encode value that indicates one or more baseline attributes that correspond to the dataset schemas of the one or more current datasets. 6. The method of claim 5 , wherein the validating the dataset schema associated with the experimental dataset includes: identifying one or more attributes associated with the dataset schema of the experimental dataset; and comparing the one or more attributes associated with the dataset schema of the experimental dataset to the one or more baseline attributes associated with the dataset schemas of the one or more current datasets. 7. The method of claim 1 , wherein the performing validation operations includes validating an update pattern associated with one or more data records in the experimental dataset. 8. The method of claim 7 , wherein the experimental dataset is an updated version of a first dataset, and wherein the one or more current datasets includes a historical version of the first dataset; and wherein the performing encoding operations includes encoding the historical version of the first dataset to generate update pattern encode values associated with the first dataset. 9. The method of claim 8 , wherein the validating the update pattern includes comparing the one or more data records in the experimental dataset to the update pattern encode values associated with the first dataset. 10. A non-transitory, computer-readable medium having instructions stored thereon that are executable by a data monitoring system to perform operations comprising: accessing one or more current datasets used by a first live database at a production datacenter and a second live database at a non-production datacenter, wherein the first live database uses the one or more current datasets to support a production version of a web service for client use, and wherein the second live database uses the one or more current datasets to perform analytics on the production version of the web service; performing encoding operations on the one or more current datasets to generate encode values corresponding to the one or more current datasets; retrieving an experimental dataset from an experimental database at the non-production datacenter, wherein the one or more current datasets and the experimental dataset include data organized into multiple data records having values corresponding to multiple data fields, the one or more current datasets and the experimental dataset have respective dataset schemas, attributes of the dataset schemas include a number of data fields and formats of data fields, and the experimental dataset is a new or updated dataset of at least one of the one or more current datasets; performing validation operations on the experimental dataset, wherein the validation operations include: retrieving the encode values corresponding to the one or more current datasets; and using the encode values to validate one or more characteristics of the experimental dataset; and in response to a determination of success of the validation operations, generating a validation output permitting publication of the experimental dataset to the first and second live databases for updating or modification of the first and second live databases. 11. The non-transitory, computer-readable medium of claim 10 , wherein the performing validation operations includes validating a value distribution associated with the experimental dataset. 12. The non-transitory, computer-readable medium of claim 11 , wherein the performing encoding operations includes: training an autoencoder machine learning model based on the one or more current datasets to generate a trained autoencoder model; and calculating a first latent probability distribution across multiple data record keys corresponding to the one or more current datasets using the trained autoencoder model. 13. The non-transitory, computer-readable medium of claim 12 , wherein the autoencoder machine learning model is a Deep Autoencoding Gaussian Mixture Model (DAGMM). 14. The non-transitory, computer-readable medium of claim 12 , wherein the validating the value distribution associated with the experimental dataset includes validating numerical data in the experimental dataset, including by: applying the trained autoencoder model to the experimental dataset to calculate a second latent probability distribution across multiple data record keys corresponding to the experimental dataset; and comparing the first and second latent probability distributions. 15. The non-transitory, computer-readable medium of claim 10 , wherein the performing validation operations includes validating a value format of string-type data included in the experimental dataset. 16. The non-transitory, computer-readable medium of claim 15 , wherein the performing encoding operations includes: generating one or more regular expressions based on string-type data included in at least one of the one or more current datasets; and

Assignees

Inventors

Classifications

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • G06F16/27Primary

    Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title

  • for performance assessment · CPC title

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Ensuring data consistency and integrity · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11797565B2 cover?
Techniques are disclosed relating to data validation using encode values. In various embodiments, a data monitoring system may retrieve a plurality of datasets from a live database at a non-production datacenter. The data monitoring system may perform encoding operations on one or more of the plurality of datasets to generate encode values that correspond to the plurality of datasets. The data …
Who is the assignee on this patent?
Paypal Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/27. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 24 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).