Detection of matching datasets using encode values

US11669428B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11669428-B2
Application numberUS-202016878429-A
CountryUS
Kind codeB2
Filing dateMay 19, 2020
Priority dateMay 19, 2020
Publication dateJun 6, 2023
Grant dateJun 6, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are disclosed relating to detecting matching datasets using encode values. In various embodiments, a data monitoring system may perform encoding operations on a first dataset to generate a first encode value that corresponds to a particular one of one or more fields included in the first dataset. The data monitoring system may then determine whether the first dataset matches a previously analyzed dataset. For example, in some embodiments, data monitoring system may compare the first encode value to a previous encode value that corresponds to a second field of the previously analyzed dataset. Based on this comparison, the data monitoring system may generate an output value that is indicative of a similarity between the first encode value and the previous encode value. The data monitoring system may then determine whether the first dataset matches the previously analyzed dataset based on this output value.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: performing, by a data monitoring system, encoding operations on a first dataset to generate a set of encode values including first and second encode values generated using different types of encoding, wherein the first dataset includes first data organized into a first plurality of fields, the first data includes data records having data values for multiple fields within the first plurality of fields, and the first encode value corresponds to a particular field of the first plurality of fields; determining, by the data monitoring system, whether the first dataset matches a previously analyzed dataset, wherein the previously analyzed dataset includes second data organized into a second plurality of fields, the second data includes data records having data values for multiple fields within the second plurality of fields, and the determining includes: comparing the set of encode values to a previous set of encode values, wherein the previous set of encode values includes third and fourth encode values generated using different types of encoding and the third encode value corresponds to a second field, of the second plurality of fields, of the previously analyzed dataset; based on the comparing, generating an output value that is indicative of a similarity between the set of encode values and the previous set of encode values; and based on the output value, determining whether the first dataset matches the previously analyzed dataset. 2. The method of claim 1 , wherein, for the particular field, the performing the encoding operations includes: selecting a particular one of a plurality of encoder modules based on a data type associated with the particular field; and encoding data included in the particular field of the first dataset, using the particular encoder module, to generate the first encode value. 3. The method of claim 2 , wherein the second field of the previously analyzed dataset has a same data type as the data type associated with the particular field of the first dataset; and wherein the determining whether the first dataset matches the previously analyzed dataset further includes: retrieving the third encode value corresponding to the second field of the previously analyzed dataset, wherein the third encode value was generated by encoding data included in the second field of the previously analyzed dataset using the particular encoder module. 4. The method of claim 2 , wherein the particular field of the first dataset includes string-type data; a type of encoding used to generate the first encode value includes semantic encoding; and the encoding data included in the particular field of the first dataset includes: generating one or more vector word-embedding representations of the string-type data included in the particular field of the first dataset. 5. The method of claim 2 , wherein the particular field of the first dataset includes string-type data; a type of encoding used to generate the first encode value includes value-format encoding; and the encoding data included in the particular field of the first dataset includes: generating a first regular expression based on the string-type data included in the particular field of the first dataset. 6. The method of claim 5 , wherein the third encode value is a second regular expression generated based on string-type data included in the second field of the previously analyzed dataset; and wherein the comparing the set of encode values to the previous set of encode values includes comparing the first and second regular expressions. 7. The method of claim 2 , wherein the particular field of the first dataset includes numerical data; a type of encoding used to generate the first encode value includes numerical distribution encoding; and the encoding data included in the particular field of the first dataset includes: calculating a first latent probability distribution corresponding to the numerical data in the particular field of the first dataset. 8. The method of claim 7 , wherein the third encode value is a second latent probability distribution corresponding to numerical data included in the second field of the previously analyzed dataset; and wherein the comparing the set of encode values to the previous set of encode values includes comparing the first and second latent probability distributions. 9. The method of claim 1 , further comprising: prior to determining whether the first dataset matches the previously analyzed dataset, comparing, by the data monitoring system, properties of a first schema associated with the first dataset to properties of a second schema associated with the previously analyzed dataset, wherein properties of the first schema include characteristics of the data records or the first plurality of fields; and in response to a determination that the first schema does not match the second schema, determining that the first dataset does not match the previously analyzed dataset. 10. The method of claim 1 , wherein the output value is specified using Kullback-Leibler divergence. 11. A non-transitory, computer-readable medium having instructions stored thereon that are executable by a data monitoring system to perform operations comprising: performing encoding operations on a first dataset to generate a set of encode values including first and second encode values generated using different types of encoding, wherein the first dataset includes first data organized into a first plurality of fields, the first data includes data records having data values for multiple fields within the first plurality of fields, and the first encode value corresponds to a particular field of the first plurality of fields; determining whether the first dataset matches a previously analyzed dataset, wherein the previously analyzed dataset includes second data organized into a second plurality of fields, the second data includes data records having data values for multiple fields within the second plurality of fields, and the determining includes: comparing the set of encode values to a previous set of encode values, wherein the previous set of encode values includes third and fourth encode values generated using different types of encoding and the third encode value corresponds to a second field, of the second plurality of fields, of the previously analyzed dataset; based on the comparing, generating an output value that is indicative of a similarity between the set of encode values and the previous set of encode values; and based on the output value, determining whether the first dataset matches the previously analyzed dataset. 12. The non-transitory, computer-readable medium of claim 11 , wherein, for the particular field, the performing the encoding operations includes: selecting a particular one of a plurality of encoder modules based on a data type associated with the particular field; and encoding data included in the particular field of the first dataset, using the particular encoder module, to generate the first encode value. 13. The non-transitory, computer-readable medium of claim 12 , wherein the second field has a same data type as the data type associated with the particular field of the first dataset; and wherein the determining whether the first dataset matches the previously analyzed dataset further includes: retrieving the third encode value corresponding to the second field of the previously analyzed dataset, wherein the third encode value was generated by encoding data included in the second field of the previously analyzed dataset using the particular encoder module.

Assignees

Inventors

Classifications

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • for evaluating statistical data {, e.g. average values, frequency distributions, probability functions, regression analysis (forecasting specially adapted for a specific administrative, business or logistic context G06Q10/04)} · CPC title

  • using adaptive string matching, e.g. the Lempel-Ziv method · CPC title

  • Encoder aspects · CPC title

  • Vector coding (for television signals, see H04N19/94) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11669428B2 cover?
Techniques are disclosed relating to detecting matching datasets using encode values. In various embodiments, a data monitoring system may perform encoding operations on a first dataset to generate a first encode value that corresponds to a particular one of one or more fields included in the first dataset. The data monitoring system may then determine whether the first dataset matches a previo…
Who is the assignee on this patent?
Paypal Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 06 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).