Detection of matching datasets using encode values

US2021365344A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021365344-A1
Application numberUS-202016878429-A
CountryUS
Kind codeA1
Filing dateMay 19, 2020
Priority dateMay 19, 2020
Publication dateNov 25, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are disclosed relating to detecting matching datasets using encode values. In various embodiments, a data monitoring system may perform encoding operations on a first dataset to generate a first encode value that corresponds to a particular one of one or more fields included in the first dataset. The data monitoring system may then determine whether the first dataset matches a previously analyzed dataset. For example, in some embodiments, data monitoring system may compare the first encode value to a previous encode value that corresponds to a second field of the previously analyzed dataset. Based on this comparison, the data monitoring system may generate an output value that is indicative of a similarity between the first encode value and the previous encode value. The data monitoring system may then determine whether the first dataset matches the previously analyzed dataset based on this output value.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, comprising: performing, by a data monitoring system, encoding operations on a first dataset to generate a first encode value, wherein the first dataset includes a first plurality of fields, and wherein the first encode value corresponds to a particular field of the first plurality of fields; determining, by the data monitoring system, whether the first dataset matches a previously analyzed dataset, wherein the previously analyzed dataset includes a second plurality of fields, wherein the determining includes: comparing the first encode value to a previous encode value that corresponds to a second field, of the second plurality of fields, of the previously analyzed dataset; based on the comparing, generating an output value that is indicative of a similarity between the first encode value and the previous encode value; and based on the output value, determining whether the first dataset matches the previously analyzed dataset. 2 . The method of claim 1 , wherein, for the particular field, the performing the encoding operations includes: selecting a particular one of a plurality of encoder modules based on a data type associated with the particular field; and encoding data included in the particular field of the first dataset, using the particular encoder module, to generate the first encode value. 3 . The method of claim 2 , wherein the second field of the previously analyzed dataset has a same data type as the data type associated with the particular field of the first dataset; and wherein the determining whether the first dataset matches the previously analyzed dataset further includes: retrieving the previous encode value corresponding to the previously analyzed dataset, wherein the previous encode value was generated by encoding data included in the second field of the previously analyzed dataset using the particular encoder module. 4 . The method of claim 3 , wherein the particular field of the first dataset includes string-type data; and wherein the encoding the data included in the particular field of the first dataset includes: generating one or more vector word-embedding representations of the string-type data included in the particular field of the first dataset. 5 . The method of claim 3 , wherein the particular field of the first dataset includes string-type data; and wherein the encoding the data included in the particular field of the first dataset includes: generating a first regular expression based on the string-type data included in the particular field of the first dataset. 6 . The method of claim 5 , wherein the previous encode value is a second regular expression generated based on string-type data included in the second field of the previously analyzed dataset; and wherein the comparing the first encode value to the previous encode value includes comparing the first and second regular expressions. 7 . The method of claim 3 , wherein the particular field of the first dataset includes numerical data; and wherein the encoding the data included in the particular field of the first dataset includes: calculating a first latent probability distribution corresponding to the numerical data in the particular field of the first dataset. 8 . The method of claim 7 , wherein the previous encode value is a second latent probability distribution corresponding to numerical data included in the second field of the previously analyzed dataset; and wherein the comparing the first encode value to the previous encode value includes comparing the first and second latent probability distributions. 9 . The method of claim 1 , further comprising: prior to determining whether the first dataset matches the previously analyzed dataset, comparing, by the data monitoring system, a first schema associated with the first dataset to a second schema associated with the previously analyzed dataset; and in response to a determination that the first schema does not match the second schema, determining that the first dataset does not match the previously analyzed dataset. 10 . The method of claim 1 , wherein the output value is specified using Kullback-Leibler divergence. 11 . A non-transitory, computer-readable medium having instructions stored thereon that are executable by a data monitoring system to perform operations comprising: performing encoding operations on a first dataset to generate a first encode value, wherein the first dataset includes a first plurality of fields, and wherein the first encode value corresponds to a particular field of the first plurality of fields; determining whether the first dataset matches a previously analyzed dataset, wherein the previously analyzed dataset includes a second plurality of fields, wherein the determining includes: comparing the first encode value to a previous encode value that corresponds to a second field, of the second plurality of fields, of the previously analyzed dataset; based on the comparing, generating an output value that is indicative of a similarity between the first encode value and the previous encode value; and based on the output value, determining whether the first dataset matches the previously analyzed dataset. 12 . The non-transitory, computer-readable medium of claim 11 , wherein, for the particular field, the performing the encoding operations includes: selecting a particular one of a plurality of encoder modules based on a data type associated with the particular field; and encoding data included in the particular field of the first dataset, using the particular encoder module, to generate the first encode value. 13 . The non-transitory, computer-readable medium of claim 12 , wherein the second field has a same data type as the data type associated with the particular field of the first dataset; and wherein the determining whether the first dataset matches the previously analyzed dataset further includes: retrieving the previous encode value corresponding to the previously analyzed dataset, wherein the previous encode value was generated by encoding data included in the second field of the previously analyzed dataset using the particular encoder module. 14 . The non-transitory, computer-readable medium of claim 12 , wherein the particular field of the first dataset includes string-type data; and wherein the encoding the data included in the particular field of the first dataset includes: generating one or more vector word-embedding representations of the string-type data included in the particular field of the first dataset; and generating a first regular expression based on the string-type data included in the particular field of the first dataset. 15 . The non-transitory, computer-readable medium of claim 12 , wherein the particular field of the first dataset includes numerical data; and wherein the encoding the data included in the particular field of the first dataset includes: applying a trained autoencoder machine learning model to the numerical data in the particular field to calculate a first latent probability distribution corresponding to the particular field of the first dataset. 16 . The non-transitory, computer-readable medium of claim 11 , wherein the operations further comprise: prior to the determining whether the first dataset matches the previously analyzed dataset, generating encode values for one or more of the second plurality of fields included in the previously analyzed dataset. 17 . A method, comprising: receiving, by a data monitoring system, a first dataset that inc

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Evolutionary algorithms, e.g. genetic algorithms or genetic programming · CPC title

  • using adaptive string matching, e.g. the Lempel-Ziv method · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021365344A1 cover?
Techniques are disclosed relating to detecting matching datasets using encode values. In various embodiments, a data monitoring system may perform encoding operations on a first dataset to generate a first encode value that corresponds to a particular one of one or more fields included in the first dataset. The data monitoring system may then determine whether the first dataset matches a previo…
Who is the assignee on this patent?
Paypal Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Nov 25 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).