Data identification method, apparatus, device, and readable medium

US11314897B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11314897-B2
Application numberUS-202117362022-A
CountryUS
Kind codeB2
Filing dateJun 29, 2021
Priority dateJul 24, 2020
Publication dateApr 26, 2022
Grant dateApr 26, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations of the present specification disclose a data identification method, apparatus, device, and a computer-readable medium. A solution includes: obtaining a first data set, data samples in the first data set being at least a part of data of a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, a data type of the data samples in the second data set being known; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set; determining a ratio between a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to that the ratio is greater than a second threshold.

First claim

Opening claim text (preview).

The invention claimed is: 1. A data identification method, comprising: obtaining a first data set, data samples in the first data set correspond to a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, wherein the state transition matrix set includes a plurality of state transition matrices, and at least one of the plurality of state transition matrices represents probabilities of state transition conditions of a value of a first character at a character position relative to a value of a second character at a next character position relative to the character position in the data samples in the second data set; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, the sample state transition probabilities representing a similarity between a data type of the data samples in the first data set and the data type of the data samples in the second data set; determining a ratio between (a) a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and (b) a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to determining that the ratio is greater than a second threshold. 2. The method according to claim 1 , further comprising: determining state transition matrices corresponding to character positions based on the data samples in the second data set to obtain the state transition matrix set. 3. The method according to claim 2 , further comprising: obtaining a given data set; and determining at least one base data set from the given data set, wherein data samples in a same base data set have a same length and the at least one base data set includes the second data set. 4. The method according to claim 1 , further comprising: before the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, determining that sample lengths of the data samples in the first data set are the same as sample lengths of the data samples in the second data set. 5. The method according to claim 1 , wherein the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set includes: for the data samples in the first data set, obtaining character state transition probabilities corresponding to character positions in the data samples based on the state transition matrix set; and calculating the sample state transition probabilities corresponding to the data samples based on the character state transition probabilities corresponding to the character positions in the data samples. 6. The method according to claim 5 , wherein the obtaining the character state transition probabilities corresponding to the character positions in the data samples based on the state transition matrix set includes: determining a value of a first character at a first character position in the data samples; determining a value of a second character at a next character position relative to the first character position; determining a first state transition matrix corresponding to the first character position from the state transition matrix set; and obtaining a first state transition probability corresponding to the first character position from the first state transition matrix based on the value of the first character and the value of the second character. 7. The method according to claim 5 , wherein the calculating the sample state transition probabilities corresponding to the data samples based on the character state transition probabilities corresponding to the character positions in the data samples includes: calculating at least a product of the character state transition probabilities corresponding to the character positions in the data samples. 8. The method according to claim 1 , further comprising: determining state occurrence probabilities corresponding to the data samples in the second data set based on the state transition matrix set; and using a fractile of the state occurrence probabilities corresponding to the data samples in the second data set as the first threshold. 9. The method according to claim 1 , wherein the data samples in the second data set are private data, and the determining the data corresponding to the to-be-identified field as being of the same data type as the data samples in the second data set includes: determining the data corresponding to the to-be-identified field as private data. 10. The method according to claim 9 , further comprising: after the determining the data corresponding to the to-be-identified field as the private data, anonymizing the data corresponding to the to-be-identified field. 11. A non-transitory computer readable medium storing contents that, when executed by one or more processors, cause the one or more processors to perform actions comprising: obtaining a first data set, data samples in the first data set correspond to a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, wherein the state transition matrix set includes a plurality of state transition matrices, and at least one of the plurality of state transition matrices represents probabilities of state transition conditions of a value of a first character at a character position relative to a value of a second character at a next character position relative to the character position in the data samples in the second data set; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, the sample state transition probabilities representing a similarity between a data type of the data samples in the first data set and the data type of the data samples in the second data set; determining a ratio between (a) a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and (b) a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to determining that the ratio is greater than a second threshold. 12. The computer readable medium according to claim 11 , the actions further comprising: determining state transition matrices corresponding to character positions based on the data samples in the second data set to obtain the state transition matrix set. 13. The computer readable medium according to claim 12 , the actions further comprising: obtaining a given data set; and determining at least one base data set from the given data set, wherein data samples in a same base data set have a same length and the at least one base data set includes the second data set. 14. The computer readable medium according to claim 11 , the actions further comprising: before the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, determining that sample lengths of the data samples in the first data set are the same as sample lengths of the data samples in the second data set. 15. The computer readable medium accordi

Assignees

Inventors

Classifications

  • characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling · CPC title

  • Classification techniques · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

  • adaptive, e.g. self learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11314897B2 cover?
Implementations of the present specification disclose a data identification method, apparatus, device, and a computer-readable medium. A solution includes: obtaining a first data set, data samples in the first data set being at least a part of data of a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, a data type…
Who is the assignee on this patent?
Alipay Hangzhou Inf Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F18/2155. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 26 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).