Who is the assignee on this patent?

Alipay Hangzhou Inf Tech Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06F18/2155. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 26 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Data identification method, apparatus, device, and readable medium

US11314897B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11314897-B2
Application number	US-202117362022-A
Country	US
Kind code	B2
Filing date	Jun 29, 2021
Priority date	Jul 24, 2020
Publication date	Apr 26, 2022
Grant date	Apr 26, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations of the present specification disclose a data identification method, apparatus, device, and a computer-readable medium. A solution includes: obtaining a first data set, data samples in the first data set being at least a part of data of a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, a data type of the data samples in the second data set being known; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set; determining a ratio between a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to that the ratio is greater than a second threshold.

First claim

Opening claim text (preview).

The invention claimed is: 1. A data identification method, comprising: obtaining a first data set, data samples in the first data set correspond to a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, wherein the state transition matrix set includes a plurality of state transition matrices, and at least one of the plurality of state transition matrices represents probabilities of state transition conditions of a value of a first character at a character position relative to a value of a second character at a next character position relative to the character position in the data samples in the second data set; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, the sample state transition probabilities representing a similarity between a data type of the data samples in the first data set and the data type of the data samples in the second data set; determining a ratio between (a) a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and (b) a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to determining that the ratio is greater than a second threshold. 2. The method according to claim 1 , further comprising: determining state transition matrices corresponding to character positions based on the data samples in the second data set to obtain the state transition matrix set. 3. The method according to claim 2 , further comprising: obtaining a given data set; and determining at least one base data set from the given data set, wherein data samples in a same base data set have a same length and the at least one base data set includes the second data set. 4. The method according to claim 1 , further comprising: before the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, determining that sample lengths of the data samples in the first data set are the same as sample lengths of the data samples in the second data set. 5. The method according to claim 1 , wherein the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set includes: for the data samples in the first data set, obtaining character state transition probabilities corresponding to character positions in the data samples based on the state transition matrix set; and calculating the sample state transition probabilities corresponding to the data samples based on the character state transition probabilities corresponding to the character positions in the data samples. 6. The method according to claim 5 , wherein the obtaining the character state transition probabilities corresponding to the character positions in the data samples based on the state transition matrix set includes: determining a value of a first character at a first character position in the data samples; determining a value of a second character at a next character position relative to the first character position; determining a first state transition matrix corresponding to the first character position from the state transition matrix set; and obtaining a first state transition probability corresponding to the first character position from the first state transition matrix based on the value of the first character and the value of the second character. 7. The method according to claim 5 , wherein the calculating the sample state transition probabilities corresponding to the data samples based on the character state transition probabilities corresponding to the character positions in the data samples includes: calculating at least a product of the character state transition probabilities corresponding to the character positions in the data samples. 8. The method according to claim 1 , further comprising: determining state occurrence probabilities corresponding to the data samples in the second data set based on the state transition matrix set; and using a fractile of the state occurrence probabilities corresponding to the data samples in the second data set as the first threshold. 9. The method according to claim 1 , wherein the data samples in the second data set are private data, and the determining the data corresponding to the to-be-identified field as being of the same data type as the data samples in the second data set includes: determining the data corresponding to the to-be-identified field as private data. 10. The method according to claim 9 , further comprising: after the determining the data corresponding to the to-be-identified field as the private data, anonymizing the data corresponding to the to-be-identified field. 11. A non-transitory computer readable medium storing contents that, when executed by one or more processors, cause the one or more processors to perform actions comprising: obtaining a first data set, data samples in the first data set correspond to a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, wherein the state transition matrix set includes a plurality of state transition matrices, and at least one of the plurality of state transition matrices represents probabilities of state transition conditions of a value of a first character at a character position relative to a value of a second character at a next character position relative to the character position in the data samples in the second data set; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, the sample state transition probabilities representing a similarity between a data type of the data samples in the first data set and the data type of the data samples in the second data set; determining a ratio between (a) a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and (b) a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to determining that the ratio is greater than a second threshold. 12. The computer readable medium according to claim 11 , the actions further comprising: determining state transition matrices corresponding to character positions based on the data samples in the second data set to obtain the state transition matrix set. 13. The computer readable medium according to claim 12 , the actions further comprising: obtaining a given data set; and determining at least one base data set from the given data set, wherein data samples in a same base data set have a same length and the at least one base data set includes the second data set. 14. The computer readable medium according to claim 11 , the actions further comprising: before the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, determining that sample lengths of the data samples in the first data set are the same as sample lengths of the data samples in the second data set. 15. The computer readable medium accordi

Assignees

Alipay Hangzhou Inf Tech Co Ltd

Inventors

Classifications

G06F18/2155Primary
characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling · CPC title
G06F18/24
Classification techniques · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title
G06F18/22
Matching criteria, e.g. proximity measures · CPC title
G06F7/023
adaptive, e.g. self learning · CPC title

Patent family

Related publications grouped by family.

View patent family 72657619

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11314897B2 cover?: Implementations of the present specification disclose a data identification method, apparatus, device, and a computer-readable medium. A solution includes: obtaining a first data set, data samples in the first data set being at least a part of data of a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, a data type…
Who is the assignee on this patent?: Alipay Hangzhou Inf Tech Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06F18/2155. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 26 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).