Method and device for verifying recognition result in character recognition
US-2019114512-A1 · Apr 18, 2019 · US
US11314897B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11314897-B2 |
| Application number | US-202117362022-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 29, 2021 |
| Priority date | Jul 24, 2020 |
| Publication date | Apr 26, 2022 |
| Grant date | Apr 26, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Implementations of the present specification disclose a data identification method, apparatus, device, and a computer-readable medium. A solution includes: obtaining a first data set, data samples in the first data set being at least a part of data of a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, a data type of the data samples in the second data set being known; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set; determining a ratio between a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to that the ratio is greater than a second threshold.
Opening claim text (preview).
The invention claimed is: 1. A data identification method, comprising: obtaining a first data set, data samples in the first data set correspond to a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, wherein the state transition matrix set includes a plurality of state transition matrices, and at least one of the plurality of state transition matrices represents probabilities of state transition conditions of a value of a first character at a character position relative to a value of a second character at a next character position relative to the character position in the data samples in the second data set; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, the sample state transition probabilities representing a similarity between a data type of the data samples in the first data set and the data type of the data samples in the second data set; determining a ratio between (a) a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and (b) a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to determining that the ratio is greater than a second threshold. 2. The method according to claim 1 , further comprising: determining state transition matrices corresponding to character positions based on the data samples in the second data set to obtain the state transition matrix set. 3. The method according to claim 2 , further comprising: obtaining a given data set; and determining at least one base data set from the given data set, wherein data samples in a same base data set have a same length and the at least one base data set includes the second data set. 4. The method according to claim 1 , further comprising: before the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, determining that sample lengths of the data samples in the first data set are the same as sample lengths of the data samples in the second data set. 5. The method according to claim 1 , wherein the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set includes: for the data samples in the first data set, obtaining character state transition probabilities corresponding to character positions in the data samples based on the state transition matrix set; and calculating the sample state transition probabilities corresponding to the data samples based on the character state transition probabilities corresponding to the character positions in the data samples. 6. The method according to claim 5 , wherein the obtaining the character state transition probabilities corresponding to the character positions in the data samples based on the state transition matrix set includes: determining a value of a first character at a first character position in the data samples; determining a value of a second character at a next character position relative to the first character position; determining a first state transition matrix corresponding to the first character position from the state transition matrix set; and obtaining a first state transition probability corresponding to the first character position from the first state transition matrix based on the value of the first character and the value of the second character. 7. The method according to claim 5 , wherein the calculating the sample state transition probabilities corresponding to the data samples based on the character state transition probabilities corresponding to the character positions in the data samples includes: calculating at least a product of the character state transition probabilities corresponding to the character positions in the data samples. 8. The method according to claim 1 , further comprising: determining state occurrence probabilities corresponding to the data samples in the second data set based on the state transition matrix set; and using a fractile of the state occurrence probabilities corresponding to the data samples in the second data set as the first threshold. 9. The method according to claim 1 , wherein the data samples in the second data set are private data, and the determining the data corresponding to the to-be-identified field as being of the same data type as the data samples in the second data set includes: determining the data corresponding to the to-be-identified field as private data. 10. The method according to claim 9 , further comprising: after the determining the data corresponding to the to-be-identified field as the private data, anonymizing the data corresponding to the to-be-identified field. 11. A non-transitory computer readable medium storing contents that, when executed by one or more processors, cause the one or more processors to perform actions comprising: obtaining a first data set, data samples in the first data set correspond to a to-be-identified field; obtaining a state transition matrix set generated based on statistics of data samples in a second data set, wherein the state transition matrix set includes a plurality of state transition matrices, and at least one of the plurality of state transition matrices represents probabilities of state transition conditions of a value of a first character at a character position relative to a value of a second character at a next character position relative to the character position in the data samples in the second data set; determining sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, the sample state transition probabilities representing a similarity between a data type of the data samples in the first data set and the data type of the data samples in the second data set; determining a ratio between (a) a number of data samples in the first data set whose sample state transition probabilities are greater than a first threshold and (b) a total number of the data samples in the first data set; and determining data corresponding to the to-be-identified field as being of a same data type as the data samples in the second data set in response to determining that the ratio is greater than a second threshold. 12. The computer readable medium according to claim 11 , the actions further comprising: determining state transition matrices corresponding to character positions based on the data samples in the second data set to obtain the state transition matrix set. 13. The computer readable medium according to claim 12 , the actions further comprising: obtaining a given data set; and determining at least one base data set from the given data set, wherein data samples in a same base data set have a same length and the at least one base data set includes the second data set. 14. The computer readable medium according to claim 11 , the actions further comprising: before the determining the sample state transition probabilities corresponding to the data samples in the first data set based on the state transition matrix set, determining that sample lengths of the data samples in the first data set are the same as sample lengths of the data samples in the second data set. 15. The computer readable medium accordi
characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling · CPC title
Classification techniques · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Matching criteria, e.g. proximity measures · CPC title
adaptive, e.g. self learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.