Text segmentation with multiple granularity levels
US-9223779-B2 · Dec 29, 2015 · US
US2020089775A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2020089775-A1 |
| Application number | US-201816132687-A |
| Country | US |
| Kind code | A1 |
| Filing date | Sep 17, 2018 |
| Priority date | Sep 17, 2018 |
| Publication date | Mar 19, 2020 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and computer program products are provided for language entity identification. In one embodiment, a computer-implemented method is disclosed. In the method, respective pinyin codes may be determined for respective Chinese characters comprised in a string that is to be processed. Then, respective pinyin features may be generated from the respective pinyin codes. Next, a candidate language entity may be identified from the string based on the respective pinyin features and a mapping describing an association between pinyin features and language entity. In other embodiments, a computer-implemented system and a computer program product for security management are disclosed.
Opening claim text (preview).
1 . A computer-implemented method, comprising: determining, by one or more processors, respective pinyin codes for respective Chinese characters comprised in a string that is to be processed; generating, by one or more processors, respective pinyin features from the respective pinyin codes; and identifying, by one or more processors, a candidate language entity from the string based on the respective pinyin features and a mapping, stored in computer memory, describing an association between pinyin features and language entities. 2 . The computer-implemented method of claim 1 , wherein the determination of respective pinyin codes further comprises: with respect to a Chinese character comprised in the string, determining, by one or more processors, a tone mark associated with the Chinese character; and updating, by one or more processors, the determined pinyin code for the Chinese character based on the determined tone mark. 3 . The computer-implemented method of claim 1 , wherein the determination of respective pinyin codes further comprises: with respect to a Chinese character comprised in the string, determining, by one or more processors, an initial portion in a pinyin code for the Chinese character; and updating, by one or more processors, the pinyin code based on the determined initial portion. 4 . The computer-implemented method of claim 1 , wherein the determination of respective pinyin codes further comprises: with respect to a Chinese character comprised in the string, determining, by one or more processors, a final portion in a pinyin code for the Chinese character; and updating the pinyin code based on the determined final portion. 5 . The computer-implemented method of claim 1 , wherein the generation of the respective pinyin features comprises: with respect to a pinyin code, obtaining, by one or more processors, a predefined length for generating a pinyin feature from the pinyin code; generating, by one or more processors, the pinyin feature based on at least one padding symbol and the pinyin code in response to a length of the pinyin code being below the predefined length. 6 . The computer-implemented method of claim 1 , further comprising: obtaining, by one or more processors, a plurality of sample language entities; with respect to one of the plurality of sample language entities, determining, by one or more processors, respective sample pinyin codes for respective Chinese characters comprised in the sample language entity; generating, by one or more processors, respective sample pinyin features from the respective sample pinyin codes; and training, by one or more processors, the mapping based on the respective sample pinyin features and the sample language entity, such that the trained mapping identifies the sample language entity. 7 . The computer-implemented method of claim 6 , wherein one of the sample language entities is labeled with a name type, and the training of the mapping further comprises: training, by one or more processors, the mapping based on the name type, such that the trained mapping identifies the sample language entity as the name type. 8 . The computer-implemented method of claim 7 , further comprising: providing, by one or more processors, a candidate name type associated with the candidate language entity. 9 . The computer-implemented method of claim 7 , wherein the obtaining of the plurality of sample language entities comprises: selecting a sample language entity that is translated from a foreign language, and wherein the name type comprises at least one of: a name of a person, a name of place, and a name of a drug. 10 . A computer-implemented system, comprising a computer processor coupled to a computer-readable memory unit, the memory unit comprising instructions that when executed by the computer processor implements a method comprising: determining, by one or more processors, respective pinyin codes for respective Chinese characters comprised in a string that is to be processed; generating, by one or more processors, respective pinyin features from the respective pinyin codes; and identifying, by one or more processors, a candidate language entity from the string based on the respective pinyin features and a mapping, stored in computer memory, describing an association between pinyin features and language entities. 11 . The computer-implemented system of claim 10 , wherein the determination of respective pinyin codes further comprises: with respect to a Chinese character comprised in the string, determining, by one or more processors, a tone mark associated with the Chinese character; and updating, by one or more processors, the determined pinyin code for the Chinese character based on the determined tone mark. 12 . The computer-implemented system of claim 10 , wherein the determination of respective pinyin codes further comprises: with respect to a Chinese character comprised in the string, determining, by one or more processors, an initial portion in a pinyin code for the Chinese character; and updating, by one or more processors, the pinyin code based on the determined initial portion. 13 . The computer-implemented system of claim 10 , wherein the determination of respective pinyin codes further comprises: with respect to a Chinese character comprised in the string, determining, by one or more processors, a final portion in a pinyin code for the Chinese character; and updating the pinyin code based on the determined final portion. 14 . The computer-implemented system of claim 10 , wherein the generation of the respective pinyin features comprises: with respect to a pinyin code, obtaining, by one or more processors, a predefined length for generating a pinyin feature from the pinyin code; generating, by one or more processors, the pinyin feature based on at least one padding symbol and the pinyin code in response to a length of the pinyin code being below the predefined length. 15 . The computer-implemented system of claim 10 , further comprising: obtaining, by one or more processors, a plurality of sample language entities; with respect to one of the plurality of sample language entities, determining, by one or more processors, respective sample pinyin codes for respective Chinese characters comprised in the sample language entity; generating, by one or more processors, respective sample pinyin features from the respective sample pinyin codes; and training, by one or more processors, the mapping based on the respective sample pinyin features and the sample language entity, such that the trained mapping identifies the sample language entity. 16 . The computer-implemented system of claim 15 , wherein one of the sample language entities is labeled with a name type, and the training of the mapping further comprises: training, by one or more processors, the mapping based on the name type, such that the trained mapping identifies the sample language entity as the name type. 17 . The computer-implemented system of claim 16 , further comprising: providing, by one or more processors, a candidate name type associated with the candidate language entity. 18 . The computer-implemented system of claim 16 , wherein the obtaining of the plurality of sample language entities comprises: selecting a sample language entity that is translated from a foreign language, and wherein the name type comprises at least one of: a name of a person, a name of place, and a name of a drug. 19 . A computer program product, the computer program
Processing of non-Latin text (kana-to-kanji conversion G06F40/129; vowelisation G06F40/232) · CPC title
Named entity recognition · CPC title
Physics · mapped topic
Physics · mapped topic
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.