Text information processing method and apparatus

US2018217979A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2018217979-A1
Application numberUS-201815940159-A
CountryUS
Kind codeA1
Filing dateMar 29, 2018
Priority dateFeb 18, 2016
Publication dateAug 2, 2018
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure provides a text information processing method. Training textual data is determined according to text information, and characters and strings are identified from the training textual data. For each of the identified characters, a respective independent probability of appearance among the training textual data is calculated. For each of the identified strings, a respective joint probability of appearance among the training textual data is calculated. Whether a particular string of the identified strings corresponds to a candidate neologism is determined according to independent probabilities of various characters of the particular string and the joint probability of the particular string. Moreover, the candidate neologism is determined as a neologism when the candidate neologism is not in a preset dictionary and a joint probability of the candidate neologism is greater than a preset threshold.

First claim

Opening claim text (preview).

What is claimed is: 1 . A text information processing method, comprising: determining training textual data according to text information; identifying, by processing circuitry of a text information processing apparatus, characters and strings from the training textual data; calculating for each of the identified characters a respective independent probability of appearance among the training textual data; calculating for each of the identified strings a respective joint probability of appearance among the training textual data; determining, by the processing circuitry of the text information processing apparatus, whether a particular string of the identified strings corresponds to a candidate neologism according to independent probabilities of various characters of the particular string and the joint probability of the particular string; and after the particular string is determined to correspond to the candidate neologism, determining, by the processing circuitry of the text information processing apparatus, the candidate neologism as a neologism when the candidate neologism is not in a preset dictionary and a joint probability of the candidate neologism is greater than a preset threshold. 2 . The method according to claim 1 , further comprising: collecting respective count numbers of the identified characters in the training textual data, respective count numbers of the identified strings in the training textual data, and a total number of characters in the training textual data, wherein the calculating for each of the identified characters the respective independent probability of appearance among the training textual data includes calculating the independent probability of a particular character according to the count number of the particular character in the training textual data and the total number of characters in the training textual data, and the calculating for each of the identified strings the respective joint probability of appearance among the training textual data includes calculating the joint probability of a particular string according to the count number of the particular string in the training textual data and the total number of characters in the training textual data. 3 . The method according to claim 1 , wherein the determining whether the particular string of the identified strings corresponds to the candidate neologism comprises: determining that the particular string corresponds to the candidate neologism when the joint probability of the particular string is greater than a product of the independent probabilities of various characters of the particular string. 4 . The method according to claim 1 , wherein each string comprises at least two consecutive characters. 5 . The method according to claim 1 , further comprising: after the particular string is determined to correspond to the candidate neologism, determining the joint probability of the candidate neologism according to the joint probability of the particular string and a pattern of the candidate neologism in the training textual data. 6 . The method according to claim 5 , wherein the determining the joint probability of the candidate neologism comprises: estimating a time required for reading from a training start position in the training textual data to a position of the candidate neologism, to obtain a forward time; estimating a time required for reading from the position of the candidate neologism to a training end position in the training textual data, to obtain a backward time; and updating the joint probability of the candidate neologism by using a preset exponential decay function according to the forward time and the backward time. 7 . The method according to claim 6 , wherein the exponential decay function is constructed according to an Ebbinghaus forgetting curve. 8 . The method according to claim 6 , wherein the estimating the time required for reading from the training start position in the training textual data to the position of the candidate neologism comprises: calculating a distance between the training start position in the training textual data to the position of the candidate neologism, to obtain a first distance; and dividing the first distance by a preset reading speed, to obtain the forward time. 9 . The method according to claim 6 , wherein the estimating the time required for reading from the position of the candidate neologism to the training end position in the training textual data comprises: calculating a distance between the position of the candidate neologism to the training end position in the training textual data, to obtain a second distance; and dividing the second distance by a preset reading speed, to obtain the backward time. 10 . A text information processing apparatus, comprising: processing circuitry configured to: determine training textual data according to text information; identify characters and strings from the training textual data; calculate for each of the identified characters a respective independent probability of appearance among the training textual data; calculate for each of the identified strings a respective joint probability of appearance among the training textual data; determine whether a particular string of the identified strings corresponds to a candidate neologism according to independent probabilities of various characters of the particular string and the joint probability of the particular string; and after the particular string is determined to correspond to the candidate neologism, determine the candidate neologism as a neologism when the candidate neologism is not in a preset dictionary and a joint probability of the candidate neologism is greater than a preset threshold. 11 . The apparatus according to claim 10 , wherein the processing circuitry is further configured to: collect respective count numbers of the identified characters in the training textual data, respective count numbers of the identified strings in the training textual data, and a total number of characters in the training textual data; calculate the independent probability of a particular character according to the count number of the particular character in the training textual data and the total number of characters in the training textual data; and calculate the joint probability of a particular string according to the count number of the particular string in the training textual data and the total number of characters in the training textual data. 12 . The apparatus according to claim 10 , wherein the processing circuitry is further configured to: determine that the particular string correspond to the candidate neologism when the joint probability of the particular string is greater than a product of the independent probabilities of various characters of the particular string. 13 . The apparatus according to claim 10 , wherein each string comprises at least two consecutive characters. 14 . The apparatus according to claim 10 , wherein the processing circuitry is further configured to: after the particular string is determined to correspond to the candidate neologism, determine the joint probability of the candidate neologism according to the joint probability of the particular string and a pattern of the candidate neologism in the training textual data. 15 . The apparatus according to claim 14 , wherein the processing circuitry is further configured to: estimate a time required for reading from a training start position in the training textual data to a position of the candidate neologism, to obtain a forward time; esti

Assignees

Inventors

Classifications

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • G06F40/289Primary

    Phrasal analysis, e.g. finite state techniques or chunking · CPC title

  • G06F40/216Primary

    using statistical methods · CPC title

  • G06F40/242Primary

    Dictionaries · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018217979A1 cover?
The disclosure provides a text information processing method. Training textual data is determined according to text information, and characters and strings are identified from the training textual data. For each of the identified characters, a respective independent probability of appearance among the training textual data is calculated. For each of the identified strings, a respective joint pr…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/289. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).