System and method for text normalization using atomic tokens
US-2016125872-A1 · May 5, 2016 · US
US10769387B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10769387-B2 |
| Application number | US-201816135493-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 19, 2018 |
| Priority date | Sep 21, 2017 |
| Publication date | Sep 8, 2020 |
| Grant date | Sep 8, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Implementations of the present disclosure are directed to a method, a system, and an article for translating chat messages. An example method can include: receiving an electronic text message from a client device of a user; normalizing the electronic text message to generate a normalized text message; tagging at least one phrase in the normalized text message with a marker to generate a tagged text message, the marker indicating that the at least one phrase will be translated using a rule-based system; translating the tagged text message using the rule-based system and a machine translation system to generate an initial translation; and post-processing the initial translation to generate a final translation.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving an electronic text message from a client device of a user; normalizing the electronic text message to generate a normalized text message comprising characters having a consistent width, the normalizing comprising at least one of: converting a full-width character into a half-width character; or converting a half-width character into a full-width character; tagging at least one phrase in the normalized text message with a marker to generate a tagged text message, the marker indicating that the at least one phrase will be translated using a rule-based system; translating the tagged text message using the rule-based system and a machine translation system to generate an initial translation, wherein translating the tagged text message comprises (i) translating the at least one phrase using the rule-based system and (ii) translating other words or phrases using the machine translation system, and wherein the machine translation system is trained using training data comprising characters having a consistent width; and post-processing the initial translation to generate a final translation. 2. The method of claim 1 , wherein receiving the electronic text message comprises: splitting the electronic text message into discrete sentences. 3. The method of claim 1 , wherein normalizing the electronic text message comprises at least one of: converting a control character into a space; converting a UNICODE space character into an ASCII space character; converting a Japanese character with Dakuten or Handakuten from a two-UNICODE point representation to a one-UNICODE point representation; replacing an XML related character; or replacing a special character utilized by the machine translation system. 4. The method of claim 1 , wherein the normalized message comprises characters having a consistent form. 5. The method of claim 1 , wherein the marker comprises an XML, marker. 6. The method of claim 1 , wherein tagging the at least one phrase comprises tokenizing the normalized text message into discrete words. 7. The method of claim 1 , wherein tagging the at least one phrase comprises converting at least one upper case character in the normalized text message to a lower case character. 8. The method of claim 1 , wherein the machine translation system comprises a statistical machine translator. 9. The method of claim 1 , wherein post-processing the initial translation comprises at least one of: detokenizing the initial translation; removing the marker from the initial translation; or reintroducing into the initial translation a special character used by the machine translation system. 10. A system, comprising: one or more computer processors programmed to perform operations comprising: receiving an electronic text message from a client device of a user; normalizing the electronic text message to generate a normalized text message comprising characters having a consistent width, the normalizing comprising at least one of: converting a full-width character into a half-width character; or converting a half-width character into a full-width character; tagging at least one phrase in the normalized text message with a marker to generate a tagged text message, the marker indicating that the at least one phrase will be translated using a rule-based system; translating the tagged text message using the rule-based system and a machine translation system to generate an initial translation, wherein translating the tagged text message comprises (i) translating the at least one phrase using the rule-based system and (ii) translating other words or phrases using the machine translation system, and wherein the machine translation system is trained using training data comprising characters having a consistent width; and post-processing the initial translation to generate a final translation. 11. The system of claim 10 , wherein receiving the electronic text message comprises: splitting the electronic text message into discrete sentences. 12. The system of claim 10 , wherein normalizing the electronic text message comprises at least one of: converting a control character into a space; converting a UNICODE space character into an ASCII space character; converting a Japanese character with Dakuten or Handakuten from a two-UNICODE point representation to a one-UNICODE point representation; replacing an XML related character; or replacing a special character utilized by the machine translation system. 13. The system of claim 10 , wherein the normalized message comprises characters having a consistent form. 14. The system of claim 10 , wherein tagging the at least one phrase comprises tokenizing the normalized text message into discrete words. 15. The system of claim 10 , wherein tagging the at least one phrase comprises converting at least one upper case character in the normalized text message to a lower case character. 16. The system of claim 10 , wherein the machine translation system comprises a statistical machine translator. 17. The system of claim 10 , wherein post-processing the initial translation comprises at least one of: detokenizing the initial translation; removing the marker from the initial translation; or reintroducing into the initial translation a special character used by the machine translation system. 18. An article, comprising: a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computer processors, cause the computer processors to perform operations comprising: receiving an electronic text message from a client device of a user; normalizing the electronic text message to generate a normalized text message comprising characters having a consistent width, the normalizing comprising at least one of: converting a full-width character into a half-width character; or converting a half-width character into a full-width character; tagging at least one phrase in the normalized text message with a marker to generate a tagged text message, the marker indicating that the at least one phrase will be translated using a rule-based system; translating the tagged text message using the rule-based system and a machine translation system to generate an initial translation, wherein translating the tagged text message comprises (i) translating the at least one phrase using the rule-based system and (ii) translating other words or phrases using the machine translation system, and wherein the machine translation system is trained using training data comprising characters having a consistent width; and post-processing the initial translation to generate a final translation.
Rule-based translation · CPC title
Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title
Statistical methods, e.g. probability models · CPC title
Processing of non-Latin text (kana-to-kanji conversion G06F40/129; vowelisation G06F40/232) · CPC title
Named entity recognition · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.