Systems and methods for language detection

US10699073B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10699073-B2
Application numberUS-201816210405-A
CountryUS
Kind codeB2
Filing dateDec 5, 2018
Priority dateOct 17, 2014
Publication dateJun 30, 2020
Grant dateJun 30, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations of the present disclosure are directed to a method, a system, and a computer program storage device for identifying a language in a message. Non-language characters are removed from a text message to generate a sanitized text message. An alphabet and/or a script are detected in the sanitized text message by performing at least one of (i) an alphabet-based language detection test to determine a first set of scores and (ii) a script-based language detection test to determine a second set of scores. Each score in the first set of scores represents a likelihood that the sanitized text message includes the alphabet for one of a plurality of different languages. Each score in the second set of scores represents a likelihood that the sanitized text message includes the script for one of the plurality of different languages. The language in the sanitized text message is identified based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: removing non-language characters from a text message to generate a sanitized text message; performing a plurality of language detection tests on the sanitized text message, wherein each language detection test determines a respective set of scores, and wherein each score in the set of scores represents a likelihood that the sanitized text message is in a respective language of a plurality of different languages; providing one or more combinations of the score sets as input to a plurality of classifiers, wherein each classifier is trained using outputs from different combinations of the language detection tests; obtaining as output from at least one of the plurality of classifiers a respective confidence score that the sanitized text message is in one of a plurality of different languages; and identifying the language of the sanitized text message based on one of the confidence scores. 2. The method of claim 1 , wherein the non-language characters comprise at least one of an emoji, a punctuation mark, an extra space, a carriage return, and a numerical character. 3. The method of claim 1 , wherein each language detection test comprises one of a byte n-gram language detection test, a dictionary-based language detection test, an alphabet-based language detection test, a script-based language detection test, and a user language profile language detection test. 4. The method of claim 1 , wherein the plurality of language detection tests are performed substantially simultaneously. 5. The method of claim 1 , wherein the one or more combinations of the score sets comprise score sets from at least one of a script-based language detection test and an alphabet-based language detection test. 6. The method of claim 1 , wherein the one or more combinations of the score sets comprise score sets from a byte n-gram language detection test and a dictionary-based language detection test. 7. The method of claim 1 , wherein the score sets comprise at least one score from a user language profile language detection test that identifies a language preference from a user based on previous text messages authored by the user. 8. The method of claim 1 , wherein each classifier comprises one of a supervised learning model, a partially supervised learning model, an unsupervised learning model, and an interpolation. 9. The method of claim 1 , wherein identifying the language of the sanitized text message comprises: selecting the confidence score based on an expected language detection accuracy. 10. The method of claim 1 , wherein identifying the language of the sanitized text message comprises: selecting the confidence score based on a linguistic domain of the sanitized text message. 11. A system, comprising: one or more computer processors programmed to perform operations to: remove non-language characters from a text message to generate a sanitized text message; perform a plurality of language detection tests on the sanitized text message, wherein each language detection test determines a respective set of scores, and wherein each score in the set of scores represents a likelihood that the sanitized text message is in a respective language of a plurality of different languages; provide one or more combinations of the score sets as input to a plurality of classifiers, wherein each classifier is trained using outputs from different combinations of the language detection tests; obtain as output from at least one of the plurality of classifiers a respective confidence score that the sanitized text message is in one of a plurality of different languages; and identify the language of the sanitized text message based on one of the confidence scores. 12. The system of claim 11 , wherein the non-language characters comprise at least one of an emoji, a punctuation mark, an extra space, a carriage return, and a numerical character. 13. The system of claim 11 , wherein each language detection test comprises one of a byte n-gram language detection test, a dictionary-based language detection test, an alphabet-based language detection test, a script-based language detection test, and a user language profile language detection test. 14. The system of claim 11 , wherein the one or more combinations of the score sets comprise score sets from at least one of a script-based language detection test and an alphabet-based language detection test. 15. The system of claim 11 , wherein the one or more combinations of the score sets comprise score sets from a byte n-gram language detection test and a dictionary-based language detection test. 16. The system of claim 11 , wherein the score sets comprise at least one score from a user language profile language detection test that identifies a language preference from a user based on previous text messages authored by the user. 17. The system of claim 11 , wherein each classifier comprises one of a supervised learning model, a partially supervised learning model, an unsupervised learning model, and an interpolation. 18. The system of claim 11 , wherein to identify the language of the sanitized text message the one or more computer processors are further to: select the confidence score based on an expected language detection accuracy. 19. The system of claim 11 , wherein to identify the language of the sanitized text message the one or more computer processors are further to: select the confidence score based on a linguistic domain of the sanitized text message. 20. A non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computer processors, cause the one or more computer processors to: remove non-language characters from a text message to generate a sanitized text message; perform a plurality of language detection tests on the sanitized text message, wherein each language detection test determines a respective set of scores, and wherein each score in the set of scores represents a likelihood that the sanitized text message is in a respective language of a plurality of different languages; provide one or more combinations of the score sets as input to a plurality of classifiers, wherein each classifier is trained using outputs from different combinations of the language detection tests; obtain as output from at least one of the plurality of classifiers a respective confidence score that the sanitized text message is in one of a plurality of different languages; and identify the language of the sanitized text message based on one of the confidence scores.

Assignees

Inventors

Classifications

  • G06F40/263Primary

    Language identification · CPC title

  • G06F40/232Primary

    Orthographic correction, e.g. spell checking or vowelisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10699073B2 cover?
Implementations of the present disclosure are directed to a method, a system, and a computer program storage device for identifying a language in a message. Non-language characters are removed from a text message to generate a sanitized text message. An alphabet and/or a script are detected in the sanitized text message by performing at least one of (i) an alphabet-based language detection test…
Who is the assignee on this patent?
Mz Ip Holdings Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/263. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 30 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).