Systems and methods for language detection

US10162811B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10162811-B2
Application numberUS-201615283646-A
CountryUS
Kind codeB2
Filing dateOct 3, 2016
Priority dateOct 17, 2014
Publication dateDec 25, 2018
Grant dateDec 25, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations of the present disclosure are directed to a method, a system, and a computer program storage device for identifying a language in a message. Non-language characters are removed from a text message to generate a sanitized text message. An alphabet and/or a script are detected in the sanitized text message by performing at least one of (i) an alphabet-based language detection test to determine a first set of scores and (ii) a script-based language detection test to determine a second set of scores. Each score in the first set of scores represents a likelihood that the sanitized text message includes the alphabet for one of a plurality of different languages. Each score in the second set of scores represents a likelihood that the sanitized text message includes the script for one of the plurality of different languages. The language in the sanitized text message is identified based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of identifying a language in a message, the method comprising: obtaining a text message generated by a user; removing non-language characters from the text message to generate a sanitized text message; detecting an alphabet and a script present in the sanitized text message, wherein (i) detecting the alphabet comprises performing an alphabet-based language detection test to determine a first set of scores, and wherein each score in the first set of scores represents a likelihood that the sanitized text message comprises the alphabet for one of a plurality of different languages, and (ii) detecting the script comprises performing a script-based language detection test to determine a second set of scores, and wherein each score in the second set of scores represents a likelihood that the sanitized text message comprises the script for one of the plurality of different languages; providing one or more combinations of the first and second sets of scores as input to one or more classifiers including a first classifier and a second classifier, wherein the first classifier was trained using outputs from a first combination of language detection tests and the second classifier was trained using outputs from a second combination of language detection tests; obtaining as output from at least one of the one or more classifiers a respective confidence score that the sanitized text message is in one of a plurality of different languages; and identifying the language in the sanitized text message based on the confidence score from at least one of the one or more classifiers. 2. The method of claim 1 , wherein the non-language characters comprise at least one of an emoji, a punctuation mark, an extra space, a carriage return, and a numerical character. 3. The method of claim 1 , wherein the one or more combinations comprise an interpolation between the first and second sets of scores. 4. The method of claim 1 , wherein providing one or more combinations of the first and second sets of scores comprises: performing a language detection test on the sanitized text message to generate a third set of scores, wherein each score in the third set of scores represents a likelihood that the sanitized text message comprises one of a plurality of different languages. 5. The method of claim 4 , wherein the language detection test is selected from a plurality of language detection tests based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores. 6. The method of claim 4 , wherein the language detection test comprises a language detection method. 7. The method of claim 6 , wherein the language detection method comprises at least one of a dictionary-based language detection test, an n-gram language detection test, and a user language profile language detection test. 8. The method of claim 4 , comprising: processing the third set of scores using the one or more classifiers to identify the language in the sanitized text message. 9. The method of claim 1 , wherein the one or more classifiers comprise at least one of a supervised learning model, a partially supervised learning model, an unsupervised learning model, and an interpolation. 10. A computer-implemented system for identifying a language in a message, comprising: one or more computer processors programmed to implement a sanitizer module, a grouper module, and a language detector module, wherein the sanitizer module obtains a text message generated by a user and removes non-language characters from the text message to generate a sanitized text message, wherein the grouper module detects an alphabet and a script present in the sanitized text message, and wherein (i) detecting the alphabet comprises performing an alphabet-based language detection test to determine a first set of scores, and wherein each score in the first set of scores represents a likelihood that the sanitized text message comprises the alphabet for one of a plurality of different languages, and (ii) detecting the script comprises performing a script-based language detection test to determine a second set of scores, and wherein each score in the second set of scores represents a likelihood that the sanitized text message comprises the script for one of the plurality of different languages, and wherein the language detector module is operable to perform operations comprising: providing one or more combinations of the first and second sets of scores as input to one or more classifiers including a first classifier and a second classifier, wherein the first classifier was trained using outputs from a first combination of language detection tests and the second classifier was trained using outputs from a second combination of language detection tests; obtaining as output from at least one of the one or more classifiers a respective confidence score that the sanitized text message is in one of a plurality of different languages; and identifying the language in the sanitized text message based on the confidence score from at least one of the one or more classifiers. 11. The system of claim 10 , wherein the non-language characters comprise at least one of an emoji, a punctuation mark, an extra space, a carriage return, and a numerical character. 12. The system of claim 10 , wherein the one or more combinations comprise an interpolation between the first and second sets of scores. 13. The system of claim 10 , wherein the grouper module is operable to perform operations comprising: selecting the language detector module from a plurality of language detector modules based on at least one of the first set of scores, the second set of scores, and a combination of the first and second sets of scores. 14. The system of claim 10 , wherein the language detector module comprises: a language detection methods module operable to perform operations comprising: performing a language detection test on the sanitized text message to generate a third set of scores, wherein each score in the third set of scores represents a likelihood that the sanitized text message comprises one of a plurality of different languages. 15. The system of claim 14 , wherein the language detection test comprises at least one of a dictionary-based language detection test, an n-gram language detection test, and a user language profile language detection test. 16. The system of claim 14 , wherein the language detector module comprises: a classifier module operable to perform operations comprising: processing the third set of scores using the one or more classifiers to identify the language in the sanitized text message. 17. The system of claim 16 , wherein the classifier module is operable to perform operations comprising: outputting an indication that the sanitized text message is in the identified language, wherein the indication comprises a confidence score. 18. The system of claim 10 , wherein the one or more classifiers comprise at least one of a supervised learning model, a partially supervised learning model, an unsupervised learning model, and an interpolation. 19. An article, comprising: a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computers, cause the computers to perform operations comprising: obtaining a text message generated by a user; removing non-language characters from the text message to generate a sanitized text message; detecting an alphabet and a s

Assignees

Inventors

Classifications

  • G06F40/263Primary

    Language identification · CPC title

  • G06F40/232Primary

    Orthographic correction, e.g. spell checking or vowelisation · CPC title

  • Physics · mapped topic

  • G06F17/275Primary

    Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10162811B2 cover?
Implementations of the present disclosure are directed to a method, a system, and a computer program storage device for identifying a language in a message. Non-language characters are removed from a text message to generate a sanitized text message. An alphabet and/or a script are detected in the sanitized text message by performing at least one of (i) an alphabet-based language detection test…
Who is the assignee on this patent?
Mz Ip Holdings Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/263. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 25 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).