Language segmentation of multilingual texts

US9400787B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9400787-B2
Application numberUS-201314073036-A
CountryUS
Kind codeB2
Filing dateNov 6, 2013
Priority dateFeb 8, 2011
Publication dateJul 26, 2016
Grant dateJul 26, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The claimed subject matter provides a system and/or method for segmenting a multi-language text. An exemplary method comprises determining an initial probability distribution for sentences in the multi-language text, the initial probability distribution indicating the likelihood of each sentence being in each of a set of languages. A probability of language transitions across sentences may be learned based on the initial probability distribution. Additionally, a highest probability language sequence of sentences in the multi-language text may be determined based on a combination of the probability of language transitions and the prior probability distribution provided by an initial model. Further, web documents are annotated at a sentence by sentence level such that each sentence of a web document is labeled in a given language according to the highest probability language determined.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of segmenting a multi-language text, comprising: determining, using a processing unit, an initial probability distribution for sentences in a web document in the multi-language text, the initial probability distribution indicating the likelihood of each sentence being in each of a set of languages; learning, using the processing unit, a probability of language transitions across sentences based on the initial probability distribution; determining, using the processing unit, a highest probability language sequence of sentences in the multi-language text based on a combination of the probability of language transitions and a prior probability distribution provided by an initial model; and annotating web documents at a sentence by sentence level such that each sentence of a web document is labeled in a given language according to the highest probability language determined. 2. The method recited in claim 1 , comprising using an automatic language detector to determine the sentences in the multi-language text. 3. The method recited in claim 1 , wherein learning the probability of language transitions comprises using a hidden Markov model. 4. The method recited in claim 1 , wherein learning the probability of language transitions comprises using a forward backward algorithm. 5. The method recited in claim 1 , wherein determining a highest probability language sequence comprises using a Viterbi Algorithm. 6. The method recited in claim 1 , comprising segmenting, using the processing unit, the multi-language text into a plurality of monolingual texts based on the highest probability language sequence. 7. The method recited in claim 1 , wherein learning the probability of language transitions comprises using a second order Markov model. 8. A system for segmenting a multi-language text, the system comprising: a processing unit; and a system memory, wherein the system memory comprises code configured to direct the processing unit to: determine an initial probability distribution for sentences in the multi-language text, the initial probability distribution indicating the likelihood of each sentence being in each of a set of languages; learn a probability of language transitions across sentences based on the initial probability distribution; determine, using the processing unit, a highest probability language sequence of sentences in the multi-language text based on a combination of the probability of language transitions and a prior probability distribution provided by an initial model; and annotate web documents at a sentence by sentence level such that each sentence of a web document is labeled in a given language according to the highest probability language determined. 9. The system recited in claim 8 , comprising using an automatic language detector to determine the sentences in the multi-language text. 10. The system recited in claim 8 , wherein learning the probability of language transitions comprises using a hidden Markov model. 11. The system recited in claim 8 , wherein learning the probability of language transitions comprises using a forward backward algorithm. 12. The system recited in claim 8 , wherein determining a highest probability language sequence comprises using a Viterbi Algorithm. 13. The system recited in claim 8 , comprising segmenting the multi-language text into a plurality of monolingual texts based on the highest probability language sequence. 14. The system recited in claim 8 , wherein learning the probability of language transitions comprises using a second order Markov model.

Assignees

Inventors

Classifications

  • G06F40/58Primary

    Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title

  • G06F40/263Primary

    Language identification · CPC title

  • G06F17/289Primary

    Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9400787B2 cover?
The claimed subject matter provides a system and/or method for segmenting a multi-language text. An exemplary method comprises determining an initial probability distribution for sentences in the multi-language text, the initial probability distribution indicating the likelihood of each sentence being in each of a set of languages. A probability of language transitions across sentences may be l…
Who is the assignee on this patent?
Aue Anthony, Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/58. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 26 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).