Optimized statistical machine translation system with rapid adaptation capability

US9959271B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9959271-B1
Application numberUS-201514868083-A
CountryUS
Kind codeB1
Filing dateSep 28, 2015
Priority dateSep 28, 2015
Publication dateMay 1, 2018
Grant dateMay 1, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Technologies are disclosed herein for statistical machine translation. In particular, the disclosed technologies include extensions to conventional machine translation pipelines: the use of multiple domain-specific and non-domain-specific dynamic language translation models and language models; cluster-based language models; and large-scale discriminative training. Incremental update technologies are also disclosed for use in updating a machine translation system in four areas: word alignment; translation modeling; language modeling; and parameter estimation. A mechanism is also disclosed for training and utilizing a runtime machine translation quality classifier for estimating the quality of machine translations without the benefit of reference translations. The runtime machine translation quality classifier is generated in a manner to offset imbalances in the number of training instances in various classes, and to assign a greater penalty to the misclassification of lower-quality translations as higher-quality translations than to misclassification of higher-quality translations as lower-quality translations.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus, comprising: one or more processors; and one or more non-transitory computer-readable storage media having instructions stored thereupon which are executable by the one or more processors and which, when executed, cause the apparatus to: determine a number of out-of-vocabulary words in input text segments; generate an estimated difficulty feature score of a supervised machine learning model in translating the input text segments based, at least in part, on the number of out-of-vocabulary words; modify a misclassification cost associated with the supervised machine learning model, stored in a memory, to offset an imbalance between a plurality of classes of training data utilized to train a machine translation quality classifier to classify a quality of machine translated text segments, the training data comprising one or more feature scores including the estimated difficulty feature score; modify a loss function associated with the supervised machine learning model stored in the memory to penalize a misclassification of a lower-quality text segment as a higher-quality text segment more greatly than a misclassification of a higher-quality text segment as a lower-quality text segment; train the machine translation quality classifier utilizing the supervised machine learning model based, at least in part on the misclassification cost and the loss function; cause the machine translation quality classifier to be deployed to a computer in a service provider network; and utilize the machine translation quality classifier is utilized to classify a quality of translated segments received from a machine translation system operating in the service provider network into one of the plurality of classes in real time. 2. The apparatus of claim 1 , wherein the plurality of classes comprises a perfect or near perfect class, an understandable class, and a residual class. 3. The apparatus of claim 1 , wherein the non-transitory computer-readable storage media has further instructions stored thereupon to: aggregate a plurality of classifications for translated segments; and generate one or more document-level or corpus-level distribution statistics based upon the aggregated classifications. 4. The apparatus of claim 1 , wherein the one or more feature scores are associated with machine translated text segments in a target language and correct class labels for the machine translated text segments in the target language. 5. The apparatus of claim 4 , wherein the correct class labels for the machine translated text segments in the target language are generated, at least in part, based upon a translation edit rate (“TER”) between the machine translated text segments in the target language and associated reference translations. 6. The apparatus of claim 4 , wherein the one or more feature scores associated with the machine translated text segments in the target language comprise one or more of: a fluency of the machine translated text segments, a level of ambiguity experienced by the machine translation system in translating the input text segments, a difference in length or punctuation between the input text segments and the machine translated text segments, or one or more statistical confidence measures generated by the machine translation system for the machine translated text segments. 7. A computer-implemented method for classifying a quality of translated segments generated by a machine translation system, the method comprising: generating an estimated difficulty feature score of a supervised machine learning model in translating input text segments based, at least in part, on a number of out-of-vocabulary words in the input text segments; training a machine translation quality classifier stored in a memory to classify the quality of the translated segments utilizing the supervised machine learning model configured with a misclassification cost configured to offset an imbalance between a plurality of classes of training data, the training data comprising one or more feature scores associated with machine translated segments of a target language and correct class labels for the machine translated segments in the target language, the one or more feature scores including the estimated difficulty feature score; and a loss function configured to penalize a misclassification of a lower-quality translated segment as a higher-quality translated segment more greatly than a misclassification of a higher-quality translated segment as a lower-quality translated segment; and utilizing the machine translation quality classifier at a computer in a service provider network to classify the quality of the translated segments generated by the machine translation system into the plurality of classes. 8. The computer-implemented method of claim 7 , wherein the correct class labels for the machine translated segments in the target language are generated, at least in part, based upon a translation edit rate (“TER”) between the machine translated segments in the target language and associated reference translations. 9. The computer-implemented method of claim 7 , wherein the one or more feature scores associated with the machine translated segments in the target language comprise one or more of: a fluency of the machine translated segments, a level of ambiguity experienced by the machine translation system in translating the input text segments, a difference in length or punctuation between the input text segments and the machine translated text segments, or one or more statistical confidence measures generated by the machine translation system for the machine translated text segments. 10. The computer-implemented method of claim 7 , further comprising: aggregating a plurality of classifications for translated segments; and generating one or more document-level or corpus-level distribution statistics based upon the aggregated classifications. 11. The computer-implemented method of claim 10 , further comprising initiating one or more actions based, at least in part, on the document-level or corpus-level distribution statistics. 12. The computer-implemented method of claim 7 , wherein the plurality of classes comprises a perfect or near perfect class, an understandable class, and a residual class. 13. The computer-implemented method of claim 12 , further comprising providing translated segments in the understandable class to an editor for post-editing. 14. The computer-implemented method of claim 12 , further comprising: discarding translated segments in the residual class; and providing input segments associated with translated segments in the residual class to an editor for translation. 15. The computer-implemented method of claim 12 , further comprising: discarding translated segments in the residual class; and retranslating input segments associated with translated segments in the residual class using a dedicated cluster of instances of a statistical machine translation system configured to examine broad segments of text that are compute intensive and statistically complex. 16. The computer-implemented method of claim 12 , further comprising: calculating a compute cost associated with retranslating one or more translated segments in the residual class using a dedicated cluster of instances of a statistical machine translation system configured to examine broad segments of text that are compute intensive and statistically complex; determining that the compute cost associated with translating the one or more translated segments in the residual class exceeds a cos

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9959271B1 cover?
Technologies are disclosed herein for statistical machine translation. In particular, the disclosed technologies include extensions to conventional machine translation pipelines: the use of multiple domain-specific and non-domain-specific dynamic language translation models and language models; cluster-based language models; and large-scale discriminative training. Incremental update technologi…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/44. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 01 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).