Using multilingual lexical resources to improve lexical simplification

US10318633B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10318633-B2
Application numberUS-201715396709-A
CountryUS
Kind codeB2
Filing dateJan 2, 2017
Priority dateJan 2, 2017
Publication dateJun 11, 2019
Grant dateJun 11, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An approach is provided that receives a word that belongs to a first natural language and retrieves a first set of complexity data pertaining to the word in the first natural language. The approach translates the word to one or more translated words, with each of the translated words corresponding to one or more second natural languages. The approach then retrieves sets of complexity data, with the sets of complexity data corresponding to a different translated word. The approach determines a complexity of the word in the first natural language based on an analysis of the first and second sets of complexity data.

First claim

Opening claim text (preview).

What is claimed is: 1. An information handling system comprising: one or more processors; a memory coupled to at least one of the processors; and a set of computer program instructions stored in the memory and executed by at least one of the processors in order to perform actions comprising: creating a multi-language word mapping by a multi-language word mapping generator executing on the information handling system, wherein the creating further comprises: retrieving a word that belongs to a first natural language; retrieving a first set of complexity data pertaining to the word in the first natural language, wherein the first set of complexity data comprises a first word length and a first word frequency; translating the word to one or more translated words, wherein each of the translated words corresponds to one or more second natural languages; retrieving one or more second sets of complexity data, wherein each of the second sets of complexity data correspond to a different one of the translated words, wherein the one or more second sets of complexity data comprises one or more second word lengths and one or more second word frequencies; and computing a complexity of the word in the first natural language based on an overall word length and an overall word frequency, wherein the overall word length is based on the first word length and the one or more second word lengths, and wherein the overall word frequency is based on the first word frequency and the one or more second word frequencies; and storing the computed complexity of the word in the multi-language word mapping; and performing, by the information handling system, lexical simplification on the document that comprises replacing the word in a document with one of the one or more translated words based on the computed complexity of the word stored in the multi-language word mapping. 2. The information handling system of claim 1 wherein the first set of complexity data includes a first word n-gram of the word in the first natural language, wherein the second sets of complexity data includes one or more second word n-grams of the word in each of the second natural languages, and wherein the actions further comprise: determining an overall word n-gram based on the first word n-gram and the second one or more word n-grams, wherein the complexity of the word is based on the overall word n-gram. 3. The information handling system of claim 1 wherein the first set of complexity data includes a first word encyclopedia entry of the word in the first natural language, wherein the second sets of complexity data includes one or more second encyclopedia entries of the word in each of the second natural languages, and wherein the actions further comprise: determining an overall word n-gram based on the first word encyclopedia entry and the second one or more encyclopedia entries, wherein the complexity of the word is based on the overall word n-gram. 4. The information handling system of claim 1 wherein the complexity of the word is based on an average length of characters of the word and the translated words in each of the first and second natural languages, a total number of translated words, a frequency of the word in the first natural language, a sum of the normalized frequencies of the one or more translated words in the second natural languages, an existence of an encyclopedia entry of the word, a number of encyclopedia entries of the translated words in the second natural languages, and a vector value of possible character n-grams in the second natural languages collectively. 5. The information handling system of claim 1 wherein the translated words include synonyms of the translated words in the second natural languages. 6. A computer program product stored in a non-transitory computer readable storage medium, comprising computer program code that, when executed by an information handling system, performs actions comprising: creating a multi-language word mapping by a multi-language word mapping generator executing on the information handling system, wherein the creating further comprises: retrieving a word that belongs to a first natural language; retrieving a first set of complexity data pertaining to the word in the first natural language, wherein the first set of complexity data comprises a first word length and a first word frequency; translating the word to one or more translated words, wherein each of the translated words corresponds to one or more second natural languages; retrieving one or more second sets of complexity data, wherein each of the second sets of complexity data correspond to a different one of the translated words, wherein the one or more second sets of complexity data comprises one or more second word lengths and one or more second word frequencies; and computing a complexity of the word in the first natural language based on an overall word length and an overall word frequency, wherein the overall word length is based on the first word length and the one or more second word lengths, and wherein the overall word frequency is based on the first word frequency and the one or more second word frequencies; and storing the computed complexity of the word in the multi-language word mapping; and performing, by the information handling system, lexical simplification on the document that comprises replacing the word in a document with one of the one or more translated words based on the computed complexity of the word stored in the multi-language word mapping. 7. The computer program product of claim 6 wherein the first set of complexity data includes a first word n-gram of the word in the first natural language, wherein the second sets of complexity data includes one or more second word n-grams of the word in each of the second natural languages, and wherein the actions further comprise: determining an overall word n-gram based on the first word n-gram and the second one or more word n-grams, wherein the complexity of the word is based on the overall word n-gram. 8. The computer program product of claim 6 wherein the first set of complexity data includes a first word encyclopedia entry of the word in the first natural language, wherein the second sets of complexity data includes one or more second encyclopedia entries of the word in each of the second natural languages, and wherein the actions further comprise: determining an overall word n-gram based on the first word encyclopedia entry and the second one or more encyclopedia entries, wherein the complexity of the word is based on the overall word n-gram. 9. The computer program product of claim 6 wherein the complexity of the word is based on an average length of characters of the word and the translated words in each of the first and second natural languages, a total number of translated words, a frequency of the word in the first natural language, a sum of the normalized frequencies of the one or more translated words in the second natural languages, an existence of an encyclopedia entry of the word, a number of encyclopedia entries of the translated words in the second natural languages, and a vector value of possible character n-grams in the second natural languages collectively.

Assignees

Inventors

Classifications

  • Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title

  • Semantic analysis · CPC title

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Natural language query formulation · CPC title

  • G06F17/277Primary

    Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10318633B2 cover?
An approach is provided that receives a word that belongs to a first natural language and retrieves a first set of complexity data pertaining to the word in the first natural language. The approach translates the word to one or more translated words, with each of the translated words corresponding to one or more second natural languages. The approach then retrieves sets of complexity data, with…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).