Methods for automatic generation of parallel corpora

US10552548B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10552548-B2
Application numberUS-201815884336-A
CountryUS
Kind codeB2
Filing dateJan 30, 2018
Priority dateFeb 28, 2014
Publication dateFeb 4, 2020
Grant dateFeb 4, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of forming parallel corpora comprises receiving sets of items in first language and second languages, each of the sets having one or more associated descriptions and metadata. The metadata is collected from the two sets of items and are aligned using the metadata. The aligned metadata are mapped from the first language to the second language for each of the sets. The descriptions of two items are fetched and the structural similarity of the descriptions is measured to assess whether two items are likely to be translations of each other. For mapped items with structurally similar descriptions, the mapped item descriptions are formed into respective sentences in first language and in the second language. The sentences are parallel corpora which may be used to translate an item from the first language to the second language, and also to train a machine translation system.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method comprising: obtaining a first item listing that is posted by a first seller in a first language and that is related to selling a particular item that is a product, a service, or a combination of a product and service; obtaining a second item listing that is posted by a second seller in a second language and that is also related to selling the particular item; aligning the first item listing with the second item listing in response to the first item listing and the second item listing both being related to selling the same item; identifying a first organizational structure with respect to first hierarchal relationships between first hypertext markup language (HTML) tags of first HTML code of a first description of the first item listing; identifying a second organizational structure with respect to second hierarchal relationships between second HTML tags of second HTML code of a second description of the second item listing; measuring, based on the aligning of the first item listing with the second item listing, an organizational structural similarity of the first HTML code with respect to the second HTML code by comparing the first organizational structure against the second organizational structure, the comparing including comparing the first hierarchal relationships against the second hierarchal relationships by comparing first nodes and first edges of a first tree that represents the first hierarchal relationships against second nodes and second edges of a second tree that represents the second hierarchal relationships; and in response to the first HTML code and the second HTML code being determined as being organizationally structurally similar based on the measured organizational structural similarity, forming the first description into a first sentence in the first language as a translation of the second description into the first language. 2. The method of claim 1 , further comprising forming the second description into a second sentence in the second language as a translation of the first description into the second language in response to the first HTML code and the second HTML code being organizationally structurally similar. 3. The method of claim 2 , wherein the first sentence and the second sentence comprise parallel corpora and the method further comprises using the parallel corpora to translate another item listing from the first language to the second language. 4. The method of claim 2 , wherein the first sentence and the second sentence comprise parallel corpora and the method further comprises using the parallel corpora to train a machine translation system. 5. The method of claim 1 , wherein measuring the organizational structural similarity of the first HTML code with respect to the second HTML code includes measuring tree isomorphism of the first tree with respect to the second tree. 6. The method of claim 1 , further comprising determining that the first HTML code and the second HTML code are organizationally structurally similar based on the organizational structural similarity meeting a similarity threshold. 7. The method of claim 1 , further comprising: collecting first metadata from the first item listing, the first metadata identifying the particular item; and collecting second metadata from the second item listing, the second metadata identifying the particular item, wherein the aligning of the first item listing with the second item listing is in response to the first metadata and the second metadata both identifying the particular item. 8. A system comprising: one or more processors; a memory to store instructions that, in response to being executed by the one or more processors, cause the system to perform operations comprising: obtaining a first item listing that is posted by a first seller in a first language and that is related to selling a particular item that is a product, a service, or a combination of a product and service; obtaining a second item listing that is posted by a second seller in a second language and that is also related to selling the particular item; aligning the first item listing with the second item listing in response to the first item listing and the second item listing both being related to selling the same item; measuring, based on the aligning of the first item listing with the second item listing, an organizational structural similarity of first hypertext markup language (HTML) code of a first description of the first item listing with respect to second HTML code of a second description of the second item listing, the measuring of the organizational structural similarity including comparing first hierarchal connections between first HTML tags of the first HTML code against second hierarchal connections between second HTML tags of the second HTML code; in response to the first HTML code and the second HTML code being determined as being organizationally structurally similar based on the measured organizational structural similarity, forming the first description into a first sentence in the first language and forming the second description into a second sentence in the second language in which the first sentence and the second sentence are parallel corpora; and using the parallel corpora to perform one or more operations selected from a group of operations consisting of: translating another item listing from the first language to the second language; and training a machine translation system. 9. The system of claim 8 , wherein the operations further comprise: using the first sentence as a translation of the second description into the first language; and using the second sentence as a translation of the first description into the second language. 10. The system of claim 8 , wherein comparing the first hierarchal connections against the second hierarchal connections includes measuring similarity of first nodes and first edges of a first tree that represents the first HTML code with respect to second nodes and second edges of a second tree that represents the second HTML code. 11. The system of claim 8 , wherein measuring the organizational structural similarity of the first HTML code with respect to the second HTML code includes measuring tree isomorphism of the first HTML code with respect to the second HTML code. 12. The system of claim 8 , wherein the operations further comprise determining that the first HTML code and the second HTML code are organizationally structurally similar based on the organizational structural similarity meeting a similarity threshold. 13. The system of claim 8 , wherein the operations further comprise: collecting first metadata from the first item listing, the first metadata identifying the particular item; and collecting second metadata from the second item listing, the second metadata identifying the particular item, wherein the aligning of the first item listing with the second item listing is in response to the first metadata and the second metadata both identifying the particular item. 14. The system of claim 8 , wherein the operations further comprise: identifying the first hierarchal connections between the first HTML tags; and identifying the second hierarchal relationships between the second HTML tags. 15. One or more non-transitory computer-readable media embodying instructions that, in response to being executed by one or more processors of a system, cause the system to perform operations comprising: obtaining a first item listing that is posted by a first seller in a first language and that is related to selling a particular item that is a pro

Assignees

Inventors

Classifications

  • Example-based machine translation; Alignment · CPC title

  • using very large corpora, e.g. the web · CPC title

  • Data-driven translation · CPC title

  • Electronic shopping [e-shopping] · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10552548B2 cover?
A method of forming parallel corpora comprises receiving sets of items in first language and second languages, each of the sets having one or more associated descriptions and metadata. The metadata is collected from the two sets of items and are aligned using the metadata. The aligned metadata are mapped from the first language to the second language for each of the sets. The descriptions of tw…
Who is the assignee on this patent?
Paypal Inc
What technology area does this patent fall under?
Primary CPC classification G06Q30/0601. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 04 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).