System and method for locating bilingual web sites

US9471565B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9471565-B2
Application numberUS-201113194668-A
CountryUS
Kind codeB2
Filing dateJul 29, 2011
Priority dateJul 29, 2011
Publication dateOct 18, 2016
Grant dateOct 18, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for bootstrapping a language translation system. A system configured to practice the method performs a bidirectional web crawl to identify a bilingual website. The system analyzes data on the bilingual website to make a classification decision about whether the root of the bilingual website is an entry point for the bilingual website. The bilingual site can contain pairs of parallel pages. Each pair can include a first website in a first language and a second website in a second language, and a first portion of the first web page corresponds to a second portion of the second web page. Then the system analyzes the first and second web pages to identify corresponding information pairs in the first and second languages, and extracts the corresponding information pairs from the first and second web pages for use in a language translation model.

First claim

Opening claim text (preview).

We claim: 1. A method comprising: performing a generic web crawl to identify a first webpage in a first language having a link thereon which points to a second webpage in a second language, wherein the first webpage and the second webpage comprise a bilingual website comprising the first webpage and the second webpage in respective languages; based on an analysis of parameters on the first webpage comprising at least two of: the link pointing to the second webpage, a title, a link neighborhood, a link context and data indicating a separate version of the first webpage, classifying the first webpage as a root page and as an entry point for the bilingual website via the link to the second webpage; identifying, using a visitation policy which constrains web-crawling to a graph neighborhood of bilingual websites, a pattern of links within between the first webpage and the second webpage, to yield a bipartite graph; ranking a relevance of candidate links which point to parallel text in the first webpage and the second webpage, to yield classifications of links based on the bipartite graph and the relevance; performing, based on the relevance, a bidirectional web crawl of the candidate links, to identify the first webpage and the second webpage as a bilingual website, the bidirectional web crawl utilizing the classifications of links to avoid links having a low respective relevance; analyzing the first webpage and the second webpage to identify information pairs in the first language and the second language; extracting the information pairs from the first webpage and the second webpage for use in a language translation model, the information pairs comprising at least one of a sentence pair and a paragraph pair; and updating a statistical model with domain representative data using the information pairs. 2. The method of claim 1 , wherein the information pairs comprise one of a word pair, a phrase pair, a sentence pair, and a paragraph pair. 3. The method of claim 1 , further comprising bootstrapping the language translation model using the information pairs. 4. The method of claim 1 , wherein identification of the bilingual website comprises identifying the pair of parallel pages. 5. The method of claim 1 , wherein the bidirectional web crawl considers back links and forward links. 6. The method of claim 5 , wherein each of the back links and the forward links is associated with a relevance score. 7. The method of claim 6 , wherein the relevance score is based on a context of a link in a neighborhood of elements. 8. The method of claim 1 , wherein the relevance is based on supervised learning via a support vector machine and a link predictor, and wherein the link predictor filters irrelevant pages when the irrelevant pages have less than a threshold amount of relevant links. 9. The method of claim 1 , wherein a frontier scheduler generates a list of links for use in the bidirectional web crawl. 10. The method of claim 1 , further comprising augmenting a statistical model with domain representative data based on the information pairs. 11. The method of claim 1 , wherein the language translation model is one of a machine translation model, a cross-lingual document retrieval model, and a language model. 12. A system comprising: a processor; a computer-readable storage memory having instructions stored which, when executed by the processor, cause the processor to perform operations comprising: performing a generic web crawl to identify a first webpage in a first language having a link thereon which points to a second webpage in a second language, wherein the first webpage and the second webpage comprise a bilingual website comprising the first webpage and the second webpage in respective languages; based on an analysis of parameters on the first webpage comprising at least two of: the link pointing to the second webpage, a title, a link neighborhood, a link context and data indicating a separate version of the first webpage, classifying the first webpage as a root page and as an entry point for the bilingual website via the link to the second webpage; identifying, using a visitation policy which constrains web-crawling to a graph neighborhood of bilingual websites, a pattern of links within between the first webpage and the second webpage, to yield a bipartite graph; ranking a relevance of candidate links which point to parallel text in the first webpage and the second webpage, to yield classifications of links based on the bipartite graph and the relevance; performing, based on the relevance, a bidirectional web crawl of the candidate links, to identify the first webpage and the second webpage as a bilingual website, the bidirectional web crawl utilizing the classifications of links to avoid links having a low respective relevance; analyzing the first webpage and the second webpage to identify information pairs in the first language and the second language; extracting the information pairs from the first webpage and the second webpage for use in a language translation model, the information pairs comprising at least one of a sentence pair and a paragraph pair; and updating a statistical model with domain representative data using the information pairs. 13. The system of claim 12 , wherein the information pairs comprise one of a word pair, a phrase pair, a sentence pair, and a paragraph pair. 14. The system of claim 12 , the computer-readable storage medium having additional instructions which, when executed by the processor, cause the processor to perform operations comprising bootstrapping the language translation model using the information pairs. 15. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising: performing a generic web crawl to identify a first webpage in a first language having a link thereon which points to a second webpage in a second language, wherein the first webpage and the second webpage comprise a bilingual website comprising the first webpage and the second webpage in respective languages; based on an analysis of parameters on the first webpage comprising at least two of: the link pointing to the second webpage, a title, a link neighborhood, a link context and data indicating a separate version of the first webpage, classifying the first webpage as a root page and as an entry point for the bilingual website via the link to the second webpage; identifying, using a visitation policy which constrains web-crawling to a graph neighborhood of bilingual websites, a pattern of links within between the first webpage and the second webpage, to yield a bipartite graph; ranking a relevance of candidate links which point to parallel text in the first webpage and the second webpage, to yield classifications of links based on the bipartite graph and the relevance; performing, based on the relevance, a bidirectional web crawl of the candidate links, to identify the first webpage and the second webpage as a bilingual website, the bidirectional web crawl utilizing the classifications of links to avoid links having a low respective relevance; analyzing the first webpage and the second webpage to identify information pairs in the first language and the second language; extracting the information pairs from the first webpage and the second webpage for use in a language translation model, the information pairs comprising at least one of a sentence pair and a paragraph pair; and updating a statistical model with domain representative data using the information pairs. 16. T

Assignees

Inventors

Classifications

  • Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title

  • G06F40/49Primary

    using very large corpora, e.g. the web · CPC title

  • Language identification · CPC title

  • using ranking · CPC title

  • G06F16/951Primary

    Indexing; Web crawling techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9471565B2 cover?
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for bootstrapping a language translation system. A system configured to practice the method performs a bidirectional web crawl to identify a bilingual website. The system analyzes data on the bilingual website to make a classification decision about whether the root of the bilingual website is an entry poi…
Who is the assignee on this patent?
Barbosa Luciano De Andrade, Bangalore Srinivas, Rangarajan Sridhar Vivek Kumar, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F40/49. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 18 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).