Who is the assignee on this patent?

Baidu online network technology beijing co ltd

What technology area does this patent fall under?

Primary CPC classification G06F16/5846. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 25 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and apparatus for retrieving image-text block from web page

US10755091B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10755091-B2
Application number	US-201816133355-A
Country	US
Kind code	B2
Filing date	Sep 17, 2018
Priority date	Oct 11, 2017
Publication date	Aug 25, 2020
Grant date	Aug 25, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for retrieving an image-text block from a web page is provided, which comprises: retrieving an image node; filtering the image node to obtain candidate image nodes; traversing, for each of the candidate image nodes, a node in sequence toward an ancestor node of the candidate image node in a preset maximum traversal depth until an ancestor node with a text is visited, using the ancestor node with the text as a candidate image-text block; clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks; and determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the image-text block cluster based on the common ancestor node.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for retrieving an image-text block from a web page, the method comprising: retrieving an image node from a document object model of a to-be-processed web page, the image node including an attribute of an image; filtering the image node based on a preset filtering rule to obtain candidate image nodes; traversing, for each of the candidate image nodes, a node in sequence toward an ancestor node of the each candidate image node in a preset maximum traversal depth until an ancestor node with a text is visited, using the ancestor node with the text as a candidate image-text block corresponding to the each candidate image node, and generating path information of the candidate image-text blocks based on locations of the candidate image-text blocks in the document object model, wherein the candidate image-text block includes a text content and the candidate image node; clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks to obtain at least one image-text block cluster; and determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the each image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the each image-text block cluster based on the common ancestor node, wherein the method is performed by at least one processor. 2. The method according to claim 1 , wherein before the clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks to obtain at least one image-text block cluster, the method further comprises: structuring the candidate image-text blocks into a structure including following data information: the path information of the candidate image-text blocks, path information of the candidate image-text blocks formatted based on a preset format, an image resource path in candidate image nodes corresponding to the candidate image-text blocks, and the hash values of the path information of the candidate image-text blocks. 3. The method according to claim 2 , wherein the path information comprises path information labeled with a path language of an extensible markup language, and the hash values of the path information of the candidate image-text blocks are hash values for the path information of the candidate image-text blocks excluding a predicate condition; and the determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the each image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the each image-text block cluster based on the common ancestor node comprises: determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks based on the path information of the candidate image-text blocks to obtain a predicate condition of path information of the common ancestor node; and combining the path information of the candidate image-text blocks within the each image-text block cluster based on the predicate condition of the path information of the common ancestor node, and using the combined path information as the path information of the each image-text block cluster. 4. The method according to claim 1 , further comprising: comparing the path information of the image-text block clusters to filter out overlapped path information. 5. The method according to claim 1 , wherein before the retrieving an image node from a document object model of a to-be-processed web page, the method further comprises: cleaning data of the document object model of the to-be-processed web page to remove invalid nodes in the document object model. 6. An apparatus for retrieving an image-text block from a web page, the apparatus comprising: at least one processor; and a memory storing instructions, wherein the instructions when executed by the at least one processor, cause the at least one processor to perform operations, the operations comprising: retrieving an image node from a document object model of a to-be-processed web page, the image node including an attribute of an image; filtering the image node based on a preset filtering rule to obtain candidate image nodes; traversing, for each of the candidate image nodes, a node in sequence towards an ancestor node of the each candidate image node in a preset maximum traversal depth until an ancestor node with a text is visited, using the ancestor node with the text as a candidate image-text block corresponding to the each candidate image node, and generating path information of the candidate image-text blocks based on locations of the candidate image-text blocks in the document object model, wherein the candidate image-text block includes a text content and the candidate image node; clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks to obtain at least one image-text block cluster; and determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the each image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the each image-text block cluster based on the common ancestor node. 7. The apparatus according to claim 6 , wherein before the clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks to obtain at least one image-text block cluster, the operations further comprise: structuring the candidate image-text blocks into a structure including following data information: the path information of the candidate image-text blocks, path information of the candidate image-text blocks formatted based on a preset format, an image resource path in candidate image nodes corresponding to the candidate image-text blocks, and the hash values of the path information of the candidate image-text blocks. 8. The apparatus according to claim 7 , wherein the path information comprises path information labeled with a path language of an extensible markup language, and the hash values of the path information of the candidate image-text blocks are hash values for the path information of the candidate image-text blocks excluding a predicate condition; and the determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the each image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the each image-text block cluster based on the common ancestor node comprises: determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks based on the path information of the candidate image-text blocks to obtain a predicate condition of path information of the common ancestor node; and combining the path information of the candidate image-text blocks within the each image-text block cluster based on the predicate condition of the path information of the common ancestor node, and using the combined path information as the path information of the each image-text block cluster. 9. The apparatus according to claim 6 , wherein the operations further comprise: comparing the path information of the image-text block clusters to filter out overlapped path information. 10. The apparatus according to claim 6 , wherein before the retrieving an image node from a document object model of a to-be-processed web page, the operations further comprise: cleaning data of the

Assignees

Baidu online network technology beijing co ltd

Inventors

Classifications

G06F16/5846Primary
using extracted text · CPC title
G06V30/414
Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title
G06V30/413
Classification of content, e.g. text, photographs or tables · CPC title
G06V30/43
Editing text-bitmaps, e.g. alignment, spacing; Semantic analysis of bitmaps of text without OCR · CPC title
G06F16/9535
Search customisation based on user profiles and personalisation · CPC title

Patent family

Related publications grouped by family.

View patent family 61052295

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10755091B2 cover?: A method for retrieving an image-text block from a web page is provided, which comprises: retrieving an image node; filtering the image node to obtain candidate image nodes; traversing, for each of the candidate image nodes, a node in sequence toward an ancestor node of the candidate image node in a preset maximum traversal depth until an ancestor node with a text is visited, using the ancestor…
Who is the assignee on this patent?: Baidu online network technology beijing co ltd
What technology area does this patent fall under?: Primary CPC classification G06F16/5846. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 25 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).