Webpage pre-reading method, apparatus and smart terminal device
US-2017013072-A1 · Jan 12, 2017 · US
US10755091B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10755091-B2 |
| Application number | US-201816133355-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 17, 2018 |
| Priority date | Oct 11, 2017 |
| Publication date | Aug 25, 2020 |
| Grant date | Aug 25, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for retrieving an image-text block from a web page is provided, which comprises: retrieving an image node; filtering the image node to obtain candidate image nodes; traversing, for each of the candidate image nodes, a node in sequence toward an ancestor node of the candidate image node in a preset maximum traversal depth until an ancestor node with a text is visited, using the ancestor node with the text as a candidate image-text block; clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks; and determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the image-text block cluster based on the common ancestor node.
Opening claim text (preview).
What is claimed is: 1. A method for retrieving an image-text block from a web page, the method comprising: retrieving an image node from a document object model of a to-be-processed web page, the image node including an attribute of an image; filtering the image node based on a preset filtering rule to obtain candidate image nodes; traversing, for each of the candidate image nodes, a node in sequence toward an ancestor node of the each candidate image node in a preset maximum traversal depth until an ancestor node with a text is visited, using the ancestor node with the text as a candidate image-text block corresponding to the each candidate image node, and generating path information of the candidate image-text blocks based on locations of the candidate image-text blocks in the document object model, wherein the candidate image-text block includes a text content and the candidate image node; clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks to obtain at least one image-text block cluster; and determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the each image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the each image-text block cluster based on the common ancestor node, wherein the method is performed by at least one processor. 2. The method according to claim 1 , wherein before the clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks to obtain at least one image-text block cluster, the method further comprises: structuring the candidate image-text blocks into a structure including following data information: the path information of the candidate image-text blocks, path information of the candidate image-text blocks formatted based on a preset format, an image resource path in candidate image nodes corresponding to the candidate image-text blocks, and the hash values of the path information of the candidate image-text blocks. 3. The method according to claim 2 , wherein the path information comprises path information labeled with a path language of an extensible markup language, and the hash values of the path information of the candidate image-text blocks are hash values for the path information of the candidate image-text blocks excluding a predicate condition; and the determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the each image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the each image-text block cluster based on the common ancestor node comprises: determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks based on the path information of the candidate image-text blocks to obtain a predicate condition of path information of the common ancestor node; and combining the path information of the candidate image-text blocks within the each image-text block cluster based on the predicate condition of the path information of the common ancestor node, and using the combined path information as the path information of the each image-text block cluster. 4. The method according to claim 1 , further comprising: comparing the path information of the image-text block clusters to filter out overlapped path information. 5. The method according to claim 1 , wherein before the retrieving an image node from a document object model of a to-be-processed web page, the method further comprises: cleaning data of the document object model of the to-be-processed web page to remove invalid nodes in the document object model. 6. An apparatus for retrieving an image-text block from a web page, the apparatus comprising: at least one processor; and a memory storing instructions, wherein the instructions when executed by the at least one processor, cause the at least one processor to perform operations, the operations comprising: retrieving an image node from a document object model of a to-be-processed web page, the image node including an attribute of an image; filtering the image node based on a preset filtering rule to obtain candidate image nodes; traversing, for each of the candidate image nodes, a node in sequence towards an ancestor node of the each candidate image node in a preset maximum traversal depth until an ancestor node with a text is visited, using the ancestor node with the text as a candidate image-text block corresponding to the each candidate image node, and generating path information of the candidate image-text blocks based on locations of the candidate image-text blocks in the document object model, wherein the candidate image-text block includes a text content and the candidate image node; clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks to obtain at least one image-text block cluster; and determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the each image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the each image-text block cluster based on the common ancestor node. 7. The apparatus according to claim 6 , wherein before the clustering the candidate image-text blocks based on hash values of the path information of the candidate image-text blocks to obtain at least one image-text block cluster, the operations further comprise: structuring the candidate image-text blocks into a structure including following data information: the path information of the candidate image-text blocks, path information of the candidate image-text blocks formatted based on a preset format, an image resource path in candidate image nodes corresponding to the candidate image-text blocks, and the hash values of the path information of the candidate image-text blocks. 8. The apparatus according to claim 7 , wherein the path information comprises path information labeled with a path language of an extensible markup language, and the hash values of the path information of the candidate image-text blocks are hash values for the path information of the candidate image-text blocks excluding a predicate condition; and the determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks within the each image-text block cluster based on the path information of the candidate image-text blocks, and determining path information of the each image-text block cluster based on the common ancestor node comprises: determining, for each image-text block cluster, a common ancestor node of the candidate image-text blocks based on the path information of the candidate image-text blocks to obtain a predicate condition of path information of the common ancestor node; and combining the path information of the candidate image-text blocks within the each image-text block cluster based on the predicate condition of the path information of the common ancestor node, and using the combined path information as the path information of the each image-text block cluster. 9. The apparatus according to claim 6 , wherein the operations further comprise: comparing the path information of the image-text block clusters to filter out overlapped path information. 10. The apparatus according to claim 6 , wherein before the retrieving an image node from a document object model of a to-be-processed web page, the operations further comprise: cleaning data of the
using extracted text · CPC title
Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title
Classification of content, e.g. text, photographs or tables · CPC title
Editing text-bitmaps, e.g. alignment, spacing; Semantic analysis of bitmaps of text without OCR · CPC title
Search customisation based on user profiles and personalisation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.