Image processing of webpages

US10713545B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10713545-B2
Application numberUS-201816172646-A
CountryUS
Kind codeB2
Filing dateOct 26, 2018
Priority dateOct 26, 2018
Publication dateJul 14, 2020
Grant dateJul 14, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A web detection system processes webpage information and performs automated feature extraction of webpages including machine processable information. In an embodiment, the web detection system determines a subset of webpages having a target characteristic by processing markup language. For a webpage of the subset, the web detection system determines that a first image overlaps at least a portion of a second image in the webpage. The web detection system generates an image of the webpage such that the portion of the second image is obscured by the first image. The web detection system determines a graphical feature of the webpage by processing the image, e.g., using optical character recognition. Responsive to determining that the graphical feature corresponds to graphical features of images of a different set of webpages associated with a target entity, the web detection system determines that the webpage is also associated with the target entity.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for automated feature extraction of webpages including machine processable information, the system comprising: a markup language engine configured to: identify a plurality of webpages, and process markup language of the plurality of webpages to determine that a subset of the plurality of webpages includes a target characteristic; a rendering engine configured to: determine, for a webpage of the subset, that a first image overlaps at least a portion of a second image in the webpage based at least on markup language of the webpage, and generate, for the webpage of the subset, an image of the webpage such that the portion of the second image is obscured by the first image; and a detection engine configured to: determine, for the webpage of the subset, at least one graphical feature of the webpage by processing the image of the webpage, the at least one graphical feature corresponding to the portion of the second image, determine, for the webpage of the subset, that the at least one graphical feature corresponds to graphical features of images of a different plurality of webpages associated with a target entity, and generating, responsive to the determination that the at least one graphical feature corresponds to the graphical features of images of the different plurality of webpages, an association between the webpage and the target entity for storage in a database. 2. The system of claim 1 , wherein the detection engine is further configured to: determine, using optical character recognition, a first string represented by the portion of the second image; determine, using optical character recognition, a second string represented by another portion of the first image that overlaps the portion of the second image; and determine a deviation between the first string and the second string, wherein the at least one graphical feature indicates the deviation. 3. The system of claim 2 , wherein the first string includes at least one alphanumeric character, and wherein the second string is different from the first string by one or more alphanumeric characters. 4. The system of claim 3 , wherein the first string describes at least one of a phone number, email address, or physical address. 5. The system of claim 1 , wherein determining the plurality of webpages comprises: receiving, at the web detection system using an application programming interface, a plurality of webpage identifiers, the plurality of webpages determined using the plurality of webpage identifiers. 6. The method of claim 1 , wherein determining that the subset of the plurality of webpages includes the target characteristic comprises: performing, by the web detection system, textual analysis of the markup language of the plurality of webpages, the target characteristic including at least one keyword in the markup language of the plurality of webpages. 7. A method for automated feature extraction of webpages including machine processable information, the method comprising: identifying, by a web detection system, a plurality of webpages; processing, by the web detection system, markup language of the plurality of webpages o determine that a subset of the plurality of webpages includes a target characteristic; responsive to determining that the subset of the plurality of webpages includes the target characteristic, for a webpage of the subset: determining, by the web detection system, that a first object overlaps at least a portion of a second object in the webpage based at least on markup language of the webpage; generating, by the web detection system, an image of the webpage such that the portion of the second object is obscured or altered by the first object; determining, by the web detection system, at least one feature of the webpage by processing the image of the webpage, the at least one feature corresponding to the portion of the second object; determining, by the web detection system, that the at least one feature corresponds to features of images of a different plurality of webpages associated with a target entity; and responsive to determining that the at least one feature corresponds to the features of images of the different plurality of webpages: generating, by the web detection system, an association between the webpage and the target entity for storage in a database. 8. The method of claim 7 , further comprising: determining, by the web detection system using optical character recognition, a first string represented by the portion of the second object; determining, by the web detection system using optical character recognition, a second string represented by another portion of the first object that overlaps the portion of the second object; and determining, by the web detection system, a deviation between the first string and the second string, wherein the at least one feature indicates the deviation. 9. The method of claim 8 , wherein the first string includes at least one alphanumeric character, and wherein the second string is different from the first string by one or more alphanumeric characters. 10. The method of claim 9 , wherein the first string describes at least one of a phone number, email address, or physical address. 11. The method of claim 7 , wherein determining the plurality of webpages comprises: receiving, at the web detection system using an application programming interface, a plurality of webpage identifiers, the plurality of webpages determined using the plurality of webpage identifiers. 12. The method of claim 7 , wherein determining that the subset of the plurality of webpages includes the target characteristic comprises: performing, by the web detection system, textual analysis of the markup language of the plurality of webpages, the target characteristic including at least one keyword in the markup language of the plurality of webpages. 13. The method of claim 7 , further comprising, responsive to determining that the subset of the plurality of webpages includes the target characteristic: determining, by the web detection system, metadata using the image of the webpage, wherein determining that the webpage is associated with the target entity is further based on the metadata. 14. The method of claim 7 , further comprising: generating, by the web detection system, a report describing webpages of the plurality of webpages associated with the target entity; and transmitting the report by the web detection system to a client device. 15. A non-transitory computer-readable storage medium storing instructions for automated feature extraction of webpages including machine processable information, the instructions when executed by a processor causing the processor to: identify a plurality of webpages; process markup language of the plurality of webpages to determine that a subset of the plurality of webpages includes a target characteristic; responsive to determining that the subset of the plurality of webpages includes the target characteristic, for a webpage of the subset: determine that a first image overlaps at least a portion of a second image in the webpage based at least on markup language of the webpage; generate an image of the webpage such that the portion of the second image is obscured by the first image; determine at least one graphical feature of the webpage by processing the image of the webpage, the at least one graphical feature corresponding to the portion of the second image; determine that the at least one graphical feature corresponds to graphical features of images of a different plurality of webpages associated with a t

Assignees

Inventors

Classifications

  • G06F16/80Primary

    of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML (content-based retrieval of web data G06F16/95) · CPC title

  • Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors · CPC title

  • Image acquisition (document image scanning and transmission H04N1/00; control of digital cameras H04N23/60) · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

  • Character recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10713545B2 cover?
A web detection system processes webpage information and performs automated feature extraction of webpages including machine processable information. In an embodiment, the web detection system determines a subset of webpages having a target characteristic by processing markup language. For a webpage of the subset, the web detection system determines that a first image overlaps at least a portio…
Who is the assignee on this patent?
Merck Sharp & Dohme
What technology area does this patent fall under?
Primary CPC classification G06F16/80. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 14 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).