Organizational logo enrichment

US10002292B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10002292-B2
Application numberUS-201514929116-A
CountryUS
Kind codeB2
Filing dateOct 30, 2015
Priority dateSep 30, 2015
Publication dateJun 19, 2018
Grant dateJun 19, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an example embodiment, a web page is obtained using a web page address stored in a first record and is parsed to extract one or more images from the web page along with a second plurality of features for each of the one or more images from the web page. Information about each image of the web page and the extracted second plurality of features for the web page are input into a supervised machine learning classifier to calculate a logo confidence score for each image of the web page, the logo confidence score indicating the probability that the image is an organization logo. In response to a particular image in the web page having a logo confidence score transgressing a first threshold, the particular image is injected into an organization logo field of the first record.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for enrichment of a logo field of a first record in a computer system, the method comprising: retrieving a plurality of sample web pages; for each sample web page in the plurality of sample web pages, parsing the sample web page to extract one or more images from the sample web page along with a first plurality of features for each of the one or more images from the sample web page, the plurality of features including proximity of each of the one or more images from the sample web page to a top of the sample web page; labeling at least one of the images in one of the plurality of sample web pages as being an organization logo; feeding information about the labeled at least one of the images, the extracted first plurality of feature for each sample web page into a supervised machine learning classifier to train the supervised machine learning classifier to recognize when an image in a web page is an organization logo, wherein the supervised machine learning classifier utilizes a random forest algorithm and a k-nearest neighbor classifier; obtaining a web page using a web page address stored in the first record; parsing the web page to extract one or more images from the web page along with a second plurality of features for each of the one or more images from the web page, the first plurality of features including proximity of each of the one or more images from the web page to a top of the web page, without rendering the web page; inputting information about each image of the web page and the extracted second plurality of features for the web page into the supervised machine learning classifier to calculate a logo confidence score for each image of the web page, the logo confidence score indicating a probability that the image is an organization logo; and in response to a particular image in the web page having a logo confidence score transgressing a first threshold, injecting the particular image into an organization logo field of the first record. 2. The method of claim 1 , wherein the first and second plurality of features both include image dimension. 3. The method of claim 1 , wherein the first and second plurality of features include whether an image links to a home page of an organization web site. 4. The method of claim 1 , wherein the second plurality of features include similarity of an organization name extracted from the first record to a file name of an image. 5. The method of claim 1 , wherein the second plurality of features include similarity of an organization name extracted from the first record to tag attributes surrounding a tag in the web page. 6. The method of claim 1 , further comprising feeding a vector of common terms either positively or negatively correlated with a logo into the supervised machine learning classifier and using the presence of any of these terms in a tag in the web page to influence whether a nearby image corresponds to an organization logo. 7. A system comprising: a computer-readable medium having instructions stored there on, which, when executed by a processor, cause the system to perform operations comprising: retrieving a plurality of sample web pages; for each sample web page in the plurality of sample web pages, parsing the sample web page to extract one or more images from the sample web page along with a first plurality of features for each of the one or more images from the sample web page, the plurality of features including proximity of each of the one or more images from the sample web page to a top of the sample web page; labeling at least one of the images in one of the plurality of sample web pages as being an organization logo; feeding information about the labeled at least one of the images, the extracted first plurality of feature for each sample web page into a supervised machine learning classifier to train the supervised machine learning classifier to recognize when an image in a web page is an organization logo, wherein the supervised machine learning classifier utilizes a random forest algorithm and a k-nearest neighbor classifier; obtaining a web page using a web page address stored in the first record; parsing the web page to extract one or more images from the web page along with a second plurality of features for each of the one or more images from the web page, the first plurality of features including proximity of each of the one or more images from the web page to a top of the web page, without rendering the web page; inputting information about each image of the web page and the extracted second plurality of features for the web page into the supervised machine learning classifier to calculate a logo confidence score for each image of the web page, the logo confidence score indicating a probability that the image is an organization logo; and in response to a particular image in the web page having a logo confidence score transgressing a first threshold, injecting the particular image into an organization logo field of the first record. 8. The system of claim 7 , wherein the first and second plurality of features both include image dimension. 9. The system of claim 7 , wherein the first and second plurality of features include whether an image links to a home page of an organization web site. 10. The system of claim 7 , wherein the second plurality of features include similarity of an organization name extracted from the first record to a file name of an image. 11. The system of claim 7 , wherein the second plurality of features include similarity of an organization name extracted from the first record to tag attributes surrounding a tag in the web page. 12. The system of claim 7 , wherein the operations further comprise feeding a vector of common terms either positively or negatively correlated with a logo into the supervised machine learning classifier and using the presence of any of these terms in a tag in the web page to influence whether a nearby image corresponds to an organization logo. 13. A non-transitory machine-readable storage medium comprising instructions, which when implemented by one or more machines, cause the one or more machines to perform operations comprising: retrieving a plurality of sample web pages; for each sample web page in the plurality of sample web pages, parsing the sample web page to extract one or more images from the sample web page along with a first plurality of features for each of the one or more images from the sample web page, the plurality of features including proximity of each of the one or more images from the sample web page to a top of the sample web page; labeling at least one of the images in one of the plurality of sample web pages as being an organization logo; feeding information about the labeled at least one of the images, the extracted first plurality of feature for each sample web page into a supervised machine learning classifier to train the supervised machine learning classifier to recognize when an image in a web page is an organization logo, wherein the supervised machine learning classifier utilizes a random forest algorithm and a k-nearest neighbor classifier; obtaining a web page using a web page address stored in the first record; parsing the web page to extract one or more images from the web page along with a second plurality of features for each of the one or more images from the web page, the first plurality of features including proximity of each of the one or more images from the web page to a top of the web page, without rendering the web page; inputting information about each image of the web page and the extracted second plurality of features

Assignees

Inventors

Classifications

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Active pattern learning · CPC title

  • Classification techniques · CPC title

  • Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Proximity, similarity or dissimilarity measures · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10002292B2 cover?
In an example embodiment, a web page is obtained using a web page address stored in a first record and is parsed to extract one or more images from the web page along with a second plurality of features for each of the one or more images from the web page. Information about each image of the web page and the extracted second plurality of features for the web page are input into a supervised mac…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 19 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).