Large-scale image tagging using image-to-topic embedding
US-2018267997-A1 · Sep 20, 2018 · US
US11874882B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11874882-B2 |
| Application number | US-201916460776-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 2, 2019 |
| Priority date | Jul 2, 2019 |
| Publication date | Jan 16, 2024 |
| Grant date | Jan 16, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system for extracting key phrase candidates from a corpus of documents, including a processor, a memory, and a program executing on the processor. The system is configured to run a key phrase model to extract one or more key phrase candidates from each document in the corpus and convert each extracted key phrase candidate into a feature vector. The key phrase model also filters the feature vectors to remove duplicates using a classifier that was trained on a set of key phrase pairs with manual labels indicating whether two key phrases are duplicates of each other, to produce remaining key phrase candidates. The system uses the remaining key phrase candidates in a computer-implemented application.
Opening claim text (preview).
The invention claimed is: 1. A system for including a webpage in a ranked list of webpages returned to a client computing device in response to receipt of a query, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform acts comprising: receiving, on a search engine executing on the system, the query from the client computing device, wherein the client computing device is in network communication with the system; returning the ranked list of webpages to the client computing device based upon the query, wherein the webpage belongs to a website, and further wherein a position of the webpage in the ranked list of webpages is based upon a topical authority score assigned to the website with respect to a topic, the topical authority score is representative of an authoritativeness of the website with respect to the topic, wherein the topical authority score for the website is computed by way of acts comprising: extracting key phrase candidates from webpages that belong to the website; converting the extracted key phrase candidates into feature vectors; filtering the feature vectors to remove duplicates using a classifier that was trained on a set of key phrase pairs with manual labels indicating whether two key phrases are duplicates of each other, to produce remaining feature vectors for remaining key phrase candidates; and assigning scores to the remaining key phrase candidates based upon the remaining feature vectors, wherein a key phrase candidate is the topic, and further wherein the topical authority score assigned to the website is based upon a score in the scores assigned to the key phrase candidate. 2. The system of claim 1 , wherein the remaining key phrase candidates are topics, and further wherein topical authority scores assigned to the website are based upon the scores assigned to the remaining key phrase candidates, such that the web site has multiple topical authority scores assigned thereto for different topics. 3. The system of claim 1 , wherein the classifier is a binary classifier. 4. The system of claim 1 , wherein a regression model is employed to assign the scores to the remaining key phrase candidates. 5. The system of claim 1 , wherein converting the extracted key-phrase candidates include at least one of the following features: language statistics, including Term Frequency (TF) and Inverted Document Frequency (IDF); a ratio of overlap with page title and page URL; relative position on the web page; one-hot encoding; features of surrounding words; case encoding; and stopword features. 6. The system of claim 1 , the acts further comprising: prior to filtering the feature vectors using the classifier, filtering the feature vectors using deduplication rules. 7. The system of claim 6 , wherein the deduplication rules include at least one of: a first rule that combines candidates with different case; a second rule that combines candidates which are the same entity in full name and abbreviation; a third rule that deduplicates candidates that overlap with each other on language statistics; or a fourth rule that drops candidates starting or ending with stopwords or containing curse words. 8. The system of claim 1 , wherein extracting the key phrase candidates from the webpages that belong to the website comprises using grammar rules to extract the key phrase candidates from the webpages, wherein the grammar rules are based upon queries previously submitted to the search engine. 9. The system of claim 8 , wherein the grammar rules are based upon parts of speech assigned to terms in the queries. 10. The system of claim 8 , wherein the grammar rules comprise a sequence of parts of speech, and further wherein extracting the key phrase candidates from the webpages that belong to the website comprises: assigning parts of speech to terms in the webpages; and extracting a sequence of terms in a webpage as a key phrase candidate due to the sequence of terms being assigned respective parts of speech that matches the sequence of parts of speech identified in the grammar rules. 11. A method performed by a computing system that executes a search engine, the method comprising: receiving, at the search engine executing on the computing system, a query from a client computing device that is in network communication with the computing system; returning a ranked list of webpages to the client computing device based upon the query, wherein the ranked list of webpages comprises a webpage that belongs to a website, and further wherein a position of the webpage in the ranked list of webpages is based upon a topical authority score assigned to the website with respect to a topic, the topical authority score is representative of an authoritativeness of the website with respect to the topic, wherein the topical authority score for the website is computed by way of acts comprising: identifying key phrase candidates in webpages that belong to the website; converting the identified key phrase candidates into feature vectors; using a classifier that was trained on a set of key phrase pairs having labels indicating whether two key phrases are duplicates of one another, removing duplicate feature vectors to produce remaining feature vectors for remaining key phrase candidates; and assigning scores to the remaining key phrase candidates based upon the remaining feature vectors, wherein a key phrase candidate in the remaining key phrase candidates is the topic, and further wherein the topical authority score assigned to the website is based upon a score in the scores assigned to the key phrase candidate. 12. The method of claim 11 , wherein the remaining key phrase candidates are topics, and further wherein topical authority scores assigned to the website are based upon the scores assigned to the remaining key phrase candidates, such that the web site has multiple topical authority scores assigned thereto for different topics. 13. The method of claim 11 , wherein the classifier is a binary classifier. 14. The method of claim 11 , wherein a regression model is employed to assign the scores to the remaining key phrase candidates. 15. The method of claim 11 , wherein converting the identified key-phrase candidates include at least one of the following features: language statistics, including Term Frequency (TF) and Inverted Document Frequency (IDF); a ratio of overlap with page title and page URL; relative position on the web page; one-hot encoding; features of surrounding words; case encoding; and stopword features. 16. The method of claim 11 , further comprising: prior to the duplicates being removed through use of the classifier, filtering the feature vectors using deduplication rules. 17. The method of claim 16 , wherein the deduplication rules include at least one of: a first rule that combines candidates with different case; a second rule that combines candidates which are the same entity in full name and abbreviation; a third rule that deduplicates candidates that overlap with each other on language statistics; or a fourth rule that drops candidates starting or ending with stopwords or containing curse words. 18. A non-transitory computer-readable medium comprising instruction that, when executed by a processor, cause the processor to perform acts comprising: receiving, at a search engine executing on a computing system, a query from a client computing device that is in network communication with the computing system; returning a ranked list of webpages to the client computing d
Indexing; Web crawling techniques · CPC title
Clustering; Classification · CPC title
using information identifiers, e.g. uniform resource locators [URL] · CPC title
Search customisation based on social or collaborative filtering · CPC title
Phrasal analysis, e.g. finite state techniques or chunking · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.