Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F16/951. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Extracting key phrase candidates from documents and producing topical authority ranking

US11874882B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11874882-B2
Application number	US-201916460776-A
Country	US
Kind code	B2
Filing date	Jul 2, 2019
Priority date	Jul 2, 2019
Publication date	Jan 16, 2024
Grant date	Jan 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system for extracting key phrase candidates from a corpus of documents, including a processor, a memory, and a program executing on the processor. The system is configured to run a key phrase model to extract one or more key phrase candidates from each document in the corpus and convert each extracted key phrase candidate into a feature vector. The key phrase model also filters the feature vectors to remove duplicates using a classifier that was trained on a set of key phrase pairs with manual labels indicating whether two key phrases are duplicates of each other, to produce remaining key phrase candidates. The system uses the remaining key phrase candidates in a computer-implemented application.

First claim

Opening claim text (preview).

The invention claimed is: 1. A system for including a webpage in a ranked list of webpages returned to a client computing device in response to receipt of a query, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform acts comprising: receiving, on a search engine executing on the system, the query from the client computing device, wherein the client computing device is in network communication with the system; returning the ranked list of webpages to the client computing device based upon the query, wherein the webpage belongs to a website, and further wherein a position of the webpage in the ranked list of webpages is based upon a topical authority score assigned to the website with respect to a topic, the topical authority score is representative of an authoritativeness of the website with respect to the topic, wherein the topical authority score for the website is computed by way of acts comprising: extracting key phrase candidates from webpages that belong to the website; converting the extracted key phrase candidates into feature vectors; filtering the feature vectors to remove duplicates using a classifier that was trained on a set of key phrase pairs with manual labels indicating whether two key phrases are duplicates of each other, to produce remaining feature vectors for remaining key phrase candidates; and assigning scores to the remaining key phrase candidates based upon the remaining feature vectors, wherein a key phrase candidate is the topic, and further wherein the topical authority score assigned to the website is based upon a score in the scores assigned to the key phrase candidate. 2. The system of claim 1 , wherein the remaining key phrase candidates are topics, and further wherein topical authority scores assigned to the website are based upon the scores assigned to the remaining key phrase candidates, such that the web site has multiple topical authority scores assigned thereto for different topics. 3. The system of claim 1 , wherein the classifier is a binary classifier. 4. The system of claim 1 , wherein a regression model is employed to assign the scores to the remaining key phrase candidates. 5. The system of claim 1 , wherein converting the extracted key-phrase candidates include at least one of the following features: language statistics, including Term Frequency (TF) and Inverted Document Frequency (IDF); a ratio of overlap with page title and page URL; relative position on the web page; one-hot encoding; features of surrounding words; case encoding; and stopword features. 6. The system of claim 1 , the acts further comprising: prior to filtering the feature vectors using the classifier, filtering the feature vectors using deduplication rules. 7. The system of claim 6 , wherein the deduplication rules include at least one of: a first rule that combines candidates with different case; a second rule that combines candidates which are the same entity in full name and abbreviation; a third rule that deduplicates candidates that overlap with each other on language statistics; or a fourth rule that drops candidates starting or ending with stopwords or containing curse words. 8. The system of claim 1 , wherein extracting the key phrase candidates from the webpages that belong to the website comprises using grammar rules to extract the key phrase candidates from the webpages, wherein the grammar rules are based upon queries previously submitted to the search engine. 9. The system of claim 8 , wherein the grammar rules are based upon parts of speech assigned to terms in the queries. 10. The system of claim 8 , wherein the grammar rules comprise a sequence of parts of speech, and further wherein extracting the key phrase candidates from the webpages that belong to the website comprises: assigning parts of speech to terms in the webpages; and extracting a sequence of terms in a webpage as a key phrase candidate due to the sequence of terms being assigned respective parts of speech that matches the sequence of parts of speech identified in the grammar rules. 11. A method performed by a computing system that executes a search engine, the method comprising: receiving, at the search engine executing on the computing system, a query from a client computing device that is in network communication with the computing system; returning a ranked list of webpages to the client computing device based upon the query, wherein the ranked list of webpages comprises a webpage that belongs to a website, and further wherein a position of the webpage in the ranked list of webpages is based upon a topical authority score assigned to the website with respect to a topic, the topical authority score is representative of an authoritativeness of the website with respect to the topic, wherein the topical authority score for the website is computed by way of acts comprising: identifying key phrase candidates in webpages that belong to the website; converting the identified key phrase candidates into feature vectors; using a classifier that was trained on a set of key phrase pairs having labels indicating whether two key phrases are duplicates of one another, removing duplicate feature vectors to produce remaining feature vectors for remaining key phrase candidates; and assigning scores to the remaining key phrase candidates based upon the remaining feature vectors, wherein a key phrase candidate in the remaining key phrase candidates is the topic, and further wherein the topical authority score assigned to the website is based upon a score in the scores assigned to the key phrase candidate. 12. The method of claim 11 , wherein the remaining key phrase candidates are topics, and further wherein topical authority scores assigned to the website are based upon the scores assigned to the remaining key phrase candidates, such that the web site has multiple topical authority scores assigned thereto for different topics. 13. The method of claim 11 , wherein the classifier is a binary classifier. 14. The method of claim 11 , wherein a regression model is employed to assign the scores to the remaining key phrase candidates. 15. The method of claim 11 , wherein converting the identified key-phrase candidates include at least one of the following features: language statistics, including Term Frequency (TF) and Inverted Document Frequency (IDF); a ratio of overlap with page title and page URL; relative position on the web page; one-hot encoding; features of surrounding words; case encoding; and stopword features. 16. The method of claim 11 , further comprising: prior to the duplicates being removed through use of the classifier, filtering the feature vectors using deduplication rules. 17. The method of claim 16 , wherein the deduplication rules include at least one of: a first rule that combines candidates with different case; a second rule that combines candidates which are the same entity in full name and abbreviation; a third rule that deduplicates candidates that overlap with each other on language statistics; or a fourth rule that drops candidates starting or ending with stopwords or containing curse words. 18. A non-transitory computer-readable medium comprising instruction that, when executed by a processor, cause the processor to perform acts comprising: receiving, at a search engine executing on a computing system, a query from a client computing device that is in network communication with the computing system; returning a ranked list of webpages to the client computing d

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06F16/951Primary
Indexing; Web crawling techniques · CPC title
G06F16/906
Clustering; Classification · CPC title
G06F16/955
using information identifiers, e.g. uniform resource locators [URL] · CPC title
G06F16/9536
Search customisation based on social or collaborative filtering · CPC title
G06F40/289
Phrasal analysis, e.g. finite state techniques or chunking · CPC title

Patent family

Related publications grouped by family.

View patent family 71948663

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11874882B2 cover?: A system for extracting key phrase candidates from a corpus of documents, including a processor, a memory, and a program executing on the processor. The system is configured to run a key phrase model to extract one or more key phrase candidates from each document in the corpus and convert each extracted key phrase candidate into a feature vector. The key phrase model also filters the feature ve…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F16/951. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).