Who is the assignee on this patent?

Thambiratnam Albert Joseph Kishan, Meng Sha, Li Gang, and 2 more

What technology area does this patent fall under?

Primary CPC classification G06F16/7844. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 01 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Keyword generation for media content

US9483557B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9483557-B2
Application number	US-201113040640-A
Country	US
Kind code	B2
Filing date	Mar 4, 2011
Priority date	Mar 4, 2011
Publication date	Nov 1, 2016
Grant date	Nov 1, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In various embodiments, a transcript that represents a media file is created. Keyword candidates that may represent topics and/or content associated with the media content are then be extracted from the transcript. Furthermore, a keyword set may be generated for the media content utilizing a mutual information criteria. In other embodiments, one or more queries may be generated based at least in part on the transcript, and a plurality of web documents may be retrieved based at least in part on the one or more queries. Additional keyword candidates may be extracted from each web document and then ranked. A subset of the keyword candidates may then be selected to form a keyword set associated with the media content.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method comprising: generating a transcript from media content using automatic speech recognition; extracting transcript keyword candidates from the transcript once the transcript is generated from the media content; determining that a first subset of the extracted transcript keyword candidates relates to a primary topic of the media content and a second subset of the extracted transcript keyword candidates relates to a second topic of the media content, wherein the primary topic is different from the second topic; generating web queries from the transcript keyword candidates and submitting the web queries to a search engine; accumulating results from the web queries to form a set of web documents and extracting web document keyword candidates from the set of web documents; determining mutual information criteria for individual ones of the extracted transcript keyword candidates based at least in part on a number of documents from the set of web documents that contain a co-occurrence of an extracted transcript keyword candidate with a web document keyword candidate, and a total number of documents of the set of web documents; ranking, based at least in part on the mutual information criteria, the extracted transcript keyword candidates to generate ranked transcript keyword candidates, wherein the first subset of the extracted transcript keyword candidates related to the primary topic of the media content are ranked higher than the second subset of the extracted transcript keyword candidates related to the second topic of the media content; selecting one or more of the extracted transcript keyword candidates based at least in part on the ranked transcript keyword candidates to form a keyword set; and associating the keyword set with the media content. 2. A method as recited in claim 1 , further comprising associating the keyword set with the media content such that the keyword set is presented with the media content, the keyword set including keywords that represent the primary topic and the second topic associated with the media content. 3. A method as recited in claim 1 , wherein the keyword set includes one or more other words or phrases in addition to words or phrases included in the transcript. 4. A method as recited in claim 1 , further comprising presenting the keyword set to a user before rendering the media content. 5. A method as recited in claim 1 , wherein selecting the one or more of the extracted transcript keyword candidates further comprises identifying a predetermined number of top-ranked transcript keyword candidates from the extracted transcript keyword candidates and associating the top- ranked transcript keyword candidates with the media content. 6. A method as recited in claim 1 , further comprising indexing the media content based at least in part on the selected one or more extracted transcript keyword candidates included in the keyword set. 7. A method comprising: generating a transcript from media content using automatic speech recognition; extracting transcript keyword candidates from the transcript once the transcript is generated from the media content; generating web queries from the extracted transcript keyword candidates and submitting the web queries to a search engine; accumulating results from the web queries to form a set of web documents and extracting web document keyword candidates from the set of web documents; determining mutual information criteria for individual ones of the extracted transcript keyword candidates based at least in part on a number of documents from the set of web documents that contain a co-occurrence of an extracted transcript keyword candidate with an extracted web document keyword candidate, and a total number of documents of the set of web documents; ranking, based at least in part on the mutual information criteria, the extracted transcript keyword candidates to generate ranked extracted transcript keyword candidates; and selecting one or more of the extracted transcript keyword candidates based at least in part on the ranked extracted transcript keyword candidates to form a keyword set; and associating the keyword set with the media content. 8. A method as recited in claim 7 , wherein the media content does not include any associated keywords prior to the transcript keyword candidates being extracted. 9. A method as recited in claim 7 , wherein associating the keyword set with the media content comprises presenting the keyword set with the media content, the keyword set including keywords that represent one or more topics associated with the media content. 10. A method as recited in claim 7 , wherein ranking the extracted transcript keyword candidates further comprises ranking the extracted transcript keyword candidates based on respective relevance of the extracted transcript keyword candidates with respect to the media content. 11. A method as recited in claim 7 , further comprising pruning at least one of the selected one or more transcript keyword candidates from the keyword set based at least in part on whether the at least one of the selected one or more transcript keyword candidates is below a relatedness threshold. 12. A system comprising: one or more processors; one or more storage devices storing modules that are executable by the one or more processors, the modules including: a speech recognizer component that generates a transcript based on speech and non-speech included in media content; an extraction component that extracts transcript keyword candidates from the transcript once the transcript is generated from the media content by the speech recognizer component; a keyword collection component that: accumulates search results from web queries formed from the extracted transcript keyword candidates submitted to a search engine; forms a set of web documents from the search results; and extracts web document keyword candidates from the set of web documents; and a keyword selector component that: determines mutual information criteria for individual ones of the extracted transcript keyword candidates based at least in part on a number of documents from the set of web documents that contain a co-occurrence of an extracted transcript keyword candidate with a web document keyword candidate, and a total number of documents of the set of web documents; ranks the extracted transcript keyword candidates, based at least in part on the mutual information criteria, to form a ranking; and selects one or more of the extracted transcript keyword candidates based at least in part on the ranking to form a keyword set that is to be associated with the media content. 13. A system as recited in claim 12 , wherein: at least the web queries include a phrase including multiple words from the extracted transcript keyword candidates; and the extracted transcript keyword candidates are identified based at least in part on meta information associated with the web documents or text in a body of the web documents. 14. A system as recited in claim 13 , wherein the meta information associated with the web documents comprises manually generated keyword lists, and the manually generated keyword lists are used as a constraint for selecting the one or more of the extracted transcript keyword candidates.

Assignees

Inventors

Classifications

G06F16/951
Indexing; Web crawling techniques · CPC title
G06F16/7844Primary
using original textual content or text extracted from visual content or transcript of audio data · CPC title
G06F17/30796Primary
Physics · mapped topic
G06F17/30864
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 46753947

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9483557B2 cover?: In various embodiments, a transcript that represents a media file is created. Keyword candidates that may represent topics and/or content associated with the media content are then be extracted from the transcript. Furthermore, a keyword set may be generated for the media content utilizing a mutual information criteria. In other embodiments, one or more queries may be generated based at least i…
Who is the assignee on this patent?: Thambiratnam Albert Joseph Kishan, Meng Sha, Li Gang, and 2 more
What technology area does this patent fall under?: Primary CPC classification G06F16/7844. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 01 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).