What technology area does this patent fall under?

Primary CPC classification G06F16/438. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 27 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Multimedia data search using multi-modal feature embeddings

US12536217B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12536217-B2
Application number	US-202318401144-A
Country	US
Kind code	B2
Filing date	Dec 29, 2023
Priority date	Dec 29, 2023
Publication date	Jan 27, 2026
Grant date	Jan 27, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects of the disclosed technology provide solutions for searching objects within multimedia content based on multi-modal embeddings. An example method can include receiving media content including a plurality of video frames. The method can include steps for generating, using a pre-output layer of a machine learning algorithm, one or more multimodal feature embeddings describing at least one object for the plurality of video frames, receiving a query including a request to search the media content for a matching object, determining whether the media content includes the matching object based on the one or more multimodal feature embeddings describing the at least one object, and returning one or more results in response to determining that the media content includes the matching object. Systems and machine-readable media are also provided.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: one or more memories; and at least one processor coupled to at least one of the one or more memories and configured to perform operations comprising: generating, by a pre-output layer of a machine learning algorithm, one or more feature embeddings describing a first object depicted in at least one video frame of a media content comprising a plurality of video frames; receiving a query including a request to search the media content for a second object matching the first object; in response to the query, searching a set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object; based on a search result obtained from searching the set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object, determining whether at least a portion of the media content depicts the second object; and generating one or more query results in response to determining that at least a portion of the media content depicts the second object. 2 . The system of claim 1 , wherein the at least one processor is configured to perform operations comprising: determining whether to filter at least one query result from one or more query results based on a privacy filter, the privacy filter comprising at least one feature embedding generated based on a user input identifying one or more items to exclude from query results, the at least one feature embedding describing the one or more items identified in the user input; and filtering the at least one query result from the one or more query results based on the privacy filter comprising the at least one feature embedding. 3 . The system of claim 1 , wherein the at least one processor is configured to perform operations comprising: tagging portions of the media content with respective feature embeddings from the set of feature embeddings; and searching the portions of the media content for the second object based on tags generated by the tagging of the portions of the media content with the respective feature embeddings. 4 . The system of claim 1 , wherein the second object comprises a simulated object generated based on an object description. 5 . The system of claim 1 , wherein the media content comprises a live video feed or recording, and wherein determining whether at least a portion of the media content depicts the second object comprises determining whether at least a portion of the live video feed or recording depicts the second object. 6 . The system of claim 1 , wherein the request to search the media content for the second object comprises a request to search the media content for motion associated with the second object, and wherein the one or more feature embeddings encode information about the motion associated with the second object. 7 . The system of claim 1 , wherein the request to search the media content for the second object comprises a request to search an audio portion of the media content for sound associated with the second object, and wherein the one or more feature embeddings encode information about the sound associated with the second object. 8 . The system of claim 1 , wherein the media content comprises the set of feature embeddings. 9 . The system of claim 1 , wherein searching a set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object and determining whether at least a portion of the media content depicts the second object are performed without using or relying on semantic labels. 10 . The system of claim 1 , wherein the one or more feature embeddings comprises one or more multimodal feature embeddings generated based on two or more signals, the two or more signals comprising at least one of a visual signal, an audio signal, a text signal, and a motion signal, and wherein the one or more multimodal feature embeddings encode information about the first object from the two or more signals. 11 . A computer-implemented method for processing media content, the computer-implemented method comprising: generating, by a pre-output layer of a machine learning algorithm, one or more feature embeddings describing a first object depicted in at least one video frame of a media content comprising a plurality of video frames; receiving a query including a request to search the media content for a second object matching the first object; in response to the query, searching a set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object; based on a search result obtained from searching the set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object, determining whether at least a portion of the media content depicts the second object; and generating one or more query results in response to determining that at least a portion of the media content depicts the second object. 12 . The computer-implemented method of claim 11 , further comprising: determining whether to filter at least one query result from one or more query results based on a privacy filter, the privacy filter comprising at least one feature embedding generated based on a user input identifying one or more items to exclude from query results, the at least one feature embedding describing the one or more items identified in the user input; and filtering the at least one query result from the one or more query results based on the privacy filter comprising the at least one feature embedding. 13 . The computer-implemented method of claim 11 , wherein searching a set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object and determining whether at least a portion of the media content depicts the second object are performed without using or relying on semantic labels. 14 . The computer-implemented method of claim 11 , wherein the second object comprises a simulated object generated based on an object description. 15 . The computer-implemented method of claim 11 , wherein the media content comprises a live video feed or recording, and wherein determining whether at least a portion of the media content depicts the second object comprises determining whether at least a portion of the live video feed or recording depicts the second object. 16 . The computer-implemented method of claim 11 , wherein the request to search the media content for the second object comprises a request to search the media content for motion associated with the second object, and wherein the one or more feature embeddings encode information about the motion associated with the second object. 17 . The computer-implemented method of claim 11 , wherein the request to search the media content for the second object comprises a request to search an audio portion of the media content for sound associated with the second object, and wherein the one or more feature embeddings encode information about the sound associated with the second object. 18 . The computer-implemented method of claim 11 , wherein the media content comprises the set of feature embeddings. 19 . The computer-implemented method of claim 11 , wherein the one or more feature embeddings comprises one or more multimodal feature embeddings generated based on two or more signals, the two or more signals comprising at least one of a visual signal,

Assignees

Roku Inc

Inventors

Classifications

G06F16/438Primary
Presentation of query results · CPC title
G06F16/435
Filtering based on additional data, e.g. user or group profiles · CPC title
G06F16/432Primary
Query formulation · CPC title
G06F16/7837
using objects detected or recognised in the video content · CPC title

Patent family

Related publications grouped by family.

View patent family 96173860

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12536217B2 cover?: Aspects of the disclosed technology provide solutions for searching objects within multimedia content based on multi-modal embeddings. An example method can include receiving media content including a plurality of video frames. The method can include steps for generating, using a pre-output layer of a machine learning algorithm, one or more multimodal feature embeddings describing at least one …
Who is the assignee on this patent?: Roku Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/438. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 27 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Assignment of Unique Identifications to People in Multi-Camera Field of View

Method and apparatus for video searches and index construction

Systems and methods for providing content based on consumption in a distinct domain

Frequently asked questions