Multimedia data search using multi-modal feature embeddings

US12536217B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12536217-B2
Application numberUS-202318401144-A
CountryUS
Kind codeB2
Filing dateDec 29, 2023
Priority dateDec 29, 2023
Publication dateJan 27, 2026
Grant dateJan 27, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects of the disclosed technology provide solutions for searching objects within multimedia content based on multi-modal embeddings. An example method can include receiving media content including a plurality of video frames. The method can include steps for generating, using a pre-output layer of a machine learning algorithm, one or more multimodal feature embeddings describing at least one object for the plurality of video frames, receiving a query including a request to search the media content for a matching object, determining whether the media content includes the matching object based on the one or more multimodal feature embeddings describing the at least one object, and returning one or more results in response to determining that the media content includes the matching object. Systems and machine-readable media are also provided.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: one or more memories; and at least one processor coupled to at least one of the one or more memories and configured to perform operations comprising: generating, by a pre-output layer of a machine learning algorithm, one or more feature embeddings describing a first object depicted in at least one video frame of a media content comprising a plurality of video frames; receiving a query including a request to search the media content for a second object matching the first object; in response to the query, searching a set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object; based on a search result obtained from searching the set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object, determining whether at least a portion of the media content depicts the second object; and generating one or more query results in response to determining that at least a portion of the media content depicts the second object. 2 . The system of claim 1 , wherein the at least one processor is configured to perform operations comprising: determining whether to filter at least one query result from one or more query results based on a privacy filter, the privacy filter comprising at least one feature embedding generated based on a user input identifying one or more items to exclude from query results, the at least one feature embedding describing the one or more items identified in the user input; and filtering the at least one query result from the one or more query results based on the privacy filter comprising the at least one feature embedding. 3 . The system of claim 1 , wherein the at least one processor is configured to perform operations comprising: tagging portions of the media content with respective feature embeddings from the set of feature embeddings; and searching the portions of the media content for the second object based on tags generated by the tagging of the portions of the media content with the respective feature embeddings. 4 . The system of claim 1 , wherein the second object comprises a simulated object generated based on an object description. 5 . The system of claim 1 , wherein the media content comprises a live video feed or recording, and wherein determining whether at least a portion of the media content depicts the second object comprises determining whether at least a portion of the live video feed or recording depicts the second object. 6 . The system of claim 1 , wherein the request to search the media content for the second object comprises a request to search the media content for motion associated with the second object, and wherein the one or more feature embeddings encode information about the motion associated with the second object. 7 . The system of claim 1 , wherein the request to search the media content for the second object comprises a request to search an audio portion of the media content for sound associated with the second object, and wherein the one or more feature embeddings encode information about the sound associated with the second object. 8 . The system of claim 1 , wherein the media content comprises the set of feature embeddings. 9 . The system of claim 1 , wherein searching a set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object and determining whether at least a portion of the media content depicts the second object are performed without using or relying on semantic labels. 10 . The system of claim 1 , wherein the one or more feature embeddings comprises one or more multimodal feature embeddings generated based on two or more signals, the two or more signals comprising at least one of a visual signal, an audio signal, a text signal, and a motion signal, and wherein the one or more multimodal feature embeddings encode information about the first object from the two or more signals. 11 . A computer-implemented method for processing media content, the computer-implemented method comprising: generating, by a pre-output layer of a machine learning algorithm, one or more feature embeddings describing a first object depicted in at least one video frame of a media content comprising a plurality of video frames; receiving a query including a request to search the media content for a second object matching the first object; in response to the query, searching a set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object; based on a search result obtained from searching the set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object, determining whether at least a portion of the media content depicts the second object; and generating one or more query results in response to determining that at least a portion of the media content depicts the second object. 12 . The computer-implemented method of claim 11 , further comprising: determining whether to filter at least one query result from one or more query results based on a privacy filter, the privacy filter comprising at least one feature embedding generated based on a user input identifying one or more items to exclude from query results, the at least one feature embedding describing the one or more items identified in the user input; and filtering the at least one query result from the one or more query results based on the privacy filter comprising the at least one feature embedding. 13 . The computer-implemented method of claim 11 , wherein searching a set of feature embeddings associated with the media content for the one or more feature embeddings describing the first object and determining whether at least a portion of the media content depicts the second object are performed without using or relying on semantic labels. 14 . The computer-implemented method of claim 11 , wherein the second object comprises a simulated object generated based on an object description. 15 . The computer-implemented method of claim 11 , wherein the media content comprises a live video feed or recording, and wherein determining whether at least a portion of the media content depicts the second object comprises determining whether at least a portion of the live video feed or recording depicts the second object. 16 . The computer-implemented method of claim 11 , wherein the request to search the media content for the second object comprises a request to search the media content for motion associated with the second object, and wherein the one or more feature embeddings encode information about the motion associated with the second object. 17 . The computer-implemented method of claim 11 , wherein the request to search the media content for the second object comprises a request to search an audio portion of the media content for sound associated with the second object, and wherein the one or more feature embeddings encode information about the sound associated with the second object. 18 . The computer-implemented method of claim 11 , wherein the media content comprises the set of feature embeddings. 19 . The computer-implemented method of claim 11 , wherein the one or more feature embeddings comprises one or more multimodal feature embeddings generated based on two or more signals, the two or more signals comprising at least one of a visual signal,

Assignees

Inventors

Classifications

  • G06F16/438Primary

    Presentation of query results · CPC title

  • Filtering based on additional data, e.g. user or group profiles · CPC title

  • G06F16/432Primary

    Query formulation · CPC title

  • using objects detected or recognised in the video content · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12536217B2 cover?
Aspects of the disclosed technology provide solutions for searching objects within multimedia content based on multi-modal embeddings. An example method can include receiving media content including a plurality of video frames. The method can include steps for generating, using a pre-output layer of a machine learning algorithm, one or more multimodal feature embeddings describing at least one …
Who is the assignee on this patent?
Roku Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/438. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 27 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).