Translating texts for videos based on video context

US12299408B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12299408-B2
Application numberUS-202218049185-A
CountryUS
Kind codeB2
Filing dateOct 24, 2022
Priority dateNov 8, 2019
Publication dateMay 13, 2025
Grant dateMay 13, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure describes systems, non-transitory computer-readable media, and methods that can generate contextual identifiers indicating context for frames of a video and utilize those contextual identifiers to generate translations of text corresponding to such video frames. By analyzing a digital video file, the disclosed systems can identify video frames corresponding to a scene and a term sequence corresponding to a subset of the video frames. Based on images features of the video frames corresponding to the scene, the disclosed systems can utilize a contextual neural network to generate a contextual identifier (e.g. a contextual tag) indicating context for the video frames. Based on the contextual identifier, the disclosed systems can subsequently apply a translation neural network to generate a translation of the term sequence from a source language to a target language. In some cases, the translation neural network also generates affinity scores for the translation.

First claim

Opening claim text (preview).

We claim: 1. A method comprising: identifying, from a digital video file, a subset of frames corresponding to a scene in the digital video file, and a term sequence from the subset of frames; determining a reference frame that represents the scene from the subset of frames based on a feature-matching score for the reference frame relative to the subset of frames; identifying, from the digital video file, an adjacent subset of frames corresponding to a different scene in the digital video file; determining an additional reference frame that represents the different scene from the adjacent subset of frames based on an additional feature-matching score for the additional reference frame relative to the adjacent subset of frames; generating a first contextual identifier for the subset of frames utilizing a contextual neural network to analyze image features of the reference frame; generating a second contextual identifier for the adjacent subset of frames utilizing the contextual neural network to analyze image features of the additional reference frame; and generating, utilizing a translation neural network, a contextual translation of the term sequence and corresponding affinity scores by: encoding the first contextual identifier, the second contextual identifier, and the term sequence into an encoded vector; and decoding the encoded vector into the contextual translation, a first affinity score reflecting a degree to which the first contextual identifier and the contextual translation are connected, and a second affinity score reflecting a degree to which the second contextual identifier and the contextual translation are connected. 2. The method of claim 1 , further comprising generating an attention vector from the encoded vector utilizing an attention neural network. 3. The method of claim 2 , wherein decoding the encoded vector into the contextual translation comprises decoding the attention vector. 4. The method of claim 1 , wherein generating the first contextual identifier utilizing the contextual neural network comprises: generating a frame vector based on a frame from the subset of frames utilizing convolutional layers of the contextual neural network; and generating the first contextual identifier based on the frame vector utilizing long-short-term-memory layers of the contextual neural network. 5. The method of claim 1 , further comprising: generating an affinity array comprising the first affinity score and the second affinity score. 6. The method of claim 1 , based on the second affinity score satisfying a threshold, utilizing the contextual translation for the term sequence. 7. The method of claim 1 , based on the second affinity score not satisfying a threshold, generating an updated contextual translation for the term sequence. 8. The method of claim 1 , wherein identifying, from the digital video file, the subset of frames comprises identifying frames corresponding to the scene in the digital video file, and identifying, from the digital video file, the adjacent subset of frames comprises identifying frames corresponding to the different scene in the digital video file, based on: metadata within the digital video file or within video-data packets, or similarity of image features between or among contiguous frames in the digital video file. 9. A non-transitory computer-readable medium storing instructions thereon that, when executed by at least one processor, cause the at least one processor to perform operations comprising: identifying, from a digital video file, a subset of frames corresponding to a scene in the digital video file, and a term sequence from the subset of frames; determining a reference frame that represents the scene from the subset of frames based on a feature-matching score for the reference frame relative to the subset of frames; identifying, from the digital video file, an adjacent subset of frames corresponding to a different scene in the digital video file; determining an additional reference frame that represents the different scene from the adjacent subset of frames based on an additional feature-matching score for the additional reference frame relative to the adjacent subset of frames; generating a first contextual identifier for the subset of frames utilizing a contextual neural network to analyze image features of the reference frame; generating a second contextual identifier for the adjacent subset of frames utilizing the contextual neural network to analyze image features of the additional reference frame; and generating, utilizing a translation neural network, a contextual translation of the term sequence and corresponding affinity scores by: encoding the first contextual identifier, the second contextual identifier, and the term sequence into an encoded vector, and decoding the encoded vector into the contextual translation a first affinity score reflecting a degree to which the first contextual identifier and the contextual translation are connected, and a second affinity score reflecting a degree to which the second contextual identifier and the contextual translation are connected. 10. The non-transitory computer-readable medium of claim 9 , wherein the operations further comprise generating an attention vector from the encoded vector utilizing an attention neural network. 11. The non-transitory computer-readable medium of claim 10 , wherein decoding the encoded vector into the contextual translation comprises decoding the attention vector. 12. The non-transitory computer-readable medium of claim 9 , wherein generating the first contextual identifier utilizing the contextual neural network comprises: generating a frame vector based on a frame from the subset of frames utilizing convolutional layers of the contextual neural network; and generating the first contextual identifier based on the frame vector utilizing long-short-term-memory layers of the contextual neural network. 13. The non-transitory computer-readable medium of claim 9 , wherein the operations further comprise: generating an affinity array comprising the first affinity score and the second affinity score. 14. The non-transitory computer-readable medium of claim 9 , based on the second affinity score satisfying a threshold, utilizing the contextual translation for the term sequence. 15. The non-transitory computer-readable medium of claim 9 , based on the second affinity score not satisfying a threshold, generating an updated contextual translation for the term sequence. 16. The non-transitory computer-readable medium of claim 9 , wherein identifying, from the digital video file, the subset of frames comprises identifying frames corresponding to the scene in the digital video file, and identifying, from the digital video file, the adjacent subset of frames comprises identifying frames corresponding to the different scene in the digital video file, based on: metadata within the digital video file or within video-data packets, or similarity of image features between or among contiguous frames in the digital video file. 17. A system comprising: one or more memory devices; and one or more processors coupled to the one or more memory devices that cause the system to perform operations comprising: determining a reference frame that represents a scene from a subset of frames of a digital video, based on a feature-matching score for the reference frame relative to the subset of frames; determining an additional reference frame that represent a different scene from an adjacent subset of frames of the digital video, based on an add

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Transfer learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12299408B2 cover?
The present disclosure describes systems, non-transitory computer-readable media, and methods that can generate contextual identifiers indicating context for frames of a video and utilize those contextual identifiers to generate translations of text corresponding to such video frames. By analyzing a digital video file, the disclosed systems can identify video frames corresponding to a scene and…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/58. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).