Method and apparatus for sound object following

US11277702B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11277702-B2
Application numberUS-202016812183-A
CountryUS
Kind codeB2
Filing dateMar 6, 2020
Priority dateMar 8, 2019
Publication dateMar 15, 2022
Grant dateMar 15, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to a method and apparatus for processing a multimedia signal. More specifically, the present disclosure relates to a method comprising obtaining at least one video object from the multimedia signal and at least one audio object from the multimedia signal, extracting video feature information for the at least one video object and audio feature information for the at least one audio object, and determining a correlation between the at least one video object and the at least one audio object through an object matching engine based on the video feature information and the audio feature information, and an apparatus therefor.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of processing a multimedia signal by an apparatus, the method comprising: obtaining at least one video object from the multimedia signal and at least one audio object from the multimedia signal, wherein the at least one video object represents a face; extracting video feature information for the at least one video object and audio feature information for the at least one audio object; determining a correlation between the at least one video object and the at least one audio object through an object matching engine based on the video feature information and the audio feature information; and based on the correlation, controlling output of left and right audio signals for sound image localization at a screen location of the at least one video object with respect to a listener, wherein the video feature information is extracted based on: a ratio between a distance between a top boundary of a first rectangular region including the face and a top boundary of a second rectangular region including lips within the first rectangular region, a height of the second rectangular region, and a distance between a bottom boundary of the second rectangular region and a bottom boundary of the first rectangular region; and a ratio between a distance between a left boundary of the second rectangular region and a left boundary of the first rectangular region, a width of the second rectangular region, and a distance between a right boundary of the second rectangular region and a right boundary of the first rectangular region. 2. The method of claim 1 , wherein determining the correlation between the at least one video object and the at least one audio object comprises: obtaining information about a relationship between each of the at least one video object and a specific audio object of the at least one audio object through the object matching engine based on the video feature information and the audio feature information; and determining, from among the at least one video object, a specific video object related to the specific audio object based on the information about the relationship between each of the at least one video object and the specific audio object. 3. The method of claim 2 , wherein determining the specific video object related to the specific audio object comprises: based on a value of the information about the relationship being greater than a predetermined value, determining a video object related to the value of the information about the relationship as the specific video object. 4. The method of claim 3 , wherein determining the specific video object related to the specific audio object further comprises: based on the least one video object comprising a plurality of video objects, determining, from among the plurality of video objects, a video object related to a greatest value of the information about the relationship as the specific video object. 5. The method of claim 2 , wherein determining the specific video object related to the specific audio object comprises: based on a value of the information about the relationship being less than a predetermined value, determining a video object related to the value of the information about the relationship as the specific video object. 6. The method of claim 5 , wherein determining the specific video object related to the specific audio object further comprises: based on the least one video object comprising a plurality of video objects, determining, from the plurality of video objects, a video object related to a smallest value of the information about the relationship as the specific video object. 7. The method of claim 2 , wherein the information about the relationship has a real number value. 8. The method of claim 1 , wherein the video feature information is extracted based on a vertical length and a horizontal length of a lip skeleton. 9. The method of claim 1 , wherein the audio feature information is extracted based on linear prediction coding (LPC). 10. The method of claim 1 , wherein the audio feature information is extracted based on a log-Mel filters-of-bank. 11. The method of claim 1 , wherein the audio feature information is extracted based on Mel-frequency cepstral coefficients (MFCC). 12. The method of claim 1 , wherein the audio feature information comprises onset information about the at least one audio object. 13. The method of claim 1 , wherein the object matching engine comprises a model trained based on learning. 14. An apparatus configured to process a multimedia signal, the apparatus comprising: a memory storing instructions; and at least one processor operatively coupled to the memory and configured to, when executing the instructions, implement operations comprising: obtaining at least one video object from the multimedia signal and at least one audio object from the multimedia signal, wherein the at least one video object represents a face; extracting video feature information for the at least one video object and audio feature information for the at least one audio object; determining a correlation between the at least one video object and the at least one audio object through an object matching engine based on the video feature information and the audio feature information; and based on the correlation, controlling output of left and right audio signals for sound image localization at a screen location of the at least one video object with respect to a listener, wherein the video feature information is extracted based on: a ratio between a distance between a top boundary of a first rectangular region including the face and a top boundary of a second rectangular region including lips within the first rectangular region, a height of the second rectangular region, and a distance between a bottom boundary of the second rectangular region and a bottom boundary of the first rectangular region; and a ratio between a distance between a left boundary of the second rectangular region and a left boundary of the first rectangular region, a width of the second rectangular region, and a distance between a right boundary of the second rectangular region and a right boundary of the first rectangular region. 15. The apparatus of claim 14 , wherein determining the correlation between the at least one video object and the at least one audio object comprises: obtaining information about a relationship between each of the at least one video object and a specific audio object of the at least one audio object through the object matching engine based on the video feature information and the audio feature information; and determining, from among the at least one video object, a specific video object related to the specific audio object based on the information about the relationship between each of the at least one video object and the specific audio object. 16. The apparatus of claim 15 , wherein determining the specific video object related to the specific audio object comprises: based on a value of the information about the relationship being greater than a predetermined value, determining a video object related to the value of the information about the relationship as the specific video object. 17. The apparatus of claim 16 , wherein determining the specific video object related to the specific audio object further comprises: based on the least one video object comprising a plurality of video objects, determining, from among the plurality of video objects, a video object related to a greatest value of the information about the relationship as the specific video obj

Assignees

Inventors

Classifications

  • H04S7/30Primary

    Control circuits for electronic adaptation of the sound field · CPC title

  • of extracted features · CPC title

  • G06V40/171Primary

    Local features and components; Facial parts (eye characteristics G06V40/18); Occluding parts, e.g. glasses; Geometrical relationships · CPC title

  • Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title

  • Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11277702B2 cover?
The present disclosure relates to a method and apparatus for processing a multimedia signal. More specifically, the present disclosure relates to a method comprising obtaining at least one video object from the multimedia signal and at least one audio object from the multimedia signal, extracting video feature information for the at least one video object and audio feature information for the a…
Who is the assignee on this patent?
Lg Electronics Inc
What technology area does this patent fall under?
Primary CPC classification H04S7/30. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Mar 15 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).