Computerized system and method for formatted transcription of multimedia content

US11315546B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11315546-B2
Application numberUS-201916449731-A
CountryUS
Kind codeB2
Filing dateJun 24, 2019
Priority dateSep 2, 2015
Publication dateApr 26, 2022
Grant dateApr 26, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are systems and methods for improving interactions with and between computers in content searching, generating, hosting and/or providing systems supported by or configured with personal computing devices, servers and/or platforms. The systems interact to identify and retrieve data within or across platforms, which can be used to improve the quality of data used in processing interactions between or among processors in such systems. The disclosed systems and methods provide systems and methods for automatic creation of a formatted, readable transcript of multimedia content, which is derived, extracted, determined, or otherwise identified from the multimedia content. The formatted, readable transcript can be utilized to increase accuracy and efficiency in search engine optimization, as well as identification of relevant digital content available for communication to a user.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: analyzing, via a computing device, a video file to identify audio data associated with the video file, said audio data comprising information associated with text corresponding to speech that is to be rendered contemporaneously with video data of the video file; determining, via the computing device, a phoneme-level transcription from the audio data by extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text, the phoneme-level transcription representing audible content and non-audible content from the audio data and a mapping of the audible content and non-audible content from within the audio data, the non-audible content corresponding to a region of no speech within the audio data; determining, via the computing device, a timestamp for the audible and non-audible content in the phoneme-level transcription that indicates a time that a word and a non-word appears in the phoneme-level transcription; determining, via the computing device, a time-aligned transcription of the audio data based on the phoneme-level transcription and associated timestamps, said time-aligned transcription determination comprising comparing occurrences of words and non-words in the phoneme-level transcription and their associated timestamps against an acoustic model that comprises information indicating a dictionary of terms and a timing scheme corresponding to a length of the video file and a beginning and end of the audio data of the video file, such that each word and non-word and their associated timestamps are mapped and stored in association with each other based on the information comprised within the acoustic model; automatically inserting, via the computing device, punctuation into the time-aligned transcription based on the text in the time-aligned transcription and the indicated mapping from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks; determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing said character set in the punctuated time-aligned transcription; and storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-aligned transcription. 2. The method of claim 1 , wherein said inserting punctuation further comprises: parsing the time-aligned transcription and identifying a feature indicating a space between said text characters, said space associated with a natural language pause between words of said speech as indicated by said non-audible content and said mapping between the non-audible content and the audible content; and inserting a punctuation mark in said time-aligned transcription based on said identified feature. 3. The method of claim 2 , further comprising: analyzing said feature, and based on said analysis, determining a dimensional value of the feature; and determining a type of said punctuation mark, wherein said inserted punctuation mark is based on said type. 4. The method of claim 1 , wherein said capitalizing further comprises: applying a language model to said punctuated time-aligned transcription, wherein said determined character set is further based on the applied language model. 5. The method of claim 1 , wherein said video file comprises video data and said audio data, wherein said audio data is extracted from said video file. 6. The method of claim 1 , wherein said audio data is stored as an audio file in association with said video file in said database, wherein said method further comprises: identifying said audio file in said database based on information associated with said video file. 7. The method of claim 1 , further comprising: determining a set of words from the text of the phoneme-level transcription; comparing each word from the set to the dictionary of terms; and confirming each word upon said comparison satisfying a similarity threshold. 8. The method of claim 1 , further comprising: receiving a search request for a video file; and identifying, based on the search request, said video file. 9. The method of claim 8 , further comprising: performing a search for said video file by analyzing modified time-aligned transcripts of video files in the database. 10. The method of claim 1 , further comprising: receiving a request for the video file; determining a context of the video file based on the modified time-aligned transcript associated with the video file; causing communication, over the network, of said context to a third party content platform to obtain a digital content item associated with said context; and communicating said identified digital content item in association with said communication of said video file. 11. A non-transitory computer-readable storage medium tangibly encoded with computer-executable instructions, that when executed by a computing device, perform a method comprising: analyzing, via the computing device, a video file to identify audio data associated with the video file, said audio data comprising information associated with text corresponding to speech that is to be rendered contemporaneously with video data of the video file; determining, via the computing device, a phoneme-level transcription from the audio data by extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text, the phoneme-level transcription representing audible content and non-audible content from the audio data and a mapping of the audible content and non-audible content from within the audio data, the non-audible content corresponding to a region of no speech within the audio data; determining, via the computing device, a timestamp for the audible and non-audible content in the phoneme-level transcription that indicates a time that a word and a non-word appears in the phoneme-level transcription; determining, via the computing device, a time-aligned transcription of the audio data based on the phoneme-level transcription and associated timestamps, said time-aligned transcription determination comprising comparing occurrences of words and non-words in the phoneme-level transcription and their associated timestamps against an acoustic model that comprises information indicating a dictionary of terms and a timing scheme corresponding to a length of the video file and a beginning and end of the audio data of the video file, such that each word and non-word and their associated timestamps are mapped and stored in association with each other based on the information comprised within the acoustic model; automatically inserting, via the computing device, punctuation into the time-aligned transcription based on the text in the time-aligned transcription and the indicated mapping from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks; determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing said character set in the punctuated time-aligned transcription; and storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-

Assignees

Inventors

Classifications

  • G10L15/26Primary

    Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Electronic editing of digitised analogue information signals, e.g. audio or video signals · CPC title

  • G10L15/02Primary

    Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Word boundary detection · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11315546B2 cover?
Disclosed are systems and methods for improving interactions with and between computers in content searching, generating, hosting and/or providing systems supported by or configured with personal computing devices, servers and/or platforms. The systems interact to identify and retrieve data within or across platforms, which can be used to improve the quality of data used in processing interacti…
Who is the assignee on this patent?
Verizon Patent & Licensing Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 26 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).