Server side hotwording
US-2024412734-A1 · Dec 12, 2024 · US
US11315546B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11315546-B2 |
| Application number | US-201916449731-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 24, 2019 |
| Priority date | Sep 2, 2015 |
| Publication date | Apr 26, 2022 |
| Grant date | Apr 26, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed are systems and methods for improving interactions with and between computers in content searching, generating, hosting and/or providing systems supported by or configured with personal computing devices, servers and/or platforms. The systems interact to identify and retrieve data within or across platforms, which can be used to improve the quality of data used in processing interactions between or among processors in such systems. The disclosed systems and methods provide systems and methods for automatic creation of a formatted, readable transcript of multimedia content, which is derived, extracted, determined, or otherwise identified from the multimedia content. The formatted, readable transcript can be utilized to increase accuracy and efficiency in search engine optimization, as well as identification of relevant digital content available for communication to a user.
Opening claim text (preview).
What is claimed is: 1. A method comprising: analyzing, via a computing device, a video file to identify audio data associated with the video file, said audio data comprising information associated with text corresponding to speech that is to be rendered contemporaneously with video data of the video file; determining, via the computing device, a phoneme-level transcription from the audio data by extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text, the phoneme-level transcription representing audible content and non-audible content from the audio data and a mapping of the audible content and non-audible content from within the audio data, the non-audible content corresponding to a region of no speech within the audio data; determining, via the computing device, a timestamp for the audible and non-audible content in the phoneme-level transcription that indicates a time that a word and a non-word appears in the phoneme-level transcription; determining, via the computing device, a time-aligned transcription of the audio data based on the phoneme-level transcription and associated timestamps, said time-aligned transcription determination comprising comparing occurrences of words and non-words in the phoneme-level transcription and their associated timestamps against an acoustic model that comprises information indicating a dictionary of terms and a timing scheme corresponding to a length of the video file and a beginning and end of the audio data of the video file, such that each word and non-word and their associated timestamps are mapped and stored in association with each other based on the information comprised within the acoustic model; automatically inserting, via the computing device, punctuation into the time-aligned transcription based on the text in the time-aligned transcription and the indicated mapping from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks; determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing said character set in the punctuated time-aligned transcription; and storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-aligned transcription. 2. The method of claim 1 , wherein said inserting punctuation further comprises: parsing the time-aligned transcription and identifying a feature indicating a space between said text characters, said space associated with a natural language pause between words of said speech as indicated by said non-audible content and said mapping between the non-audible content and the audible content; and inserting a punctuation mark in said time-aligned transcription based on said identified feature. 3. The method of claim 2 , further comprising: analyzing said feature, and based on said analysis, determining a dimensional value of the feature; and determining a type of said punctuation mark, wherein said inserted punctuation mark is based on said type. 4. The method of claim 1 , wherein said capitalizing further comprises: applying a language model to said punctuated time-aligned transcription, wherein said determined character set is further based on the applied language model. 5. The method of claim 1 , wherein said video file comprises video data and said audio data, wherein said audio data is extracted from said video file. 6. The method of claim 1 , wherein said audio data is stored as an audio file in association with said video file in said database, wherein said method further comprises: identifying said audio file in said database based on information associated with said video file. 7. The method of claim 1 , further comprising: determining a set of words from the text of the phoneme-level transcription; comparing each word from the set to the dictionary of terms; and confirming each word upon said comparison satisfying a similarity threshold. 8. The method of claim 1 , further comprising: receiving a search request for a video file; and identifying, based on the search request, said video file. 9. The method of claim 8 , further comprising: performing a search for said video file by analyzing modified time-aligned transcripts of video files in the database. 10. The method of claim 1 , further comprising: receiving a request for the video file; determining a context of the video file based on the modified time-aligned transcript associated with the video file; causing communication, over the network, of said context to a third party content platform to obtain a digital content item associated with said context; and communicating said identified digital content item in association with said communication of said video file. 11. A non-transitory computer-readable storage medium tangibly encoded with computer-executable instructions, that when executed by a computing device, perform a method comprising: analyzing, via the computing device, a video file to identify audio data associated with the video file, said audio data comprising information associated with text corresponding to speech that is to be rendered contemporaneously with video data of the video file; determining, via the computing device, a phoneme-level transcription from the audio data by extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text, the phoneme-level transcription representing audible content and non-audible content from the audio data and a mapping of the audible content and non-audible content from within the audio data, the non-audible content corresponding to a region of no speech within the audio data; determining, via the computing device, a timestamp for the audible and non-audible content in the phoneme-level transcription that indicates a time that a word and a non-word appears in the phoneme-level transcription; determining, via the computing device, a time-aligned transcription of the audio data based on the phoneme-level transcription and associated timestamps, said time-aligned transcription determination comprising comparing occurrences of words and non-words in the phoneme-level transcription and their associated timestamps against an acoustic model that comprises information indicating a dictionary of terms and a timing scheme corresponding to a length of the video file and a beginning and end of the audio data of the video file, such that each word and non-word and their associated timestamps are mapped and stored in association with each other based on the information comprised within the acoustic model; automatically inserting, via the computing device, punctuation into the time-aligned transcription based on the text in the time-aligned transcription and the indicated mapping from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks; determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing said character set in the punctuated time-aligned transcription; and storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-
Speech to text systems (G10L15/08 takes precedence) · CPC title
Phonemes, fenemes or fenones being the recognition units · CPC title
Electronic editing of digitised analogue information signals, e.g. audio or video signals · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Word boundary detection · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.