System and method of video capture and search optimization for creating an acoustic voiceprint
US-2022139399-A1 · May 5, 2022 · US
US2025068673A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025068673-A1 |
| Application number | US-202418813647-A |
| Country | US |
| Kind code | A1 |
| Filing date | Aug 23, 2024 |
| Priority date | Aug 25, 2023 |
| Publication date | Feb 27, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computing system is configured to obtain a plurality of media files that each includes speech of one or more speakers. The computing system is further configured to process the plurality of media files to generate indexed data, wherein the indexed data includes a corresponding embedding for each speaker of the one or more speakers identified in the media file and a corresponding one or more keywords identified in the speech in the media file. The computing system is further configured to receive an indication at least one of a selection of a particular speaker from the one or more speakers or a selection of a particular keyword from a plurality of keywords. The computing system is further configured to generate one or more correlations based on the indexed data. The computing system is further configured to output an alert regarding the one or more correlations.
Opening claim text (preview).
What is claimed is: 1 . A method, comprising: obtaining, by a computing system and from one or more sources that provide media files over one or more networks, a plurality of media files that each includes speech of one or more speakers; processing, by the computing system, the plurality of media files to generate indexed data, wherein the indexed data includes, for each media file of the plurality of media files, a corresponding embedding for each speaker of the one or more speakers identified in the media file and a corresponding transcript of speech for each language identified in the speech in the media file; receiving, by the computing system, an indication of at least one of a selection of a particular speaker from the one or more speakers or a selection of a particular keyword from a plurality of keywords; generating, by the computing system, one or more correlations based on the indexed data, wherein the one or more correlations include at least one of an association among the one or more speakers or an association among keywords detected in the transcripts as spoken by the one or more speakers; and outputting, by the computing system, based on the one or more correlations, an indication regarding the one or more correlations. 2 . The method of claim 1 , wherein processing the plurality of media files into indexed data includes clustering excerpts from the plurality of media files. 3 . The method of claim 1 , wherein processing the plurality of media files into indexed data includes: extracting embeddings from overlapping windows from each of the plurality of media files; and applying clustering to the embeddings. 4 . The method of claim 1 , further comprising: generating master data by matching keywords included in a watchlist to the transcripts. 5 . The method of claim 1 , further comprising: receiving, by the computing system, an indication of a selection of a particular keyword, and wherein outputting the indication includes generating the indication based on determining that at least one media file of the plurality of media files includes a speaker speaking the particular keyword. 6 . The method of claim 1 , wherein the indexed data includes, for each speaker, at least one of: a gender identifier, or an identifier of the language spoken by the speaker. 7 . The method of claim 6 , further comprising: determining, by the computing system, a subset of the one or more speakers, where each speaker of the subset of the one or more speakers is associated with the particular speaker, and wherein determining the subset includes: determining that the speakers of the subset of one or more speakers and the particular speaker speak in a same media file. 8 . The method of claim 1 , further comprising: receiving, by the computing system, an indication of a selection of a particular speaker of the one or more speakers; and identifying, by the computing system, one or more media files that include speech by the particular speaker. 9 . The method of claim 1 , wherein the one or more sources comprise a Clearnet site and a darknet site. 10 . The method of claim 1 , wherein the indexed data includes, for a media file of the plurality of media files, respective identifiers for multiple speakers of the one or more speakers that speak in the media file, and wherein generating the one or more correlations based on the indexed data comprises identifying an association among the multiple speakers based on the identifiers for the multiple speakers that speak in the media file. 11 . The method of claim 1 , further comprising: outputting, by the computing system, a graphical user interface (GUI), wherein the GUI includes one or more visual representations of the one or more correlations. 12 . The method of claim 1 , wherein processing the plurality of media files to generate the indexed data includes: processing the plurality of media files to generate, for each media file, respective embeddings for one or more speakers having speech in the media file; and matching speaker embeddings included in a watchlist to the embeddings for the one or more speakers having speech in the media file. 13 . The method of claim 1 , wherein, for each media file of the plurality of media files, the corresponding one or more keywords identified in the speech in the media file are present in a transcript of the media file. 14 . The method of claim 1 , wherein the media file is a first media file, wherein the indication is first indication, and further comprising: receiving, by the computing system, a second media file of a speaker for enrollment, wherein the second media file includes speech of at least one speaker; processing, by the computing system, the second media file, wherein processing the second media file includes: extracting an embedding of the at least one speaker from the second media file, and matching the embedding to a cluster of one or more clusters of a plurality of speakers, wherein each cluster of the plurality of clusters corresponds to a respective speaker of a plurality of speakers; and outputting, by the computing system, based on matching the embedding to the cluster, a second indication that includes an indication of a match between the at least one speaker and a speaker of the plurality of speakers. 15 . The method of claim 1 , wherein the plurality of media files include media files with audio events, and wherein generating the correlations includes generating correlations that include associations among the audio events. 16 . A computing system, comprising: memory; and one or more programmable processors in communication with the memory and configured to: obtain, from one or more sources that provide media files over one or more networks, a plurality of media files that each includes speech of one or more speakers; process the plurality of media files to generate indexed data, wherein the indexed data includes, for each media file of the plurality of media files, a corresponding embedding for each speaker of the one or more speakers identified in the media file and a corresponding transcript of speech for each language identified in the speech in the media file; receive an indication of at least one of a selection of a particular speaker from the one or more speakers or a selection of a particular keyword from a plurality of keywords; generate one or more correlations based on the indexed data, wherein the one or more correlations include at least one of an association among the one or more speakers or an association among keywords detected in the transcripts as spoken by the one or more speakers; and output, based on the one or more correlations, an indication regarding the one or more correlations. 17 . The computing system of claim 16 , wherein to process the plurality of media files into indexed data, the one or more programmable processors are further configured to: cluster excerpts from the plurality of media files. 18 . The computing system of claim 16 , wherein to process the plurality of media files into indexed data, the one or more programmable processors are further configured to: extract embeddings from overlapping windows from each of the plurality of media files; and apply clustering to the embeddings. 19 . The computing system of claim 16 , wherein the one or more programmable processors are further configured to generate master data by matching keywords included in a watchlist to the transcripts. 20 . Non-transit
Language recognition · CPC title
Word spotting · CPC title
Speaker identification or verification techniques · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Indexing; Data structures therefor; Storage structures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.