Dynamic domain-adapted automatic speech recognition system

US12374328B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12374328-B2
Application numberUS-202318511077-A
CountryUS
Kind codeB2
Filing dateNov 16, 2023
Priority dateMar 26, 2021
Publication dateJul 29, 2025
Grant dateJul 29, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are system, apparatus, article of manufacture, method, and computer program product embodiments for adapting an automated speech recognition system to provide more accurate suggestions to voice queries involving media content including recently created or recently available content. An example computer-implemented method includes transcribing the voice query, identifying respective components of the query such as the media content being requested and the action to be performed, and generating fuzzy candidates that potentially match the media content based on phonetic representations of the identified components. Phonetic representations of domain specific candidates are stored in a domain entities index and is continuously updated with new entries so as to maintain the accuracy of the speech recognition of voice queries for recently created or recently available content.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for adapting an automatic speech recognition engine implemented within a multimedia environment, comprising: receiving, at a user device, a voice query that includes identification of requested media content and an action to be performed on the requested media content, wherein the identification of the requested media content comprises a first entity representing a title of the requested media content and a second entity representing at least one metadata associated with the requested media content; generating a transcription of the voice query, wherein the transcription is generated using the automatic speech recognition engine, wherein the transcription includes an imperfect textual representation of the requested media content; parsing the transcription to identify the first entity and the second entity; generating a phonetic representation of the requested media content based on the first entity and the second entity; generating, based on the phonetic representation, a fuzzy candidate list comprising a plurality of fuzzy candidates representing potential matches to the requested media content, wherein each fuzzy candidate of the plurality of fuzzy candidates is associated with a popularity score; ranking the fuzzy candidate list to form a ranked fuzzy candidate list including a highest ranked fuzzy candidate corresponding to a best potential match for the requested media content, wherein the highest ranked fuzzy candidate corresponding to the best potential match for the requested media content is determined based on a comparison of the popularity score of each fuzzy candidate of the plurality of fuzzy candidates; and performing the action associated with the highest ranked fuzzy candidate. 2. The computer-implemented method of claim 1 , wherein the user device is one of a remote control, a media device, or a display device. 3. The computer-implemented method of claim 1 , wherein a first fuzzy candidate of the plurality of fuzzy candidates is associated with a first popularity score and a second fuzzy candidate of the plurality of fuzzy candidates is associated with a second popularity score. 4. The computer-implemented method of claim 3 , wherein each fuzzy candidate is associated with a match count including the first fuzzy candidate being associated with a first match count and the second fuzzy candidate being associated with a second match count, wherein the match count comprises a numerical value indicating a number of matching strategies that indicate each fuzzy candidate as being a quality match to the requested media content. 5. The computer-implemented method of claim 4 , wherein the matching strategies comprises at least two of grapheme spelling, grapheme n-gram, and phoneme. 6. The computer-implemented method of claim 3 , further comprising: determining the first popularity score of the first fuzzy candidate based on a first frequency of performed actions associated with the first fuzzy candidate within the multimedia environment; and determining the second popularity score of the second fuzzy candidate based on a second frequency of performed actions associated with the second fuzzy candidate with the multimedia environment. 7. The computer-implemented method of claim 6 , wherein the performed actions are based on metrics associated with the first fuzzy candidate and the second fuzzy candidate collected within the multimedia environment. 8. The computer-implemented method of claim 6 , wherein the first frequency of performed actions comprises at least one of a number of times the first fuzzy candidate was streamed within the multimedia environment or a number of times the first fuzzy candidate was requested within the multimedia environment. 9. An apparatus implemented within a multimedia environment comprising: a memory; and at least one processor communicatively coupled to the memory and configured to: receive a voice query that includes identification of requested media content and an action to be performed on the requested media content, wherein the identification of the requested media content comprises a first entity representing a title of the requested media content and a second entity representing at least one metadata associated with the requested media content; generate a transcription of the voice query, wherein the transcription is generated using an automatic speech recognition engine, wherein the transcription includes an imperfect textual representation of the requested media content; parse the transcription to identify the first entity and the second entity; generate a phonetic representation of the requested media content based on the first entity and the second entity; generate, based on the phonetic representation, a fuzzy candidate list comprising a plurality of fuzzy candidates representing potential matches to the requested media content, wherein each fuzzy candidate is associated with a popularity score; rank the fuzzy candidate list to form a ranked fuzzy candidate list including a highest ranked fuzzy candidate corresponding to a best potential match for the requested media content, wherein the highest ranked fuzzy candidate corresponding to the best potential match for the requested media content is determined based on a comparison of the popularity score of each fuzzy candidate of the plurality of fuzzy candidates; and perform the action associated with the highest ranked fuzzy candidate. 10. The apparatus of claim 9 , wherein the apparatus is implemented as one of a remote control, a media device, or a display device. 11. The apparatus of claim 9 , wherein a first fuzzy candidate of the plurality of fuzzy candidates is associated with a first popularity score and a second fuzzy candidate of the plurality of fuzzy candidates is associated with a second popularity score. 12. The apparatus of claim 11 , wherein each fuzzy candidate is associated with a match count including the first fuzzy candidate being associated with a first match count and the second fuzzy candidate being associated with a second match count, wherein the match count comprises a numerical value indicating a number of matching strategies that indicate each fuzzy candidate as being a quality match to the requested media content. 13. The apparatus of claim 12 , wherein the matching strategies comprises at least two of grapheme spelling, grapheme n-gram, and phoneme. 14. The apparatus of claim 11 , wherein the at least one processor is further configured to: determine the first popularity score of the first fuzzy candidate based on a first frequency of performed actions associated with the first fuzzy candidate within the multimedia environment; and determine the second popularity score of the second fuzzy candidate based on a second frequency of performed actions associated with the second fuzzy candidate with the multimedia environment. 15. The apparatus of claim 14 , wherein the first frequency of performed actions comprises at least one of a number of times the first fuzzy candidate was streamed within the multimedia environment or a number of times the first fuzzy candidate was requested within the multimedia environment. 16. A non-transitory computer-readable medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform operations comprising: receiving, at a user device, a voice query that includes identification of requested media content and an action to be performed on the requested media content, wherein the identification of the reques

Assignees

Inventors

Classifications

  • using fuzzy logic · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Parsing for meaning understanding · CPC title

  • Named entity recognition · CPC title

  • Parsing · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12374328B2 cover?
Disclosed herein are system, apparatus, article of manufacture, method, and computer program product embodiments for adapting an automated speech recognition system to provide more accurate suggestions to voice queries involving media content including recently created or recently available content. An example computer-implemented method includes transcribing the voice query, identifying respec…
Who is the assignee on this patent?
Roku Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/187. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 29 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).