Audio recognition method, electronic device and storage medium

US12373486B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12373486-B2
Application numberUS-202217703564-A
CountryUS
Kind codeB2
Filing dateMar 24, 2022
Priority dateSep 22, 2021
Publication dateJul 29, 2025
Grant dateJul 29, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes obtaining a query content. The query content includes segment information representing a to-be-recognized audio. The method further includes selecting the preset quantity of candidate audios corresponding to the query content from a preset library. Each candidate audio includes a candidate audio segment matched with the segment information. The method further includes inputting the candidate audio segment into a trained detection model so as to obtain target segment information including the segment information and a target audio where the target segment information is located.

First claim

Opening claim text (preview).

What is claimed is: 1. An audio recognition method, performed by an electronic device, comprising: obtaining a query content, wherein the query content comprises segment information representing a to-be-recognized audio; selecting a preset quantity of candidate audios corresponding to the query content from a preset library, wherein the candidate audio comprises a candidate audio segment matched with the segment information; obtaining a to-be-detected vector corresponding to the candidate audio according to the segment information and the candidate audio segment; inputting the to-be-detected vector corresponding to the candidate audio into a trained detection model, so as to obtain a detection result data output by the trained detection model; and obtaining target segment information comprising the segment information and a target audio where the target segment information is located according to the detection result data; wherein the detection result data comprise first probability data and second probability data which correspondingly indicate that a morpheme in the candidate audio segment is located at a starting position and an ending position respectively, and wherein: obtaining the target segment information comprising the segment information and the target audio where the target segment information is located according to the detection result data comprises: determining, in response to determining that the starting position is smaller than the ending position, a target audio segment from the candidate audio segment based on a product of the first probability data and the second probability data; and using the target audio segment as the target segment information recognized from the query content, and using an audio where the target segment information is located as the target audio; wherein determining the target audio segment from the candidate audio segment based on the product of the first probability data and the second probability data comprises: determining a starting morpheme at the starting position and an ending morpheme at the ending position when the product of the first probability data and the second probability data is the largest; and determining that all morphemes between the starting morpheme and the ending morpheme constitute the target audio segment. 2. The method according to claim 1 , wherein selecting the preset quantity of the candidate audios corresponding to the query content from the preset library comprises: determining a similarity between a morpheme of the segment information and text information of an audio in the preset library; sorting audios in the preset library according to the similarity from large to small so as to obtain a sorting result; determining a preset quantity of audios with sorting positions at the front as the candidate audios based on the sorting result, the candidate audio comprising at least one audio segment matched with the morphemes of the segment information; and obtaining an audio segment comprising the longest consecutive matching morpheme from the at least one audio segment of the candidate audio, so as to obtain the candidate audio segment, matched with the segment information, of the candidate audio. 3. The method according to claim 1 , wherein obtaining the to-be-detected vector corresponding to the candidate audio according to the segment information and the candidate audio segment comprises: splicing the segment information with the candidate audio segment of the candidate audio respectively, so as to obtain the to-be-detected vector corresponding to the candidate audio, and wherein the to-be-detected vector at least comprises a first identifier and a second identifier, the first identifier is configured to identify a starting position of the to-be-detected vector, and the second identifier is configured to identify a splicing position and an ending position of the to-be-detected vector. 4. The method according to claim 1 , wherein the to-be-recognized audio is a song, and the segment information refers to part of lyrics in the song. 5. An electronic device, comprising: a processor; and a memory configured to store computer instructions executable by the processor; wherein, when the processor executes the instructions the processor is configured to: obtain a query content, wherein the query content comprises segment information representing a to-be-recognized audio; select a preset quantity of candidate audios corresponding to the query content from a preset library, wherein the candidate audio comprises a candidate audio segment matched with the segment information; obtain a to-be-detected vector corresponding to the candidate audio according to the segment information and the candidate audio segment; input the to-be-detected vector corresponding to the candidate audio into a trained detection model, so as to obtain a detection result data output by the trained detection model; and obtain target segment information comprising the segment information and a target audio where the target segment information is located according to the detection result data; wherein the detection result data comprise first probability data and second probability data which correspondingly indicate that a morpheme in the candidate audio segment is located at a starting position and an ending position respectively; and wherein the processor is further configured to: determine, in response to determining that the starting position is smaller than the ending position, a target audio segment from the candidate audio segment based on a product of the first probability data and the second probability data; and use the target audio segment as the target segment information recognized from the query content, and using an audio where the target segment information is located as the target audio; wherein the processor is further configured to: determine a starting morpheme at the starting position and an ending morpheme at the ending position when the product of the first probability data and the second probability data is the largest; and determine that all morphemes between the starting morpheme and the ending morpheme constitute the target audio segment. 6. The electronic device according to claim 5 , wherein the processor is further configured to: determine a similarity between a morpheme of the segment information and text information of an audio in the preset library; sort audios in the preset library according to the similarity from large to small so as to obtain a sorting result; determine a preset quantity of audios with sorting positions at the front as the candidate audios based on the sorting result, the candidate audio comprising at least one audio segment matched with the morphemes of the segment information; and obtain an audio segment comprising the longest consecutive matching morpheme from the at least one audio segment of the candidate audio, so as to obtain the candidate audio segment, matched with the segment information, of the candidate audio. 7. The electronic device according to claim 5 , wherein the processor is further configured to: splice the segment information with the candidate audio segment of the candidate audio respectively, so as to obtain the to-be-detected vector corresponding to the candidate audio, and wherein the to-be-detected vector at least comprises a first identifier and a second identifier, and wherein the first identifier is configured to identify a starting position of the to-be-detected vector, and the second identifier is configured to identify a splicing position and an ending position of the to-be-detected vector. 8. The electronic device according to claim 5 , wherein the to-be-recognized audio is a song, and the

Assignees

Inventors

Classifications

  • using artificial neural networks · CPC title

  • G06F16/685Primary

    using automatically derived transcript of audio data, e.g. lyrics (speech recognition G10L15/00) · CPC title

  • using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings · CPC title

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12373486B2 cover?
A method includes obtaining a query content. The query content includes segment information representing a to-be-recognized audio. The method further includes selecting the preset quantity of candidate audios corresponding to the query content from a preset library. Each candidate audio includes a candidate audio segment matched with the segment information. The method further includes inputtin…
Who is the assignee on this patent?
Beijing Xiaomi Mobile Software Co Ltd, Beijing Xiaomi Pinecone Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F16/685. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 29 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).