Systems and methods for character-to-phone conversion

US12555563B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12555563-B2
Application numberUS-202217888243-A
CountryUS
Kind codeB2
Filing dateAug 15, 2022
Priority dateAug 15, 2022
Publication dateFeb 17, 2026
Grant dateFeb 17, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for training a model to perform end-to-end character-to-phoneme (C2P) conversion include: selecting a plurality of unlabeled sentences from a first data source, selecting a plurality of labeled sentences from a second data source, preprocessing a combined corpus of the selected unlabeled and labeled sentences to extract a plurality of linguistic features, generating mixed training data by automatically labeling tokens in the preprocessed corpus based on the plurality of extracted linguistic features, and training a pre-trained model, using the mixed training data, to perform end-to-end C2P conversion.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for training a model to perform end-to-end character-to-phoneme (C2P) conversion, performed by at least one processor and comprising: selecting, for a combined corpus of selected unlabeled and labeled sentences, a plurality of unlabeled sentences from a first data source wherein at least a predetermined minimum number of sentences are selected with respect to a target polyphone character; selecting, for the combined corpus, a plurality of labeled sentences from a second data source; preprocessing the combined corpus to extract a plurality of linguistic features; generating mixed training data by automatically labeling tokens in the preprocessed corpus based on the plurality of extracted linguistic features; further training a pre-trained model, using the mixed training data, to perform end-to-end C2P conversion, wherein the pre-trained model is configured to output a plurality of phoneme labels of a plurality of characters of a sentence that is input into the pre-trained model; and inputting a plurality of characters into the further trained model to obtain a plurality of phonemes corresponding to the plurality of characters. 2 . The method of claim 1 , wherein selecting the plurality of unlabeled sentences from the first data source comprises: selecting at least one sentence, from the plurality of unlabeled sentences, that includes a target polyphone character, for each of a plurality of target polyphone characters. 3 . The method of claim 1 , wherein selecting the plurality of unlabeled sentences from the first data source comprises: selecting a sentence, from the plurality of unlabeled sentences, only once for each of a plurality of target polyphone characters; and selecting at least a predetermined minimum number of sentences, from the plurality of unlabeled sentences, for each of the plurality of target polyphone characters. 4 . The method of claim 1 , wherein preprocessing the combined corpus of the selected unlabeled and labeled sentences to extract a plurality of linguistic features comprises: identifying one or more of a word boundary, part-of-speech (POS) tag, or named entity phrase using one or more linguistic tools; and extracting each identified word boundary, POS tag, or named entity phrase as a linguistic feature. 5 . The method of claim 1 , wherein preprocessing the combined corpus of the selected unlabeled and labeled sentences to extract a plurality of linguistic features comprises: identifying one or more tokens of the combined corpus that are associated with an extracted linguistic feature and a manually-labeled linguistic feature, wherein the extracted and manually-labeled linguistic features are mismatched; and masking the mismatched extracted linguistic feature associated with each of the one or more identified tokens. 6 . The method of claim 1 , wherein generating mixed training data by automatically labeling tokens in the preprocessed corpus based on the plurality of extracted linguistic features comprises: inputting the preprocessed corpus to a baseline system for automatic labeling, wherein the baseline system includes a plurality of rules and decision trees configured to label one or more tokens in the preprocessed corpus; obtaining a plurality of labeled tokens, output by the baseline system in response to inputting the preprocessed corpus, each token corresponding to a character in the preprocessed corpus, and each label corresponding to a phoneme associated with the character; and mixing the plurality of labeled tokens. 7 . The method of claim 6 , wherein mixing the plurality of labeled tokens comprises: identifying a mismatch among the plurality of labeled tokens, between an automatically generated label associated with a token and a manually assigned label associated with a token; and converting the mismatched automatically generated label to be consistent with the manually assigned label. 8 . The method of claim 1 , wherein training the pre-trained model, using the mixed training data, to perform end-to-end C2P conversion comprises: obtaining the pre-trained model that is previously trained to perform C2P conversion; and retraining the pre-trained model, using the mixed training data, to simultaneously label all characters in an input sentence when performing C2P conversion on the input sentence. 9 . An electronic device comprising: at least one memory configured to store computer program code; and at least one processor configured to operate as instructed by the computer program code, the computer program code including: selecting code configured to cause the at least one processor to select, for a combined corpus of selected unlabeled and labeled sentences, a plurality of unlabeled sentences from a first data source wherein at least a predetermined minimum number of sentences are selected with respect to a target polyphone character, and select, for the combined corpus, a plurality of labeled sentences from a second data source, preprocessing code configured to cause the at least one processor to preprocess the combined corpus to extract a plurality of linguistic features, generating code configured to cause the at least one processor to generate mixed training data by automatically labeling tokens in the preprocessed corpus based on the plurality of extracted linguistic features, training code configured to cause the at least one processor to further train a pre-trained model, using the mixed training data, to perform end-to-end C2P conversion, wherein the pre-trained model is configured to output a plurality of phoneme labels of a plurality of characters of a sentence that is input into the pre-trained model; and inference code configured to cause the at least one processor to input a plurality of characters into the further trained model to obtain a plurality of phonemes corresponding to the plurality of characters. 10 . The electronic device of claim 9 , wherein the selecting code is configured to cause the at least one processor to: select a sentence, from the plurality of unlabeled sentences, only once for each of a plurality of target polyphone characters; and select at least a predetermined minimum number of sentences, from the plurality of unlabeled sentences, for each of the plurality of target polyphone characters. 11 . The electronic device of claim 9 , wherein the preprocessing code is configured to cause the at least one processor to: identify one or more of a word boundary, part-of-speech (POS) tag, or named entity phrase using one or more linguistic tools; and extract each identified word boundary, POS tag, or named entity phrase as a linguistic feature. 12 . The electronic device of claim 9 , wherein the generating code is configured to cause the at least one processor to: input the preprocessed corpus to a baseline system for automatic labeling, wherein the baseline system includes a plurality of rules and decision trees configured to label one or more tokens in the preprocessed corpus; obtain a plurality of labeled tokens, output by the baseline system in response to inputting the preprocessed corpus, each token corresponding to a character in the preprocessed corpus, and each label corresponding to a phoneme associated with the character; and mix the plurality of labeled tokens. 13 . The electronic device of claim 9 , wherein the training code is configured to cause the at least one processor to: obtain the pre-trained model that is previously trained to perform C2P conversion; and retrain the pre-trained model, using the mixed training data, to simultaneously label all characters in an input

Assignees

Inventors

Classifications

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

  • Semantic analysis · CPC title

  • Named entity recognition · CPC title

  • G10L13/047Primary

    Architecture of speech synthesisers · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12555563B2 cover?
Systems and methods for training a model to perform end-to-end character-to-phoneme (C2P) conversion include: selecting a plurality of unlabeled sentences from a first data source, selecting a plurality of labeled sentences from a second data source, preprocessing a combined corpus of the selected unlabeled and labeled sentences to extract a plurality of linguistic features, generating mixed tr…
Who is the assignee on this patent?
Tencent America LLC
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).