System for speaker diarization based multilateral automatic speech translation system and its operating method, and apparatus supporting the same
US-2015227510-A1 · Aug 13, 2015 · US
US10108606B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10108606-B2 |
| Application number | US-201615214215-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 19, 2016 |
| Priority date | Mar 3, 2016 |
| Publication date | Oct 23, 2018 |
| Grant date | Oct 23, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Provided are an automatic interpretation system and method for generating a synthetic sound having characteristics similar to those of an original speaker's voice. The automatic interpretation system for generating a synthetic sound having characteristics similar to those of an original speaker's voice includes a speech recognition module configured to generate text data by performing speech recognition for an original speech signal of an original speaker and extract at least one piece of characteristic information among pitch information, vocal intensity information, speech speed information, and vocal tract characteristic information of the original speech, an automatic translation module configured to generate a synthesis-target translation by translating the text data, and a speech synthesis module configured to generate a synthetic sound of the synthesis-target translation.
Opening claim text (preview).
What is claimed is: 1. An automatic interpretation system for generating a synthetic sound having characteristics similar to those of an original speaker's voice, the system comprising: a processor; and a non-transitory computer readable medium having computer executable instructions stored thereon which, when executed by the processor, performs the following method: generating text data, using a speech recognition module, by performing speech recognition for an original speech signal of an original speaker and extract one or more pieces of characteristic information among pitch information, vocal intensity information, speech speed information, and vocal tract characteristic information of an original speech; generating, using an automatic translation module, a synthesis-target translation by translating the text data; and generating, using a speech synthesis module, a synthetic sound of the synthesis-target translation, wherein the speech recognition module includes a speech speed extractor measuring a speech speed of the original speech signal in units of one or more of words, sentences, and intonation phrases, comparing the measured speech speed and an average speech speed, the average speech speed being based on numbers of syllables according to corresponding types of units and acquired from one or more previously built massive male and female conversational speech databases, and storing a ratio of the speech speed of the original speaker to the average speech speed based on a comparison result. 2. The automatic interpretation system of claim 1 , wherein the speech recognition module further includes: a word and sentence extractor configured to extract words and sentences from the original speech signal and convert the extracted words and sentences into the text data; a pitch extractor configured to extract a pitch and a pitch trajectory from the original speech signal; a vocal intensity extractor configured to extract a vocal intensity from the original speech signal; and a vocal tract characteristic extractor configured to extract a vocal tract parameter from the original speech signal. 3. The automatic interpretation system of claim 2 , wherein the pitch extractor additionally extracts prosody structures from the original speech signal according to intonation phrases. 4. The automatic interpretation system of claim 2 , wherein the vocal intensity extractor compares the extracted vocal intensity with a gender-specific average vocal intensity acquired from one or more of previously built massive male and female conversational speech databases and stores a ratio of the vocal intensity of the original speaker to the average vocal intensity based on a comparison result. 5. The automatic interpretation system of claim 2 , wherein the vocal tract characteristic extractor extracts at least one of characteristic parameters of a Mel-frequency cepstral coefficient (MFCC) and a glottal wave. 6. The automatic interpretation system of claim 1 , wherein, when the automatic translation module is a rule-based machine translator, the automatic translation module extracts correspondence information in units of one or more of words, intonation phrases, and sentences corresponding to a language of the original speech and a language of the synthesis-target translation in a translation process. 7. The automatic interpretation system of claim 1 , wherein, when the automatic translation module is a statistical machine translator, the automatic translation module extracts correspondence information in units of one or more of words, intonation phrases, and sentences using dictionary information and alignment information of a translation process or using results of chunking in units of words, phrases, and clauses. 8. The automatic interpretation system of claim 1 , wherein the speech synthesis module further includes: a preprocessor configured to convert numbers and marks in the synthesis-target translation into characters; a pronunciation converter configured to convert pronunciations to correspond to the characters of the converted synthesis-target translation; and a synthetic sound generator configured to search for synthesis units of the synthesis-target translation that has been subjected to the prosody processing and generate the synthetic sound of the synthesis-target translation based on search results. 9. The automatic interpretation system of claim 8 , wherein the synthetic sound generator generates the synthetic sound of the synthesis-target translation based on the speech speed information of the original speech signal, the vocal tract characteristic information of the original speech signal, or both. 10. A method of generating a synthetic sound having characteristics similar to those of an original speaker's voice in an automatic interpretation system, the method comprising: generating text data by performing speech recognition for an original speech signal of an original speaker and extracting one or more pieces of characteristic information among pitch information, vocal intensity information, speech speed information, and vocal tract characteristic information of the original speech signal; generating a synthesis-target translation by automatically translating the text data; and generating a synthetic sound of the synthesis-target translation, wherein the extracting of the one or more pieces of characteristic information includes: measuring a speech speed of the original speech signal in units of one or more of words, sentences, and intonation phrases; comparing the measured speech speed and an average speech speed, the average speech speed being based on numbers of syllables according to corresponding types of units and acquired from one or more previously built massive male and female conversational speech databases; and storing a ratio of the speech speed of the original speaker to the average speech speed based on a comparison result. 11. The method of claim 10 , wherein the extracting of the one or more pieces of characteristic information further includes additionally extracting prosody structures from the original speech signal according to the intonation phrases. 12. The method of claim 10 , wherein the comparison result is a first comparison result, and wherein the extracting of the one or more pieces of characteristic information further includes: comparing a vocal intensity with a gender-specific average vocal intensity acquired from the one or more previously built massive male and female conversational speech databases to generate a second comparison result; and storing a ratio of the vocal intensity of the original speaker to the average vocal intensity based on the second comparison result. 13. The method of claim 10 , wherein the extracting of the one or more pieces of characteristic information further includes extracting at least one of characteristic parameters of a Mel-frequency cepstral coefficient (MFCC) and a glottal wave. 14. The method of claim 10 , wherein in case of a rule-based machine translator, the generating of the synthesis-target translation includes extracting correspondence information in units of one or more of words, intonation phrases, and sentences corresponding to a language of the original speech and a language of a translation result in a translation process, and in case of a statistical machine translator, the generating of the synthesis-target translation includes extracting correspondence information in units of one or more of words, intonation phrases, and sentences using dictionary information and alignment information of the interpretation process or using results of chunk
Pitch control · CPC title
Elementary speech units used in speech synthesisers; Concatenation rules · CPC title
specially adapted for particular use · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Rule-based translation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.