Who is the assignee on this patent?

Electronics & Telecommunications Res Inst

What technology area does this patent fall under?

Primary CPC classification G10L13/0335. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 23 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Automatic interpretation system and method for generating synthetic sound having characteristics similar to those of original speaker's voice

US10108606B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10108606-B2
Application number	US-201615214215-A
Country	US
Kind code	B2
Filing date	Jul 19, 2016
Priority date	Mar 3, 2016
Publication date	Oct 23, 2018
Grant date	Oct 23, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided are an automatic interpretation system and method for generating a synthetic sound having characteristics similar to those of an original speaker's voice. The automatic interpretation system for generating a synthetic sound having characteristics similar to those of an original speaker's voice includes a speech recognition module configured to generate text data by performing speech recognition for an original speech signal of an original speaker and extract at least one piece of characteristic information among pitch information, vocal intensity information, speech speed information, and vocal tract characteristic information of the original speech, an automatic translation module configured to generate a synthesis-target translation by translating the text data, and a speech synthesis module configured to generate a synthetic sound of the synthesis-target translation.

First claim

Opening claim text (preview).

What is claimed is: 1. An automatic interpretation system for generating a synthetic sound having characteristics similar to those of an original speaker's voice, the system comprising: a processor; and a non-transitory computer readable medium having computer executable instructions stored thereon which, when executed by the processor, performs the following method: generating text data, using a speech recognition module, by performing speech recognition for an original speech signal of an original speaker and extract one or more pieces of characteristic information among pitch information, vocal intensity information, speech speed information, and vocal tract characteristic information of an original speech; generating, using an automatic translation module, a synthesis-target translation by translating the text data; and generating, using a speech synthesis module, a synthetic sound of the synthesis-target translation, wherein the speech recognition module includes a speech speed extractor measuring a speech speed of the original speech signal in units of one or more of words, sentences, and intonation phrases, comparing the measured speech speed and an average speech speed, the average speech speed being based on numbers of syllables according to corresponding types of units and acquired from one or more previously built massive male and female conversational speech databases, and storing a ratio of the speech speed of the original speaker to the average speech speed based on a comparison result. 2. The automatic interpretation system of claim 1 , wherein the speech recognition module further includes: a word and sentence extractor configured to extract words and sentences from the original speech signal and convert the extracted words and sentences into the text data; a pitch extractor configured to extract a pitch and a pitch trajectory from the original speech signal; a vocal intensity extractor configured to extract a vocal intensity from the original speech signal; and a vocal tract characteristic extractor configured to extract a vocal tract parameter from the original speech signal. 3. The automatic interpretation system of claim 2 , wherein the pitch extractor additionally extracts prosody structures from the original speech signal according to intonation phrases. 4. The automatic interpretation system of claim 2 , wherein the vocal intensity extractor compares the extracted vocal intensity with a gender-specific average vocal intensity acquired from one or more of previously built massive male and female conversational speech databases and stores a ratio of the vocal intensity of the original speaker to the average vocal intensity based on a comparison result. 5. The automatic interpretation system of claim 2 , wherein the vocal tract characteristic extractor extracts at least one of characteristic parameters of a Mel-frequency cepstral coefficient (MFCC) and a glottal wave. 6. The automatic interpretation system of claim 1 , wherein, when the automatic translation module is a rule-based machine translator, the automatic translation module extracts correspondence information in units of one or more of words, intonation phrases, and sentences corresponding to a language of the original speech and a language of the synthesis-target translation in a translation process. 7. The automatic interpretation system of claim 1 , wherein, when the automatic translation module is a statistical machine translator, the automatic translation module extracts correspondence information in units of one or more of words, intonation phrases, and sentences using dictionary information and alignment information of a translation process or using results of chunking in units of words, phrases, and clauses. 8. The automatic interpretation system of claim 1 , wherein the speech synthesis module further includes: a preprocessor configured to convert numbers and marks in the synthesis-target translation into characters; a pronunciation converter configured to convert pronunciations to correspond to the characters of the converted synthesis-target translation; and a synthetic sound generator configured to search for synthesis units of the synthesis-target translation that has been subjected to the prosody processing and generate the synthetic sound of the synthesis-target translation based on search results. 9. The automatic interpretation system of claim 8 , wherein the synthetic sound generator generates the synthetic sound of the synthesis-target translation based on the speech speed information of the original speech signal, the vocal tract characteristic information of the original speech signal, or both. 10. A method of generating a synthetic sound having characteristics similar to those of an original speaker's voice in an automatic interpretation system, the method comprising: generating text data by performing speech recognition for an original speech signal of an original speaker and extracting one or more pieces of characteristic information among pitch information, vocal intensity information, speech speed information, and vocal tract characteristic information of the original speech signal; generating a synthesis-target translation by automatically translating the text data; and generating a synthetic sound of the synthesis-target translation, wherein the extracting of the one or more pieces of characteristic information includes: measuring a speech speed of the original speech signal in units of one or more of words, sentences, and intonation phrases; comparing the measured speech speed and an average speech speed, the average speech speed being based on numbers of syllables according to corresponding types of units and acquired from one or more previously built massive male and female conversational speech databases; and storing a ratio of the speech speed of the original speaker to the average speech speed based on a comparison result. 11. The method of claim 10 , wherein the extracting of the one or more pieces of characteristic information further includes additionally extracting prosody structures from the original speech signal according to the intonation phrases. 12. The method of claim 10 , wherein the comparison result is a first comparison result, and wherein the extracting of the one or more pieces of characteristic information further includes: comparing a vocal intensity with a gender-specific average vocal intensity acquired from the one or more previously built massive male and female conversational speech databases to generate a second comparison result; and storing a ratio of the vocal intensity of the original speaker to the average vocal intensity based on the second comparison result. 13. The method of claim 10 , wherein the extracting of the one or more pieces of characteristic information further includes extracting at least one of characteristic parameters of a Mel-frequency cepstral coefficient (MFCC) and a glottal wave. 14. The method of claim 10 , wherein in case of a rule-based machine translator, the generating of the synthesis-target translation includes extracting correspondence information in units of one or more of words, intonation phrases, and sentences corresponding to a language of the original speech and a language of a translation result in a translation process, and in case of a statistical machine translator, the generating of the synthesis-target translation includes extracting correspondence information in units of one or more of words, intonation phrases, and sentences using dictionary information and alignment information of the interpretation process or using results of chunk

Assignees

Electronics & Telecommunications Res Inst

Inventors

Classifications

G10L13/0335Primary
Pitch control · CPC title
G10L13/06
Elementary speech units used in speech synthesisers; Concatenation rules · CPC title
G10L25/48
specially adapted for particular use · CPC title
G10L15/26
Speech to text systems (G10L15/08 takes precedence) · CPC title
G06F40/55Primary
Rule-based translation · CPC title

Patent family

Related publications grouped by family.

View patent family 59724267

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10108606B2 cover?: Provided are an automatic interpretation system and method for generating a synthetic sound having characteristics similar to those of an original speaker's voice. The automatic interpretation system for generating a synthetic sound having characteristics similar to those of an original speaker's voice includes a speech recognition module configured to generate text data by performing speech re…
Who is the assignee on this patent?: Electronics & Telecommunications Res Inst
What technology area does this patent fall under?: Primary CPC classification G10L13/0335. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 23 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).