System and method for transcription of spoken words using multilingual mismatched crowd

US2018061417A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2018061417-A1
Application numberUS-201715476505-A
CountryUS
Kind codeA1
Filing dateMar 31, 2017
Priority dateAug 30, 2016
Publication dateMar 1, 2018
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure generally relates to transcription of spoken words, and more particularly to a system and method for transcription of spoken words using multilingual mismatched words. The process comprises collection of multi-scripted noisy transcriptions of the spoken word obtained from workers of the multilingual mismatched crowd. The collected words are mapped to a phoneme sequence in the source language using script specific graphemes to phoneme model. Further, it builds a multi-scripted transcription script specific, worker specific and a global insertion-deletion-substitution (IDS) channel. Furthermore, the disclosure also determines reputation of workers to allocate the transcription task. Determination of reputation is based on word belief. The word belief is determined by taking ratio of likelihood probability of mapped phoneme sequence of transcriptions given the current estimate of word to the sum of likelihood probabilities of mapped phoneme sequences of the transcriptions given the phoneme sequence of each dictionary word.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer implemented method for transcribing one or more spoken word utterances of a source language using a multilingual mismatched crowd, the method comprises: collecting, at word transcription table, a plurality of multi-scripted noisy transcriptions of the spoken word obtained from plurality of workers of the multilingual mismatched crowd; mapping each of the collected plurality of multi-scripted transcriptions to a phoneme sequence in the source language using script specific graphemes to phoneme model; building worker specific insertion-deletion-substitution (IDS) channel model, multi-scripted transcription script specific IDS channel model and a global IDS channel model from the multi-scripted transcriptions; filtering out a set of workers of the plurality of workers based on the reputation of the workers, estimated by simulating IDS channel for worker specific on the dictionary words using worker reputation module; allocating the transcription tasks to the set of workers such that the required number of transcriptions per word are minimized; and decoding, at a transcription decoding module, the plurality of multi-scripted transcriptions are combined to decode the transcription in source script, wherein the decoding comprises steps of: finding likelihood probability of the mapped phoneme sequences of the multi-scripted mismatched crowd transcriptions with each of the predefined dictionary word's phoneme sequence using insertion-deletion-substitution channel parameters and voting the dictionary word that maximizes above likelihood; and determining word belief by taking ratio of likelihood probability of mapped phoneme sequences of transcriptions given the current estimate of word to the sum of likelihood probabilities of mapped phoneme sequences of the transcriptions given the phoneme sequence of each dictionary word. 2 . The method of claim 1 , wherein the building of the worker specific IDS channel model using the gold standard transcription test. 3 . The method of claim 1 , wherein the multilingual mismatched crowd includes the plurality of workers who are having different script for the same word, unfamiliar to the source language and transcribing the given words of the source language in their own language. 4 . The method of claim 1 , wherein the multi-scripted transcriptions of words, whose ground truth phoneme sequences are known in source language, are used to train worker script's grapheme to source language's phoneme mapping models using expectation maximization algorithm. 5 . The method of claim 4 , further wherein with the help of ground truth phoneme sequences, the worker specific, the transcription script specific and the global IDS channel models are trained using expectation maximization algorithm. 6 . The method of claim 1 , wherein the multi-script transcriptions of words whose ground truths are not known are first mapped to phoneme sequence using G2P model, and with the help of phoneme sequences of dictionary words the worker specific, the transcription script specific and the global IDS channel models are trained in iterative fashion using expectation maximization algorithm. 7 . The method of claim 1 , wherein the estimation of worker reputation is based on transcribed words whose ground truth is known and task allocation follows the estimated worker reputation. 8 . The method of claim 1 , wherein the task allocation utilizes the bipartite matching algorithm to allocate the tasks to worker such that average word belief is maximized. 9 . The method of claim 1 , wherein the likelihood probability of a mapped phoneme sequence of the plurality of worker's transcription of the spoken word given a phoneme sequence of dictionary word is obtained by aligning both phoneme sequences using the linearly combined IDS channel parameters of worker, his/her script and global. 10 . The method of claim 1 , wherein the likelihood probabilities of each mapped phoneme sequences of multi-scripted transcriptions of a word with a predefined dictionary word's phoneme sequences are obtained and multiplied so as to obtain the likelihood probability. Further wherein, the predefined dictionary word that provides maximum likelihood probability is considered as decoded word. 11 . A system for transcribing one or more spoken word utterances of a source language using a multilingual mismatched crowd, the system comprising: a processor; a memory communicatively coupled to the processor and the memory contains instructions that are readable by the processor; a database parted within the memory, wherein the database comprises an audio chunk table and a word transcription table; a plurality of typing interfaces are configured according to script preference of each of the plurality of workers of mismatched crowd; a reputation module is configured to compute the worker reputation and filter out the spammer from the plurality of workers; a task allocation module is configured to compute word beliefs and to allocate the transcription tasks to the plurality of workers, wherein the reputation of the plurality of workers is estimated by simulating the worker specific IDS channel on dictionary words; and a transcription decoding module is configured to generate transcription in the source language from multi-transcriptions of the plurality of workers. 12 . The system of claim 11 , wherein the audio chunk table is configured to store one or more information of the plurality of workers, one or more spoken word segments of the each of the plurality of workers, number of responses given by each of the plurality of workers and transcription score of each of the plurality of workers; 13 . The system of claim 11 , wherein the word transcription table is configured to store transcription responses of the spoken word segments presented to the plurality of workers, the audio chunk id, the each of the plurality of workers id, and the workers transcription text; 14 . The system of claim 11 , wherein the transcription decoding module is configured to invoke a grapheme to phoneme model for mapping the received multi-transcriptions with the phoneme sequence of the source language. 15 . The system of claim 11 , wherein the transcription decoding module is configured to compute likelihood by aligning phoneme sequence of predefined dictionary with the phoneme sequence of multi-transcriptions of the plurality of workers. 16 . The system of claim 11 , wherein the transcription decoding module is configured to invoke insertion-deletion-substitution channel model for decoding the word transcription in the source language from multi-scripted transcriptions. 17 . The system of claim 11 , wherein the task allocation module is to decide the word that needs more transcriptions from its belief probability. 18 . A non-transitory computer readable medium embodying a program executable in a computing device for transcribing one or more spoken word utterances of a source language using a multilingual mismatched crowd, the program comprising: a program code for collecting, at word transcription table, a plurality of multi-scripted noisy transcriptions of the spoken word obtained from plurality of workers of the multilingual mismatched crowd; mapping each of the collected plurality of multi-scripted transcriptions to a phoneme sequence in the source language using script specific graphemes to phoneme model; building worker specific insertion-deletion-substitution (IDS) channel model, multi-scripted transcription script specific IDS channel

Assignees

Inventors

Classifications

  • G10L15/26Primary

    Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Language recognition · CPC title

  • Training · CPC title

  • Database cache management · CPC title

  • Programmable keyboards (key guide holders G06F3/0224) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018061417A1 cover?
The disclosure generally relates to transcription of spoken words, and more particularly to a system and method for transcription of spoken words using multilingual mismatched words. The process comprises collection of multi-scripted noisy transcriptions of the spoken word obtained from workers of the multilingual mismatched crowd. The collected words are mapped to a phoneme sequence in the sou…
Who is the assignee on this patent?
Tata Consultancy Services Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Mar 01 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).