Synthetic data generation for training of natural language understanding models

US11508360B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11508360-B2
Application numberUS-202017021892-A
CountryUS
Kind codeB2
Filing dateSep 15, 2020
Priority dateSep 15, 2020
Publication dateNov 22, 2022
Grant dateNov 22, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This document relates to machine learning. One example includes a method or technique that can be performed on a computing device. The method or technique can include obtaining a task-adapted generative model that has been tuned using one or more task-specific seed examples. The method or technique can also include inputting dialog acts into the task-adapted generative model and obtaining synthetic utterances that are output by the task-adapted generative model. The method or technique can also include populating a synthetic training corpus with synthetic training examples that include the synthetic utterances. The synthetic training corpus may be suitable for training a natural language understanding model.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method comprising: obtaining a task-adapted generative model that has been tuned using one or more task-specific seed examples; inputting dialog acts into the task-adapted generative model; obtaining synthetic utterances that are output by the task-adapted generative model; and populating a synthetic training corpus with synthetic training examples that include the synthetic utterances, the synthetic training corpus suitable for training a natural language understanding model. 2. The method of claim 1 , wherein each of the synthetic training examples comprise a particular synthetic utterance and a particular dialog act that was input to the task-adapted generative model to generate the particular synthetic utterance. 3. The method of claim 2 , wherein the dialog acts comprising intent values and slot values. 4. The method of claim 3 , wherein obtaining the synthetic utterances comprises sampling tokens from an output distribution of the task-adapted generative model. 5. The method of claim 1 , further comprising: training the natural language understanding model using the synthetic training corpus. 6. The method of claim 5 , wherein the task-adapted generative model comprises one or more transformer decoders and the natural language understanding model comprises one or more transformer encoders. 7. The method of claim 1 , further comprising: receiving a request to train the natural language understanding model; receiving the task-specific seed examples for generating the natural language understanding model; determining whether additional task-specific examples are appropriate for training the natural language understanding model; and populating the synthetic training corpus in an instance when additional task-specific examples are determined to be appropriate for generating the natural language understanding model. 8. The method of claim 7 , further comprising: outputting an offer to generate the synthetic training corpus responsive to a determination that additional task-specific examples are appropriate for generating the natural language understanding model; and populating the synthetic training corpus responsive to acceptance of the offer. 9. A system comprising: a processor; and a storage medium storing instructions which, when executed by the processor, cause the processor to: using a task-adapted generative model tuned for a particular task, generate synthetic training examples for the particular task; and populate a synthetic training corpus with the synthetic training examples. 10. The system of claim 9 , wherein the instructions, when executed by the processor, cause the processor to: sample predicted next tokens from an output distribution of the task-adapted generative model to provide a diverse set of synthetic training examples. 11. The system of claim 10 , wherein the instructions, when executed by the processor, cause the processor to: receive input designating a requested diversity of the synthetic training examples; and sample the output distribution based at least on the requested diversity. 12. The system of claim 11 , wherein the instructions, when executed by the processor, cause the processor to: select a specified number of predicted next tokens from the output distribution based at least on the requested diversity. 13. The system of claim 11 , wherein the instructions, when executed by the processor, cause the processor to: select predicted next tokens having respective probabilities above a probability threshold from the output distribution, the probability threshold corresponding to the requested diversity. 14. The system of claim 11 , wherein the instructions, when executed by the processor, cause the processor to: identify a defined set of slot values for the synthetic training corpus; and filter out synthetic training examples produced by the task-adapted generative model that lack corresponding slot values from the defined set. 15. The system of claim 11 , wherein the instructions, when executed by the processor, cause the processor to: train a natural language understanding model using the synthetic training corpus. 16. A method comprising: obtaining a pretrained generative model that has been pretrained using a first training data set having unlabeled training examples; semantically conditioning the pretrained generative model based at least on a second training data set having dialog act-labeled utterances to obtain a semantically-conditioned generative model; tuning the semantically-conditioned generative model using a third training data set having task-specific seed examples to obtain a task-adapted generative model; and outputting the task-adapted generative model. 17. The method of claim 16 , wherein the semantically conditioning comprises: inputting individual dialog acts from the second training data set to the pretrained generative model and training the pretrained generative model to generate corresponding utterances that are labeled with the individual dialog acts. 18. The method of claim 17 , wherein the tuning comprises: inputting individual task-specific dialog acts from the third training data set to the semantically-conditioned generative model and training the semantically-conditioned generative model to generate corresponding task-specific utterances that are labeled with the individual task-specific dialog acts. 19. The method of claim 18 , wherein the semantically conditioning and the tuning comprise performing next token prediction. 20. The method of claim 16 , wherein the third training data set includes slot labels that are not present in the second training data set.

Assignees

Inventors

Classifications

  • G10L15/063Primary

    Training · CPC title

  • Parsing for meaning understanding · CPC title

  • Semantic analysis · CPC title

  • Natural language generation · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11508360B2 cover?
This document relates to machine learning. One example includes a method or technique that can be performed on a computing device. The method or technique can include obtaining a task-adapted generative model that has been tuned using one or more task-specific seed examples. The method or technique can also include inputting dialog acts into the task-adapted generative model and obtaining synth…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 22 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).