Synthetic data generation for training of natural language understanding models

US11875787B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11875787-B2
Application numberUS-202217963766-A
CountryUS
Kind codeB2
Filing dateOct 11, 2022
Priority dateSep 15, 2020
Publication dateJan 16, 2024
Grant dateJan 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This document relates to machine learning. One example includes a method or technique that can be performed on a computing device. The method or technique can include obtaining a task-semantically-conditioned generative model that has been pretrained based at least on a first training data set having unlabeled training examples and semantically conditioned based at least on a second training data set having dialog act-labeled utterances. The method or technique can also include inputting dialog acts into the semantically-conditioned generative model and obtaining synthetic utterances that are output by the semantically-conditioned generative model. The method or technique can also include outputting the synthetic utterances.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method comprising: obtaining a pretrained generative model that has been pretrained using a first training data set having unlabeled training examples; semantically conditioning the pretrained generative model based at least on a second training data set having dialog act-labeled utterances to obtain a semantically-conditioned generative model; and outputting the semantically-conditioned generative model. 2. The method of claim 1 , wherein the semantically-conditioned generative model comprises one or more transformer decoders that are semantically conditioned based at least on the dialog act-labeled utterances. 3. The method of claim 2 , wherein semantically conditioning the pretrained generative model comprises: inputting, to the pretrained generative model, a token from a particular utterance from the second training data set that is labeled with a particular dialog act; predicting a subsequent token of the particular utterance using the pretrained generative model; and adjusting parameters of the one or more transformer decoders based at least on whether the predicted subsequent token matches an actual subsequent token of the particular utterance. 4. The method of claim 3 , wherein the particular dialog act includes a particular intent value and a particular slot value, and the predicting of the subsequent token is conditioned on the particular intent value and the particular slot value. 5. The method of claim 1 , wherein the second training data set includes dialog act-labeled utterances from a plurality of different task domains. 6. The method of claim 5 , wherein the plurality of different task domains includes at least a food ordering domain and a travel domain. 7. The method of claim 1 , wherein the semantically-conditioned generative model is adapted, by the semantically conditioning, to conduct dialogs with users. 8. The method of claim 7 , wherein the semantically-conditioned generative model is adapted to output synthetic training examples suitable for populating a synthetic training corpus. 9. A system comprising: a processor; and a storage medium storing instructions which, when executed by the processor, cause the processor to: using a semantically-conditioned generative model trained to conduct dialogs with users, generate synthetic training examples; and output the synthetic training examples into a synthetic training corpus. 10. The system of claim 9 , wherein the instructions, when executed by the processor, cause the processor to: receive a specific dialog act comprising a specific intent value and a specific slot value; and using the semantically-conditioned generative model, generate multiple different synthetic training examples labeled with the specific dialog act. 11. The system of claim 10 , wherein the instructions, when executed by the processor, cause the processor to: select individual tokens of the multiple different synthetic training examples from an output distribution of the semantically-conditioned generative model based at least on the individual tokens having respective probabilities exceeding a threshold; and discard at least one other token from the output distribution having a probability that does not exceed the threshold. 12. The system of claim 11 , wherein the instructions, when executed by the processor, cause the processor to: receive user input identifying a defined set of slot values for the synthetic training corpus; and filter out synthetic training examples produced by the semantically-conditioned generative model that lack corresponding slot values from the defined set. 13. The system of claim 10 , wherein the instructions, when executed by the processor, cause the processor to: select a specified number of top tokens from an output distribution of the semantically-conditioned generative model to obtain the multiple different synthetic training examples; and discard at least one other token from the output distribution. 14. The system of claim 10 , wherein the instructions, when executed by the processor, cause the processor to: train a natural language understanding model using the synthetic training corpus. 15. A method comprising: obtaining a semantically-conditioned generative model that has been pretrained based at least on a first training data set having unlabeled training examples and semantically conditioned based at least on a second training data set having dialog act-labeled utterances; inputting dialog acts into the semantically-conditioned generative model; obtaining synthetic utterances that are output by the semantically-conditioned generative model; and outputting the synthetic utterances. 16. The method of claim 15 , wherein the semantically-conditioned generative model comprises one or more transformer decoders that have been semantically conditioned based at least on the dialog act-labeled utterances. 17. The method of claim 16 , wherein the semantically-conditioned generative model comprises at least two transformer decoders each having a feed-forward layer and a masked self-attention layer. 18. The method of claim 17 , further comprising: receiving input specifying a set of dialog acts to input to the semantically-conditioned generative model. 19. The method of claim 18 , further comprising: outputting a graphical user interface having a field for specifying the set of dialog acts; and receiving the input via the field of the graphical user interface. 20. The method of claim 17 , further comprising: sampling predicted next tokens from an output distribution of the semantically-conditioned generative model to obtain the synthetic utterances. 21. The method of claim 20 , further comprising: receiving input designating a requested diversity of the synthetic utterances; and sampling the output distribution based at least on the requested diversity. 22. The method of claim 21 , further comprising: outputting a graphical user interface having a field for specifying the requested diversity; and receiving the input via the field of the graphical user interface.

Assignees

Inventors

Classifications

  • G10L15/18Primary

    using natural language modelling · CPC title

  • Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • G10L15/063Primary

    Training · CPC title

  • Parsing for meaning understanding · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11875787B2 cover?
This document relates to machine learning. One example includes a method or technique that can be performed on a computing device. The method or technique can include obtaining a task-semantically-conditioned generative model that has been pretrained based at least on a first training data set having unlabeled training examples and semantically conditioned based at least on a second training da…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).