Goal-oriented conversational training data generation
US-11392773-B1 · Jul 19, 2022 · US
US11699434B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11699434-B2 |
| Application number | US-202017112670-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 4, 2020 |
| Priority date | Dec 4, 2020 |
| Publication date | Jul 11, 2023 |
| Grant date | Jul 11, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments provide for improved data sequence validity processing, for example to determine validity of sentences or other language within a particular language domain. Such improved processing is useful at least for arranging data sequences based on determined validity, and/or making determinations and/or performing actions based on the determined validity. A determined probability (e.g., transformed into the perplexity space) of each token appearing in a data sequence is used in any of a myriad of manners to perform such data sequence validity processing. Example embodiments provide for generating a perplexity value set for each data sequence in a plurality of data sequences, generating a probabilistic ranking set for the plurality of data sequences based on the perplexity value sets and at least one sequence ranking metric, and generating an arrangement of the plurality of data sequences based on the probabilistic ranking set.
Opening claim text (preview).
What is claimed is: 1. An apparatus comprising at least one processor and at least one memory, the at least one memory having computer-coded instructions stored thereon, wherein the computer-coded instructions, in execution with the at least one processor, configures the apparatus to: for each data sequence of a plurality of data sequences, each data sequence comprising a token sequence: generate, utilizing a language model, a perplexity value set associated with the data sequence, wherein the perplexity value set comprises a perplexity value for each data token in the token sequence of the data sequence, wherein the language model comprises a trained machine learning model configured to generate the perplexity value for each data token; and generate a probabilistic ranking set for the plurality of data sequences, the probabilistic ranking set including a probabilistic ranking for each data sequence in the plurality of data sequences, and the probabilistic ranking set generated based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence of the plurality of data sequences, wherein generating the probabilistic ranking set comprises: generating a bucket-based sequence perplexity value set including a bucket-based sequence perplexity value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences: determining an unacceptable bucket token count associated with the data sequence; determining the bucket-based sequence perplexity values for the data sequence based at least in part on the unacceptable bucket token count associated with the data sequence; and generating the probabilistic ranking set based at least in part on the bucket-based sequence perplexity value set; and generate an arrangement of the plurality of data sequences based at least in part on the probabilistic ranking set. 2. The apparatus according to claim 1 , the apparatus further configured to: provide the arrangement of the plurality of data sequences to a client device for output configured for rendering via a display interface of the client device or audio output from the client device. 3. The apparatus according to claim 1 , the apparatus further configured to: identify, based at least in part on the arrangement of the plurality of data sequences, at least one invalid data sequence from the plurality of data sequences. 4. The apparatus according to claim 1 , the apparatus further configured to: exclude at least one data sequence from the plurality of data sequences based at least in part on the arrangement of the plurality of data sequences. 5. The apparatus according to claim 1 , wherein the language model is trained on a domain-specific set of language training data. 6. The apparatus according to claim 1 , wherein to generate the probabilistic ranking set for the plurality of data sequences based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence, the apparatus is configured to: generate an average sequence perplexity value set including an average sequence perplexity value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences: determining the average sequence perplexity value for the data sequence, wherein the average sequence perplexity value represents a mean value based at least in part on the perplexity value for each data token in the token sequence of the data sequence; and generate the probabilistic ranking set based at least in part on the average sequence perplexity value set. 7. The apparatus according to claim 1 , wherein to generate the probabilistic ranking set for the plurality of data sequences based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence the apparatus is configured to: generate an area violating threshold value set including an area violating threshold value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences: determining the area violating threshold value for the data sequence, wherein the area violating threshold value is based at least in part on the perplexity value set for the data sequence and an unacceptable perplexity threshold; and generating the probabilistic ranking set based at least in part on the area violating threshold value set. 8. The apparatus according to claim 1 , wherein the probabilistic ranking set is determined utilizing the equation: a = 1 m ∑ i = 0 n ( X i ) C i , wherein X represents a number greater than one, a represents the probabilistic ranking for a particular data sequence of the plurality of data sequences, m represents a number of tokens in the particular data sequence, i represents an order of unacceptable buckets, n represents a number of unacceptable buckets minus 1, and C i represents a number of tokens in an unacceptable bucket represented by i. 9. The apparatus according to claim 1 , wherein the language model is language agnostic and direction agnostic. 10. The apparatus according to claim 1 , further configured to: collect a set of training data sequences associated with a language domain, wherein the set of training data sequences is collected from one or more external computing devices associated with the language domain; and train the language model based at least in part on the set of training data. 11. A computer-implemented method comprising: for each data sequence of a plurality of data sequences, each data sequence comprising a token sequence: generating, utilizing a language model, a perplexity value set associated with the data sequence, wherein the perplexity value set comprises a perplexity value for each data token in the token sequence of the data sequence, wherein the language model comprises a trained machine learning model configured to generate the perplexity value for each data token; and generating a probabilistic ranking set for the plurality of data sequences, the probabilistic ranking set including a probabilistic ranking for each data sequence in the plurality of data sequences, and the probabilistic ranking set generated based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence of the plurality of data sequences, wherein generating the probabilistic ranking set comprises: generating a bucket-based sequence perplexity value set including a bucket-based sequence perplexity value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences: determining an unacce
using probabilistic model · CPC title
using context dependencies, e.g. language models · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Training · CPC title
Semantic analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.