Systems, computer-implemented methods, and computer program products for data sequence validity processing

US11699434B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11699434-B2
Application numberUS-202017112670-A
CountryUS
Kind codeB2
Filing dateDec 4, 2020
Priority dateDec 4, 2020
Publication dateJul 11, 2023
Grant dateJul 11, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments provide for improved data sequence validity processing, for example to determine validity of sentences or other language within a particular language domain. Such improved processing is useful at least for arranging data sequences based on determined validity, and/or making determinations and/or performing actions based on the determined validity. A determined probability (e.g., transformed into the perplexity space) of each token appearing in a data sequence is used in any of a myriad of manners to perform such data sequence validity processing. Example embodiments provide for generating a perplexity value set for each data sequence in a plurality of data sequences, generating a probabilistic ranking set for the plurality of data sequences based on the perplexity value sets and at least one sequence ranking metric, and generating an arrangement of the plurality of data sequences based on the probabilistic ranking set.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus comprising at least one processor and at least one memory, the at least one memory having computer-coded instructions stored thereon, wherein the computer-coded instructions, in execution with the at least one processor, configures the apparatus to: for each data sequence of a plurality of data sequences, each data sequence comprising a token sequence: generate, utilizing a language model, a perplexity value set associated with the data sequence, wherein the perplexity value set comprises a perplexity value for each data token in the token sequence of the data sequence, wherein the language model comprises a trained machine learning model configured to generate the perplexity value for each data token; and generate a probabilistic ranking set for the plurality of data sequences, the probabilistic ranking set including a probabilistic ranking for each data sequence in the plurality of data sequences, and the probabilistic ranking set generated based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence of the plurality of data sequences, wherein generating the probabilistic ranking set comprises: generating a bucket-based sequence perplexity value set including a bucket-based sequence perplexity value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences: determining an unacceptable bucket token count associated with the data sequence; determining the bucket-based sequence perplexity values for the data sequence based at least in part on the unacceptable bucket token count associated with the data sequence; and generating the probabilistic ranking set based at least in part on the bucket-based sequence perplexity value set; and generate an arrangement of the plurality of data sequences based at least in part on the probabilistic ranking set. 2. The apparatus according to claim 1 , the apparatus further configured to: provide the arrangement of the plurality of data sequences to a client device for output configured for rendering via a display interface of the client device or audio output from the client device. 3. The apparatus according to claim 1 , the apparatus further configured to: identify, based at least in part on the arrangement of the plurality of data sequences, at least one invalid data sequence from the plurality of data sequences. 4. The apparatus according to claim 1 , the apparatus further configured to: exclude at least one data sequence from the plurality of data sequences based at least in part on the arrangement of the plurality of data sequences. 5. The apparatus according to claim 1 , wherein the language model is trained on a domain-specific set of language training data. 6. The apparatus according to claim 1 , wherein to generate the probabilistic ranking set for the plurality of data sequences based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence, the apparatus is configured to: generate an average sequence perplexity value set including an average sequence perplexity value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences: determining the average sequence perplexity value for the data sequence, wherein the average sequence perplexity value represents a mean value based at least in part on the perplexity value for each data token in the token sequence of the data sequence; and generate the probabilistic ranking set based at least in part on the average sequence perplexity value set. 7. The apparatus according to claim 1 , wherein to generate the probabilistic ranking set for the plurality of data sequences based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence the apparatus is configured to: generate an area violating threshold value set including an area violating threshold value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences: determining the area violating threshold value for the data sequence, wherein the area violating threshold value is based at least in part on the perplexity value set for the data sequence and an unacceptable perplexity threshold; and generating the probabilistic ranking set based at least in part on the area violating threshold value set. 8. The apparatus according to claim 1 , wherein the probabilistic ranking set is determined utilizing the equation: a = 1 m ⁢ ∑ i = 0 n ⁢ ( X i ) ⁢ C i , wherein X represents a number greater than one, a represents the probabilistic ranking for a particular data sequence of the plurality of data sequences, m represents a number of tokens in the particular data sequence, i represents an order of unacceptable buckets, n represents a number of unacceptable buckets minus 1, and C i represents a number of tokens in an unacceptable bucket represented by i. 9. The apparatus according to claim 1 , wherein the language model is language agnostic and direction agnostic. 10. The apparatus according to claim 1 , further configured to: collect a set of training data sequences associated with a language domain, wherein the set of training data sequences is collected from one or more external computing devices associated with the language domain; and train the language model based at least in part on the set of training data. 11. A computer-implemented method comprising: for each data sequence of a plurality of data sequences, each data sequence comprising a token sequence: generating, utilizing a language model, a perplexity value set associated with the data sequence, wherein the perplexity value set comprises a perplexity value for each data token in the token sequence of the data sequence, wherein the language model comprises a trained machine learning model configured to generate the perplexity value for each data token; and generating a probabilistic ranking set for the plurality of data sequences, the probabilistic ranking set including a probabilistic ranking for each data sequence in the plurality of data sequences, and the probabilistic ranking set generated based at least in part on at least one sequence arrangement metric and the perplexity value set for each data sequence of the plurality of data sequences, wherein generating the probabilistic ranking set comprises: generating a bucket-based sequence perplexity value set including a bucket-based sequence perplexity value for each data sequence of the plurality of data sequences by, for each data sequence of the plurality of data sequences: determining an unacce

Assignees

Inventors

Classifications

  • using probabilistic model · CPC title

  • G10L15/183Primary

    using context dependencies, e.g. language models · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Training · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11699434B2 cover?
Embodiments provide for improved data sequence validity processing, for example to determine validity of sentences or other language within a particular language domain. Such improved processing is useful at least for arranging data sequences based on determined validity, and/or making determinations and/or performing actions based on the determined validity. A determined probability (e.g., tra…
Who is the assignee on this patent?
Arria Data2Text Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/183. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 11 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).