Autoregressively generating sequences of data elements defining actions to be performed by an agent

US12547890B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12547890-B2
Application numberUS-202117410689-A
CountryUS
Kind codeB2
Filing dateAug 24, 2021
Priority dateAug 24, 2021
Publication dateFeb 10, 2026
Grant dateFeb 10, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting actions to be performed by an agent to interact with an environment using an action selection neural network. In one aspect, a method comprises, at each time step in a sequence of time steps: generating a current representation of a state of a task being performed by the agent in the environment as of the current time step as a sequence of data elements; autoregressively generating a sequence of data elements representing a current action to be performed by the agent at the current time step; and after autoregressively generating the sequence of data elements representing the current action, causing the agent to perform the current action at the current time step.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method performed by one or more computers for selecting actions to be performed by an agent to interact with an environment using an action selection neural network, the method comprising, at each time step in a sequence of time steps: generating a current representation of a state of a task being performed by the agent in the environment as of the current time step as a sequence of data elements; autoregressively generating a sequence of data elements representing a current action to be performed by the agent at the current time step, comprising, for each position starting from a first position in the sequence of data elements representing the current action: processing the current representation of the state of the task using the action selection neural network to generate a score distribution over a set of possible data elements; selecting a data element for the position in the sequence of data elements representing the current action in accordance with the score distribution; and updating the current representation of the state of the task by concatenating the selected data element for the position to the current representation of the state of the task; and after autoregressively generating the sequence of data elements representing the current action, causing the agent to perform the current action at the current time step; wherein the action selection neural network has been trained on a set of training examples that includes respective training examples from multiple different control domains; wherein each training example from each control domain is defined by a training sequence of data elements and includes an interleaved sequence of: (i) observations of a corresponding environment, each observation including a respective sequence of observation data elements, and (ii) actions performed by the agent to interact with the corresponding environment, each action including a respective sequence of action data elements; and wherein the multiple different control domains include a first control domain where observations of the corresponding environment include sequences of observation data elements that each have a first sequence length and represent observations with a first dimensionality, and a second control domain where observations of the corresponding environment include sequences of observation data elements that each have a second, different sequence length and represent observations with a second, different dimensionality. 2 . The method of claim 1 , wherein for each time step in the sequence of time steps, generating the current representation of the state of the task as of the current time step comprises: receiving a current observation characterizing a state of the environment at the current time step; generating a representation of the current observation as a sequence of data elements; and including the representation of the current observation as a sequence of data elements in the current representation of the state of the task as of the current time step. 3 . The method of claim 2 , wherein the current observation is defined by a collection of numerical values, and generating the representation of the current observation as a sequence of data elements comprises: concatenating each numerical value in the collection of numerical values defining the current observation into a sequence of numerical values in a predefined order. 4 . The method of claim 3 , wherein generating the representation of the current observation as a sequence of data elements further comprises: discretizing each numerical value in the collection of numerical values defining the current observation. 5 . The method of claim 2 , wherein the current observation characterizing the current state of the environment at the current time step comprises an image defined by an array of pixels. 6 . The method of claim 2 , wherein generating the representation of the current observation as a sequence of data elements comprises: combining a target return to be achieved by interaction of the agent with the environment with the representation of the current observation as a sequence of data elements, wherein the target return defines a cumulative measure of rewards to be achieved as a result of the interaction of the agent with the environment. 7 . The method of claim 2 , wherein for each time step after a first time step in the sequence of time steps, including the representation of the current observation as a sequence of data elements in the current representation of the state of the task as of the current time step comprises: receiving a representation of the state of the task as of a previous time step as a sequence of date elements; and concatenating the representation of the current observation as a sequence of data elements to the representation of the state of the task as of the previous time step as a sequence of data elements to generate the current representation of the state of the task as of the current time step. 8 . The method of claim 7 , wherein the representation of the state of the task as of the previous time step represents, for each time step preceding the current time step: (i) a respective observation characterizing a state of the environment at the time step, and (ii) a respective action performed by the agent at the time step. 9 . The method of claim 2 , wherein at a first time step in the sequence of time steps, including the representation of the current observation as a sequence of data elements in the current representation of the state of the task as of the current time step comprises: receiving a prompt that comprises data characterizing the task to be performed by the agent in the environment; generating a representation of the prompt as a sequence of data elements; and concatenating the representation of the current observation as a sequence of data elements to the representation of the prompt as a sequence of data elements to generate the current representation of the state of the task as of the current time step. 10 . The method of claim 9 , wherein prompt comprises one or more of: a demonstration of the task, a goal observation characterizing a goal state of the environment, or a sequence of text in a natural language that provides instructions related to the task. 11 . The method of claim 1 , wherein, for each training example in the set of training examples: at least one of the data elements in the training sequence of data elements defining the training example is designated as an action data element; and training the action selection neural network on the training example comprises training the action selection neural network to generate the action data elements included in the training example. 12 . The method of claim 1 , wherein the multiple different control domains include a first control domain where actions performed by the corresponding agent have a first dimensionality, and a second control domain where actions performed by the corresponding agent have a second, different dimensionality. 13 . The method of claim 1 , wherein the set of training examples includes a plurality of language modeling training examples, wherein each language modeling training example represents a sequence of text in a natural language. 14 . The method of claim 1 , wherein the action selection neural network comprises a plurality of self-attention neural network layers. 15 . The method of claim 1 , wherein for each position starting from the first position in the sequences of data elements representing the current action, selecting the data e

Assignees

Inventors

Classifications

  • Architecture, e.g. interconnection topology · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • G06N3/0455Primary

    Auto-encoder networks; Encoder-decoder networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12547890B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting actions to be performed by an agent to interact with an environment using an action selection neural network. In one aspect, a method comprises, at each time step in a sequence of time steps: generating a current representation of a state of a task being performed by the agent in the…
Who is the assignee on this patent?
Gdm Holding Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).