Random action replay for reinforcement learning

US12367350B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12367350-B2
Application numberUS-202016946586-A
CountryUS
Kind codeB2
Filing dateJun 29, 2020
Priority dateJun 29, 2020
Publication dateJul 22, 2025
Grant dateJul 22, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An artificial intelligence (AI) platform to support random action replay for natural language (NL) learning. A NL conversation is subject to exploration to train a neural network. One or more tuples are leveraged for the training, with each tuple representing an input action, a vector, an output action, and a reward value. An action is sampled from the vector, with the sampling configured to assess a corresponding first gradient. The first gradient is applied to selectively adjust the neural network. As NL input is received and applied to the selectively adjusted neural network, an output corresponding to the NL input is identified and a corresponding action is subject to be executed.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer system comprising: a processing unit operatively coupled to memory; an artificial intelligence (AI) platform operatively coupled to the processing unit, the AI platform configured with one or more tools to support random action replay for natural language (NL) learning, the one or more tools comprising: a training manager configured to train a neural network, the training further comprising the training manager to: explore a NL conversation, the exploration to leverage one or more tuples associated with the NL conversation, each tuple representing at least an input action, an output action, a policy vector, and a reward value; select a tuple and sample a first action, from a distribution of actions, associated with the selected tuple; assess the sampled first action, including generate output associated with the assessment, compare the generated output to a value of the sampled first action corresponding to the policy vector and, based on the comparison, calculate a first gradient representing a distance of the generated output from the sampled first action in the selected tuple associated with the NL conversation; and apply the first gradient to selectively adjust the neural network; a language manager operatively coupled to the training manager, the language manager configured to receive and apply NL input to the selectively adjusted neural network, and generate a NL output corresponding to the received NL input; and the language manager configured to execute an identified action corresponding to the identified output. 2. The computer system of claim 1 , further comprising an interaction manager operatively coupled to the training manager, the interaction manager configured to create the one or more tuples in an interactive environment with corresponding first and second agents, the interactive environment to identify one or more actions from the distribution of actions as a response to receipt of the input action. 3. The computer system of claim 1 , further comprising the training manager configured to re-train the neural network and incorporate a sampled second action from the distribution of actions, calculate a second gradient representing a distance of the sampled second action from the input action, and apply the second gradient to selectively adjust the neural network. 4. The computer system of claim 3 , further comprising the training manager configured to assess the first and second gradients, and responsive to identification of a convergence of the first and second gradients the training manager further configured to terminate training of the neural network. 5. The computer system of claim 1 , further comprising the training manager configured to utilize a random choice function to select the first action from the distribution of actions for sampling. 6. The computer system of claim 1 , wherein the trained neural network is configured to evaluate the received NL input and to determine one or more NL components of the evaluated NL input. 7. The computer system of claim 6 , further comprising the trained neural network configured to evaluate the determined one or more NL components and determine an action corresponding to the received NL input. 8. A computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by a processor to: train a neural network, the training further comprising the program code to: explore a natural language (NL) conversation, the exploration to leverage one or more tuples associated with the NL conversation, each tuple representing at least an input action, an output action, a policy vector, and a reward value; select a tuple and sample a first action, from a distribution of actions, associated with the selected tuple; assess the sampled first action, including generate output associated with the assessment, compare the generated output to a value of the sampled first action corresponding to the policy vector and, based on the comparison, calculate a first gradient representing a distance of the generated output from the sampled first action in the selected tuple associated with the NL conversation; and apply the first gradient to selectively adjust the neural network; receive and apply NL input to the selectively adjusted neural network, and generate a NL output corresponding to the received NL input; and execute an identified action corresponding to the identified output. 9. The computer program product of claim 8 , further comprising the program code executable by the processor to create the one or more tuples in an interactive environment with corresponding first and second agents, the interactive environment to identify one or more actions from the distribution of actions as a response to receipt of the input action. 10. He computer program product of claim 8 , further comprising the program code executable by the processor to re-train the neural network and incorporate a sampled second action from the distribution of actions, calculate a second gradient representing a distance of the sampled second action from the input action; and apply the second gradient to selectively adjust the neural network. 11. The computer program product of claim 10 , further comprising the program code executable by the processor to assess the first and second gradients, and responsive to identification of a convergence of the first and second gradients terminate training of the neural network. 12. The computer program product of claim 8 , further comprising the program code executable by the processor to utilize a random choice function to select the first action from the distribution of actions for sampling. 13. A computer implemented method comprising: training a neural network, the training further comprising: exploring a natural language (NL) conversation, the exploration to leverage one or more tuples associated with the NL conversation, each tuple representing an input action, an output action, a policy vector, and a reward value; selecting a tuple and sampling a first action, from a distribution of actions, associated with the selected tuple; assessing the sampled first action, including generate output associated with the assessment, compare the generated output to a value of the sampled first action corresponding to the policy vector and, based on the comparison, calculate a first gradient representing a distance of the generated output from the sampled first action in the selected tuple associated with the NL conversation; and applying the first gradient to selectively adjust the neural network; receiving and applying NL input to the selectively adjusted neural network, and generating a NL output corresponding to received NL input; and executing an identified action corresponding to the identified output. 14. The method of claim 13 , further comprising creating the one or more tuples in an interactive environment with corresponding first and second agents, the interactive environment to identify one or more actions from the distribution of actions as a response to receipt of the input action. 15. The method of claim 13 , further comprising re-training the neural network and incorporating a sampled second action from the distribution of actions, calculating a second gradient representing a distance of the sampled second action from the input action, and applying the second gradient to selectively adjust the neural network. 16. The method of claim 15 , further comprising assessing the first and second gradients, and responsive to identificat

Assignees

Inventors

Classifications

  • Reinforcement learning · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Natural language generation · CPC title

  • modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12367350B2 cover?
An artificial intelligence (AI) platform to support random action replay for natural language (NL) learning. A NL conversation is subject to exploration to train a neural network. One or more tuples are leveraged for the training, with each tuple representing an input action, a vector, an output action, and a reward value. An action is sampled from the vector, with the sampling configured to as…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/35. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 22 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).