Interpretability of deep reinforcement learning models in assistant systems
US-11715042-B1 · Aug 1, 2023 · US
US12367350B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12367350-B2 |
| Application number | US-202016946586-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 29, 2020 |
| Priority date | Jun 29, 2020 |
| Publication date | Jul 22, 2025 |
| Grant date | Jul 22, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An artificial intelligence (AI) platform to support random action replay for natural language (NL) learning. A NL conversation is subject to exploration to train a neural network. One or more tuples are leveraged for the training, with each tuple representing an input action, a vector, an output action, and a reward value. An action is sampled from the vector, with the sampling configured to assess a corresponding first gradient. The first gradient is applied to selectively adjust the neural network. As NL input is received and applied to the selectively adjusted neural network, an output corresponding to the NL input is identified and a corresponding action is subject to be executed.
Opening claim text (preview).
What is claimed is: 1. A computer system comprising: a processing unit operatively coupled to memory; an artificial intelligence (AI) platform operatively coupled to the processing unit, the AI platform configured with one or more tools to support random action replay for natural language (NL) learning, the one or more tools comprising: a training manager configured to train a neural network, the training further comprising the training manager to: explore a NL conversation, the exploration to leverage one or more tuples associated with the NL conversation, each tuple representing at least an input action, an output action, a policy vector, and a reward value; select a tuple and sample a first action, from a distribution of actions, associated with the selected tuple; assess the sampled first action, including generate output associated with the assessment, compare the generated output to a value of the sampled first action corresponding to the policy vector and, based on the comparison, calculate a first gradient representing a distance of the generated output from the sampled first action in the selected tuple associated with the NL conversation; and apply the first gradient to selectively adjust the neural network; a language manager operatively coupled to the training manager, the language manager configured to receive and apply NL input to the selectively adjusted neural network, and generate a NL output corresponding to the received NL input; and the language manager configured to execute an identified action corresponding to the identified output. 2. The computer system of claim 1 , further comprising an interaction manager operatively coupled to the training manager, the interaction manager configured to create the one or more tuples in an interactive environment with corresponding first and second agents, the interactive environment to identify one or more actions from the distribution of actions as a response to receipt of the input action. 3. The computer system of claim 1 , further comprising the training manager configured to re-train the neural network and incorporate a sampled second action from the distribution of actions, calculate a second gradient representing a distance of the sampled second action from the input action, and apply the second gradient to selectively adjust the neural network. 4. The computer system of claim 3 , further comprising the training manager configured to assess the first and second gradients, and responsive to identification of a convergence of the first and second gradients the training manager further configured to terminate training of the neural network. 5. The computer system of claim 1 , further comprising the training manager configured to utilize a random choice function to select the first action from the distribution of actions for sampling. 6. The computer system of claim 1 , wherein the trained neural network is configured to evaluate the received NL input and to determine one or more NL components of the evaluated NL input. 7. The computer system of claim 6 , further comprising the trained neural network configured to evaluate the determined one or more NL components and determine an action corresponding to the received NL input. 8. A computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by a processor to: train a neural network, the training further comprising the program code to: explore a natural language (NL) conversation, the exploration to leverage one or more tuples associated with the NL conversation, each tuple representing at least an input action, an output action, a policy vector, and a reward value; select a tuple and sample a first action, from a distribution of actions, associated with the selected tuple; assess the sampled first action, including generate output associated with the assessment, compare the generated output to a value of the sampled first action corresponding to the policy vector and, based on the comparison, calculate a first gradient representing a distance of the generated output from the sampled first action in the selected tuple associated with the NL conversation; and apply the first gradient to selectively adjust the neural network; receive and apply NL input to the selectively adjusted neural network, and generate a NL output corresponding to the received NL input; and execute an identified action corresponding to the identified output. 9. The computer program product of claim 8 , further comprising the program code executable by the processor to create the one or more tuples in an interactive environment with corresponding first and second agents, the interactive environment to identify one or more actions from the distribution of actions as a response to receipt of the input action. 10. He computer program product of claim 8 , further comprising the program code executable by the processor to re-train the neural network and incorporate a sampled second action from the distribution of actions, calculate a second gradient representing a distance of the sampled second action from the input action; and apply the second gradient to selectively adjust the neural network. 11. The computer program product of claim 10 , further comprising the program code executable by the processor to assess the first and second gradients, and responsive to identification of a convergence of the first and second gradients terminate training of the neural network. 12. The computer program product of claim 8 , further comprising the program code executable by the processor to utilize a random choice function to select the first action from the distribution of actions for sampling. 13. A computer implemented method comprising: training a neural network, the training further comprising: exploring a natural language (NL) conversation, the exploration to leverage one or more tuples associated with the NL conversation, each tuple representing an input action, an output action, a policy vector, and a reward value; selecting a tuple and sampling a first action, from a distribution of actions, associated with the selected tuple; assessing the sampled first action, including generate output associated with the assessment, compare the generated output to a value of the sampled first action corresponding to the policy vector and, based on the comparison, calculate a first gradient representing a distance of the generated output from the sampled first action in the selected tuple associated with the NL conversation; and applying the first gradient to selectively adjust the neural network; receiving and applying NL input to the selectively adjusted neural network, and generating a NL output corresponding to received NL input; and executing an identified action corresponding to the identified output. 14. The method of claim 13 , further comprising creating the one or more tuples in an interactive environment with corresponding first and second agents, the interactive environment to identify one or more actions from the distribution of actions as a response to receipt of the input action. 15. The method of claim 13 , further comprising re-training the neural network and incorporating a sampled second action from the distribution of actions, calculating a second gradient representing a distance of the sampled second action from the input action, and applying the second gradient to selectively adjust the neural network. 16. The method of claim 15 , further comprising assessing the first and second gradients, and responsive to identificat
Reinforcement learning · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Natural language generation · CPC title
modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.