Intelligent compute resource selection for machine learning training jobs
US-11537439-B1 · Dec 27, 2022 · US
US12061964B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12061964-B2 |
| Application number | US-202017032562-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 25, 2020 |
| Priority date | Sep 25, 2019 |
| Publication date | Aug 13, 2024 |
| Grant date | Aug 13, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes sampling a behavior modulation in accordance with a current probability distribution; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network to generate a respective action score for each action in a set of possible actions that can be performed by the agent; modifying the action scores using the sampled behavior modulation; and selecting the action to be performed by the agent at the time step based on the modified action scores; determining a fitness measure corresponding to the sampled behavior modulation; and updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more data processing apparatus for selecting actions to be performed by an agent interacting with an environment to accomplish a task, wherein the method comprises repeatedly performing operations including: sampling a behavior modulation from a set of possible behavior modulations in accordance with a current probability distribution over the set of possible behavior modulations, wherein: the behavior modulation includes a respective value for each of one or more modulation factors; and the behavior modulation defines an exploration policy for modifying a set of action scores that includes a respective action score for each action in a set of possible actions that can be performed by the agent to interact with the environment; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network to generate a respective action score for each action in the set of possible actions that can be performed by the agent; modifying the action scores using the sampled behavior modulation; and selecting the action to be performed by the agent at the time step based on the modified action scores; determining a fitness measure corresponding to the sampled behavior modulation that characterizes an estimated progress in improving a performance of the agent in accomplishing the task that would result from training the action selection neural network on training data characterizing interaction of the agent with the environment over the one or more time steps where the action scores were modified using the sampled behavior modulation; and updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation. 2. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies a respective bias value for each action in the set of possible actions, and modifying the action scores using the sampled behavior modulation comprises adding the corresponding bias value to each action score. 3. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies a temperature parameter, and modifying the action scores using the behavior modulation factor comprises dividing each action score by the temperature parameter. 4. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies an ϵ probability value, and modifying the action scores using the behavior modulation comprises modifying the action scores such that, with probability given by the ϵ probability value, each action has an equal likelihood of being selected to be performed by the agent. 5. The method of claim 1 , wherein the one or more behavior modulation factors include a behavior modulation factor that specifies an action-repeat probability value, and modifying the action scores using the behavior modulation factor comprises modifying the action scores such that, with probability given by the action-repeat probability value, the action selected to be performed by the agent at the time step is the action performed by the agent at a previous time step. 6. The method of claim 1 , wherein: the action score for each action is specified by data characterizing a probability distribution over possible returns that would result from the agent performing the action; wherein the one or more behavior modulation factors include a behavior modulation factor that specifies one or more parameters of a distortion function; and modifying the action scores using the behavior modulation factor comprises, for each action, applying the distortion function to the data characterizing the probability distribution over possible returns that would result from the agent performing the action. 7. The method of claim 1 , wherein the progress in improving the performance of the agent in accomplishing the task characterizes an expected difference between a cumulative measure of rewards received by the agent as a result of interacting with the environment by performing actions selected using the action selection neural network: (i) after the action selection neural network is trained on the training data characterizing the interaction of the agent with the environment over the one or more time steps where the action scores were modified using the behavior modulation, and (ii) before the action selection neural network is trained on the training data characterizing the interaction of the agent with the environment over the one or more time steps where the action scores were modified using the behavior modulation. 8. The method of claim 1 , wherein determining the fitness measure comprises determining the fitness measure based on a reward received by the agent at each of the one or more time steps. 9. The method of claim 8 , wherein the fitness measure is determined as a sum of the rewards received by the agent at each of the one or more time steps. 10. The method of claim 1 , wherein updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation comprises: determining a measure of central tendency of a plurality of given fitness measures that have previously been determined; and determining an updated probability distribution based at least in part on whether the fitness measure corresponding to the behavior modulation exceeds the measure of central tendency of the plurality of given fitness measures. 11. The method of claim 10 , wherein the measure of central tendency of the plurality of given fitness measures is a mean of the plurality of given fitness measures. 12. The method of claim 10 , further comprising adjusting a length of a time window specifying which given fitness measures are eligible for inclusion in the plurality of given fitness measures. 13. The method of claim 1 , further comprising updating current values of parameters of the action selection neural network using reinforcement learning techniques. 14. The method of claim 1 , wherein selecting the action to be performed by the agent at the time step based on the modified action scores comprises: sampling the action to be performed by the agent at the time step based on a probability distribution over the set of possible actions that is specified by the modified action scores. 15. A system for selecting actions to be performed by an agent interacting with an environment to accomplish a task, the system comprising one or more computers and one or more storage devices that when executed by the one or more computers cause the one or more computers to repeatedly perform operations comprising: sampling a behavior modulation from a set of possible behavior modulations in accordance with a current probability distribution over the set of possible behavior modulations, wherein: the behavior modulation includes a respective value for each of one or more modulation factors; and the behavior modulation defines an exploration policy for modifying a set of action scores that includes a respective action score for each action in a set of possible actions that can be performed by the agent to interact with the environment; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network t
Convolutional networks [CNN, ConvNet] · CPC title
Reinforcement learning · CPC title
Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.