Training actor-critic algorithms in laboratory settings

US12423571B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12423571-B2
Application numberUS-202017003673-A
CountryUS
Kind codeB2
Filing dateAug 26, 2020
Priority dateAug 26, 2020
Publication dateSep 23, 2025
Grant dateSep 23, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Reinforcement learning methods can use actor-critic networks where (1) additional laboratory-only state information is used to train a policy that much act without this additional laboratory-only information in a production setting; and (2) complex resource-demanding policies are distilled into a less-demanding policy that can be more easily run at production with limited computational resources. The production actor network can be optimized using a frozen version of a large critic network, previously trained with a large actor network. Aspects of these methods can leverage actor-critic methods in which the critic network models the action value function, as opposed to the state value function.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of training an agent, comprising: in a laboratory setting: training a policy using an actor-critic algorithm using an actor network and a critic network, the critic network using state information available in a both the laboratory setting and in a production setting and the actor network using state information available only in the production setting, wherein the state information used in the training of the policy in the laboratory setting includes information that is collected in the laboratory setting by at least one of (1) additional sensors of the agent available in the laboratory setting and not in the production setting and (2) greater compute resources available to the agent in the laboratory setting and not in the production setting, wherein the information is not collected in the production setting; and optimizing action choices of the actor network against the critic network; after training the policy in the laboratory setting, prior to deploying the policy to the production setting: duplicating the critic network into a frozen critic network; providing a production actor network through a distillation of the actor network and by optimizing the production actor network with the frozen critic network outside of the laboratory setting without requiring any further interaction with an environment of the laboratory setting; wherein the critic network is only required during training in the laboratory setting; and wherein the critic network is modeled based on an action value function as opposed to a state value function. 2. The method of claim 1 , wherein the production actor network is the same as the actor network. 3. The method of claim 1 , wherein the state information in the laboratory setting includes information from sensors unavailable in the production setting. 4. The method of claim 1 , wherein the production actor network is smaller than the actor network. 5. The method of claim 1 , wherein a first actor-critic algorithm runs during the training of the critic network and a second actor-critic algorithm runs during the step of optimizing the production actor network using the frozen critic network. 6. The method of claim 5 , wherein the first actor-critic algorithm is the same as the second actor-critic algorithm. 7. A method of training an agent, comprising: in the laboratory setting: training a policy with an actor-critic algorithm using an actor network and a critic network the critic network using state information available in a both the laboratory setting and in a production setting and the actor network using state information available only in the production setting, wherein the state information is collected in the laboratory setting by at least one of (1) additional sensors of the agent available in the laboratory setting and not in the production setting and (2) greater compute resources available to the agent in the laboratory setting and not in the production setting, wherein the information; and optimizing action choices of the actor network against the critic network; and after training the policy in the laboratory setting, prior to deploying the policy to the production setting: providing a production actor network through a distillation of the actor network; and duplicating the critic network, when the training is complete, into a frozen critic network and optimizing the production actor network using the frozen critic network outside of the laboratory setting without requiring any further interaction with an environment of the laboratory setting, wherein wherein the critic network is only required during training in the laboratory setting; the production actor network is smaller than the actor network, and the critic network is modeled based on an action value function as opposed to a state value function. 8. The method of claim 7 , wherein a first actor-critic algorithm runs during the training of the critic network and a second actor-critic algorithm runs during the step of optimizing the production actor network using the frozen critic network. 9. The method of claim 8 , wherein the first actor-critic algorithm is the same as the second actor-critic algorithm. 10. The method of claim 7 , wherein the state information used in the training of the policy in the laboratory setting includes information that is collected in the laboratory setting and is not collected in the production setting. 11. A non-transitory computer-readable storage medium with an executable program stored thereon, wherein the program instructs one or more processors to perform the following steps: in a laboratory setting: training a policy using an actor network and a critic network using an actor-critic algorithm, the critic network using state information available in a both the laboratory setting and in a production setting and the actor network using state information available only in the production setting, wherein the state information used in the training of the policy in the laboratory setting includes information that is collected in the laboratory setting by at least one of (1) additional sensors of the agent available in the laboratory setting and not in the production setting and (2) greater compute resources available to the agent in the laboratory setting and not in the production setting, wherein the information is not collected in the production setting; and optimizing action choices of the actor network against the critic network; and after training the policy in the laboratory setting, prior to deploying the policy to the production setting: duplicating the critic network into a frozen critic network; providing a production actor network through a distillation of the actor network and by optimizing the production actor network with the frozen critic network outside of the laboratory setting without requiring any further interaction with an environment of the laboratory setting; wherein the critic network is only required during training in the laboratory setting; and wherein the critic network is modeled based on an action value function as opposed to a state value function. 12. The non-transitory computer-readable storage medium of claim 11 , wherein the critic network is modeled based on an action value function. 13. The non-transitory computer-readable storage medium of claim 11 , wherein the program instructs one or more processors to further perform: when the training is complete, duplicating the critic network into a frozen critic network; and optimizing the production actor network using the frozen critic network. 14. The non-transitory computer-readable storage medium of claim 13 , wherein the production actor network is smaller than the actor network. 15. The method of claim 1 , wherein only the production actor network is run in the production setting. 16. The non-transitory computer-readable storage medium of claim 13 , wherein only the production actor network is run in the production setting.

Assignees

Inventors

Classifications

  • G06N3/045Primary

    Combinations of networks · CPC title

  • Feedforward networks · CPC title

  • Transfer learning · CPC title

  • Reinforcement learning · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12423571B2 cover?
Reinforcement learning methods can use actor-critic networks where (1) additional laboratory-only state information is used to train a policy that much act without this additional laboratory-only information in a production setting; and (2) complex resource-demanding policies are distilled into a less-demanding policy that can be more easily run at production with limited computational resource…
Who is the assignee on this patent?
Sony Corp, Sony Corp America, Sony Group Corp
What technology area does this patent fall under?
Primary CPC classification G06N3/045. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 23 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).