Agent training method, electronic device and storage medium

US2025378391A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025378391-A1
Application numberUS-202519310720-A
CountryUS
Kind codeA1
Filing dateAug 26, 2025
Priority dateJun 19, 2025
Publication dateDec 11, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for training an agent includes: for each subtask of a sample task, determining action priorities for a plurality of first candidate actions in a plurality of sets of experience data corresponding to the subtask in an experience pool of the agent, wherein the action priorities represent values of the plurality of the first candidate actions; selecting target experience data corresponding to the subtask from the plurality of sets of experience data based on the action priorities; and training the agent based on the target experience data.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for training an agent, comprising: for each subtask of a sample task, determining action priorities for a plurality of first candidate actions in a plurality of sets of experience data corresponding to the subtask in an experience pool of the agent, wherein the action priorities represent values of the plurality of the first candidate actions; selecting target experience data corresponding to the subtask from the plurality of sets of experience data based on the action priorities; and training the agent based on the target experience data. 2 . The method of claim 1 , wherein determining the action priorities for the plurality of the first candidate actions in the plurality of sets of experience data corresponding to the subtask in the experience pool of the agent comprises: determining a first dominance value for any first candidate action of the plurality of first candidate actions; determining an uncertainty penalty coefficient corresponding to the subtask, wherein the uncertainty penalty coefficient represents a degree of uncertainty in action selection; and determining an action priority for the any first candidate action based on the first dominance value and the uncertainty penalty coefficient. 3 . The method of claim 2 , wherein determining the first dominance value for the any first candidate action of the plurality of the first candidate actions comprises: determining an expected cumulative reward for adopting the any first candidate action for the subtask; for the subtask, determining a maximum value from expected cumulative rewards corresponding to first candidate actions other than the any first candidate action in the plurality of the first candidate actions; and determining the first dominance value based on the expected cumulative reward corresponding to the any first candidate action and the maximum value. 4 . The method of claim 2 , wherein determining the uncertainty penalty coefficient corresponding to the subtask comprises: determining an expected cumulative reward for adopting the any first candidate action for the subtask; determining a first probability for adopting the any first candidate action for the subtask based on expected cumulative rewards corresponding to the plurality of the first candidate actions; and determining the uncertainty penalty coefficient based on first probabilities respectively corresponding to the plurality of the first candidate actions. 5 . The method of claim 1 , wherein training the agent based on the target experience data comprises: determining a reward value for a second candidate action in the target experience data; and training the agent based on the reward value of the second candidate action. 6 . The method of claim 5 , wherein determining the reward value for the second candidate action in the target experience data comprises: determining an instant reward for adopting the second candidate action for the subtask; predicting a long-term reward for completing the sample task in a case of adopting the second candidate action for the subtask; and determining the reward value based on the instant reward and the long-term reward. 7 . The method of claim 6 , wherein determining the instant reward for adopting the second candidate action for the subtask comprises: determining a matching degree between an instruction description of the subtask and a current interface state corresponding to the subtask; and determining the instant reward based on the matching degree. 8 . The method of claim 6 , wherein predicting the long-term reward for completing the sample task in the case of adopting the second candidate action for the subtask comprises: predicting a completion probability of the sample task based on task instruction information of the sample task and a current interface state corresponding to the subtask; and determining the long-term reward based on the completion probability. 9 . The method of claim 5 , wherein training the agent based on the reward value for the second candidate action comprises: selecting a target action from second candidate actions according to reward values for the second candidate actions; determining a second probability for selecting the target action for the subtask and a second dominance value of the target action; and training the agent based on the second probability, the second dominance value and the reward value for the second candidate action. 10 . The method of claim 9 , wherein training the agent based on the second probability, the second dominance value and the reward value for the second candidate action comprises: adjusting parameters of a large model in the agent based on the reward value for the second candidate action to obtain the large model with adjusted parameters; adjusting parameters of a policy network in the agent based on the second probability and the second dominance value to obtain the policy network with adjusted parameters; and obtaining a trained agent based on the large model with the adjusted parameters and the policy network with the adjusted parameters. 11 . The method of claim 10 , wherein there are a plurality of second candidate actions, and adjusting the parameters of the large model in the agent based on the reward value for the second candidate action to obtain the large model with the adjusted parameters comprises: determining a meta-gradient based on a reward value of a target action having a largest reward value for the plurality of second candidate actions and expected values of reward values of candidate actions other than the target action in the plurality of second candidate actions; and adjusting the parameters of the large model based on the meta-gradient to obtain the large model with adjusted parameters. 12 . An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor is caused to: for each subtask of a sample task, determine action priorities for a plurality of first candidate actions in a plurality of sets of experience data corresponding to the subtask in an experience pool of the agent, wherein the action priorities represent values of the plurality of the first candidate actions; select target experience data corresponding to the subtask from the plurality of sets of experience data based on the action priorities; and train the agent based on the target experience data. 13 . The electronic device of claim 12 , wherein the at least one processor is caused to: determine a first dominance value for any first candidate action of the plurality of first candidate actions; determine an uncertainty penalty coefficient corresponding to the subtask, wherein the uncertainty penalty coefficient represents a degree of uncertainty in action selection; and determine an action priority for the any first candidate action based on the first dominance value and the uncertainty penalty coefficient. 14 . The electronic device of claim 13 , wherein the at least one processor is caused to: determine an expected cumulative reward for adopting the any first candidate action for the subtask; for the subtask, determine a maximum value from expected cumulative rewards corresponding to first candidate actions other than the any first candidate action in the plurality of the first candidate actions; and determine the first dominance value based on the expected cumu

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025378391A1 cover?
A method for training an agent includes: for each subtask of a sample task, determining action priorities for a plurality of first candidate actions in a plurality of sets of experience data corresponding to the subtask in an experience pool of the agent, wherein the action priorities represent values of the plurality of the first candidate actions; selecting target experience data correspondin…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).