System for reducing transaction failure
US-12175472-B2 · Dec 24, 2024 · US
US2025378391A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025378391-A1 |
| Application number | US-202519310720-A |
| Country | US |
| Kind code | A1 |
| Filing date | Aug 26, 2025 |
| Priority date | Jun 19, 2025 |
| Publication date | Dec 11, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for training an agent includes: for each subtask of a sample task, determining action priorities for a plurality of first candidate actions in a plurality of sets of experience data corresponding to the subtask in an experience pool of the agent, wherein the action priorities represent values of the plurality of the first candidate actions; selecting target experience data corresponding to the subtask from the plurality of sets of experience data based on the action priorities; and training the agent based on the target experience data.
Opening claim text (preview).
What is claimed is: 1 . A method for training an agent, comprising: for each subtask of a sample task, determining action priorities for a plurality of first candidate actions in a plurality of sets of experience data corresponding to the subtask in an experience pool of the agent, wherein the action priorities represent values of the plurality of the first candidate actions; selecting target experience data corresponding to the subtask from the plurality of sets of experience data based on the action priorities; and training the agent based on the target experience data. 2 . The method of claim 1 , wherein determining the action priorities for the plurality of the first candidate actions in the plurality of sets of experience data corresponding to the subtask in the experience pool of the agent comprises: determining a first dominance value for any first candidate action of the plurality of first candidate actions; determining an uncertainty penalty coefficient corresponding to the subtask, wherein the uncertainty penalty coefficient represents a degree of uncertainty in action selection; and determining an action priority for the any first candidate action based on the first dominance value and the uncertainty penalty coefficient. 3 . The method of claim 2 , wherein determining the first dominance value for the any first candidate action of the plurality of the first candidate actions comprises: determining an expected cumulative reward for adopting the any first candidate action for the subtask; for the subtask, determining a maximum value from expected cumulative rewards corresponding to first candidate actions other than the any first candidate action in the plurality of the first candidate actions; and determining the first dominance value based on the expected cumulative reward corresponding to the any first candidate action and the maximum value. 4 . The method of claim 2 , wherein determining the uncertainty penalty coefficient corresponding to the subtask comprises: determining an expected cumulative reward for adopting the any first candidate action for the subtask; determining a first probability for adopting the any first candidate action for the subtask based on expected cumulative rewards corresponding to the plurality of the first candidate actions; and determining the uncertainty penalty coefficient based on first probabilities respectively corresponding to the plurality of the first candidate actions. 5 . The method of claim 1 , wherein training the agent based on the target experience data comprises: determining a reward value for a second candidate action in the target experience data; and training the agent based on the reward value of the second candidate action. 6 . The method of claim 5 , wherein determining the reward value for the second candidate action in the target experience data comprises: determining an instant reward for adopting the second candidate action for the subtask; predicting a long-term reward for completing the sample task in a case of adopting the second candidate action for the subtask; and determining the reward value based on the instant reward and the long-term reward. 7 . The method of claim 6 , wherein determining the instant reward for adopting the second candidate action for the subtask comprises: determining a matching degree between an instruction description of the subtask and a current interface state corresponding to the subtask; and determining the instant reward based on the matching degree. 8 . The method of claim 6 , wherein predicting the long-term reward for completing the sample task in the case of adopting the second candidate action for the subtask comprises: predicting a completion probability of the sample task based on task instruction information of the sample task and a current interface state corresponding to the subtask; and determining the long-term reward based on the completion probability. 9 . The method of claim 5 , wherein training the agent based on the reward value for the second candidate action comprises: selecting a target action from second candidate actions according to reward values for the second candidate actions; determining a second probability for selecting the target action for the subtask and a second dominance value of the target action; and training the agent based on the second probability, the second dominance value and the reward value for the second candidate action. 10 . The method of claim 9 , wherein training the agent based on the second probability, the second dominance value and the reward value for the second candidate action comprises: adjusting parameters of a large model in the agent based on the reward value for the second candidate action to obtain the large model with adjusted parameters; adjusting parameters of a policy network in the agent based on the second probability and the second dominance value to obtain the policy network with adjusted parameters; and obtaining a trained agent based on the large model with the adjusted parameters and the policy network with the adjusted parameters. 11 . The method of claim 10 , wherein there are a plurality of second candidate actions, and adjusting the parameters of the large model in the agent based on the reward value for the second candidate action to obtain the large model with the adjusted parameters comprises: determining a meta-gradient based on a reward value of a target action having a largest reward value for the plurality of second candidate actions and expected values of reward values of candidate actions other than the target action in the plurality of second candidate actions; and adjusting the parameters of the large model based on the meta-gradient to obtain the large model with adjusted parameters. 12 . An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor is caused to: for each subtask of a sample task, determine action priorities for a plurality of first candidate actions in a plurality of sets of experience data corresponding to the subtask in an experience pool of the agent, wherein the action priorities represent values of the plurality of the first candidate actions; select target experience data corresponding to the subtask from the plurality of sets of experience data based on the action priorities; and train the agent based on the target experience data. 13 . The electronic device of claim 12 , wherein the at least one processor is caused to: determine a first dominance value for any first candidate action of the plurality of first candidate actions; determine an uncertainty penalty coefficient corresponding to the subtask, wherein the uncertainty penalty coefficient represents a degree of uncertainty in action selection; and determine an action priority for the any first candidate action based on the first dominance value and the uncertainty penalty coefficient. 14 . The electronic device of claim 13 , wherein the at least one processor is caused to: determine an expected cumulative reward for adopting the any first candidate action for the subtask; for the subtask, determine a maximum value from expected cumulative rewards corresponding to first candidate actions other than the any first candidate action in the plurality of the first candidate actions; and determine the first dominance value based on the expected cumu
Related publications grouped by family.
Answers are generated from the same data shown on this page.