Systems and methods for executing confidence-aware reinforcement learning

US2025165796A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025165796-A1
Application numberUS-202318516125-A
CountryUS
Kind codeA1
Filing dateNov 21, 2023
Priority dateNov 21, 2023
Publication dateMay 22, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems for executing confidence-aware reinforcement learning for an Artificial Intelligence (AI) model for subsequent deployment of that AI model in an environment are disclosed. The method includes accessing a set of expert trajectories, each expert trajectory comprising a sequence of expert state-action pairs, the expert entities complying with an expert constraint that is unknown. The method also includes generating a main constraint for the set of expert trajectories, the main constraint being conditioned on a pre-determined confidence level, the pre-determined confidence level being indicative of a probability that the main constraint is at least as constraining as the expert constraint, the main constraint comprising one or more rules limiting the actions that are executable by the AI model, determining a target policy among a plurality of policies, the target policy complying with the main constraint and executing the target policy by the AI model.

First claim

Opening claim text (preview).

1 . A computer-implemented method for executing confidence-aware reinforcement learning for an Artificial Intelligence (AI) model for subsequent deployment of that AI model in an environment, the method comprising: accessing a set of expert trajectories, each expert trajectory comprising a sequence of expert state-action pairs, a given one of the expert trajectories including information about a given state of the environment and a corresponding action that is to be executed in response to the given state, the expert entities complying with an expert constraint that is unknown; generating a main constraint for the set of expert trajectories, the main constraint being conditioned on a pre-determined confidence level, the pre-determined confidence level being indicative of a probability that the main constraint is at least as constraining as the expert constraint, the main constraint comprising one or more rules limiting the actions that are executable by the AI model; determining a target policy among a plurality of policies, the target policy complying with the main constraint; and executing the target policy by the AI model. 2 . The method of claim 1 , further comprising: accessing a set of policies, each policy being a mapping from states to actions for the sequences of expert state-action pairs of the expert trajectories, an execution of the policy aiming at maximizing a reward; determining a policy complying with the main constraint; executing the policy by iteratively: executing the actions of the policy, receiving indication of rewards from and states of the environment, and adjusting the policy based on outcomes of the actions and received rewards. 3 . The method of claim 2 , further comprising, prior to executing the policy: determining a policy-value of the target policy; in response to the policy-value being below a pre-determined value threshold, flagging the set of expert trajectories as insufficient. 4 . The method of claim 3 , further comprising augmenting the set of expert trajectories with additional expert trajectories until the policy-value exceeds the pre-determined value threshold. 5 . The method of claim 3 , wherein determining the policy-value of the policy comprises determining an expected cumulative reward based on rewards associated with the action-state pairs of the policy. 6 . The method of claim 1 , wherein generating a main constraint for the set of expert trajectories comprises: determining a constraint distribution based on the set of expert trajectories; selecting a constraint from the constraint distribution based on the pre-determined confidence level as the main constraint. 7 . The method of claim 6 , wherein selecting the constraint from the constraint distribution comprises selecting the lower boundary constraint of a quantile of the constraint distribution based on the pre-determined confidence level. 8 . The method of claim 7 , wherein the main constraint is: quantile P(c) (1−λ) where P(c) is the constraint distribution and λ is the pre-determined confidence level. 9 . The method of claim 6 , wherein determining a constraint distribution comprises: employing a neural network encoding the set of expert trajectories to determine, for each of the expert trajectory, a set of contribution factors; and adjusting a template distribution according to the set of contribution factors to form the constraint distribution. 10 . The method of claim 9 , wherein each expert trajectory is encoded with a corresponding encoder having corresponding weights in the neural network. 11 . A system for executing confidence-aware reinforcement learning for an Artificial Intelligence (AI) model for subsequent deployment of that AI model in an environment, the system comprising a controller and a memory storing a plurality of executable instructions which, when executed by the controller, cause the system to: access a set of expert trajectories, each expert trajectory comprising a sequence of expert state-action pairs, a given one of the expert trajectories including information about a given state of the environment and a corresponding action that is to be executed in response to the given state, the expert entities complying with an expert constraint that is unknown; generate a main constraint for the set of expert trajectories, the main constraint being conditioned on a pre-determined confidence level, the pre-determined confidence level being indicative of a probability that the main constraint is at least as constraining as the expert constraint, the main constraint comprising one or more rules limiting the actions that are executable by the AI model; determine a target policy among a plurality of policies, the target policy complying with the main constraint; and execute the target policy by the AI model. 12 . The system of claim 11 , wherein the system is further configured to: access a set of policies, each policy being a mapping from states to actions for the sequences of expert state-action pairs of the expert trajectories, an execution of the policy aiming at maximizing a reward; determine a policy complying with the main constraint; execute the policy by iteratively: executing the actions of the policy, receiving indication of rewards from and states of the environment, and adjusting the policy based on outcomes of the actions and received rewards. 13 . The system of claim 12 , wherein the system is further configured to, prior to executing the policy: determine a policy-value of the target policy; in response to the policy-value being below a pre-determined value threshold, flag the set of expert trajectories as insufficient. 14 . The system of claim 13 , wherein the system is further configured to augment the set of expert trajectories with additional expert trajectories until the policy-value exceeds the pre-determined value threshold. 15 . The system of claim 13 , wherein the system is further configured to, upon determining the policy-value of the policy, determine an expected cumulative reward based on rewards associated with the action-state pairs of the policy. 16 . The system of claim 11 , wherein the system is further configured to, upon generating a main constraint for the set of expert trajectories: determine a constraint distribution based on the set of expert trajectories; select a constraint from the constraint distribution based on the pre-determined confidence level as the main constraint. 17 . The system of claim 16 , wherein the system is further configured to select the constraint from the constraint distribution by selecting the lower boundary constraint of a quantile of the constraint distribution based on the pre-determined confidence level. 18 . The system of claim 17 , wherein the main constraint is: quantile P(c) (1−λ) where P(c) is the constraint distribution and λ is the pre-determined confidence level. 19 . The system of claim 16 , wherein the system is further configured to, upon determining a constraint distribution: employ a neural network encoding the set of expert trajectories to determine, for each of the expert trajectory, a set of contribution factors; and adjust a template distribution according to the set of contribution factors to form the constraint distribution. 20 . The system of claim 19 , wherein each expert trajectory is encoded with a corresponding encoder having corresponding weights in the neural network.

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Combinations of networks · CPC title

  • Learning methods · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • G06N3/092Primary

    Reinforcement learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025165796A1 cover?
Methods and systems for executing confidence-aware reinforcement learning for an Artificial Intelligence (AI) model for subsequent deployment of that AI model in an environment are disclosed. The method includes accessing a set of expert trajectories, each expert trajectory comprising a sequence of expert state-action pairs, the expert entities complying with an expert constraint that is unknow…
Who is the assignee on this patent?
Liu Guiliang, Poupart Pascal, Huawei Tech Canada Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/092. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 22 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).